Moving the server-side gamelogic to PNaCl

This post is aimed at Unvanquished/Dæmon developers to explain the big changes that will happen soon, as well at people interested by the technology behind the Dæmon engine. It contains background on the upcoming change to PNaCl-only sandboxes for the server-side gamelogic (game virtual machine) as well as an overview of changes for developers.

When we first anounced our long-term plan to upgrade the Dæmon engine, we were looking to replace QVMs with PNaCl sandboxes and had a first proof of concept working. The biggest advantage of using PNaCl is that gameplay developers will be able to use modern C++ as well as use external C/C++ libraries directly inside the VMs. In addition, it will allow us to share code between the engine and the gamelogic as well as provide a performance boost on the raw processing of the gamelogic.

While PNaCl sandboxes for game have been supported by the engine for a while now, we never actually shipped PNaCl files. The reason is that we wanted for other parts of the engine to be rewritten (e.g. the filesystem), and we also rewrote all the PNaCl-related code to make it more type safe and portable.

So now we are looking to completely remove QVM support for the game virtual machine. This is a huge change because the means of communication between the engine and the gamelogic are totally different. In order to minimize breakage we are going to merge it now, just after the release of the 25th alpha. Our hope is that by alpha 26th all the bugs and distribution issues will be fleshed out. Still, it will not work right away for some users but the advantage of removing QVM only for game is that users will still be able to play on servers without problems and give us valuable feedback so that eventually all the gamelogic runs on PNaCl flawlessly.

Differences between the Quake3 Virtual Machines and PNaCL

QVM and PNaCl are quite different so let's have a look at how each of them work.

Quake3 uses a custom version of the Light C Compiler to compile the C89 source code of the gamelogic. Internally LCC compiles C code to a stack machine assembly; QVMs are basically the stacked-based internal representations of several files linked into one binary. Then at runtime the QVMs are interpreted or JITed by the engine so communication between the VM and the engine is made by (almost) direct function calls and is very fast The engine can also access the VM's memory easily. That means that the VM can call the engine "for free", for example ftol isn't supported by QVMs so the gamelogic used to call the engine each time it needed to convert a float to an int. We did not measure the overhead of function calls for QVMs but it is tiny.

For PNaCl, this is quite a different story. A custom LLVM toolchain compiles the gamelogic to a .pexe file which contains instrumented LLVM bytecode which is then translated at runtime into $ARCH.nexe files depending on the architecture of the processor. Finally the the Dæmon engine starts an NaCl sandboxed process running the gamelogic, but because it is in another process, communication is more difficult. Basically the only two means of communication are shared memory and sockets, effectively putting the different parts of the game in a network [1]. NaCl also provides a way to share file handles with the sandboxed process (it's the process' only way to open files) and that's actually how we do shared memory. Our measurements show that making a function call (sending a message and waiting for its answer) takes around 10 microseconds while just sending a message (not waiting for an answer) takes roughly 1 microsecond. Even if this is a lot, the bandwidth is pretty much unlimited and the cost of a message barely depends on its size.

For game this difference isn't that much of a problem because it makes relatively few calls to the engine compared to the other part of the gamelogic, cgame. This allows us to focus on stabilizing the general communication framework before we start to port and optimize cgame for PNaCl.

How the engine works with PNaCl

Up until now we were speaking of PNaCL but it was "enhanced truth", what the engine really supports is basic NaCl. This only makes a difference in the loading of the sandbox, all the rest of the code staying the same. The main reason why we don't support PNaCl yet is that it requires us to write a custom PNaCl loader that uses an undocumented [2] protocol to feed the PNaCl bytecode to the PNaCl to NaCl translator. Instead of using PNaCl we are currently compiling NaCl modules twice, for the x86 and x86_64 architectures which are the only two we support.

There are 3 ways to compile the NaCl gamelogic, which correspond to 3 use cases and 3 CMake options:

  • BUILD_GAME_NACL compiles the gamelogic as NaCl binaries for x86 and x86_64 as well as a (not currently used) PNaCl binary. This is to be used for distribution but isn't very convenient to work with.
  • BUILD_GAME_NACL_NATIVE_EXE compiles the gamelogic as a native executable that is launched in another process and behave exactly like the NaCl binaries. This is to be used by servers that care about the last 5-10% of performance or those that don't want to use the NaCl toolchain.
  • BUILD_GAME_NACL_NATIVE_DLL compiles the gamelogic as a native shared library that can be loaded in the engine process but still uses restricted means of communication. This is particularly useful when debugging as you would need to attach the debugger to only one process. However it should never be used for production because the gamelogic might not exit cleanly.

The vm_game cvar allows to choose which type of gamelogic should be loaded for game:

  • 0 or 1 mean the NaCl executable should be used and if it is 1, the gamelogic is started in a gdb server waiting on port 4014.
  • 2 or 3 mean the native executable should be used and if it is 3, the gamelogic is started in a gdb server waiting on port 4014.
  • 4 means the native shared library version is used.

Developing with NaCl

As for the compilation goes, the NaCl toolchain binaries are automatically provided in the external_deps folder and used for the compilation. If you need to debug the game you should use the shared library version so that running the debugger on the engine also attaches it to the gamelogic.

Writing syscalls in NaCl

The way we write syscalls has changed: we are now using typed messages that are automatically serialized and deserialized with some template magic. There are two types of messages (and more coming for performance): Message<ID, Types...> used for cheap asynchronous messages and SyncMessage<Message<ID, Types...>, Reply<Types...>> used for more expensive messages that wait for the return values. By default you should use SyncMessage; Message can be used when you don't need a return value and no SyncMessage is sent back when handling that Message. When in doubt just use SyncMessage with no reply as such SyncMessage<Message<ID, Types...>> or ask on IRC.

Let's say we want to create a message that asks the engine the list of the clients whose name contains some string and with a ping lower than some number. You would define the message as such:

1
2
3
4
5
6
7
 typedef IPC::SyncMessage<
    // An Message id is made of a major number specifying which subsystem
    // handles that message, and a minor number specifying which message
    // of that subsystem it is.
    IPC::Message<IPC_ID(CLIENT, QUERY), std::string, int>,
    IPC::Reply<std::vector<int>>
 > ClientQueryMsg;

With CLIENT and MY_QUERY that together form a unique pair of number that identify that message type. (IPC_ID will become IPC::Id soon). Now sending the message is done like this:

1
2
3
4
5
// First define the variables that will hold the result of the message
std::vector<int> gummyBearClients;

// Then call VM::SendMsg with the inputs then the variable holding the outputs of the message
VM::SendMsg<ClientQueryMsg>("gummybear", 100, gummyBearClients);

Handling the message is a bit more complex, first the IPC system looks at the ID and splits it in a major number and a minor number. Then forwards the message to the subsystem given by the major number a Reader that contains the message as a bit string and a Socket over which to send the answer. Usually the subsystem will do a switch on the message type and handle the messages. Here is an example for our ClientQueryMsg:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
switch (minorNumber) {

    //...

    case QUERY:
        IPC::HandleMsg<ClientQueryMsg>(socket, std::move(reader), [](Str::StringRef nameSubString, int minPing, std::vector<int>& queryResult) {
                // Do something with nameSubString and minPing and fill the queryResult
        });
        break;

    //...

}

So now with NaCl you can start using C++ as well as include third-party libraries in game. Enjoy!

Challenges and possible solution for porting cgame

The next step will be to port cgame, the client-side gamelogic to NaCl: in addition to a lot of gruntwork to translate all the communication from QVM-style to NaCl-style, we need to worry about the performance. As we saw earlier, in NaCl communication isn't free anymore; this wasn't too much of a problem for game as it doesn't communicate a lot with the engine. It is a different story for cgame as it currently needs 2000 to 4000 messages per frame: even if all the messages where asynchronous, that would be 2 to 4 milliseconds per frame spent doing nothing, which we cannot afford. This was a very optimistic computation, in reality things are even worse because a lot of these messages need to be synchronous.

We have several strategies to make cgame fast on NaCl too.

First of all some subsystems are hosted in the engine because they do a lot of computation that would be too slow for Quake3 to do in the QVMs: this is the case for the collision code and math routines. These can be moved to the gamelogic relatively easily, removing a lot of synchronous messages.

Also some messages are used to stream data from the engine that is known at the beginning of the frame but may not be needed. Here we can instead use a single asynchronous message that sends all this data in single batch at the beginning of the frame.

Finally the bulk of the message sent by cgame are used to tell the engine what to draw (or what sounds to play). These messages do not need to be processed right away and their order isn't important: we can buffer them and send them all in a single message. These buffered message could even live in a shared memory buffer to avoid the cost of calling socket operations. Our first estimates show that this optimization alone would make two thirds of the message free.

Summing it up

We ported game relatively easily and are going to start testing it with NaCl only starting the next couple of alphas. This will allow us to stabilize the NaCl code, meanwhile we will be refactoring parts of cgame to make them more efficient in term of number of messages and eventually run it in NaCl too. After that is done, all the gamelogic will be able to work with third party libraries, with C++ and at full speed, sharing useful code with our engine. At the same time, will have changed one of the most crucial components of our engine, truly marking our departure from the Quake3 legacy.

[1]While using shared memory and atomic operations we can reimplement spinlocks and other more complex mutexes, we cannot do any signaling like we would do with condition variables. We cannot use spinlock, waiting to acquire the mutex as on a single core processor, this would prevent the OS from scheduling the thread holding the spinlock. Any mutex more complex then a spinlock involves a call to yield, and without proper signaling we wouldn't be able to wake up exactly when we need it.
[2]There is no documentation about making a NaCl host, unless you consider Chromium's sources as documentation.