Dev Diary: Synchronization Woes!
V1.12 Sync Test: http://www.ironcladgames.com/sins/Sins112SyncTest.zip
The goal this past weekend was to seek and destroy the elusive desync problem. It's disappointing that its showing up again in higher frequency, no doubt due to the increase in numbers of people joining Sins multiplayer after the 1.1 release. Unfortunately, we didn't get consistant desyncs during the lengthy beta so we were never able to nail it down. I also really want it out of the way before Entrenchment is released. It's multiplayer component is a lot of fun so I don't want it blemished by sync issues.
For the longest time I was sure it was some mod related problem. We hadn't personally seen a desync in over 6 months, however when the reports started coming in again after 1.1 went live I decided to dedicate the entire weekend to doing nothing but tracking it down. All of Friday, Saturday and most of Sunday I played with a lot of people who were all as dedicated as I was to eradicating the beast. There were theories to test, combinations to try, players to track down - particularly those who reportedly could produce desyncs consistently, IRC debates, logs to submit and analyze and much more.
Not one problem occurred all of Friday and Saturday but then Sunday afternoon I got word through the grapevine that a mysterious player named "Krunk" was currently the desync king. So a bunch of Sins fans on ICO hunted him down for me and we set up a game. Sure, enough - no more than a few seconds into the game - he desynced. I couldn't believe it when the red desync text materialized and proceeded to burn my eyes. Well, I guess there really is a desync problem.
What is a Sync Bug?
In the world of developing RTS games, there is nothing more painful than a sync bug. They consume vast amounts of time to track down and much of the engine design is focused around methodology to prevent them. Just because I think it’s interesting I'm going to explain what they are and how they are typically caused.
Unlike MMO's, FPS's or many other multiplayer games, in an RTS there is no master server who has a final say in the current state of the game. You may notice in something like World of Warcraft that your avatar suddenly gets his position corrected. What is happening there is your local simulation of the WoW universe has the character moving a certain way, but then the master server says that is incorrect and tells him to reposition himself. You were out of sync with the boss and the boss set you in your place. For an RTS game its far too impractical (in many different respects) to have a master server checking over everybody so we rely on determinism to make sure every stays in sync. Determinism is about making sure everything happens in the exact same way. If you can guarantee everything happens in the same way, you don't need a master server telling everyone what the results are and you don't have to send much information between the player's computers.
Here is a simple example from Sins: if an AI player on my machine randomly decides to attack player X, then I need to trust that the AI on your machine will also randomly decide to attack Player X. There is no communication between our two computers, the synchronization is implicit in the math, logic, and structure. If something goes wrong with this, the AI on your machine may decide to attack someone else. From this point onward our universes diverge and we see completely different results, or in the worst case our games crash. Sometimes the divergence starts so small and grows so slowly that we don't notice for a very long time, if at all. Regardless any form of divergeance is a desync, or a sync bug.
Sync Bug Causes
So what can cause the divergence? Why are my ships in a different position than yours? Why did the AI make a difference decision on my machine than yours? Here are a few examples straight from Sins:
1. First, we might be using different CPU's. Different architectures can generate slightly different results, particularly between brands (say Intel vs AMD). This is usually part of the Floating Point Unit so one of the first things we do is make sure we synchronize the FPU's on everyone's machines using a special command. You may have remembered a sync bug shortly after we released the ability to load up mods much earlier in the year. This was caused by the mod setup codepath bypassing the FPU Control call the regular setup of the game used. From that point onward, there is a small chance your mathematical calculations are going to give slightly different results than mine, which usually shows up first as a miscalculation in the orientation or position of a ship since there is a lot of floating point math going on there (particularly with the matrix multiplies required for rotation).
2. Next, we might call non-deterministic operations in a deterministic code block. An RTS engine is really broken up into two separate parts: the Simulation and the Presentation. The Simulation is what is actually happening (AI, physics, gameplay etc) and is the part that has to be deterministic and in sync. The Presentation (rendering, particle systems, etc) is what you see and it doesn’t' have to be in sync or be deterministic. Typically, the Presentation is a custom interpretation of the simulation (e.g it looks at the simulation and decides how best to show you that information based on how powerful your computer is). For example, the simulation says that one of your ships blew up. The Presentation then realizes you have a very powerful graphics card so it decides to render a ton of particle effects to make the explosion look pretty. On an older graphics card it may decide to show a boring white blob grow and shrink. Now to make the nice pretty explosion the Presentation may make many calls to a random number generator to spew various fireball images in random directions while the crappy white explosion didn't make any calls to the random number generator. It's important to note here that random number generators aren't really random, they just spit out numbers that look to humans to be random but really they are numbers that follow a very predictable pattern and order. The generator "remembers" where it was the last it was called so that each successive call doesn't generate the same starting pattern over and over again. But what happens if the AI decides to use that same random number generator to randomly decide which player to attack? Because my pretty explosion made many calls to it and your crappy explosion didn't make any, our random number generators will generate different results because they were left at different positions in the sequence of numbers. Your random numbers are behind mine in the sequence. In order to solve this problem we have to use two separate random number generators - the Deterministic Generator and the Non-Deterministic Generator. Everything in the Presentation calls the ND-Generator and everything in the Simulation calls the D-Generator. One of the early sync bugs in Sins was caused by the autocast code on the Novalith cannon calling the ND Generator when trying to decide which enemy planet to fire at. This of course would give completely different results on every machine. It took a long time to find this one because A, it takes a long time to tech up to the Novalith and B. not many people use the Novalith's autocast.
3. Finally, desyncs can be caused by bad state initialization. One of the key ideas to determinism is that given the same initial conditions, a series of operations on the state of the system will generate the same result on any machine. Naturally, if the initial conditions are different you are screwed from the get go. When programming RTS games its very important that when you create various objects (ships, buildings etc) that they always have the same state from the start of the game. To be honest we rarely release code to the public that has bad state initialization because this type of desync is pretty easy to detect as soon as the game starts (as opposed to the other kind which take a long time to occur) and our testers don't have to play for hours upon hours to see if one exists. However, there are some special cases where bad state initialization sneaks in and is very difficult, highly improbable, if not near impossible to detect. In a sense, bad state initialization is both the easiest and most difficult type of desync to find. It turns out the sync bug I was tracking on the weekend was one such bug, has existed since last spring, and until Sunday afternoon I swore it didn't exist. Here is what caused it and why it was so elusive:
The Failing Market:
The market system has two special state variables called "stateStartTime" and "stateEndTime" that control the time interval of various market states (e.g Metal Boom, Crystal Crash etc). These values were not initialized properly. As I said above this is typically caught pretty quickly but this case falls under the very improbable. If two players start their first multiplayer game, both their market state variables will be incorrectly initialized in the same way, so even though they are incorrect, at least they are in sync. Every time they end a game those two values are left in whatever state they were in for the start of the next game. But even then, those two players can continue playing all day long with each other without any sync problems. So suppose they decide to play against someone else. That new player's market values are not screwed up in the same way theirs is. Normally, in this case they would go out of sync right away and we wouldn't have much of a problem. Easy find, easy fix. Nope, not in this case. How is this possible? It’s clear that everyone's market values are completely different but they stay in sync? As I said to myself on Sunday, "wtf!!!???!!!"
The reason they stay in sync is because the market simulation code that uses those particular state values is very rarely executed. It takes a very particular set of conditions to cause the market to enter the state that will use these values to determine the evolution of the market.
You need the following conditions to get this sync bug to occur:
1. One player who already played a game of Sins.
2. His original game must have entered one of a few, very rare market states.
3. This player must play someone who he hasn't already played a game with.
4. He must not have restarted Sins.
5. Their new game must also enter the same, very rare market state.
So on Sunday, I finally got to play against two players (Krunk and ZanZ) that met these rare conditions. After desyncing with them, they sent me their sync logs and I was able to compare them against my own to determine that their market state variables differed and using that information I could trace back what caused the divergence. The process of detecting a divergence and tracing its causes is also a very interesting topic so maybe I'll do a write up on that if there is some interest.
Before we officially release 1.12 I'd like to have the fix tested with a lot more people. You can grab a special 1.12 build at http://www.ironcladgames.com/sins/Sins112SyncTest.rar or http://www.ironcladgames.com/sins/Sins112SyncTest.zip if you want to give it a shot. It also fixes the buff stacking issue. Just extract the exe into your Sins install folder and run it. You will only be able to play against people also using this build so don't overwrite your 1.11 exe if you want to jump back and forth between versions. Also keep in mind that this new exe will need to be allowed through your firewall.
It's also possible there is another sync bug out there but I doubt it very much given that all the logs on the weekend pointed to the same cause. But just in case I won't be making any "monkey's uncle" claims like I did before
Special thanks to everyone who participated in tracking this down, especially:
(in no particular order)
and ofcourse SpaceFish for his sync snapshot code.