Monday, February 25, 2013

The bounce-pad hack in LEGO Universe

The bounce-pads (also known as "bouncers") in LEGO Universe were a means of travel, when you stepped on them they sent you to a specific location in the map, usually somewhere nearby that you could not otherwise reach. Bounce-pads were one of the first gameplay elements created during the production of the game (almost 3 years before the game released), and had no real issues during the entire production run or alpha/beta testing, so we were all somewhat surprised to run into a huge bug on the very first day the game opened up to the public.

The whole company gathered in the gym around some big screen TVs so we could watch the game go live for the first time. We watched a few people get in the game and play for a short time and then we all scattered back to our desks so we could login and play the game ourselves. No sooner than had I created my first character I had a person from the live service team at my desk describing to me a serious bug. Apparently there were a large number of players that were stuck in the very first area of the game called "The Venture Explorer", a small spaceship that served as an introduction level to teach players them the basics. About two thirds of the way through the map there is a spot where you must quickbuild your first LEGO model, a bounce-pad, and afterwards you step on it and it bounces you to the next NPC to give you your next mission. It seemed that about 1 in 100 players would build the bounce-pad but could not step on it to get bounced, so they had no way to get to NPC to get the next mission. 

We had some in-game GMs talking with some of the players having this problem, and apparently they could see the bounce-pad but they couldn't step on it. The part of the code responsible for bouncing the player was the bounce-pad code that would set the player's velocity when the physics system told it that the player collided with the bounce-pad. Somewhere in that flow something was broken, and I wanted to find out what it was. We had nobody in the office that was experiencing this bug, none of our testers had seen it at any point during production either. Because we couldn't reproduce this in-house my next thought was to see what information I could get from the clients that were seeing the bug. As a gameplay programmer I didn't really know the details of what kinds of reporting and tools the live team had setup with the game to be able to get me information about this problem. As it turns out, there was almost nothing.

The client version of the game was setup to generate log messages for any errors, and there's a good chance the log file might tell me if something was amiss, like maybe the physics for the bounce-pad was failing to load, or if something was going wrong in the collision check. Sadly, the live team never got around to setting up a way for the clients to be able to send us their logs, or for a GM to be able to send a message to the client's program and have it return the logs to us. At the time they said something about possible legal reasons for us not getting information about them sent automatically to us, which seemed a bit ridiculous since the information didn't contain any account information besides the name of their in-game character, and a 32-bit account ID, which even if it got into the wrong hands is worthless. Anyway, getting any information from the client was impossible at the time.

Server log files were one thing I did have access to, unfortunately this was a client issue since not everyone was seeing it, and since the functionality to make the player bounce was entirely client-side, with only some server-side monitoring to check for cheating.

My next idea was to use some in-game tools that I had written during development, that created huge amounts of data about any object in the world. I was hoping I could use this tool to see if there might be any issues with the bounce-pad itself that the tool could find. The tool would analyze thousands of points of data on an object (at run-time) and check for any inconsistencies and report them back. This tool was not built into the version of the client that players used, so I would need to build an internal version of the client and log-in with it, additionally the server would not return the requested information unless you were using a GM account, for security reasons. It took awhile but finally I was given a temporary GM account to be able to analyze the problem, and still we were able to find nothing, mostly because the client information about the object was based on my client, which was working fine.

After about a day of sifting through logs and using tools to try and find any issues I could find nothing, and in the mean time GMs were having to sit in the game and teleport these players to the NPC so they could get their missions. The pressure was on to find a solution, but the only information I had to go on was that a small percentage of players were seeing a problem, and because this appeared to be a client-side problem and there was no system to get client logs back to me, there was nothing I could put in the game to get me any information about the problem. The reason this problem was on me was because I had written the bounce-pad system, the system that now appeared to be broken.

I enlisted the help of a couple fellow gameplay programmers to try and see what we could do to reproduce the problem in-house. We tried removing the physics asset from the computer to see how the game would react, and when starting the game the patching system would see the missing asset and simply download it again. There were code paths that could be hit if a physics file failed to load, so we put in some code to force the physics asset to fail to load, and in that case the game put in a fallback physics shape (a 1x1x1 cube), and even though that wasn't the proper shape it was still enough the player could touch it and the system would respond and bounce them, so that was a no-go. We checked to see if maybe the collision could be succeeding and somehow the bounce-pad code was failing to translate that into a bounce, but we couldn't see any point of failure, or a way to force it to fail.

So here we are with a problem we cannot produce, and a system we can't seem to make fail, and no way to get any information from the players that were seeing the problem. Leaving the bug as-is was unacceptable, as it would mean a lot of lost customers or the expense of GMs permanently stationed near the broken bounce-pad to teleport players. So the solution, was a hack.

During early development of LEGO Universe, almost all gameplay was entirely server-based. Things like attacking, picking up power-ups on the ground, and using bounce-pads were done entirely on the server and then the server would inform the client of the event. This was very secure but it resulted in laggy gameplay, which didn't work well for an action game like LEGO Universe. Along with other systems the bounce-pads were made so that the client-side object did the bouncing of the player, and simply told the server what it had done so that the server could check for any possible cheating or hacking. So, remembering that it used to work on the server years ago, and that the bounce-pads were still properly loading on the server, the solution presented itself. I put in code on the server so that if the player stood on a bounce-pad for more than half of a second and did not get a message from the client's bounce-pad that they've bounced, that it would assume the bounce-pad on the client was broken and bounce the player from the server. The result was that for those 1 in 100 players seeing the problem, that one bounce-pad on the first level would feel a little bit laggy but it would work. We also setup some server-side logging for any time the server-side bounce-pad needed to take over, and we found that we were only ever seeing the logging for that one bounce-pad in the first level of the game. For some reason that we never tracked down, it was only ever that one asset that exhibited this problem in the game, there was never another problem related to the physics for an asset not properly loading. 

We did make the assumption that the physics were likely failing somehow on those clients, because the only way we weren't able to bounce on the client was if a physics collision message was never sent to the bouncer, so I do feel some comfort in that the system I wrote may not have been the problem, it just affected my system. Even though as a team we take responsibility for the entire product, rather than saying "this is my code, that is your code", it's still feels good when you've written a system that works well, so you never want to see it break down and fail. It remains the biggest hack that I've ever made to a released product, but I don't feel bad about it, I feel like I made the best of the situation with what I was given, and in the end the players never knew the difference. 

6 comments:

  1. Interesting. Do you happen to remember which systems were handled exclusively by the server? I assume it was just attacking and drops since those would be the main area for cheating/hacking. Or was everything handled by the client in this manner?

    Also, are there any other general hints about the servers you could share? There is an open source project on Sourceforge that is trying to recreate the servers. With the servers shut down, the only thing we have to go by are our packet captures and whatever information we can get from the client. Granted, we are stuck with the initial client/server handshake so it has not gotten very far.

    ReplyDelete
  2. I can't really talk about much in detail because I'm under an NDA that doesn't expire. There was more than just attacking and drops that were verified from the server, things like vendor purchases for example. Every item generated in the game has a global ID that must be generated by the server for it to be valid, if the server doesn't know about the ID then it won't accept it, so it cannot be created by a client. Speed hacks are another example of something the server checked for, as well as teleporting hacks. There was even a hook on the server where we could have checked for people hacking bouncers. Players could, for example, alter the velocity a bouncer gives them and end up at a different location because of that, and that was something we could have checked for, but we didn't end up with people cheating in this way so we never set up the server-side logging for it.

    ReplyDelete
    Replies
    1. Thanks for the reply. Do you think there were a lot of people actually trying to cheat? I know there were plenty of people trying to find glitches. I had someone show me the one in Nimbus Station behind the portal to Starbase 3000. He liked to go down there but it got pretty boring to me after a minute or two. You would have to smash to get out of it so I did not go there more than that one time.

      I guess your NDA pretty much sucks for me :) Do you know/can you talk about how they used RakNet and what version? It does not seem RakNet was used "per the manual." The initial 09/0A throws me since RakPeer-->receive only recognizes RakNet packets. From what I can gather, it looks like it was version 3.5 up to 3.7. Not using RakNet "per the manual" seems odd so it adds to my confusion.

      Delete
    2. I'm not sure how many players were trying to cheat, but we knew that some people were able to do speed hacks, and I believe some other cheats that involved tweaking values that were in memory on the client, and were client authoritative. The client's position was essentially client-authoritative, so we had to add in some checks, like checking if the speed was faster than it should be. We also made some of the most commonly hacked values harder to modify by obfuscating them.

      Delete
  3. Hello again. I was wondering if you could tell me if LEGO Universe uses RakNet Bitstreams? The packet encryption has been broken and we are going through them trying to determine different things. I assume that since I am not asking specifics this will not violate your NDA. If it does, then I will leave you alone :)

    My other question had been answered when one of our team determined it had to be RakNet version 3.25. That is what happens when you start with the latest version not realizing it had changed that much :)

    ReplyDelete
    Replies
    1. I haven't maintained this blog in forever, but I feel bad having not responded to messages. It's been over 7 years since you asked this question, so by now you probably know, but yes, LU uses RakNet Bitstreams.

      Delete