How did Eve Online use Honeycomb’s observability tooling to migrate from monolith to microservices, and from on-premises to the cloud?
Imagine an arcane mainframe that no one knows exactly how it works. CCP Games Technical Director of Infrastructure Nick Herring had heard such things existed but had never experienced one firsthand. Then he started working on Eve Online – a massively multiplayer online role-playing game (MMORPG) created 20 years ago that has set some Guinness World Records for the scale of its online battles in space.
Since the game’s first star system went live, Eve Online has grown from an arcane fleet of monoliths running the same software stack, to a partial decoupling of new features running on microservices on Amazon Web Services (AWS), which is known as Quasar. The evolution to where it is today involved two parallel migrations – from the monolith to microservices and from on-premises to the cloud.
While migrating in these parallel universes, CCP Games kept the spaceship running the whole time, ramping up observability from cryptic error messages to entire ecosystem visibility through Honeycomb observability tooling. Along the way, CCP devs also had to adapt to a cultural shift that came with a new architecture and new approach to observability. This post looks at how they used observability to ease the transition to Quasar and shine a light on how all their services work together.
An interstellar landscape in need of API traffic control
In the 20 years since Eve Online began, not only did the game’s functionality grow and expand, but so did the player community and the complexity of their activities. The mechanisms that manage player-created alliances, corporations, and kill boards connect to the game via a large number of APIs. This network has shaped Eve Online to function like a traditional application, with most of its API interactions based on requests and responses, instead of streaming.
The game’s growth of APIs was also a turning point that led to the creation of Quasar. CCP devs wanted to have more control over the way they deployed features and managed API traffic, so they created Eve Swagger Interface, known as ESI, which was Quasar in its infancy. Using RabbitMQ for messaging and gRPC as a framework, Quasar offered a more flexible mechanism for communicating that allowed for easier expansion of the player ecosystem.
But with greater expansion came greater traffic management. ‘It got to the point where we couldn’t possibly reason or see what we needed to understand the ingress into this system and the effect that it was having,’ Nick says. With ESI for third-party APIs and a mobile client on top of it, it was tough to get a handle on all the new traffic patterns.
Until recently, observability was less science and more fiction
During the massive proliferation of APIs and connections, Eve Online ran Prometheus to monitor its Stackless Python architecture. But in terms of observability, error messages were something of a black hole. If a database call or Python function were taking too long, it was up to someone’s best guess to determine the issue. ‘There was a lot of institutional knowledge around how to read those chicken bones,’ Nick says. For instance, pinpointing the guilty line of Python involved one dev’s evaluation in C++ that was more or less impossible to explain to the larger team. In addition, the alerts were generic, indicating things like spikes in CPU and RAM that was about to run out.
That’s when CCP decided to bring in Honeycomb for its high-cardinality observability that allowed for granularity at scale. With the deeper level of tracing enabled by Honeycomb, the team could understand how quickly messages were being processed and where they were getting stuck. ‘Before, you basically got in the neighborhood, but the tracing allows us to pick a room in a house,’ Nick says.
Observability makes modern systems easier to observe because its datastore and query engine are purpose-built to detect patterns across billions of requests in under three seconds, even with highly unique and granular data where problems lurk behind any arbitrary combination of attributes. This unique advantage gives game developers fast feedback loops to understand how their code operates in the chaotic real world of production.
Eve Online launches Honeycomb’s observability tooling and shifts a culture
Around the same time that Eve Online decoupled its monolith with microservices, the game also moved from on-premises infrastructure and Google Pub/Sub to AWS. The migrations set off a cascade of cultural changes around deployment cycles, releases, and engineering responsibilities. ‘We started learning about how a monolith conditions engineering practices,’ Nick says. ‘Having to worry about absolutely everything in a monolith to just being concerned with a section of it was the biggest cultural change.’
Being able to visualize not only issues, but also microservices boundaries with Honeycomb tracing has been instrumental. Now, CCP devs can share a link to a specific Honeycomb query where they can explore the data directly, quickly see anomalies, and start building teams that rely on each other.
Are we at the end of Eve Online’s galactic migration?
Two migrations later, and Quasar is only the beginning. For Nick and his team, AWS and microservices were just stops along the technological journey that’s evolved Eve Online to where it is today.