Complex systems are all around us. In the technology world, we often pigeonhole the term to refer to our computer services, but the concepts of complex systems are much broader than that.
Anything that has more than one component with a behavior can be considered a complex system, from living organisms to political systems, from the cars that drive on our roads to the buildings that constitute our cities and towns. In fact, those road systems and cities are complex systems as well.
All complex systems are by definition difficult to understand because they also generally exhibit emergent behavior. Emergence is when a system exhibits behaviors that you could not possibly infer from its component parts and which only come to be when these components are working in unison.
Like all complex systems, computer services are difficult to understand. At the heart of this is emergence. The more components we introduce, the more observable behaviors the system is capable of producing, and we can simply never predict what those are ahead of time. As our services become deeper, larger, and more complex, these emergent behaviors will just continue to grow.
This is why observability is such an important topic. Observability is the capacity to infer the internal states of systems by their external outputs. It is important because we cannot possibly continue to rely on traditional metrics and monitoring to understand modern software architectures. We now live in a world of microservices, multi-dependencies, and an ever-growing reliance on SaaS, cloud vendors, third-party products, and open source components that create ever-more complex systems. They are increasingly more difficult to observe and therefore more difficult to understand and diagnose when something is acting in a manner that isn’t expected.
I’ve been teaching people how to measure their services for a long time. The biggest problem I’ve encountered while educating people about this often lies within the definition of the problem space. This article did not start off by defining complex systems, emergent behavior, and what observability actually is by mistake. People often think they know what observability means and entails, but that doesn’t often turn out to be the case.
When you ask an engineering organization about their observability, they’ll often assume you’re asking them about their monitoring, metrics, or logs. All of these can be important parts of an observable system, but they generally don’t cover the most tricky piece: emergent behavior.
In order to produce metrics and logs for your services, engineers have to decide what internal states to expose. In most circumstances, someone has to write a line of code that decides to log an event that took place or export a data point to a metrics system. This is often excellent data that you can use to understand the current state or recent behavior of a single component of your complex system, but it can rarely tell you much about the overarching system as a whole. Emergent behaviors can theoretically be predicted, but in practice, this is very difficult. You generally won’t really know how a multi-component system will behave until all of those components are in place.
If our first problem is getting teams to understand what observing a complex system actually entails, the second problem is that these sorts of systems often have many teams. The more components your service has, the more likely it is that different teams are writing, maintaining, and operating them. This exacerbates the difficulties of observing your systems properly because often no one owns the discovery of emergent behaviors of the entire system as a whole.
For example, in common microservices architectures, each component of the system may have a team assigned to it. For each team, they may find that the logging and metrics for their own service are suitable enough to attempt to keep things running. But as these components interact, emergence will come into play and no one will have the data required to understand what the entire system is currently doing (or why).
An example that I have witnessed in the real world is when a logging agent was upgraded. This upgrade caused some metadata for each logline to change. The log processing pipeline didn’t mind this difference at all, but the data processing job that ran at 03:00 every morning to process the previous day’s logs did care. Suddenly people were being paged in the middle of the night with no clear understanding as to why this job couldn’t run. There was no real observability into the entire pipeline, and the data processing job was five layers removed from the logging agents.
This is not an easy problem to solve. But there are tools at your disposal that can help.
First, ensure everyone in your organization understands the concepts of complex systems, emergence, and observability. Don’t let them fall prey to marketing departments that dilute these terms! None of these things are ‘monitoring’ and you cannot explain emergent behaviors by only knowing the CPU utilization percentages of your containers.
Second, encourage people to think about how to expose telemetry that can not only help troubleshoot issues with their own service, but also any services that may depend upon it. Make this data available to all teams because you never know who needs to be aware of the state of a certain component system.
Third, introduce distributed tracing using projects such as OpenTelemetry. Distributed tracing by definition measures the interactions between many different services as opposed to exposing details about only a single one. This can help you discover emergent behaviors much more easily.
And, finally, understand that observability isn’t a thing you do once. It’s not something you can set an OKR for, tick off of a checklist, and move on from. Observability requires a different state of mind when thinking about your complex systems and how to both measure and react to their emergent behaviors. Give your teams the time and resources they need in order to adjust to this shift – not only in the near term but also far into the future.
You’ll never be able to capture or anticipate all of the potential behaviors of your systems, and as they grow in size and complexity this will compound over and over. That’s just fine! What you can do is arm yourself and your teams with the correct way of thinking about systems, the understanding that behaviors are emergent and not always predictable, and provide the tools and telemetry to best help them observe their world.