The importance of observability cannot be overstated, especially as it pertains to modern systems.
Essentially, observability is how well you can understand the internal state of a system by external outputs. This definition was coined more than 60 years ago, well before the modern internet and most services we know today. The importance of observability cannot be overstated, especially as it pertains to modern systems. As these systems become more complex, run on ephemeral infrastructure, or on a serverless system where we have no access, the outputs become the only visibility we have into the system. If we want to know what is going on with our services, we need to rely on these outputs for a general understanding of the operation, as well as troubleshooting when things don’t go as expected.
In this article, I will discuss the groundwork for getting observability started at your company. This won’t be an explicit map because there is no one route to improving observability. Likewise, I will not be discussing explicit tools, as the toolchain you rely on will be heavily dependent on your system’s architecture. There is no one right way to do observability, but the below list should give you an overview of how to get started, or where to focus your efforts in order to continue on a journey you have already begun.
These tips are not listed in any particular order. Like I said, there is no explicit roadmap to get from where you are today to where you want to be tomorrow. This article should be seen as a helpful set of bullet points for you to consider while starting or continuing your observability journey.
1. It has to start somewhere
When beginning your observability revolution, you need to know where to start. To determine what system is ripe for observability, look at your incidents. These incidents will show which systems are regularly failing for one reason or another. From there, dig into the remediation actions that your team is taking, the MTTR, and the process required for understanding what was going on in the first place.
While every service deserves observability, you have to start somewhere. If you try and focus on all of the services, you will quickly find that you are attempting to boil the ocean. Focus on a single service that you will get the most value from and take it from there. This will reduce the scope of what needs to be done to see a quick ROI on your time. While you are building out an observability plan for this first service, you should still consider other services and architectures. It is important to ensure that how you are capturing outputs will work for other services. While I don’t recommend digging into all services, you should be aware of them and cognizant about how you will implement observability moving forward. This will help to prevent tool creep, as well as ensuring that you are not repeating any work when you move to the next service.
2. Focus on value and improving people’s lives
When you bring up observability to people that have never considered it, you will need to demonstrate its value and how it will improve their working life. Focus on how observability makes troubleshooting and understanding the operation of a system much faster. This will enable you to sell people on the idea more easily. The value of observability is in how quickly you can understand what is going on. This is helpful not only during troubleshooting production but during development when services aren’t behaving as expected. Show that you can reduce fatigue from on-call due to faster MTTR. Prove that observability also reduces the tool creep required to troubleshoot an issue both during development and in production. Most people don’t want to go spelunking through servers, and tools to gather details. They simply may not know a better way.
3. Don’t make assumptions – ask questions
It might seem obvious, but a lot of people will assume they know what others are looking for when it comes to observability. If you are not sure what is most important to those around you, the best thing you can do is ask.
You’d be surprised at the little things you can discover, which you would otherwise be unaware of. At the end of the day, observability is designed to improve the lives of the people that we work with. If we are not providing insight into areas that they care about, then we are doing them a disservice. It’s a good idea to have a quarterly sit down with the people that you see as your internal customer for this project. Ask them what is working with existing observability and what they are struggling with. Frontline workers are your most important tool for realizing what needs to be fixed.
Likewise, realize that these are the people who are going to live with the decisions that you make today. Whether it’s updating code to fit your model, writing fresh code to push new outputs, or reading the data that you gather during troubleshooting sessions, their feedback will help you to gauge whether you are making progress and achieving your goals.
This is important even if you think you have a solid foundation – you have done this work before at another organization, etc. People in your world are constantly changing, therefore their needs may evolve over time. As previously stated, there is no one way to do observability, so don’t assume what you did at org X will work now at your current org.
4. Set achievable goals and find a measuring stick
Without setting goals and finding a way to measure them, we will not know if our journey has been successful. Once you have chosen how you are going to start, set goals for what you hope to achieve. This could be a reduction in MTTR, an improvement in velocity, or happier engineers. Whatever your goal is, make sure that it is achievable and well-documented. While chasing your North Star, consider the small bite-size chunks that will get you there. These will be your goals and stepping stones.
As an example of a measuring stick, say you want to improve troubleshooting an application. You could measure this during an incident by reporting whether the data required to solve an issue existed. Your goal could be that 80% of incidents on service X have all of the necessary data to assist the on-call engineer. The other 20% will highlight gaps that will become future goals.
5. Care and feeding
Observability isn’t something you do once and then forget about. If you do this, you will quickly find that services you thought had great visibility may turn into a black box. This is due to the only reality in our world –constant change. In some companies, this change happens slowly on quarterly releases.
In others, there are changes no less than daily. With the idea of change being a constant, we need to make sure that the way we are observing a service continues to make sense. Ensure that you are regularly making time to review the outputs of a system to check that you still understand what is going on under the hood. How often you do this is highly dependent on the release cycle for a company.
Releasing multiple times a day? Run exercises monthly to validate observability. Launch weekly? Maybe you check quarterly. Quarterly release cycle? Your releases are probably large enough that you only need to review them before release.
The above cadences are just recommendations at a minimum. You may find that you want to review more often, depending on the sizes of your changes. For some mature services, you may not need to check as often. The goal here is that you check. Without validating that what you did still works, you will have the unfortunate surprise of something being broken when you desperately need it.
6. Constant improvement
As systems change, strive to improve observability. If you are at a company that does blame-free postmortems then you’re in luck! You can use these meetings to suss out and push on improving observability gaps. The postmortem is the perfect place to highlight how having more data can greatly reduce MTTR on an outage. Discussing this can go a long way. Even if you ignore everything else written here, it’s worth noting that postmortems are a great way to push an observability agenda. You hold the keys to making the next outage better, and less impactful. You are at a table where that’s exactly the type of thing people want to hear.
If you are at a company that doesn’t do postmortems, you can still push for improvements. The battle here may be a bit harder as an absence of postmortems may indicate a lack of willingness to change. Regardless of this, you can still demonstrate how getting data out of a system can improve investigations moving forward.
If you are not currently doing postmortems, I would argue that you want to encourage postmortems prior to your observability push. Postmortems are extremely beneficial in ensuring consistent improvement across the board of your service, not just in observability. Some of the tools on gaining sway with observability may help here, but honestly, the importance of postmortems is probably its own article.
7. Finding sponsors/partners
In order to get anything done on a team, you need two things – time and the support of others.
Finding a sponsor, such as your manager or an executive in your team, will ensure that you are given the time to focus on observability. They will make sure that your plan gets added to the roadmap initiatives so that you can get it done properly. Without this time, you aren’t stuck, but you may well be pushing this initiative during spare cycles rather than giving it the dedicated attention that it deserves. Trust me when I say having allocated time is better for your mental health.
When finding partners, focus on individuals outside of your team. The goal here is not to sell to everyone, but instead, find people that will help carry the mantle with you. The more business units you partner with, the more services you can make effective change within. These partners will not only assist you in getting work done, but they will also help sell the idea further and train their peers.
8. Observability should be as easy as your day job
Consider the consumption and generation of the data you are collecting. For your developers, make sure they have libraries or sidecars to make building out observability seamless. Making it hard to implement is a surefire way for observability to be an afterthought at best, or only done when somebody points out the gap at worst.
Also, prioritize your consumers. If done right, your data should be consumable from executive level down to a level one support tech trying to make sense of an issue. How the data is consumed will differ from person to person and vary greatly depending on their level. For those not comfortable spelunking, make sure you are providing useful dashboards. This will enable them to gather data at a glance. For those that will need to dig deeper, provide effective tools and enablement. They will need to understand where these tools live and how they can be used effectively. If going through the tools you’ve provided is harder than a CLI tool then you will quickly notice people reverting to old habits.
Outro
I hope I have given some tips or food for thought for you to consider when embarking on your journey, and you found it useful to get your brain going on how to push observability in your own company. If you take nothing else away from this, understand that the best thing you can do is to prove the value of what you are attempting to do. This is important even if it means implementing observability as a side project to prove out exactly how valuable it can be.