The missing piece to improving your user experience might just be adopting an observability strategy.
According to McKinsey, the average share of digital customer interactions jumped from 18% pre-COVID to 55% during the pandemic. This digital acceleration is here to stay, making user experience (UX) a critical aspect of every digital product and online service. But can you really understand UX without a proper observability approach?
Monitoring techniques
To ensure a positive UX, DevOps teams need to collect data to understand how the transactions are performing; this may require many methods and monitoring tools to do so. Primary examples include real user monitoring (RUM), synthetic monitoring, application performance management (APM), and now distributed tracing. For a more comprehensive understanding, let’s look at each of these monitoring techniques.
APM is a broader term that refers to the process of monitoring, managing, detecting, and mitigating performance issues within an application.
APM was born in the era of big monolithic applications, but with the adoption of microservices running on containers, operations teams found many new “unknown unknowns” once moving code into production (at scale). A new approach was required to understand how user transactions were behaving in a highly distributed and ephemeral environment, hence the need for a distributed tracing approach.
Distributed tracing is a more recent method of tracking the flow of requests and responses between multiple microservices in a distributed system. This is especially important in microservice architectures, where a single user request may involve multiple microservices, making it difficult to identify performance bottlenecks.
With distributed tracing, each microservice generates a unique identifier for each incoming request and passes it along to the next microservice in the transaction. This allows developers to see the complete end-to-end flow of requests and responses and identify which microservice is causing performance issues. Distributed tracing also provides visibility into each microservice’s processing time, as well as how long it takes to travel between microservices. It is a critical tool for ensuring the performance and reliability of cloud-native applications.
Both traditional APMs and distributed tracing technologies are mainly searching for sources of performance issues from the inside of the application. To properly understand the full UX, teams also need to monitor the outside world. They can achieve this by asking questions like, “Is my content distribution network (CDN) impacting performance? Or the domain name system (DNS)?”
This brings us to the need for RUM and synthetic monitoring.
RUM captures information in real time from users’ browsers and sends it to the RUM solution for analysis. Using RUM, teams can identify performance issues that may impact UX in the real world.
Synthetic monitoring, on the other hand, uses automated scripts to simulate user interactions with the application. The goal of synthetic monitoring is to identify performance issues and to provide an early warning of potential problems before they impact real users. This monitoring technique can be performed from a specific location or multiple locations, mimicking the actions of a real user accessing the application. It can additionally run scheduled tests at specific intervals, allowing it to continuously monitor the performance of an application. The data collected by synthetic monitoring can be used to measure the availability, response time, and functionality of an application, as well as to identify potential performance issues.
Now imagine that you have all those tools. Does that mean that your application is observable? Are you sure that you are collecting the right data, with the right agent?
Observability
Observability should be the central element of IT modernization strategies that aim to support UX improvements.
Observability is the ability to understand the internal state of a system by measuring its external behavior. In today’s complex and distributed systems, it is crucial for understanding and troubleshooting issues, as well as making informed decisions about a system’s performance and capacity.
The origins of observability can be traced back to a single blog post shared by Twitter’s observability team in 2013. The post highlighted the difficulty faced in making the team’s distributed architecture “observable”, lighting the match that sparked a wildfire of industry debate. As discussed earlier, with the shift to microservices and other containers/serverless, traditional monitoring tools could no longer provide the expected visibility of these complex systems. The metrics and traces were missing.
How observability came about
Our good old monitoring tools were not ready to handle the volume of metrics generated by the large-scale shift to containers, especially at the level of cardinality that exploded when microservices and containers are combined.
Traditional APM tools and their sampling-based approach no longer allowed a clear understanding of the customer journey, hence the emergence of distributed tracing. As user expectations constantly increase, we can no longer afford to lose or overlook potentially crucial information that is not taken into account in the samples of a traditional APM.
The problem of “real-time” still remains, however. When containers can start, crash, and be restarted in a matter of seconds (or less!) by Kubernetes, what’s the point of monitoring every minute? Collecting logs, metrics, and traces – in real-time and in an exhaustive way – becomes a necessity in a world where the entropy of our systems is constantly increasing.
This entropy also has an impact on the efficiency of machine learning (ML) algorithms for those who use AIOps technology to reduce alert noise and detect anomalies. We all want to improve UX, but not at the cost of greater team fatigue. Observability provides more data and insights for training and operating AI models than current monitoring approaches.
Observability is therefore a data problem: basic datasets (logs, metrics, and traces) must be collected without sampling and in real-time. But to transform this data into useful information, you also need to be able to correlate it, provide context, and give it meaning.
Most of the network operation centers (NOCs) I’ve known in my career have had 20 to 30 different tools – from specialized tools like Oracle Enterprise Manager for Database administrators (DBAs), to Nagios for the network, and many others. This required the installation of multiple agents on the servers/VMs to collect these logs, metrics, and traces. And each agent is often proprietary. The concern with this approach is that the maintenance of agents is quite heavy when you have a multitude of machines (imagine the number of agents it takes to install, update, or replace – in the case of changed of tools – in a hybrid multi-cloud environment, with legacy IT talking to microservices running in one or more clouds). Not only is this extremely difficult to manage, but the cost of processing and memory will not be trivial either.
How observability and OpenTelemetry work together
One of the key tools for achieving observability is OpenTelemetry (a.k.a. OTel). The open-source project was originally known as the OpenTracing project (joined by OpenCensus later), which was initiated by a group of engineers from Uber and Lightstep, among others. It is now governed by the OpenTelemetry Steering Committee, which is made up of representatives from various organizations and companies, including Google, Microsoft, Splunk, AWS, and more. The project is backed by the Cloud Native Computing Foundation (CNCF), which provides support and resources for the development and promotion of the project.
Today, it is a vendor-neutral framework for collecting, storing, and analyzing telemetry data. It provides a consistent and standardized way of instrumenting applications and services, enabling engineers to understand the health and performance of their systems across different environments and technologies.
Observability without OpenTelemetry doesn’t make sense for several reasons:
- Standardization: OpenTelemetry provides a common set of application programming interfaces (APIs) and protocols for instrumenting applications and services, which means that telemetry data can be collected, stored, and analyzed in a consistent manner across different environments and technologies. This standardization enables your engineers to easily understand and troubleshoot issues, regardless of the underlying infrastructure. Without OpenTelemetry, each application or service would likely have its own way of instrumenting and collecting telemetry data, resulting in a lack of consistency across different environments and technologies.
- Vendor neutrality: OpenTelemetry is vendor-neutral. This allows you to use the tools and services you prefer, without being locked into a particular vendor or ecosystem. You can change your back-end monitoring tool without having to re-instrument your environment.
- Rich data: OpenTelemetry provides a rich set of data – including metrics, traces, and logs – which enables you to understand the health and performance of your systems at different levels of granularity. This rich data enables DevOps teams to troubleshoot issues, identify performance bottlenecks, and make informed decisions about capacity and scalability.
- Open source: OpenTelemetry is free to use and can be modified and extended to meet the specific needs of an organization. This also means that the community can contribute to the development of the framework and add new features and integrations.
OpenTelemetry is an essential tool and is now considered the standard for observability in cloud-native systems; it has become the de facto lightweight single agent supported by all major vendors.
Final thoughts
Hopefully, now we can see how OpenTelemetry, APM, RUM, distributed tracing, synthetic monitoring, and even AIOps – come together to help understand and improve UX.
By combining these tools, organizations can start to identify performance issues, understand root causes, and make data-driven decisions about changes to their applications that will result in a better UX.