The term Observability is fast moving towards the peak of the hype cycle but it is critical to managing the cloud native architectures in any modern enterprise. In this research brief, we stake out our position on the evolution of Observability in IT operations and highlight the potential of using machine learning and artificial intelligence to make Observability more useful in the context of distributed environments. We will also highlight some considerations for organizations exploring the use of Machine Learning (ML) or Artificial Intelligence (AI) to manage Observability data.
The term Observability is all rage today among the IT operations, SRE and DevOps but there is also quite a bit of confusion in the minds of Modern IT Decision Makers on how it fits in their strategy. Questions like is it Monitoring ++, does it help in security, etc? This research brief is meant to address the basic questions modern enterprise decision makers have on Observability and also lay out the nascent but evolving landscape.
Let us start with the Wikipedia definition of the term
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.
Even though it doesn’t directly translate to how the term is used in the context of modern enterprise IT, it does offer a partial definition of Observability. Observability is about knowing the internal states of the (distributed) system through the knowledge of the external outputs. The knowledge about the internal states of a highly distributed system is critical and the only way IT operations can infer it is through the externally available knowledge including monitoring data, log data, tracing data, etc..
Our take on what constitutes Observability aligns with Twitter’s original blog post on the topic
- Distributed systems tracing infrastructure
- Log aggregation/analytics
Clearly, it goes well beyond monitoring and transforms the traditional approach of IT Operations.
Observability: Transforming “what” to “why”
In the traditional world of IT operations, the focus of monitoring has centered on what is happening in the system than finding out why it is happening. When you were dealing with monolithic apps on servers hosted on your data center, such a traditional approach to monitoring helped solve most of the problems IT operations faced in their daily tasks. With the cloud native approach powered by containers on the infrastructure and microservices on the application layer, the modern IT is faced with an increasingly distributed environments where the traditional ideas about reliability and monitoring breaks down. IT Operations are faced with a more distributed infrastructure underneath and applications on top. With a focus on resiliency in the modern distributed computing, it is critical to go beyond what is happening to figure out why systems are behaving in a certain way.
Observability brings together monitoring data, log data and tracing data to add a correlation and context so that IT Operations, SRE and DevOps teams can better understand the system dynamics and take appropriate action pro-actively than the traditional reactive approach. In other words, Observability helps people to go beyond what is happening in their system to why something is happening. This knowledge is key to managing cloud native infrastructure and the applications. Observability helps teams anticipate failures, including grey failures which is difficult to anticipate with traditional tools and operations. The wealth of knowledge in the Observability data helps SRE and DevOps teams manage both known failures and grey failures in a more graceful manner, without impacting the user experience of the end users.
The very shift from what to why requires a mindset change among the folks responsible for modern IT operations. They shouldn’t consider Observability as
- a new set of tools to use for cloud native environments
- a new way of doing the old things
- a more sophisticated monitoring tool that has better debuggability
- a magic pill that makes failures go away
Instead they should treat Observability as a paradigm that helps them understand the variations in the underlying dynamics of their systems, helping them “smell” potential failures much before the failures happen and take remediation measures. Observability helps IP Operations/SRE/DevOps teams do their jobs as the systems they manage transitions to be more distributed and complex.
Observability: Going beyond rules-based approach
While the idea of Observability is gaining steam, the way the data is used relies mostly on a more traditional approach of using existing knowledge of systems and failure domains to set up rules that help Operations proactively tackle these failures before they happen. Such an approach is effective, but it is not scalable, especially as the underlying infrastructure becomes more distributed and fluid due to the use of containers that are easily portable, edge computing and IoT devices. The complexity added by these increasingly distributed systems cannot be handled with just the existing knowledge about failure domains. The use of humans in deciding the importance of a specific set of failures requires throwing away data that are useless from the perspective of these failures. This severely limits the operations team from finding grey failures with potentially catastrophic impacts.
In order to avoid grey failures, it is important to collect data from many more sources and these data should be correlated to see patterns that are beyond the existing human knowledge. In order to do that, one shouldn’t be getting rid of “unwanted data” but collect more data from more sources to add a better context and provide better correlation. Humans cannot process such large volumes of data in an efficient way. This is where machine learning and artificial intelligence becomes important.
Observability: Machine Learning to the rescue
Machine learning (and eventually deep learning) can help organizations take advantage of vast amounts of Observability data to identify grey failures which are otherwise not visible to human processing. With edge computing and IoT becoming the norm, modern IT’s scope expands beyond the traditional systems in the data center and the perimeter gets more fluid. In today’s world, user experience is king, and it is critical for organizations to collect data across all the devices that play a role in modern applications. Not only the infrastructure and the applications on top are distributed but the consumption devices of users are also distributed and more global. In such an environment, using human centric rules as the driving force for seamless user experience will be of limited help. Such distributed environments not only bring in new type of challenges but also exaggerate the human blind spots in troubleshooting.
Machine learning, where computers can dig through vast volumes of data, to find patterns that can be correlated to failure domains and grey failures plays a significant role in Observability. Without the machine intelligence from vast amounts of data, IT operations/SRE/DevOps teams are only looking at a subset of problem domains and they are not in a position to avert catastrophic failures. One good, but very unfortunate, example is the recent engine failure in Southwest Airlines where human centric approach to aircraft maintenance failed to notice the grey failure happening due to subsurface flaw. We are not arguing that use of machine learning would have prevented this accident. We are just making a case that using a large number of Observability data sources coupled with use of machine learning or deep learning models has higher chances of detecting such grey failures than traditional human centric approaches with a limited set of Observability data.
At Rishidot Research, we feel that the use of Machine Learning and Artificial Intelligence in Observability is only at the beginning stages now with organizations using them to tackle problems that are considered as low hanging fruits. We expect that this trend will accelerate in the next few years with ML/AI engine being mainstream in 2-3 years’ timeframe. The biggest obstacle to the use of machine learning in Observability data is the lack of training data that can fit the needs of any organization. We expect this to change in the future with web scale cloud providers either offering data from their operations as the seed for training the models (think of open source approach to sharing data) or offer it as a service for different products to train their models. The other option is to start with human centric approach with data from the end user organization and, slowly, let the learning engine to learn from the production data. Since it is early days in the use of machine learning and artificial intelligence on Observability data, there is no clear-cut prescription for this problem, but we expect this to change in the coming months.
ML/AI in Observability: Some considerations
Whether you are instrumenting your Observability data platform using DIY approach and open source software or building it using packaged vendor tools, there are some considerations we want to highlight, and which can help you maximize benefits.
- Build the right mindset and culture. Get IT operations to start thinking about resiliency over reliability and take advantage of Observability data in rolling our resilient services. This cultural change is critical in not just managing distributed systems but, also, in using Observability efficiently
- Stop discarding data and bring together various data sources by breaking down the silos. Whether it is data from your DevOps pipeline or production environment including edge locations or end user performance data, it is important to feed from all the sources before applying machine learning or deep learning models effectively
- Focus on training data. It is difficult to train the models efficiently because the needs of each organization are different. Using generic data may end up creating more problems. Understand the training data to see if it can help your organization or start with a rules-based system and slowly use the organization’s data to train the learning models
- Focus on instrumentation. Instrumentation in the context of Observability is about increasing Observability. So, focus on building instrumentation into everything from infrastructure to the code you run. Instrumentation cannot be an afterthought
- Focus on automation and how it can be seamlessly hooked into Observability data to ensure a more autonomous self-healing and self-evolving system
As you modernize your IT and start embracing cloud native architectures, Observability is key to running resilient systems. Machine Learning and Artificial Intelligence have a critical role to play in enhancing Observability, especially as the perimeter moves towards the edge and IoT devices. The use of ML and AI in Observability is still in very early stages, but we expect it to become mainstream in 2-3 years. As a modern enterprise stakeholder, it is important you understand the role of Observability and how machine learning and AI can shape its future.
You can download the PDF version of the report here.