Autonomic computing is not new. It has been in vogue since 2001 after IBM talked about it. But it has been in the sidelines ever since, as a research project with little mainstream attention. Industry conversation has centered on automation with cloud as the underlying fabric. Even though analytics has been playing a critical role in smoothening the IT operations, heavy reliance on humans at the intersection of analytics and operations is still causing outages causing disruptions and financial loss. Think about the AWS S3 outage caused by humans where a small mistake in the input causing large-scale outage of their services or British Airways outage due to a human switching off the servers too quickly, the cost of such mistakes are enormous. As long as humans are involved in critical operations tasks, such costly mistakes will happen even in the future.
I am not arguing for the removal of humans from operations. First, it is not possible with today’s technologies and human beings are essential to handle the machinery behind capitalism. Second, it is inhuman to replace human beings with machines in large scale before a credible socio-economic system is put in place to support such a switch (a topic beyond this publication). However, in today’s enterprise IT, with all the automation in place, humans still hold the responsibility for critical operations with machines analytics playing a supporting role. With further maturation of technologies using machine learning or deep learning, we may be in a position to put autonomic systems in place to handle critical operations with humans playing supporting roles. Autonomic systems are not the magic pill and they can have their own emergent problems (for e.g., think about some catastrophic failures of autonomic trading in the wall street) and it is critical for us to build necessary safety nets to handle such scenarios. But, artificial intelligence + automation holds promise for removing humans out of the critical IT operations in the future.
Let us be clear here. The premise of this argument is not “No humans in operations” but, rather, it is about using autonomic systems to let operations teams handle systems at large scales. It is about empowering them to do operations at a scale that is not possible even in today’s automation driven IT. It is not just about scale but also injecting resiliency in operations by using a “learning system” as the nerve center of the automation. With the digitization of the world, the need for “operations skills” is not going away but autonomic systems can help smaller teams manage large scale distributed systems without talent shortages impacting the organizations. The abstraction of complexities through automation and machine intelligence puts human beings at supervisory roles than being woken up at 3 AM to manage an immediate crisis in production systems.
We have a long way to go in the maturity curve before the scenario described above becomes a reality but the industry is taking baby steps towards this future. Even at this early stage of the evolution, machine learning holds lots of promise.
- The insights offered by analytics tools can be more personalized than delivering generic insights driven by a set of rules
- Analyzing logs can uncover patterns missed by human operators or even pattern matching done with a rule-set
- Alerting can be more targeted by eliminating all the false alerts that impacts the health of human operators
- Security can be done in a more proactive fashion than a reactive approach
- Detection of problems can be more proactive and issues can be caught much early by letting the learning systems to be part of the root cause analysis
We are just scratching the surface and with more integration of machine learning into an existing set of tools and new ones, the face of the operations will differ from what we see today.
In this section, I will highlight efforts by various vendors in fusing AI with automation. This is not an exhaustive list by any means but it will give readers a flavor for what they can expect in the coming decade. If you are a vendor in this space tapping AI to streamline operations, please contact us to set up a briefing.
- Splunk is well positioned to take advantage of ML and AI as data is at the core of their products. Splunk IT Service Intelligence takes advantage of ML and AI to detect patterns and help IT manage their systems
- Insight Engines cybersecurity investigator combines Splunk’s data with Natural Language Processing (NLP) to offer an easy way for security teams to handle security
- Instana’s Application Performance Management takes advantage of ML and AI to offer APM for Microservices. It offers automatic identification of root cause for problems
- CloudFabrix’s AppDimension uses machine learning to glean out optimal state and baselines to better predict negative outcomes, helping in faster remediation
- Moogsoft’s IT incident management platform uses ML and AI for root cause analysis and offer better predictions (see the recent recording of Moogsoft in our Cloud Native Vendors Video Series)
- Elasticsearch can be used with machine learning for better IT operations
- Pagerduty has been using machine learning to better manage events that impact the health of the Operations team
- New Relic’s Applied Intelligence uses machine learning for better prediction and root cause analysis
These are just some examples of available products in the industry and most of them are standalone ML or AI enabled products. These products do not fit into Industry autonomic computing but more innovation along product lines will lead the industry towards autonomic computing. Most of the web scale providers like Amazon Web Services, Microsoft Azure, Google Cloud are using ML and AI for everything from data center managements to systems management and monitoring. Oracle has announced that they will integrate ML and AI capabilities into their cloud platforms. It is still early days but watch out for more innovation and maturity in this field which will lead to a day where modern enterprises can use autonomic systems to manage their IT operations with humans playing supervisory roles. Even though this future is still 5-8 years away, it is critical for CIOs to consider this as they plot their modernization strategy.
Disclosure: CloudFabrix is a client of Rishidot Research