Modernising operations on AWS Cloud with AIOps.

Nanthan Rasiah
7 min readAug 6, 2022

AIOps, artificial intelligence operations, is the process of applying data analytics and advanced machine learning on operational data in order to enhance IT operations and to reduce human intervention. AIOps helps us accelerate issue identification and resolution by increasing root cause analysis (RCA) accuracy and proactive identification.

AWS provides a number of services to monitor your workload on AWS cloud. Events, alarms, logs, traces and metics in Amazon CloudWatch and rules in AWS Config provide valuable insights. You can get even more insights by applying machine learning techniques to this data. AIOps is basically applying machine learning techniques to operational data to solve operational problems.

AIOps helps you to reduce operational incidents and increase service quality by grouping related incidents and even solving incidents automatically by predicting knowledge base articles. Besides, it allows to predict incidents before they happen or respond in real time.

In this post, let’s explore the services from AWS that can be used to deliver AIOps capabilities to IT operations.

The following are the AI-enabled AWS Services that provide AIOps capabilities to IT operations in order to improve application availability and reliability on AWS.

  1. Amazon DevOps Guru
  2. Amazon Detective
  3. Amazon CloudWatch Anomaly Detection
  4. Predictive scaling for Amazon EC2 Auto Scaling
  5. Amazon Macie
  6. Amazon Lookout for Metrics

Amazon DevOps Guru

Amazon DevOps Guru is a fully managed AIOps platform service that assists in improving application availability and operational performance and it uses pre-trained ML models to detects operational issues and recommend remediation actions. It is designed to save time and effort spending on debugging and resolving operational issues and allows to monitor complex application effectively.

Amazon DevOps Guru automatically ingest operational data from Amazon CloudWatch, AWS Config, AWS CloudTrial and AWS X-Ray, continuously analyse metrics such as latency, error rates and request rates and learn normal operating patterns and behaviours by leveraging pre-trained ML models. Then it identifies anomalous application behaviour like increased latency, error rates, or resource constraints that could cause potential outages or service disruptions and enriches the data by fetching relevant and specific information from a different number of data sources. Finally it sends alert of the issue, with a summary of related anomalies, contextual information about why and when the issue occurred, along with recommendations on how to remediate issues and reduce application downtime.

DevOps Guru can be used as a standalone service, and also integrates with AWS System Manager OpsCenter, Amazon SNS and partner applications from PagerDuty and Atlassian. It is easy to get started. you can enable it for all resources in your AWS account, resources in your AWS CloudFormation Stacks, or resources grouped together by AWS Tags. No ML expertise required for DevOps Guru configuration.

For more information, please visit https://aws.amazon.com/devops-guru/

Amazon Detective

Amazon Detective automatically collects log data from various AWS resources and apply machine learning, statistical analysis and graph theory to build a linked set of data in order to analyse, investigate and identify the root cause of potential security issues or suspicious activities.

It makes it easy to analyse, investigate, and quickly identify the root cause of potential security issues or suspicious activities. Amazon Detective automatically collects log data from your AWS resources and uses machine learning, statistical analysis, and graph theory to build a linked set of data that enables you to easily conduct faster and more efficient security investigations.

Amazon Detective simplifies the investigative process and helps security teams conduct faster and more effective investigations. Amazon Detective’s prebuilt data aggregations, summaries, and context help you quickly analyse and determine the nature and extent of possible security issues. Amazon Detective maintains up to a year of aggregated data and makes it easily available through a set of visualisations that shows changes in the type and volume of activity over a selected time window, and links those changes to security findings. There are no upfront costs and you pay only for the events analysed, with no additional software to deploy or log feeds to enable.

For more information, please visit https://aws.amazon.com/detective/

Amazon CloudWatch Anomaly Detection

Amazon CloudWatch anomaly detection uses machine-learning algorithms to continuously analyse the historical values of system and application metric, determine predictable patterns which is repeatable at regular interval, and then creates models to detect anomaly. This allows customers to better predict the future and cleanly differentiate normal and problematic behaviour.

After you create a model, CloudWatch anomaly detection continually evaluates the model and makes adjustments to it to ensure that it is as accurate as possible. This includes re-training the model to adjust if the metric values evolve over time or have sudden changes, and also includes predictors to improve the models of metrics that are seasonal, spiky, or sparse.

For more information, please visit https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html

Predictive scaling for Amazon EC2 Auto Scaling

Predictive scaling allows to increase the number of EC2 instances in the Auto Scaling group in advance based on the patterns in traffic flows. For example, if you have high use of resources during regular business hours and low use of resources during evenings and weekends, you should use predictive scaling as it helps you scale faster by launching capacity in advance of forecasted load.

If you use predictive scaling, you don’t need to over provision EC2 instance and it saves your bill. It adds capacity before the first influx of traffic which helps application maintain high availability and performance when going from a period of lower utilisation to a period of higher utilisation.

For more information, please visit https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-predictive-scaling.html

Amazon Macie

As we migrate huge amount of data to AWS Cloud, it is extremely important to protect valuable data like Personal Identifiable Information (PII). The best way is to automate findings of sensitive data so you don’t have to bother to manually classify data and its permissions.

Amazon Macie is a fully managed powerful data security service that uses machine learning and pattern matching to automatically discover and protect the sensitive data, such as personally identifiable information (PII) and protected health information (PHI) stored in AWS. It also provides summary of publicly accessible buckets, unencrypted buckets and shared buckets.

It helps you achieve the security you need in AWS Cloud and meet the regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) and General Data Privacy Regulation (GDPR) or just continually achieve the security you require in the AWS Cloud environment.

For more information, please visit https://aws.amazon.com/macie/

Amazon Lookout for Metrics

Amazon Lookout for Metrics is a new services that uses machine learning to automatically detect and diagnose anomalies in operational data. This enables you to build highly-accurate, machine learning models (called Detectors) to identify outliers on live or real-time data. You can use historical data to train a model which will then detect outliers on live data. If you do not have historical data then Amazon Lookout for Metrics will train a model on-the-go.

Besides, it provides feedback on detected anomalies to tune the results, improve accuracy over time and makes it easy to diagnose detected anomalies by grouping together anomalies that are related to the same event and sending an alert that includes a summary of the potential root cause. It also ranks anomalies in order of severity so that you can prioritise your attention to what matters the most to your business.

Amazon Lookout for Metrics uses this custom-trained detector to monitor your chosen metrics for outliers, allowing you to quickly identify and resolve issues that are likely to impact your business. Amazon Lookout for Metrics can also integrate with Amazon SNS to alert you when the service detects important outliers.

For more information, please visit https://aws.amazon.com/lookout-for-metrics/

Amazon GuardDuty

Amazon GuardDuty is an intelligent managed threat detection service based on sophisticated machine learning algorithms. The service continuously monitors and protects one or multiple AWS accounts, workloads, and data stored in Amazon S3 for malicious activity and delivers detailed security findings for visibility and remediation.

It monitors and analyses the VPC Flow Logs, AWS CloudTrail event logs, and DNS log and it uses malicious IP addresses and URLs and machine learning to identify unexpected and potentially unauthorised and malicious activity within your AWS environment. For example, you can use GuardDuty to detect compromised EC2 instances serving malware or mining bitcoin.

GuardDuty informs you of the status of your AWS environment by producing security findings that you can view in the GuardDuty console or through Amazon CloudWatch events. It allows to create your own custom automated functions using CLI tool and HTTPS APIs to handle threats. GuardDuty provides three levels of severity:

  • Low severity: indicates threats that have already been removed or blocked before compromising any resource.
  • Medium severity: indicates suspicious activity.
  • High severity: indicates a resource that is fully compromised and is constantly being used for unintended purposes.

For more information, please visit https://aws.amazon.com/devops-guru/

The purpose of this post was to explain the AIOps concepts and the services AWS provides to implement AIOps on AWS cloud environment.

--

--

Nanthan Rasiah

Ex. AWS APN Ambassador | Architect | AWS Certified Pro | GCP Certified Pro | Azure Certified Expert | AWS Certified Security & Machine Learning Specialty