Another interesting article from my fav guest author.
Amazon Web Services (AWS) provides a comprehensive infrastructure you can use to develop, deploy and scale applications. As cloud deployments grow and support mission critical applications, it is becoming more important to gain visibility into what is happening on the cloud, which services or workloads are experiencing issues, and how to resolve them.
Amazon provides CloudWatch, a foundation of AWS monitoring strategies. CloudWatch provides rich metrics about just about anything running on Amazon—from EC2 compute instances to managed services like Amazon Lambda or the Relational Database Service (RDS).
In this article I’ll explain what are the most critical metrics to use, and which other tools and best practices you can use to monitor critical workloads on AWS.
What Should You Monitor in AWS?
Here are the most important things to monitor in an AWS environment.
Status Checks
To ensure the health of EC2 servers, AWS performs status checks by default. There are two types of status checks:
- System status checks—monitor issues that AWS is responsible for fixing, such as hardware failures and network connectivity issues.
- Instance status checks—monitor issues that you are responsible for fixing, such as corrupt file systems and exhausted memory.
You can set up status check alarms and get notifications when an issue occurs.
CPU Usage
AWS shows the CPU usage of each instance as a percentage, which you can monitor to ensure your shared instances are not reaching a CPU limit. If an instance reaches a CPU limit, you can check your CPU balance in CloudWatch. You can also monitor the CPU steal time, which can show you if other virtual machines (VMs) are using the CPU. You can set up monitoring for this metric by using the collectd agent.
Network Usage
You can easily monitor the network usage of your AWS resources. To ensure optimal performance, you should monitor the total amount of packets and the throughput of each instance type. If the limitations of the instance type you are using are negatively impacting performance, you can upgrade to a larger instance.
Application Performance
The performance of your application has a direct impact on user experience. To ensure user experience remains positive, you need to set up monitoring and metrics for system performance. CloudWatch, however, does not offer detailed performance for monitoring application performance deployed on EC2. You need to set this up manually, using an agent like collectd or integrate with a monitoring service.
AWS Monitoring Tools
Amazon provides several tools you can use to monitor Amazon services and your workloads running on AWS.
AWS CloudTrail
CloudTrail is a tool designed to help AWS users gain visibility into application programming interface (API) usage and user activity across AWS environments. Once you set up CloudTrail the system will automatically record and store event logs of certain actions performed by AWS user accounts, including the time during which users performed the activity and their identity.
AWS CloudWatch
CloudWatch is a comprehensive monitoring solution designed especially for developers, engineers, security professionals, and DevOps teams. The solution provides capabilities for monitoring both operational and security tasks across an IT infrastructure, including AWS services and on-premise cloud environments. Notable features include automated incident response, insights into operations, troubleshooting, and anomaly detection.
AWS Security Hub
Security Hub enables you to centralize data and security alerts aggregated from all AWS security applications. The hub lets you capture security data from AWS services, including Amazon Macie and Amazon Inspector. Once you collect all of the information, you can organize and prioritize alerts, and create your own customized dashboards.
Amazon Inspector
Amazon Inspector lets you automatically perform security assessments in AWS. You can use this tool for all applications deployed in the AWS cloud, and investigate certain aspects of the application. For example, you can leverage Amazon Inspector to detect anomalies or variations from baseline traffic levels or baseline activities.
AWS Monitoring Best Practices
Here are a few best practices that can help you get the most out of AWS monitoring.
Collect and Analyze Data from Across the AWS Environment
It’s important to create a monitoring plan that collects data from your entire AWS environment—including cloud services, compute resources, and storage resources. By collecting and analyzing this data together, you can quickly correlate events to identify failures involving multiple resources or services.
Aggregate Data to Take Action
A monitoring strategy should provide you with actionable insights. To achieve this, you need to aggregate data from multiple sources, perform analysis on a regular basis, and receive prioritized alerts. AWS offers several tools that let you record and store logs (CloudTrail), monitor your infrastructure (CloudWatch), centralize data aggregation and alerting (Security Hub), and investigate events (Inspector).
Use Automation Whenever Possible
It is not enough merely to monitor resources on AWS and notify personnel when something goes wrong. Automatic response to alarms is essential for maintaining the AWS environment and efficiently managing resources in a complex environment.
You can identify problems and resolve them with scripted actions or Amazon automation features, minimizing human intervention. For example, if memory usage on a compute instance reaches a critical level, you can scale up using a script or Amazon’s Auto Scaling service.
Creating automated responses to alerts can help you:
- Dynamically configure services to improve efficiency and conserve costs
- Immediately respond to problems that can affect application availability or user experience
- Save time for IT and DevOps staff
Conclusion
In this article I explained the basics of AWS monitoring, and provided the following best practices to help you monitor workloads more effectively:
- Collect data from across your AWS environment, to be able to triage issues affecting multiple Amazon services
- Aggregate data using CloudTrail, CloudWatch and additional Amazon services
- Don’t just monitor—put in place automated actions that can resolve any problem occurring in your AWS environment
I hope this will help you build and operate robust, resilient applications in the Amazon cloud.