Cloud environments are complex and dynamic. There is a lot happening as workloads expand and contract over a 24 hour cycle. Users, both human and machine, log in and out from all around the world, add datasets, run queries, perform computations. Application developers push out new versions of code whilst DevOps deploy new services and retire others. And even whilst all this is happening - the expectation is that the system will always be up.
You can certainly engineer reliability into an architecture, but how do you reduce unplanned outages that cause:
loss of revenue and customers, and brand damage for public or partner facing systems
loss of productivity and continuity of processes and development for internal systems
The smart and most cost-effective approach is to implement a structured monitoring regime.
Challenges
Often, there is a lack of understanding of what can go wrong: many environments may have been over-provisioned, with autoscaling, and set up across multiple availability zones, but this does not make them immune to user error or comms link failure.
Application teams want to focus on the applications that support their core business, not the availability of the systems on which they run. Typically they lack the expertise or resources to establish a monitoring regime, and if they are looking at alerts themselves, it can be a distraction and time consuming.
Frequently, there is a misunderstanding of what should be monitored, thresholded or where the events and alerts need to be directed.
Without a well thought through monitoring regime, you might be monitoring the wrong things, have incorrect thresholds set on the right things or might not be directing alerts to somewhere where they can be acted on quickly, and/or collated for periodic review to determine patterns or systemic issues.
Solution
A structured monitoring regime monitors resources, services, and even budgets within an AWS account - collectively known as ‘metrics’. It creates an incident record with a service desk if a metric steps outside predetermined thresholds. Thresholds can be set on each metric for each of the severities info, warning, error, critical.
For example, if a disk hits 90% capacity, a warning might be raised, or an error if it hits 98%. This allows service desk support staff to proactively deal with the issue before it affects the business, and also be aware that it is regularly hitting 90% and take preventative measures.
Notifications are collated on a dashboard with a traffic light system with the highest criticality issues visible above all others. No one needs to sit in front of this dashboard 24 hours a day. The service is automated to the point where only imminent or immediate action is required, and these alerts are typically routed through to your managed service provider for them to deal with.
In most cases, you read about issues that were intercepted and incidents that were prevented in the monthly review report. These issues can be subject to Root Cause Analysis and then remediated to obviate their likely recurrence.
Benefits
A well implemented monitoring regime significantly reduces unplanned outages and the brand damage, and loss of revenue and productivity associated with them.
Even if your Monitoring service does not pre-empt an unplanned outage, it can alert you to its existence so you can proactively mitigate its impact on your business or your customers.
The Monitoring Service can also be set up to monitor AWS account charges against set budgets.
This facilitates a proactive approach to resource management and alerts you to issues of unnecessary resource consumption or leakage, and avoids an end of month ‘bill shock’.
If you are concerned about unplanned outages, or that you may have unnecessarily over-provisioned in the hope of avoiding them, book a free consultation or callback, and our cloud professionals can discuss monitoring options to suit your unique circumstances.
Comments