Cloud Monitoring & Auto-Remediation System

Real-time AWS monitoring system with automated incident detection, decision-making, and remediation.

Detects EC2 incidents with CloudWatch and automates resolution via EventBridge, Lambda, ASG, DynamoDB, and SNS.

Overview

This system provides cloud-native incident management for EC2 in an Auto Scaling Group: alarms trigger EventBridge, a decision engine runs in Lambda, and remediation actions are executed automatically while tracking state in DynamoDB and notifying via SNS.

Features

Event-driven monitoring

CloudWatch alarms feed incidents into EventBridge for realtime responses.

Automated remediation

Lambda functions execute corrective actions without manual intervention.

Cooldown logic

Duplicate alarms are suppressed to avoid repeated actions and alert fatigue.

ASG integration

HIGH_CPU incidents are scaled by Auto Scaling Group policies.

DynamoDB logging

Incident decisions and actions are persisted for auditing and analysis.

CloudWatch dashboard

Centralized metrics, alarms, and health overview for state visibility.

Architecture

Architecture diagram
  • CloudWatch triggers alarms and alerts EventBridge.
  • Decision Lambda classifies incidents and checks cooldown.
  • Remediation Lambda performs EC2 action and records the event.

Incident Handling

HIGH_CPU

Action: SCALE_MANAGED_BY_ASG

STATUS_CHECK_FAILED

Action: REBOOT

LOW_UTILIZATION

Action: STOP

AWS Services Used

CloudWatch

Monitor metrics and define alarms.

EventBridge

Route alarm events into the decision workflow.

Lambda

Decision engine and remediation actions.

DynamoDB

Store incident decisions and cooldown state.

SNS

Notifications for operators and status updates.

EC2 + ASG + ALB

Compute layer with scaling and traffic routing.

Demo / Proof

Dashboard screenshot
Dashboard screenshot Shows real-time CPU utilization, alarm states, target health, and ASG capacity.
SNS email screenshot
SNS email screenshot Automated notification sent when the system detects and processes an incident.
Logs screenshot
Logs screenshot Decision engine logs showing incident classification, cooldown check, and remediation flow.

Live Demo Flow

1. CloudWatch detects incident
2. EventBridge routes alarm event
3. Decision Lambda classifies issue
4. Remediation Lambda takes action
5. SNS sends notification
6. Logs stored for auditing