CloudWatch (Logs, Metrics, Alarms)
Logs
# Tail logs for a Lambda
aws logs tail /aws/lambda/staging-EcomIndexerFunction --follow
# Search logs for errors in last hour
aws logs filter-log-events \
--log-group-name /aws/lambda/staging-EcomIndexerFunction \
--start-time $(date -v-1H +%s000) \
--filter-pattern "?ERROR ?Traceback ?Exception"
# Monolith logs (ECS Fargate)
aws logs filter-log-events \
--log-group-name staging-monolith-logs \
--start-time $(date -v-15M +%s000) \
--filter-pattern "ERROR"
# Exclude health checks from monolith logs
aws logs filter-log-events \
--log-group-name staging-monolith-logs \
--start-time $(date -v-15M +%s000) \
--filter-pattern "-\"GET /openapi.json 200\""
# List log groups
aws logs describe-log-groups --query 'logGroups[].[logGroupName,storedBytes]' --output table
Alarms
# List all alarms
aws cloudwatch describe-alarms --query 'MetricAlarms[].[AlarmName,StateValue,MetricName]' --output table
# List alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM --output table
# Get alarm history
aws cloudwatch describe-alarm-history --alarm-name "staging-EcomMetricsWorkerDLQAlarm" --max-items 10
Key Alarms
| Alarm |
Trigger |
Severity |
{env}-EcomMetricsWorkerDLQAlarm |
Messages in metrics DLQ |
Sev2 (Slack + PagerDuty) |
{env}-EcomMonitoringServiceAlarm |
Monitoring Lambda errors |
Sev2.5 |
{env}-EcomPartialDocumentsDetectedGlobalAlarm |
Partial docs in indexer |
Sev2 |
{env}-Agentic5xxRpsAlarm |
Agentic 5XX rate > 2/s |
Sev2 |
MerchandisingExporterErrorAlarm-{env} |
Merch exporter errors |
Sev2 |
MerchandisingExporterHeartbeatAlarm-{env} |
No merch exporter invocations |
Sev2 |
Dashboards
# List dashboards
aws cloudwatch list-dashboards --query 'DashboardEntries[].[DashboardName]' --output table
Key dashboards: {env}-EcomDashboard, CloudControllerDashboard-{env}, MerchandisingExporterDashboard-{env}.
SNS Notification Topics
| Topic |
Purpose |
CloudwatchAlarmNotifySlack |
Slack alerts |
CloudwatchAlarmNotifyPagerduty |
PagerDuty Sev2 |
CloudwatchAlarmNotifyPagerdutySev2_5 |
PagerDuty Sev2.5 |
What to Look For
| Symptom |
Check |
| Alert firing |
aws cloudwatch describe-alarms --state-value ALARM |
| Missing logs |
Verify log group exists, check Lambda execution role has logs permissions |
| High error rate |
Filter log group with ERROR pattern |
| Latency issues |
Check dashboard widgets for p99 latency |