Proactive failure notification

DORAProactive failure notification

马上开始. 它是免费的哦
注册 使用您的电邮地址
Proactive failure notification 作者: Mind Map: Proactive failure notification

1. Pitfalls

1.1. Alert only when broken

1.1.1. too late

1.2. Noisy or numerous alerts leading the miss relevant alert

1.2.1. e.g. rule: one false posivite max by week (or day)

1.2.2. Anti-pattern: "cause-based alerting" aka The montoring solution try to enumarate all possible error conditions write an alert for each of them leading to very bad Signal Noise Ratio SNR pager fatigue Recomended pattern "Symptom based alerting"

1.3. Unadapted metric aligment

1.3.1. Alignment means: within time series regularization raw time serie aligned on small window aligned on larger window

1.3.2. Window (aka duration) Example: Sampling 1 min, Align period=5min, aligner=SUM, condition= GREATER then 2, assume eval every 1 min


1.3.4. Summary Larger alignment windows means more metrice samples -> more stability Samller alignment windows means les metric samples -> more sensibility

1.4. Unadapted condition duration window

1.4.1. Start firing with a first measurement maching condition while should wait to confirm Example: Sampling 1 min, Align period=5min, aligner=SUM, condition= GREATER then 2, assume eval every 1 min AND duration 3 min

1.4.2. Summary Larger duration means less noisy and longer to alert Smaler duration means short alerting and potentially more false positives Possible to set duration = 0 -> single aligned result will trigger

1.5. Partial metric data

1.5.1. Missing data for more thant the duration -> unknown state no alerts incident not closing

1.5.2. Chosse what to do on missing Nothing treat as violate teat as don't violate

1.5.3. Increase duration as cost of responsiveness

1.6. Unadapted notification latency

1.6.1. Metric sampling delay E.g. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds

1.6.2. Alerting computation delay monitoring: 60 to 90 sec

1.6.3. Duration window E.g. 3 minutes

1.6.4. Time to deliver notification E.g. 3 min

1.6.5. Time for operator to Ack the Alert e.g. 15 min

1.6.6. E.g. Total = 3 + 1.5 + 3 + 15 = 22.5 min + impact of the aligment window: last hours vs last 24 hours

2. Measure

2.1. % of incident where not detected through monitoring-alerting

2.2. Number of false positive alert per week for a given team-product

2.3. % of acknowledged alerts on agreed time

2.4. % of unactionable alerts

2.5. % of silenced alerts

2.6. Distribution of alerts on hours and team locations

3. Understand

3.1. Fix issue before it start impacting users

3.2. Anti-pattern

3.2.1. Customer reported issue

4. Implement

4.1. Use alerting rules

4.1.1. e.g. Prometheus Alert Manager

4.1.2. e.g. Cloud Monitoring Alerting Policies

4.2. Identify related monitoring metric

4.2.1. Built in e.g. Url map filtering to specific feature Error reate based on response_code filtering

4.2.2. Log based e.g Distribution of duration values Enable to detect when the system is getting to slow

4.2.3. Custom Any specific measurement directly implemented in the App Code

4.2.4. Sercice Level e.g. Error budget burn rate sli_measurement = good_events_count / (good_events_count + bad_events_count), for a given aggregation window slo = the target e.g. 99.9% error_budget_target = 100% - slo, e.g. 0.1% error_budget_measurement = 100% - sli_measurement error_budget_burn_rate = error_budget_measurement / error_budget_target example one: 0.08% / 0.1% = 0.8 burnrate example two: 0.15% / 0.1% = 1.5 burnrate

4.3. Define the metric computation

4.3.1. Right duration windows to query a given metric e.g. CPU higher 80% for at least 5 minutes

4.3.2. Aggregation e.g resource group

4.3.3. Calculation None e,g, number of process running on a VM instance is more than, or less than, a threshold Rate-of-change Values in a time series increase or decrease by a specific percent Metric-ratio E.g. error rate: Good / (Good + Bad)

4.4. Set conditions to define when to alert

4.4.1. One of multiple

4.4.2. Use thresholds to define Early warning indicators aka. look to trends before it is too late

4.4.3. E.g. Alerting on burn rate Remaining capacity or quotas

4.5. Check how often is evaluated the rule

4.5.1. Configurable e.g. Prometheus alerting manager / rule group interval setting Contribute to the End to End detection time Reduce un necessary load on the monitorig back end

4.5.2. Fixed e.g. Cloud Monitoring Alerting policy "Alerting policy conditions are evaluated at a fixed frequency." "Alerting policy computations take an additional delay of 60 to 90 seconds"

4.6. Use adapted notification channel

4.6.1. How fast the response should be ? Bad New should travel fast e.g. email pager dutty slack

4.6.2. Where are the skills to deal with the detected issue? Who can better limit the potential blast radius? e.g dedicated SRE team 24x7 Opportunity to improve the service / go to root cause Skill economy Anti partten Common NOC, SOC, outsourced offshore e.g: L0 reboot the VM and keep as is Scale economy