Grafana alerting til IoT — regler, kontaktpunkter og notification policies
Grafana Unified Alerting til IoT-platforme: alert rules i PromQL og LogQL, kontaktpunkter (email/Slack/PagerDuty), notification policies, silencer og Grafana OnCall integration.
Af M-Bus Gateway
Grafana Unified Alerting erstatter Grafana 8's legacy alerting og understøtter PromQL, LogQL og Mimir-baserede regler. Her er opsætningen til IoT-gateway-monitoring.
Grafana Unified Alerting — arkitektur
# Grafana alerting stack (docker-compose):
services:
grafana:
image: grafana/grafana:11.0.0
environment:
GF_ALERTING_ENABLED: "true"
GF_UNIFIED_ALERTING_ENABLED: "true"
# Alerts evalueres i Grafana (ikke Prometheus)
# Upstream: Mimir/Prometheus til metrics, Loki til logs
prometheus:
image: prom/prometheus:v2.53.0
# Scraper: FastAPI /metrics (OpenTelemetry Prometheus exporter)
loki:
image: grafana/loki:3.1.0
# Modtager: structlog JSON output via Promtail
# Alerting flow:
# Prometheus metrics → Grafana alert rule (PromQL) → fires alert
# Loki logs → Grafana alert rule (LogQL) → fires alert
# Alert → Contact point → Email/Slack/PagerDuty
Alert rules — gateway-specific
# grafana/provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- name: mbus-gateways
folder: IoT Alerts
interval: 5m # Evalueringsinterval
rules:
# 1. Gateway stale — ingen data i 36+ timer
- uid: gateway-stale
title: "Gateway stale — ingen aflæsninger"
condition: C
data:
- refId: A
queryType: range
model:
expr: |
time() - max by (gateway_id) (
last_over_time(mbus_gateway_last_seen_timestamp[1h])
) > 129600 # 36 timer i sekunder
- refId: C
type: threshold
model:
conditions:
- evaluator: { type: gt, params: [0] }
reducer: { type: last }
labels:
severity: warning
annotations:
summary: "Gateway {{ $labels.gateway_id }} stale i 36+ timer"
description: "Ingen MQTT-heartbeat. Tjek SIM-kort og netværk."
# 2. AES dekrypteringsfejl (Loki):
- uid: aes-decrypt-errors
title: "AES dekryptering fejler"
condition: B
data:
- refId: A
queryType: instant
datasource: Loki
model:
expr: |
sum(count_over_time(
{service="mbus-server"}
|= "decryption failed" [5m]
)) > 5
- refId: B
type: threshold
model:
conditions:
- evaluator: { type: gt, params: [0] }
reducer: { type: last }
labels:
severity: critical
annotations:
summary: ">5 AES-dekrypteringsfejl på 5 min"
description: "Sandsynlig nøgle-rotation. Opdatér AES-nøgle i platform."
Kontaktpunkter (contact points)
# grafana/provisioning/alerting/contact_points.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: email-ops
receivers:
- uid: email-ops-receiver
type: email
settings:
addresses: "ops@mbus-platform.dk"
subject: "[Alert] {{ .CommonLabels.alertname }}"
message: |
{{ range .Alerts }}
Status: {{ .Status }}
Summary: {{ .Annotations.summary }}
Detail: {{ .Annotations.description }}
Gateway: {{ .Labels.gateway_id }}
Starttid: {{ .StartsAt }}
{{ end }}
- orgId: 1
name: slack-iot-alerts
receivers:
- uid: slack-receiver
type: slack
settings:
url: "${SLACK_WEBHOOK_URL}"
channel: "#iot-alerts"
title: "{{ .CommonLabels.alertname }}"
text: |
*{{ .CommonAnnotations.summary }}*
{{ .CommonAnnotations.description }}
color: |
{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}
- orgId: 1
name: pagerduty-critical
receivers:
- uid: pagerduty-receiver
type: pagerduty
settings:
integrationKey: "${PAGERDUTY_KEY}"
severity: critical
class: gateway-failure
Notification policies — routing
# grafana/provisioning/alerting/notification_policies.yaml
apiVersion: 1
policies:
- orgId: 1
receiver: email-ops # Default receiver
group_by: [alertname, gateway_id]
group_wait: 30s # Vent før første notifikation
group_interval: 5m # Interval ved gentagne
repeat_interval: 4h # Gentag aktive alerts hvert 4. time
routes:
# Critical alerts → PagerDuty:
- receiver: pagerduty-critical
matchers:
- severity = critical
continue: true # Fortsæt til email også
# Gateway-specifikke → Slack + email:
- receiver: slack-iot-alerts
matchers:
- alertname =~ "Gateway.*"
group_wait: 1m
# Hverdagsalerts → kun email, ingen PagerDuty:
- receiver: email-ops
matchers:
- severity = warning
repeat_interval: 24h
Silencers — maintenance window
# Planlagt vedligeholdelse → silence alerts automatisk
import httpx
from datetime import datetime, timedelta
async def create_grafana_silence(
gateway_id: str,
duration_hours: int = 2,
reason: str = "Planlagt vedligeholdelse",
) -> str:
"""Opret silence i Grafana for specifik gateway."""
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{GRAFANA_URL}/api/alertmanager/grafana/api/v2/silences",
headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"},
json={
"matchers": [
{"name": "gateway_id", "value": gateway_id, "isRegex": False}
],
"startsAt": datetime.utcnow().isoformat() + "Z",
"endsAt": (
datetime.utcnow() + timedelta(hours=duration_hours)
).isoformat() + "Z",
"comment": reason,
"createdBy": "mbus-platform",
},
)
resp.raise_for_status()
return resp.json()["id"]
# Kaldes fra portal når tekniker starter OTA-opdatering:
# await create_grafana_silence(gateway_id, duration_hours=1, reason="OTA update")
Recording rules — pre-aggregeret data
# prometheus/recording_rules.yaml
# Recording rules reducerer query-load ved dashboards og alerts
groups:
- name: mbus_gateway_summary
interval: 5m
rules:
# Dagligt gennemsnit pr. gateway (til trend-alerts):
- record: mbus:gateway:readings_per_day_avg5m
expr: |
avg_over_time(
rate(mbus_readings_ingested_total[1h])[24h:1h]
) * 86400
# Aktive gateways (set inden for 36t):
- record: mbus:gateways:active_count
expr: |
count(
time() - mbus_gateway_last_seen_timestamp < 129600
)
# P99 API latens pr. endpoint:
- record: mbus:api:p99_latency_ms
expr: |
histogram_quantile(0.99,
rate(http_server_duration_milliseconds_bucket[5m])
)
Konklusion
Grafana Unified Alerting med YAML-provisionering giver reproducerbar alert-konfiguration som kode. Alert-routing via notification_policies sender critical alerts til PagerDuty, gateway-alerts til Slack og warnings kun til email med dagligt repeat. Recording rules reducerer query-load og giver stabile metrikker til dashboard-panels og alert-betingelser. Automatiske silencers under OTA-opdateringer forhindrer alert-storm ved planlagt vedligeholdelse.
Se OpenTelemetry observability guide eller Grafana IoT monitoring guide.