M-Bus Gateway
← Tilbage til blog
· Grafana· alerting· IoT· PromQL· monitoring· observability· Prometheus· Loki

Grafana alerting til IoT — regler, kontaktpunkter og notification policies

Grafana Unified Alerting til IoT-platforme: alert rules i PromQL og LogQL, kontaktpunkter (email/Slack/PagerDuty), notification policies, silencer og Grafana OnCall integration.

Af M-Bus Gateway

Grafana Unified Alerting erstatter Grafana 8's legacy alerting og understøtter PromQL, LogQL og Mimir-baserede regler. Her er opsætningen til IoT-gateway-monitoring.


Grafana Unified Alerting — arkitektur

# Grafana alerting stack (docker-compose):

services:
  grafana:
    image: grafana/grafana:11.0.0
    environment:
      GF_ALERTING_ENABLED: "true"
      GF_UNIFIED_ALERTING_ENABLED: "true"
      # Alerts evalueres i Grafana (ikke Prometheus)
      # Upstream: Mimir/Prometheus til metrics, Loki til logs

  prometheus:
    image: prom/prometheus:v2.53.0
    # Scraper: FastAPI /metrics (OpenTelemetry Prometheus exporter)

  loki:
    image: grafana/loki:3.1.0
    # Modtager: structlog JSON output via Promtail

# Alerting flow:
# Prometheus metrics → Grafana alert rule (PromQL) → fires alert
# Loki logs → Grafana alert rule (LogQL) → fires alert
# Alert → Contact point → Email/Slack/PagerDuty

Alert rules — gateway-specific

# grafana/provisioning/alerting/rules.yaml

apiVersion: 1
groups:
  - name: mbus-gateways
    folder: IoT Alerts
    interval: 5m    # Evalueringsinterval
    rules:

      # 1. Gateway stale — ingen data i 36+ timer
      - uid: gateway-stale
        title: "Gateway stale — ingen aflæsninger"
        condition: C
        data:
          - refId: A
            queryType: range
            model:
              expr: |
                time() - max by (gateway_id) (
                  last_over_time(mbus_gateway_last_seen_timestamp[1h])
                ) > 129600   # 36 timer i sekunder
          - refId: C
            type: threshold
            model:
              conditions:
                - evaluator: { type: gt, params: [0] }
                  reducer: { type: last }
        labels:
          severity: warning
        annotations:
          summary: "Gateway {{ $labels.gateway_id }} stale i 36+ timer"
          description: "Ingen MQTT-heartbeat. Tjek SIM-kort og netværk."

      # 2. AES dekrypteringsfejl (Loki):
      - uid: aes-decrypt-errors
        title: "AES dekryptering fejler"
        condition: B
        data:
          - refId: A
            queryType: instant
            datasource: Loki
            model:
              expr: |
                sum(count_over_time(
                  {service="mbus-server"} 
                  |= "decryption failed" [5m]
                )) > 5
          - refId: B
            type: threshold
            model:
              conditions:
                - evaluator: { type: gt, params: [0] }
                  reducer: { type: last }
        labels:
          severity: critical
        annotations:
          summary: ">5 AES-dekrypteringsfejl på 5 min"
          description: "Sandsynlig nøgle-rotation. Opdatér AES-nøgle i platform."

Kontaktpunkter (contact points)

# grafana/provisioning/alerting/contact_points.yaml

apiVersion: 1
contactPoints:
  - orgId: 1
    name: email-ops
    receivers:
      - uid: email-ops-receiver
        type: email
        settings:
          addresses: "ops@mbus-platform.dk"
          subject: "[Alert] {{ .CommonLabels.alertname }}"
          message: |
            {{ range .Alerts }}
            Status: {{ .Status }}
            Summary: {{ .Annotations.summary }}
            Detail: {{ .Annotations.description }}
            Gateway: {{ .Labels.gateway_id }}
            Starttid: {{ .StartsAt }}
            {{ end }}

  - orgId: 1
    name: slack-iot-alerts
    receivers:
      - uid: slack-receiver
        type: slack
        settings:
          url: "${SLACK_WEBHOOK_URL}"
          channel: "#iot-alerts"
          title: "{{ .CommonLabels.alertname }}"
          text: |
            *{{ .CommonAnnotations.summary }}*
            {{ .CommonAnnotations.description }}
          color: |
            {{ if eq .Status "firing" }}danger{{ else }}good{{ end }}

  - orgId: 1
    name: pagerduty-critical
    receivers:
      - uid: pagerduty-receiver
        type: pagerduty
        settings:
          integrationKey: "${PAGERDUTY_KEY}"
          severity: critical
          class: gateway-failure

Notification policies — routing

# grafana/provisioning/alerting/notification_policies.yaml

apiVersion: 1
policies:
  - orgId: 1
    receiver: email-ops          # Default receiver
    group_by: [alertname, gateway_id]
    group_wait: 30s              # Vent før første notifikation
    group_interval: 5m           # Interval ved gentagne
    repeat_interval: 4h          # Gentag aktive alerts hvert 4. time

    routes:
      # Critical alerts → PagerDuty:
      - receiver: pagerduty-critical
        matchers:
          - severity = critical
        continue: true           # Fortsæt til email også

      # Gateway-specifikke → Slack + email:
      - receiver: slack-iot-alerts
        matchers:
          - alertname =~ "Gateway.*"
        group_wait: 1m

      # Hverdagsalerts → kun email, ingen PagerDuty:
      - receiver: email-ops
        matchers:
          - severity = warning
        repeat_interval: 24h

Silencers — maintenance window

# Planlagt vedligeholdelse → silence alerts automatisk

import httpx
from datetime import datetime, timedelta

async def create_grafana_silence(
    gateway_id: str,
    duration_hours: int = 2,
    reason: str = "Planlagt vedligeholdelse",
) -> str:
    """Opret silence i Grafana for specifik gateway."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{GRAFANA_URL}/api/alertmanager/grafana/api/v2/silences",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"},
            json={
                "matchers": [
                    {"name": "gateway_id", "value": gateway_id, "isRegex": False}
                ],
                "startsAt": datetime.utcnow().isoformat() + "Z",
                "endsAt": (
                    datetime.utcnow() + timedelta(hours=duration_hours)
                ).isoformat() + "Z",
                "comment": reason,
                "createdBy": "mbus-platform",
            },
        )
        resp.raise_for_status()
        return resp.json()["id"]

# Kaldes fra portal når tekniker starter OTA-opdatering:
# await create_grafana_silence(gateway_id, duration_hours=1, reason="OTA update")

Recording rules — pre-aggregeret data

# prometheus/recording_rules.yaml
# Recording rules reducerer query-load ved dashboards og alerts

groups:
  - name: mbus_gateway_summary
    interval: 5m
    rules:
      # Dagligt gennemsnit pr. gateway (til trend-alerts):
      - record: mbus:gateway:readings_per_day_avg5m
        expr: |
          avg_over_time(
            rate(mbus_readings_ingested_total[1h])[24h:1h]
          ) * 86400

      # Aktive gateways (set inden for 36t):
      - record: mbus:gateways:active_count
        expr: |
          count(
            time() - mbus_gateway_last_seen_timestamp < 129600
          )

      # P99 API latens pr. endpoint:
      - record: mbus:api:p99_latency_ms
        expr: |
          histogram_quantile(0.99,
            rate(http_server_duration_milliseconds_bucket[5m])
          )

Konklusion

Grafana Unified Alerting med YAML-provisionering giver reproducerbar alert-konfiguration som kode. Alert-routing via notification_policies sender critical alerts til PagerDuty, gateway-alerts til Slack og warnings kun til email med dagligt repeat. Recording rules reducerer query-load og giver stabile metrikker til dashboard-panels og alert-betingelser. Automatiske silencers under OTA-opdateringer forhindrer alert-storm ved planlagt vedligeholdelse.

Se OpenTelemetry observability guide eller Grafana IoT monitoring guide.