M-Bus Gateway
← Tilbage til blog
· Grafana· Loki· monitoring· IoT· gateway· RSSI· batteri· alerting· TimescaleDB

IoT gateway monitoring med Grafana og Loki

Grafana monitoring af IoT gateways: RSSI-trends, batteri-niveau, aflæsningsdækning, struktureret logging med Loki, alerting og dashboard-opsætning.

Af M-Bus Gateway

Grafana med TimescaleDB og Loki giver fuld observability over en flåde af wM-Bus gateways — fra signalstyrke til aflæsningsdækning og log-analyse.


Arkitektur

[Raspberry Pi gateway]
        ↓ structlog JSON til /var/log/mbus/
        ↓ MQTT metrics payload
[Hetzner server]
        ↓ TimescaleDB (readings, gateway status)
        ↓ Loki (gateway logs via Promtail)
[Grafana]
        ← TimescaleDB datasource (PostgreSQL)
        ← Loki datasource
        → Alerting → Brevo email / PagerDuty

TimescaleDB datasource konfiguration

# grafana/provisioning/datasources/timescaledb.yml
apiVersion: 1
datasources:
  - name: TimescaleDB
    type: postgres
    url: timescaledb:5432
    database: mbus
    user: grafana_ro
    secureJsonData:
      password: "${GRAFANA_DB_PASSWORD}"
    jsonData:
      sslmode: require
      maxOpenConns: 5
      maxIdleConns: 2
      postgresVersion: 1600
      timescaledb: true
-- Opret read-only bruger til Grafana:
CREATE USER grafana_ro WITH PASSWORD 'strong-password';
GRANT CONNECT ON DATABASE mbus TO grafana_ro;
GRANT USAGE ON SCHEMA public TO grafana_ro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO grafana_ro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO grafana_ro;

Dashboard: RSSI-trends pr. gateway

-- Grafana query — RSSI over tid pr. gateway:
SELECT
  time_bucket('1 hour', r.timestamp) AS time,
  g.name AS gateway,
  AVG(r.rssi_dbm) AS avg_rssi
FROM reading r
JOIN meter_installation mi ON mi.id = r.meter_installation_id
JOIN meter m ON m.id = mi.meter_id
JOIN gateway g ON g.id = m.gateway_id
WHERE
  $__timeFilter(r.timestamp)
  AND g.tenant_id = '${tenant_id}'
GROUP BY 1, 2
ORDER BY 1
Grafana panel konfiguration:
  Type: Time series
  Y-axis: RSSI (dBm) — inverteret (lavere er dårligere)
  Thresholds:
    > -70 dBm  → grøn (god)
    -85 til -70 → gul (acceptabel)
    < -85 dBm  → rød (kritisk)
  Alert: Hvis avg_rssi < -85 i 30 min → send alarm

Dashboard: Batteri-niveau og forventet levetid

-- Batteri-trend pr. installation (seneste 90 dage):
SELECT
  time_bucket('1 day', timestamp) AS day,
  meter_installation_id,
  AVG(battery_level_pct) AS battery_pct
FROM reading
WHERE
  $__timeFilter(timestamp)
  AND battery_level_pct IS NOT NULL
  AND meter_installation_id = '${installation_id}'
GROUP BY 1, 2
ORDER BY 1

-- Estimeret antal dage til kritisk niveau (20%):
WITH daily AS (
  SELECT
    date_trunc('day', timestamp) AS day,
    AVG(battery_level_pct) AS pct
  FROM reading
  WHERE
    meter_installation_id = '${installation_id}'
    AND timestamp >= now() - INTERVAL '90 days'
    AND battery_level_pct IS NOT NULL
  GROUP BY 1
  ORDER BY 1
),
regression AS (
  SELECT
    regr_slope(pct, extract(epoch FROM day)) AS slope,
    regr_intercept(pct, extract(epoch FROM day)) AS intercept,
    MAX(pct) AS latest_pct
  FROM daily
)
SELECT
  latest_pct AS current_battery_pct,
  CASE
    WHEN slope < 0 THEN
      round(((20 - latest_pct) / slope / 86400)::numeric, 0)
    ELSE NULL
  END AS days_to_20pct
FROM regression

Loki logging fra gateway

# gateway/src/logging/setup.py

import structlog
import logging
import json
import sys

def setup_logging(log_level: str = "INFO") -> None:
    """JSON struktureret logging til stdout — opfanges af Promtail."""
    structlog.configure(
        processors=[
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.BoundLogger,
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
    )
    logging.basicConfig(
        format="%(message)s",
        stream=sys.stdout,
        level=getattr(logging, log_level.upper()),
    )

# Brug i gateway kode:
log = structlog.get_logger()
log.info("telegram_received",
    meter_id="12345678",
    rssi_dbm=-72,
    status="OK",
    gateway_id="GW-0001",
)
# Output: {"event": "telegram_received", "meter_id": "12345678", "rssi_dbm": -72, "status": "OK", "gateway_id": "GW-0001", "level": "info", "timestamp": "2026-05-24T06:00:00Z"}

Promtail konfiguration

# /etc/promtail/config.yml (på Hetzner server)
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: mbus-gateway
    static_configs:
      - targets: ["localhost"]
        labels:
          job: mbus-gateway
          env: production
          __path__: /var/log/mbus/*.log

    pipeline_stages:
      - json:
          expressions:
            level: level
            gateway_id: gateway_id
            meter_id: meter_id
            event: event
      - labels:
          level:
          gateway_id:
          event:

Loki query — AES-fejl per gateway

# LogQL — tæl DEC_ERR pr. gateway over 24t:
sum by (gateway_id) (
  count_over_time(
    {job="mbus-gateway", event="telegram_received"} | json | status="DEC_ERR"
    [24h]
  )
)

# Log-søgning — vis alle kritiske fejl:
{job="mbus-gateway", level="error"} | json
  | line_format "{{.timestamp}} [{{.gateway_id}}] {{.event}}: {{.error}}"

Alerting: Stale gateway

# grafana/provisioning/alerting/rules.yml
groups:
  - name: gateway-health
    rules:
      - alert: GatewayStale
        expr: |
          SELECT COUNT(*) FROM gateway
          WHERE last_seen_at < NOW() - INTERVAL '36 hours'
            AND tenant_id = '${tenant_id}'
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Gateway har ikke sendt data i 36+ timer"
          description: "Gateway {{ $labels.gateway_id }} er stale"

      - alert: LowCoverage
        expr: |
          SELECT
            COUNT(*) FILTER (WHERE last_reading_at > NOW() - INTERVAL '48 hours')::float /
            NULLIF(COUNT(*), 0) AS coverage
          FROM meter_installation
          WHERE tenant_id = '${tenant_id}' AND deleted_at IS NULL
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Aflæsningsdækning under 80%"

Docker Compose: Grafana + Loki + Promtail

services:
  grafana:
    image: grafana/grafana:10-alpine
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD}"
      GF_AUTH_ANONYMOUS_ENABLED: "false"
      GF_SMTP_ENABLED: "true"
      GF_SMTP_HOST: smtp.brevo.com:587
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks: [internal]

  loki:
    image: grafana/loki:2.9-alpine
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - loki_data:/loki
    networks: [internal]

  promtail:
    image: grafana/promtail:2.9-alpine
    command: -config.file=/etc/promtail/config.yml
    volumes:
      - /var/log/mbus:/var/log/mbus:ro
      - /etc/promtail:/etc/promtail
    networks: [internal]

Konklusion

Grafana med TimescaleDB-datasource giver realtidsoverblik over RSSI, batteri og aflæsningsdækning på tværs af gateway-flåden. Loki med Promtail opsamler strukturerede JSON-logs og muliggør AES-fejl-analyse pr. gateway. Alerting sender email ved stale gateways (>36t) eller dækning under 80%.

Se TimescaleDB continuous aggregates guide eller gateway fejlsøgning guide.