Grafana Alloy¶
Grafana Alloy is a unified telemetry collector. One agent ships host metrics and container logs to Grafana Cloud — replacing what would otherwise be two separate processes (node_exporter, Promtail). Per-container metrics are deferred to a later phase of #15 (see Out of scope below).
Why¶
The lab needs trend-based visibility (per-container CPU/memory over weeks, host disk pressure, ZFS pool health) and centralised logs that survive container restarts. Self-hosting the full Grafana + Prometheus + Loki stack would cost ~500 MB RAM and require ongoing maintenance for zero functional benefit over the Grafana Cloud Free tier.
Alloy collapses what previously required several agents into a single process:
- Host metrics via
prometheus.exporter.unix(the embeddednode_exporterlibrary —/proc,/sys, and the rootfs are bind-mounted from the host). Series are labelledjob=integrations/node_exporterandinstance=$HOSTNAME_OVERRIDEto match Grafana Cloud's "Linux Server" integration, whose dashboards and alerts querynode_*series with that exactjoband useinstanceas the per-host variable. The statichostlabel is retained alongside for in-house dashboards. - Container logs via
loki.source.docker, with adiscovery.dockerstep that auto-discovers running containers and promotes compose project/service labels. Reaches the Docker socket through a read-only LinuxServer socket-proxy. Aloki.processstage drops entries older than 167h before they reach Loki to stay inside Grafana Cloud Free's 7-day ingest window — this is what keeps a post-restart backfill of long-lived containers (immich-db-backup, matter-server) from getting whole batches rejected with HTTP 400. - Traefik metrics via
prometheus.scrape "traefik", targeting the local Traefik instance attraefik:8082/metricsover the sharedalloy-frontendDocker network (60s interval,hostandjob=traefiklabels added via relabeling). Per-router, per-service, and per-entrypoint label cardinality is enabled on the Traefik side. The:8082entrypoint is internal-only and gated by anipAllowListrestricted to the pinnedalloy-frontendsubnet (172.30.100.8/29) — see Architecture § Alloy Metrics Scrape Entrypoint. - Immich Postgres via
prometheus.scrape "postgres_immich", targetingimmich-db-exporter:9187over theimmich-backendDocker network (60s interval,hostandjob=postgres_immichlabels added via relabeling). The exporter sidecar lives inservices/immich/compose.yamland reuses Immich'sIMMICH_DB_PASSWORD— no DB credentials are added to Alloy'ssecret.sops.env. Scoped to svlnas (dropped on svlazext viacompose.svlazext.yaml). - Outline Postgres via
prometheus.scrape "postgres_outline", targetingoutline-db-exporter:9187over theoutline-backendDocker network (60s interval,hostandjob=postgres_outlinelabels added via relabeling). The exporter sidecar lives inservices/outline/compose.yamland reuses Outline'sOUTLINE_DB_PASSWORD. Scoped to svlnas (dropped on svlazext viacompose.svlazext.yaml). - GitHub repo stats via
prometheus.exporter.githubpolling the GitHub REST API forDevSecNinja/truenas-appsandDevSecNinja/dotfiles(10m interval,hostandjob=integrations/github_exporterlabels added via relabeling). Surfaces rate-limit headroom, stars/forks/watchers, open PR/issue counts, and repo size. Authenticates with a fine-grained GitHub PAT (Metadata+Issues+Pull requestsread-only on the listed repos) stored asGITHUB_API_TOKENinsecret.sops.env. Single-host scrape: gated tosvlnasvia adiscovery.relabelkeep rule onHOSTNAME_OVERRIDEso only one Alloy instance polls the API; on svlazext the target list filters to empty andprometheus.scrapeis a no-op. - Host systemd journal via
loki.source.journal, reading/var/log/journaldirectly so host-level signals not visible from container logs (sshd, smartd, kernel/OOM, ZFS events) become searchable in Loki — and sodccd.shdeploy logs (already emitted to journald vialogger -t dccd) land in the same place as everything else. The pipeline isloki.source.journal.host→loki.relabel.journal→loki.process.journal→loki.write.grafana_cloud. Promoted labels:unit(transport-aware:__journal_syslog_identifierfor entries withtransport=syslog, otherwise__journal__systemd_unit— see below),syslog_identifier(from__journal_syslog_identifier),transport(from__journal__transport), andlevel(from__journal_priority_keyword), alongside the statichost,instance=$HOSTNAME_OVERRIDE, andjob=integrations/node_exporter. The transport-awareunitrule exists because TrueNAS' cron spawns the user shell inside a transientsession-<N>.scopeunit, so__journal__systemd_unitis not empty fordccd.shlines — it's set to a useless scope name. Preferring the syslog identifier whentransport=syslogmakesdccd(and any otherlogger -t …producer) appear under its identifier in the Grafana "Linux Server / Logs" dashboard'sunitvariable; real services likesshd.serviceorsmartd.servicecontinue to use their proper unit names because they don't go through the syslog transport. Thejob/instancepair matches the host metrics stream so Grafana Cloud's "Linux Server" integration "Logs" dashboard works against this Alloy without modification (a single dashboard variable drives both panels).__journal__boot_idis intentionally not promoted — it's bounded but a fresh stream per reboot accumulates over months. Higher-cardinality fields like_PIDor_HOSTNAMEstay on the line. Twoloki.processdrop stages run on the stream: amatchstage suppresses high-volume, zero-signalpam_unix .* session (opened|closed)messages fromCRONandsystemd-logind, and a 167h stale-entrydropstage matches the docker pipeline's Cloud Free 168h ingest-window protection. The source itself is capped withmax_age = "12h0m0s"so a fresh tail (first start, or after losing its cursor) cannot read far enough back to produce rejected batches. Read access is granted by mounting/var/log/journal,/run/log/journal, and/etc/machine-idread-only and adding the host'ssystemd-journalGID viagroup_add(hardcoded per host:102on svlnas,999on svlazext via the compose override). - Self-observability via
prometheus.exporter.self.
Out of scope¶
Per-container metrics (CPU, memory, network per container) are deliberately deferred. Alloy's only built-in option is prometheus.exporter.cadvisor, which wraps the embedded cAdvisor library and requires privileged: true plus mounts on /sys, /var/lib/docker, /dev/disk, and /var/run — conflicting with the hardened-container posture used elsewhere in this repo. Container logs already surface the signals that matter for alerting (restarts, OOM, crashes), so this is a deliberate Phase 2 decision rather than an oversight.
Everything is shipped to Grafana Cloud Free (Frankfurt region) — Prometheus for metrics, Loki for logs, Grafana for dashboards and alerting.
See issue #15 for the full monitoring/IRM rollout plan.
Compose File¶
- compose.yaml
- compose.svlazext.yaml — per-host override (sets
HOSTNAME_OVERRIDE=svlazext)
Access¶
| URL | Description |
|---|---|
https://alloy.${DOMAINNAME} |
Alloy debug UI (component graph, scrape state) — SSO-protected |
https://alloy-ext.${DOMAINNAME} |
Same UI on svlazext |
Architecture¶
- Image:
dhi/alloy(Docker Hardened Image, Debian 13 base) — minimal rootfs, no shell beyond/bin/sh(dash), continuously rebuilt against patched bases. Requires the host to be logged in to a DHI-entitled Docker Hub account. - User/Group:
3125:3125(svc-app-alloy) - Networks:
alloy-frontend(Traefik-facing, also used for outbound Grafana Cloud traffic),alloy-backend(internal — Docker socket proxy) - Reverse proxy: Traefik with
chain-auth@file(SSO required)
Services¶
| Container | Role |
|---|---|
alloy-init |
One-shot init: chowns ./data to 3125:3125 so the Alloy WAL + queue path is writable |
alloy |
Telemetry collector — reads host /proc//sys/rootfs, polls Docker stats, tails container logs |
alloy-docker-proxy |
LinuxServer socket-proxy — read-only Docker API access (CONTAINERS=1, EVENTS=1, INFO=1, POST=0) |
Volumes¶
./config:/etc/alloy:ro—config.alloy(git-tracked, read-only)./data:/var/lib/alloy— Alloy state directory; Alloy createsdata/(WAL + queue) andremotecfg/subdirectories at startup (gitignored, chowned by init container)/:/host/rootfs:ro,rslave,/proc:/host/proc:ro,/sys:/host/sys:ro— host filesystem visibility forprometheus.exporter.unix/var/log/journal:/var/log/journal:ro,/run/log/journal:/run/log/journal:ro,/etc/machine-id:/etc/machine-id:ro— host systemd journal access forloki.source.journal(read-only; readability granted viagroup_addwith the host'ssystemd-journalGID)
Resource footprint¶
Target on a host with ~30 containers: <200 MB RAM, <2% sustained CPU. Adjust MEM_LIMIT in secret.sops.env if a host scrapes additional targets.
Secrets¶
Managed via secret.sops.env (decrypted to .env at deploy time):
| Variable | Source |
|---|---|
GRAFANA_PROM_URL |
Grafana Cloud → stack details → Prometheus push URL |
GRAFANA_PROM_USERNAME |
Numeric instance ID shown next to the push URL |
GRAFANA_PROM_PASSWORD |
Access Policy token with metrics:write scope |
GRAFANA_LOKI_URL |
Grafana Cloud → stack details → Loki push URL |
GRAFANA_LOKI_USERNAME |
Numeric instance ID shown next to the Loki URL |
GRAFANA_LOKI_PASSWORD |
Same Access Policy token (or a separate one with logs:write scope) |
GITHUB_API_TOKEN |
Fine-grained GitHub PAT, read-only Metadata/Issues/Pull requests (svlnas only) |
Optional resource overrides: MEM_LIMIT, SOCKET_PROXY_MEM_LIMIT.
The numeric GID of the host's systemd-journal group (needed by the Alloy container to read mode-0640 journal files) is not a secret. It's hardcoded per host via group_add in compose.yaml (svlnas: 102) and compose.svlazext.yaml (svlazext: 999). Verify with getent group systemd-journal if rebuilding a host.
Per-host configuration¶
The static host label injected into every metric and log line is set via the
HOSTNAME_OVERRIDE environment variable in
compose.yaml
(default: svlnas). For non-default hosts, override it in a compose.<server>.yaml
file — see
compose.svlazext.yaml.
It is not a secret and lives in the compose file rather than secret.sops.env.
First-Run Setup¶
- Create the dataset:
vm-pool/apps/services/alloyon TrueNAS. - Create the TrueNAS service account:
svc-app-alloywith UID/GID3125:3125. - Create a Grafana Cloud Access Policy token:
- Grafana Cloud → "Access Policies" → "Create access policy".
- Scopes:
metrics:write,logs:write. Realms: limit to your stack. - Generate a token; copy the
glc_…value. - Populate
secret.sops.envwith the URLs from the stack details page and the token from step 3. - Encrypt the secrets:
- Validate the compose file:
- Deploy:
bash scripts/dccd.sh -d /mnt/vm-pool/apps -t -f -A alloy # svlnas
bash scripts/dccd.sh -d /opt/apps -S svlazext -A alloy # svlazext
- Verify in Grafana Cloud Explore:
- Metrics:
up{job="alloy",host="svlnas"} == 1 - Logs:
{host="svlnas", job="docker"}
Privacy Notes¶
Container logs leave the network to Grafana Cloud. The default loki.source.docker config ships everything from every container; per-service redaction and exclusion rules are added in Phase 2 of #15. Until then, treat this as "everything that goes to stdout/stderr from any container is in Grafana Cloud (EU region) for 30 days."
If a service must never have its logs leave the network, exclude it temporarily by adding a discovery.relabel drop rule against __meta_docker_container_name in config.alloy.
Reference¶
- Alloy components: https://grafana.com/docs/alloy/latest/reference/components/
- Grafana Cloud free-tier limits: https://grafana.com/pricing/
- Issue #15 — full monitoring rollout plan: https://github.com/DevSecNinja/truenas-apps/issues/15