Observability Metrics
PopChoice can run a small self-hosted Prometheus stack for application and VPS health. It is intentionally separate from the app stack: if Prometheus, exporters, or Grafana are down, recommendations keep running.
Files
docker-compose.observability.ymlstarts Prometheus, cAdvisor, node exporter, Postgres exporter, Redis exporter, Grafana, Loki, Alloy, and Uptime Kuma.observability/images/*/Dockerfilebuilds small config images so Coolify can deploy the stack without runtime bind mounts for repo-owned YAML/JSON files.observability/prometheus/prometheus.ymlconfigures scrape targets.observability/prometheus/rules/popchoice.ymladds starter recording rules.observability/grafana/provisioning/datasources/prometheus.yamlprovisions the Prometheus data source.observability/grafana/provisioning/dashboards/popchoice.yamlloads dashboards fromobservability/grafana/dashboards.observability/grafana/provisioning/alerting/popchoice-alerts.yamlprovisions conservative Grafana alert rules.observability/grafana/dashboards/popchoice-overview.jsonis the initial PopChoice overview dashboard.
App Metrics
The web app exposes metrics at:
/api/metricsThe worker process exposes metrics at:
:9464/metricsProduction metrics are disabled by default. Enable them on the app stack with:
METRICS_ENABLED=true
METRICS_BEARER_TOKEN=<long-random-token>
WORKER_METRICS_PORT=9464Prometheus must use the same METRICS_BEARER_TOKEN. In development, metrics are
enabled without a token unless METRICS_ENABLED=false is set.
Scrape Targets
Prometheus scrapes:
popchoice-webvia${POPCHOICE_WEB_METRICS_TARGET:-web:3000}/api/metricspopchoice-workersvia${POPCHOICE_WORKERS_METRICS_TARGET:-workers:9464}/metricspostgresviaobservability-postgres-exporter:9187redisviaobservability-redis-exporter:9121cadvisorviaobservability-cadvisor:8080nodeviaobservability-node-exporter:9100prometheusitself
The observability compose file joins an external app network named
${POPCHOICE_APP_NETWORK:-popchoice_default}. If Coolify gives the app resource
a different Docker network name, set POPCHOICE_APP_NETWORK on the
observability resource.
For the Postgres exporter, configure:
POSTGRES_EXPORTER_DATA_SOURCE_URI=db:5432/popchoice?sslmode=disable
POSTGRES_EXPORTER_DATA_SOURCE_USER=popchoice
POSTGRES_EXPORTER_DATA_SOURCE_PASS=<postgres-password>In Coolify, the observability stack is a separate resource, so it does not
inherit POSTGRES_PASSWORD from the PopChoice app resource. Copy the app
database password into POSTGRES_EXPORTER_DATA_SOURCE_PASS; otherwise the
exporter starts but Prometheus reports up{job="postgres"} == 0 and the
exporter logs password authentication failed for user "popchoice".
For the Redis exporter, configure when the Redis host is not redis:6379:
REDIS_EXPORTER_REDIS_ADDR=redis://redis:6379For Grafana Telegram alert notifications, configure both values on the observability resource:
GRAFANA_TELEGRAM_BOT_TOKEN=<telegram-bot-token>
GRAFANA_TELEGRAM_CHAT_ID=<telegram-chat-id>If either value is missing, Grafana starts without provisioning the Telegram contact point so the observability stack can still run safely.
Telegram alert notifications use a PopChoice-specific plain-text template. Set
GF_SERVER_ROOT_URL to the public Grafana URL if silence, dashboard, and
Grafana links should be useful outside the Docker network.
The template is intentionally compact for mobile triage: firing or resolved state, severity, alert name, summary, action, affected targets, and links. Raw Grafana expression values are omitted from Telegram notifications.
Local or VPS Start
From the repo root:
export GRAFANA_ADMIN_PASSWORD='replace-with-a-long-password'
export METRICS_BEARER_TOKEN='replace-with-a-long-token'
docker compose -f docker-compose.observability.yml up -dPrometheus listens on 127.0.0.1:9090; Grafana listens on 127.0.0.1:3001.
On a VPS, keep both behind Coolify authentication or an SSH tunnel.
Metrics Included
Application metrics keep labels low-cardinality. They do not include user ids, request ids, recommendation ids, job ids, or movie titles.
popchoice_recommendations_total{mode,status}counts recommendation completions and failures.popchoice_recommendation_duration_seconds{mode,status}tracks recommendation processing latency.popchoice_provider_errors_total{provider,operation,reason}counts OpenAI and TMDB degradation, including timeouts, HTTP errors, validation errors, and rate limits.popchoice_queue_depth{queue,status}reports BullMQ waiting, active, delayed, completed, failed, and paused counts at scrape time.popchoice_queue_jobs_total{queue,job,event,final}counts worker job completions and retry/final failures.popchoice_dependency_health{dependency}reports the most recent web health check view of Postgres and Redis.popchoice_dependency_health_failures_total{dependency}counts health check failures.
Prometheus also collects default Node.js process metrics from the web and
worker processes, container CPU/memory/filesystem metrics from cAdvisor, host
CPU/memory/disk metrics from node exporter, and pg_up / redis_up from the
database exporters.
Dashboard
Grafana provisions the PopChoice Overview dashboard. It starts with:
- web and worker scrape target health
- Postgres and Redis exporter health
- recommendation throughput and p50/p95 duration
- BullMQ queue depth and final failures
- OpenAI/TMDB degradation events
- container CPU and memory
- host disk usage
Panels may be empty until the relevant metric has been emitted. For example,
recommendation latency appears only after a recommendation completes, and
dependency health appears after /api/health has been called.
Alerts
Grafana provisions a first conservative alert set for #503. The rules are grouped by severity and include owner/action annotations:
- P1: Postgres and Redis outages. Public app outage paging should come from
Uptime Kuma
popchoice-prod-healthor failed post-deploy verification, not app metrics scrape visibility alone. - P2: app metrics scrape target downtime, sustained queue backlog, provider timeout/rate-limit spikes, and disk pressure.
- P3: monitoring scrape target failures and elevated recommendation failure ratio.
Deploy-sensitive scrape alerts carry noise_profile=deploy-sensitive. The
GitHub deploy workflow can create a short Grafana silence for that label before
triggering Coolify and then verify public /api/health plus /api/build after
the webhook.
See Observability Alerts for thresholds, retention, and backup expectations. See Observability Runbooks for incident response.
For Tempo traces and request/job correlation, see Observability Traces.