Observability Runbooks

These runbooks pair with the provisioned Grafana alerts in observability/grafana/provisioning/alerting/popchoice-alerts.yaml. They are written for a small self-hosted VPS/Coolify deployment where one operator may own app, database, and infrastructure response.

General Triage

Check whether this is user-facing:
- open the app
- open /api/health
- check Uptime Kuma and Coolify health
Check Grafana dashboards:
- PopChoice Overview
- logs in Loki if available
- Prometheus targets page
Check recent changes:
- latest deploy
- migrations
- environment variable edits
- catalog/backfill jobs
Prefer reversible actions:
- pause noisy workers before killing data stores
- restart one service at a time
- keep logs before pruning or recreating containers

App Down

Owner: App operator.

Symptoms:

/api/health fails or times out.
Coolify shows the web container unhealthy or restarting.
Uptime Kuma popchoice-prod-health is down after retries.
Post-deploy verification fails to recover before its retry budget expires.

Immediate checks:

docker ps --filter name=popchoice
docker logs --tail=200 <web-container>
curl -i https://your-domain.example/api/health

Actions:

Check Coolify deployment status and whether the latest image pulled successfully.
Check web logs for startup validation errors, missing env vars, migration failures, or unhandled runtime exceptions.
Check /api/build if the app responds but looks like the wrong version.
Check DB and Redis health because /api/health depends on both.
If the latest deploy caused the outage, roll back to the last known good image tag.

Follow-up alerting work is tracked in the deploy workflow so normal redeploy churn can be separated from confirmed outages.

Recovery check:

/api/health returns 200.
Prometheus target popchoice-web is up.
A cheap recommendation smoke flow reaches a persisted result without live provider spending unless explicitly intended.

App Metrics Scrape Target Down

Owner: App operator.

Symptoms:

Grafana alert: P2 App metrics scrape target down.
Prometheus target popchoice-web is down.
Grafana panels for app metrics are stale or empty.
Public /api/health may still be healthy during normal Coolify redeploy churn.

Immediate checks:

curl -i https://your-domain.example/api/health
curl -i https://your-domain.example/api/build
docker ps --filter name=web

Actions:

Check whether a Coolify deploy is currently running or just completed.
If /api/health is healthy and /api/build reports the expected commit, treat this as a metrics visibility issue and wait for the scrape target to recover.
If /api/health fails, follow the App Down runbook.
If the target stays down after the app recovers, check METRICS_ENABLED, METRICS_BEARER_TOKEN, POPCHOICE_WEB_METRICS_TARGET, and Prometheus logs.
During planned deploys, use the deploy-aware silence script or the GitHub deploy workflow so this alert does not create avoidable noise.

Recovery check:

/api/health returns 200.
/api/build reports the expected commit or image metadata.
Prometheus target popchoice-web returns to up == 1.

DB Down

Owner: Database operator.

Symptoms:

Grafana alert: P1 Postgres exporter reports database down.
/api/health reports database failure.
App logs show connection, migration, or query errors.

Immediate checks:

docker ps --filter name=db
docker logs --tail=200 <db-container>
docker exec -it <db-container> pg_isready -U popchoice
df -h

Actions:

Check disk pressure first; Postgres can fail or become read-only when the host is full.
Check whether credentials changed in Coolify without restarting all dependent services.
Check recent migrations and startup logs.
If the DB container is restarting, inspect the earliest startup error, not only the latest health check line.
Restore from the latest verified database backup only after confirming the volume is corrupted or missing.

Recovery check:

pg_isready succeeds.
/api/health returns 200.
The Postgres exporter reports pg_up == 1.
App reads existing recommendations and can create a new deterministic/local recommendation job.

Redis Down

Owner: App operator.

Symptoms:

Grafana alert: P1 Redis exporter reports Redis down.
/api/health reports Redis failure.
Workers stop processing BullMQ jobs.
Bull Board cannot load queues.

Immediate checks:

docker ps --filter name=redis
docker logs --tail=200 <redis-container>
docker exec -it <redis-container> redis-cli ping
df -h

Actions:

Check Redis memory and disk pressure.
Check whether the Redis container restarted and lost in-memory queued work.
Restart workers after Redis recovers so BullMQ connections reconnect cleanly.
Inspect Bull Board for delayed or failed jobs after recovery.

Recovery check:

redis-cli ping returns PONG.
/api/health returns 200.
Redis exporter reports redis_up == 1.
Queue depths stop growing and workers process jobs.

Stuck Queues

Owner: App operator.

Symptoms:

Grafana alert: P2 BullMQ queue backlog sustained.
Bull Board shows waiting or delayed jobs that do not drain.
Recommendation results remain pending.

Immediate checks:

docker logs --tail=300 <workers-container>
docker logs --tail=200 <redis-container>

Actions:

Open Bull Board and identify the affected queue and job type.
Check worker logs for repeated provider timeouts, DB errors, validation errors, or final failures.
Check Redis health and memory.
If provider timeouts are driving retries, pause catalog/backfill or lower worker concurrency before retrying.
Retry only jobs whose failure cause is understood. Do not bulk retry unknown failures if they can spend AI credits or hammer TMDB.

Recovery check:

Waiting and delayed queue depth trends downward.
Active jobs appear and complete.
Final failure count does not keep increasing.
New recommendation jobs complete within normal latency.

TMDB/OpenAI Timeout Spike

Owner: App operator.

Symptoms:

Grafana alert: P2 Provider timeout or rate-limit spike.
Logs show OpenAI or TMDB timeouts, 429, or upstream HTTP failures.
Recommendation latency increases or queue backlog grows.

Immediate checks:

docker logs --tail=300 <web-container>
docker logs --tail=300 <workers-container>

Actions:

Check provider status pages and account/API key limits.
Check recent traffic, backfill, discovery, or maintenance jobs that may have increased request volume.
Pause or slow catalog-maintenance workers if TMDB is rate-limiting.
Avoid live AI evals while OpenAI errors are elevated.
If only one feature path is failing, keep unrelated workers running and isolate the failing queue/job type.

Recovery check:

Provider error increase slows to near zero.
Queue backlog drains.
Deterministic recommendation eval still passes locally/CI.
Optional live-provider validation is run only when explicitly desired.

Disk Pressure

Owner: Infrastructure operator.

Symptoms:

Grafana alert: P2 Host disk usage high.
Postgres, Prometheus, Loki, or Docker logs fail to write.
Containers restart or fail with no space errors.

Immediate checks:

df -h
docker system df
du -h -d 1 /var/lib/docker 2>/dev/null | sort -h

Actions:

Confirm the latest app and database backups exist before deleting anything important.
Prune unused Docker images and build cache if safe for the host.
Reduce log growth or retention if Loki/Docker logs are the largest consumer.
Expand the VPS disk if normal retention no longer fits.
Restart services that failed because writes were denied.

Recovery check:

Disk usage is below 85%.
Postgres and Redis health checks pass.
Prometheus and Loki are ingesting again.

Monitoring Stack Down

Owner: Infrastructure operator.

Symptoms:

Grafana alert: P3 Monitoring scrape target down.
Grafana dashboards stop updating.
Prometheus targets show exporters down.
Uptime Kuma or Coolify shows Grafana/Prometheus/Loki unhealthy.

Immediate checks:

docker compose -f docker-compose.observability.yml ps
docker compose -f docker-compose.observability.yml logs --tail=200 observability-prometheus
docker compose -f docker-compose.observability.yml logs --tail=200 observability-grafana

Actions:

Check whether the observability stack can reach the PopChoice app network.
Check METRICS_BEARER_TOKEN, exporter credentials, and POPCHOICE_APP_NETWORK.
Restart the failed observability service.
If Prometheus data is missing but config is intact, prefer restoring from Git and accepting a metrics gap over risky volume surgery.
If Grafana provisioning fails on startup, validate the changed YAML before starting the stack again.

Recovery check:

Grafana opens.
Prometheus /targets shows expected targets up.
PopChoice Overview panels update.
Alert rules are visible under the PopChoice Alerts folder.

Backup Restore Drill

Owner: Infrastructure operator.

Run after alerting/provisioning changes and before relying on a new VPS setup:

restore_dir="$(mktemp -d /tmp/popchoice-observability-restore.XXXXXX)"
tar -cf "$restore_dir/observability-config.tar" \
  docker-compose.observability.yml \
  observability/prometheus \
  observability/grafana/provisioning \
  observability/grafana/dashboards \
  observability/loki \
  observability/alloy \
  docs/OBSERVABILITY-*.md
mkdir "$restore_dir/restored"
tar -xf "$restore_dir/observability-config.tar" -C "$restore_dir/restored"
diff -qr observability "$restore_dir/restored/observability"
diff -q docker-compose.observability.yml "$restore_dir/restored/docker-compose.observability.yml"

Expected result: no diff output for the restored observability config. If the archive cannot recreate Grafana provisioning files, do not deploy the config.

On this page