Observability Runbooks
These runbooks pair with the provisioned Grafana alerts in
observability/grafana/provisioning/alerting/popchoice-alerts.yaml. They are
written for a small self-hosted VPS/Coolify deployment where one operator may
own app, database, and infrastructure response.
General Triage
- Check whether this is user-facing:
- open the app
- open
/api/health - check Uptime Kuma and Coolify health
- Check Grafana dashboards:
- PopChoice Overview
- logs in Loki if available
- Prometheus targets page
- Check recent changes:
- latest deploy
- migrations
- environment variable edits
- catalog/backfill jobs
- Prefer reversible actions:
- pause noisy workers before killing data stores
- restart one service at a time
- keep logs before pruning or recreating containers
App Down
Owner: App operator.
Symptoms:
/api/healthfails or times out.- Coolify shows the web container unhealthy or restarting.
- Uptime Kuma
popchoice-prod-healthis down after retries. - Post-deploy verification fails to recover before its retry budget expires.
Immediate checks:
docker ps --filter name=popchoice
docker logs --tail=200 <web-container>
curl -i https://your-domain.example/api/healthActions:
- Check Coolify deployment status and whether the latest image pulled successfully.
- Check web logs for startup validation errors, missing env vars, migration failures, or unhandled runtime exceptions.
- Check
/api/buildif the app responds but looks like the wrong version. - Check DB and Redis health because
/api/healthdepends on both. - If the latest deploy caused the outage, roll back to the last known good image tag.
Follow-up alerting work is tracked in the deploy workflow so normal redeploy churn can be separated from confirmed outages.
Recovery check:
/api/healthreturns200.- Prometheus target
popchoice-webis up. - A cheap recommendation smoke flow reaches a persisted result without live provider spending unless explicitly intended.
App Metrics Scrape Target Down
Owner: App operator.
Symptoms:
- Grafana alert:
P2 App metrics scrape target down. - Prometheus target
popchoice-webis down. - Grafana panels for app metrics are stale or empty.
- Public
/api/healthmay still be healthy during normal Coolify redeploy churn.
Immediate checks:
curl -i https://your-domain.example/api/health
curl -i https://your-domain.example/api/build
docker ps --filter name=webActions:
- Check whether a Coolify deploy is currently running or just completed.
- If
/api/healthis healthy and/api/buildreports the expected commit, treat this as a metrics visibility issue and wait for the scrape target to recover. - If
/api/healthfails, follow the App Down runbook. - If the target stays down after the app recovers, check
METRICS_ENABLED,METRICS_BEARER_TOKEN,POPCHOICE_WEB_METRICS_TARGET, and Prometheus logs. - During planned deploys, use the deploy-aware silence script or the GitHub deploy workflow so this alert does not create avoidable noise.
Recovery check:
/api/healthreturns200./api/buildreports the expected commit or image metadata.- Prometheus target
popchoice-webreturns toup == 1.
DB Down
Owner: Database operator.
Symptoms:
- Grafana alert:
P1 Postgres exporter reports database down. /api/healthreports database failure.- App logs show connection, migration, or query errors.
Immediate checks:
docker ps --filter name=db
docker logs --tail=200 <db-container>
docker exec -it <db-container> pg_isready -U popchoice
df -hActions:
- Check disk pressure first; Postgres can fail or become read-only when the host is full.
- Check whether credentials changed in Coolify without restarting all dependent services.
- Check recent migrations and startup logs.
- If the DB container is restarting, inspect the earliest startup error, not only the latest health check line.
- Restore from the latest verified database backup only after confirming the volume is corrupted or missing.
Recovery check:
pg_isreadysucceeds./api/healthreturns200.- The Postgres exporter reports
pg_up == 1. - App reads existing recommendations and can create a new deterministic/local recommendation job.
Redis Down
Owner: App operator.
Symptoms:
- Grafana alert:
P1 Redis exporter reports Redis down. /api/healthreports Redis failure.- Workers stop processing BullMQ jobs.
- Bull Board cannot load queues.
Immediate checks:
docker ps --filter name=redis
docker logs --tail=200 <redis-container>
docker exec -it <redis-container> redis-cli ping
df -hActions:
- Check Redis memory and disk pressure.
- Check whether the Redis container restarted and lost in-memory queued work.
- Restart workers after Redis recovers so BullMQ connections reconnect cleanly.
- Inspect Bull Board for delayed or failed jobs after recovery.
Recovery check:
redis-cli pingreturnsPONG./api/healthreturns200.- Redis exporter reports
redis_up == 1. - Queue depths stop growing and workers process jobs.
Stuck Queues
Owner: App operator.
Symptoms:
- Grafana alert:
P2 BullMQ queue backlog sustained. - Bull Board shows waiting or delayed jobs that do not drain.
- Recommendation results remain pending.
Immediate checks:
docker logs --tail=300 <workers-container>
docker logs --tail=200 <redis-container>Actions:
- Open Bull Board and identify the affected queue and job type.
- Check worker logs for repeated provider timeouts, DB errors, validation errors, or final failures.
- Check Redis health and memory.
- If provider timeouts are driving retries, pause catalog/backfill or lower worker concurrency before retrying.
- Retry only jobs whose failure cause is understood. Do not bulk retry unknown failures if they can spend AI credits or hammer TMDB.
Recovery check:
- Waiting and delayed queue depth trends downward.
- Active jobs appear and complete.
- Final failure count does not keep increasing.
- New recommendation jobs complete within normal latency.
TMDB/OpenAI Timeout Spike
Owner: App operator.
Symptoms:
- Grafana alert:
P2 Provider timeout or rate-limit spike. - Logs show OpenAI or TMDB timeouts,
429, or upstream HTTP failures. - Recommendation latency increases or queue backlog grows.
Immediate checks:
docker logs --tail=300 <web-container>
docker logs --tail=300 <workers-container>Actions:
- Check provider status pages and account/API key limits.
- Check recent traffic, backfill, discovery, or maintenance jobs that may have increased request volume.
- Pause or slow catalog-maintenance workers if TMDB is rate-limiting.
- Avoid live AI evals while OpenAI errors are elevated.
- If only one feature path is failing, keep unrelated workers running and isolate the failing queue/job type.
Recovery check:
- Provider error increase slows to near zero.
- Queue backlog drains.
- Deterministic recommendation eval still passes locally/CI.
- Optional live-provider validation is run only when explicitly desired.
Disk Pressure
Owner: Infrastructure operator.
Symptoms:
- Grafana alert:
P2 Host disk usage high. - Postgres, Prometheus, Loki, or Docker logs fail to write.
- Containers restart or fail with no space errors.
Immediate checks:
df -h
docker system df
du -h -d 1 /var/lib/docker 2>/dev/null | sort -hActions:
- Confirm the latest app and database backups exist before deleting anything important.
- Prune unused Docker images and build cache if safe for the host.
- Reduce log growth or retention if Loki/Docker logs are the largest consumer.
- Expand the VPS disk if normal retention no longer fits.
- Restart services that failed because writes were denied.
Recovery check:
- Disk usage is below 85%.
- Postgres and Redis health checks pass.
- Prometheus and Loki are ingesting again.
Monitoring Stack Down
Owner: Infrastructure operator.
Symptoms:
- Grafana alert:
P3 Monitoring scrape target down. - Grafana dashboards stop updating.
- Prometheus targets show exporters down.
- Uptime Kuma or Coolify shows Grafana/Prometheus/Loki unhealthy.
Immediate checks:
docker compose -f docker-compose.observability.yml ps
docker compose -f docker-compose.observability.yml logs --tail=200 observability-prometheus
docker compose -f docker-compose.observability.yml logs --tail=200 observability-grafanaActions:
- Check whether the observability stack can reach the PopChoice app network.
- Check
METRICS_BEARER_TOKEN, exporter credentials, andPOPCHOICE_APP_NETWORK. - Restart the failed observability service.
- If Prometheus data is missing but config is intact, prefer restoring from Git and accepting a metrics gap over risky volume surgery.
- If Grafana provisioning fails on startup, validate the changed YAML before starting the stack again.
Recovery check:
- Grafana opens.
- Prometheus
/targetsshows expected targets up. - PopChoice Overview panels update.
- Alert rules are visible under the
PopChoice Alertsfolder.
Backup Restore Drill
Owner: Infrastructure operator.
Run after alerting/provisioning changes and before relying on a new VPS setup:
restore_dir="$(mktemp -d /tmp/popchoice-observability-restore.XXXXXX)"
tar -cf "$restore_dir/observability-config.tar" \
docker-compose.observability.yml \
observability/prometheus \
observability/grafana/provisioning \
observability/grafana/dashboards \
observability/loki \
observability/alloy \
docs/OBSERVABILITY-*.md
mkdir "$restore_dir/restored"
tar -xf "$restore_dir/observability-config.tar" -C "$restore_dir/restored"
diff -qr observability "$restore_dir/restored/observability"
diff -q docker-compose.observability.yml "$restore_dir/restored/docker-compose.observability.yml"Expected result: no diff output for the restored observability config. If the
archive cannot recreate Grafana provisioning files, do not deploy the config.