Uptime Kuma Monitoring
Issue #502 starts the self-hosted observability stack with an external view of PopChoice production reachability. This page is intentionally limited to Uptime Kuma uptime and cheap synthetic checks. Logs, metrics, traces, and dashboards are tracked in separate observability issues.
Stack
Run Uptime Kuma outside the PopChoice application Compose stack so app deploys, database migrations, and worker restarts cannot take the monitor down with the service it is watching.
docker compose -f docker-compose.observability.yml up -dThe local Compose file exposes Kuma on http://127.0.0.1:3002 and stores all
monitor state in the named uptime-kuma-data volume. On a VPS, put this service
behind Coolify, Caddy, nginx, or a private network such as Tailscale. Do not
publish the Kuma UI directly without authentication and TLS.
Recommended VPS shape:
- One small Uptime Kuma service from
docker-compose.observability.yml. - A persistent volume mounted at
/app/data. - A private or protected public URL such as
https://uptime.example.com. - Backups enabled for the Kuma volume before monitors become the source of truth.
Production Monitors
Create these monitors manually in Kuma. Keep names stable so alert history stays readable.
| Name | Type | Target | Interval | Success criteria | Purpose |
|---|---|---|---|---|---|
popchoice-prod-health | HTTP(s) | https://pop-choice.shchilkin.dev/api/health | 60s | HTTP 200 | Catches app, PostgreSQL, or Redis outages because /api/health fails closed when required dependencies are unavailable. |
popchoice-prod-build | HTTP(s) Keyword | https://pop-choice.shchilkin.dev/api/build | 5m | HTTP 200 and keyword version | Confirms the deployed app is serving build metadata for release/debug visibility. |
popchoice-prod-homepage | HTTP(s) Keyword | https://pop-choice.shchilkin.dev/ | 5m | HTTP 200 and keyword PopChoice | Cheap browserless smoke check for the public app shell. |
popchoice-prod-catalog-page | HTTP(s) | https://pop-choice.shchilkin.dev/available-movies | 10m | HTTP 200 | Cheap smoke check for a catalog-backed page without invoking recommendations or external AI providers. |
Use retries before alerting to reduce noise from transient network blips. A good starting point is 2 retries, 20s retry interval, and a 20s request timeout.
Synthetic Smoke Strategy
The default synthetic checks must not spend OpenAI or TMDB credits. Keep the always-on Kuma checks to read-only routes:
/api/healthfor app, PostgreSQL, and Redis readiness./api/buildfor deploy provenance./for the public shell./available-moviesfor a cheap catalog-backed page load.
Do not make an always-on production monitor POST to
/api/movie-recommendation, /api/recommendations, or
/api/recommendations/[id]/more-picks. Those paths can call embeddings,
chat completions, TMDB, Redis workers, or recommendation persistence depending
on environment and request shape.
For deeper recommendation smoke coverage, use one of these explicitly separated paths:
- Run the existing Playwright e2e smoke suite in CI or after deploy with
E2E_DETERMINISTIC_RECOMMENDATIONS=1. That validates the browser, API, database, results, feedback, and movie-memory flow without live AI calls. - Add a staging-only deterministic endpoint or staging deployment later, then point a Kuma monitor at staging rather than production.
- Run live-provider recommendation checks manually when you intentionally want to spend API credits and inspect model/provider behavior.
If a future synthetic job posts to a PopChoice API, give it a dedicated API key, rate-limit it tightly, and document whether it is deterministic, staging-only, or allowed to call live providers.
Notifications
Configure at least one low-friction notification channel in Kuma before relying on the monitor:
- Telegram: bot token plus chat ID for fast personal alerts.
- Slack: incoming webhook URL for a project or ops channel.
- Email/SMTP: host, port, username, password,
from, and recipient address.
Suggested alert behavior:
- Alert when
popchoice-prod-healthis down after retries. Treat this as production-impacting because it covers the app and required dependencies. - Alert when
popchoice-prod-buildis down for deploy visibility loss, but use a lower urgency than/api/health. - Alert on homepage or catalog-page failures after retries. These usually mean routing, certificate, rendering, or application availability problems.
- Send recovery notifications so incident timelines include both outage and restoration times.
Keep secrets only in Kuma's notification settings or the host secret manager. Do not commit bot tokens, webhook URLs, SMTP passwords, or destination IDs.
Backup and Restore
Kuma stores monitors, notification settings, status pages, and history under
/app/data. In this repository's Compose file, that path is backed by the
uptime-kuma-data Docker volume.
Back up before changing monitor definitions and on a regular VPS backup schedule:
docker run --rm \
-v popchoice-observability_uptime-kuma-data:/data:ro \
-v "$PWD/backups":/backup \
alpine \
tar czf /backup/uptime-kuma-$(date +%Y%m%d-%H%M%S).tgz -C /data .Restore into a stopped Kuma service:
docker compose -f docker-compose.observability.yml down
docker run --rm \
-v popchoice-observability_uptime-kuma-data:/data \
-v "$PWD/backups":/backup \
alpine \
sh -c 'rm -rf /data/* && tar xzf /backup/uptime-kuma-YYYYMMDD-HHMMSS.tgz -C /data'
docker compose -f docker-compose.observability.yml up -dThe exact volume name can vary by Compose project name. Check it with:
docker volume ls | grep uptime-kumaKuma also has an in-app export/import path for monitor definitions. Use it for quick edits, but keep volume backups as the recovery source because notification settings and history live in the data directory.
Operational Checks
After the monitor stack starts:
- Open the Kuma UI and create the admin account.
- Add the production monitors from this document.
- Configure at least one notification channel.
- Use Kuma's test notification button.
- Temporarily pause any monitor you are intentionally breaking during deploy work, and add a note to the incident timeline if an alert fires.
- Export monitor definitions after the first working setup and after major changes.