Appearance
Observability Guide
Track the health of LLM runs, automations, and integrations to keep service levels predictable.
Metrics architecture
- Backend collectors: Django emits business metrics, Celery tasks export automation counters, and the gateway records Prometheus histograms.
- Scrape endpoint: expose
/metricsvia the Django ASGI app; secure it with basic auth or IP whitelisting. - Dashboards: the internal dashboard hosts a consolidated view; external teams can mirror the metrics into Grafana.
Key Prometheus series
| Metric | Type | Description |
|---|---|---|
hestyna_llm_runs_total | Counter | Number of LLM runs by intent and outcome. |
hestyna_llm_latency_seconds | Histogram | Model latency; P95 should stay under 6 seconds. |
hestyna_integration_failures_total | Counter | Failed calls per integration slug. |
hestyna_automation_runs_total | Counter | Runs executed per automation version. |
hestyna_requests_resolution_seconds | Histogram | Lifecycle from intake to resolution. |
Dashboard walkthrough
- Open Dashboard → Observability.
- Filter by organization or date range; widgets update automatically every 60 seconds.
- Review LLM Runs for success rate, average tokens, and common failure reasons.
- Inspect Automation throughput to ensure published playbooks run as expected.
- Check Integration health for spike alerts; drill into logs via the link icon.
Alerting
- Prometheus alert rules (example):
yaml
- alert: IntegrationFailures
expr: increase(hestyna_integration_failures_total[5m]) > 5
for: 10m
labels:
severity: page
annotations:
summary: "{{ $labels.integration }} failing repeatedly"
description: "Investigate credentials or downstream outages."- Slack notifications: configure the webhook under Settings → Notifications; incident alerts post to
#automation-alertsby default. - PagerDuty: map Prometheus alertmanager receivers to on-call schedules.
Tracing & logs
- Enable OpenTelemetry in Django (
OTEL_EXPORTER_OTLP_ENDPOINTenv var). - Attach trace IDs to LLM runs; the frontend links each observability row to the tracing UI.
- Stream structured logs (JSON) to your SIEM—fields include
organization,intent,automation_id, andlatency_ms.
Troubleshooting checklist
- [ ] Metrics endpoint returns HTTP 200 and includes all expected series.
- [ ] Worker processes expose Celery metrics—run
celery inspect statsif counts are stuck. - [ ] Time ranges align—dashboard uses UTC; ensure external tools do as well.
- [ ] Alerts include run IDs or integration names so responders have immediate context.
- [ ] Post-incident reviews update guardrails or automations to prevent repeats.