Observability Guide

Track the health of LLM runs, automations, and integrations to keep service levels predictable.

Metrics architecture

Backend collectors: Django emits business metrics, Celery tasks export automation counters, and the gateway records Prometheus histograms.
Scrape endpoint: expose /metrics via the Django ASGI app; secure it with basic auth or IP whitelisting.
Dashboards: the internal dashboard hosts a consolidated view; external teams can mirror the metrics into Grafana.

Key Prometheus series

Metric	Type	Description
`hestyna_llm_runs_total`	Counter	Number of LLM runs by intent and outcome.
`hestyna_llm_latency_seconds`	Histogram	Model latency; P95 should stay under 6 seconds.
`hestyna_integration_failures_total`	Counter	Failed calls per integration slug.
`hestyna_automation_runs_total`	Counter	Runs executed per automation version.
`hestyna_requests_resolution_seconds`	Histogram	Lifecycle from intake to resolution.

Dashboard walkthrough

Open Dashboard → Observability.
Filter by organization or date range; widgets update automatically every 60 seconds.
Review LLM Runs for success rate, average tokens, and common failure reasons.
Inspect Automation throughput to ensure published playbooks run as expected.
Check Integration health for spike alerts; drill into logs via the link icon.

Alerting

Prometheus alert rules (example):

yaml

- alert: IntegrationFailures
  expr: increase(hestyna_integration_failures_total[5m]) > 5
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.integration }} failing repeatedly"
    description: "Investigate credentials or downstream outages."

Slack notifications: configure the webhook under Settings → Notifications; incident alerts post to #automation-alerts by default.
PagerDuty: map Prometheus alertmanager receivers to on-call schedules.

Tracing & logs

Enable OpenTelemetry in Django (OTEL_EXPORTER_OTLP_ENDPOINT env var).
Attach trace IDs to LLM runs; the frontend links each observability row to the tracing UI.
Stream structured logs (JSON) to your SIEM—fields include organization, intent, automation_id, and latency_ms.

Troubleshooting checklist

[ ] Metrics endpoint returns HTTP 200 and includes all expected series.
[ ] Worker processes expose Celery metrics—run celery inspect stats if counts are stuck.
[ ] Time ranges align—dashboard uses UTC; ensure external tools do as well.
[ ] Alerts include run IDs or integration names so responders have immediate context.
[ ] Post-incident reviews update guardrails or automations to prevent repeats.

Observability Guide ​

Metrics architecture ​

Key Prometheus series ​

Dashboard walkthrough ​

Alerting ​

Tracing & logs ​

Troubleshooting checklist ​

Observability Guide

Metrics architecture

Key Prometheus series

Dashboard walkthrough

Alerting

Tracing & logs

Troubleshooting checklist