Skip to content

Observability Guide

Track the health of LLM runs, automations, and integrations to keep service levels predictable.

Metrics architecture

  • Backend collectors: Django emits business metrics, Celery tasks export automation counters, and the gateway records Prometheus histograms.
  • Scrape endpoint: expose /metrics via the Django ASGI app; secure it with basic auth or IP whitelisting.
  • Dashboards: the internal dashboard hosts a consolidated view; external teams can mirror the metrics into Grafana.

Key Prometheus series

MetricTypeDescription
hestyna_llm_runs_totalCounterNumber of LLM runs by intent and outcome.
hestyna_llm_latency_secondsHistogramModel latency; P95 should stay under 6 seconds.
hestyna_integration_failures_totalCounterFailed calls per integration slug.
hestyna_automation_runs_totalCounterRuns executed per automation version.
hestyna_requests_resolution_secondsHistogramLifecycle from intake to resolution.

Dashboard walkthrough

  1. Open Dashboard → Observability.
  2. Filter by organization or date range; widgets update automatically every 60 seconds.
  3. Review LLM Runs for success rate, average tokens, and common failure reasons.
  4. Inspect Automation throughput to ensure published playbooks run as expected.
  5. Check Integration health for spike alerts; drill into logs via the link icon.

Alerting

  • Prometheus alert rules (example):
yaml
- alert: IntegrationFailures
  expr: increase(hestyna_integration_failures_total[5m]) > 5
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.integration }} failing repeatedly"
    description: "Investigate credentials or downstream outages."
  • Slack notifications: configure the webhook under Settings → Notifications; incident alerts post to #automation-alerts by default.
  • PagerDuty: map Prometheus alertmanager receivers to on-call schedules.

Tracing & logs

  • Enable OpenTelemetry in Django (OTEL_EXPORTER_OTLP_ENDPOINT env var).
  • Attach trace IDs to LLM runs; the frontend links each observability row to the tracing UI.
  • Stream structured logs (JSON) to your SIEM—fields include organization, intent, automation_id, and latency_ms.

Troubleshooting checklist

  • [ ] Metrics endpoint returns HTTP 200 and includes all expected series.
  • [ ] Worker processes expose Celery metrics—run celery inspect stats if counts are stuck.
  • [ ] Time ranges align—dashboard uses UTC; ensure external tools do as well.
  • [ ] Alerts include run IDs or integration names so responders have immediate context.
  • [ ] Post-incident reviews update guardrails or automations to prevent repeats.

Produit par l’équipe Hestyna