2025-12-31 20:11:44 -05:00

5.1 KiB
Executable File
Raw Permalink Blame History

ML Log Watcher Utilities

This repository now contains two automation entry points that work together to triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted language models.

1. scripts/log_monitor.py

Existing script that queries Elasticsearch indices, pulls a recent window of logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via cron/systemd.

ELASTIC_HOST=https://casper.localdomain:9200 \
ELASTIC_API_KEY=... \
OPENROUTER_API_KEY=... \
python3 scripts/log_monitor.py --index 'log*' --minutes 30

2. scripts/grafana_alert_webhook.py

A FastAPI web server that accepts Grafana alert webhooks, finds the matching entry in alert_runbook.yaml, renders the LLM prompt, and posts it to OpenRouter. The response text is returned to Grafana (or any caller) immediately so automation can fan out to chat, ticketing, etc.

Dependencies

python3 -m venv .venv
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain

Environment

  • OPENROUTER_API_KEY required.
  • OPENROUTER_MODEL optional (default openai/gpt-4o-mini).
  • RUNBOOK_PATH optional (default alert_runbook.yaml in repo root).
  • ANSIBLE_HOSTS_PATH optional (default /etc/ansible/hosts). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
  • OPENROUTER_REFERER / OPENROUTER_TITLE forwarded headers if needed.
  • TRIAGE_ENABLE_COMMANDS set to 1 to let the webhook execute runbook commands (default 0 keeps it in read-only mode).
  • TRIAGE_COMMAND_RUNNER ssh (default) or local. When using ssh, also set TRIAGE_SSH_USER and optional TRIAGE_SSH_OPTIONS.
  • TRIAGE_COMMAND_TIMEOUT, TRIAGE_MAX_COMMANDS, TRIAGE_OUTPUT_LIMIT, TRIAGE_DEFAULT_OS tune execution behavior.
  • TRIAGE_VERBOSE_LOGS set to 1 to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
  • TRIAGE_EMAIL_ENABLED when 1, the webhook emails the final LLM summary per alert. Requires TRIAGE_EMAIL_FROM, TRIAGE_EMAIL_TO (comma-separated), TRIAGE_SMTP_HOST, and optional TRIAGE_SMTP_PORT, TRIAGE_SMTP_USER, TRIAGE_SMTP_PASSWORD, TRIAGE_SMTP_STARTTLS, TRIAGE_SMTP_SSL.

Running

source .venv/bin/activate
export OPENROUTER_API_KEY=...
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081

The server loads the runbook at startup and exposes:

  • POST /alerts Grafana webhook target.
  • POST /reload-runbook force runbook reload without restarting.

When TRIAGE_ENABLE_COMMANDS=1, the server executes the relevant triage commands for each alert (via SSH or locally), captures stdout/stderr, and appends the results to both the OpenRouter prompt and the HTTP response JSON. This lets you automate evidence gathering directly from the runbook instructions. Use environment variables to control which user/host the commands target and to limit timeouts/output size. LangChain powers the multi-turn investigation flow: the LLM can call the provided tools (run_local_command, run_ssh_command) to gather additional evidence until its ready to deliver a final summary. When /etc/ansible/hosts (or ANSIBLE_HOSTS_PATH) is available the server automatically enriches the alert context with SSH metadata (user, host, port, identity file, and common args) so runbook commands default to using SSH against the alerting host instead of the webhook server.

Running with Docker Compose

  1. Copy .env.example to .env and fill in your OpenRouter key, email SMTP settings, and other toggles.
  2. Place any SSH keys the webhook needs inside ./.ssh/ (the compose file mounts this directory read-only inside the container).
  3. Run docker compose up -d to build and launch the webhook. It listens on port 8081 by default and uses the mounted alert_runbook.yaml plus the host /etc/ansible/hosts.
  4. Use docker compose logs -f to watch verbose LangChain output or restart with docker compose restart when updating the code/runbook.

Sample payload

curl -X POST http://localhost:8081/alerts \
  -H 'Content-Type: application/json' \
  -d '{
        "status":"firing",
        "ruleUid":"edkmsdmlay2o0c",
        "ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
        "alerts":[
          {
            "status":"firing",
            "labels":{
              "alertname":"High Mem.",
              "host":"unit-02",
              "rule_uid":"edkmsdmlay2o0c"
            },
            "annotations":{
              "summary":"Memory usage above 95% for 10m",
              "value":"96.2%"
            },
            "startsAt":"2025-09-22T17:20:00Z",
            "endsAt":"0001-01-01T00:00:00Z"
          }
        ]
      }'

With a valid OpenRouter key this returns a JSON body containing the LLM summary per alert plus any unmatched alerts (missing runbook entries or rule UIDs).

Testing without OpenRouter

Set OPENROUTER_API_KEY=dummy and point the DNS entry to a mock (e.g. mitmproxy) if you need to capture outbound requests. Otherwise, hits will fail fast with HTTP 502 so Grafana knows the automation need to be retried.