ML Log Watcher Utilities

This repository now contains two automation entry points that work together to triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted language models.

1. `scripts/log_monitor.py`

Existing script that queries Elasticsearch indices, pulls a recent window of logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via cron/systemd.

ELASTIC_HOST=https://casper.localdomain:9200 \
ELASTIC_API_KEY=... \
OPENROUTER_API_KEY=... \
python3 scripts/log_monitor.py --index 'log*' --minutes 30

2. `scripts/grafana_alert_webhook.py`

A FastAPI web server that accepts Grafana alert webhooks, finds the matching entry in alert_runbook.yaml, renders the LLM prompt, and posts it to OpenRouter. The response text is returned to Grafana (or any caller) immediately so automation can fan out to chat, ticketing, etc.

Dependencies

python3 -m venv .venv
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain

Environment

OPENROUTER_API_KEY – required.
OPENROUTER_MODEL – optional (default openai/gpt-4o-mini).
RUNBOOK_PATH – optional (default alert_runbook.yaml in repo root).
ANSIBLE_HOSTS_PATH – optional (default /etc/ansible/hosts). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
OPENROUTER_REFERER / OPENROUTER_TITLE – forwarded headers if needed.
TRIAGE_ENABLE_COMMANDS – set to 1 to let the webhook execute runbook commands (default 0 keeps it in read-only mode).
TRIAGE_COMMAND_RUNNER – ssh (default) or local. When using ssh, also set TRIAGE_SSH_USER and optional TRIAGE_SSH_OPTIONS.
TRIAGE_COMMAND_TIMEOUT, TRIAGE_MAX_COMMANDS, TRIAGE_OUTPUT_LIMIT, TRIAGE_DEFAULT_OS – tune execution behavior.
TRIAGE_VERBOSE_LOGS – set to 1 to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
TRIAGE_EMAIL_ENABLED – when 1, the webhook emails the final LLM summary per alert. Requires TRIAGE_EMAIL_FROM, TRIAGE_EMAIL_TO (comma-separated), TRIAGE_SMTP_HOST, and optional TRIAGE_SMTP_PORT, TRIAGE_SMTP_USER, TRIAGE_SMTP_PASSWORD, TRIAGE_SMTP_STARTTLS, TRIAGE_SMTP_SSL.

Running

source .venv/bin/activate
export OPENROUTER_API_KEY=...
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081

The server loads the runbook at startup and exposes:

POST /alerts – Grafana webhook target.
POST /reload-runbook – force runbook reload without restarting.

When TRIAGE_ENABLE_COMMANDS=1, the server executes the relevant triage commands for each alert (via SSH or locally), captures stdout/stderr, and appends the results to both the OpenRouter prompt and the HTTP response JSON. This lets you automate evidence gathering directly from the runbook instructions. Use environment variables to control which user/host the commands target and to limit timeouts/output size. LangChain powers the multi-turn investigation flow: the LLM can call the provided tools (run_local_command, run_ssh_command) to gather additional evidence until it’s ready to deliver a final summary. When /etc/ansible/hosts (or ANSIBLE_HOSTS_PATH) is available the server automatically enriches the alert context with SSH metadata (user, host, port, identity file, and common args) so runbook commands default to using SSH against the alerting host instead of the webhook server.

Running with Docker Compose

Copy .env.example to .env and fill in your OpenRouter key, email SMTP settings, and other toggles.
Place any SSH keys the webhook needs inside ./.ssh/ (the compose file mounts this directory read-only inside the container).
Run docker compose up -d to build and launch the webhook. It listens on port 8081 by default and uses the mounted alert_runbook.yaml plus the host /etc/ansible/hosts.
Use docker compose logs -f to watch verbose LangChain output or restart with docker compose restart when updating the code/runbook.

Sample payload

curl -X POST http://localhost:8081/alerts \
  -H 'Content-Type: application/json' \
  -d '{
        "status":"firing",
        "ruleUid":"edkmsdmlay2o0c",
        "ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
        "alerts":[
          {
            "status":"firing",
            "labels":{
              "alertname":"High Mem.",
              "host":"unit-02",
              "rule_uid":"edkmsdmlay2o0c"
            },
            "annotations":{
              "summary":"Memory usage above 95% for 10m",
              "value":"96.2%"
            },
            "startsAt":"2025-09-22T17:20:00Z",
            "endsAt":"0001-01-01T00:00:00Z"
          }
        ]
      }'

With a valid OpenRouter key this returns a JSON body containing the LLM summary per alert plus any unmatched alerts (missing runbook entries or rule UIDs).

Testing without OpenRouter

Set OPENROUTER_API_KEY=dummy and point the DNS entry to a mock (e.g. mitmproxy) if you need to capture outbound requests. Otherwise, hits will fail fast with HTTP 502 so Grafana knows the automation need to be retried.

5.1 KiB Executable File Raw Permalink Blame History Unescape Escape