# ML Log Watcher Utilities This repository now contains two automation entry points that work together to triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted language models. ## 1. `scripts/log_monitor.py` Existing script that queries Elasticsearch indices, pulls a recent window of logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via cron/systemd. ``` ELASTIC_HOST=https://casper.localdomain:9200 \ ELASTIC_API_KEY=... \ OPENROUTER_API_KEY=... \ python3 scripts/log_monitor.py --index 'log*' --minutes 30 ``` ## 2. `scripts/grafana_alert_webhook.py` A FastAPI web server that accepts Grafana alert webhooks, finds the matching entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to OpenRouter. The response text is returned to Grafana (or any caller) immediately so automation can fan out to chat, ticketing, etc. ### Dependencies ``` python3 -m venv .venv .venv/bin/pip install fastapi uvicorn pyyaml requests langchain ``` ### Environment - `OPENROUTER_API_KEY` – required. - `OPENROUTER_MODEL` – optional (default `openai/gpt-4o-mini`). - `RUNBOOK_PATH` – optional (default `alert_runbook.yaml` in repo root). - `ANSIBLE_HOSTS_PATH` – optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information. - `OPENROUTER_REFERER` / `OPENROUTER_TITLE` – forwarded headers if needed. - `TRIAGE_ENABLE_COMMANDS` – set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode). - `TRIAGE_COMMAND_RUNNER` – `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`. - `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` – tune execution behavior. - `TRIAGE_VERBOSE_LOGS` – set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging. - `TRIAGE_EMAIL_ENABLED` – when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`. ### Running ``` source .venv/bin/activate export OPENROUTER_API_KEY=... uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081 ``` The server loads the runbook at startup and exposes: - `POST /alerts` – Grafana webhook target. - `POST /reload-runbook` – force runbook reload without restarting. When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands for each alert (via SSH or locally), captures stdout/stderr, and appends the results to both the OpenRouter prompt and the HTTP response JSON. This lets you automate evidence gathering directly from the runbook instructions. Use environment variables to control which user/host the commands target and to limit timeouts/output size. LangChain powers the multi-turn investigation flow: the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to gather additional evidence until it’s ready to deliver a final summary. When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server automatically enriches the alert context with SSH metadata (user, host, port, identity file, and common args) so runbook commands default to using SSH against the alerting host instead of the webhook server. ### Running with Docker Compose 1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP settings, and other toggles. 2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file mounts this directory read-only inside the container). 3. Run `docker compose up -d` to build and launch the webhook. It listens on port `8081` by default and uses the mounted `alert_runbook.yaml` plus the host `/etc/ansible/hosts`. 4. Use `docker compose logs -f` to watch verbose LangChain output or restart with `docker compose restart` when updating the code/runbook. ### Sample payload ``` curl -X POST http://localhost:8081/alerts \ -H 'Content-Type: application/json' \ -d '{ "status":"firing", "ruleUid":"edkmsdmlay2o0c", "ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view", "alerts":[ { "status":"firing", "labels":{ "alertname":"High Mem.", "host":"unit-02", "rule_uid":"edkmsdmlay2o0c" }, "annotations":{ "summary":"Memory usage above 95% for 10m", "value":"96.2%" }, "startsAt":"2025-09-22T17:20:00Z", "endsAt":"0001-01-01T00:00:00Z" } ] }' ``` With a valid OpenRouter key this returns a JSON body containing the LLM summary per alert plus any unmatched alerts (missing runbook entries or rule UIDs). ### Testing without OpenRouter Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy) if you need to capture outbound requests. Otherwise, hits will fail fast with HTTP 502 so Grafana knows the automation need to be retried.