ML Log Watcher Utilities
This repository now contains two automation entry points that work together to triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted language models.
1. scripts/log_monitor.py
Existing script that queries Elasticsearch indices, pulls a recent window of logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via cron/systemd.
ELASTIC_HOST=https://casper.localdomain:9200 \
ELASTIC_API_KEY=... \
OPENROUTER_API_KEY=... \
python3 scripts/log_monitor.py --index 'log*' --minutes 30
2. scripts/grafana_alert_webhook.py
A FastAPI web server that accepts Grafana alert webhooks, finds the matching
entry in alert_runbook.yaml, renders the LLM prompt, and posts it to
OpenRouter. The response text is returned to Grafana (or any caller) immediately
so automation can fan out to chat, ticketing, etc.
Dependencies
python3 -m venv .venv
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
Environment
OPENROUTER_API_KEY– required.OPENROUTER_MODEL– optional (defaultopenai/gpt-4o-mini).RUNBOOK_PATH– optional (defaultalert_runbook.yamlin repo root).ANSIBLE_HOSTS_PATH– optional (default/etc/ansible/hosts). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.OPENROUTER_REFERER/OPENROUTER_TITLE– forwarded headers if needed.TRIAGE_ENABLE_COMMANDS– set to1to let the webhook execute runbook commands (default0keeps it in read-only mode).TRIAGE_COMMAND_RUNNER–ssh(default) orlocal. When using ssh, also setTRIAGE_SSH_USERand optionalTRIAGE_SSH_OPTIONS.TRIAGE_COMMAND_TIMEOUT,TRIAGE_MAX_COMMANDS,TRIAGE_OUTPUT_LIMIT,TRIAGE_DEFAULT_OS– tune execution behavior.TRIAGE_VERBOSE_LOGS– set to1to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.TRIAGE_EMAIL_ENABLED– when1, the webhook emails the final LLM summary per alert. RequiresTRIAGE_EMAIL_FROM,TRIAGE_EMAIL_TO(comma-separated),TRIAGE_SMTP_HOST, and optionalTRIAGE_SMTP_PORT,TRIAGE_SMTP_USER,TRIAGE_SMTP_PASSWORD,TRIAGE_SMTP_STARTTLS,TRIAGE_SMTP_SSL.
Running
source .venv/bin/activate
export OPENROUTER_API_KEY=...
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
The server loads the runbook at startup and exposes:
POST /alerts– Grafana webhook target.POST /reload-runbook– force runbook reload without restarting.
When TRIAGE_ENABLE_COMMANDS=1, the server executes the relevant triage commands
for each alert (via SSH or locally), captures stdout/stderr, and appends the
results to both the OpenRouter prompt and the HTTP response JSON. This lets you
automate evidence gathering directly from the runbook instructions. Use
environment variables to control which user/host the commands target and to
limit timeouts/output size. LangChain powers the multi-turn investigation flow:
the LLM can call the provided tools (run_local_command, run_ssh_command) to
gather additional evidence until it’s ready to deliver a final summary.
When /etc/ansible/hosts (or ANSIBLE_HOSTS_PATH) is available the server
automatically enriches the alert context with SSH metadata (user, host, port,
identity file, and common args) so runbook commands default to using SSH against
the alerting host instead of the webhook server.
Running with Docker Compose
- Copy
.env.exampleto.envand fill in your OpenRouter key, email SMTP settings, and other toggles. - Place any SSH keys the webhook needs inside
./.ssh/(the compose file mounts this directory read-only inside the container). - Run
docker compose up -dto build and launch the webhook. It listens on port8081by default and uses the mountedalert_runbook.yamlplus the host/etc/ansible/hosts. - Use
docker compose logs -fto watch verbose LangChain output or restart withdocker compose restartwhen updating the code/runbook.
Sample payload
curl -X POST http://localhost:8081/alerts \
-H 'Content-Type: application/json' \
-d '{
"status":"firing",
"ruleUid":"edkmsdmlay2o0c",
"ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
"alerts":[
{
"status":"firing",
"labels":{
"alertname":"High Mem.",
"host":"unit-02",
"rule_uid":"edkmsdmlay2o0c"
},
"annotations":{
"summary":"Memory usage above 95% for 10m",
"value":"96.2%"
},
"startsAt":"2025-09-22T17:20:00Z",
"endsAt":"0001-01-01T00:00:00Z"
}
]
}'
With a valid OpenRouter key this returns a JSON body containing the LLM summary per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
Testing without OpenRouter
Set OPENROUTER_API_KEY=dummy and point the DNS entry to a mock (e.g. mitmproxy)
if you need to capture outbound requests. Otherwise, hits will fail fast with
HTTP 502 so Grafana knows the automation need to be retried.