2025-12-31 20:11:44 -05:00

121 lines
5.1 KiB
Markdown
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ML Log Watcher Utilities
This repository now contains two automation entry points that work together to
triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted
language models.
## 1. `scripts/log_monitor.py`
Existing script that queries Elasticsearch indices, pulls a recent window of
logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via
cron/systemd.
```
ELASTIC_HOST=https://casper.localdomain:9200 \
ELASTIC_API_KEY=... \
OPENROUTER_API_KEY=... \
python3 scripts/log_monitor.py --index 'log*' --minutes 30
```
## 2. `scripts/grafana_alert_webhook.py`
A FastAPI web server that accepts Grafana alert webhooks, finds the matching
entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to
OpenRouter. The response text is returned to Grafana (or any caller) immediately
so automation can fan out to chat, ticketing, etc.
### Dependencies
```
python3 -m venv .venv
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
```
### Environment
- `OPENROUTER_API_KEY` required.
- `OPENROUTER_MODEL` optional (default `openai/gpt-4o-mini`).
- `RUNBOOK_PATH` optional (default `alert_runbook.yaml` in repo root).
- `ANSIBLE_HOSTS_PATH` optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
- `OPENROUTER_REFERER` / `OPENROUTER_TITLE` forwarded headers if needed.
- `TRIAGE_ENABLE_COMMANDS` set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode).
- `TRIAGE_COMMAND_RUNNER` `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`.
- `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` tune execution behavior.
- `TRIAGE_VERBOSE_LOGS` set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
- `TRIAGE_EMAIL_ENABLED` when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`.
### Running
```
source .venv/bin/activate
export OPENROUTER_API_KEY=...
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
```
The server loads the runbook at startup and exposes:
- `POST /alerts` Grafana webhook target.
- `POST /reload-runbook` force runbook reload without restarting.
When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands
for each alert (via SSH or locally), captures stdout/stderr, and appends the
results to both the OpenRouter prompt and the HTTP response JSON. This lets you
automate evidence gathering directly from the runbook instructions. Use
environment variables to control which user/host the commands target and to
limit timeouts/output size. LangChain powers the multi-turn investigation flow:
the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to
gather additional evidence until its ready to deliver a final summary.
When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server
automatically enriches the alert context with SSH metadata (user, host, port,
identity file, and common args) so runbook commands default to using SSH against
the alerting host instead of the webhook server.
### Running with Docker Compose
1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP
settings, and other toggles.
2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file
mounts this directory read-only inside the container).
3. Run `docker compose up -d` to build and launch the webhook. It listens on
port `8081` by default and uses the mounted `alert_runbook.yaml` plus the
host `/etc/ansible/hosts`.
4. Use `docker compose logs -f` to watch verbose LangChain output or restart
with `docker compose restart` when updating the code/runbook.
### Sample payload
```
curl -X POST http://localhost:8081/alerts \
-H 'Content-Type: application/json' \
-d '{
"status":"firing",
"ruleUid":"edkmsdmlay2o0c",
"ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
"alerts":[
{
"status":"firing",
"labels":{
"alertname":"High Mem.",
"host":"unit-02",
"rule_uid":"edkmsdmlay2o0c"
},
"annotations":{
"summary":"Memory usage above 95% for 10m",
"value":"96.2%"
},
"startsAt":"2025-09-22T17:20:00Z",
"endsAt":"0001-01-01T00:00:00Z"
}
]
}'
```
With a valid OpenRouter key this returns a JSON body containing the LLM summary
per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
### Testing without OpenRouter
Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy)
if you need to capture outbound requests. Otherwise, hits will fail fast with
HTTP 502 so Grafana knows the automation need to be retried.