121 lines
5.1 KiB
Markdown
Executable File
121 lines
5.1 KiB
Markdown
Executable File
# ML Log Watcher Utilities
|
||
|
||
This repository now contains two automation entry points that work together to
|
||
triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted
|
||
language models.
|
||
|
||
## 1. `scripts/log_monitor.py`
|
||
|
||
Existing script that queries Elasticsearch indices, pulls a recent window of
|
||
logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via
|
||
cron/systemd.
|
||
|
||
```
|
||
ELASTIC_HOST=https://casper.localdomain:9200 \
|
||
ELASTIC_API_KEY=... \
|
||
OPENROUTER_API_KEY=... \
|
||
python3 scripts/log_monitor.py --index 'log*' --minutes 30
|
||
```
|
||
|
||
## 2. `scripts/grafana_alert_webhook.py`
|
||
|
||
A FastAPI web server that accepts Grafana alert webhooks, finds the matching
|
||
entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to
|
||
OpenRouter. The response text is returned to Grafana (or any caller) immediately
|
||
so automation can fan out to chat, ticketing, etc.
|
||
|
||
### Dependencies
|
||
|
||
```
|
||
python3 -m venv .venv
|
||
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
|
||
```
|
||
|
||
### Environment
|
||
|
||
- `OPENROUTER_API_KEY` – required.
|
||
- `OPENROUTER_MODEL` – optional (default `openai/gpt-4o-mini`).
|
||
- `RUNBOOK_PATH` – optional (default `alert_runbook.yaml` in repo root).
|
||
- `ANSIBLE_HOSTS_PATH` – optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
|
||
- `OPENROUTER_REFERER` / `OPENROUTER_TITLE` – forwarded headers if needed.
|
||
- `TRIAGE_ENABLE_COMMANDS` – set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode).
|
||
- `TRIAGE_COMMAND_RUNNER` – `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`.
|
||
- `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` – tune execution behavior.
|
||
- `TRIAGE_VERBOSE_LOGS` – set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
|
||
- `TRIAGE_EMAIL_ENABLED` – when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`.
|
||
|
||
### Running
|
||
|
||
```
|
||
source .venv/bin/activate
|
||
export OPENROUTER_API_KEY=...
|
||
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
|
||
```
|
||
|
||
The server loads the runbook at startup and exposes:
|
||
|
||
- `POST /alerts` – Grafana webhook target.
|
||
- `POST /reload-runbook` – force runbook reload without restarting.
|
||
|
||
When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands
|
||
for each alert (via SSH or locally), captures stdout/stderr, and appends the
|
||
results to both the OpenRouter prompt and the HTTP response JSON. This lets you
|
||
automate evidence gathering directly from the runbook instructions. Use
|
||
environment variables to control which user/host the commands target and to
|
||
limit timeouts/output size. LangChain powers the multi-turn investigation flow:
|
||
the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to
|
||
gather additional evidence until it’s ready to deliver a final summary.
|
||
When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server
|
||
automatically enriches the alert context with SSH metadata (user, host, port,
|
||
identity file, and common args) so runbook commands default to using SSH against
|
||
the alerting host instead of the webhook server.
|
||
|
||
### Running with Docker Compose
|
||
|
||
1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP
|
||
settings, and other toggles.
|
||
2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file
|
||
mounts this directory read-only inside the container).
|
||
3. Run `docker compose up -d` to build and launch the webhook. It listens on
|
||
port `8081` by default and uses the mounted `alert_runbook.yaml` plus the
|
||
host `/etc/ansible/hosts`.
|
||
4. Use `docker compose logs -f` to watch verbose LangChain output or restart
|
||
with `docker compose restart` when updating the code/runbook.
|
||
|
||
### Sample payload
|
||
|
||
```
|
||
curl -X POST http://localhost:8081/alerts \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{
|
||
"status":"firing",
|
||
"ruleUid":"edkmsdmlay2o0c",
|
||
"ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
|
||
"alerts":[
|
||
{
|
||
"status":"firing",
|
||
"labels":{
|
||
"alertname":"High Mem.",
|
||
"host":"unit-02",
|
||
"rule_uid":"edkmsdmlay2o0c"
|
||
},
|
||
"annotations":{
|
||
"summary":"Memory usage above 95% for 10m",
|
||
"value":"96.2%"
|
||
},
|
||
"startsAt":"2025-09-22T17:20:00Z",
|
||
"endsAt":"0001-01-01T00:00:00Z"
|
||
}
|
||
]
|
||
}'
|
||
```
|
||
|
||
With a valid OpenRouter key this returns a JSON body containing the LLM summary
|
||
per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
|
||
|
||
### Testing without OpenRouter
|
||
|
||
Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy)
|
||
if you need to capture outbound requests. Otherwise, hits will fail fast with
|
||
HTTP 502 so Grafana knows the automation need to be retried.
|