Add dev stacks
This commit is contained in:
120
stacks/mllogwatcher/README.md
Executable file
120
stacks/mllogwatcher/README.md
Executable file
@@ -0,0 +1,120 @@
|
||||
# ML Log Watcher Utilities
|
||||
|
||||
This repository now contains two automation entry points that work together to
|
||||
triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted
|
||||
language models.
|
||||
|
||||
## 1. `scripts/log_monitor.py`
|
||||
|
||||
Existing script that queries Elasticsearch indices, pulls a recent window of
|
||||
logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via
|
||||
cron/systemd.
|
||||
|
||||
```
|
||||
ELASTIC_HOST=https://casper.localdomain:9200 \
|
||||
ELASTIC_API_KEY=... \
|
||||
OPENROUTER_API_KEY=... \
|
||||
python3 scripts/log_monitor.py --index 'log*' --minutes 30
|
||||
```
|
||||
|
||||
## 2. `scripts/grafana_alert_webhook.py`
|
||||
|
||||
A FastAPI web server that accepts Grafana alert webhooks, finds the matching
|
||||
entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to
|
||||
OpenRouter. The response text is returned to Grafana (or any caller) immediately
|
||||
so automation can fan out to chat, ticketing, etc.
|
||||
|
||||
### Dependencies
|
||||
|
||||
```
|
||||
python3 -m venv .venv
|
||||
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
|
||||
```
|
||||
|
||||
### Environment
|
||||
|
||||
- `OPENROUTER_API_KEY` – required.
|
||||
- `OPENROUTER_MODEL` – optional (default `openai/gpt-4o-mini`).
|
||||
- `RUNBOOK_PATH` – optional (default `alert_runbook.yaml` in repo root).
|
||||
- `ANSIBLE_HOSTS_PATH` – optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
|
||||
- `OPENROUTER_REFERER` / `OPENROUTER_TITLE` – forwarded headers if needed.
|
||||
- `TRIAGE_ENABLE_COMMANDS` – set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode).
|
||||
- `TRIAGE_COMMAND_RUNNER` – `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`.
|
||||
- `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` – tune execution behavior.
|
||||
- `TRIAGE_VERBOSE_LOGS` – set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
|
||||
- `TRIAGE_EMAIL_ENABLED` – when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`.
|
||||
|
||||
### Running
|
||||
|
||||
```
|
||||
source .venv/bin/activate
|
||||
export OPENROUTER_API_KEY=...
|
||||
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
|
||||
```
|
||||
|
||||
The server loads the runbook at startup and exposes:
|
||||
|
||||
- `POST /alerts` – Grafana webhook target.
|
||||
- `POST /reload-runbook` – force runbook reload without restarting.
|
||||
|
||||
When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands
|
||||
for each alert (via SSH or locally), captures stdout/stderr, and appends the
|
||||
results to both the OpenRouter prompt and the HTTP response JSON. This lets you
|
||||
automate evidence gathering directly from the runbook instructions. Use
|
||||
environment variables to control which user/host the commands target and to
|
||||
limit timeouts/output size. LangChain powers the multi-turn investigation flow:
|
||||
the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to
|
||||
gather additional evidence until it’s ready to deliver a final summary.
|
||||
When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server
|
||||
automatically enriches the alert context with SSH metadata (user, host, port,
|
||||
identity file, and common args) so runbook commands default to using SSH against
|
||||
the alerting host instead of the webhook server.
|
||||
|
||||
### Running with Docker Compose
|
||||
|
||||
1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP
|
||||
settings, and other toggles.
|
||||
2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file
|
||||
mounts this directory read-only inside the container).
|
||||
3. Run `docker compose up -d` to build and launch the webhook. It listens on
|
||||
port `8081` by default and uses the mounted `alert_runbook.yaml` plus the
|
||||
host `/etc/ansible/hosts`.
|
||||
4. Use `docker compose logs -f` to watch verbose LangChain output or restart
|
||||
with `docker compose restart` when updating the code/runbook.
|
||||
|
||||
### Sample payload
|
||||
|
||||
```
|
||||
curl -X POST http://localhost:8081/alerts \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"status":"firing",
|
||||
"ruleUid":"edkmsdmlay2o0c",
|
||||
"ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
|
||||
"alerts":[
|
||||
{
|
||||
"status":"firing",
|
||||
"labels":{
|
||||
"alertname":"High Mem.",
|
||||
"host":"unit-02",
|
||||
"rule_uid":"edkmsdmlay2o0c"
|
||||
},
|
||||
"annotations":{
|
||||
"summary":"Memory usage above 95% for 10m",
|
||||
"value":"96.2%"
|
||||
},
|
||||
"startsAt":"2025-09-22T17:20:00Z",
|
||||
"endsAt":"0001-01-01T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
With a valid OpenRouter key this returns a JSON body containing the LLM summary
|
||||
per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
|
||||
|
||||
### Testing without OpenRouter
|
||||
|
||||
Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy)
|
||||
if you need to capture outbound requests. Otherwise, hits will fail fast with
|
||||
HTTP 502 so Grafana knows the automation need to be retried.
|
||||
Reference in New Issue
Block a user