Add dev stacks

2025-12-31 20:11:44 -05:00
parent 0bcfed8fb8
commit 13989e2b59
49 changed files with 4948 additions and 0 deletions
--- a/stacks/mllogwatcher/README.md
+++ b/stacks/mllogwatcher/README.md
@@ -0,0 +1,120 @@
+# ML Log Watcher Utilities
+
+This repository now contains two automation entry points that work together to
+triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted
+language models.
+
+## 1. `scripts/log_monitor.py`
+
+Existing script that queries Elasticsearch indices, pulls a recent window of
+logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via
+cron/systemd.
+
+```
+ELASTIC_HOST=https://casper.localdomain:9200 \
+ELASTIC_API_KEY=... \
+OPENROUTER_API_KEY=... \
+python3 scripts/log_monitor.py --index 'log*' --minutes 30
+```
+
+## 2. `scripts/grafana_alert_webhook.py`
+
+A FastAPI web server that accepts Grafana alert webhooks, finds the matching
+entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to
+OpenRouter. The response text is returned to Grafana (or any caller) immediately
+so automation can fan out to chat, ticketing, etc.
+
+### Dependencies
+
+```
+python3 -m venv .venv
+.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
+```
+
+### Environment
+
+- `OPENROUTER_API_KEY` – required.
+- `OPENROUTER_MODEL` – optional (default `openai/gpt-4o-mini`).
+- `RUNBOOK_PATH` – optional (default `alert_runbook.yaml` in repo root).
+- `ANSIBLE_HOSTS_PATH` – optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
+- `OPENROUTER_REFERER` / `OPENROUTER_TITLE` – forwarded headers if needed.
+- `TRIAGE_ENABLE_COMMANDS` – set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode).
+- `TRIAGE_COMMAND_RUNNER` – `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`.
+- `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` – tune execution behavior.
+- `TRIAGE_VERBOSE_LOGS` – set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
+- `TRIAGE_EMAIL_ENABLED` – when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`.
+
+### Running
+
+```
+source .venv/bin/activate
+export OPENROUTER_API_KEY=...
+uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
+```
+
+The server loads the runbook at startup and exposes:
+
+- `POST /alerts` – Grafana webhook target.
+- `POST /reload-runbook` – force runbook reload without restarting.
+
+When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands
+for each alert (via SSH or locally), captures stdout/stderr, and appends the
+results to both the OpenRouter prompt and the HTTP response JSON. This lets you
+automate evidence gathering directly from the runbook instructions. Use
+environment variables to control which user/host the commands target and to
+limit timeouts/output size. LangChain powers the multi-turn investigation flow:
+the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to
+gather additional evidence until it’s ready to deliver a final summary.
+When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server
+automatically enriches the alert context with SSH metadata (user, host, port,
+identity file, and common args) so runbook commands default to using SSH against
+the alerting host instead of the webhook server.
+
+### Running with Docker Compose
+
+1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP
+   settings, and other toggles.
+2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file
+   mounts this directory read-only inside the container).
+3. Run `docker compose up -d` to build and launch the webhook. It listens on
+   port `8081` by default and uses the mounted `alert_runbook.yaml` plus the
+   host `/etc/ansible/hosts`.
+4. Use `docker compose logs -f` to watch verbose LangChain output or restart
+   with `docker compose restart` when updating the code/runbook.
+
+### Sample payload
+
+```
+curl -X POST http://localhost:8081/alerts \
+  -H 'Content-Type: application/json' \
+  -d '{
+        "status":"firing",
+        "ruleUid":"edkmsdmlay2o0c",
+        "ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
+        "alerts":[
+          {
+            "status":"firing",
+            "labels":{
+              "alertname":"High Mem.",
+              "host":"unit-02",
+              "rule_uid":"edkmsdmlay2o0c"
+            },
+            "annotations":{
+              "summary":"Memory usage above 95% for 10m",
+              "value":"96.2%"
+            },
+            "startsAt":"2025-09-22T17:20:00Z",
+            "endsAt":"0001-01-01T00:00:00Z"
+          }
+        ]
+      }'
+```
+
+With a valid OpenRouter key this returns a JSON body containing the LLM summary
+per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
+
+### Testing without OpenRouter
+
+Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy)
+if you need to capture outbound requests. Otherwise, hits will fail fast with
+HTTP 502 so Grafana knows the automation need to be retried.