Add dev stacks
This commit is contained in:
7
stacks/mllogwatcher/.dockerignore
Normal file
7
stacks/mllogwatcher/.dockerignore
Normal file
@@ -0,0 +1,7 @@
|
||||
.venv
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.git
|
||||
.gitignore
|
||||
.env
|
||||
tmp/
|
||||
14
stacks/mllogwatcher/.env.example
Normal file
14
stacks/mllogwatcher/.env.example
Normal file
@@ -0,0 +1,14 @@
|
||||
OPENROUTER_API_KEY=
|
||||
OPENROUTER_MODEL=openai/gpt-5.2-codex-max
|
||||
TRIAGE_ENABLE_COMMANDS=1
|
||||
TRIAGE_COMMAND_RUNNER=local
|
||||
TRIAGE_VERBOSE_LOGS=1
|
||||
TRIAGE_EMAIL_ENABLED=1
|
||||
TRIAGE_EMAIL_FROM=alertai@example.com
|
||||
TRIAGE_EMAIL_TO=admin@example.com
|
||||
TRIAGE_SMTP_HOST=smtp.example.com
|
||||
TRIAGE_SMTP_PORT=465
|
||||
TRIAGE_SMTP_USER=alertai@example.com
|
||||
TRIAGE_SMTP_PASSWORD=
|
||||
TRIAGE_SMTP_SSL=1
|
||||
TRIAGE_SMTP_STARTTLS=0
|
||||
14
stacks/mllogwatcher/.env.template
Normal file
14
stacks/mllogwatcher/.env.template
Normal file
@@ -0,0 +1,14 @@
|
||||
OPENROUTER_API_KEY=
|
||||
OPENROUTER_MODEL=openai/gpt-5.2-codex-max
|
||||
TRIAGE_ENABLE_COMMANDS=1
|
||||
TRIAGE_COMMAND_RUNNER=local
|
||||
TRIAGE_VERBOSE_LOGS=1
|
||||
TRIAGE_EMAIL_ENABLED=1
|
||||
TRIAGE_EMAIL_FROM=alertai@example.com
|
||||
TRIAGE_EMAIL_TO=admin@example.com
|
||||
TRIAGE_SMTP_HOST=smtp.example.com
|
||||
TRIAGE_SMTP_PORT=465
|
||||
TRIAGE_SMTP_USER=alertai@example.com
|
||||
TRIAGE_SMTP_PASSWORD=
|
||||
TRIAGE_SMTP_SSL=1
|
||||
TRIAGE_SMTP_STARTTLS=0
|
||||
20
stacks/mllogwatcher/Dockerfile
Normal file
20
stacks/mllogwatcher/Dockerfile
Normal file
@@ -0,0 +1,20 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
ENV PYTHONDONTWRITEBYTECODE=1 \
|
||||
PYTHONUNBUFFERED=1
|
||||
|
||||
WORKDIR /var/core/mlLogWatcher
|
||||
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends openssh-client && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
COPY alert_runbook.yaml ./alert_runbook.yaml
|
||||
COPY scripts ./scripts
|
||||
|
||||
EXPOSE 8081
|
||||
|
||||
CMD ["uvicorn", "scripts.grafana_alert_webhook:app", "--host", "0.0.0.0", "--port", "8081"]
|
||||
120
stacks/mllogwatcher/README.md
Executable file
120
stacks/mllogwatcher/README.md
Executable file
@@ -0,0 +1,120 @@
|
||||
# ML Log Watcher Utilities
|
||||
|
||||
This repository now contains two automation entry points that work together to
|
||||
triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted
|
||||
language models.
|
||||
|
||||
## 1. `scripts/log_monitor.py`
|
||||
|
||||
Existing script that queries Elasticsearch indices, pulls a recent window of
|
||||
logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via
|
||||
cron/systemd.
|
||||
|
||||
```
|
||||
ELASTIC_HOST=https://casper.localdomain:9200 \
|
||||
ELASTIC_API_KEY=... \
|
||||
OPENROUTER_API_KEY=... \
|
||||
python3 scripts/log_monitor.py --index 'log*' --minutes 30
|
||||
```
|
||||
|
||||
## 2. `scripts/grafana_alert_webhook.py`
|
||||
|
||||
A FastAPI web server that accepts Grafana alert webhooks, finds the matching
|
||||
entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to
|
||||
OpenRouter. The response text is returned to Grafana (or any caller) immediately
|
||||
so automation can fan out to chat, ticketing, etc.
|
||||
|
||||
### Dependencies
|
||||
|
||||
```
|
||||
python3 -m venv .venv
|
||||
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
|
||||
```
|
||||
|
||||
### Environment
|
||||
|
||||
- `OPENROUTER_API_KEY` – required.
|
||||
- `OPENROUTER_MODEL` – optional (default `openai/gpt-4o-mini`).
|
||||
- `RUNBOOK_PATH` – optional (default `alert_runbook.yaml` in repo root).
|
||||
- `ANSIBLE_HOSTS_PATH` – optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
|
||||
- `OPENROUTER_REFERER` / `OPENROUTER_TITLE` – forwarded headers if needed.
|
||||
- `TRIAGE_ENABLE_COMMANDS` – set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode).
|
||||
- `TRIAGE_COMMAND_RUNNER` – `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`.
|
||||
- `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` – tune execution behavior.
|
||||
- `TRIAGE_VERBOSE_LOGS` – set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
|
||||
- `TRIAGE_EMAIL_ENABLED` – when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`.
|
||||
|
||||
### Running
|
||||
|
||||
```
|
||||
source .venv/bin/activate
|
||||
export OPENROUTER_API_KEY=...
|
||||
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
|
||||
```
|
||||
|
||||
The server loads the runbook at startup and exposes:
|
||||
|
||||
- `POST /alerts` – Grafana webhook target.
|
||||
- `POST /reload-runbook` – force runbook reload without restarting.
|
||||
|
||||
When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands
|
||||
for each alert (via SSH or locally), captures stdout/stderr, and appends the
|
||||
results to both the OpenRouter prompt and the HTTP response JSON. This lets you
|
||||
automate evidence gathering directly from the runbook instructions. Use
|
||||
environment variables to control which user/host the commands target and to
|
||||
limit timeouts/output size. LangChain powers the multi-turn investigation flow:
|
||||
the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to
|
||||
gather additional evidence until it’s ready to deliver a final summary.
|
||||
When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server
|
||||
automatically enriches the alert context with SSH metadata (user, host, port,
|
||||
identity file, and common args) so runbook commands default to using SSH against
|
||||
the alerting host instead of the webhook server.
|
||||
|
||||
### Running with Docker Compose
|
||||
|
||||
1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP
|
||||
settings, and other toggles.
|
||||
2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file
|
||||
mounts this directory read-only inside the container).
|
||||
3. Run `docker compose up -d` to build and launch the webhook. It listens on
|
||||
port `8081` by default and uses the mounted `alert_runbook.yaml` plus the
|
||||
host `/etc/ansible/hosts`.
|
||||
4. Use `docker compose logs -f` to watch verbose LangChain output or restart
|
||||
with `docker compose restart` when updating the code/runbook.
|
||||
|
||||
### Sample payload
|
||||
|
||||
```
|
||||
curl -X POST http://localhost:8081/alerts \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"status":"firing",
|
||||
"ruleUid":"edkmsdmlay2o0c",
|
||||
"ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
|
||||
"alerts":[
|
||||
{
|
||||
"status":"firing",
|
||||
"labels":{
|
||||
"alertname":"High Mem.",
|
||||
"host":"unit-02",
|
||||
"rule_uid":"edkmsdmlay2o0c"
|
||||
},
|
||||
"annotations":{
|
||||
"summary":"Memory usage above 95% for 10m",
|
||||
"value":"96.2%"
|
||||
},
|
||||
"startsAt":"2025-09-22T17:20:00Z",
|
||||
"endsAt":"0001-01-01T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
With a valid OpenRouter key this returns a JSON body containing the LLM summary
|
||||
per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
|
||||
|
||||
### Testing without OpenRouter
|
||||
|
||||
Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy)
|
||||
if you need to capture outbound requests. Otherwise, hits will fail fast with
|
||||
HTTP 502 so Grafana knows the automation need to be retried.
|
||||
254
stacks/mllogwatcher/alert_runbook.yaml
Executable file
254
stacks/mllogwatcher/alert_runbook.yaml
Executable file
@@ -0,0 +1,254 @@
|
||||
# Grafana alert triage playbook for the HomeLab telemetry stack.
|
||||
# Each entry contains the alert metadata, what the signal means,
|
||||
# the evidence to capture automatically, and the manual / scripted steps.
|
||||
metadata:
|
||||
generated: "2025-09-22T00:00:00Z"
|
||||
grafana_url: "http://casper:3000"
|
||||
datasource: "InfluxDB telegraf (uid=P951FEA4DE68E13C5)"
|
||||
llm_provider: "OpenRouter"
|
||||
alerts:
|
||||
- name: "Data Stale"
|
||||
rule_uid: "fdk9orif6fytcf"
|
||||
description: "No CPU usage_user metrics have arrived for non-unit hosts within 5 minutes."
|
||||
signal:
|
||||
metric: "cpu.usage_user"
|
||||
condition: "count(host samples over 5m) < 1"
|
||||
impact: "Host is no longer reporting to Telegraf/Influx -> monitoring blind spot."
|
||||
evidence_to_collect:
|
||||
- "Influx: `from(bucket:\"telegraf\") |> range(start:-10m) |> filter(fn:(r)=>r._measurement==\"cpu\" and r.host==\"{{ host }}\") |> count()`"
|
||||
- "Telegraf log tail"
|
||||
- "System journal for network/auth errors"
|
||||
triage:
|
||||
- summary: "Verify Telegraf agent health"
|
||||
linux: "sudo systemctl status telegraf && sudo journalctl -u telegraf -n 100"
|
||||
windows: "Get-Service telegraf; Get-Content 'C:\\Program Files\\telegraf\\telegraf.log' -Tail 100"
|
||||
- summary: "Check connectivity from host to Influx (`casper:8086`)"
|
||||
linux: "curl -sSf http://casper:8086/ping"
|
||||
windows: "Invoke-WebRequest -UseBasicParsing http://casper:8086/ping"
|
||||
- summary: "Confirm host clock drift <5s (important for Influx line protocol timestamps)"
|
||||
linux: "chronyc tracking"
|
||||
windows: "w32tm /query /status"
|
||||
remediation:
|
||||
- "Restart Telegraf after config validation: `sudo telegraf --test --config /etc/telegraf/telegraf.conf` then `sudo systemctl restart telegraf`."
|
||||
- "Re-apply Ansible telemetry playbook if multiple hosts fail."
|
||||
llm_prompt: >
|
||||
Alert {{ alertname }} fired for {{ host }}. Telegraf stopped sending cpu.usage_user metrics. Given the collected logs and command output, identify root causes (agent down, auth failures, firewall, time skew) and list the next action.
|
||||
|
||||
- name: "High CPU"
|
||||
rule_uid: "fdkms407ubmdcc"
|
||||
description: "Mean CPU usage_system over the last 10 minutes exceeds 85%."
|
||||
signal:
|
||||
metric: "cpu.usage_system"
|
||||
condition: "mean over 10m > 85%"
|
||||
impact: "Host is near saturation; scheduler latency and queueing likely."
|
||||
evidence_to_collect:
|
||||
- "Top CPU processes snapshot (Linux: `ps -eo pid,cmd,%cpu --sort=-%cpu | head -n 15`; Windows: `Get-Process | Sort-Object CPU -Descending | Select -First 15`)"
|
||||
- "Load vs CPU core count"
|
||||
- "Recent deploys / cron jobs metadata"
|
||||
triage:
|
||||
- summary: "Confirm sustained CPU pressure"
|
||||
linux: "uptime && mpstat 1 5"
|
||||
windows: "typeperf \"\\Processor(_Total)\\% Processor Time\" -sc 15"
|
||||
- summary: "Check offending processes/services"
|
||||
linux: "sudo ps -eo pid,user,comm,%cpu,%mem --sort=-%cpu | head"
|
||||
windows: "Get-Process | Sort-Object CPU -Descending | Select -First 10 Name,CPU"
|
||||
- summary: "Inspect cgroup / VM constraints if on Proxmox"
|
||||
linux: "sudo pct status {{ vmid }} && sudo pct config {{ vmid }}"
|
||||
remediation:
|
||||
- "Throttle or restart runaway service; scale workload or tune limits."
|
||||
- "Consider moving noisy neighbors off shared hypervisor."
|
||||
llm_prompt: >
|
||||
High CPU alert for {{ host }}. Review process table, recent deploys, and virtualization context; determine why cpu.usage_system stayed above 85% and recommend mitigation.
|
||||
|
||||
- name: "High Mem."
|
||||
rule_uid: "edkmsdmlay2o0c"
|
||||
description: "Mean memory used_percent over 10 minutes > 95% (excluding hosts jhci/nerv*/magi*)."
|
||||
signal:
|
||||
metric: "mem.used_percent"
|
||||
condition: "mean over 10m > 95%"
|
||||
impact: "OOM risk and swap thrash."
|
||||
evidence_to_collect:
|
||||
- "Free/available memory snapshot"
|
||||
- "Top consumers (Linux: `sudo smem -rt rss | head`; Windows: `Get-Process | Sort-Object WorkingSet -Descending`)"
|
||||
- "Swap in/out metrics"
|
||||
triage:
|
||||
- summary: "Validate actual memory pressure"
|
||||
linux: "free -m && vmstat -SM 5 5"
|
||||
windows: "Get-Counter '\\Memory\\Available MBytes'"
|
||||
- summary: "Identify leaking services"
|
||||
linux: "sudo ps -eo pid,user,comm,%mem,rss --sort=-%mem | head"
|
||||
windows: "Get-Process | Sort-Object WS -Descending | Select -First 10 ProcessName,WS"
|
||||
- summary: "Check recent kernel/OOM logs"
|
||||
linux: "sudo dmesg | tail -n 50"
|
||||
windows: "Get-WinEvent -LogName System -MaxEvents 50 | ? { $_.Message -match 'memory' }"
|
||||
remediation:
|
||||
- "Restart or reconfigure offender; add swap as stop-gap; increase VM memory allocation."
|
||||
llm_prompt: >
|
||||
High Mem alert for {{ host }}. After reviewing free memory, swap activity, and top processes, explain the likely cause and propose remediation steps with priority.
|
||||
|
||||
- name: "High Disk IO"
|
||||
rule_uid: "bdkmtaru7ru2od"
|
||||
description: "Mean merged_reads/writes per second converted to GB/s exceeds 10."
|
||||
signal:
|
||||
metric: "diskio.merged_reads + merged_writes"
|
||||
condition: "mean over 10m > 10 GB/s"
|
||||
impact: "Storage controller saturated; latency spikes, possible backlog."
|
||||
evidence_to_collect:
|
||||
- "iostat extended output"
|
||||
- "Process level IO (pidstat/nethogs equivalent)"
|
||||
- "ZFS/MDADM status for relevant pools"
|
||||
triage:
|
||||
- summary: "Inspect device queues"
|
||||
linux: "iostat -xzd 5 3"
|
||||
windows: "Get-WmiObject -Class Win32_PerfFormattedData_PerfDisk_LogicalDisk | Format-Table Name,DiskWritesPersec,DiskReadsPersec,AvgDisksecPerTransfer"
|
||||
- summary: "Correlate to filesystem / VM"
|
||||
linux: "sudo lsof +D /mnt/critical -u {{ user }}"
|
||||
- summary: "Check backup or replication windows"
|
||||
linux: "journalctl -u pvebackup -n 50"
|
||||
remediation:
|
||||
- "Pause heavy jobs, move backups off-peak, evaluate faster storage tiers."
|
||||
llm_prompt: >
|
||||
High Disk IO on {{ host }}. With iostat/pidstat output provided, decide whether activity is expected (backup, scrub) or abnormal and list mitigations.
|
||||
|
||||
- name: "Low Uptime"
|
||||
rule_uid: "ddkmuadxvkm4ge"
|
||||
description: "System uptime converted to minutes is below 10 -> host rebooted recently."
|
||||
signal:
|
||||
metric: "system.uptime"
|
||||
condition: "last uptime_minutes < 10"
|
||||
impact: "Unexpected reboot or crash; may need RCA."
|
||||
evidence_to_collect:
|
||||
- "Boot reason logs"
|
||||
- "Last patch/maintenance window from Ansible inventory"
|
||||
- "Smart log excerpt for power events"
|
||||
triage:
|
||||
- summary: "Confirm uptime and reason"
|
||||
linux: "uptime && last -x | head"
|
||||
windows: "Get-WinEvent -LogName System -MaxEvents 50 | ? { $_.Id -in 41,6006,6008 }"
|
||||
- summary: "Check kernel panic or watchdog traces"
|
||||
linux: "sudo journalctl -k -b -1 | tail -n 200"
|
||||
- summary: "Validate patch automation logs"
|
||||
linux: "sudo tail -n 100 /var/log/ansible-pull.log"
|
||||
remediation:
|
||||
- "Schedule deeper diagnostics if crash; reschedule workloads once stable."
|
||||
llm_prompt: >
|
||||
Low Uptime alert: host restarted within 10 minutes. Inspect boot reason logs and recommend whether this is maintenance or a fault needing follow-up.
|
||||
|
||||
- name: "High Load"
|
||||
rule_uid: "ddkmul9x8gcn4d"
|
||||
description: "system.load5 > 6 for 5 minutes."
|
||||
signal:
|
||||
metric: "system.load5"
|
||||
condition: "last value > 6"
|
||||
impact: "Runnable queue more than CPU threads -> latency growth."
|
||||
evidence_to_collect:
|
||||
- "Load vs CPU count (`nproc`)"
|
||||
- "Process states (D/R blocked tasks)"
|
||||
- "IO wait percentage"
|
||||
triage:
|
||||
- summary: "Correlate load to CPU and IO"
|
||||
linux: "uptime && vmstat 1 5"
|
||||
- summary: "Identify stuck IO"
|
||||
linux: "sudo pidstat -d 1 5"
|
||||
- summary: "Check Proxmox scheduler for resource contention"
|
||||
linux: "pveperf && qm list"
|
||||
remediation:
|
||||
- "Reduce cron concurrency, add CPU, or fix IO bottleneck causing runnable queue growth."
|
||||
llm_prompt: >
|
||||
High Load alert on {{ host }}. Based on vmstat/pidstat output, explain whether CPU saturation, IO wait, or runnable pile-up is at fault and propose actions.
|
||||
|
||||
- name: "High Network Traffic (Download)"
|
||||
rule_uid: "cdkpct82a7g8wd"
|
||||
description: "Derivative of bytes_recv > 50 MB/s on any interface over last hour."
|
||||
signal:
|
||||
metric: "net.bytes_recv"
|
||||
condition: "mean download throughput > 50 MB/s"
|
||||
impact: "Link saturation, potential DDoS or backup window."
|
||||
evidence_to_collect:
|
||||
- "Interface counters (Linux: `ip -s link show {{ iface }}`; Windows: `Get-NetAdapterStatistics`)"
|
||||
- "Top talkers (Linux: `sudo nethogs {{ iface }}` or `iftop -i {{ iface }}`)"
|
||||
- "Firewall/IDS logs"
|
||||
triage:
|
||||
- summary: "Confirm interface experiencing spike"
|
||||
linux: "sar -n DEV 1 5 | grep {{ iface }}"
|
||||
windows: "Get-Counter -Counter '\\Network Interface({{ iface }})\\Bytes Received/sec' -Continuous -SampleInterval 1 -MaxSamples 5"
|
||||
- summary: "Identify process or remote peer"
|
||||
linux: "sudo ss -ntu state established | sort -k4"
|
||||
windows: "Get-NetTCPConnection | Sort-Object -Property LocalPort"
|
||||
remediation:
|
||||
- "Throttle offending transfers, move backup replication, verify no compromised service."
|
||||
llm_prompt: >
|
||||
High download throughput on {{ host }} interface {{ iface }}. Review interface counters and connection list to determine if traffic is expected and advise throttling or blocking steps.
|
||||
|
||||
- name: "High Network Traffic (Upload)"
|
||||
rule_uid: "aec650pbtvzswa"
|
||||
description: "Derivative of bytes_sent > 30 MB/s for an interface."
|
||||
signal:
|
||||
metric: "net.bytes_sent"
|
||||
condition: "mean upload throughput > 30 MB/s"
|
||||
impact: "Excess upstream usage; may saturate ISP uplink."
|
||||
evidence_to_collect:
|
||||
- "Interface statistics"
|
||||
- "NetFlow sample if available (`/var/log/telegraf/netflow.log`)"
|
||||
- "List of active transfers"
|
||||
triage:
|
||||
- summary: "Measure upload curve"
|
||||
linux: "bmon -p {{ iface }} -o ascii"
|
||||
windows: "Get-Counter '\\Network Interface({{ iface }})\\Bytes Sent/sec' -Continuous -SampleInterval 1 -MaxSamples 5"
|
||||
- summary: "Find process generating traffic"
|
||||
linux: "sudo iftop -i {{ iface }} -t -s 30"
|
||||
windows: "Get-NetAdapterStatistics -Name {{ iface }}"
|
||||
remediation:
|
||||
- "Pause replication jobs, confirm backups not stuck, search for data exfiltration."
|
||||
llm_prompt: >
|
||||
High upload alert for {{ host }} interface {{ iface }}. Using captured traffic samples, determine whether replication/backup explains the pattern or if anomalous traffic needs blocking.
|
||||
|
||||
- name: "High Disk Usage"
|
||||
rule_uid: "cdma6i5k2gem8d"
|
||||
description: "Disk used_percent >= 95% for Linux devices (filters out unwanted devices)."
|
||||
signal:
|
||||
metric: "disk.used_percent"
|
||||
condition: "last value > 95%"
|
||||
impact: "Filesystem full -> service crashes or write failures."
|
||||
evidence_to_collect:
|
||||
- "`df -h` or `Get-Volume` output for device"
|
||||
- "Largest directories snapshot (Linux: `sudo du -xhd1 /path`; Windows: `Get-ChildItem | Sort Length`)"
|
||||
- "Recent deploy or backup expansion logs"
|
||||
triage:
|
||||
- summary: "Validate usage"
|
||||
linux: "df -h {{ mountpoint }}"
|
||||
windows: "Get-Volume -FileSystemLabel {{ volume }}"
|
||||
- summary: "Identify growth trend"
|
||||
linux: "sudo journalctl -u telegraf -g 'disk usage' -n 20"
|
||||
- summary: "Check for stale docker volumes"
|
||||
linux: "docker system df && docker volume ls"
|
||||
remediation:
|
||||
- "Prune temp artifacts, expand disk/VM, move logs to remote storage."
|
||||
llm_prompt: >
|
||||
High Disk Usage alert on {{ host }} device {{ device }}. Summarize what consumed the space and recommend reclaim or expansion actions with priority.
|
||||
|
||||
- name: "CPU Heartbeat"
|
||||
rule_uid: "eec62gqn3oetcf"
|
||||
description: "Counts cpu.usage_system samples per host; fires if <1 sample arrives within window."
|
||||
signal:
|
||||
metric: "cpu.usage_system"
|
||||
condition: "sample count within 10m < 1"
|
||||
impact: "Indicates host stopped reporting metrics entirely (telemetry silent)."
|
||||
evidence_to_collect:
|
||||
- "Influx query for recent cpu samples"
|
||||
- "Telegraf service and logs"
|
||||
- "Network reachability from host to casper"
|
||||
triage:
|
||||
- summary: "Check host alive and reachable"
|
||||
linux: "ping -c 3 {{ host }} && ssh {{ host }} uptime"
|
||||
windows: "Test-Connection {{ host }} -Count 3"
|
||||
- summary: "Inspect Telegraf state"
|
||||
linux: "sudo systemctl status telegraf && sudo tail -n 100 /var/log/telegraf/telegraf.log"
|
||||
windows: "Get-Service telegraf; Get-EventLog -LogName Application -Newest 50 | ? { $_.Source -match 'Telegraf' }"
|
||||
- summary: "Validate API key / Influx auth"
|
||||
linux: "sudo grep -n 'outputs.influxdb' -n /etc/telegraf/telegraf.conf"
|
||||
remediation:
|
||||
- "Re-issue Telegraf credentials, run `ansible-playbook telemetry.yml -l {{ host }}`."
|
||||
- "If host intentionally offline, silence alert via Grafana maintenance window."
|
||||
llm_prompt: >
|
||||
CPU Heartbeat for {{ host }} indicates telemetry silent. Use connectivity tests and Telegraf logs to determine if host is down or just metrics disabled; propose fixes.
|
||||
14
stacks/mllogwatcher/docker-compose.yml
Normal file
14
stacks/mllogwatcher/docker-compose.yml
Normal file
@@ -0,0 +1,14 @@
|
||||
version: "3.9"
|
||||
|
||||
services:
|
||||
grafana-alert-webhook:
|
||||
build: .
|
||||
env_file:
|
||||
- .env
|
||||
ports:
|
||||
- "8081:8081"
|
||||
volumes:
|
||||
- ./alert_runbook.yaml:/var/core/mlLogWatcher/alert_runbook.yaml:ro
|
||||
- /etc/ansible/hosts:/etc/ansible/hosts:ro
|
||||
- ./.ssh:/var/core/mlLogWatcher/.ssh:ro
|
||||
restart: unless-stopped
|
||||
5
stacks/mllogwatcher/requirements.txt
Normal file
5
stacks/mllogwatcher/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
fastapi==0.115.5
|
||||
uvicorn[standard]==0.32.0
|
||||
pyyaml==6.0.2
|
||||
requests==2.32.3
|
||||
langchain==0.2.15
|
||||
988
stacks/mllogwatcher/scripts/grafana_alert_webhook.py
Executable file
988
stacks/mllogwatcher/scripts/grafana_alert_webhook.py
Executable file
@@ -0,0 +1,988 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Minimal FastAPI web server that accepts Grafana alert webhooks, looks up the
|
||||
matching runbook entry, builds an LLM prompt, and calls OpenRouter to return a
|
||||
triage summary.
|
||||
|
||||
Run with:
|
||||
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
|
||||
|
||||
Environment variables:
|
||||
RUNBOOK_PATH Path to alert_runbook.yaml (default: ./alert_runbook.yaml)
|
||||
OPENROUTER_API_KEY Required; API token for https://openrouter.ai
|
||||
OPENROUTER_MODEL Optional; default openai/gpt-4o-mini
|
||||
OPENROUTER_REFERER Optional referer header
|
||||
OPENROUTER_TITLE Optional title header (default: Grafana Alert Webhook)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
import json
|
||||
import shlex
|
||||
import subprocess
|
||||
from textwrap import indent
|
||||
import smtplib
|
||||
from email.message import EmailMessage
|
||||
|
||||
import requests
|
||||
import yaml
|
||||
from fastapi import FastAPI, HTTPException, Request
|
||||
from langchain.llms.base import LLM
|
||||
|
||||
LOGGER = logging.getLogger("grafana_webhook")
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
|
||||
RUNBOOK_PATH = Path(os.environ.get("RUNBOOK_PATH", "alert_runbook.yaml"))
|
||||
ANSIBLE_HOSTS_PATH = Path(os.environ.get("ANSIBLE_HOSTS_PATH", "/etc/ansible/hosts"))
|
||||
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
|
||||
OPENROUTER_MODEL = os.environ.get("OPENROUTER_MODEL", "openai/gpt-4o-mini")
|
||||
OPENROUTER_REFERER = os.environ.get("OPENROUTER_REFERER")
|
||||
OPENROUTER_TITLE = os.environ.get("OPENROUTER_TITLE", "Grafana Alert Webhook")
|
||||
|
||||
TRIAGE_ENABLE_COMMANDS = os.environ.get("TRIAGE_ENABLE_COMMANDS", "0").lower() in {"1", "true", "yes", "on"}
|
||||
TRIAGE_COMMAND_RUNNER = os.environ.get("TRIAGE_COMMAND_RUNNER", "ssh").lower()
|
||||
TRIAGE_SSH_USER = os.environ.get("TRIAGE_SSH_USER", "root")
|
||||
TRIAGE_SSH_OPTIONS = shlex.split(
|
||||
os.environ.get("TRIAGE_SSH_OPTIONS", "-o BatchMode=yes -o StrictHostKeyChecking=no -o ConnectTimeout=5")
|
||||
)
|
||||
TRIAGE_COMMAND_TIMEOUT = int(os.environ.get("TRIAGE_COMMAND_TIMEOUT", "60"))
|
||||
TRIAGE_DEFAULT_OS = os.environ.get("TRIAGE_DEFAULT_OS", "linux").lower()
|
||||
TRIAGE_MAX_COMMANDS = int(os.environ.get("TRIAGE_MAX_COMMANDS", "3"))
|
||||
TRIAGE_OUTPUT_LIMIT = int(os.environ.get("TRIAGE_OUTPUT_LIMIT", "1200"))
|
||||
# LangChain-driven investigation loop
|
||||
TRIAGE_MAX_ITERATIONS = int(os.environ.get("TRIAGE_MAX_ITERATIONS", "3"))
|
||||
TRIAGE_FOLLOWUP_MAX_COMMANDS = int(os.environ.get("TRIAGE_FOLLOWUP_MAX_COMMANDS", "4"))
|
||||
TRIAGE_SYSTEM_PROMPT = os.environ.get(
|
||||
"TRIAGE_SYSTEM_PROMPT",
|
||||
(
|
||||
"You are assisting with on-call investigations. Always reply with JSON containing:\n"
|
||||
"analysis: your findings and next steps.\n"
|
||||
"followup_commands: list of command specs (summary, command, optional runner/os) to gather more data.\n"
|
||||
"complete: true when sufficient information is gathered.\n"
|
||||
"Request commands only when more evidence is required."
|
||||
),
|
||||
)
|
||||
TRIAGE_VERBOSE_LOGS = os.environ.get("TRIAGE_VERBOSE_LOGS", "0").lower() in {"1", "true", "yes", "on"}
|
||||
TRIAGE_EMAIL_ENABLED = os.environ.get("TRIAGE_EMAIL_ENABLED", "0").lower() in {"1", "true", "yes", "on"}
|
||||
TRIAGE_EMAIL_FROM = os.environ.get("TRIAGE_EMAIL_FROM")
|
||||
TRIAGE_EMAIL_TO = [addr.strip() for addr in os.environ.get("TRIAGE_EMAIL_TO", "").split(",") if addr.strip()]
|
||||
TRIAGE_SMTP_HOST = os.environ.get("TRIAGE_SMTP_HOST")
|
||||
TRIAGE_SMTP_PORT = int(os.environ.get("TRIAGE_SMTP_PORT", "587"))
|
||||
TRIAGE_SMTP_USER = os.environ.get("TRIAGE_SMTP_USER")
|
||||
TRIAGE_SMTP_PASSWORD = os.environ.get("TRIAGE_SMTP_PASSWORD")
|
||||
TRIAGE_SMTP_STARTTLS = os.environ.get("TRIAGE_SMTP_STARTTLS", "1").lower() in {"1", "true", "yes", "on"}
|
||||
TRIAGE_SMTP_SSL = os.environ.get("TRIAGE_SMTP_SSL", "0").lower() in {"1", "true", "yes", "on"}
|
||||
TRIAGE_SMTP_TIMEOUT = int(os.environ.get("TRIAGE_SMTP_TIMEOUT", "20"))
|
||||
|
||||
|
||||
def log_verbose(title: str, content: Any) -> None:
|
||||
"""Emit structured verbose logs when TRIAGE_VERBOSE_LOGS is enabled."""
|
||||
if not TRIAGE_VERBOSE_LOGS:
|
||||
return
|
||||
if isinstance(content, (dict, list)):
|
||||
text = json.dumps(content, indent=2, sort_keys=True)
|
||||
else:
|
||||
text = str(content)
|
||||
LOGGER.info("%s:\n%s", title, text)
|
||||
|
||||
|
||||
def email_notifications_configured() -> bool:
|
||||
if not TRIAGE_EMAIL_ENABLED:
|
||||
return False
|
||||
if not (TRIAGE_SMTP_HOST and TRIAGE_EMAIL_FROM and TRIAGE_EMAIL_TO):
|
||||
LOGGER.warning(
|
||||
"Email notifications requested but TRIAGE_SMTP_HOST/TRIAGE_EMAIL_FROM/TRIAGE_EMAIL_TO are incomplete."
|
||||
)
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def format_command_results_for_email(results: List[Dict[str, Any]]) -> str:
|
||||
if not results:
|
||||
return "No automation commands were executed."
|
||||
lines: List[str] = []
|
||||
for result in results:
|
||||
lines.append(f"- {result.get('summary')} [{result.get('status')}] {result.get('command')}")
|
||||
stdout = result.get("stdout")
|
||||
stderr = result.get("stderr")
|
||||
error = result.get("error")
|
||||
if stdout:
|
||||
lines.append(indent(truncate_text(stdout, 800), " stdout: "))
|
||||
if stderr:
|
||||
lines.append(indent(truncate_text(stderr, 800), " stderr: "))
|
||||
if error and result.get("status") != "ok":
|
||||
lines.append(f" error: {error}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def build_email_body(alert: Dict[str, Any], result: Dict[str, Any], context: Dict[str, Any]) -> str:
|
||||
lines = [
|
||||
f"Alert: {result.get('alertname')} ({result.get('rule_uid')})",
|
||||
f"Host: {result.get('host') or context.get('host')}",
|
||||
f"Status: {alert.get('status')}",
|
||||
f"Value: {alert.get('value') or alert.get('annotations', {}).get('value')}",
|
||||
f"Grafana Rule: {context.get('rule_url')}",
|
||||
"",
|
||||
"LLM Summary:",
|
||||
result.get("llm_summary") or "(no summary returned)",
|
||||
"",
|
||||
"Command Results:",
|
||||
format_command_results_for_email(result.get("command_results") or []),
|
||||
]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def send_summary_email(alert: Dict[str, Any], result: Dict[str, Any], context: Dict[str, Any]) -> None:
|
||||
if not email_notifications_configured():
|
||||
return
|
||||
subject_host = result.get("host") or context.get("host") or "(unknown host)"
|
||||
subject = f"[Grafana] {result.get('alertname')} - {subject_host}"
|
||||
body = build_email_body(alert, result, context)
|
||||
message = EmailMessage()
|
||||
message["Subject"] = subject
|
||||
message["From"] = TRIAGE_EMAIL_FROM
|
||||
message["To"] = ", ".join(TRIAGE_EMAIL_TO)
|
||||
message.set_content(body)
|
||||
try:
|
||||
smtp_class = smtplib.SMTP_SSL if TRIAGE_SMTP_SSL else smtplib.SMTP
|
||||
with smtp_class(TRIAGE_SMTP_HOST, TRIAGE_SMTP_PORT, timeout=TRIAGE_SMTP_TIMEOUT) as client:
|
||||
if TRIAGE_SMTP_STARTTLS and not TRIAGE_SMTP_SSL:
|
||||
client.starttls()
|
||||
if TRIAGE_SMTP_USER:
|
||||
client.login(TRIAGE_SMTP_USER, TRIAGE_SMTP_PASSWORD or "")
|
||||
client.send_message(message)
|
||||
LOGGER.info("Sent summary email to %s for host %s", ", ".join(TRIAGE_EMAIL_TO), subject_host)
|
||||
except Exception as exc: # pylint: disable=broad-except
|
||||
LOGGER.exception("Failed to send summary email: %s", exc)
|
||||
|
||||
app = FastAPI(title="Grafana Alert Webhook", version="1.0.0")
|
||||
|
||||
_RUNBOOK_INDEX: Dict[str, Dict[str, Any]] = {}
|
||||
_INVENTORY_INDEX: Dict[str, Dict[str, Any]] = {}
|
||||
_INVENTORY_GROUP_VARS: Dict[str, Dict[str, str]] = {}
|
||||
_TEMPLATE_PATTERN = re.compile(r"{{\s*([a-zA-Z0-9_]+)\s*}}")
|
||||
|
||||
|
||||
DEFAULT_SYSTEM_PROMPT = TRIAGE_SYSTEM_PROMPT
|
||||
|
||||
|
||||
class OpenRouterLLM(LLM):
|
||||
"""LangChain-compatible LLM that calls OpenRouter chat completions."""
|
||||
|
||||
api_key: str
|
||||
model_name: str
|
||||
|
||||
def __init__(self, api_key: str, model_name: str, **kwargs: Any) -> None:
|
||||
super().__init__(api_key=api_key, model_name=model_name, **kwargs)
|
||||
|
||||
@property
|
||||
def _llm_type(self) -> str:
|
||||
return "openrouter"
|
||||
|
||||
def __call__(self, prompt: str, stop: Optional[List[str]] = None) -> str:
|
||||
return self._call(prompt, stop=stop)
|
||||
|
||||
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
|
||||
payload = {
|
||||
"model": self.model_name,
|
||||
"messages": [
|
||||
{"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
|
||||
{"role": "user", "content": prompt},
|
||||
],
|
||||
}
|
||||
log_verbose("OpenRouter request payload", payload)
|
||||
if stop:
|
||||
payload["stop"] = stop
|
||||
LOGGER.info("Posting to OpenRouter model=%s via LangChain", self.model_name)
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
if OPENROUTER_REFERER:
|
||||
headers["HTTP-Referer"] = OPENROUTER_REFERER
|
||||
if OPENROUTER_TITLE:
|
||||
headers["X-Title"] = OPENROUTER_TITLE
|
||||
response = requests.post("https://openrouter.ai/api/v1/chat/completions", json=payload, headers=headers, timeout=90)
|
||||
if response.status_code >= 400:
|
||||
try:
|
||||
detail = response.json()
|
||||
except ValueError:
|
||||
detail = response.text
|
||||
raise RuntimeError(f"OpenRouter error {response.status_code}: {detail}")
|
||||
data = response.json()
|
||||
log_verbose("OpenRouter raw response", data)
|
||||
choices = data.get("choices")
|
||||
if not choices:
|
||||
raise RuntimeError("OpenRouter returned no choices")
|
||||
return choices[0]["message"]["content"].strip()
|
||||
|
||||
|
||||
def load_runbook() -> Dict[str, Dict[str, Any]]:
|
||||
"""Load runbook YAML into a dict keyed by rule_uid."""
|
||||
if not RUNBOOK_PATH.exists():
|
||||
raise FileNotFoundError(f"Runbook file not found: {RUNBOOK_PATH}")
|
||||
with RUNBOOK_PATH.open("r", encoding="utf-8") as handle:
|
||||
data = yaml.safe_load(handle) or {}
|
||||
alerts = data.get("alerts", [])
|
||||
index: Dict[str, Dict[str, Any]] = {}
|
||||
for entry in alerts:
|
||||
uid = entry.get("rule_uid")
|
||||
if uid:
|
||||
index[str(uid)] = entry
|
||||
LOGGER.info("Loaded %d runbook entries from %s", len(index), RUNBOOK_PATH)
|
||||
return index
|
||||
|
||||
|
||||
def _normalize_host_key(host: str) -> str:
|
||||
return host.strip().lower()
|
||||
|
||||
|
||||
def _parse_key_value_tokens(tokens: List[str]) -> Dict[str, str]:
|
||||
data: Dict[str, str] = {}
|
||||
for token in tokens:
|
||||
if "=" not in token:
|
||||
continue
|
||||
key, value = token.split("=", 1)
|
||||
data[key] = value
|
||||
return data
|
||||
|
||||
|
||||
def load_ansible_inventory() -> Tuple[Dict[str, Dict[str, Any]], Dict[str, Dict[str, str]]]:
|
||||
"""Parse a simple INI-style Ansible hosts file into host/group maps."""
|
||||
if not ANSIBLE_HOSTS_PATH.exists():
|
||||
LOGGER.warning("Ansible inventory not found at %s", ANSIBLE_HOSTS_PATH)
|
||||
return {}, {}
|
||||
hosts: Dict[str, Dict[str, Any]] = {}
|
||||
group_vars: Dict[str, Dict[str, str]] = {}
|
||||
current_group: Optional[str] = None
|
||||
current_section: str = "hosts"
|
||||
|
||||
with ANSIBLE_HOSTS_PATH.open("r", encoding="utf-8") as handle:
|
||||
for raw_line in handle:
|
||||
line = raw_line.strip()
|
||||
if not line or line.startswith("#"):
|
||||
continue
|
||||
if line.startswith("[") and line.endswith("]"):
|
||||
header = line[1:-1].strip()
|
||||
if ":" in header:
|
||||
group_name, suffix = header.split(":", 1)
|
||||
current_group = group_name
|
||||
current_section = suffix
|
||||
else:
|
||||
current_group = header
|
||||
current_section = "hosts"
|
||||
group_vars.setdefault(current_group, {})
|
||||
continue
|
||||
cleaned = line.split("#", 1)[0].strip()
|
||||
if not cleaned:
|
||||
continue
|
||||
tokens = shlex.split(cleaned)
|
||||
if not tokens:
|
||||
continue
|
||||
if current_section == "vars":
|
||||
vars_dict = _parse_key_value_tokens(tokens)
|
||||
group_vars.setdefault(current_group or "all", {}).update(vars_dict)
|
||||
continue
|
||||
host_token = tokens[0]
|
||||
host_key = _normalize_host_key(host_token)
|
||||
entry = hosts.setdefault(host_key, {"name": host_token, "definitions": [], "groups": set()})
|
||||
vars_dict = _parse_key_value_tokens(tokens[1:])
|
||||
entry["definitions"].append({"group": current_group, "vars": vars_dict})
|
||||
if current_group:
|
||||
entry["groups"].add(current_group)
|
||||
|
||||
LOGGER.info("Loaded %d Ansible inventory hosts from %s", len(hosts), ANSIBLE_HOSTS_PATH)
|
||||
return hosts, group_vars
|
||||
|
||||
|
||||
def _lookup_inventory(host: Optional[str]) -> Optional[Dict[str, Any]]:
|
||||
if not host:
|
||||
return None
|
||||
key = _normalize_host_key(host)
|
||||
entry = _INVENTORY_INDEX.get(key)
|
||||
if entry:
|
||||
return entry
|
||||
# try stripping domain suffix
|
||||
short = key.split(".", 1)[0]
|
||||
if short != key:
|
||||
return _INVENTORY_INDEX.get(short)
|
||||
return None
|
||||
|
||||
|
||||
def _merge_group_vars(groups: List[str], host_os: Optional[str]) -> Dict[str, str]:
|
||||
merged: Dict[str, str] = {}
|
||||
global_vars = _INVENTORY_GROUP_VARS.get("all")
|
||||
if global_vars:
|
||||
merged.update(global_vars)
|
||||
normalized_os = (host_os or "").lower()
|
||||
for group in groups:
|
||||
vars_dict = _INVENTORY_GROUP_VARS.get(group)
|
||||
if not vars_dict:
|
||||
continue
|
||||
connection = (vars_dict.get("ansible_connection") or "").lower()
|
||||
if connection == "winrm" and normalized_os == "linux":
|
||||
continue
|
||||
merged.update(vars_dict)
|
||||
return merged
|
||||
|
||||
|
||||
def _should_include_definition(group: Optional[str], vars_dict: Dict[str, str], host_os: Optional[str]) -> bool:
|
||||
if not vars_dict:
|
||||
return False
|
||||
normalized_os = (host_os or "").lower()
|
||||
connection = (vars_dict.get("ansible_connection") or "").lower()
|
||||
if connection == "winrm" and normalized_os != "windows":
|
||||
return False
|
||||
if connection == "local":
|
||||
return True
|
||||
if group and "windows" in group.lower() and normalized_os == "linux" and not connection:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def apply_inventory_context(context: Dict[str, Any]) -> None:
|
||||
"""Augment the alert context with SSH metadata from the Ansible inventory."""
|
||||
host = context.get("host")
|
||||
entry = _lookup_inventory(host)
|
||||
if not entry:
|
||||
return
|
||||
merged_vars = _merge_group_vars(list(entry.get("groups", [])), context.get("host_os"))
|
||||
for definition in entry.get("definitions", []):
|
||||
group_name = definition.get("group")
|
||||
vars_dict = definition.get("vars", {})
|
||||
if _should_include_definition(group_name, vars_dict, context.get("host_os")):
|
||||
merged_vars.update(vars_dict)
|
||||
ansible_host = merged_vars.get("ansible_host") or entry.get("name")
|
||||
ansible_user = merged_vars.get("ansible_user")
|
||||
ansible_port = merged_vars.get("ansible_port")
|
||||
ssh_common_args = merged_vars.get("ansible_ssh_common_args")
|
||||
ssh_key = merged_vars.get("ansible_ssh_private_key_file")
|
||||
connection = (merged_vars.get("ansible_connection") or "").lower()
|
||||
host_os = (context.get("host_os") or "").lower()
|
||||
if connection == "winrm" and host_os != "windows":
|
||||
for key in (
|
||||
"ansible_connection",
|
||||
"ansible_port",
|
||||
"ansible_password",
|
||||
"ansible_winrm_server_cert_validation",
|
||||
"ansible_winrm_scheme",
|
||||
):
|
||||
merged_vars.pop(key, None)
|
||||
connection = ""
|
||||
|
||||
context.setdefault("ssh_host", ansible_host or host)
|
||||
if ansible_user:
|
||||
context["ssh_user"] = ansible_user
|
||||
if ansible_port:
|
||||
context["ssh_port"] = ansible_port
|
||||
if ssh_common_args:
|
||||
context["ssh_common_args"] = ssh_common_args
|
||||
if ssh_key:
|
||||
context["ssh_identity_file"] = ssh_key
|
||||
context.setdefault("inventory_groups", list(entry.get("groups", [])))
|
||||
if connection == "local":
|
||||
context.setdefault("preferred_runner", "local")
|
||||
elif connection in {"", "ssh", "smart"}:
|
||||
context.setdefault("preferred_runner", "ssh")
|
||||
context.setdefault("inventory_groups", list(entry.get("groups", [])))
|
||||
|
||||
|
||||
def render_template(template: str, context: Dict[str, Any]) -> str:
|
||||
"""Very small mustache-style renderer for {{ var }} placeholders."""
|
||||
def replace(match: re.Match[str]) -> str:
|
||||
key = match.group(1)
|
||||
return str(context.get(key, match.group(0)))
|
||||
|
||||
return _TEMPLATE_PATTERN.sub(replace, template)
|
||||
|
||||
|
||||
def extract_rule_uid(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> Optional[str]:
|
||||
"""Grafana webhooks may include rule UID in different fields."""
|
||||
candidates: List[Any] = [
|
||||
alert.get("ruleUid"),
|
||||
alert.get("rule_uid"),
|
||||
alert.get("ruleId"),
|
||||
alert.get("uid"),
|
||||
alert.get("labels", {}).get("rule_uid"),
|
||||
alert.get("labels", {}).get("ruleUid"),
|
||||
parent_payload.get("ruleUid"),
|
||||
parent_payload.get("rule_uid"),
|
||||
parent_payload.get("ruleId"),
|
||||
]
|
||||
for candidate in candidates:
|
||||
if candidate:
|
||||
return str(candidate)
|
||||
# Fall back to Grafana URL parsing if present
|
||||
url = (
|
||||
alert.get("ruleUrl")
|
||||
or parent_payload.get("ruleUrl")
|
||||
or alert.get("generatorURL")
|
||||
or parent_payload.get("generatorURL")
|
||||
)
|
||||
if url and "/alerting/" in url:
|
||||
return url.rstrip("/").split("/")[-2]
|
||||
return None
|
||||
|
||||
|
||||
def derive_fallback_rule_uid(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> str:
|
||||
"""Construct a deterministic identifier when Grafana omits rule UIDs."""
|
||||
labels = alert.get("labels", {})
|
||||
candidates = [
|
||||
alert.get("fingerprint"),
|
||||
labels.get("alertname"),
|
||||
labels.get("host"),
|
||||
labels.get("instance"),
|
||||
parent_payload.get("groupKey"),
|
||||
parent_payload.get("title"),
|
||||
]
|
||||
for candidate in candidates:
|
||||
if candidate:
|
||||
return str(candidate)
|
||||
return "unknown-alert"
|
||||
|
||||
|
||||
def build_fallback_runbook_entry(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Return a generic runbook entry so every alert can be processed."""
|
||||
labels = alert.get("labels", {})
|
||||
alertname = labels.get("alertname") or parent_payload.get("title") or "Grafana Alert"
|
||||
host = labels.get("host") or labels.get("instance") or "(unknown host)"
|
||||
return {
|
||||
"name": f"{alertname} (auto)",
|
||||
"llm_prompt": (
|
||||
"Grafana alert {{ alertname }} fired for {{ host }}.\n"
|
||||
"No dedicated runbook entry exists. Use the payload details, command outputs, "
|
||||
"and your own reasoning to propose likely causes, evidence to gather, and remediation steps."
|
||||
),
|
||||
"triage": [],
|
||||
"evidence_to_collect": [],
|
||||
"remediation": [],
|
||||
"metadata": {"host": host},
|
||||
}
|
||||
|
||||
|
||||
def summarize_dict(prefix: str, data: Optional[Dict[str, Any]]) -> str:
|
||||
if not data:
|
||||
return f"{prefix}: (none)"
|
||||
parts = ", ".join(f"{key}={value}" for key, value in sorted(data.items()))
|
||||
return f"{prefix}: {parts}"
|
||||
|
||||
|
||||
def determine_host_os(alert: Dict[str, Any]) -> str:
|
||||
"""Infer host operating system from labels or defaults."""
|
||||
labels = alert.get("labels", {})
|
||||
candidates = [
|
||||
labels.get("os"),
|
||||
labels.get("platform"),
|
||||
labels.get("system"),
|
||||
alert.get("os"),
|
||||
]
|
||||
for candidate in candidates:
|
||||
if candidate:
|
||||
value = str(candidate).lower()
|
||||
if "win" in value:
|
||||
return "windows"
|
||||
if any(token in value for token in ("linux", "unix", "darwin")):
|
||||
return "linux"
|
||||
host = (labels.get("host") or labels.get("instance") or "").lower()
|
||||
if host.startswith("win") or host.endswith(".localdomain") and "win" in host:
|
||||
return "windows"
|
||||
inventory_os = infer_os_from_inventory(labels.get("host") or labels.get("instance"))
|
||||
if inventory_os:
|
||||
return inventory_os
|
||||
return TRIAGE_DEFAULT_OS
|
||||
|
||||
|
||||
def infer_os_from_inventory(host: Optional[str]) -> Optional[str]:
|
||||
if not host:
|
||||
return None
|
||||
entry = _lookup_inventory(host)
|
||||
if not entry:
|
||||
return None
|
||||
for definition in entry.get("definitions", []):
|
||||
vars_dict = definition.get("vars", {}) or {}
|
||||
connection = (vars_dict.get("ansible_connection") or "").lower()
|
||||
if connection == "winrm":
|
||||
return "windows"
|
||||
for group in entry.get("groups", []):
|
||||
if "windows" in (group or "").lower():
|
||||
return "windows"
|
||||
return None
|
||||
|
||||
|
||||
def truncate_text(text: str, limit: int = TRIAGE_OUTPUT_LIMIT) -> str:
|
||||
"""Trim long outputs to keep prompts manageable."""
|
||||
if not text:
|
||||
return ""
|
||||
cleaned = text.strip()
|
||||
if len(cleaned) <= limit:
|
||||
return cleaned
|
||||
return cleaned[:limit] + "... [truncated]"
|
||||
|
||||
|
||||
def gather_command_specs(entry: Dict[str, Any], host_os: str) -> List[Dict[str, Any]]:
|
||||
"""Collect command specs from triage steps and optional automation sections."""
|
||||
specs: List[Dict[str, Any]] = []
|
||||
for step in entry.get("triage", []):
|
||||
cmd = step.get(host_os)
|
||||
if not cmd:
|
||||
continue
|
||||
specs.append(
|
||||
{
|
||||
"summary": step.get("summary") or entry.get("name") or "triage",
|
||||
"shell": cmd,
|
||||
"runner": step.get("runner"),
|
||||
"os": host_os,
|
||||
}
|
||||
)
|
||||
for item in entry.get("automation_commands", []):
|
||||
target_os = item.get("os", host_os)
|
||||
if target_os and target_os.lower() != host_os:
|
||||
continue
|
||||
specs.append(item)
|
||||
if TRIAGE_MAX_COMMANDS > 0:
|
||||
return specs[:TRIAGE_MAX_COMMANDS]
|
||||
return specs
|
||||
|
||||
|
||||
def build_runner_command(
|
||||
rendered_command: str,
|
||||
runner: str,
|
||||
context: Dict[str, Any],
|
||||
spec: Dict[str, Any],
|
||||
) -> Tuple[Any, str, bool, str]:
|
||||
"""Return the subprocess args, display string, shell flag, and runner label."""
|
||||
runner = runner or TRIAGE_COMMAND_RUNNER
|
||||
runner = runner.lower()
|
||||
if runner == "ssh":
|
||||
host = spec.get("host") or context.get("ssh_host") or context.get("host")
|
||||
if not host:
|
||||
raise RuntimeError("Host not provided for ssh runner.")
|
||||
ssh_user = spec.get("ssh_user") or context.get("ssh_user") or TRIAGE_SSH_USER
|
||||
ssh_target = spec.get("ssh_target") or f"{ssh_user}@{host}"
|
||||
ssh_options = list(TRIAGE_SSH_OPTIONS)
|
||||
common_args = spec.get("ssh_common_args") or context.get("ssh_common_args")
|
||||
if common_args:
|
||||
ssh_options.extend(shlex.split(common_args))
|
||||
ssh_port = spec.get("ssh_port") or context.get("ssh_port")
|
||||
if ssh_port:
|
||||
ssh_options.extend(["-p", str(ssh_port)])
|
||||
identity_file = spec.get("ssh_identity_file") or context.get("ssh_identity_file")
|
||||
if identity_file:
|
||||
ssh_options.extend(["-i", identity_file])
|
||||
command_list = ["ssh", *ssh_options, ssh_target, rendered_command]
|
||||
display = " ".join(shlex.quote(part) for part in command_list)
|
||||
return command_list, display, False, "ssh"
|
||||
# default to local shell execution
|
||||
display = rendered_command
|
||||
return rendered_command, display, True, "local"
|
||||
|
||||
|
||||
def run_subprocess_command(
|
||||
command: Any,
|
||||
display: str,
|
||||
summary: str,
|
||||
use_shell: bool,
|
||||
runner_label: str,
|
||||
) -> Dict[str, Any]:
|
||||
"""Execute subprocess command and capture results."""
|
||||
LOGGER.info("Executing command (%s) via %s: %s", summary, runner_label, display)
|
||||
try:
|
||||
completed = subprocess.run(
|
||||
command,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=TRIAGE_COMMAND_TIMEOUT,
|
||||
shell=use_shell,
|
||||
check=False,
|
||||
)
|
||||
result = {
|
||||
"summary": summary,
|
||||
"command": display,
|
||||
"runner": runner_label,
|
||||
"exit_code": completed.returncode,
|
||||
"stdout": (completed.stdout or "").strip(),
|
||||
"stderr": (completed.stderr or "").strip(),
|
||||
"status": "ok" if completed.returncode == 0 else "failed",
|
||||
}
|
||||
log_verbose(f"Command result ({summary})", result)
|
||||
return result
|
||||
except subprocess.TimeoutExpired as exc:
|
||||
result = {
|
||||
"summary": summary,
|
||||
"command": display,
|
||||
"runner": runner_label,
|
||||
"exit_code": None,
|
||||
"stdout": truncate_text((exc.stdout or "").strip()),
|
||||
"stderr": truncate_text((exc.stderr or "").strip()),
|
||||
"status": "timeout",
|
||||
"error": f"Command timed out after {TRIAGE_COMMAND_TIMEOUT}s",
|
||||
}
|
||||
log_verbose(f"Command timeout ({summary})", result)
|
||||
return result
|
||||
except Exception as exc: # pylint: disable=broad-except
|
||||
LOGGER.exception("Command execution failed (%s): %s", summary, exc)
|
||||
result = {
|
||||
"summary": summary,
|
||||
"command": display,
|
||||
"runner": runner_label,
|
||||
"exit_code": None,
|
||||
"stdout": "",
|
||||
"stderr": "",
|
||||
"status": "error",
|
||||
"error": str(exc),
|
||||
}
|
||||
log_verbose(f"Command error ({summary})", result)
|
||||
return result
|
||||
|
||||
|
||||
def run_command_spec(spec: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
summary = spec.get("summary") or spec.get("name") or "command"
|
||||
shell_cmd = spec.get("shell")
|
||||
if not shell_cmd:
|
||||
return {"summary": summary, "status": "skipped", "error": "No shell command provided."}
|
||||
rendered = render_template(shell_cmd, context)
|
||||
preferred_runner = context.get("preferred_runner")
|
||||
runner_choice = (spec.get("runner") or preferred_runner or TRIAGE_COMMAND_RUNNER).lower()
|
||||
try:
|
||||
command, display, use_shell, runner_label = build_runner_command(rendered, runner_choice, context, spec)
|
||||
except RuntimeError as exc:
|
||||
LOGGER.warning("Skipping command '%s': %s", summary, exc)
|
||||
return {"summary": summary, "status": "skipped", "error": str(exc), "command": rendered}
|
||||
return run_subprocess_command(command, display, summary, use_shell, runner_label)
|
||||
|
||||
|
||||
def execute_triage_commands(entry: Dict[str, Any], alert: Dict[str, Any], context: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
host_os = context.get("host_os") or determine_host_os(alert)
|
||||
context["host_os"] = host_os
|
||||
specs = gather_command_specs(entry, host_os)
|
||||
if not specs:
|
||||
LOGGER.info("No triage commands defined for host_os=%s", host_os)
|
||||
return []
|
||||
if not TRIAGE_ENABLE_COMMANDS:
|
||||
LOGGER.info("Command execution disabled; %d commands queued but skipped.", len(specs))
|
||||
return []
|
||||
LOGGER.info("Executing up to %d triage commands for host_os=%s", len(specs), host_os)
|
||||
results = []
|
||||
for spec in specs:
|
||||
results.append(run_command_spec(spec, context))
|
||||
return results
|
||||
|
||||
|
||||
def format_command_results_for_llm(results: List[Dict[str, Any]]) -> str:
|
||||
lines: List[str] = []
|
||||
for idx, result in enumerate(results, start=1):
|
||||
lines.append(f"{idx}. {result.get('summary')} [{result.get('status')}] {result.get('command')}")
|
||||
stdout = result.get("stdout")
|
||||
stderr = result.get("stderr")
|
||||
error = result.get("error")
|
||||
if stdout:
|
||||
lines.append(" stdout:")
|
||||
lines.append(indent(truncate_text(stdout), " "))
|
||||
if stderr:
|
||||
lines.append(" stderr:")
|
||||
lines.append(indent(truncate_text(stderr), " "))
|
||||
if error and result.get("status") != "ok":
|
||||
lines.append(f" error: {error}")
|
||||
if not lines:
|
||||
return "No command results were available."
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def parse_structured_response(text: str) -> Optional[Dict[str, Any]]:
|
||||
cleaned = text.strip()
|
||||
try:
|
||||
return json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
start = cleaned.find("{")
|
||||
end = cleaned.rfind("}")
|
||||
if start != -1 and end != -1 and end > start:
|
||||
snippet = cleaned[start : end + 1]
|
||||
try:
|
||||
return json.loads(snippet)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def normalize_followup_command(item: Dict[str, Any]) -> Dict[str, Any]:
|
||||
return {
|
||||
"summary": item.get("summary") or item.get("name") or "Follow-up command",
|
||||
"shell": item.get("command") or item.get("shell"),
|
||||
"runner": item.get("runner"),
|
||||
"host": item.get("host") or item.get("target"),
|
||||
"ssh_user": item.get("ssh_user"),
|
||||
"os": (item.get("os") or item.get("platform") or "").lower() or None,
|
||||
}
|
||||
|
||||
|
||||
def investigate_with_langchain(
|
||||
entry: Dict[str, Any],
|
||||
alert: Dict[str, Any],
|
||||
parent_payload: Dict[str, Any],
|
||||
context: Dict[str, Any],
|
||||
initial_outputs: List[Dict[str, Any]],
|
||||
) -> Tuple[str, List[Dict[str, Any]]]:
|
||||
command_outputs = list(initial_outputs)
|
||||
prompt = build_prompt(entry, alert, parent_payload, context, command_outputs)
|
||||
log_verbose("Initial investigation prompt", prompt)
|
||||
if not OPENROUTER_API_KEY:
|
||||
return "OPENROUTER_API_KEY is not configured; unable to analyze alert.", command_outputs
|
||||
|
||||
llm = OpenRouterLLM(api_key=OPENROUTER_API_KEY, model_name=OPENROUTER_MODEL)
|
||||
dialogue = (
|
||||
prompt
|
||||
+ "\n\nRespond with JSON containing fields analysis, followup_commands, and complete. "
|
||||
"Request commands only when more evidence is required."
|
||||
)
|
||||
total_followup = 0
|
||||
final_summary = ""
|
||||
for iteration in range(TRIAGE_MAX_ITERATIONS):
|
||||
log_verbose(f"LLM dialogue iteration {iteration + 1}", dialogue)
|
||||
llm_text = llm(dialogue)
|
||||
log_verbose(f"LLM iteration {iteration + 1} output", llm_text)
|
||||
dialogue += f"\nAssistant:\n{llm_text}\n"
|
||||
parsed = parse_structured_response(llm_text)
|
||||
if parsed:
|
||||
log_verbose(f"LLM iteration {iteration + 1} parsed response", parsed)
|
||||
if not parsed:
|
||||
final_summary = llm_text
|
||||
break
|
||||
|
||||
analysis = parsed.get("analysis") or ""
|
||||
followups = parsed.get("followup_commands") or parsed.get("commands") or []
|
||||
final_summary = analysis
|
||||
complete_flag = bool(parsed.get("complete"))
|
||||
|
||||
if complete_flag or not followups:
|
||||
break
|
||||
|
||||
log_verbose(f"LLM iteration {iteration + 1} requested follow-ups", followups)
|
||||
allowed = max(0, TRIAGE_FOLLOWUP_MAX_COMMANDS - total_followup)
|
||||
if not TRIAGE_ENABLE_COMMANDS or allowed <= 0:
|
||||
dialogue += (
|
||||
"\nUser:\nCommand execution is disabled or budget exhausted. Provide final analysis with JSON format.\n"
|
||||
)
|
||||
continue
|
||||
|
||||
normalized_cmds: List[Dict[str, Any]] = []
|
||||
for raw in followups:
|
||||
if not isinstance(raw, dict):
|
||||
continue
|
||||
normalized = normalize_followup_command(raw)
|
||||
if not normalized.get("shell"):
|
||||
continue
|
||||
cmd_os = normalized.get("os")
|
||||
if cmd_os and cmd_os != context.get("host_os"):
|
||||
continue
|
||||
normalized_cmds.append(normalized)
|
||||
|
||||
log_verbose(f"Normalized follow-up commands (iteration {iteration + 1})", normalized_cmds)
|
||||
if not normalized_cmds:
|
||||
dialogue += "\nUser:\nNo valid commands to run. Finalize analysis in JSON format.\n"
|
||||
continue
|
||||
|
||||
normalized_cmds = normalized_cmds[:allowed]
|
||||
executed_batch: List[Dict[str, Any]] = []
|
||||
for spec in normalized_cmds:
|
||||
executed = run_command_spec(spec, context)
|
||||
command_outputs.append(executed)
|
||||
executed_batch.append(executed)
|
||||
total_followup += 1
|
||||
|
||||
result_text = "Follow-up command results:\n" + format_command_results_for_llm(executed_batch)
|
||||
dialogue += (
|
||||
"\nUser:\n"
|
||||
+ result_text
|
||||
+ "\nUpdate your analysis and respond with JSON (analysis, followup_commands, complete).\n"
|
||||
)
|
||||
log_verbose("Executed follow-up commands", result_text)
|
||||
else:
|
||||
final_summary = final_summary or "Reached maximum iterations without a conclusive response."
|
||||
|
||||
if not final_summary:
|
||||
final_summary = "LLM did not return a valid analysis."
|
||||
|
||||
log_verbose("Final LLM summary", final_summary)
|
||||
return final_summary, command_outputs
|
||||
|
||||
|
||||
def build_context(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> Dict[str, Any]:
|
||||
labels = alert.get("labels", {})
|
||||
annotations = alert.get("annotations", {})
|
||||
context = {
|
||||
"alertname": labels.get("alertname") or alert.get("title") or parent_payload.get("title") or parent_payload.get("ruleName"),
|
||||
"host": labels.get("host") or labels.get("instance"),
|
||||
"iface": labels.get("interface"),
|
||||
"device": labels.get("device"),
|
||||
"vmid": labels.get("vmid"),
|
||||
"status": alert.get("status") or parent_payload.get("status"),
|
||||
"value": alert.get("value") or annotations.get("value"),
|
||||
"rule_url": alert.get("ruleUrl") or parent_payload.get("ruleUrl"),
|
||||
}
|
||||
context.setdefault("ssh_user", TRIAGE_SSH_USER)
|
||||
return context
|
||||
|
||||
|
||||
def build_prompt(
|
||||
entry: Dict[str, Any],
|
||||
alert: Dict[str, Any],
|
||||
parent_payload: Dict[str, Any],
|
||||
context: Dict[str, Any],
|
||||
command_outputs: Optional[List[Dict[str, Any]]] = None,
|
||||
) -> str:
|
||||
template = entry.get("llm_prompt", "Alert {{ alertname }} fired for {{ host }}.")
|
||||
rendered_template = render_template(template, {k: v or "" for k, v in context.items()})
|
||||
|
||||
evidence = entry.get("evidence_to_collect", [])
|
||||
triage_steps = entry.get("triage", [])
|
||||
remediation = entry.get("remediation", [])
|
||||
|
||||
lines = [
|
||||
rendered_template.strip(),
|
||||
"",
|
||||
"Alert payload summary:",
|
||||
f"- Status: {context.get('status') or alert.get('status')}",
|
||||
f"- Host: {context.get('host')}",
|
||||
f"- Value: {context.get('value')}",
|
||||
f"- StartsAt: {alert.get('startsAt')}",
|
||||
f"- EndsAt: {alert.get('endsAt')}",
|
||||
f"- RuleURL: {context.get('rule_url')}",
|
||||
f"- Host OS (inferred): {context.get('host_os')}",
|
||||
"- Note: All timestamps are UTC/RFC3339 as provided by Grafana.",
|
||||
summarize_dict("- Labels", alert.get("labels")),
|
||||
summarize_dict("- Annotations", alert.get("annotations")),
|
||||
]
|
||||
|
||||
if evidence:
|
||||
lines.append("")
|
||||
lines.append("Evidence to gather (for automation reference):")
|
||||
for item in evidence:
|
||||
lines.append(f"- {item}")
|
||||
|
||||
if triage_steps:
|
||||
lines.append("")
|
||||
lines.append("Suggested manual checks:")
|
||||
for step in triage_steps:
|
||||
summary = step.get("summary")
|
||||
linux = step.get("linux")
|
||||
windows = step.get("windows")
|
||||
lines.append(f"- {summary}")
|
||||
if linux:
|
||||
lines.append(f" Linux: {linux}")
|
||||
if windows:
|
||||
lines.append(f" Windows: {windows}")
|
||||
|
||||
if remediation:
|
||||
lines.append("")
|
||||
lines.append("Remediation ideas:")
|
||||
for item in remediation:
|
||||
lines.append(f"- {item}")
|
||||
|
||||
if command_outputs:
|
||||
lines.append("")
|
||||
lines.append("Command execution results:")
|
||||
for result in command_outputs:
|
||||
status = result.get("status", "unknown")
|
||||
cmd_display = result.get("command", "")
|
||||
lines.append(f"- {result.get('summary')} [{status}] {cmd_display}")
|
||||
stdout = result.get("stdout")
|
||||
stderr = result.get("stderr")
|
||||
error = result.get("error")
|
||||
if stdout:
|
||||
lines.append(" stdout:")
|
||||
lines.append(indent(truncate_text(stdout), " "))
|
||||
if stderr:
|
||||
lines.append(" stderr:")
|
||||
lines.append(indent(truncate_text(stderr), " "))
|
||||
if error and status != "ok":
|
||||
lines.append(f" error: {error}")
|
||||
|
||||
return "\n".join(lines).strip()
|
||||
|
||||
|
||||
def get_alerts(payload: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
alerts = payload.get("alerts")
|
||||
if isinstance(alerts, list) and alerts:
|
||||
return alerts
|
||||
return [payload]
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
def startup_event() -> None:
|
||||
global _RUNBOOK_INDEX, _INVENTORY_INDEX, _INVENTORY_GROUP_VARS
|
||||
_RUNBOOK_INDEX = load_runbook()
|
||||
_INVENTORY_INDEX, _INVENTORY_GROUP_VARS = load_ansible_inventory()
|
||||
LOGGER.info(
|
||||
"Alert webhook server ready with %d runbook entries and %d inventory hosts.",
|
||||
len(_RUNBOOK_INDEX),
|
||||
len(_INVENTORY_INDEX),
|
||||
)
|
||||
|
||||
|
||||
@app.post("/alerts")
|
||||
async def handle_alert(request: Request) -> Dict[str, Any]:
|
||||
payload = await request.json()
|
||||
LOGGER.info("Received Grafana payload: %s", json.dumps(payload, indent=2, sort_keys=True))
|
||||
results = []
|
||||
unmatched = []
|
||||
for alert in get_alerts(payload):
|
||||
LOGGER.info("Processing alert: %s", json.dumps(alert, indent=2, sort_keys=True))
|
||||
unmatched_reason: Optional[str] = None
|
||||
alert_status = str(alert.get("status") or payload.get("status") or "").lower()
|
||||
if alert_status and alert_status != "firing":
|
||||
details = {"reason": "non_firing_status", "status": alert_status, "alert": alert}
|
||||
unmatched.append(details)
|
||||
LOGGER.info("Skipping alert with status=%s (only 'firing' alerts are processed).", alert_status)
|
||||
continue
|
||||
rule_uid = extract_rule_uid(alert, payload)
|
||||
if not rule_uid:
|
||||
unmatched_reason = "missing_rule_uid"
|
||||
derived_uid = derive_fallback_rule_uid(alert, payload)
|
||||
details = {"reason": unmatched_reason, "derived_rule_uid": derived_uid, "alert": alert}
|
||||
unmatched.append(details)
|
||||
LOGGER.warning("Alert missing rule UID, using fallback identifier %s", derived_uid)
|
||||
rule_uid = derived_uid
|
||||
entry = _RUNBOOK_INDEX.get(rule_uid)
|
||||
runbook_matched = entry is not None
|
||||
if not entry:
|
||||
unmatched_reason = unmatched_reason or "no_runbook_entry"
|
||||
details = {"reason": unmatched_reason, "rule_uid": rule_uid, "alert": alert}
|
||||
unmatched.append(details)
|
||||
LOGGER.warning("No runbook entry for rule_uid=%s, using generic fallback.", rule_uid)
|
||||
entry = build_fallback_runbook_entry(alert, payload)
|
||||
context = build_context(alert, payload)
|
||||
context["host_os"] = determine_host_os(alert)
|
||||
context["rule_uid"] = rule_uid
|
||||
apply_inventory_context(context)
|
||||
initial_outputs = execute_triage_commands(entry, alert, context)
|
||||
try:
|
||||
llm_text, command_outputs = investigate_with_langchain(entry, alert, payload, context, initial_outputs)
|
||||
except Exception as exc: # pylint: disable=broad-except
|
||||
LOGGER.exception("Investigation failed for rule_uid=%s: %s", rule_uid, exc)
|
||||
raise HTTPException(status_code=502, detail=f"LLM investigation error: {exc}") from exc
|
||||
result = {
|
||||
"rule_uid": rule_uid,
|
||||
"alertname": entry.get("name"),
|
||||
"host": alert.get("labels", {}).get("host"),
|
||||
"llm_summary": llm_text,
|
||||
"command_results": command_outputs,
|
||||
"runbook_matched": runbook_matched,
|
||||
}
|
||||
if not runbook_matched and unmatched_reason:
|
||||
result["fallback_reason"] = unmatched_reason
|
||||
results.append(result)
|
||||
send_summary_email(alert, result, context)
|
||||
return {"processed": len(results), "results": results, "unmatched": unmatched}
|
||||
|
||||
|
||||
@app.post("/reload-runbook")
|
||||
def reload_runbook() -> Dict[str, Any]:
|
||||
global _RUNBOOK_INDEX, _INVENTORY_INDEX, _INVENTORY_GROUP_VARS
|
||||
_RUNBOOK_INDEX = load_runbook()
|
||||
_INVENTORY_INDEX, _INVENTORY_GROUP_VARS = load_ansible_inventory()
|
||||
return {"entries": len(_RUNBOOK_INDEX), "inventory_hosts": len(_INVENTORY_INDEX)}
|
||||
178
stacks/mllogwatcher/scripts/log_monitor.py
Executable file
178
stacks/mllogwatcher/scripts/log_monitor.py
Executable file
@@ -0,0 +1,178 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Log anomaly checker that queries Elasticsearch and asks an OpenRouter-hosted LLM
|
||||
for a quick triage summary. Intended to be run on a schedule (cron/systemd).
|
||||
|
||||
Required environment variables:
|
||||
ELASTIC_HOST e.g. https://casper.localdomain:9200
|
||||
ELASTIC_API_KEY Base64 ApiKey used for Elasticsearch requests
|
||||
OPENROUTER_API_KEY Token for https://openrouter.ai/
|
||||
|
||||
Optional environment variables:
|
||||
OPENROUTER_MODEL Model identifier (default: openai/gpt-4o-mini)
|
||||
OPENROUTER_REFERER Passed through as HTTP-Referer header
|
||||
OPENROUTER_TITLE Passed through as X-Title header
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import datetime as dt
|
||||
import os
|
||||
import sys
|
||||
from typing import Any, Iterable
|
||||
|
||||
import requests
|
||||
|
||||
|
||||
def utc_iso(ts: dt.datetime) -> str:
|
||||
"""Return an ISO8601 string with Z suffix."""
|
||||
return ts.replace(microsecond=0).isoformat() + "Z"
|
||||
|
||||
|
||||
def query_elasticsearch(
|
||||
host: str,
|
||||
api_key: str,
|
||||
index_pattern: str,
|
||||
minutes: int,
|
||||
size: int,
|
||||
verify: bool,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Fetch recent logs from Elasticsearch."""
|
||||
end = dt.datetime.utcnow()
|
||||
start = end - dt.timedelta(minutes=minutes)
|
||||
url = f"{host.rstrip('/')}/{index_pattern}/_search"
|
||||
payload = {
|
||||
"size": size,
|
||||
"sort": [{"@timestamp": {"order": "desc"}}],
|
||||
"query": {
|
||||
"range": {
|
||||
"@timestamp": {
|
||||
"gte": utc_iso(start),
|
||||
"lte": utc_iso(end),
|
||||
}
|
||||
}
|
||||
},
|
||||
"_source": ["@timestamp", "message", "host.name", "container.image.name", "log.level"],
|
||||
}
|
||||
headers = {
|
||||
"Authorization": f"ApiKey {api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
response = requests.post(url, json=payload, headers=headers, timeout=30, verify=verify)
|
||||
response.raise_for_status()
|
||||
hits = response.json().get("hits", {}).get("hits", [])
|
||||
return hits
|
||||
|
||||
|
||||
def build_prompt(logs: Iterable[dict[str, Any]], limit_messages: int) -> str:
|
||||
"""Create the prompt that will be sent to the LLM."""
|
||||
selected = []
|
||||
for idx, hit in enumerate(logs):
|
||||
if idx >= limit_messages:
|
||||
break
|
||||
source = hit.get("_source", {})
|
||||
message = source.get("message") or source.get("event", {}).get("original") or ""
|
||||
timestamp = source.get("@timestamp", "unknown time")
|
||||
host = source.get("host", {}).get("name") or source.get("host", {}).get("hostname") or "unknown-host"
|
||||
container = source.get("container", {}).get("image", {}).get("name") or ""
|
||||
level = source.get("log", {}).get("level") or source.get("log.level") or ""
|
||||
selected.append(
|
||||
f"[{timestamp}] host={host} level={level} container={container}\n{message}".strip()
|
||||
)
|
||||
|
||||
if not selected:
|
||||
return "No logs were returned from Elasticsearch in the requested window."
|
||||
|
||||
prompt = (
|
||||
"You are assisting with HomeLab observability. Review the following log entries collected from "
|
||||
"Elasticsearch and highlight any notable anomalies, errors, or emerging issues. "
|
||||
"Explain the impact and suggest next steps when applicable. "
|
||||
"Use concise bullet points. Logs:\n\n"
|
||||
+ "\n\n".join(selected)
|
||||
)
|
||||
return prompt
|
||||
|
||||
|
||||
def call_openrouter(prompt: str, model: str, api_key: str, referer: str | None, title: str | None) -> str:
|
||||
"""Send prompt to OpenRouter and return the model response text."""
|
||||
url = "https://openrouter.ai/api/v1/chat/completions"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
if referer:
|
||||
headers["HTTP-Referer"] = referer
|
||||
if title:
|
||||
headers["X-Title"] = title
|
||||
|
||||
body = {
|
||||
"model": model,
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a senior SRE helping analyze log anomalies."},
|
||||
{"role": "user", "content": prompt},
|
||||
],
|
||||
}
|
||||
|
||||
response = requests.post(url, json=body, headers=headers, timeout=60)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
choices = data.get("choices", [])
|
||||
if not choices:
|
||||
raise RuntimeError("OpenRouter response did not include choices")
|
||||
return choices[0]["message"]["content"]
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(description="Query Elasticsearch and summarize logs with OpenRouter.")
|
||||
parser.add_argument("--host", default=os.environ.get("ELASTIC_HOST"), help="Elasticsearch host URL")
|
||||
parser.add_argument("--api-key", default=os.environ.get("ELASTIC_API_KEY"), help="Elasticsearch ApiKey")
|
||||
parser.add_argument("--index", default="log*", help="Index pattern (default: log*)")
|
||||
parser.add_argument("--minutes", type=int, default=60, help="Lookback window in minutes (default: 60)")
|
||||
parser.add_argument("--size", type=int, default=200, help="Max number of logs to fetch (default: 200)")
|
||||
parser.add_argument("--message-limit", type=int, default=50, help="Max log lines sent to LLM (default: 50)")
|
||||
parser.add_argument("--openrouter-model", default=os.environ.get("OPENROUTER_MODEL", "openai/gpt-4o-mini"))
|
||||
parser.add_argument("--insecure", action="store_true", help="Disable TLS verification for Elasticsearch")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
if not args.host or not args.api_key:
|
||||
print("ELASTIC_HOST and ELASTIC_API_KEY must be provided via environment or CLI", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
logs = query_elasticsearch(
|
||||
host=args.host,
|
||||
api_key=args.api_key,
|
||||
index_pattern=args.index,
|
||||
minutes=args.minutes,
|
||||
size=args.size,
|
||||
verify=not args.insecure,
|
||||
)
|
||||
|
||||
prompt = build_prompt(logs, limit_messages=args.message_limit)
|
||||
if not prompt.strip() or prompt.startswith("No logs"):
|
||||
print(prompt)
|
||||
return 0
|
||||
|
||||
openrouter_key = os.environ.get("OPENROUTER_API_KEY")
|
||||
if not openrouter_key:
|
||||
print("OPENROUTER_API_KEY is required to summarize logs", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
referer = os.environ.get("OPENROUTER_REFERER")
|
||||
title = os.environ.get("OPENROUTER_TITLE", "Elastic Log Monitor")
|
||||
response_text = call_openrouter(
|
||||
prompt=prompt,
|
||||
model=args.openrouter_model,
|
||||
api_key=openrouter_key,
|
||||
referer=referer,
|
||||
title=title,
|
||||
)
|
||||
print(response_text.strip())
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
17
stacks/mllogwatcher/testing.py
Executable file
17
stacks/mllogwatcher/testing.py
Executable file
@@ -0,0 +1,17 @@
|
||||
# pip install -qU langchain "langchain[anthropic]"
|
||||
from langchain.agents import create_agent
|
||||
|
||||
def get_weather(city: str) -> str:
|
||||
"""Get weather for a given city."""
|
||||
return f"It's always sunny in {city}!"
|
||||
|
||||
agent = create_agent(
|
||||
model="claude-sonnet-4-5-20250929",
|
||||
tools=[get_weather],
|
||||
system_prompt="You are a helpful assistant",
|
||||
)
|
||||
|
||||
# Run the agent
|
||||
agent.invoke(
|
||||
{"messages": [{"role": "user", "content": "what is the weather in sf"}]}
|
||||
)
|
||||
10
stacks/mllogwatcher/worklog-2025-12-29.txt
Normal file
10
stacks/mllogwatcher/worklog-2025-12-29.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
# Worklog – 2025-12-29
|
||||
|
||||
1. Added containerization assets for grafana_alert_webhook:
|
||||
- `Dockerfile`, `.dockerignore`, `docker-compose.yml`, `.env.example`, and consolidated `requirements.txt`.
|
||||
- Compose mounts the runbook, `/etc/ansible/hosts`, and `.ssh` so SSH automation works inside the container.
|
||||
- README now documents the compose workflow.
|
||||
2. Copied knight’s SSH key to `.ssh/webhook_id_rsa` and updated `jet-alone` inventory entry with `ansible_user` + `ansible_ssh_private_key_file` so remote commands can run non-interactively.
|
||||
3. Updated `OpenRouterLLM` to satisfy Pydantic’s field validation inside the container.
|
||||
4. Brought the webhook up under Docker Compose, tested alerts end-to-end, and reverted `OPENROUTER_MODEL` to the valid `openai/gpt-5.1-codex-max`.
|
||||
5. Created `/var/core/ansible/ops_baseline.yml` to install sysstat/iotop/smartmontools/hdparm and enforce synchronized Bash history (`/etc/profile.d/99-bash-history.sh`). Ran the playbook against the primary LAN hosts; noted remediation items for the few that failed (outdated mirrors, pending grub configuration, missing sudo password).
|
||||
Reference in New Issue
Block a user