Add dev stacks

This commit is contained in:
2025-12-31 20:11:44 -05:00
parent 0bcfed8fb8
commit 13989e2b59
49 changed files with 4948 additions and 0 deletions

View File

@@ -0,0 +1,7 @@
.venv
__pycache__/
*.pyc
.git
.gitignore
.env
tmp/

View File

@@ -0,0 +1,14 @@
OPENROUTER_API_KEY=
OPENROUTER_MODEL=openai/gpt-5.2-codex-max
TRIAGE_ENABLE_COMMANDS=1
TRIAGE_COMMAND_RUNNER=local
TRIAGE_VERBOSE_LOGS=1
TRIAGE_EMAIL_ENABLED=1
TRIAGE_EMAIL_FROM=alertai@example.com
TRIAGE_EMAIL_TO=admin@example.com
TRIAGE_SMTP_HOST=smtp.example.com
TRIAGE_SMTP_PORT=465
TRIAGE_SMTP_USER=alertai@example.com
TRIAGE_SMTP_PASSWORD=
TRIAGE_SMTP_SSL=1
TRIAGE_SMTP_STARTTLS=0

View File

@@ -0,0 +1,14 @@
OPENROUTER_API_KEY=
OPENROUTER_MODEL=openai/gpt-5.2-codex-max
TRIAGE_ENABLE_COMMANDS=1
TRIAGE_COMMAND_RUNNER=local
TRIAGE_VERBOSE_LOGS=1
TRIAGE_EMAIL_ENABLED=1
TRIAGE_EMAIL_FROM=alertai@example.com
TRIAGE_EMAIL_TO=admin@example.com
TRIAGE_SMTP_HOST=smtp.example.com
TRIAGE_SMTP_PORT=465
TRIAGE_SMTP_USER=alertai@example.com
TRIAGE_SMTP_PASSWORD=
TRIAGE_SMTP_SSL=1
TRIAGE_SMTP_STARTTLS=0

View File

@@ -0,0 +1,20 @@
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
WORKDIR /var/core/mlLogWatcher
RUN apt-get update && \
apt-get install -y --no-install-recommends openssh-client && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY alert_runbook.yaml ./alert_runbook.yaml
COPY scripts ./scripts
EXPOSE 8081
CMD ["uvicorn", "scripts.grafana_alert_webhook:app", "--host", "0.0.0.0", "--port", "8081"]

120
stacks/mllogwatcher/README.md Executable file
View File

@@ -0,0 +1,120 @@
# ML Log Watcher Utilities
This repository now contains two automation entry points that work together to
triage Elasticsearch logs and Grafana alerts with the help of OpenRouter-hosted
language models.
## 1. `scripts/log_monitor.py`
Existing script that queries Elasticsearch indices, pulls a recent window of
logs, and asks an LLM for anomaly highlights. Run it ad-hoc or schedule via
cron/systemd.
```
ELASTIC_HOST=https://casper.localdomain:9200 \
ELASTIC_API_KEY=... \
OPENROUTER_API_KEY=... \
python3 scripts/log_monitor.py --index 'log*' --minutes 30
```
## 2. `scripts/grafana_alert_webhook.py`
A FastAPI web server that accepts Grafana alert webhooks, finds the matching
entry in `alert_runbook.yaml`, renders the LLM prompt, and posts it to
OpenRouter. The response text is returned to Grafana (or any caller) immediately
so automation can fan out to chat, ticketing, etc.
### Dependencies
```
python3 -m venv .venv
.venv/bin/pip install fastapi uvicorn pyyaml requests langchain
```
### Environment
- `OPENROUTER_API_KEY` required.
- `OPENROUTER_MODEL` optional (default `openai/gpt-4o-mini`).
- `RUNBOOK_PATH` optional (default `alert_runbook.yaml` in repo root).
- `ANSIBLE_HOSTS_PATH` optional (default `/etc/ansible/hosts`). When set, the webhook auto-loads the Ansible inventory so alerts targeting known hosts inherit their SSH user/port/key information.
- `OPENROUTER_REFERER` / `OPENROUTER_TITLE` forwarded headers if needed.
- `TRIAGE_ENABLE_COMMANDS` set to `1` to let the webhook execute runbook commands (default `0` keeps it in read-only mode).
- `TRIAGE_COMMAND_RUNNER` `ssh` (default) or `local`. When using ssh, also set `TRIAGE_SSH_USER` and optional `TRIAGE_SSH_OPTIONS`.
- `TRIAGE_COMMAND_TIMEOUT`, `TRIAGE_MAX_COMMANDS`, `TRIAGE_OUTPUT_LIMIT`, `TRIAGE_DEFAULT_OS` tune execution behavior.
- `TRIAGE_VERBOSE_LOGS` set to `1` to stream the entire LLM dialogue, prompts, and command outputs to the webhook logs for debugging.
- `TRIAGE_EMAIL_ENABLED` when `1`, the webhook emails the final LLM summary per alert. Requires `TRIAGE_EMAIL_FROM`, `TRIAGE_EMAIL_TO` (comma-separated), `TRIAGE_SMTP_HOST`, and optional `TRIAGE_SMTP_PORT`, `TRIAGE_SMTP_USER`, `TRIAGE_SMTP_PASSWORD`, `TRIAGE_SMTP_STARTTLS`, `TRIAGE_SMTP_SSL`.
### Running
```
source .venv/bin/activate
export OPENROUTER_API_KEY=...
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
```
The server loads the runbook at startup and exposes:
- `POST /alerts` Grafana webhook target.
- `POST /reload-runbook` force runbook reload without restarting.
When `TRIAGE_ENABLE_COMMANDS=1`, the server executes the relevant triage commands
for each alert (via SSH or locally), captures stdout/stderr, and appends the
results to both the OpenRouter prompt and the HTTP response JSON. This lets you
automate evidence gathering directly from the runbook instructions. Use
environment variables to control which user/host the commands target and to
limit timeouts/output size. LangChain powers the multi-turn investigation flow:
the LLM can call the provided tools (`run_local_command`, `run_ssh_command`) to
gather additional evidence until its ready to deliver a final summary.
When `/etc/ansible/hosts` (or `ANSIBLE_HOSTS_PATH`) is available the server
automatically enriches the alert context with SSH metadata (user, host, port,
identity file, and common args) so runbook commands default to using SSH against
the alerting host instead of the webhook server.
### Running with Docker Compose
1. Copy `.env.example` to `.env` and fill in your OpenRouter key, email SMTP
settings, and other toggles.
2. Place any SSH keys the webhook needs inside `./.ssh/` (the compose file
mounts this directory read-only inside the container).
3. Run `docker compose up -d` to build and launch the webhook. It listens on
port `8081` by default and uses the mounted `alert_runbook.yaml` plus the
host `/etc/ansible/hosts`.
4. Use `docker compose logs -f` to watch verbose LangChain output or restart
with `docker compose restart` when updating the code/runbook.
### Sample payload
```
curl -X POST http://localhost:8081/alerts \
-H 'Content-Type: application/json' \
-d '{
"status":"firing",
"ruleUid":"edkmsdmlay2o0c",
"ruleUrl":"http://casper:3000/alerting/grafana/edkmsdmlay2o0c/view",
"alerts":[
{
"status":"firing",
"labels":{
"alertname":"High Mem.",
"host":"unit-02",
"rule_uid":"edkmsdmlay2o0c"
},
"annotations":{
"summary":"Memory usage above 95% for 10m",
"value":"96.2%"
},
"startsAt":"2025-09-22T17:20:00Z",
"endsAt":"0001-01-01T00:00:00Z"
}
]
}'
```
With a valid OpenRouter key this returns a JSON body containing the LLM summary
per alert plus any unmatched alerts (missing runbook entries or rule UIDs).
### Testing without OpenRouter
Set `OPENROUTER_API_KEY=dummy` and point the DNS entry to a mock (e.g. mitmproxy)
if you need to capture outbound requests. Otherwise, hits will fail fast with
HTTP 502 so Grafana knows the automation need to be retried.

View File

@@ -0,0 +1,254 @@
# Grafana alert triage playbook for the HomeLab telemetry stack.
# Each entry contains the alert metadata, what the signal means,
# the evidence to capture automatically, and the manual / scripted steps.
metadata:
generated: "2025-09-22T00:00:00Z"
grafana_url: "http://casper:3000"
datasource: "InfluxDB telegraf (uid=P951FEA4DE68E13C5)"
llm_provider: "OpenRouter"
alerts:
- name: "Data Stale"
rule_uid: "fdk9orif6fytcf"
description: "No CPU usage_user metrics have arrived for non-unit hosts within 5 minutes."
signal:
metric: "cpu.usage_user"
condition: "count(host samples over 5m) < 1"
impact: "Host is no longer reporting to Telegraf/Influx -> monitoring blind spot."
evidence_to_collect:
- "Influx: `from(bucket:\"telegraf\") |> range(start:-10m) |> filter(fn:(r)=>r._measurement==\"cpu\" and r.host==\"{{ host }}\") |> count()`"
- "Telegraf log tail"
- "System journal for network/auth errors"
triage:
- summary: "Verify Telegraf agent health"
linux: "sudo systemctl status telegraf && sudo journalctl -u telegraf -n 100"
windows: "Get-Service telegraf; Get-Content 'C:\\Program Files\\telegraf\\telegraf.log' -Tail 100"
- summary: "Check connectivity from host to Influx (`casper:8086`)"
linux: "curl -sSf http://casper:8086/ping"
windows: "Invoke-WebRequest -UseBasicParsing http://casper:8086/ping"
- summary: "Confirm host clock drift <5s (important for Influx line protocol timestamps)"
linux: "chronyc tracking"
windows: "w32tm /query /status"
remediation:
- "Restart Telegraf after config validation: `sudo telegraf --test --config /etc/telegraf/telegraf.conf` then `sudo systemctl restart telegraf`."
- "Re-apply Ansible telemetry playbook if multiple hosts fail."
llm_prompt: >
Alert {{ alertname }} fired for {{ host }}. Telegraf stopped sending cpu.usage_user metrics. Given the collected logs and command output, identify root causes (agent down, auth failures, firewall, time skew) and list the next action.
- name: "High CPU"
rule_uid: "fdkms407ubmdcc"
description: "Mean CPU usage_system over the last 10 minutes exceeds 85%."
signal:
metric: "cpu.usage_system"
condition: "mean over 10m > 85%"
impact: "Host is near saturation; scheduler latency and queueing likely."
evidence_to_collect:
- "Top CPU processes snapshot (Linux: `ps -eo pid,cmd,%cpu --sort=-%cpu | head -n 15`; Windows: `Get-Process | Sort-Object CPU -Descending | Select -First 15`)"
- "Load vs CPU core count"
- "Recent deploys / cron jobs metadata"
triage:
- summary: "Confirm sustained CPU pressure"
linux: "uptime && mpstat 1 5"
windows: "typeperf \"\\Processor(_Total)\\% Processor Time\" -sc 15"
- summary: "Check offending processes/services"
linux: "sudo ps -eo pid,user,comm,%cpu,%mem --sort=-%cpu | head"
windows: "Get-Process | Sort-Object CPU -Descending | Select -First 10 Name,CPU"
- summary: "Inspect cgroup / VM constraints if on Proxmox"
linux: "sudo pct status {{ vmid }} && sudo pct config {{ vmid }}"
remediation:
- "Throttle or restart runaway service; scale workload or tune limits."
- "Consider moving noisy neighbors off shared hypervisor."
llm_prompt: >
High CPU alert for {{ host }}. Review process table, recent deploys, and virtualization context; determine why cpu.usage_system stayed above 85% and recommend mitigation.
- name: "High Mem."
rule_uid: "edkmsdmlay2o0c"
description: "Mean memory used_percent over 10 minutes > 95% (excluding hosts jhci/nerv*/magi*)."
signal:
metric: "mem.used_percent"
condition: "mean over 10m > 95%"
impact: "OOM risk and swap thrash."
evidence_to_collect:
- "Free/available memory snapshot"
- "Top consumers (Linux: `sudo smem -rt rss | head`; Windows: `Get-Process | Sort-Object WorkingSet -Descending`)"
- "Swap in/out metrics"
triage:
- summary: "Validate actual memory pressure"
linux: "free -m && vmstat -SM 5 5"
windows: "Get-Counter '\\Memory\\Available MBytes'"
- summary: "Identify leaking services"
linux: "sudo ps -eo pid,user,comm,%mem,rss --sort=-%mem | head"
windows: "Get-Process | Sort-Object WS -Descending | Select -First 10 ProcessName,WS"
- summary: "Check recent kernel/OOM logs"
linux: "sudo dmesg | tail -n 50"
windows: "Get-WinEvent -LogName System -MaxEvents 50 | ? { $_.Message -match 'memory' }"
remediation:
- "Restart or reconfigure offender; add swap as stop-gap; increase VM memory allocation."
llm_prompt: >
High Mem alert for {{ host }}. After reviewing free memory, swap activity, and top processes, explain the likely cause and propose remediation steps with priority.
- name: "High Disk IO"
rule_uid: "bdkmtaru7ru2od"
description: "Mean merged_reads/writes per second converted to GB/s exceeds 10."
signal:
metric: "diskio.merged_reads + merged_writes"
condition: "mean over 10m > 10 GB/s"
impact: "Storage controller saturated; latency spikes, possible backlog."
evidence_to_collect:
- "iostat extended output"
- "Process level IO (pidstat/nethogs equivalent)"
- "ZFS/MDADM status for relevant pools"
triage:
- summary: "Inspect device queues"
linux: "iostat -xzd 5 3"
windows: "Get-WmiObject -Class Win32_PerfFormattedData_PerfDisk_LogicalDisk | Format-Table Name,DiskWritesPersec,DiskReadsPersec,AvgDisksecPerTransfer"
- summary: "Correlate to filesystem / VM"
linux: "sudo lsof +D /mnt/critical -u {{ user }}"
- summary: "Check backup or replication windows"
linux: "journalctl -u pvebackup -n 50"
remediation:
- "Pause heavy jobs, move backups off-peak, evaluate faster storage tiers."
llm_prompt: >
High Disk IO on {{ host }}. With iostat/pidstat output provided, decide whether activity is expected (backup, scrub) or abnormal and list mitigations.
- name: "Low Uptime"
rule_uid: "ddkmuadxvkm4ge"
description: "System uptime converted to minutes is below 10 -> host rebooted recently."
signal:
metric: "system.uptime"
condition: "last uptime_minutes < 10"
impact: "Unexpected reboot or crash; may need RCA."
evidence_to_collect:
- "Boot reason logs"
- "Last patch/maintenance window from Ansible inventory"
- "Smart log excerpt for power events"
triage:
- summary: "Confirm uptime and reason"
linux: "uptime && last -x | head"
windows: "Get-WinEvent -LogName System -MaxEvents 50 | ? { $_.Id -in 41,6006,6008 }"
- summary: "Check kernel panic or watchdog traces"
linux: "sudo journalctl -k -b -1 | tail -n 200"
- summary: "Validate patch automation logs"
linux: "sudo tail -n 100 /var/log/ansible-pull.log"
remediation:
- "Schedule deeper diagnostics if crash; reschedule workloads once stable."
llm_prompt: >
Low Uptime alert: host restarted within 10 minutes. Inspect boot reason logs and recommend whether this is maintenance or a fault needing follow-up.
- name: "High Load"
rule_uid: "ddkmul9x8gcn4d"
description: "system.load5 > 6 for 5 minutes."
signal:
metric: "system.load5"
condition: "last value > 6"
impact: "Runnable queue more than CPU threads -> latency growth."
evidence_to_collect:
- "Load vs CPU count (`nproc`)"
- "Process states (D/R blocked tasks)"
- "IO wait percentage"
triage:
- summary: "Correlate load to CPU and IO"
linux: "uptime && vmstat 1 5"
- summary: "Identify stuck IO"
linux: "sudo pidstat -d 1 5"
- summary: "Check Proxmox scheduler for resource contention"
linux: "pveperf && qm list"
remediation:
- "Reduce cron concurrency, add CPU, or fix IO bottleneck causing runnable queue growth."
llm_prompt: >
High Load alert on {{ host }}. Based on vmstat/pidstat output, explain whether CPU saturation, IO wait, or runnable pile-up is at fault and propose actions.
- name: "High Network Traffic (Download)"
rule_uid: "cdkpct82a7g8wd"
description: "Derivative of bytes_recv > 50 MB/s on any interface over last hour."
signal:
metric: "net.bytes_recv"
condition: "mean download throughput > 50 MB/s"
impact: "Link saturation, potential DDoS or backup window."
evidence_to_collect:
- "Interface counters (Linux: `ip -s link show {{ iface }}`; Windows: `Get-NetAdapterStatistics`)"
- "Top talkers (Linux: `sudo nethogs {{ iface }}` or `iftop -i {{ iface }}`)"
- "Firewall/IDS logs"
triage:
- summary: "Confirm interface experiencing spike"
linux: "sar -n DEV 1 5 | grep {{ iface }}"
windows: "Get-Counter -Counter '\\Network Interface({{ iface }})\\Bytes Received/sec' -Continuous -SampleInterval 1 -MaxSamples 5"
- summary: "Identify process or remote peer"
linux: "sudo ss -ntu state established | sort -k4"
windows: "Get-NetTCPConnection | Sort-Object -Property LocalPort"
remediation:
- "Throttle offending transfers, move backup replication, verify no compromised service."
llm_prompt: >
High download throughput on {{ host }} interface {{ iface }}. Review interface counters and connection list to determine if traffic is expected and advise throttling or blocking steps.
- name: "High Network Traffic (Upload)"
rule_uid: "aec650pbtvzswa"
description: "Derivative of bytes_sent > 30 MB/s for an interface."
signal:
metric: "net.bytes_sent"
condition: "mean upload throughput > 30 MB/s"
impact: "Excess upstream usage; may saturate ISP uplink."
evidence_to_collect:
- "Interface statistics"
- "NetFlow sample if available (`/var/log/telegraf/netflow.log`)"
- "List of active transfers"
triage:
- summary: "Measure upload curve"
linux: "bmon -p {{ iface }} -o ascii"
windows: "Get-Counter '\\Network Interface({{ iface }})\\Bytes Sent/sec' -Continuous -SampleInterval 1 -MaxSamples 5"
- summary: "Find process generating traffic"
linux: "sudo iftop -i {{ iface }} -t -s 30"
windows: "Get-NetAdapterStatistics -Name {{ iface }}"
remediation:
- "Pause replication jobs, confirm backups not stuck, search for data exfiltration."
llm_prompt: >
High upload alert for {{ host }} interface {{ iface }}. Using captured traffic samples, determine whether replication/backup explains the pattern or if anomalous traffic needs blocking.
- name: "High Disk Usage"
rule_uid: "cdma6i5k2gem8d"
description: "Disk used_percent >= 95% for Linux devices (filters out unwanted devices)."
signal:
metric: "disk.used_percent"
condition: "last value > 95%"
impact: "Filesystem full -> service crashes or write failures."
evidence_to_collect:
- "`df -h` or `Get-Volume` output for device"
- "Largest directories snapshot (Linux: `sudo du -xhd1 /path`; Windows: `Get-ChildItem | Sort Length`)"
- "Recent deploy or backup expansion logs"
triage:
- summary: "Validate usage"
linux: "df -h {{ mountpoint }}"
windows: "Get-Volume -FileSystemLabel {{ volume }}"
- summary: "Identify growth trend"
linux: "sudo journalctl -u telegraf -g 'disk usage' -n 20"
- summary: "Check for stale docker volumes"
linux: "docker system df && docker volume ls"
remediation:
- "Prune temp artifacts, expand disk/VM, move logs to remote storage."
llm_prompt: >
High Disk Usage alert on {{ host }} device {{ device }}. Summarize what consumed the space and recommend reclaim or expansion actions with priority.
- name: "CPU Heartbeat"
rule_uid: "eec62gqn3oetcf"
description: "Counts cpu.usage_system samples per host; fires if <1 sample arrives within window."
signal:
metric: "cpu.usage_system"
condition: "sample count within 10m < 1"
impact: "Indicates host stopped reporting metrics entirely (telemetry silent)."
evidence_to_collect:
- "Influx query for recent cpu samples"
- "Telegraf service and logs"
- "Network reachability from host to casper"
triage:
- summary: "Check host alive and reachable"
linux: "ping -c 3 {{ host }} && ssh {{ host }} uptime"
windows: "Test-Connection {{ host }} -Count 3"
- summary: "Inspect Telegraf state"
linux: "sudo systemctl status telegraf && sudo tail -n 100 /var/log/telegraf/telegraf.log"
windows: "Get-Service telegraf; Get-EventLog -LogName Application -Newest 50 | ? { $_.Source -match 'Telegraf' }"
- summary: "Validate API key / Influx auth"
linux: "sudo grep -n 'outputs.influxdb' -n /etc/telegraf/telegraf.conf"
remediation:
- "Re-issue Telegraf credentials, run `ansible-playbook telemetry.yml -l {{ host }}`."
- "If host intentionally offline, silence alert via Grafana maintenance window."
llm_prompt: >
CPU Heartbeat for {{ host }} indicates telemetry silent. Use connectivity tests and Telegraf logs to determine if host is down or just metrics disabled; propose fixes.

View File

@@ -0,0 +1,14 @@
version: "3.9"
services:
grafana-alert-webhook:
build: .
env_file:
- .env
ports:
- "8081:8081"
volumes:
- ./alert_runbook.yaml:/var/core/mlLogWatcher/alert_runbook.yaml:ro
- /etc/ansible/hosts:/etc/ansible/hosts:ro
- ./.ssh:/var/core/mlLogWatcher/.ssh:ro
restart: unless-stopped

View File

@@ -0,0 +1,5 @@
fastapi==0.115.5
uvicorn[standard]==0.32.0
pyyaml==6.0.2
requests==2.32.3
langchain==0.2.15

View File

@@ -0,0 +1,988 @@
#!/usr/bin/env python3
"""
Minimal FastAPI web server that accepts Grafana alert webhooks, looks up the
matching runbook entry, builds an LLM prompt, and calls OpenRouter to return a
triage summary.
Run with:
uvicorn scripts.grafana_alert_webhook:app --host 0.0.0.0 --port 8081
Environment variables:
RUNBOOK_PATH Path to alert_runbook.yaml (default: ./alert_runbook.yaml)
OPENROUTER_API_KEY Required; API token for https://openrouter.ai
OPENROUTER_MODEL Optional; default openai/gpt-4o-mini
OPENROUTER_REFERER Optional referer header
OPENROUTER_TITLE Optional title header (default: Grafana Alert Webhook)
"""
from __future__ import annotations
import logging
import os
import re
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
import json
import shlex
import subprocess
from textwrap import indent
import smtplib
from email.message import EmailMessage
import requests
import yaml
from fastapi import FastAPI, HTTPException, Request
from langchain.llms.base import LLM
LOGGER = logging.getLogger("grafana_webhook")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
RUNBOOK_PATH = Path(os.environ.get("RUNBOOK_PATH", "alert_runbook.yaml"))
ANSIBLE_HOSTS_PATH = Path(os.environ.get("ANSIBLE_HOSTS_PATH", "/etc/ansible/hosts"))
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
OPENROUTER_MODEL = os.environ.get("OPENROUTER_MODEL", "openai/gpt-4o-mini")
OPENROUTER_REFERER = os.environ.get("OPENROUTER_REFERER")
OPENROUTER_TITLE = os.environ.get("OPENROUTER_TITLE", "Grafana Alert Webhook")
TRIAGE_ENABLE_COMMANDS = os.environ.get("TRIAGE_ENABLE_COMMANDS", "0").lower() in {"1", "true", "yes", "on"}
TRIAGE_COMMAND_RUNNER = os.environ.get("TRIAGE_COMMAND_RUNNER", "ssh").lower()
TRIAGE_SSH_USER = os.environ.get("TRIAGE_SSH_USER", "root")
TRIAGE_SSH_OPTIONS = shlex.split(
os.environ.get("TRIAGE_SSH_OPTIONS", "-o BatchMode=yes -o StrictHostKeyChecking=no -o ConnectTimeout=5")
)
TRIAGE_COMMAND_TIMEOUT = int(os.environ.get("TRIAGE_COMMAND_TIMEOUT", "60"))
TRIAGE_DEFAULT_OS = os.environ.get("TRIAGE_DEFAULT_OS", "linux").lower()
TRIAGE_MAX_COMMANDS = int(os.environ.get("TRIAGE_MAX_COMMANDS", "3"))
TRIAGE_OUTPUT_LIMIT = int(os.environ.get("TRIAGE_OUTPUT_LIMIT", "1200"))
# LangChain-driven investigation loop
TRIAGE_MAX_ITERATIONS = int(os.environ.get("TRIAGE_MAX_ITERATIONS", "3"))
TRIAGE_FOLLOWUP_MAX_COMMANDS = int(os.environ.get("TRIAGE_FOLLOWUP_MAX_COMMANDS", "4"))
TRIAGE_SYSTEM_PROMPT = os.environ.get(
"TRIAGE_SYSTEM_PROMPT",
(
"You are assisting with on-call investigations. Always reply with JSON containing:\n"
"analysis: your findings and next steps.\n"
"followup_commands: list of command specs (summary, command, optional runner/os) to gather more data.\n"
"complete: true when sufficient information is gathered.\n"
"Request commands only when more evidence is required."
),
)
TRIAGE_VERBOSE_LOGS = os.environ.get("TRIAGE_VERBOSE_LOGS", "0").lower() in {"1", "true", "yes", "on"}
TRIAGE_EMAIL_ENABLED = os.environ.get("TRIAGE_EMAIL_ENABLED", "0").lower() in {"1", "true", "yes", "on"}
TRIAGE_EMAIL_FROM = os.environ.get("TRIAGE_EMAIL_FROM")
TRIAGE_EMAIL_TO = [addr.strip() for addr in os.environ.get("TRIAGE_EMAIL_TO", "").split(",") if addr.strip()]
TRIAGE_SMTP_HOST = os.environ.get("TRIAGE_SMTP_HOST")
TRIAGE_SMTP_PORT = int(os.environ.get("TRIAGE_SMTP_PORT", "587"))
TRIAGE_SMTP_USER = os.environ.get("TRIAGE_SMTP_USER")
TRIAGE_SMTP_PASSWORD = os.environ.get("TRIAGE_SMTP_PASSWORD")
TRIAGE_SMTP_STARTTLS = os.environ.get("TRIAGE_SMTP_STARTTLS", "1").lower() in {"1", "true", "yes", "on"}
TRIAGE_SMTP_SSL = os.environ.get("TRIAGE_SMTP_SSL", "0").lower() in {"1", "true", "yes", "on"}
TRIAGE_SMTP_TIMEOUT = int(os.environ.get("TRIAGE_SMTP_TIMEOUT", "20"))
def log_verbose(title: str, content: Any) -> None:
"""Emit structured verbose logs when TRIAGE_VERBOSE_LOGS is enabled."""
if not TRIAGE_VERBOSE_LOGS:
return
if isinstance(content, (dict, list)):
text = json.dumps(content, indent=2, sort_keys=True)
else:
text = str(content)
LOGGER.info("%s:\n%s", title, text)
def email_notifications_configured() -> bool:
if not TRIAGE_EMAIL_ENABLED:
return False
if not (TRIAGE_SMTP_HOST and TRIAGE_EMAIL_FROM and TRIAGE_EMAIL_TO):
LOGGER.warning(
"Email notifications requested but TRIAGE_SMTP_HOST/TRIAGE_EMAIL_FROM/TRIAGE_EMAIL_TO are incomplete."
)
return False
return True
def format_command_results_for_email(results: List[Dict[str, Any]]) -> str:
if not results:
return "No automation commands were executed."
lines: List[str] = []
for result in results:
lines.append(f"- {result.get('summary')} [{result.get('status')}] {result.get('command')}")
stdout = result.get("stdout")
stderr = result.get("stderr")
error = result.get("error")
if stdout:
lines.append(indent(truncate_text(stdout, 800), " stdout: "))
if stderr:
lines.append(indent(truncate_text(stderr, 800), " stderr: "))
if error and result.get("status") != "ok":
lines.append(f" error: {error}")
return "\n".join(lines)
def build_email_body(alert: Dict[str, Any], result: Dict[str, Any], context: Dict[str, Any]) -> str:
lines = [
f"Alert: {result.get('alertname')} ({result.get('rule_uid')})",
f"Host: {result.get('host') or context.get('host')}",
f"Status: {alert.get('status')}",
f"Value: {alert.get('value') or alert.get('annotations', {}).get('value')}",
f"Grafana Rule: {context.get('rule_url')}",
"",
"LLM Summary:",
result.get("llm_summary") or "(no summary returned)",
"",
"Command Results:",
format_command_results_for_email(result.get("command_results") or []),
]
return "\n".join(lines)
def send_summary_email(alert: Dict[str, Any], result: Dict[str, Any], context: Dict[str, Any]) -> None:
if not email_notifications_configured():
return
subject_host = result.get("host") or context.get("host") or "(unknown host)"
subject = f"[Grafana] {result.get('alertname')} - {subject_host}"
body = build_email_body(alert, result, context)
message = EmailMessage()
message["Subject"] = subject
message["From"] = TRIAGE_EMAIL_FROM
message["To"] = ", ".join(TRIAGE_EMAIL_TO)
message.set_content(body)
try:
smtp_class = smtplib.SMTP_SSL if TRIAGE_SMTP_SSL else smtplib.SMTP
with smtp_class(TRIAGE_SMTP_HOST, TRIAGE_SMTP_PORT, timeout=TRIAGE_SMTP_TIMEOUT) as client:
if TRIAGE_SMTP_STARTTLS and not TRIAGE_SMTP_SSL:
client.starttls()
if TRIAGE_SMTP_USER:
client.login(TRIAGE_SMTP_USER, TRIAGE_SMTP_PASSWORD or "")
client.send_message(message)
LOGGER.info("Sent summary email to %s for host %s", ", ".join(TRIAGE_EMAIL_TO), subject_host)
except Exception as exc: # pylint: disable=broad-except
LOGGER.exception("Failed to send summary email: %s", exc)
app = FastAPI(title="Grafana Alert Webhook", version="1.0.0")
_RUNBOOK_INDEX: Dict[str, Dict[str, Any]] = {}
_INVENTORY_INDEX: Dict[str, Dict[str, Any]] = {}
_INVENTORY_GROUP_VARS: Dict[str, Dict[str, str]] = {}
_TEMPLATE_PATTERN = re.compile(r"{{\s*([a-zA-Z0-9_]+)\s*}}")
DEFAULT_SYSTEM_PROMPT = TRIAGE_SYSTEM_PROMPT
class OpenRouterLLM(LLM):
"""LangChain-compatible LLM that calls OpenRouter chat completions."""
api_key: str
model_name: str
def __init__(self, api_key: str, model_name: str, **kwargs: Any) -> None:
super().__init__(api_key=api_key, model_name=model_name, **kwargs)
@property
def _llm_type(self) -> str:
return "openrouter"
def __call__(self, prompt: str, stop: Optional[List[str]] = None) -> str:
return self._call(prompt, stop=stop)
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
payload = {
"model": self.model_name,
"messages": [
{"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
}
log_verbose("OpenRouter request payload", payload)
if stop:
payload["stop"] = stop
LOGGER.info("Posting to OpenRouter model=%s via LangChain", self.model_name)
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
if OPENROUTER_REFERER:
headers["HTTP-Referer"] = OPENROUTER_REFERER
if OPENROUTER_TITLE:
headers["X-Title"] = OPENROUTER_TITLE
response = requests.post("https://openrouter.ai/api/v1/chat/completions", json=payload, headers=headers, timeout=90)
if response.status_code >= 400:
try:
detail = response.json()
except ValueError:
detail = response.text
raise RuntimeError(f"OpenRouter error {response.status_code}: {detail}")
data = response.json()
log_verbose("OpenRouter raw response", data)
choices = data.get("choices")
if not choices:
raise RuntimeError("OpenRouter returned no choices")
return choices[0]["message"]["content"].strip()
def load_runbook() -> Dict[str, Dict[str, Any]]:
"""Load runbook YAML into a dict keyed by rule_uid."""
if not RUNBOOK_PATH.exists():
raise FileNotFoundError(f"Runbook file not found: {RUNBOOK_PATH}")
with RUNBOOK_PATH.open("r", encoding="utf-8") as handle:
data = yaml.safe_load(handle) or {}
alerts = data.get("alerts", [])
index: Dict[str, Dict[str, Any]] = {}
for entry in alerts:
uid = entry.get("rule_uid")
if uid:
index[str(uid)] = entry
LOGGER.info("Loaded %d runbook entries from %s", len(index), RUNBOOK_PATH)
return index
def _normalize_host_key(host: str) -> str:
return host.strip().lower()
def _parse_key_value_tokens(tokens: List[str]) -> Dict[str, str]:
data: Dict[str, str] = {}
for token in tokens:
if "=" not in token:
continue
key, value = token.split("=", 1)
data[key] = value
return data
def load_ansible_inventory() -> Tuple[Dict[str, Dict[str, Any]], Dict[str, Dict[str, str]]]:
"""Parse a simple INI-style Ansible hosts file into host/group maps."""
if not ANSIBLE_HOSTS_PATH.exists():
LOGGER.warning("Ansible inventory not found at %s", ANSIBLE_HOSTS_PATH)
return {}, {}
hosts: Dict[str, Dict[str, Any]] = {}
group_vars: Dict[str, Dict[str, str]] = {}
current_group: Optional[str] = None
current_section: str = "hosts"
with ANSIBLE_HOSTS_PATH.open("r", encoding="utf-8") as handle:
for raw_line in handle:
line = raw_line.strip()
if not line or line.startswith("#"):
continue
if line.startswith("[") and line.endswith("]"):
header = line[1:-1].strip()
if ":" in header:
group_name, suffix = header.split(":", 1)
current_group = group_name
current_section = suffix
else:
current_group = header
current_section = "hosts"
group_vars.setdefault(current_group, {})
continue
cleaned = line.split("#", 1)[0].strip()
if not cleaned:
continue
tokens = shlex.split(cleaned)
if not tokens:
continue
if current_section == "vars":
vars_dict = _parse_key_value_tokens(tokens)
group_vars.setdefault(current_group or "all", {}).update(vars_dict)
continue
host_token = tokens[0]
host_key = _normalize_host_key(host_token)
entry = hosts.setdefault(host_key, {"name": host_token, "definitions": [], "groups": set()})
vars_dict = _parse_key_value_tokens(tokens[1:])
entry["definitions"].append({"group": current_group, "vars": vars_dict})
if current_group:
entry["groups"].add(current_group)
LOGGER.info("Loaded %d Ansible inventory hosts from %s", len(hosts), ANSIBLE_HOSTS_PATH)
return hosts, group_vars
def _lookup_inventory(host: Optional[str]) -> Optional[Dict[str, Any]]:
if not host:
return None
key = _normalize_host_key(host)
entry = _INVENTORY_INDEX.get(key)
if entry:
return entry
# try stripping domain suffix
short = key.split(".", 1)[0]
if short != key:
return _INVENTORY_INDEX.get(short)
return None
def _merge_group_vars(groups: List[str], host_os: Optional[str]) -> Dict[str, str]:
merged: Dict[str, str] = {}
global_vars = _INVENTORY_GROUP_VARS.get("all")
if global_vars:
merged.update(global_vars)
normalized_os = (host_os or "").lower()
for group in groups:
vars_dict = _INVENTORY_GROUP_VARS.get(group)
if not vars_dict:
continue
connection = (vars_dict.get("ansible_connection") or "").lower()
if connection == "winrm" and normalized_os == "linux":
continue
merged.update(vars_dict)
return merged
def _should_include_definition(group: Optional[str], vars_dict: Dict[str, str], host_os: Optional[str]) -> bool:
if not vars_dict:
return False
normalized_os = (host_os or "").lower()
connection = (vars_dict.get("ansible_connection") or "").lower()
if connection == "winrm" and normalized_os != "windows":
return False
if connection == "local":
return True
if group and "windows" in group.lower() and normalized_os == "linux" and not connection:
return False
return True
def apply_inventory_context(context: Dict[str, Any]) -> None:
"""Augment the alert context with SSH metadata from the Ansible inventory."""
host = context.get("host")
entry = _lookup_inventory(host)
if not entry:
return
merged_vars = _merge_group_vars(list(entry.get("groups", [])), context.get("host_os"))
for definition in entry.get("definitions", []):
group_name = definition.get("group")
vars_dict = definition.get("vars", {})
if _should_include_definition(group_name, vars_dict, context.get("host_os")):
merged_vars.update(vars_dict)
ansible_host = merged_vars.get("ansible_host") or entry.get("name")
ansible_user = merged_vars.get("ansible_user")
ansible_port = merged_vars.get("ansible_port")
ssh_common_args = merged_vars.get("ansible_ssh_common_args")
ssh_key = merged_vars.get("ansible_ssh_private_key_file")
connection = (merged_vars.get("ansible_connection") or "").lower()
host_os = (context.get("host_os") or "").lower()
if connection == "winrm" and host_os != "windows":
for key in (
"ansible_connection",
"ansible_port",
"ansible_password",
"ansible_winrm_server_cert_validation",
"ansible_winrm_scheme",
):
merged_vars.pop(key, None)
connection = ""
context.setdefault("ssh_host", ansible_host or host)
if ansible_user:
context["ssh_user"] = ansible_user
if ansible_port:
context["ssh_port"] = ansible_port
if ssh_common_args:
context["ssh_common_args"] = ssh_common_args
if ssh_key:
context["ssh_identity_file"] = ssh_key
context.setdefault("inventory_groups", list(entry.get("groups", [])))
if connection == "local":
context.setdefault("preferred_runner", "local")
elif connection in {"", "ssh", "smart"}:
context.setdefault("preferred_runner", "ssh")
context.setdefault("inventory_groups", list(entry.get("groups", [])))
def render_template(template: str, context: Dict[str, Any]) -> str:
"""Very small mustache-style renderer for {{ var }} placeholders."""
def replace(match: re.Match[str]) -> str:
key = match.group(1)
return str(context.get(key, match.group(0)))
return _TEMPLATE_PATTERN.sub(replace, template)
def extract_rule_uid(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> Optional[str]:
"""Grafana webhooks may include rule UID in different fields."""
candidates: List[Any] = [
alert.get("ruleUid"),
alert.get("rule_uid"),
alert.get("ruleId"),
alert.get("uid"),
alert.get("labels", {}).get("rule_uid"),
alert.get("labels", {}).get("ruleUid"),
parent_payload.get("ruleUid"),
parent_payload.get("rule_uid"),
parent_payload.get("ruleId"),
]
for candidate in candidates:
if candidate:
return str(candidate)
# Fall back to Grafana URL parsing if present
url = (
alert.get("ruleUrl")
or parent_payload.get("ruleUrl")
or alert.get("generatorURL")
or parent_payload.get("generatorURL")
)
if url and "/alerting/" in url:
return url.rstrip("/").split("/")[-2]
return None
def derive_fallback_rule_uid(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> str:
"""Construct a deterministic identifier when Grafana omits rule UIDs."""
labels = alert.get("labels", {})
candidates = [
alert.get("fingerprint"),
labels.get("alertname"),
labels.get("host"),
labels.get("instance"),
parent_payload.get("groupKey"),
parent_payload.get("title"),
]
for candidate in candidates:
if candidate:
return str(candidate)
return "unknown-alert"
def build_fallback_runbook_entry(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> Dict[str, Any]:
"""Return a generic runbook entry so every alert can be processed."""
labels = alert.get("labels", {})
alertname = labels.get("alertname") or parent_payload.get("title") or "Grafana Alert"
host = labels.get("host") or labels.get("instance") or "(unknown host)"
return {
"name": f"{alertname} (auto)",
"llm_prompt": (
"Grafana alert {{ alertname }} fired for {{ host }}.\n"
"No dedicated runbook entry exists. Use the payload details, command outputs, "
"and your own reasoning to propose likely causes, evidence to gather, and remediation steps."
),
"triage": [],
"evidence_to_collect": [],
"remediation": [],
"metadata": {"host": host},
}
def summarize_dict(prefix: str, data: Optional[Dict[str, Any]]) -> str:
if not data:
return f"{prefix}: (none)"
parts = ", ".join(f"{key}={value}" for key, value in sorted(data.items()))
return f"{prefix}: {parts}"
def determine_host_os(alert: Dict[str, Any]) -> str:
"""Infer host operating system from labels or defaults."""
labels = alert.get("labels", {})
candidates = [
labels.get("os"),
labels.get("platform"),
labels.get("system"),
alert.get("os"),
]
for candidate in candidates:
if candidate:
value = str(candidate).lower()
if "win" in value:
return "windows"
if any(token in value for token in ("linux", "unix", "darwin")):
return "linux"
host = (labels.get("host") or labels.get("instance") or "").lower()
if host.startswith("win") or host.endswith(".localdomain") and "win" in host:
return "windows"
inventory_os = infer_os_from_inventory(labels.get("host") or labels.get("instance"))
if inventory_os:
return inventory_os
return TRIAGE_DEFAULT_OS
def infer_os_from_inventory(host: Optional[str]) -> Optional[str]:
if not host:
return None
entry = _lookup_inventory(host)
if not entry:
return None
for definition in entry.get("definitions", []):
vars_dict = definition.get("vars", {}) or {}
connection = (vars_dict.get("ansible_connection") or "").lower()
if connection == "winrm":
return "windows"
for group in entry.get("groups", []):
if "windows" in (group or "").lower():
return "windows"
return None
def truncate_text(text: str, limit: int = TRIAGE_OUTPUT_LIMIT) -> str:
"""Trim long outputs to keep prompts manageable."""
if not text:
return ""
cleaned = text.strip()
if len(cleaned) <= limit:
return cleaned
return cleaned[:limit] + "... [truncated]"
def gather_command_specs(entry: Dict[str, Any], host_os: str) -> List[Dict[str, Any]]:
"""Collect command specs from triage steps and optional automation sections."""
specs: List[Dict[str, Any]] = []
for step in entry.get("triage", []):
cmd = step.get(host_os)
if not cmd:
continue
specs.append(
{
"summary": step.get("summary") or entry.get("name") or "triage",
"shell": cmd,
"runner": step.get("runner"),
"os": host_os,
}
)
for item in entry.get("automation_commands", []):
target_os = item.get("os", host_os)
if target_os and target_os.lower() != host_os:
continue
specs.append(item)
if TRIAGE_MAX_COMMANDS > 0:
return specs[:TRIAGE_MAX_COMMANDS]
return specs
def build_runner_command(
rendered_command: str,
runner: str,
context: Dict[str, Any],
spec: Dict[str, Any],
) -> Tuple[Any, str, bool, str]:
"""Return the subprocess args, display string, shell flag, and runner label."""
runner = runner or TRIAGE_COMMAND_RUNNER
runner = runner.lower()
if runner == "ssh":
host = spec.get("host") or context.get("ssh_host") or context.get("host")
if not host:
raise RuntimeError("Host not provided for ssh runner.")
ssh_user = spec.get("ssh_user") or context.get("ssh_user") or TRIAGE_SSH_USER
ssh_target = spec.get("ssh_target") or f"{ssh_user}@{host}"
ssh_options = list(TRIAGE_SSH_OPTIONS)
common_args = spec.get("ssh_common_args") or context.get("ssh_common_args")
if common_args:
ssh_options.extend(shlex.split(common_args))
ssh_port = spec.get("ssh_port") or context.get("ssh_port")
if ssh_port:
ssh_options.extend(["-p", str(ssh_port)])
identity_file = spec.get("ssh_identity_file") or context.get("ssh_identity_file")
if identity_file:
ssh_options.extend(["-i", identity_file])
command_list = ["ssh", *ssh_options, ssh_target, rendered_command]
display = " ".join(shlex.quote(part) for part in command_list)
return command_list, display, False, "ssh"
# default to local shell execution
display = rendered_command
return rendered_command, display, True, "local"
def run_subprocess_command(
command: Any,
display: str,
summary: str,
use_shell: bool,
runner_label: str,
) -> Dict[str, Any]:
"""Execute subprocess command and capture results."""
LOGGER.info("Executing command (%s) via %s: %s", summary, runner_label, display)
try:
completed = subprocess.run(
command,
capture_output=True,
text=True,
timeout=TRIAGE_COMMAND_TIMEOUT,
shell=use_shell,
check=False,
)
result = {
"summary": summary,
"command": display,
"runner": runner_label,
"exit_code": completed.returncode,
"stdout": (completed.stdout or "").strip(),
"stderr": (completed.stderr or "").strip(),
"status": "ok" if completed.returncode == 0 else "failed",
}
log_verbose(f"Command result ({summary})", result)
return result
except subprocess.TimeoutExpired as exc:
result = {
"summary": summary,
"command": display,
"runner": runner_label,
"exit_code": None,
"stdout": truncate_text((exc.stdout or "").strip()),
"stderr": truncate_text((exc.stderr or "").strip()),
"status": "timeout",
"error": f"Command timed out after {TRIAGE_COMMAND_TIMEOUT}s",
}
log_verbose(f"Command timeout ({summary})", result)
return result
except Exception as exc: # pylint: disable=broad-except
LOGGER.exception("Command execution failed (%s): %s", summary, exc)
result = {
"summary": summary,
"command": display,
"runner": runner_label,
"exit_code": None,
"stdout": "",
"stderr": "",
"status": "error",
"error": str(exc),
}
log_verbose(f"Command error ({summary})", result)
return result
def run_command_spec(spec: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
summary = spec.get("summary") or spec.get("name") or "command"
shell_cmd = spec.get("shell")
if not shell_cmd:
return {"summary": summary, "status": "skipped", "error": "No shell command provided."}
rendered = render_template(shell_cmd, context)
preferred_runner = context.get("preferred_runner")
runner_choice = (spec.get("runner") or preferred_runner or TRIAGE_COMMAND_RUNNER).lower()
try:
command, display, use_shell, runner_label = build_runner_command(rendered, runner_choice, context, spec)
except RuntimeError as exc:
LOGGER.warning("Skipping command '%s': %s", summary, exc)
return {"summary": summary, "status": "skipped", "error": str(exc), "command": rendered}
return run_subprocess_command(command, display, summary, use_shell, runner_label)
def execute_triage_commands(entry: Dict[str, Any], alert: Dict[str, Any], context: Dict[str, Any]) -> List[Dict[str, Any]]:
host_os = context.get("host_os") or determine_host_os(alert)
context["host_os"] = host_os
specs = gather_command_specs(entry, host_os)
if not specs:
LOGGER.info("No triage commands defined for host_os=%s", host_os)
return []
if not TRIAGE_ENABLE_COMMANDS:
LOGGER.info("Command execution disabled; %d commands queued but skipped.", len(specs))
return []
LOGGER.info("Executing up to %d triage commands for host_os=%s", len(specs), host_os)
results = []
for spec in specs:
results.append(run_command_spec(spec, context))
return results
def format_command_results_for_llm(results: List[Dict[str, Any]]) -> str:
lines: List[str] = []
for idx, result in enumerate(results, start=1):
lines.append(f"{idx}. {result.get('summary')} [{result.get('status')}] {result.get('command')}")
stdout = result.get("stdout")
stderr = result.get("stderr")
error = result.get("error")
if stdout:
lines.append(" stdout:")
lines.append(indent(truncate_text(stdout), " "))
if stderr:
lines.append(" stderr:")
lines.append(indent(truncate_text(stderr), " "))
if error and result.get("status") != "ok":
lines.append(f" error: {error}")
if not lines:
return "No command results were available."
return "\n".join(lines)
def parse_structured_response(text: str) -> Optional[Dict[str, Any]]:
cleaned = text.strip()
try:
return json.loads(cleaned)
except json.JSONDecodeError:
start = cleaned.find("{")
end = cleaned.rfind("}")
if start != -1 and end != -1 and end > start:
snippet = cleaned[start : end + 1]
try:
return json.loads(snippet)
except json.JSONDecodeError:
return None
return None
def normalize_followup_command(item: Dict[str, Any]) -> Dict[str, Any]:
return {
"summary": item.get("summary") or item.get("name") or "Follow-up command",
"shell": item.get("command") or item.get("shell"),
"runner": item.get("runner"),
"host": item.get("host") or item.get("target"),
"ssh_user": item.get("ssh_user"),
"os": (item.get("os") or item.get("platform") or "").lower() or None,
}
def investigate_with_langchain(
entry: Dict[str, Any],
alert: Dict[str, Any],
parent_payload: Dict[str, Any],
context: Dict[str, Any],
initial_outputs: List[Dict[str, Any]],
) -> Tuple[str, List[Dict[str, Any]]]:
command_outputs = list(initial_outputs)
prompt = build_prompt(entry, alert, parent_payload, context, command_outputs)
log_verbose("Initial investigation prompt", prompt)
if not OPENROUTER_API_KEY:
return "OPENROUTER_API_KEY is not configured; unable to analyze alert.", command_outputs
llm = OpenRouterLLM(api_key=OPENROUTER_API_KEY, model_name=OPENROUTER_MODEL)
dialogue = (
prompt
+ "\n\nRespond with JSON containing fields analysis, followup_commands, and complete. "
"Request commands only when more evidence is required."
)
total_followup = 0
final_summary = ""
for iteration in range(TRIAGE_MAX_ITERATIONS):
log_verbose(f"LLM dialogue iteration {iteration + 1}", dialogue)
llm_text = llm(dialogue)
log_verbose(f"LLM iteration {iteration + 1} output", llm_text)
dialogue += f"\nAssistant:\n{llm_text}\n"
parsed = parse_structured_response(llm_text)
if parsed:
log_verbose(f"LLM iteration {iteration + 1} parsed response", parsed)
if not parsed:
final_summary = llm_text
break
analysis = parsed.get("analysis") or ""
followups = parsed.get("followup_commands") or parsed.get("commands") or []
final_summary = analysis
complete_flag = bool(parsed.get("complete"))
if complete_flag or not followups:
break
log_verbose(f"LLM iteration {iteration + 1} requested follow-ups", followups)
allowed = max(0, TRIAGE_FOLLOWUP_MAX_COMMANDS - total_followup)
if not TRIAGE_ENABLE_COMMANDS or allowed <= 0:
dialogue += (
"\nUser:\nCommand execution is disabled or budget exhausted. Provide final analysis with JSON format.\n"
)
continue
normalized_cmds: List[Dict[str, Any]] = []
for raw in followups:
if not isinstance(raw, dict):
continue
normalized = normalize_followup_command(raw)
if not normalized.get("shell"):
continue
cmd_os = normalized.get("os")
if cmd_os and cmd_os != context.get("host_os"):
continue
normalized_cmds.append(normalized)
log_verbose(f"Normalized follow-up commands (iteration {iteration + 1})", normalized_cmds)
if not normalized_cmds:
dialogue += "\nUser:\nNo valid commands to run. Finalize analysis in JSON format.\n"
continue
normalized_cmds = normalized_cmds[:allowed]
executed_batch: List[Dict[str, Any]] = []
for spec in normalized_cmds:
executed = run_command_spec(spec, context)
command_outputs.append(executed)
executed_batch.append(executed)
total_followup += 1
result_text = "Follow-up command results:\n" + format_command_results_for_llm(executed_batch)
dialogue += (
"\nUser:\n"
+ result_text
+ "\nUpdate your analysis and respond with JSON (analysis, followup_commands, complete).\n"
)
log_verbose("Executed follow-up commands", result_text)
else:
final_summary = final_summary or "Reached maximum iterations without a conclusive response."
if not final_summary:
final_summary = "LLM did not return a valid analysis."
log_verbose("Final LLM summary", final_summary)
return final_summary, command_outputs
def build_context(alert: Dict[str, Any], parent_payload: Dict[str, Any]) -> Dict[str, Any]:
labels = alert.get("labels", {})
annotations = alert.get("annotations", {})
context = {
"alertname": labels.get("alertname") or alert.get("title") or parent_payload.get("title") or parent_payload.get("ruleName"),
"host": labels.get("host") or labels.get("instance"),
"iface": labels.get("interface"),
"device": labels.get("device"),
"vmid": labels.get("vmid"),
"status": alert.get("status") or parent_payload.get("status"),
"value": alert.get("value") or annotations.get("value"),
"rule_url": alert.get("ruleUrl") or parent_payload.get("ruleUrl"),
}
context.setdefault("ssh_user", TRIAGE_SSH_USER)
return context
def build_prompt(
entry: Dict[str, Any],
alert: Dict[str, Any],
parent_payload: Dict[str, Any],
context: Dict[str, Any],
command_outputs: Optional[List[Dict[str, Any]]] = None,
) -> str:
template = entry.get("llm_prompt", "Alert {{ alertname }} fired for {{ host }}.")
rendered_template = render_template(template, {k: v or "" for k, v in context.items()})
evidence = entry.get("evidence_to_collect", [])
triage_steps = entry.get("triage", [])
remediation = entry.get("remediation", [])
lines = [
rendered_template.strip(),
"",
"Alert payload summary:",
f"- Status: {context.get('status') or alert.get('status')}",
f"- Host: {context.get('host')}",
f"- Value: {context.get('value')}",
f"- StartsAt: {alert.get('startsAt')}",
f"- EndsAt: {alert.get('endsAt')}",
f"- RuleURL: {context.get('rule_url')}",
f"- Host OS (inferred): {context.get('host_os')}",
"- Note: All timestamps are UTC/RFC3339 as provided by Grafana.",
summarize_dict("- Labels", alert.get("labels")),
summarize_dict("- Annotations", alert.get("annotations")),
]
if evidence:
lines.append("")
lines.append("Evidence to gather (for automation reference):")
for item in evidence:
lines.append(f"- {item}")
if triage_steps:
lines.append("")
lines.append("Suggested manual checks:")
for step in triage_steps:
summary = step.get("summary")
linux = step.get("linux")
windows = step.get("windows")
lines.append(f"- {summary}")
if linux:
lines.append(f" Linux: {linux}")
if windows:
lines.append(f" Windows: {windows}")
if remediation:
lines.append("")
lines.append("Remediation ideas:")
for item in remediation:
lines.append(f"- {item}")
if command_outputs:
lines.append("")
lines.append("Command execution results:")
for result in command_outputs:
status = result.get("status", "unknown")
cmd_display = result.get("command", "")
lines.append(f"- {result.get('summary')} [{status}] {cmd_display}")
stdout = result.get("stdout")
stderr = result.get("stderr")
error = result.get("error")
if stdout:
lines.append(" stdout:")
lines.append(indent(truncate_text(stdout), " "))
if stderr:
lines.append(" stderr:")
lines.append(indent(truncate_text(stderr), " "))
if error and status != "ok":
lines.append(f" error: {error}")
return "\n".join(lines).strip()
def get_alerts(payload: Dict[str, Any]) -> List[Dict[str, Any]]:
alerts = payload.get("alerts")
if isinstance(alerts, list) and alerts:
return alerts
return [payload]
@app.on_event("startup")
def startup_event() -> None:
global _RUNBOOK_INDEX, _INVENTORY_INDEX, _INVENTORY_GROUP_VARS
_RUNBOOK_INDEX = load_runbook()
_INVENTORY_INDEX, _INVENTORY_GROUP_VARS = load_ansible_inventory()
LOGGER.info(
"Alert webhook server ready with %d runbook entries and %d inventory hosts.",
len(_RUNBOOK_INDEX),
len(_INVENTORY_INDEX),
)
@app.post("/alerts")
async def handle_alert(request: Request) -> Dict[str, Any]:
payload = await request.json()
LOGGER.info("Received Grafana payload: %s", json.dumps(payload, indent=2, sort_keys=True))
results = []
unmatched = []
for alert in get_alerts(payload):
LOGGER.info("Processing alert: %s", json.dumps(alert, indent=2, sort_keys=True))
unmatched_reason: Optional[str] = None
alert_status = str(alert.get("status") or payload.get("status") or "").lower()
if alert_status and alert_status != "firing":
details = {"reason": "non_firing_status", "status": alert_status, "alert": alert}
unmatched.append(details)
LOGGER.info("Skipping alert with status=%s (only 'firing' alerts are processed).", alert_status)
continue
rule_uid = extract_rule_uid(alert, payload)
if not rule_uid:
unmatched_reason = "missing_rule_uid"
derived_uid = derive_fallback_rule_uid(alert, payload)
details = {"reason": unmatched_reason, "derived_rule_uid": derived_uid, "alert": alert}
unmatched.append(details)
LOGGER.warning("Alert missing rule UID, using fallback identifier %s", derived_uid)
rule_uid = derived_uid
entry = _RUNBOOK_INDEX.get(rule_uid)
runbook_matched = entry is not None
if not entry:
unmatched_reason = unmatched_reason or "no_runbook_entry"
details = {"reason": unmatched_reason, "rule_uid": rule_uid, "alert": alert}
unmatched.append(details)
LOGGER.warning("No runbook entry for rule_uid=%s, using generic fallback.", rule_uid)
entry = build_fallback_runbook_entry(alert, payload)
context = build_context(alert, payload)
context["host_os"] = determine_host_os(alert)
context["rule_uid"] = rule_uid
apply_inventory_context(context)
initial_outputs = execute_triage_commands(entry, alert, context)
try:
llm_text, command_outputs = investigate_with_langchain(entry, alert, payload, context, initial_outputs)
except Exception as exc: # pylint: disable=broad-except
LOGGER.exception("Investigation failed for rule_uid=%s: %s", rule_uid, exc)
raise HTTPException(status_code=502, detail=f"LLM investigation error: {exc}") from exc
result = {
"rule_uid": rule_uid,
"alertname": entry.get("name"),
"host": alert.get("labels", {}).get("host"),
"llm_summary": llm_text,
"command_results": command_outputs,
"runbook_matched": runbook_matched,
}
if not runbook_matched and unmatched_reason:
result["fallback_reason"] = unmatched_reason
results.append(result)
send_summary_email(alert, result, context)
return {"processed": len(results), "results": results, "unmatched": unmatched}
@app.post("/reload-runbook")
def reload_runbook() -> Dict[str, Any]:
global _RUNBOOK_INDEX, _INVENTORY_INDEX, _INVENTORY_GROUP_VARS
_RUNBOOK_INDEX = load_runbook()
_INVENTORY_INDEX, _INVENTORY_GROUP_VARS = load_ansible_inventory()
return {"entries": len(_RUNBOOK_INDEX), "inventory_hosts": len(_INVENTORY_INDEX)}

View File

@@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""
Log anomaly checker that queries Elasticsearch and asks an OpenRouter-hosted LLM
for a quick triage summary. Intended to be run on a schedule (cron/systemd).
Required environment variables:
ELASTIC_HOST e.g. https://casper.localdomain:9200
ELASTIC_API_KEY Base64 ApiKey used for Elasticsearch requests
OPENROUTER_API_KEY Token for https://openrouter.ai/
Optional environment variables:
OPENROUTER_MODEL Model identifier (default: openai/gpt-4o-mini)
OPENROUTER_REFERER Passed through as HTTP-Referer header
OPENROUTER_TITLE Passed through as X-Title header
"""
from __future__ import annotations
import argparse
import datetime as dt
import os
import sys
from typing import Any, Iterable
import requests
def utc_iso(ts: dt.datetime) -> str:
"""Return an ISO8601 string with Z suffix."""
return ts.replace(microsecond=0).isoformat() + "Z"
def query_elasticsearch(
host: str,
api_key: str,
index_pattern: str,
minutes: int,
size: int,
verify: bool,
) -> list[dict[str, Any]]:
"""Fetch recent logs from Elasticsearch."""
end = dt.datetime.utcnow()
start = end - dt.timedelta(minutes=minutes)
url = f"{host.rstrip('/')}/{index_pattern}/_search"
payload = {
"size": size,
"sort": [{"@timestamp": {"order": "desc"}}],
"query": {
"range": {
"@timestamp": {
"gte": utc_iso(start),
"lte": utc_iso(end),
}
}
},
"_source": ["@timestamp", "message", "host.name", "container.image.name", "log.level"],
}
headers = {
"Authorization": f"ApiKey {api_key}",
"Content-Type": "application/json",
}
response = requests.post(url, json=payload, headers=headers, timeout=30, verify=verify)
response.raise_for_status()
hits = response.json().get("hits", {}).get("hits", [])
return hits
def build_prompt(logs: Iterable[dict[str, Any]], limit_messages: int) -> str:
"""Create the prompt that will be sent to the LLM."""
selected = []
for idx, hit in enumerate(logs):
if idx >= limit_messages:
break
source = hit.get("_source", {})
message = source.get("message") or source.get("event", {}).get("original") or ""
timestamp = source.get("@timestamp", "unknown time")
host = source.get("host", {}).get("name") or source.get("host", {}).get("hostname") or "unknown-host"
container = source.get("container", {}).get("image", {}).get("name") or ""
level = source.get("log", {}).get("level") or source.get("log.level") or ""
selected.append(
f"[{timestamp}] host={host} level={level} container={container}\n{message}".strip()
)
if not selected:
return "No logs were returned from Elasticsearch in the requested window."
prompt = (
"You are assisting with HomeLab observability. Review the following log entries collected from "
"Elasticsearch and highlight any notable anomalies, errors, or emerging issues. "
"Explain the impact and suggest next steps when applicable. "
"Use concise bullet points. Logs:\n\n"
+ "\n\n".join(selected)
)
return prompt
def call_openrouter(prompt: str, model: str, api_key: str, referer: str | None, title: str | None) -> str:
"""Send prompt to OpenRouter and return the model response text."""
url = "https://openrouter.ai/api/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
if referer:
headers["HTTP-Referer"] = referer
if title:
headers["X-Title"] = title
body = {
"model": model,
"messages": [
{"role": "system", "content": "You are a senior SRE helping analyze log anomalies."},
{"role": "user", "content": prompt},
],
}
response = requests.post(url, json=body, headers=headers, timeout=60)
response.raise_for_status()
data = response.json()
choices = data.get("choices", [])
if not choices:
raise RuntimeError("OpenRouter response did not include choices")
return choices[0]["message"]["content"]
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Query Elasticsearch and summarize logs with OpenRouter.")
parser.add_argument("--host", default=os.environ.get("ELASTIC_HOST"), help="Elasticsearch host URL")
parser.add_argument("--api-key", default=os.environ.get("ELASTIC_API_KEY"), help="Elasticsearch ApiKey")
parser.add_argument("--index", default="log*", help="Index pattern (default: log*)")
parser.add_argument("--minutes", type=int, default=60, help="Lookback window in minutes (default: 60)")
parser.add_argument("--size", type=int, default=200, help="Max number of logs to fetch (default: 200)")
parser.add_argument("--message-limit", type=int, default=50, help="Max log lines sent to LLM (default: 50)")
parser.add_argument("--openrouter-model", default=os.environ.get("OPENROUTER_MODEL", "openai/gpt-4o-mini"))
parser.add_argument("--insecure", action="store_true", help="Disable TLS verification for Elasticsearch")
return parser.parse_args()
def main() -> int:
args = parse_args()
if not args.host or not args.api_key:
print("ELASTIC_HOST and ELASTIC_API_KEY must be provided via environment or CLI", file=sys.stderr)
return 1
logs = query_elasticsearch(
host=args.host,
api_key=args.api_key,
index_pattern=args.index,
minutes=args.minutes,
size=args.size,
verify=not args.insecure,
)
prompt = build_prompt(logs, limit_messages=args.message_limit)
if not prompt.strip() or prompt.startswith("No logs"):
print(prompt)
return 0
openrouter_key = os.environ.get("OPENROUTER_API_KEY")
if not openrouter_key:
print("OPENROUTER_API_KEY is required to summarize logs", file=sys.stderr)
return 1
referer = os.environ.get("OPENROUTER_REFERER")
title = os.environ.get("OPENROUTER_TITLE", "Elastic Log Monitor")
response_text = call_openrouter(
prompt=prompt,
model=args.openrouter_model,
api_key=openrouter_key,
referer=referer,
title=title,
)
print(response_text.strip())
return 0
if __name__ == "__main__":
raise SystemExit(main())

17
stacks/mllogwatcher/testing.py Executable file
View File

@@ -0,0 +1,17 @@
# pip install -qU langchain "langchain[anthropic]"
from langchain.agents import create_agent
def get_weather(city: str) -> str:
"""Get weather for a given city."""
return f"It's always sunny in {city}!"
agent = create_agent(
model="claude-sonnet-4-5-20250929",
tools=[get_weather],
system_prompt="You are a helpful assistant",
)
# Run the agent
agent.invoke(
{"messages": [{"role": "user", "content": "what is the weather in sf"}]}
)

View File

@@ -0,0 +1,10 @@
# Worklog 2025-12-29
1. Added containerization assets for grafana_alert_webhook:
- `Dockerfile`, `.dockerignore`, `docker-compose.yml`, `.env.example`, and consolidated `requirements.txt`.
- Compose mounts the runbook, `/etc/ansible/hosts`, and `.ssh` so SSH automation works inside the container.
- README now documents the compose workflow.
2. Copied knights SSH key to `.ssh/webhook_id_rsa` and updated `jet-alone` inventory entry with `ansible_user` + `ansible_ssh_private_key_file` so remote commands can run non-interactively.
3. Updated `OpenRouterLLM` to satisfy Pydantics field validation inside the container.
4. Brought the webhook up under Docker Compose, tested alerts end-to-end, and reverted `OPENROUTER_MODEL` to the valid `openai/gpt-5.1-codex-max`.
5. Created `/var/core/ansible/ops_baseline.yml` to install sysstat/iotop/smartmontools/hdparm and enforce synchronized Bash history (`/etc/profile.d/99-bash-history.sh`). Ran the playbook against the primary LAN hosts; noted remediation items for the few that failed (outdated mirrors, pending grub configuration, missing sudo password).