Compare commits

..

10 Commits

Author SHA1 Message Date
f299126ab2 Point compose to remote Elasticsearch and Qdrant 2025-11-18 13:25:41 -05:00
86fd017f3c Add Docker and compose setup 2025-11-18 13:21:14 -05:00
40d4f41f6e Add graph and vector search features 2025-11-09 14:24:50 -05:00
14d37f23e4 Add clickable reference badges and improve UI layout
- Add clickable badges for backlinks and references that trigger query string searches
- Improve toggle checkbox layout with better styling
- Add description block styling with scrollable container
- Update results styling with bordered cards and shadows
- Add favicon support across pages
- Enhance .env loading with logging for debugging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 14:56:43 -05:00
d8d2c5e34c Fix results overflow and add debug logging for reference badges
CSS Changes:
- Added max-width and overflow handling to .badge-row
- Added word-wrap and overflow protection to .item
- Added overflow-x: hidden to .window-body
- Badges now use white-space: nowrap to prevent text wrapping
- Item titles now break words properly with word-break

JavaScript Changes:
- Added console.log debugging for reference counts
- Logs show whether fields are present and their values
- Helps diagnose why badges aren't appearing

This should fix the overflow issue and help debug badge visibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 11:18:17 -05:00
595b19f7c7 Fix sorting by referenced_by_count with unmapped_type handling
- Added unmapped_type parameter to referenced_by_count sort
- This handles documents that don't have the field yet
- Updated ingest.py to include reference fields when indexing:
  * internal_references
  * internal_references_count
  * referenced_by
  * referenced_by_count
- Updated index mapping to include reference fields
- Documents without the field will sort as 0 (appear last)

Fixes BadRequestError: No mapping found for [referenced_by_count]

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 11:10:56 -05:00
d616b87701 Add python-dotenv support for automatic .env loading
- Added python-dotenv to requirements.txt
- Config now automatically loads .env file if present
- Allows local development without manually exporting env vars
- Gracefully falls back if python-dotenv not installed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 11:03:42 -05:00
7988e2751a Add video reference tracking and display
- Add "Most referenced" sort option to sort by backlink count
- Backend now supports sorting by referenced_by_count field
- Search results now display reference counts as badges:
  - Shows number of backlinks (videos linking to this one)
  - Shows number of internal references (outbound links)
- Reference badges appear alongside transcript source badges

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 10:52:00 -05:00
2846e13a81 Fix timestamp parsing for string format timestamps
Both primary and secondary transcripts use 'timestamp' field
with string format "HH:MM:SS.mmm" instead of numeric seconds.

Changes:
- Add parseTimestampToSeconds() to handle string timestamps
- Parse "HH:MM:SS.mmm" format (e.g., "00:00:39.480")
- Also handle "MM:SS" format
- Still support numeric timestamps (seconds or milliseconds)
- Check 'timestamp' field first (primary format in data)

This fixes the NaN issue and displays correct timestamps
for both primary and secondary transcripts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 01:16:26 -05:00
e241d206c5 Fix NaN timestamps with proper type checking
Previous || chain could pass through invalid values causing NaN.
Now explicitly checks each possible timestamp field with:
- null check (field != null)
- NaN check (!isNaN(parseFloat(field)))
- Takes first valid numeric value found

This ensures timestamps always have a valid number, defaulting
to 0 if no valid timestamp field is found.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 01:09:21 -05:00
19 changed files with 3499 additions and 337 deletions

11
.dockerignore Normal file
View File

@@ -0,0 +1,11 @@
.git
.gitignore
.venv
__pycache__
*.pyc
*.pyo
.DS_Store
node_modules
data
videos
*.log

31
AGENTS.md Normal file
View File

@@ -0,0 +1,31 @@
# Repository Guidelines
## Project Structure & Module Organization
- Core modules live under `python_app/`: `config.py` centralizes settings, `transcript_collector.py` gathers transcripts, `ingest.py` handles Elasticsearch bulk loads, and `search_app.py` exposes the Flask UI.
- Static assets belong in `static/` (`index.html`, `frequency.html`, companion JS/CSS). Keep HTML here and wire it up through Flask routes.
- Runtime artifacts land in `data/` (`raw/` for downloads, `video_metadata/` for cleaned payloads). Preserve the JSON schema emitted by the collector.
- When adding utilities, place them in `python_app/` and use package-relative imports so scripts continue to run via `python -m`.
## Build, Test, and Development Commands
- `python -m venv .venv && source .venv/bin/activate`: bootstrap the virtualenv used by all scripts.
- `pip install -r requirements.txt`: install Flask, Elasticsearch tooling, Google API clients, and dotenv support.
- `python -m python_app.transcript_collector --channel UC... --output data/raw`: fetch transcript JSON for a channel; rerun to refresh cached data.
- `python -m python_app.ingest --source data/video_metadata --index this_little_corner_py`: index prepared metadata and auto-create mappings when needed.
- `python -m python_app.search_app`: launch the Flask server on port 8080 for UI smoke tests.
## Coding Style & Naming Conventions
- Follow PEP 8 with 4-space indentation, `snake_case` for functions/modules, and `CamelCase` for classes; reserve UPPER_SNAKE_CASE for configuration constants.
- Keep Elasticsearch payload keys lower-case with underscores, and centralize shared values in `config.py` rather than scattering literals.
## Testing Guidelines
- No automated suite is committed yet; when adding coverage, create `tests/` modules using `pytest` with files named `test_*.py`.
- Focus tests on collector pagination, ingest transformations, and Flask route helpers, and run `python -m pytest` locally before opening a PR.
- Manually verify by ingesting a small sample into a local Elasticsearch node and checking facets, highlights, and transcript retrieval via the UI.
## Commit & Pull Request Guidelines
- Mirror the existing history: short, imperative commit subjects (e.g. “Fix results overflow”, “Add video reference tracking”).
- PRs should describe scope, list environment variables or indices touched, link issues, and attach before/after screenshots whenever UI output changes. Highlight Elasticsearch mapping or data migration impacts for both search and frontend reviewers.
## Configuration & Security Tips
- Load credentials through environment variables (`ELASTIC_URL`, `ELASTIC_USERNAME`, `ELASTIC_PASSWORD`, `ELASTIC_API_KEY`, `YOUTUBE_API_KEY`) or a `.env` file, and keep secrets out of version control.
- Adjust `ELASTIC_VERIFY_CERTS`, `ELASTIC_CA_CERT`, and `ELASTIC_DEBUG` only while debugging, and prefer branch-specific indices (`this_little_corner_py_<initials>`) to avoid clobbering shared data.

32
Dockerfile Normal file
View File

@@ -0,0 +1,32 @@
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
WORKDIR /app
# System deps kept lean to support torch/sentence-transformers wheels.
RUN apt-get update \
&& apt-get install -y --no-install-recommends build-essential git curl \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
# Copy the package into /app/python_app so `python -m python_app.search_app` works.
COPY . /app/python_app
ENV ELASTIC_URL=http://elasticsearch:9200 \
ELASTIC_INDEX=this_little_corner_py \
ELASTIC_VERIFY_CERTS=0 \
QDRANT_URL=http://qdrant:6333 \
QDRANT_COLLECTION=tlc-captions-full \
QDRANT_VECTOR_NAME= \
QDRANT_VECTOR_SIZE=1024 \
QDRANT_EMBED_MODEL=BAAI/bge-large-en-v1.5 \
LOCAL_DATA_DIR=/app/data/video_metadata
EXPOSE 8080
WORKDIR /app
CMD ["python", "-m", "python_app.search_app"]

View File

@@ -85,3 +85,22 @@ Visit <http://localhost:8080/> and youll see a barebones UI that:
Feel free to expand on this scaffold—add proper logging, schedule transcript Feel free to expand on this scaffold—add proper logging, schedule transcript
updates, or flesh out the UI—once youre happy with the baseline behaviour. updates, or flesh out the UI—once youre happy with the baseline behaviour.
## Run with Docker Compose (App Only; Remote ES/Qdrant)
The provided compose file builds/runs only the Flask app and expects **remote** Elasticsearch/Qdrant endpoints. Supply them via environment variables (directly or a `.env` alongside `docker-compose.yml`):
```bash
ELASTIC_URL=https://your-es-host:9200 \
QDRANT_URL=https://your-qdrant-host:6333 \
docker compose up --build
```
Other tunables (defaults shown in compose):
- `ELASTIC_INDEX` (default `this_little_corner_py`)
- `ELASTIC_USERNAME` / `ELASTIC_PASSWORD` or `ELASTIC_API_KEY`
- `ELASTIC_VERIFY_CERTS` (set to `1` for real TLS verification)
- `QDRANT_COLLECTION` (default `tlc-captions-full`)
- `QDRANT_VECTOR_NAME` / `QDRANT_VECTOR_SIZE` / `QDRANT_EMBED_MODEL`
Port 8080 on the host is forwarded to the app. Mount `./data` (read-only) if you want local fallbacks for metrics (`LOCAL_DATA_DIR=/app/data/video_metadata`); otherwise the app will rely purely on the remote backends. Stop the container with `docker compose down`.

View File

@@ -16,6 +16,20 @@ from dataclasses import dataclass
from pathlib import Path from pathlib import Path
from typing import Optional from typing import Optional
# Load .env file if it exists
try:
from dotenv import load_dotenv
import logging
_logger = logging.getLogger(__name__)
_env_path = Path(__file__).parent / ".env"
if _env_path.exists():
_logger.info("Loading .env from: %s", _env_path)
result = load_dotenv(_env_path, override=True)
_logger.info("load_dotenv result: %s", result)
except ImportError:
pass # python-dotenv not installed
@dataclass(frozen=True) @dataclass(frozen=True)
class ElasticSettings: class ElasticSettings:
@@ -44,6 +58,11 @@ class AppConfig:
elastic: ElasticSettings elastic: ElasticSettings
data: DataSettings data: DataSettings
youtube: YoutubeSettings youtube: YoutubeSettings
qdrant_url: str
qdrant_collection: str
qdrant_vector_name: Optional[str]
qdrant_vector_size: int
qdrant_embed_model: str
def _env(name: str, default: Optional[str] = None) -> Optional[str]: def _env(name: str, default: Optional[str] = None) -> Optional[str]:
@@ -75,7 +94,16 @@ def load_config() -> AppConfig:
) )
data = DataSettings(root=data_root) data = DataSettings(root=data_root)
youtube = YoutubeSettings(api_key=_env("YOUTUBE_API_KEY")) youtube = YoutubeSettings(api_key=_env("YOUTUBE_API_KEY"))
return AppConfig(elastic=elastic, data=data, youtube=youtube) return AppConfig(
elastic=elastic,
data=data,
youtube=youtube,
qdrant_url=_env("QDRANT_URL", "http://localhost:6333"),
qdrant_collection=_env("QDRANT_COLLECTION", "tlc_embeddings"),
qdrant_vector_name=_env("QDRANT_VECTOR_NAME"),
qdrant_vector_size=int(_env("QDRANT_VECTOR_SIZE", "1024")),
qdrant_embed_model=_env("QDRANT_EMBED_MODEL", "BAAI/bge-large-en-v1.5"),
)
CONFIG = load_config() CONFIG = load_config()

26
docker-compose.yml Normal file
View File

@@ -0,0 +1,26 @@
version: "3.9"
# Runs only the Flask app container, pointing to remote Elasticsearch/Qdrant.
# Provide ELASTIC_URL / QDRANT_URL (and related) via environment or a .env file.
services:
app:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
ELASTIC_URL: ${ELASTIC_URL:?set ELASTIC_URL to your remote Elasticsearch URL}
ELASTIC_INDEX: ${ELASTIC_INDEX:-this_little_corner_py}
ELASTIC_USERNAME: ${ELASTIC_USERNAME:-}
ELASTIC_PASSWORD: ${ELASTIC_PASSWORD:-}
ELASTIC_API_KEY: ${ELASTIC_API_KEY:-}
ELASTIC_VERIFY_CERTS: ${ELASTIC_VERIFY_CERTS:-0}
QDRANT_URL: ${QDRANT_URL:?set QDRANT_URL to your remote Qdrant URL}
QDRANT_COLLECTION: ${QDRANT_COLLECTION:-tlc-captions-full}
QDRANT_VECTOR_NAME: ${QDRANT_VECTOR_NAME:-}
QDRANT_VECTOR_SIZE: ${QDRANT_VECTOR_SIZE:-1024}
QDRANT_EMBED_MODEL: ${QDRANT_EMBED_MODEL:-BAAI/bge-large-en-v1.5}
LOCAL_DATA_DIR: ${LOCAL_DATA_DIR:-/app/data/video_metadata}
volumes:
- ./data:/app/data:ro

View File

@@ -90,6 +90,10 @@ def build_bulk_actions(
"transcript_full": transcript_full, "transcript_full": transcript_full,
"transcript_secondary_full": doc.get("transcript_secondary_full"), "transcript_secondary_full": doc.get("transcript_secondary_full"),
"transcript_parts": parts, "transcript_parts": parts,
"internal_references": doc.get("internal_references", []),
"internal_references_count": doc.get("internal_references_count", 0),
"referenced_by": doc.get("referenced_by", []),
"referenced_by_count": doc.get("referenced_by_count", 0),
}, },
} }
@@ -121,6 +125,10 @@ def ensure_index(client: "Elasticsearch", index: str) -> None:
"text": {"type": "text"}, "text": {"type": "text"},
}, },
}, },
"internal_references": {"type": "keyword"},
"internal_references_count": {"type": "integer"},
"referenced_by": {"type": "keyword"},
"referenced_by_count": {"type": "integer"},
} }
}, },
) )

View File

@@ -2,3 +2,6 @@ Flask>=2.3
elasticsearch>=7.0.0,<9.0.0 elasticsearch>=7.0.0,<9.0.0
youtube-transcript-api>=0.6 youtube-transcript-api>=0.6
google-api-python-client>=2.0.0 google-api-python-client>=2.0.0
python-dotenv>=0.19.0
requests>=2.31.0
sentence-transformers>=2.7.0

View File

@@ -1,11 +1,15 @@
""" """
Flask application exposing a minimal search API backed by Elasticsearch. Flask application exposing search, graph, and transcript endpoints for TLC.
Routes: Routes:
GET / -> Static HTML search page. GET / -> static HTML search page.
GET /api/channels -> List available channels (via terms aggregation). GET /graph -> static reference graph UI.
GET /api/search -> Search index with pagination and simple highlighting. GET /vector-search -> experimental Qdrant vector search UI.
GET /api/transcript -> Return full transcript for a given video_id. GET /api/channels -> channels aggregation.
GET /api/search -> Elasticsearch keyword search.
POST /api/vector-search -> Qdrant vector similarity query.
GET /api/graph -> reference graph API.
GET /api/transcript -> transcript JSON payload.
""" """
from __future__ import annotations from __future__ import annotations
@@ -15,13 +19,20 @@ import json
import logging import logging
import re import re
from pathlib import Path from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Sequence, Set from typing import Any, Dict, Iterable, List, Optional, Sequence, Set, Tuple
from collections import Counter from collections import Counter, deque
from datetime import datetime from datetime import datetime
from flask import Flask, jsonify, request, send_from_directory from flask import Flask, jsonify, request, send_from_directory
import requests
try:
from sentence_transformers import SentenceTransformer # type: ignore
except ImportError: # pragma: no cover - optional dependency
SentenceTransformer = None
from .config import CONFIG, AppConfig from .config import CONFIG, AppConfig
try: try:
@@ -32,6 +43,35 @@ except ImportError: # pragma: no cover - dependency optional
BadRequestError = Exception # type: ignore BadRequestError = Exception # type: ignore
LOGGER = logging.getLogger(__name__) LOGGER = logging.getLogger(__name__)
_EMBED_MODEL = None
_EMBED_MODEL_NAME: Optional[str] = None
def _ensure_embedder(model_name: str) -> "SentenceTransformer":
global _EMBED_MODEL, _EMBED_MODEL_NAME
if SentenceTransformer is None: # pragma: no cover - optional dependency
raise RuntimeError(
"sentence-transformers is required for vector search. Install via pip install sentence-transformers."
)
if _EMBED_MODEL is None or _EMBED_MODEL_NAME != model_name:
LOGGER.info("Loading embedding model: %s", model_name)
_EMBED_MODEL = SentenceTransformer(model_name)
_EMBED_MODEL_NAME = model_name
return _EMBED_MODEL
def embed_query(text: str, *, model_name: str, expected_dim: int) -> List[float]:
embedder = _ensure_embedder(model_name)
vector = embedder.encode(
[f"query: {text}"],
show_progress_bar=False,
normalize_embeddings=True,
)[0].tolist()
if len(vector) != expected_dim:
raise RuntimeError(
f"Embedding dimension mismatch (expected {expected_dim}, got {len(vector)})"
)
return vector
def _ensure_client(config: AppConfig) -> "Elasticsearch": def _ensure_client(config: AppConfig) -> "Elasticsearch":
@@ -286,6 +326,24 @@ def parse_channel_params(values: Iterable[Optional[str]]) -> List[str]:
return channels return channels
def build_year_filter(year: Optional[str]) -> Optional[Dict]:
if not year:
return None
try:
year_int = int(year)
return {
"range": {
"date": {
"gte": f"{year_int}-01-01",
"lt": f"{year_int + 1}-01-01",
"format": "yyyy-MM-dd"
}
}
}
except (ValueError, TypeError):
return None
def build_channel_filter(channels: Optional[Sequence[str]]) -> Optional[Dict]: def build_channel_filter(channels: Optional[Sequence[str]]) -> Optional[Dict]:
if not channels: if not channels:
return None return None
@@ -320,6 +378,7 @@ def build_query_payload(
query: str, query: str,
*, *,
channels: Optional[Sequence[str]] = None, channels: Optional[Sequence[str]] = None,
year: Optional[str] = None,
sort: str = "relevant", sort: str = "relevant",
use_exact: bool = True, use_exact: bool = True,
use_fuzzy: bool = True, use_fuzzy: bool = True,
@@ -333,6 +392,10 @@ def build_query_payload(
if channel_filter: if channel_filter:
filters.append(channel_filter) filters.append(channel_filter)
year_filter = build_year_filter(year)
if year_filter:
filters.append(year_filter)
if use_query_string: if use_query_string:
base_fields = ["title^3", "description^2", "transcript_full", "transcript_secondary_full"] base_fields = ["title^3", "description^2", "transcript_full", "transcript_secondary_full"]
qs_query = (query or "").strip() or "*" qs_query = (query or "").strip() or "*"
@@ -376,6 +439,8 @@ def build_query_payload(
body["sort"] = [{"date": {"order": "desc"}}] body["sort"] = [{"date": {"order": "desc"}}]
elif sort == "older": elif sort == "older":
body["sort"] = [{"date": {"order": "asc"}}] body["sort"] = [{"date": {"order": "asc"}}]
elif sort == "referenced":
body["sort"] = [{"referenced_by_count": {"order": "desc", "unmapped_type": "long"}}]
return body return body
if query: if query:
@@ -403,6 +468,17 @@ def build_query_payload(
} }
} }
) )
should.append(
{
"match_phrase": {
"title": {
"query": query,
"slop": 0,
"boost": 50.0,
}
}
}
)
if use_fuzzy: if use_fuzzy:
should.append( should.append(
{ {
@@ -479,6 +555,8 @@ def build_query_payload(
body["sort"] = [{"date": {"order": "desc"}}] body["sort"] = [{"date": {"order": "desc"}}]
elif sort == "older": elif sort == "older":
body["sort"] = [{"date": {"order": "asc"}}] body["sort"] = [{"date": {"order": "asc"}}]
elif sort == "referenced":
body["sort"] = [{"referenced_by_count": {"order": "desc", "unmapped_type": "long"}}]
return body return body
@@ -486,15 +564,182 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
app = Flask(__name__, static_folder=str(Path(__file__).parent / "static")) app = Flask(__name__, static_folder=str(Path(__file__).parent / "static"))
client = _ensure_client(config) client = _ensure_client(config)
index = config.elastic.index index = config.elastic.index
qdrant_url = config.qdrant_url
qdrant_collection = config.qdrant_collection
qdrant_vector_name = config.qdrant_vector_name
qdrant_vector_size = config.qdrant_vector_size
qdrant_embed_model = config.qdrant_embed_model
@app.route("/") @app.route("/")
def index_page(): def index_page():
return send_from_directory(app.static_folder, "index.html") return send_from_directory(app.static_folder, "index.html")
@app.route("/graph")
def graph_page():
return send_from_directory(app.static_folder, "graph.html")
@app.route("/vector-search")
def vector_search_page():
return send_from_directory(app.static_folder, "vector.html")
@app.route("/static/<path:filename>") @app.route("/static/<path:filename>")
def static_files(filename: str): def static_files(filename: str):
return send_from_directory(app.static_folder, filename) return send_from_directory(app.static_folder, filename)
def normalize_reference_list(values: Any) -> List[str]:
if values is None:
return []
if isinstance(values, (list, tuple, set)):
iterable = values
else:
iterable = [values]
normalized: List[str] = []
for item in iterable:
candidate: Optional[str]
if isinstance(item, dict):
candidate = item.get("video_id") or item.get("id") # type: ignore[assignment]
else:
candidate = item # type: ignore[assignment]
if candidate is None:
continue
text = str(candidate).strip()
if not text:
continue
if text.lower() in {"none", "null"}:
continue
normalized.append(text)
return normalized
def build_graph_payload(
root_id: str, depth: int, max_nodes: int
) -> Dict[str, Any]:
root_id = root_id.strip()
if not root_id:
return {"nodes": [], "links": [], "root": root_id, "depth": depth, "meta": {}}
doc_cache: Dict[str, Optional[Dict[str, Any]]] = {}
def fetch_document(video_id: str) -> Optional[Dict[str, Any]]:
if video_id in doc_cache:
return doc_cache[video_id]
try:
result = client.get(index=index, id=video_id)
doc_cache[video_id] = result.get("_source")
except Exception as exc: # pragma: no cover - elasticsearch handles errors
LOGGER.debug("Graph: failed to load %s: %s", video_id, exc)
doc_cache[video_id] = None
return doc_cache[video_id]
nodes: Dict[str, Dict[str, Any]] = {}
links: List[Dict[str, Any]] = []
link_seen: Set[Tuple[str, str, str]] = set()
queue: deque[Tuple[str, int]] = deque([(root_id, 0)])
queued: Set[str] = {root_id}
visited: Set[str] = set()
while queue and len(nodes) < max_nodes:
current_id, level = queue.popleft()
queued.discard(current_id)
if current_id in visited:
continue
doc = fetch_document(current_id)
if doc is None:
if current_id == root_id:
break
visited.add(current_id)
continue
visited.add(current_id)
nodes[current_id] = {
"id": current_id,
"title": doc.get("title") or current_id,
"channel_id": doc.get("channel_id"),
"channel_name": doc.get("channel_name") or doc.get("channel_id") or "Unknown",
"url": doc.get("url"),
"date": doc.get("date"),
"is_root": current_id == root_id,
}
if level >= depth:
continue
neighbor_ids: List[str] = []
for ref_id in normalize_reference_list(doc.get("internal_references")):
if ref_id == current_id:
continue
key = (current_id, ref_id, "references")
if key not in link_seen:
links.append(
{"source": current_id, "target": ref_id, "relation": "references"}
)
link_seen.add(key)
neighbor_ids.append(ref_id)
for ref_id in normalize_reference_list(doc.get("referenced_by")):
if ref_id == current_id:
continue
key = (ref_id, current_id, "referenced_by")
if key not in link_seen:
links.append(
{"source": ref_id, "target": current_id, "relation": "referenced_by"}
)
link_seen.add(key)
neighbor_ids.append(ref_id)
for neighbor in neighbor_ids:
if neighbor in visited or neighbor in queued:
continue
if len(nodes) + len(queue) >= max_nodes:
break
queue.append((neighbor, level + 1))
queued.add(neighbor)
# Ensure nodes referenced by links exist in the payload.
for link in links:
for key in ("source", "target"):
node_id = link[key]
if node_id in nodes:
continue
doc = fetch_document(node_id)
if doc is None:
nodes[node_id] = {
"id": node_id,
"title": node_id,
"channel_id": None,
"channel_name": "Unknown",
"url": None,
"date": None,
"is_root": node_id == root_id,
}
else:
nodes[node_id] = {
"id": node_id,
"title": doc.get("title") or node_id,
"channel_id": doc.get("channel_id"),
"channel_name": doc.get("channel_name") or doc.get("channel_id") or "Unknown",
"url": doc.get("url"),
"date": doc.get("date"),
"is_root": node_id == root_id,
}
links = [
link
for link in links
if link.get("source") in nodes and link.get("target") in nodes
]
return {
"root": root_id,
"depth": depth,
"nodes": list(nodes.values()),
"links": links,
"meta": {
"node_count": len(nodes),
"link_count": len(links),
},
}
@app.route("/api/channels") @app.route("/api/channels")
def channels(): def channels():
base_channels_body = { base_channels_body = {
@@ -553,21 +798,99 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
.get("channels", {}) .get("channels", {})
.get("buckets", []) .get("buckets", [])
) )
data = [ data = []
{ for bucket in buckets:
"Id": bucket.get("key"), key = bucket.get("key")
"Name": ( name_hit = (
bucket.get("name", {}) bucket.get("name", {})
.get("hits", {}) .get("hits", {})
.get("hits", [{}])[0] .get("hits", [{}])[0]
.get("_source", {}) .get("_source", {})
.get("channel_name", bucket.get("key")) .get("channel_name")
), )
display_name = name_hit or key or "Unknown"
data.append(
{
"Id": key,
"Name": display_name,
"Count": bucket.get("doc_count", 0),
}
)
data.sort(key=lambda item: item["Name"].lower())
return jsonify(data)
@app.route("/api/graph")
def graph_api():
video_id = (request.args.get("video_id") or "").strip()
if not video_id:
return jsonify({"error": "video_id is required"}), 400
try:
depth = int(request.args.get("depth", "1"))
except ValueError:
depth = 1
depth = max(0, min(depth, 3))
try:
max_nodes = int(request.args.get("max_nodes", "200"))
except ValueError:
max_nodes = 200
max_nodes = max(10, min(max_nodes, 400))
payload = build_graph_payload(video_id, depth, max_nodes)
if not payload["nodes"]:
return (
jsonify({"error": f"Video '{video_id}' was not found in the index."}),
404,
)
payload["meta"]["max_nodes"] = max_nodes
return jsonify(payload)
@app.route("/api/years")
def years():
body = {
"size": 0,
"aggs": {
"years": {
"date_histogram": {
"field": "date",
"calendar_interval": "year",
"format": "yyyy",
"order": {"_key": "desc"}
}
}
}
}
if config.elastic.debug:
LOGGER.info(
"Elasticsearch years request: %s",
json.dumps({"index": index, "body": body}, indent=2),
)
response = client.search(index=index, body=body)
if config.elastic.debug:
LOGGER.info(
"Elasticsearch years response: %s",
json.dumps(response, indent=2, default=str),
)
buckets = (
response.get("aggregations", {})
.get("years", {})
.get("buckets", [])
)
data = [
{
"Year": bucket.get("key_as_string"),
"Count": bucket.get("doc_count", 0), "Count": bucket.get("doc_count", 0),
} }
for bucket in buckets for bucket in buckets
if bucket.get("doc_count", 0) > 0
] ]
data.sort(key=lambda item: item["Name"].lower())
return jsonify(data) return jsonify(data)
@app.route("/api/search") @app.route("/api/search")
@@ -578,6 +901,7 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
if legacy_channel: if legacy_channel:
raw_channels.append(legacy_channel) raw_channels.append(legacy_channel)
channels = parse_channel_params(raw_channels) channels = parse_channel_params(raw_channels)
year = request.args.get("year", "", type=str) or None
sort = request.args.get("sort", "relevant", type=str) sort = request.args.get("sort", "relevant", type=str)
page = max(request.args.get("page", 0, type=int), 0) page = max(request.args.get("page", 0, type=int), 0)
size = max(request.args.get("size", 10, type=int), 1) size = max(request.args.get("size", 10, type=int), 1)
@@ -598,6 +922,7 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
payload = build_query_payload( payload = build_query_payload(
query, query,
channels=channels, channels=channels,
year=year,
sort=sort, sort=sort,
use_exact=use_exact, use_exact=use_exact,
use_fuzzy=use_fuzzy, use_fuzzy=use_fuzzy,
@@ -642,10 +967,13 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
for hit in hits.get("hits", []): for hit in hits.get("hits", []):
source = hit.get("_source", {}) source = hit.get("_source", {})
highlight_map = hit.get("highlight", {}) highlight_map = hit.get("highlight", {})
transcript_highlight = ( transcript_highlight = [
(highlight_map.get("transcript_full", []) or []) {"html": value, "source": "primary"}
+ (highlight_map.get("transcript_secondary_full", []) or []) for value in (highlight_map.get("transcript_full", []) or [])
) ] + [
{"html": value, "source": "secondary"}
for value in (highlight_map.get("transcript_secondary_full", []) or [])
]
title_html = ( title_html = (
highlight_map.get("title") highlight_map.get("title")
@@ -665,12 +993,18 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
"description": source.get("description"), "description": source.get("description"),
"descriptionHtml": description_html, "descriptionHtml": description_html,
"date": source.get("date"), "date": source.get("date"),
"duration": source.get("duration"),
"url": source.get("url"), "url": source.get("url"),
"toHighlight": transcript_highlight, "toHighlight": transcript_highlight,
"highlightSource": { "highlightSource": {
"primary": bool(highlight_map.get("transcript_full")), "primary": bool(highlight_map.get("transcript_full")),
"secondary": bool(highlight_map.get("transcript_secondary_full")), "secondary": bool(highlight_map.get("transcript_secondary_full")),
}, },
"internal_references_count": source.get("internal_references_count", 0),
"internal_references": source.get("internal_references", []),
"referenced_by_count": source.get("referenced_by_count", 0),
"referenced_by": source.get("referenced_by", []),
"video_status": source.get("video_status"),
} }
) )
@@ -716,6 +1050,7 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
if legacy_channel: if legacy_channel:
raw_channels.append(legacy_channel) raw_channels.append(legacy_channel)
channels = parse_channel_params(raw_channels) channels = parse_channel_params(raw_channels)
year = request.args.get("year", "", type=str) or None
interval = (request.args.get("interval", "month") or "month").lower() interval = (request.args.get("interval", "month") or "month").lower()
allowed_intervals = {"day", "week", "month", "quarter", "year"} allowed_intervals = {"day", "week", "month", "quarter", "year"}
if interval not in allowed_intervals: if interval not in allowed_intervals:
@@ -723,45 +1058,50 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
start = request.args.get("start", type=str) start = request.args.get("start", type=str)
end = request.args.get("end", type=str) end = request.args.get("end", type=str)
filters: List[Dict] = [] def parse_flag(name: str, default: bool = True) -> bool:
channel_filter = build_channel_filter(channels) value = request.args.get(name)
if channel_filter: if value is None:
filters.append(channel_filter) return default
lowered = value.lower()
return lowered not in {"0", "false", "no"}
use_exact = parse_flag("exact", True)
use_fuzzy = parse_flag("fuzzy", True)
use_phrase = parse_flag("phrase", True)
if use_query_string:
use_exact = use_fuzzy = use_phrase = False
search_payload = build_query_payload(
term,
channels=channels,
year=year,
sort="relevant",
use_exact=use_exact,
use_fuzzy=use_fuzzy,
use_phrase=use_phrase,
use_query_string=use_query_string,
)
query = search_payload.get("query", {"match_all": {}})
if start or end: if start or end:
range_filter: Dict[str, Dict[str, Dict[str, str]]] = {"range": {"date": {}}} range_filter: Dict[str, Dict[str, Dict[str, str]]] = {"range": {"date": {}}}
if start: if start:
range_filter["range"]["date"]["gte"] = start range_filter["range"]["date"]["gte"] = start
if end: if end:
range_filter["range"]["date"]["lte"] = end range_filter["range"]["date"]["lte"] = end
filters.append(range_filter) if "bool" in query:
bool_clause = query.setdefault("bool", {})
base_fields = ["title^3", "description^2", "transcript_full", "transcript_secondary_full"] existing_filter = bool_clause.get("filter")
if use_query_string: if existing_filter is None:
qs_query = term or "*" bool_clause["filter"] = [range_filter]
must_clause: List[Dict[str, Any]] = [ elif isinstance(existing_filter, list):
{ bool_clause["filter"].append(range_filter)
"query_string": {
"query": qs_query,
"default_operator": "AND",
"fields": base_fields,
}
}
]
else: else:
must_clause = [ bool_clause["filter"] = [existing_filter, range_filter]
{ elif query.get("match_all") is not None:
"multi_match": { query = {"bool": {"filter": [range_filter]}}
"query": term, else:
"fields": base_fields, query = {"bool": {"must": [query], "filter": [range_filter]}}
"type": "best_fields",
"operator": "and",
}
}
]
query: Dict[str, Any] = {"bool": {"must": must_clause}}
if filters:
query["bool"]["filter"] = filters
histogram: Dict[str, Any] = { histogram: Dict[str, Any] = {
"field": "date", "field": "date",
@@ -791,12 +1131,20 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
"field": "channel_id.keyword", "field": "channel_id.keyword",
"size": channel_terms_size, "size": channel_terms_size,
"order": {"_count": "desc"}, "order": {"_count": "desc"},
},
"aggs": {
"channel_name_hit": {
"top_hits": {
"size": 1,
"_source": {"includes": ["channel_name"]},
} }
} }
}, },
} }
}, },
} }
},
}
if config.elastic.debug: if config.elastic.debug:
LOGGER.info( LOGGER.info(
@@ -830,7 +1178,7 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
.get("buckets", []) .get("buckets", [])
) )
channel_totals: Dict[str, int] = {} channel_totals: Dict[str, Dict[str, Any]] = {}
buckets: List[Dict[str, Any]] = [] buckets: List[Dict[str, Any]] = []
for bucket in raw_buckets: for bucket in raw_buckets:
date_str = bucket.get("key_as_string") date_str = bucket.get("key_as_string")
@@ -840,14 +1188,28 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
cid = ch_bucket.get("key") cid = ch_bucket.get("key")
count = ch_bucket.get("doc_count", 0) count = ch_bucket.get("doc_count", 0)
if cid: if cid:
channel_entries.append({"id": cid, "count": count}) hit_source = (
channel_totals[cid] = channel_totals.get(cid, 0) + count ch_bucket.get("channel_name_hit", {})
.get("hits", {})
.get("hits", [{}])[0]
.get("_source", {})
)
channel_name = hit_source.get("channel_name") if isinstance(hit_source, dict) else None
channel_entries.append({"id": cid, "count": count, "name": channel_name})
if cid not in channel_totals:
channel_totals[cid] = {"total": 0, "name": channel_name}
channel_totals[cid]["total"] += count
if channel_name and not channel_totals[cid].get("name"):
channel_totals[cid]["name"] = channel_name
buckets.append( buckets.append(
{"date": date_str, "total": total, "channels": channel_entries} {"date": date_str, "total": total, "channels": channel_entries}
) )
ranked_channels = sorted( ranked_channels = sorted(
[{"id": cid, "total": total} for cid, total in channel_totals.items()], [
{"id": cid, "total": info.get("total", 0), "name": info.get("name")}
for cid, info in channel_totals.items()
],
key=lambda item: item["total"], key=lambda item: item["total"],
reverse=True, reverse=True,
) )
@@ -867,6 +1229,145 @@ def create_app(config: AppConfig = CONFIG) -> Flask:
def frequency_page(): def frequency_page():
return send_from_directory(app.static_folder, "frequency.html") return send_from_directory(app.static_folder, "frequency.html")
@app.route("/api/vector-search", methods=["POST"])
def api_vector_search():
payload = request.get_json(silent=True) or {}
query_text = (payload.get("query") or "").strip()
filters = payload.get("filters") or {}
limit = max(int(payload.get("size", 10)), 1)
offset = max(int(payload.get("offset", 0)), 0)
if not query_text:
return jsonify(
{"items": [], "totalResults": 0, "offset": offset, "error": "empty_query"}
)
try:
query_vector = embed_query(
query_text, model_name=qdrant_embed_model, expected_dim=qdrant_vector_size
)
except Exception as exc: # pragma: no cover - runtime dependency
LOGGER.error("Embedding failed: %s", exc, exc_info=config.elastic.debug)
return jsonify({"error": "embedding_unavailable"}), 500
qdrant_vector_payload: Any
if qdrant_vector_name:
qdrant_vector_payload = {qdrant_vector_name: query_vector}
else:
qdrant_vector_payload = query_vector
qdrant_body: Dict[str, Any] = {
"vector": qdrant_vector_payload,
"limit": limit,
"offset": offset,
"with_payload": True,
"with_vectors": False,
}
if filters:
qdrant_body["filter"] = filters
try:
response = requests.post(
f"{qdrant_url}/collections/{qdrant_collection}/points/search",
json=qdrant_body,
timeout=20,
)
response.raise_for_status()
data = response.json()
except Exception as exc:
LOGGER.error("Vector search failed: %s", exc, exc_info=config.elastic.debug)
return jsonify({"error": "vector_search_unavailable"}), 502
points = data.get("result", []) if isinstance(data, dict) else []
items: List[Dict[str, Any]] = []
missing_channel_ids: Set[str] = set()
for point in points:
payload = point.get("payload", {}) or {}
raw_highlights = payload.get("highlights") or []
highlight_entries: List[Dict[str, str]] = []
for entry in raw_highlights:
if isinstance(entry, dict):
html_value = entry.get("html") or entry.get("text")
else:
html_value = str(entry)
if not html_value:
continue
highlight_entries.append({"html": html_value, "source": "primary"})
channel_label = (
payload.get("channel_name")
or payload.get("channel_title")
or payload.get("channel_id")
)
items.append(
{
"video_id": payload.get("video_id"),
"channel_id": payload.get("channel_id"),
"channel_name": channel_label,
"title": payload.get("title"),
"titleHtml": payload.get("title"),
"description": payload.get("description"),
"descriptionHtml": payload.get("description"),
"date": payload.get("date"),
"url": payload.get("url"),
"chunkText": payload.get("text")
or payload.get("chunk_text")
or payload.get("chunk")
or payload.get("content"),
"chunkTimestamp": payload.get("timestamp")
or payload.get("start_seconds")
or payload.get("start"),
"toHighlight": highlight_entries,
"highlightSource": {
"primary": bool(highlight_entries),
"secondary": False,
},
"distance": point.get("score"),
"internal_references_count": payload.get("internal_references_count", 0),
"internal_references": payload.get("internal_references", []),
"referenced_by_count": payload.get("referenced_by_count", 0),
"referenced_by": payload.get("referenced_by", []),
"video_status": payload.get("video_status"),
"duration": payload.get("duration"),
}
)
if (not channel_label) and payload.get("channel_id"):
missing_channel_ids.add(str(payload.get("channel_id")))
if missing_channel_ids:
try:
es_lookup = client.search(
index=index,
body={
"size": len(missing_channel_ids) * 2,
"_source": ["channel_id", "channel_name"],
"query": {"terms": {"channel_id.keyword": list(missing_channel_ids)}},
},
)
hits = es_lookup.get("hits", {}).get("hits", [])
channel_lookup = {}
for hit in hits:
src = hit.get("_source", {}) or {}
cid = src.get("channel_id")
cname = src.get("channel_name")
if cid and cname and cid not in channel_lookup:
channel_lookup[cid] = cname
for item in items:
if not item.get("channel_name"):
cid = item.get("channel_id")
if cid and cid in channel_lookup:
item["channel_name"] = channel_lookup[cid]
except Exception as exc:
LOGGER.debug("Vector channel lookup failed: %s", exc)
return jsonify(
{
"items": items,
"totalResults": len(items),
"offset": offset,
}
)
@app.route("/api/transcript") @app.route("/api/transcript")
def transcript(): def transcript():
video_id = request.args.get("video_id", type=str) video_id = request.args.get("video_id", type=str)

File diff suppressed because it is too large Load Diff

BIN
static/favicon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 KiB

View File

@@ -4,6 +4,7 @@
<meta charset="utf-8" /> <meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Term Frequency Explorer</title> <title>Term Frequency Explorer</title>
<link rel="icon" href="/static/favicon.png" type="image/png" />
<link rel="stylesheet" href="/static/style.css" /> <link rel="stylesheet" href="/static/style.css" />
<style> <style>
#chart { #chart {
@@ -65,4 +66,3 @@
<script src="/static/frequency.js"></script> <script src="/static/frequency.js"></script>
</body> </body>
</html> </html>

85
static/graph.html Normal file
View File

@@ -0,0 +1,85 @@
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>TLC Reference Graph</title>
<link rel="icon" href="/static/favicon.png" type="image/png" />
<link rel="stylesheet" href="https://unpkg.com/xp.css" />
<link rel="stylesheet" href="/static/style.css" />
<script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script>
</head>
<body>
<div class="window graph-window" style="max-width: 1100px; margin: 20px auto;">
<div class="title-bar">
<div class="title-bar-text">Reference Graph</div>
<div class="title-bar-controls">
<a class="title-bar-link" href="/">⬅ Search</a>
</div>
</div>
<div class="window-body">
<p>
Explore how videos reference each other. Enter a <code>video_id</code> to see its immediate
neighbors (referenced and referencing videos). Choose a larger depth to expand the graph.
</p>
<form id="graphForm" class="graph-controls">
<div class="field-group">
<label for="graphVideoId">Video ID</label>
<input
id="graphVideoId"
name="video_id"
type="text"
placeholder="e.g. dQw4w9WgXcQ"
required
/>
</div>
<div class="field-group">
<label for="graphDepth">Depth</label>
<select id="graphDepth" name="depth">
<option value="1">1 hop</option>
<option value="2">2 hops</option>
<option value="3">3 hops</option>
</select>
</div>
<div class="field-group">
<label for="graphMaxNodes">Max nodes</label>
<select id="graphMaxNodes" name="max_nodes">
<option value="100">100</option>
<option value="150">150</option>
<option value="200" selected>200</option>
<option value="300">300</option>
</select>
</div>
<div class="field-group">
<label for="graphLabelSize">Labels</label>
<select id="graphLabelSize" name="label_size">
<option value="off">Off</option>
<option value="tiny" selected>Tiny</option>
<option value="small">Small</option>
<option value="normal">Normal</option>
<option value="medium">Medium</option>
<option value="large">Large</option>
<option value="xlarge">Extra large</option>
</select>
</div>
<button type="submit">Build graph</button>
</form>
<div id="graphStatus" class="graph-status">Enter a video ID to begin.</div>
<div id="graphContainer" class="graph-container"></div>
</div>
<div class="status-bar">
<p class="status-bar-field">Click nodes to open the video on YouTube</p>
<p class="status-bar-field">Colors represent channels</p>
</div>
</div>
<script src="/static/graph.js"></script>
</body>
</html>

670
static/graph.js Normal file
View File

@@ -0,0 +1,670 @@
(() => {
const global = window;
const GraphUI = (global.GraphUI = global.GraphUI || {});
GraphUI.ready = false;
const form = document.getElementById("graphForm");
const videoInput = document.getElementById("graphVideoId");
const depthInput = document.getElementById("graphDepth");
const maxNodesInput = document.getElementById("graphMaxNodes");
const labelSizeInput = document.getElementById("graphLabelSize");
const statusEl = document.getElementById("graphStatus");
const container = document.getElementById("graphContainer");
const isEmbedded =
container && container.dataset && container.dataset.embedded === "true";
if (!form || !videoInput || !depthInput || !maxNodesInput || !labelSizeInput || !container) {
console.error("Graph: required DOM elements missing.");
return;
}
const color = d3.scaleOrdinal(d3.schemeTableau10);
const colorRange = typeof color.range === "function" ? color.range() : [];
const paletteSizeDefault = colorRange.length || 10;
const PATTERN_TYPES = [
{ key: "none", legendClass: "none" },
{ key: "diag-forward", legendClass: "diag-forward" },
{ key: "diag-back", legendClass: "diag-back" },
{ key: "cross", legendClass: "cross" },
{ key: "dots", legendClass: "dots" },
];
const ADDITIONAL_PATTERNS = PATTERN_TYPES.filter((pattern) => pattern.key !== "none");
const sanitizeDepth = (value) => {
const parsed = parseInt(value, 10);
if (Number.isNaN(parsed)) return 1;
return Math.max(0, Math.min(parsed, 3));
};
const sanitizeMaxNodes = (value) => {
const parsed = parseInt(value, 10);
if (Number.isNaN(parsed)) return 200;
return Math.max(10, Math.min(parsed, 400));
};
const LABEL_SIZE_VALUES = ["off", "tiny", "small", "normal", "medium", "large", "xlarge"];
const LABEL_FONT_SIZES = {
tiny: "7px",
small: "8px",
normal: "9px",
medium: "10px",
large: "11px",
xlarge: "13px",
};
const DEFAULT_LABEL_SIZE = "tiny";
const isValidLabelSize = (value) => LABEL_SIZE_VALUES.includes(value);
const getLabelSize = () => {
if (!labelSizeInput) return DEFAULT_LABEL_SIZE;
const value = labelSizeInput.value;
return isValidLabelSize(value) ? value : DEFAULT_LABEL_SIZE;
};
function setLabelSizeInput(value) {
if (!labelSizeInput) return;
labelSizeInput.value = isValidLabelSize(value) ? value : DEFAULT_LABEL_SIZE;
}
const getChannelLabel = (node) =>
(node && (node.channel_name || node.channel_id)) || "Unknown";
function appendPatternContent(pattern, baseColor, patternKey) {
pattern.append("rect").attr("width", 8).attr("height", 8).attr("fill", baseColor);
const strokeColor = "#1f1f1f";
const strokeOpacity = 0.35;
const addForward = () => {
pattern
.append("path")
.attr("d", "M-2,6 L2,2 M0,8 L8,0 M6,10 L10,4")
.attr("stroke", strokeColor)
.attr("stroke-width", 1)
.attr("stroke-opacity", strokeOpacity)
.attr("fill", "none");
};
const addBackward = () => {
pattern
.append("path")
.attr("d", "M-2,2 L2,6 M0,0 L8,8 M6,-2 L10,2")
.attr("stroke", strokeColor)
.attr("stroke-width", 1)
.attr("stroke-opacity", strokeOpacity)
.attr("fill", "none");
};
switch (patternKey) {
case "diag-forward":
addForward();
break;
case "diag-back":
addBackward();
break;
case "cross":
addForward();
addBackward();
break;
case "dots":
pattern
.append("circle")
.attr("cx", 4)
.attr("cy", 4)
.attr("r", 1.5)
.attr("fill", strokeColor)
.attr("fill-opacity", strokeOpacity);
break;
default:
break;
}
}
function createChannelStyle(label, baseColor, patternKey) {
const patternInfo =
PATTERN_TYPES.find((pattern) => pattern.key === patternKey) || PATTERN_TYPES[0];
return {
baseColor,
hatch: patternInfo ? patternInfo.key : "none",
legendClass: patternInfo ? patternInfo.legendClass : "none",
};
}
let currentGraphData = null;
let currentChannelStyles = new Map();
let currentDepth = sanitizeDepth(depthInput.value);
let currentMaxNodes = sanitizeMaxNodes(maxNodesInput.value);
let currentSimulation = null;
function setStatus(message, isError = false) {
if (!statusEl) return;
statusEl.textContent = message;
if (isError) {
statusEl.classList.add("error");
} else {
statusEl.classList.remove("error");
}
}
function sanitizeId(value) {
return (value || "").trim();
}
async function fetchGraph(videoId, depth, maxNodes) {
const params = new URLSearchParams();
params.set("video_id", videoId);
params.set("depth", String(depth));
params.set("max_nodes", String(maxNodes));
const response = await fetch(`/api/graph?${params.toString()}`);
if (!response.ok) {
const errorPayload = await response.json().catch(() => ({}));
const errorMessage =
errorPayload.error ||
`Graph request failed (${response.status} ${response.statusText})`;
throw new Error(errorMessage);
}
return response.json();
}
function resizeContainer() {
if (!container) return;
const minHeight = 520;
const viewportHeight = window.innerHeight;
container.style.height = `${Math.max(minHeight, Math.round(viewportHeight * 0.6))}px`;
}
function renderGraph(data, labelSize = "normal") {
if (!container) return;
if (currentSimulation) {
currentSimulation.stop();
currentSimulation = null;
}
container.innerHTML = "";
const width = container.clientWidth || 900;
const height = container.clientHeight || 600;
const svg = d3
.select(container)
.append("svg")
.attr("viewBox", [0, 0, width, height])
.attr("width", "100%")
.attr("height", height);
const defs = svg.append("defs");
defs
.append("marker")
.attr("id", "arrow-references")
.attr("viewBox", "0 -5 10 10")
.attr("refX", 18)
.attr("refY", 0)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("path")
.attr("d", "M0,-5L10,0L0,5")
.attr("fill", "#6c83c7");
defs
.append("marker")
.attr("id", "arrow-referenced-by")
.attr("viewBox", "0 -5 10 10")
.attr("refX", 18)
.attr("refY", 0)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("path")
.attr("d", "M0,-5L10,0L0,5")
.attr("fill", "#c76c6c");
const contentGroup = svg.append("g").attr("class", "graph-content");
const linkGroup = contentGroup.append("g").attr("class", "graph-links");
const nodeGroup = contentGroup.append("g").attr("class", "graph-nodes");
const labelGroup = contentGroup.append("g").attr("class", "graph-labels");
const links = data.links || [];
const nodes = data.nodes || [];
currentChannelStyles = new Map();
const uniqueChannels = [];
nodes.forEach((node) => {
const label = getChannelLabel(node);
if (!currentChannelStyles.has(label)) {
uniqueChannels.push(label);
}
});
const additionalPatternCount = ADDITIONAL_PATTERNS.length;
uniqueChannels.forEach((label, idx) => {
const baseColor = color(label);
let patternKey = "none";
if (idx >= paletteSizeDefault && additionalPatternCount > 0) {
const patternInfo =
ADDITIONAL_PATTERNS[(idx - paletteSizeDefault) % additionalPatternCount];
patternKey = patternInfo.key;
}
const style = createChannelStyle(label, baseColor, patternKey);
currentChannelStyles.set(label, style);
});
const linkSelection = linkGroup
.selectAll("line")
.data(links)
.enter()
.append("line")
.attr("stroke-width", 1.2)
.attr("stroke", (d) =>
d.relation === "references" ? "#6c83c7" : "#c76c6c"
)
.attr("stroke-opacity", 0.7)
.attr("marker-end", (d) =>
d.relation === "references" ? "url(#arrow-references)" : "url(#arrow-referenced-by)"
);
let nodePatternCounter = 0;
const nodePatternRefs = new Map();
const getNodeFill = (node) => {
const style = currentChannelStyles.get(getChannelLabel(node));
if (!style) {
return color(getChannelLabel(node));
}
if (!style.hatch || style.hatch === "none") {
return style.baseColor;
}
const patternId = `node-pattern-${nodePatternCounter++}`;
const pattern = defs
.append("pattern")
.attr("id", patternId)
.attr("patternUnits", "userSpaceOnUse")
.attr("width", 8)
.attr("height", 8);
appendPatternContent(pattern, style.baseColor, style.hatch);
pattern.attr("patternTransform", "translate(0,0)");
nodePatternRefs.set(node.id, pattern);
return `url(#${patternId})`;
};
const nodeSelection = nodeGroup
.selectAll("circle")
.data(nodes, (d) => d.id)
.enter()
.append("circle")
.attr("r", (d) => (d.is_root ? 10 : 7))
.attr("fill", (d) => getNodeFill(d))
.attr("stroke", "#1f1f1f")
.attr("stroke-width", (d) => (d.is_root ? 2 : 1))
.call(
d3
.drag()
.on("start", (event, d) => {
if (!event.active) simulation.alphaTarget(0.3).restart();
d.fx = d.x;
d.fy = d.y;
})
.on("drag", (event, d) => {
d.fx = event.x;
d.fy = event.y;
})
.on("end", (event, d) => {
if (!event.active) simulation.alphaTarget(0);
d.fx = null;
d.fy = null;
})
)
.on("click", (event, d) => {
if (d.url) {
window.open(d.url, "_blank", "noopener");
}
})
.on("contextmenu", (event, d) => {
event.preventDefault();
loadGraph(d.id, currentDepth, currentMaxNodes, { updateInputs: true });
});
nodeSelection
.append("title")
.text((d) => {
const parts = [];
parts.push(d.title || d.id);
if (d.channel_name) {
parts.push(`Channel: ${d.channel_name}`);
}
if (d.date) {
parts.push(`Date: ${d.date}`);
}
return parts.join("\n");
});
const labelSelection = labelGroup
.selectAll("text")
.data(nodes, (d) => d.id)
.enter()
.append("text")
.attr("class", "graph-node-label")
.attr("text-anchor", "middle")
.attr("fill", "#1f1f1f")
.attr("pointer-events", "none")
.text((d) => d.title || d.id);
applyLabelAppearance(labelSelection, labelSize);
const simulation = d3
.forceSimulation(nodes)
.force(
"link",
d3
.forceLink(links)
.id((d) => d.id)
.distance(120)
.strength(0.8)
)
.force("charge", d3.forceManyBody().strength(-320))
.force("center", d3.forceCenter(width / 2, height / 2))
.force(
"collide",
d3.forceCollide().radius((d) => (d.is_root ? 20 : 14)).iterations(2)
);
simulation.on("tick", () => {
linkSelection
.attr("x1", (d) => d.source.x)
.attr("y1", (d) => d.source.y)
.attr("x2", (d) => d.target.x)
.attr("y2", (d) => d.target.y);
nodeSelection.attr("cx", (d) => d.x).attr("cy", (d) => d.y);
labelSelection.attr("x", (d) => d.x).attr("y", (d) => d.y - (d.is_root ? 14 : 12));
nodeSelection.each(function (d) {
const pattern = nodePatternRefs.get(d.id);
if (pattern) {
const safeX = Number.isFinite(d.x) ? d.x : 0;
const safeY = Number.isFinite(d.y) ? d.y : 0;
pattern.attr("patternTransform", `translate(${safeX}, ${safeY})`);
}
});
});
const zoomBehavior = d3
.zoom()
.scaleExtent([0.3, 3])
.on("zoom", (event) => {
contentGroup.attr("transform", event.transform);
});
svg.call(zoomBehavior);
currentSimulation = simulation;
}
async function loadGraph(videoId, depth, maxNodes, { updateInputs = false } = {}) {
const sanitizedId = sanitizeId(videoId);
if (!sanitizedId) {
setStatus("Please enter a video ID.", true);
return;
}
const safeDepth = sanitizeDepth(depth);
const safeMaxNodes = sanitizeMaxNodes(maxNodes);
if (updateInputs) {
videoInput.value = sanitizedId;
depthInput.value = String(safeDepth);
maxNodesInput.value = String(safeMaxNodes);
}
setStatus("Loading graph…");
try {
const data = await fetchGraph(sanitizedId, safeDepth, safeMaxNodes);
if (!data.nodes || data.nodes.length === 0) {
setStatus("No nodes returned for this video.", true);
container.innerHTML = "";
currentGraphData = null;
currentChannelStyles = new Map();
renderLegend([]);
return;
}
currentGraphData = data;
currentDepth = safeDepth;
currentMaxNodes = safeMaxNodes;
renderGraph(data, getLabelSize());
renderLegend(data.nodes);
setStatus(
`Showing ${data.nodes.length} nodes and ${data.links.length} links (depth ${data.depth})`
);
updateUrlState(sanitizedId, safeDepth, safeMaxNodes, getLabelSize());
} catch (err) {
console.error(err);
setStatus(err.message || "Failed to build graph.", true);
container.innerHTML = "";
currentGraphData = null;
currentChannelStyles = new Map();
renderLegend([]);
}
}
async function handleSubmit(event) {
event.preventDefault();
await loadGraph(videoInput.value, depthInput.value, maxNodesInput.value, {
updateInputs: true,
});
}
function renderLegend(nodes) {
let legend = document.getElementById("graphLegend");
if (!legend) {
legend = document.createElement("div");
legend.id = "graphLegend";
legend.className = "graph-legend";
if (statusEl && statusEl.parentNode) {
statusEl.insertAdjacentElement("afterend", legend);
} else {
container.parentElement?.insertBefore(legend, container);
}
}
legend.innerHTML = "";
const edgesSection = document.createElement("div");
edgesSection.className = "graph-legend-section";
const edgesTitle = document.createElement("div");
edgesTitle.className = "graph-legend-title";
edgesTitle.textContent = "Edges";
edgesSection.appendChild(edgesTitle);
const createEdgeRow = (swatchClass, text) => {
const row = document.createElement("div");
row.className = "graph-legend-row";
const swatch = document.createElement("span");
swatch.className = `graph-legend-swatch ${swatchClass}`;
const label = document.createElement("span");
label.textContent = text;
row.appendChild(swatch);
row.appendChild(label);
return row;
};
edgesSection.appendChild(
createEdgeRow(
"graph-legend-swatch--references",
"Outgoing reference (video references other)"
)
);
edgesSection.appendChild(
createEdgeRow(
"graph-legend-swatch--referenced",
"Incoming reference (other video references this)"
)
);
legend.appendChild(edgesSection);
const channelSection = document.createElement("div");
channelSection.className = "graph-legend-section";
const channelTitle = document.createElement("div");
channelTitle.className = "graph-legend-title";
channelTitle.textContent = "Channels in view";
channelSection.appendChild(channelTitle);
const channelList = document.createElement("div");
channelList.className = "graph-legend-channel-list";
const channelEntries = Array.from(currentChannelStyles.entries()).sort((a, b) =>
a[0].localeCompare(b[0], undefined, { sensitivity: "base" })
);
const maxChannelItems = 20;
channelEntries.slice(0, maxChannelItems).forEach(([label, style]) => {
const item = document.createElement("div");
item.className = `graph-legend-channel graph-legend-channel--${
style.legendClass || "none"
}`;
const swatch = document.createElement("span");
swatch.className = "graph-legend-swatch graph-legend-channel-swatch";
swatch.style.backgroundColor = style.baseColor;
const text = document.createElement("span");
text.textContent = label;
item.appendChild(swatch);
item.appendChild(text);
channelList.appendChild(item);
});
const totalChannels = channelEntries.length;
if (channelList.childElementCount) {
channelSection.appendChild(channelList);
if (totalChannels > maxChannelItems) {
const note = document.createElement("div");
note.className = "graph-legend-note";
note.textContent = `+${totalChannels - maxChannelItems} more channels`;
channelSection.appendChild(note);
}
} else {
const empty = document.createElement("div");
empty.className = "graph-legend-note";
empty.textContent = "No channel data available.";
channelSection.appendChild(empty);
}
legend.appendChild(channelSection);
}
function applyLabelAppearance(selection, labelSize) {
if (labelSize === "off") {
selection.style("display", "none");
} else {
selection
.style("display", null)
.attr("font-size", LABEL_FONT_SIZES[labelSize] || LABEL_FONT_SIZES.normal);
}
}
function updateUrlState(videoId, depth, maxNodes, labelSize) {
if (isEmbedded) {
return;
}
const next = new URL(window.location.href);
next.searchParams.set("video_id", videoId);
next.searchParams.set("depth", String(depth));
next.searchParams.set("max_nodes", String(maxNodes));
if (labelSize && labelSize !== "normal") {
next.searchParams.set("label_size", labelSize);
} else {
next.searchParams.delete("label_size");
}
history.replaceState({}, "", next.toString());
}
function initFromQuery() {
const params = new URLSearchParams(window.location.search);
const videoId = sanitizeId(params.get("video_id"));
const depth = sanitizeDepth(params.get("depth") || "");
const maxNodes = sanitizeMaxNodes(params.get("max_nodes") || "");
const labelSizeParam = params.get("label_size");
if (videoId) {
videoInput.value = videoId;
}
depthInput.value = String(depth);
maxNodesInput.value = String(maxNodes);
if (labelSizeParam && isValidLabelSize(labelSizeParam)) {
setLabelSizeInput(labelSizeParam);
} else {
setLabelSizeInput(getLabelSize());
}
if (!videoId || isEmbedded) {
return;
}
loadGraph(videoId, depth, maxNodes, { updateInputs: false });
}
resizeContainer();
window.addEventListener("resize", resizeContainer);
form.addEventListener("submit", handleSubmit);
labelSizeInput.addEventListener("change", () => {
const size = getLabelSize();
if (currentGraphData) {
renderGraph(currentGraphData, size);
renderLegend(currentGraphData.nodes);
}
updateUrlState(
sanitizeId(videoInput.value),
currentDepth,
currentMaxNodes,
size
);
});
initFromQuery();
Object.assign(GraphUI, {
load(videoId, depth, maxNodes, options = {}) {
const targetDepth = depth != null ? depth : currentDepth;
const targetMax = maxNodes != null ? maxNodes : currentMaxNodes;
return loadGraph(videoId, targetDepth, targetMax, {
updateInputs: options.updateInputs !== false,
});
},
setLabelSize(size) {
if (!labelSizeInput || !size) return;
setLabelSizeInput(size);
labelSizeInput.dispatchEvent(new Event("change", { bubbles: true }));
},
setDepth(value) {
if (!depthInput) return;
const safe = sanitizeDepth(value);
depthInput.value = String(safe);
currentDepth = safe;
},
setMaxNodes(value) {
if (!maxNodesInput) return;
const safe = sanitizeMaxNodes(value);
maxNodesInput.value = String(safe);
currentMaxNodes = safe;
},
focusInput() {
if (videoInput) {
videoInput.focus();
videoInput.select();
}
},
stop() {
if (currentSimulation) {
currentSimulation.stop();
currentSimulation = null;
}
},
getState() {
return {
depth: currentDepth,
maxNodes: currentMaxNodes,
labelSize: getLabelSize(),
nodes: currentGraphData ? currentGraphData.nodes.slice() : [],
links: currentGraphData ? currentGraphData.links.slice() : [],
};
},
isEmbedded,
});
GraphUI.ready = true;
setTimeout(() => {
window.dispatchEvent(new CustomEvent("graph-ui-ready"));
}, 0);
})();

View File

@@ -3,7 +3,8 @@
<head> <head>
<meta charset="utf-8" /> <meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="viewport" content="width=device-width, initial-scale=1" />
<title>This Little Corner (Python)</title> <title>TLC Search</title>
<link rel="icon" href="/static/favicon.png" type="image/png" />
<link rel="stylesheet" href="https://unpkg.com/xp.css" /> <link rel="stylesheet" href="https://unpkg.com/xp.css" />
<link rel="stylesheet" href="/static/style.css" /> <link rel="stylesheet" href="/static/style.css" />
<script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script>
@@ -11,8 +12,9 @@
<body> <body>
<div class="window" style="max-width: 1200px; margin: 20px auto;"> <div class="window" style="max-width: 1200px; margin: 20px auto;">
<div class="title-bar"> <div class="title-bar">
<div class="title-bar-text">This Little Corner — Elastic Search</div> <div class="title-bar-text">This Little Corner</div>
<div class="title-bar-controls"> <div class="title-bar-controls">
<button id="aboutBtn" aria-label="About">?</button>
<button id="minimizeBtn" aria-label="Minimize"></button> <button id="minimizeBtn" aria-label="Minimize"></button>
<button aria-label="Maximize"></button> <button aria-label="Maximize"></button>
<button aria-label="Close"></button> <button aria-label="Close"></button>
@@ -20,6 +22,10 @@
</div> </div>
<div class="window-body"> <div class="window-body">
<p>Enter a phrase to query title, description, and transcript text.</p> <p>Enter a phrase to query title, description, and transcript text.</p>
<p style="font-size: 11px;">
Looking for semantic matches? Try the
<a href="/vector-search">vector search beta</a>.
</p>
<fieldset> <fieldset>
<legend>Search</legend> <legend>Search</legend>
@@ -30,19 +36,22 @@
</div> </div>
<div class="field-row" style="margin-bottom: 8px; align-items: center;"> <div class="field-row" style="margin-bottom: 8px; align-items: center;">
<label style="width: 60px;">Channel:</label> <label for="channel" style="width: 60px;">Channel:</label>
<details id="channelDropdown" class="channel-dropdown" style="flex: 1;"> <select id="channel" style="flex: 1;">
<summary id="channelSummary">All Channels</summary> <option value="">All Channels</option>
<div id="channelOptions" class="channel-options"> </select>
<div>Loading channels…</div>
</div> <label for="year" style="margin-left: 8px;">Year:</label>
</details> <select id="year">
<option value="">All Years</option>
</select>
<label for="sort" style="margin-left: 8px;">Sort:</label> <label for="sort" style="margin-left: 8px;">Sort:</label>
<select id="sort"> <select id="sort">
<option value="relevant">Most relevant</option> <option value="relevant">Most relevant</option>
<option value="newer">Newest first</option> <option value="newer">Newest first</option>
<option value="older">Oldest first</option> <option value="older">Oldest first</option>
<option value="referenced">Most referenced</option>
</select> </select>
<label for="size" style="margin-left: 8px;">Size:</label> <label for="size" style="margin-left: 8px;">Size:</label>
@@ -53,18 +62,30 @@
</select> </select>
</div> </div>
<div class="field-row"> <div class="field-row toggle-row">
<div class="toggle-item toggle-item--first">
<input type="checkbox" id="exactToggle" checked /> <input type="checkbox" id="exactToggle" checked />
<label for="exactToggle">Exact</label> <label for="exactToggle">Exact</label>
<span class="toggle-help">Match all terms exactly.</span>
</div>
<div class="toggle-item">
<input type="checkbox" id="fuzzyToggle" checked /> <input type="checkbox" id="fuzzyToggle" checked />
<label for="fuzzyToggle">Fuzzy</label> <label for="fuzzyToggle">Fuzzy</label>
<span class="toggle-help">Allow small typos and variations.</span>
</div>
<div class="toggle-item">
<input type="checkbox" id="phraseToggle" checked /> <input type="checkbox" id="phraseToggle" checked />
<label for="phraseToggle">Phrase</label> <label for="phraseToggle">Phrase</label>
<span class="toggle-help">Boost exact phrases inside transcripts.</span>
</div>
<div class="toggle-item">
<input type="checkbox" id="queryStringToggle" /> <input type="checkbox" id="queryStringToggle" />
<label for="queryStringToggle">Query string mode</label> <label for="queryStringToggle">Query string mode</label>
<span class="toggle-help">Use raw Lucene syntax (overrides other toggles).</span>
</div>
</div> </div>
</fieldset> </fieldset>
@@ -78,7 +99,7 @@
</fieldset> </fieldset>
</div> </div>
<div class="summary-right"> <div class="summary-right">
<fieldset style="height: 100%;"> <fieldset>
<legend>Timeline</legend> <legend>Timeline</legend>
<div id="frequencySummary" style="font-size: 11px; margin-bottom: 8px;"></div> <div id="frequencySummary" style="font-size: 11px; margin-bottom: 8px;"></div>
<div id="frequencyChart"></div> <div id="frequencyChart"></div>
@@ -97,6 +118,105 @@
</div> </div>
</div> </div>
<div class="about-panel" id="aboutPanel" hidden>
<div class="about-panel__header">
<strong>About This App</strong>
<button id="aboutCloseBtn" aria-label="Close about panel">×</button>
</div>
<div class="about-panel__body">
<p>Use the toggles to choose exact, fuzzy, or phrase matching. Query string mode accepts raw Lucene syntax.</p>
<p>Results are ranked by your chosen sort order; the timeline summarizes the same query.</p>
<p>You can download transcripts, copy MLA citations, or explore references via the graph button.</p>
</div>
</div>
<div
id="graphModalOverlay"
class="graph-modal-overlay"
aria-hidden="true"
>
<div
class="window graph-window graph-modal-window"
id="graphModalWindow"
role="dialog"
aria-modal="true"
aria-labelledby="graphModalTitle"
>
<div class="title-bar">
<div class="title-bar-text" id="graphModalTitle">Reference Graph</div>
<div class="title-bar-controls">
<button id="graphModalClose" aria-label="Close"></button>
</div>
</div>
<div class="window-body">
<p>
Explore how this video links with its neighbors. Adjust depth or node cap to expand the graph.
</p>
<form id="graphForm" class="graph-controls">
<div class="field-group">
<label for="graphVideoId">Video ID</label>
<input
id="graphVideoId"
name="video_id"
type="text"
placeholder="e.g. dQw4w9WgXcQ"
required
/>
</div>
<div class="field-group">
<label for="graphDepth">Depth</label>
<select id="graphDepth" name="depth">
<option value="1" selected>1 hop</option>
<option value="2">2 hops</option>
<option value="3">3 hops</option>
</select>
</div>
<div class="field-group">
<label for="graphMaxNodes">Max nodes</label>
<select id="graphMaxNodes" name="max_nodes">
<option value="100">100</option>
<option value="150">150</option>
<option value="200" selected>200</option>
<option value="300">300</option>
<option value="400">400</option>
</select>
</div>
<div class="field-group">
<label for="graphLabelSize">Labels</label>
<select id="graphLabelSize" name="label_size">
<option value="off">Off</option>
<option value="tiny" selected>Tiny</option>
<option value="small">Small</option>
<option value="normal">Normal</option>
<option value="medium">Medium</option>
<option value="large">Large</option>
<option value="xlarge">Extra large</option>
</select>
</div>
<button type="submit">Build graph</button>
</form>
<div id="graphStatus" class="graph-status">Enter a video ID to begin.</div>
<div
id="graphContainer"
class="graph-container"
data-embedded="true"
></div>
</div>
<div class="status-bar">
<p class="status-bar-field">Right-click a node to set a new root</p>
<p class="status-bar-field">Colors (and hatches) represent channels</p>
</div>
</div>
</div>
<script src="/static/graph.js"></script>
<script src="/static/app.js"></script> <script src="/static/app.js"></script>
</body> </body>
</html> </html>

View File

@@ -63,7 +63,7 @@ body.dimmed {
} }
.field-row input[type="text"], .field-row input[type="text"],
.field-row .channel-dropdown { .field-row select#channel {
flex: 1 1 100% !important; flex: 1 1 100% !important;
min-width: 0 !important; min-width: 0 !important;
max-width: 100% !important; max-width: 100% !important;
@@ -86,63 +86,73 @@ body.dimmed {
max-width: 100%; max-width: 100%;
min-width: 100%; min-width: 100%;
} }
.graph-controls {
flex-direction: column;
align-items: stretch;
} }
/* Channel dropdown custom styling */ .graph-controls .field-group,
.channel-dropdown { .graph-controls input,
position: relative; .graph-controls select {
display: inline-block; width: 100%;
min-width: 0;
}
} }
.channel-dropdown summary { .toggle-row {
list-style: none; flex-direction: column;
cursor: pointer; align-items: flex-start;
padding: 3px 4px; gap: 4px;
background: ButtonFace; margin-top: 8px;
border: 1px solid;
border-color: ButtonHighlight ButtonShadow ButtonShadow ButtonHighlight;
min-width: 180px;
text-align: left;
} }
.channel-dropdown summary::-webkit-details-marker { .toggle-row > * {
display: none; margin-left: 0 !important;
} }
.channel-dropdown summary::after { .toggle-item {
content: ' ▼';
font-size: 8px;
float: right;
}
.channel-dropdown[open] summary::after {
content: ' ▲';
}
.channel-options {
position: absolute;
margin-top: 2px;
padding: 4px;
background: ButtonFace;
border: 1px solid;
border-color: ButtonHighlight ButtonShadow ButtonShadow ButtonHighlight;
max-height: 300px;
overflow-y: auto;
box-shadow: 2px 2px 0 rgba(0, 0, 0, 0.2);
z-index: 100;
min-width: 220px;
}
.channel-option {
display: flex; display: flex;
align-items: center; align-items: center;
gap: 6px; gap: 6px;
margin-bottom: 4px; user-select: none;
font-size: 11px;
} }
.channel-option:last-child { .toggle-item label {
margin-bottom: 0; cursor: pointer;
width: auto !important;
}
.toggle-item--first {
margin-left: 0;
}
.toggle-item input[type="checkbox"] {
margin: 0;
}
.toggle-item input[type="checkbox"]:disabled + label {
color: GrayText;
opacity: 0.7;
}
.toggle-item input[type="checkbox"]:disabled {
cursor: not-allowed;
}
.toggle-item input[type="checkbox"]:disabled + label {
cursor: not-allowed;
}
.description-block {
background: Window;
border: 1px solid #919b9c;
padding: 6px 8px;
margin-top: 6px;
font-size: 11px;
white-space: pre-wrap;
max-height: 6em;
overflow-y: auto;
} }
/* Layout helpers */ /* Layout helpers */
@@ -163,15 +173,373 @@ body.dimmed {
min-width: 300px; min-width: 300px;
} }
.graph-window {
width: 95%;
}
.graph-controls {
display: flex;
flex-wrap: wrap;
gap: 12px;
align-items: flex-end;
margin-bottom: 12px;
}
.graph-controls .field-group {
display: flex;
flex-direction: column;
gap: 4px;
}
.graph-controls label {
font-size: 11px;
font-weight: bold;
}
.graph-controls input,
.graph-controls select {
min-width: 160px;
}
.graph-status {
font-size: 11px;
margin-bottom: 8px;
color: #1f1f1f;
}
.graph-status.error {
color: #b00020;
}
.graph-container {
background: Window;
border: 1px solid #919b9c;
box-shadow: inset -1px -1px #0a0a0a, inset 1px 1px #fff;
position: relative;
width: 100%;
min-height: 520px;
height: auto;
overflow: visible;
}
.graph-modal-overlay {
position: fixed;
inset: 0;
display: none;
align-items: center;
justify-content: center;
padding: 24px;
background: rgba(0, 0, 0, 0.35);
z-index: 2000;
}
.graph-modal-overlay.active {
display: flex;
}
.graph-modal-window {
width: min(960px, 100%);
max-height: calc(100vh - 48px);
}
.graph-modal-window .window-body {
max-height: calc(100vh - 180px);
overflow-y: auto;
}
.graph-modal-window .graph-container {
height: 560px;
}
body.modal-open {
overflow: hidden;
}
.result-header {
display: flex;
justify-content: flex-start;
gap: 6px;
flex-wrap: wrap;
align-items: flex-start;
}
.result-header-main {
flex: 1 1 auto;
min-width: 220px;
}
.result-actions {
display: flex;
align-items: flex-start;
gap: 6px;
margin-left: auto;
}
.result-action-btn {
white-space: nowrap;
font-family: "Tahoma", "MS Sans Serif", sans-serif;
font-size: 11px;
padding: 4px 10px;
}
.result-meta {
display: flex;
align-items: center;
flex-wrap: wrap;
gap: 4px;
}
.result-status {
display: inline-flex;
align-items: center;
gap: 4px;
padding: 1px 6px;
border-radius: 3px;
font-size: 10px;
line-height: 1.3;
border: 1px solid #c4a3a3;
background: #fff6f6;
color: #6b1f1f;
}
.result-status::before {
content: "⚠";
font-size: 10px;
line-height: 1;
}
.result-status--deleted {
border-color: #d1a6a6;
background: #fff8f8;
color: #6b1f1f;
}
.graph-launch-btn {
white-space: nowrap;
}
.graph-node-label {
text-shadow: -1px -1px 0 #fff, 1px -1px 0 #fff, -1px 1px 0 #fff, 1px 1px 0 #fff;
}
.graph-nodes circle {
cursor: pointer;
}
.graph-legend {
margin: 12px 0;
font-size: 11px;
background: Window;
border: 1px solid #919b9c;
padding: 8px 10px;
display: inline-flex;
flex-direction: column;
gap: 4px;
box-shadow: inset -1px -1px #0a0a0a, inset 1px 1px #fff;
}
.graph-legend-section {
display: flex;
flex-direction: column;
gap: 4px;
}
.graph-legend-title {
font-weight: bold;
color: #1f1f1f;
}
.graph-legend-row {
display: flex;
align-items: center;
gap: 8px;
}
.graph-legend-swatch {
display: inline-block;
width: 18px;
height: 12px;
border: 1px solid #1f1f1f;
}
.graph-legend-swatch--references {
background: #6c83c7;
}
.graph-legend-swatch--referenced {
background: #c76c6c;
}
.graph-legend-channel-list {
display: flex;
flex-wrap: wrap;
gap: 8px;
}
.graph-legend-channel {
display: flex;
align-items: center;
gap: 6px;
}
.graph-legend-channel-swatch {
width: 14px;
height: 14px;
background-repeat: repeat;
background-position: 0 0;
background-size: 6px 6px;
}
.graph-legend-channel--none .graph-legend-channel-swatch {
background-image: none;
}
.graph-legend-channel--diag-forward .graph-legend-channel-swatch {
background-image: repeating-linear-gradient(
45deg,
rgba(0, 0, 0, 0.35) 0,
rgba(0, 0, 0, 0.35) 2px,
transparent 2px,
transparent 4px
);
background-blend-mode: multiply;
}
.graph-legend-channel--diag-back .graph-legend-channel-swatch {
background-image: repeating-linear-gradient(
-45deg,
rgba(0, 0, 0, 0.35) 0,
rgba(0, 0, 0, 0.35) 2px,
transparent 2px,
transparent 4px
);
background-blend-mode: multiply;
}
.graph-legend-channel--cross .graph-legend-channel-swatch {
background-image:
repeating-linear-gradient(
45deg,
rgba(0, 0, 0, 0.25) 0,
rgba(0, 0, 0, 0.25) 2px,
transparent 2px,
transparent 4px
),
repeating-linear-gradient(
-45deg,
rgba(0, 0, 0, 0.25) 0,
rgba(0, 0, 0, 0.25) 2px,
transparent 2px,
transparent 4px
);
background-blend-mode: multiply;
}
.graph-legend-channel--dots .graph-legend-channel-swatch {
background-image: radial-gradient(rgba(0, 0, 0, 0.35) 30%, transparent 31%);
background-size: 6px 6px;
background-blend-mode: multiply;
}
.graph-legend-note {
font-size: 10px;
color: #555;
font-style: italic;
}
.title-bar-link {
display: inline-block;
color: inherit;
text-decoration: none;
font-size: 11px;
padding: 2px 6px;
border: 1px solid;
border-color: ButtonHighlight ButtonShadow ButtonShadow ButtonHighlight;
background: ButtonFace;
}
.title-bar-controls #aboutBtn {
font-weight: bold;
font-size: 12px;
padding: 0 6px;
margin-right: 6px;
}
.toggle-item {
display: flex;
align-items: center;
gap: 6px;
}
.toggle-help {
font-size: 10px;
color: #555;
}
.about-panel {
position: fixed;
top: 20px;
right: 20px;
width: 280px;
background: Window;
border: 2px solid #919b9c;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.25);
z-index: 2100;
font-size: 11px;
}
.about-panel__header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 6px 8px;
background: #0055aa;
color: #fff;
}
.about-panel__body {
padding: 8px;
background: Window;
color: #000;
}
.about-panel__header button {
border: none;
background: transparent;
color: inherit;
font-weight: bold;
cursor: pointer;
}
/* Results styling */ /* Results styling */
#results .item { #results .item {
border-bottom: 1px solid ButtonShadow; background: Window;
padding: 12px 0; border: 2px solid #919b9c;
padding: 12px;
margin-bottom: 8px; margin-bottom: 8px;
max-width: 100%;
overflow: hidden;
word-wrap: break-word;
box-sizing: border-box;
box-shadow: 2px 2px 0 rgba(0, 0, 0, 0.15);
} }
#results .item:last-child { #results .item:last-child {
border-bottom: none; margin-bottom: 0;
}
#results .item strong {
word-break: break-word;
max-width: 100%;
display: inline-block;
}
.window-body {
max-width: 100%;
overflow-x: hidden;
margin: 0;
padding: 1rem;
box-sizing: border-box;
} }
/* Badges */ /* Badges */
@@ -180,6 +548,8 @@ body.dimmed {
display: flex; display: flex;
gap: 4px; gap: 4px;
flex-wrap: wrap; flex-wrap: wrap;
max-width: 100%;
overflow: hidden;
} }
.badge { .badge {
@@ -189,6 +559,25 @@ body.dimmed {
padding: 2px 6px; padding: 2px 6px;
font-size: 10px; font-size: 10px;
font-weight: bold; font-weight: bold;
white-space: nowrap;
word-break: keep-all;
}
.badge--transcript-primary {
background: #0b6efd;
}
.badge--transcript-secondary {
background: #8f4bff;
}
.badge-clickable {
cursor: pointer;
}
.badge-clickable:focus {
outline: 2px solid rgba(11, 110, 253, 0.6);
outline-offset: 1px;
} }
/* Transcript and highlights */ /* Transcript and highlights */
@@ -212,9 +601,14 @@ body.dimmed {
} }
.highlight-row { .highlight-row {
padding: 4px; padding: 4px 6px;
cursor: pointer; cursor: pointer;
border: 1px solid transparent; border: 1px solid transparent;
display: flex;
align-items: flex-start;
gap: 8px;
max-width: 100%;
box-sizing: border-box;
} }
.highlight-row:hover { .highlight-row:hover {
@@ -223,6 +617,77 @@ body.dimmed {
border: 1px dotted WindowText; border: 1px dotted WindowText;
} }
.highlight-text {
flex: 1 1 auto;
word-break: break-word;
overflow-wrap: anywhere;
}
.highlight-source-indicator {
width: 10px;
height: 10px;
border-radius: 2px;
border: 1px solid transparent;
margin-left: auto;
flex: 0 0 auto;
}
.highlight-source-indicator--primary {
background: #0b6efd;
border-color: #084bb5;
}
.highlight-source-indicator--secondary {
background: #8f4bff;
border-color: #5d2db3;
}
.vector-chunk {
margin-top: 8px;
padding: 8px;
background: #f3f7ff;
border: 1px solid #c7d0e2;
font-size: 11px;
line-height: 1.5;
word-break: break-word;
}
@media screen and (max-width: 640px) {
.result-header {
flex-direction: column;
gap: 6px;
}
.result-header-main {
flex: 1 1 auto;
min-width: 0;
width: 100%;
}
.result-actions {
width: auto;
align-self: flex-start;
justify-content: flex-start;
flex-wrap: wrap;
gap: 4px;
margin-left: 0;
}
.result-action-btn {
width: 100%;
text-align: left;
}
.highlight-row {
flex-direction: column;
gap: 4px;
}
.highlight-source-indicator {
align-self: flex-end;
}
}
mark { mark {
background: yellow; background: yellow;
color: black; color: black;
@@ -237,8 +702,7 @@ mark {
margin-top: 12px; margin-top: 12px;
padding: 8px; padding: 8px;
background: Window; background: Window;
border: 2px solid; border: 2px solid #919b9c;
border-color: ButtonShadow ButtonHighlight ButtonHighlight ButtonShadow;
max-height: 400px; max-height: 400px;
overflow-y: auto; overflow-y: auto;
font-size: 11px; font-size: 11px;
@@ -250,6 +714,10 @@ mark {
border-bottom: 1px solid ButtonShadow; border-bottom: 1px solid ButtonShadow;
} }
.transcript-segment--matched {
background: #fff6cc;
}
.transcript-segment:last-child { .transcript-segment:last-child {
border-bottom: none; border-bottom: none;
margin-bottom: 0; margin-bottom: 0;
@@ -294,27 +762,9 @@ mark {
line-height: 1.4; line-height: 1.4;
} }
.transcript-header { .transcript-header,
font-weight: bold;
margin-bottom: 8px;
display: flex;
align-items: center;
justify-content: space-between;
background: ActiveCaption;
color: CaptionText;
padding: 2px 4px;
}
.transcript-close { .transcript-close {
cursor: pointer; display: none;
font-size: 16px;
padding: 0 4px;
font-weight: bold;
}
.transcript-close:hover {
background: Highlight;
color: HighlightText;
} }
/* Chart styling */ /* Chart styling */

46
static/vector.html Normal file
View File

@@ -0,0 +1,46 @@
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>TLC Vector Search</title>
<link rel="icon" href="/static/favicon.png" type="image/png" />
<link rel="stylesheet" href="https://unpkg.com/xp.css" />
<link rel="stylesheet" href="/static/style.css" />
</head>
<body>
<div class="window" style="max-width: 1200px; margin: 20px auto;">
<div class="title-bar">
<div class="title-bar-text">Vector Search (Experimental)</div>
<div class="title-bar-controls">
<a class="title-bar-link" href="/">⬅ Back to Search</a>
</div>
</div>
<div class="window-body">
<p>Enter a natural language prompt; results come from the Qdrant vector index.</p>
<fieldset>
<legend>Vector Query</legend>
<div class="field-row" style="margin-bottom: 8px;">
<label for="vectorQuery" style="width: 60px;">Query:</label>
<input id="vectorQuery" type="text" placeholder="Describe what you are looking for" style="flex: 1;" />
<button id="vectorSearchBtn">Search</button>
</div>
</fieldset>
<div id="vectorMeta" style="margin-top: 12px; font-size: 11px;"></div>
<fieldset style="margin-top: 16px;">
<legend>Results</legend>
<div id="vectorResults"></div>
</fieldset>
</div>
<div class="status-bar">
<p class="status-bar-field">Experimental mode • Qdrant</p>
</div>
</div>
<script src="/static/vector.js"></script>
</body>
</html>

423
static/vector.js Normal file
View File

@@ -0,0 +1,423 @@
(() => {
const queryInput = document.getElementById("vectorQuery");
const searchBtn = document.getElementById("vectorSearchBtn");
const resultsDiv = document.getElementById("vectorResults");
const metaDiv = document.getElementById("vectorMeta");
const transcriptCache = new Map();
if (!queryInput || !searchBtn || !resultsDiv || !metaDiv) {
console.error("Vector search elements missing");
return;
}
/** Utility helpers **/
const escapeHtml = (str) =>
(str || "").replace(/[&<>"']/g, (ch) => {
switch (ch) {
case "&":
return "&amp;";
case "<":
return "&lt;";
case ">":
return "&gt;";
case '"':
return "&quot;";
case "'":
return "&#39;";
default:
return ch;
}
});
const fmtDate = (value) => {
try {
return (value || "").split("T")[0];
} catch {
return value;
}
};
const fmtSimilarity = (score) => {
if (typeof score !== "number" || Number.isNaN(score)) return "";
return score.toFixed(3);
};
const getVideoStatus = (item) =>
(item && item.video_status ? String(item.video_status).toLowerCase() : "");
const isLikelyDeleted = (item) => getVideoStatus(item) === "deleted";
const formatTimestamp = (seconds) => {
if (!seconds && seconds !== 0) return "00:00";
const hours = Math.floor(seconds / 3600);
const mins = Math.floor((seconds % 3600) / 60);
const secs = Math.floor(seconds % 60);
if (hours > 0) {
return `${hours}:${mins.toString().padStart(2, "0")}:${secs
.toString()
.padStart(2, "0")}`;
}
return `${mins}:${secs.toString().padStart(2, "0")}`;
};
const formatSegmentTimestamp = (segment) => {
if (!segment) return "";
if (segment.timestamp) return segment.timestamp;
const fields = [
segment.start_seconds,
segment.start,
segment.offset,
segment.time,
];
for (const value of fields) {
if (value == null) continue;
const num = parseFloat(value);
if (!Number.isNaN(num)) {
return formatTimestamp(num);
}
}
return "";
};
const serializeTranscriptSection = (label, parts, fullText) => {
let content = "";
if (typeof fullText === "string" && fullText.trim()) {
content = fullText.trim();
} else if (Array.isArray(parts) && parts.length) {
content = parts
.map((segment) => {
const ts = formatSegmentTimestamp(segment);
const text = segment && segment.text ? segment.text : "";
return ts ? `[${ts}] ${text}` : text;
})
.join("\n")
.trim();
}
if (!content) return "";
return `${label}\n${content}\n`;
};
const fetchTranscriptData = async (videoId) => {
if (!videoId) return null;
if (transcriptCache.has(videoId)) {
return transcriptCache.get(videoId);
}
const res = await fetch(`/api/transcript?video_id=${encodeURIComponent(videoId)}`);
if (!res.ok) {
throw new Error(`Transcript fetch failed (${res.status})`);
}
const data = await res.json();
transcriptCache.set(videoId, data);
return data;
};
const buildTranscriptDownloadText = (item, transcriptData) => {
const lines = [];
lines.push(`Title: ${item.title || "Untitled"}`);
if (item.channel_name) lines.push(`Channel: ${item.channel_name}`);
if (item.date) lines.push(`Published: ${item.date}`);
if (item.url) lines.push(`URL: ${item.url}`);
lines.push("");
const primaryText = serializeTranscriptSection(
"Primary Transcript",
transcriptData.transcript_parts,
transcriptData.transcript_full
);
const secondaryText = serializeTranscriptSection(
"Secondary Transcript",
transcriptData.transcript_secondary_parts,
transcriptData.transcript_secondary_full
);
if (primaryText) lines.push(primaryText);
if (secondaryText) lines.push(secondaryText);
if (!primaryText && !secondaryText) {
lines.push("No transcript available.");
}
return lines.join("\n").trim() + "\n";
};
const flashButtonMessage = (button, message, duration = 1800) => {
if (!button) return;
const original = button.dataset.originalLabel || button.textContent;
button.dataset.originalLabel = original;
button.textContent = message;
setTimeout(() => {
button.textContent = button.dataset.originalLabel || original;
}, duration);
};
const handleTranscriptDownload = async (item, button) => {
if (!item.video_id) return;
button.disabled = true;
try {
const transcriptData = await fetchTranscriptData(item.video_id);
if (!transcriptData) throw new Error("Transcript unavailable");
const text = buildTranscriptDownloadText(item, transcriptData);
const blob = new Blob([text], { type: "text/plain" });
const url = URL.createObjectURL(blob);
const link = document.createElement("a");
link.href = url;
link.download = `${item.video_id}.txt`;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
URL.revokeObjectURL(url);
flashButtonMessage(button, "Downloaded");
} catch (err) {
console.error("Download failed", err);
alert("Unable to download transcript right now.");
} finally {
button.disabled = false;
}
};
const formatMlaDate = (value) => {
if (!value) return "n.d.";
const parsed = new Date(value);
if (Number.isNaN(parsed.valueOf())) return value;
const months = [
"Jan.", "Feb.", "Mar.", "Apr.", "May", "June",
"July", "Aug.", "Sept.", "Oct.", "Nov.", "Dec.",
];
return `${parsed.getDate()} ${months[parsed.getMonth()]} ${parsed.getFullYear()}`;
};
const buildMlaCitation = (item) => {
const channel = (item.channel_name || item.channel_id || "Unknown").trim();
const title = (item.title || "Untitled").trim();
const url = item.url || "";
const publishDate = formatMlaDate(item.date);
const today = formatMlaDate(new Date().toISOString().split("T")[0]);
return `${channel}. "${title}." YouTube, uploaded by ${channel}, ${publishDate}, ${url}. Accessed ${today}.`;
};
const handleCopyCitation = async (item, button) => {
const citation = buildMlaCitation(item);
try {
if (navigator.clipboard && window.isSecureContext) {
await navigator.clipboard.writeText(citation);
} else {
const textarea = document.createElement("textarea");
textarea.value = citation;
textarea.style.position = "fixed";
textarea.style.opacity = "0";
document.body.appendChild(textarea);
textarea.select();
document.execCommand("copy");
document.body.removeChild(textarea);
}
flashButtonMessage(button, "Copied!");
} catch (err) {
console.error("Citation copy failed", err);
alert(citation);
}
};
/** Rendering helpers **/
const createHighlightRows = (entries) => {
if (!Array.isArray(entries) || !entries.length) return null;
const container = document.createElement("div");
container.className = "transcript highlight-list";
entries.forEach((entry) => {
if (!entry) return;
const row = document.createElement("div");
row.className = "highlight-row";
const textBlock = document.createElement("div");
textBlock.className = "highlight-text";
const html = entry.html || entry.text || entry;
textBlock.innerHTML = html || "";
row.appendChild(textBlock);
const indicator = document.createElement("span");
indicator.className = "highlight-source-indicator highlight-source-indicator--primary";
indicator.title = "Vector highlight";
row.appendChild(indicator);
container.appendChild(row);
});
return container;
};
const createActions = (item) => {
const actions = document.createElement("div");
actions.className = "result-actions";
const downloadBtn = document.createElement("button");
downloadBtn.type = "button";
downloadBtn.className = "result-action-btn";
downloadBtn.textContent = "Download transcript";
downloadBtn.addEventListener("click", () => handleTranscriptDownload(item, downloadBtn));
actions.appendChild(downloadBtn);
const citationBtn = document.createElement("button");
citationBtn.type = "button";
citationBtn.className = "result-action-btn";
citationBtn.textContent = "Copy citation";
citationBtn.addEventListener("click", () => handleCopyCitation(item, citationBtn));
actions.appendChild(citationBtn);
const graphBtn = document.createElement("button");
graphBtn.type = "button";
graphBtn.className = "result-action-btn graph-launch-btn";
graphBtn.textContent = "Graph";
graphBtn.disabled = !item.video_id;
graphBtn.addEventListener("click", () => {
if (!item.video_id) return;
const target = `/graph?video_id=${encodeURIComponent(item.video_id)}`;
window.open(target, "_blank", "noopener");
});
actions.appendChild(graphBtn);
return actions;
};
const renderVectorResults = (payload) => {
resultsDiv.innerHTML = "";
const items = payload.items || [];
if (!items.length) {
metaDiv.textContent = "No vector matches for this prompt.";
return;
}
metaDiv.textContent = `Matches: ${items.length} (vector mode)`;
items.forEach((item) => {
const el = document.createElement("div");
el.className = "item";
const header = document.createElement("div");
header.className = "result-header";
const headerMain = document.createElement("div");
headerMain.className = "result-header-main";
const titleEl = document.createElement("strong");
titleEl.innerHTML = item.titleHtml || escapeHtml(item.title || "Untitled");
headerMain.appendChild(titleEl);
const metaLine = document.createElement("div");
metaLine.className = "muted result-meta";
const channelLabel = item.channel_name || item.channel_id || "Unknown";
const dateLabel = fmtDate(item.date);
let durationSeconds = null;
if (typeof item.duration === "number") {
durationSeconds = item.duration;
} else if (typeof item.duration === "string" && item.duration.trim()) {
const parsed = parseFloat(item.duration);
if (!Number.isNaN(parsed)) {
durationSeconds = parsed;
}
}
const durationLabel = durationSeconds != null ? `${formatTimestamp(durationSeconds)}` : "";
metaLine.textContent = channelLabel ? `${channelLabel}${dateLabel}${durationLabel}` : `${dateLabel}${durationLabel}`;
if (isLikelyDeleted(item)) {
metaLine.appendChild(document.createTextNode(" "));
const statusEl = document.createElement("span");
statusEl.className = "result-status result-status--deleted";
statusEl.textContent = "Likely deleted";
metaLine.appendChild(statusEl);
}
headerMain.appendChild(metaLine);
if (item.url) {
const linkLine = document.createElement("div");
linkLine.className = "muted";
const anchor = document.createElement("a");
anchor.href = item.url;
anchor.target = "_blank";
anchor.rel = "noopener";
anchor.textContent = "Open on YouTube";
linkLine.appendChild(anchor);
headerMain.appendChild(linkLine);
}
if (typeof item.distance === "number") {
const scoreLine = document.createElement("div");
scoreLine.className = "muted";
scoreLine.textContent = `Similarity score: ${fmtSimilarity(item.distance)}`;
headerMain.appendChild(scoreLine);
}
header.appendChild(headerMain);
header.appendChild(createActions(item));
el.appendChild(header);
if (item.descriptionHtml || item.description) {
const desc = document.createElement("div");
desc.className = "muted description-block";
desc.innerHTML = item.descriptionHtml || escapeHtml(item.description);
el.appendChild(desc);
}
if (item.chunkText) {
const chunkBlock = document.createElement("div");
chunkBlock.className = "vector-chunk";
if (item.chunkTimestamp && item.url) {
const tsObj =
typeof item.chunkTimestamp === "object"
? item.chunkTimestamp
: { timestamp: item.chunkTimestamp };
const ts = formatSegmentTimestamp(tsObj);
const tsLink = document.createElement("a");
const paramValue =
typeof item.chunkTimestamp === "number"
? Math.floor(item.chunkTimestamp)
: item.chunkTimestamp;
tsLink.href = `${item.url}${item.url.includes("?") ? "&" : "?"}t=${encodeURIComponent(
paramValue
)}`;
tsLink.target = "_blank";
tsLink.rel = "noopener";
tsLink.textContent = ts ? `[${ts}]` : "[timestamp]";
chunkBlock.appendChild(tsLink);
chunkBlock.appendChild(document.createTextNode(" "));
}
const chunkTextSpan = document.createElement("span");
chunkTextSpan.textContent = item.chunkText;
chunkBlock.appendChild(chunkTextSpan);
el.appendChild(chunkBlock);
}
const highlights = createHighlightRows(item.toHighlight);
if (highlights) {
el.appendChild(highlights);
}
resultsDiv.appendChild(el);
});
};
/** Search handler **/
const runVectorSearch = async () => {
const query = queryInput.value.trim();
if (!query) {
alert("Please enter a query.");
return;
}
metaDiv.textContent = "Searching vector index…";
resultsDiv.innerHTML = "";
searchBtn.disabled = true;
try {
const res = await fetch("/api/vector-search", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ query }),
});
if (!res.ok) {
throw new Error(`Vector search failed (${res.status})`);
}
const data = await res.json();
if (data.error) {
metaDiv.textContent = "Vector search unavailable.";
return;
}
renderVectorResults(data);
} catch (err) {
console.error(err);
metaDiv.textContent = "Vector search unavailable.";
} finally {
searchBtn.disabled = false;
}
};
searchBtn.addEventListener("click", runVectorSearch);
queryInput.addEventListener("keypress", (event) => {
if (event.key === "Enter") {
runVectorSearch();
}
});
})();

188
sync_qdrant_channels.py Normal file
View File

@@ -0,0 +1,188 @@
"""
Utility to backfill channel titles/names inside the Qdrant payloads.
Usage:
python -m python_app.sync_qdrant_channels \
--batch-size 512 \
--max-batches 200 \
--dry-run
"""
from __future__ import annotations
import argparse
import logging
from typing import Dict, Iterable, List, Optional, Set, Tuple
import time
import requests
from .config import CONFIG
from .search_app import _ensure_client
LOGGER = logging.getLogger(__name__)
def chunked(iterable: Iterable, size: int):
chunk: List = []
for item in iterable:
chunk.append(item)
if len(chunk) >= size:
yield chunk
chunk = []
if chunk:
yield chunk
def resolve_channels(channel_ids: Iterable[str]) -> Dict[str, str]:
client = _ensure_client(CONFIG)
ids = list(set(channel_ids))
if not ids:
return {}
body = {
"size": len(ids) * 2,
"_source": ["channel_id", "channel_name"],
"query": {"terms": {"channel_id.keyword": ids}},
}
response = client.search(index=CONFIG.elastic.index, body=body)
resolved: Dict[str, str] = {}
for hit in response.get("hits", {}).get("hits", []):
source = hit.get("_source") or {}
cid = source.get("channel_id")
cname = source.get("channel_name")
if cid and cname and cid not in resolved:
resolved[cid] = cname
return resolved
def upsert_channel_payload(
qdrant_url: str,
collection: str,
channel_id: str,
channel_name: str,
*,
dry_run: bool = False,
) -> bool:
"""Set channel_name/channel_title for all vectors with this channel_id."""
payload = {"channel_name": channel_name, "channel_title": channel_name}
body = {
"payload": payload,
"filter": {"must": [{"key": "channel_id", "match": {"value": channel_id}}]},
}
LOGGER.info("Updating channel_id=%s -> %s", channel_id, channel_name)
if dry_run:
return True
resp = requests.post(
f"{qdrant_url}/collections/{collection}/points/payload",
json=body,
timeout=120,
)
if resp.status_code >= 400:
LOGGER.error("Failed to update %s: %s", channel_id, resp.text)
return False
return True
def scroll_missing_payloads(
qdrant_url: str,
collection: str,
batch_size: int,
*,
max_points: Optional[int] = None,
) -> Iterable[List[Tuple[str, Dict[str, any]]]]:
"""Yield batches of (point_id, payload) missing channel names."""
fetched = 0
next_page = None
while True:
current_limit = batch_size
while True:
body = {
"limit": current_limit,
"with_payload": True,
"filter": {"must": [{"is_empty": {"key": "channel_name"}}]},
}
if next_page:
body["offset"] = next_page
try:
resp = requests.post(
f"{qdrant_url}/collections/{collection}/points/scroll",
json=body,
timeout=120,
)
resp.raise_for_status()
break
except requests.HTTPError as exc:
LOGGER.warning(
"Scroll request failed at limit=%s: %s", current_limit, exc
)
if current_limit <= 5:
raise
current_limit = max(5, current_limit // 2)
LOGGER.info("Reducing scroll batch size to %s", current_limit)
time.sleep(2)
except requests.RequestException as exc: # type: ignore[attr-defined]
LOGGER.warning("Transient scroll error: %s", exc)
time.sleep(2)
payload = resp.json().get("result", {})
points = payload.get("points", [])
if not points:
break
batch: List[Tuple[str, Dict[str, any]]] = []
for point in points:
pid = point.get("id")
p_payload = point.get("payload") or {}
batch.append((pid, p_payload))
yield batch
fetched += len(points)
if max_points and fetched >= max_points:
break
next_page = payload.get("next_page_offset")
if not next_page:
break
def main() -> None:
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
parser = argparse.ArgumentParser(
description="Backfill missing channel_name/channel_title in Qdrant payloads"
)
parser.add_argument("--batch-size", type=int, default=512)
parser.add_argument(
"--max-points",
type=int,
default=None,
help="Limit processing to the first N points for testing",
)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
q_url = CONFIG.qdrant_url
collection = CONFIG.qdrant_collection
total_updates = 0
for batch in scroll_missing_payloads(
q_url, collection, args.batch_size, max_points=args.max_points
):
channel_ids: Set[str] = set()
for _, payload in batch:
cid = payload.get("channel_id")
if cid:
channel_ids.add(str(cid))
if not channel_ids:
continue
resolved = resolve_channels(channel_ids)
if not resolved:
LOGGER.warning("No channel names resolved for ids: %s", channel_ids)
continue
for cid, name in resolved.items():
if upsert_channel_payload(
q_url, collection, cid, name, dry_run=args.dry_run
):
total_updates += 1
LOGGER.info("Updated %s channel payloads so far", total_updates)
LOGGER.info("Finished. Total channel updates attempted: %s", total_updates)
if __name__ == "__main__":
main()