TLC-Search/README.md
2026-01-08 15:24:05 -05:00

119 lines
4.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Python Search Toolkit (Rough Draft)
This minimal Python implementation covers three core needs:
1. **Collect transcripts** from YouTube channels.
2. **Ingest transcripts/metadata** into Elasticsearch.
3. **Expose a simple Flask search UI** that queries Elasticsearch directly.
The code lives alongside the existing C# stack so you can experiment without
touching production infrastructure.
## Setup
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r python_app/requirements.txt
```
Configure your environment as needed:
```bash
export ELASTIC_URL=http://localhost:9200
export ELASTIC_INDEX=this_little_corner_py
export ELASTIC_USERNAME=elastic # optional
export ELASTIC_PASSWORD=secret # optional
export ELASTIC_API_KEY=XXXX # optional alternative auth
export ELASTIC_CA_CERT=/path/to/ca.pem # optional, for self-signed TLS
export ELASTIC_VERIFY_CERTS=1 # set to 0 to skip verification (dev only)
export ELASTIC_DEBUG=0 # set to 1 for verbose request/response logging
export LOCAL_DATA_DIR=./data/video_metadata # defaults to this
export YOUTUBE_API_KEY=AIza... # required for live collection
```
## 1. Collect Transcripts
```bash
python -m python_app.transcript_collector \
--channel UCxxxx \
--output data/raw \
--max-pages 2
```
Each video becomes a JSON file containing metadata plus transcript segments
(`TranscriptSegment`). Downloads require both `google-api-python-client` and
`youtube-transcript-api`, as well as a valid `YOUTUBE_API_KEY`.
> Already have cached JSON? You can skip this step and move straight to ingesting.
## 2. Ingest Into Elasticsearch
```bash
python -m python_app.ingest \
--source data/video_metadata \
--index this_little_corner_py
```
The script walks the source directory, builds `bulk` requests, and creates the
index with a lightweight mapping when needed. Authentication is handled via
`ELASTIC_USERNAME` / `ELASTIC_PASSWORD` if set.
## 3. Serve the Search Frontend
```bash
python -m python_app.search_app
```
Visit <http://localhost:8080/> and youll see a barebones UI that:
- Lists channels via a terms aggregation.
- Queries titles/descriptions/transcripts with toggleable exact, fuzzy, and phrase clauses plus optional date sorting.
- Surfaces transcript highlights.
- Lets you pull the full transcript for any result on demand.
- Shows a stacked-by-channel timeline for each search query (with `/frequency` offering a standalone explorer) powered by D3.js.
- Supports a query-string mode toggle so you can write advanced Lucene queries (e.g. `meaning OR purpose`, `meaning~2` for fuzzy matches, `title:(meaning crisis)`), while the default toggles stay AND-backed.
## Integration Notes
- All modules share configuration through `python_app.config.CONFIG`, so you can
fine-tune paths or credentials centrally.
- The ingest flow reuses existing JSON schema from `data/video_metadata`, so no
re-download is necessary if you already have the dumps.
- Everything is intentionally simple (no Celery, task queues, or custom auth) to
keep the draft approachable and easy to extend.
Feel free to expand on this scaffold—add proper logging, schedule transcript
updates, or flesh out the UI—once youre happy with the baseline behaviour.
## Run with Docker Compose (App Only; Remote ES/Qdrant)
The provided compose file builds/runs only the Flask app and expects **remote** Elasticsearch/Qdrant endpoints. Supply them via environment variables (directly or a `.env` alongside `docker-compose.yml`):
```bash
ELASTIC_URL=https://your-es-host:9200 \
QDRANT_URL=https://your-qdrant-host:6333 \
docker compose up --build
```
Other tunables (defaults shown in compose):
- `ELASTIC_INDEX` (default `this_little_corner_py`)
- `ELASTIC_USERNAME` / `ELASTIC_PASSWORD` or `ELASTIC_API_KEY`
- `ELASTIC_VERIFY_CERTS` (set to `1` for real TLS verification)
- `QDRANT_COLLECTION` (default `tlc-captions-full`)
- `QDRANT_VECTOR_NAME` / `QDRANT_VECTOR_SIZE` / `QDRANT_EMBED_MODEL`
- `RATE_LIMIT_ENABLED` (default `1`)
- `RATE_LIMIT_REQUESTS` (default `60`)
- `RATE_LIMIT_WINDOW_SECONDS` (default `60`)
Port 8080 on the host is forwarded to the app. Mount `./data` (read-only) if you want local fallbacks for metrics (`LOCAL_DATA_DIR=/app/data/video_metadata`); otherwise the app will rely purely on the remote backends. Stop the container with `docker compose down`.
## CI (Docker build)
A Gitea Actions workflow (`.gitea/workflows/docker-build.yml`) builds and pushes the Docker image on every push to `master`. Configure the following repository secrets in Gitea:
- `DOCKER_USERNAME`
- `DOCKER_PASSWORD`
The image is tagged as `gitea.ghost.tel/knight/tlc-search:latest` and with the commit SHA. Adjust `IMAGE_NAME` in the workflow if you need a different registry/repo.