119 lines
4.7 KiB
Markdown
119 lines
4.7 KiB
Markdown
# Python Search Toolkit (Rough Draft)
|
||
|
||
This minimal Python implementation covers three core needs:
|
||
|
||
1. **Collect transcripts** from YouTube channels.
|
||
2. **Ingest transcripts/metadata** into Elasticsearch.
|
||
3. **Expose a simple Flask search UI** that queries Elasticsearch directly.
|
||
|
||
The code lives alongside the existing C# stack so you can experiment without
|
||
touching production infrastructure.
|
||
|
||
## Setup
|
||
|
||
```bash
|
||
python -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r python_app/requirements.txt
|
||
```
|
||
|
||
Configure your environment as needed:
|
||
|
||
```bash
|
||
export ELASTIC_URL=http://localhost:9200
|
||
export ELASTIC_INDEX=this_little_corner_py
|
||
export ELASTIC_USERNAME=elastic # optional
|
||
export ELASTIC_PASSWORD=secret # optional
|
||
export ELASTIC_API_KEY=XXXX # optional alternative auth
|
||
export ELASTIC_CA_CERT=/path/to/ca.pem # optional, for self-signed TLS
|
||
export ELASTIC_VERIFY_CERTS=1 # set to 0 to skip verification (dev only)
|
||
export ELASTIC_DEBUG=0 # set to 1 for verbose request/response logging
|
||
export LOCAL_DATA_DIR=./data/video_metadata # defaults to this
|
||
export YOUTUBE_API_KEY=AIza... # required for live collection
|
||
```
|
||
|
||
## 1. Collect Transcripts
|
||
|
||
```bash
|
||
python -m python_app.transcript_collector \
|
||
--channel UCxxxx \
|
||
--output data/raw \
|
||
--max-pages 2
|
||
```
|
||
|
||
Each video becomes a JSON file containing metadata plus transcript segments
|
||
(`TranscriptSegment`). Downloads require both `google-api-python-client` and
|
||
`youtube-transcript-api`, as well as a valid `YOUTUBE_API_KEY`.
|
||
|
||
> Already have cached JSON? You can skip this step and move straight to ingesting.
|
||
|
||
## 2. Ingest Into Elasticsearch
|
||
|
||
```bash
|
||
python -m python_app.ingest \
|
||
--source data/video_metadata \
|
||
--index this_little_corner_py
|
||
```
|
||
|
||
The script walks the source directory, builds `bulk` requests, and creates the
|
||
index with a lightweight mapping when needed. Authentication is handled via
|
||
`ELASTIC_USERNAME` / `ELASTIC_PASSWORD` if set.
|
||
|
||
## 3. Serve the Search Frontend
|
||
|
||
```bash
|
||
python -m python_app.search_app
|
||
```
|
||
|
||
Visit <http://localhost:8080/> and you’ll see a barebones UI that:
|
||
|
||
- Lists channels via a terms aggregation.
|
||
- Queries titles/descriptions/transcripts with toggleable exact, fuzzy, and phrase clauses plus optional date sorting.
|
||
- Surfaces transcript highlights.
|
||
- Lets you pull the full transcript for any result on demand.
|
||
- Shows a stacked-by-channel timeline for each search query (with `/frequency` offering a standalone explorer) powered by D3.js.
|
||
- Supports a query-string mode toggle so you can write advanced Lucene queries (e.g. `meaning OR purpose`, `meaning~2` for fuzzy matches, `title:(meaning crisis)`), while the default toggles stay AND-backed.
|
||
|
||
## Integration Notes
|
||
|
||
- All modules share configuration through `python_app.config.CONFIG`, so you can
|
||
fine-tune paths or credentials centrally.
|
||
- The ingest flow reuses existing JSON schema from `data/video_metadata`, so no
|
||
re-download is necessary if you already have the dumps.
|
||
- Everything is intentionally simple (no Celery, task queues, or custom auth) to
|
||
keep the draft approachable and easy to extend.
|
||
|
||
Feel free to expand on this scaffold—add proper logging, schedule transcript
|
||
updates, or flesh out the UI—once you’re happy with the baseline behaviour.
|
||
|
||
## Run with Docker Compose (App Only; Remote ES/Qdrant)
|
||
|
||
The provided compose file builds/runs only the Flask app and expects **remote** Elasticsearch/Qdrant endpoints. Supply them via environment variables (directly or a `.env` alongside `docker-compose.yml`):
|
||
|
||
```bash
|
||
ELASTIC_URL=https://your-es-host:9200 \
|
||
QDRANT_URL=https://your-qdrant-host:6333 \
|
||
docker compose up --build
|
||
```
|
||
|
||
Other tunables (defaults shown in compose):
|
||
- `ELASTIC_INDEX` (default `this_little_corner_py`)
|
||
- `ELASTIC_USERNAME` / `ELASTIC_PASSWORD` or `ELASTIC_API_KEY`
|
||
- `ELASTIC_VERIFY_CERTS` (set to `1` for real TLS verification)
|
||
- `QDRANT_COLLECTION` (default `tlc-captions-full`)
|
||
- `QDRANT_VECTOR_NAME` / `QDRANT_VECTOR_SIZE` / `QDRANT_EMBED_MODEL`
|
||
- `RATE_LIMIT_ENABLED` (default `1`)
|
||
- `RATE_LIMIT_REQUESTS` (default `60`)
|
||
- `RATE_LIMIT_WINDOW_SECONDS` (default `60`)
|
||
|
||
Port 8080 on the host is forwarded to the app. Mount `./data` (read-only) if you want local fallbacks for metrics (`LOCAL_DATA_DIR=/app/data/video_metadata`); otherwise the app will rely purely on the remote backends. Stop the container with `docker compose down`.
|
||
|
||
## CI (Docker build)
|
||
|
||
A Gitea Actions workflow (`.gitea/workflows/docker-build.yml`) builds and pushes the Docker image on every push to `master`. Configure the following repository secrets in Gitea:
|
||
|
||
- `DOCKER_USERNAME`
|
||
- `DOCKER_PASSWORD`
|
||
|
||
The image is tagged as `gitea.ghost.tel/knight/tlc-search:latest` and with the commit SHA. Adjust `IMAGE_NAME` in the workflow if you need a different registry/repo.
|