111 lines
4.1 KiB
Markdown
111 lines
4.1 KiB
Markdown
# Python Search Toolkit (Rough Draft)
|
||
|
||
This minimal Python implementation covers three core needs:
|
||
|
||
1. **Collect transcripts** from YouTube channels.
|
||
2. **Ingest transcripts/metadata** into Elasticsearch.
|
||
3. **Expose a simple Flask search UI** that queries Elasticsearch directly.
|
||
|
||
The code lives alongside the existing C# stack so you can experiment without
|
||
touching production infrastructure.
|
||
|
||
## Setup
|
||
|
||
```bash
|
||
python -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r python_app/requirements.txt
|
||
```
|
||
|
||
Configure your environment as needed:
|
||
|
||
```bash
|
||
export ELASTIC_URL=http://localhost:9200
|
||
export ELASTIC_INDEX=this_little_corner_py
|
||
export ELASTIC_USERNAME=elastic # optional
|
||
export ELASTIC_PASSWORD=secret # optional
|
||
export ELASTIC_API_KEY=XXXX # optional alternative auth
|
||
export ELASTIC_CA_CERT=/path/to/ca.pem # optional, for self-signed TLS
|
||
export ELASTIC_VERIFY_CERTS=1 # set to 0 to skip verification (dev only)
|
||
export ELASTIC_DEBUG=0 # set to 1 for verbose request/response logging
|
||
export LOCAL_DATA_DIR=./data/video_metadata # defaults to this
|
||
export YOUTUBE_API_KEY=AIza... # required for live collection
|
||
```
|
||
|
||
## 1. Collect Transcripts
|
||
|
||
```bash
|
||
python -m python_app.transcript_collector \
|
||
--channel UCxxxx \
|
||
--output data/raw \
|
||
--max-pages 2
|
||
```
|
||
|
||
Each video becomes a JSON file containing metadata plus transcript segments
|
||
(`TranscriptSegment`). Downloads require both `google-api-python-client` and
|
||
`youtube-transcript-api`, as well as a valid `YOUTUBE_API_KEY`.
|
||
|
||
> Already have cached JSON? You can skip this step and move straight to ingesting.
|
||
|
||
## 2. Ingest Into Elasticsearch
|
||
|
||
```bash
|
||
python -m python_app.ingest \
|
||
--source data/video_metadata \
|
||
--index this_little_corner_py
|
||
```
|
||
|
||
The script walks the source directory, builds `bulk` requests, and creates the
|
||
index with a lightweight mapping when needed. Authentication is handled via
|
||
`ELASTIC_USERNAME` / `ELASTIC_PASSWORD` if set.
|
||
|
||
## 3. Serve the Search Frontend
|
||
|
||
```bash
|
||
python -m python_app.search_app
|
||
```
|
||
|
||
Visit <http://localhost:8080/> and you’ll see a barebones UI that:
|
||
|
||
- Lists channels via a terms aggregation.
|
||
- Queries titles/descriptions/transcripts with toggleable exact, fuzzy, and phrase clauses plus optional date sorting.
|
||
- Surfaces transcript highlights.
|
||
- Lets you pull the full transcript for any result on demand.
|
||
- Shows a stacked-by-channel timeline for each search query (with `/frequency` offering a standalone explorer) powered by D3.js.
|
||
- Supports a query-string mode toggle so you can write advanced Lucene queries (e.g. `meaning OR purpose`, `meaning~2` for fuzzy matches, `title:(meaning crisis)`), while the default toggles stay AND-backed.
|
||
|
||
## Integration Notes
|
||
|
||
- All modules share configuration through `python_app.config.CONFIG`, so you can
|
||
fine-tune paths or credentials centrally.
|
||
- The ingest flow reuses existing JSON schema from `data/video_metadata`, so no
|
||
re-download is necessary if you already have the dumps.
|
||
- Everything is intentionally simple (no Celery, task queues, or custom auth) to
|
||
keep the draft approachable and easy to extend.
|
||
|
||
Feel free to expand on this scaffold—add proper logging, schedule transcript
|
||
updates, or flesh out the UI—once you’re happy with the baseline behaviour.
|
||
|
||
## Run with Docker Compose
|
||
|
||
A quick single-node stack (app + Elasticsearch + Qdrant) is included:
|
||
|
||
```bash
|
||
docker compose build
|
||
docker compose up
|
||
```
|
||
|
||
Services:
|
||
- **app** (port 8080): Flask UI/API, embeds queries on demand (downloads the model on first run).
|
||
- **elasticsearch** (port 9200): single node, security disabled for local use.
|
||
- **qdrant** (port 6333): vector index used by `/vector-search`.
|
||
|
||
Key environment wiring (see `docker-compose.yml` for defaults):
|
||
- `ELASTIC_URL=http://elasticsearch:9200`
|
||
- `ELASTIC_INDEX=this_little_corner_py`
|
||
- `QDRANT_URL=http://qdrant:6333`
|
||
- `QDRANT_COLLECTION=tlc-captions-full`
|
||
- `LOCAL_DATA_DIR=/app/data/video_metadata` (mounted from `./data`)
|
||
|
||
Mount `./data` (read-only) if you want local fallbacks for metrics; otherwise the app relies entirely on Elasticsearch/Qdrant. Stop the stack with `docker compose down` (add `-v` to clear ES/Qdrant volumes).
|