88 lines
3.2 KiB
Markdown
88 lines
3.2 KiB
Markdown
# Python Search Toolkit (Rough Draft)
|
||
|
||
This minimal Python implementation covers three core needs:
|
||
|
||
1. **Collect transcripts** from YouTube channels.
|
||
2. **Ingest transcripts/metadata** into Elasticsearch.
|
||
3. **Expose a simple Flask search UI** that queries Elasticsearch directly.
|
||
|
||
The code lives alongside the existing C# stack so you can experiment without
|
||
touching production infrastructure.
|
||
|
||
## Setup
|
||
|
||
```bash
|
||
python -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r python_app/requirements.txt
|
||
```
|
||
|
||
Configure your environment as needed:
|
||
|
||
```bash
|
||
export ELASTIC_URL=http://localhost:9200
|
||
export ELASTIC_INDEX=this_little_corner_py
|
||
export ELASTIC_USERNAME=elastic # optional
|
||
export ELASTIC_PASSWORD=secret # optional
|
||
export ELASTIC_API_KEY=XXXX # optional alternative auth
|
||
export ELASTIC_CA_CERT=/path/to/ca.pem # optional, for self-signed TLS
|
||
export ELASTIC_VERIFY_CERTS=1 # set to 0 to skip verification (dev only)
|
||
export ELASTIC_DEBUG=0 # set to 1 for verbose request/response logging
|
||
export LOCAL_DATA_DIR=./data/video_metadata # defaults to this
|
||
export YOUTUBE_API_KEY=AIza... # required for live collection
|
||
```
|
||
|
||
## 1. Collect Transcripts
|
||
|
||
```bash
|
||
python -m python_app.transcript_collector \
|
||
--channel UCxxxx \
|
||
--output data/raw \
|
||
--max-pages 2
|
||
```
|
||
|
||
Each video becomes a JSON file containing metadata plus transcript segments
|
||
(`TranscriptSegment`). Downloads require both `google-api-python-client` and
|
||
`youtube-transcript-api`, as well as a valid `YOUTUBE_API_KEY`.
|
||
|
||
> Already have cached JSON? You can skip this step and move straight to ingesting.
|
||
|
||
## 2. Ingest Into Elasticsearch
|
||
|
||
```bash
|
||
python -m python_app.ingest \
|
||
--source data/video_metadata \
|
||
--index this_little_corner_py
|
||
```
|
||
|
||
The script walks the source directory, builds `bulk` requests, and creates the
|
||
index with a lightweight mapping when needed. Authentication is handled via
|
||
`ELASTIC_USERNAME` / `ELASTIC_PASSWORD` if set.
|
||
|
||
## 3. Serve the Search Frontend
|
||
|
||
```bash
|
||
python -m python_app.search_app
|
||
```
|
||
|
||
Visit <http://localhost:8080/> and you’ll see a barebones UI that:
|
||
|
||
- Lists channels via a terms aggregation.
|
||
- Queries titles/descriptions/transcripts with toggleable exact, fuzzy, and phrase clauses plus optional date sorting.
|
||
- Surfaces transcript highlights.
|
||
- Lets you pull the full transcript for any result on demand.
|
||
- Shows a stacked-by-channel timeline for each search query (with `/frequency` offering a standalone explorer) powered by D3.js.
|
||
- Supports a query-string mode toggle so you can write advanced Lucene queries (e.g. `meaning OR purpose`, `meaning~2` for fuzzy matches, `title:(meaning crisis)`), while the default toggles stay AND-backed.
|
||
|
||
## Integration Notes
|
||
|
||
- All modules share configuration through `python_app.config.CONFIG`, so you can
|
||
fine-tune paths or credentials centrally.
|
||
- The ingest flow reuses existing JSON schema from `data/video_metadata`, so no
|
||
re-download is necessary if you already have the dumps.
|
||
- Everything is intentionally simple (no Celery, task queues, or custom auth) to
|
||
keep the draft approachable and easy to extend.
|
||
|
||
Feel free to expand on this scaffold—add proper logging, schedule transcript
|
||
updates, or flesh out the UI—once you’re happy with the baseline behaviour.
|