knight/TLC-Search

Fork 0

Go to file

knight d168287636

docker-build / build (push) Has been cancelled

Details

Add Rigel Windsong Thurston

2026-01-10 13:36:10 -05:00

.gitea/workflows

Add Gitea workflow for Docker image builds

2025-11-18 19:14:20 -05:00

feed-master-config

Add Rigel Windsong Thurston

2026-01-10 13:36:10 -05:00

static

Add unified channel feed

2026-01-08 22:53:30 -05:00

__init__.py

Initial commit

2025-11-02 01:14:36 -04:00

.dockerignore

Add unified channel feed

2026-01-08 22:53:30 -05:00

.gitignore

Ignore .gemini artifacts

2026-01-08 22:55:33 -05:00

AGENTS.md

Add graph and vector search features

2025-11-09 14:24:50 -05:00

channel_config.py

Add unified channel feed

2026-01-08 22:53:30 -05:00

channels.yml

Add Rigel Windsong Thurston

2026-01-10 13:36:10 -05:00

config.py

Add unified channel feed

2026-01-08 22:53:30 -05:00

docker-compose.yml

Add unified channel feed

2026-01-08 22:53:30 -05:00

Dockerfile

Add Docker and compose setup

2025-11-18 13:21:14 -05:00

generate_feed_config_simple.py

Add unified channel feed

2026-01-08 22:53:30 -05:00

generate_feed_config.py

Add unified channel feed

2026-01-08 22:53:30 -05:00

ingest.py

Fix sorting by referenced_by_count with unmapped_type handling

2025-11-05 11:10:56 -05:00

Makefile

Add unified channel feed

2026-01-08 22:53:30 -05:00

README-FEED-MASTER.md

Document channel feeds

2026-01-08 22:46:30 -05:00

README.md

Add API rate limits

2026-01-08 15:24:05 -05:00

requirements.txt

Disable vector search

2026-01-08 15:20:06 -05:00

search_app.py

Add unified channel feed

2026-01-08 22:53:30 -05:00

sync_qdrant_channels.py

Add graph and vector search features

2025-11-09 14:24:50 -05:00

transcript_collector.py

Initial commit

2025-11-02 01:14:36 -04:00

urls.txt

Add Rigel Windsong Thurston

2026-01-10 13:36:10 -05:00

README.md

Python Search Toolkit (Rough Draft)

This minimal Python implementation covers three core needs:

Collect transcripts from YouTube channels.
Ingest transcripts/metadata into Elasticsearch.
Expose a simple Flask search UI that queries Elasticsearch directly.

The code lives alongside the existing C# stack so you can experiment without touching production infrastructure.

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r python_app/requirements.txt

Configure your environment as needed:

export ELASTIC_URL=http://localhost:9200
export ELASTIC_INDEX=this_little_corner_py
export ELASTIC_USERNAME=elastic          # optional
export ELASTIC_PASSWORD=secret           # optional
export ELASTIC_API_KEY=XXXX              # optional alternative auth
export ELASTIC_CA_CERT=/path/to/ca.pem   # optional, for self-signed TLS
export ELASTIC_VERIFY_CERTS=1            # set to 0 to skip verification (dev only)
export ELASTIC_DEBUG=0                   # set to 1 for verbose request/response logging
export LOCAL_DATA_DIR=./data/video_metadata  # defaults to this
export YOUTUBE_API_KEY=AIza...           # required for live collection

1. Collect Transcripts

python -m python_app.transcript_collector \
  --channel UCxxxx \
  --output data/raw \
  --max-pages 2

Each video becomes a JSON file containing metadata plus transcript segments (TranscriptSegment). Downloads require both google-api-python-client and youtube-transcript-api, as well as a valid YOUTUBE_API_KEY.

Already have cached JSON? You can skip this step and move straight to ingesting.

2. Ingest Into Elasticsearch

python -m python_app.ingest \
  --source data/video_metadata \
  --index this_little_corner_py

The script walks the source directory, builds bulk requests, and creates the index with a lightweight mapping when needed. Authentication is handled via ELASTIC_USERNAME / ELASTIC_PASSWORD if set.

3. Serve the Search Frontend

python -m python_app.search_app

Visit http://localhost:8080/ and you’ll see a barebones UI that:

Lists channels via a terms aggregation.
Queries titles/descriptions/transcripts with toggleable exact, fuzzy, and phrase clauses plus optional date sorting.
Surfaces transcript highlights.
Lets you pull the full transcript for any result on demand.
Shows a stacked-by-channel timeline for each search query (with /frequency offering a standalone explorer) powered by D3.js.
Supports a query-string mode toggle so you can write advanced Lucene queries (e.g. meaning OR purpose, meaning~2 for fuzzy matches, title:(meaning crisis)), while the default toggles stay AND-backed.

Integration Notes

All modules share configuration through python_app.config.CONFIG, so you can fine-tune paths or credentials centrally.
The ingest flow reuses existing JSON schema from data/video_metadata, so no re-download is necessary if you already have the dumps.
Everything is intentionally simple (no Celery, task queues, or custom auth) to keep the draft approachable and easy to extend.

Feel free to expand on this scaffold—add proper logging, schedule transcript updates, or flesh out the UI—once you’re happy with the baseline behaviour.

Run with Docker Compose (App Only; Remote ES/Qdrant)

The provided compose file builds/runs only the Flask app and expects remote Elasticsearch/Qdrant endpoints. Supply them via environment variables (directly or a .env alongside docker-compose.yml):

ELASTIC_URL=https://your-es-host:9200 \
QDRANT_URL=https://your-qdrant-host:6333 \
docker compose up --build

Other tunables (defaults shown in compose):

ELASTIC_INDEX (default this_little_corner_py)
ELASTIC_USERNAME / ELASTIC_PASSWORD or ELASTIC_API_KEY
ELASTIC_VERIFY_CERTS (set to 1 for real TLS verification)
QDRANT_COLLECTION (default tlc-captions-full)
QDRANT_VECTOR_NAME / QDRANT_VECTOR_SIZE / QDRANT_EMBED_MODEL
RATE_LIMIT_ENABLED (default 1)
RATE_LIMIT_REQUESTS (default 60)
RATE_LIMIT_WINDOW_SECONDS (default 60)

Port 8080 on the host is forwarded to the app. Mount ./data (read-only) if you want local fallbacks for metrics (LOCAL_DATA_DIR=/app/data/video_metadata); otherwise the app will rely purely on the remote backends. Stop the container with docker compose down.

CI (Docker build)

A Gitea Actions workflow (.gitea/workflows/docker-build.yml) builds and pushes the Docker image on every push to master. Configure the following repository secrets in Gitea:

DOCKER_USERNAME
DOCKER_PASSWORD

The image is tagged as gitea.ghost.tel/knight/tlc-search:latest and with the commit SHA. Adjust IMAGE_NAME in the workflow if you need a different registry/repo.

Languages

Python 44.3%

JavaScript 40.8%

HTML 6.8%

CSS 6.6%

Makefile 1.1%

Other 0.4%

README.md Unescape Escape

Python Search Toolkit (Rough Draft)

Setup

1. Collect Transcripts

2. Ingest Into Elasticsearch

3. Serve the Search Frontend

Integration Notes

Run with Docker Compose (App Only; Remote ES/Qdrant)

CI (Docker build)

README.md