FIX SOURCE INGESTION: DEDUPLICATE AND RELEVANCE-FILTER YOUTUBE INPUTS
The input set contains the same YouTube video twice and content unrelated to backend/AI-in-SDLC, exposing gaps in our ingestion pipeline. Add deterministic dedu...
The input set contains the same YouTube video twice and content unrelated to backend/AI-in-SDLC, exposing gaps in our ingestion pipeline. Add deterministic deduplication by YouTube videoId and a lightweight relevance classifier on titles/descriptions to filter off-topic items. This reduces noise before summarization and speeds editorial review.
Cuts reviewer time and model token spend on irrelevant media.
Improves trust in automated digests and downstream metrics.
-
terminal
Compare LLM zero-shot vs. a small supervised classifier over embeddings for relevance on a labeled set.
-
terminal
Evaluate exact videoId matching vs. embedding-based near-duplicate detection to catch re-uploads and playlist variants.
Legacy codebase integration strategies...
- 01.
Insert a pre-processing stage in the existing ETL to run in shadow mode and report precision/recall before enforcing drops.
- 02.
Route uncertain items to a quarantine queue and use human feedback to retrain the classifier weekly.
Fresh architecture paradigms...
- 01.
Model ingestion around canonical IDs (YouTube videoId) with content hashes and explicit source provenance in the schema.
- 02.
Define SLOs for relevance precision/recall and gate deploys with automated evaluation in CI.