A system enables semantic search within your personal YouTube subscriptions to efficiently find your desired channels.
git clone https://github.com/yuting1214/YouTube_Subscription_Search.git
Virtual Environment (Optional)
pip setuptools wheel
pip install .
Make sure to place the file youtube_credential.json under the folder of API_key/
- Create your personal Database
youtube_subscription createdb
- Render front-end App
youtube_subscription runapp
- Update Database when adding new subscription
youtube_subscription updatedb
-
📺 Channel Management
- Streamline and enhance the experience of managing subscribed YouTube channels.
-
🤖 Advanced Modelling Techniques
- Leverage state-of-the-art Language Learning Models (LLMs), embedding techniques, and vector storage methods to improve system functionality.
-
⚖️ Automated Judging Mechanism
- Introduce an LLM-based judge as an alternative to traditional human judgment processes. This method is designed to more accurately gauge relevance related to user intentions, leading to significant reductions in labor costs.
-
🔍 Information Retrieval Enhancement
- Harness advanced IR techniques, including:
- Specialized search algorithms tailored for rich content discovery.
- Integration of LLMs to bolster the overall project development and refine user queries.
- Harness advanced IR techniques, including:
-
🔧 Tool Exploration and Integration:
- LLM Exploration: Dived deep into various Large Language Models for content completion using platforms like:
- Embedding Tools: For text embeddings, relied on both commercial and open-source solutions:
- Commercial Vendors:
- Open-Source Solutions:
- SBERT (all-MiniLM-L6-v2)
- Integration: Leveraged LangChain to seamlessly link and combine these tools, ensuring a smooth and integrated workflow.
-
🌱 Future Development:
- Word toward developing an intelligent agent that automatically prioritizes high-quality channels and videos to foster users' life-long learning.
Data was collected to assess the proposed search methods. Optimal solutions from these experiments are adopted for the application using personal subscription data.
Splitting Method | Embedding Method | Avg_Rel @ 1 | Avg_Rel @ 3 |
---|---|---|---|
256 | all-mpnet-base-v2 (SBERT) | 1.422 | 3.622 |
256 | all-MiniLM-L12-v2 (SBERT) | 1.400 | 3.533 |
256 | text-embedding-ada-002 (OpenAI) | 1.378 | 3.222 |
4500 | text-embedding-ada-002 (OpenAI) | 1.356 | 3.422 |
256 | textembedding-gecko-multilingual@latest (Google Vertex AI) | 1.289 | 3.400 |
256 | textembedding-gecko@001 (Google Vertex AI) | 1.267 | 3.333 |
3000 | textembedding-gecko@001 (Google Vertex AI) | 1.267 | 3.422 |
256 | embed-english-light-v2.0 (Cohere) | 1.222 | 3.067 |
256 | embed-multilingual-v2.0 (Cohere) | 1.156 | 3.200 |
- Rubric for relevance score (See more details)
- Avg_Rel @ k = Average relevance scores on top k retrieved channels.
Grading examples
{
"query": "A step-by-step makeup tutorial for achieving a natural, everyday look using drugstore products",
"query_type": "detail_query",
"channel": "Tati",
"genre": "Beauty and Fashion",
"Relevance_Score": 2,
"Score_explanation: "The channel has a high relevance score because the majority of the video titles are directly related to the search prompt.
The videos cover various aspects of makeup, including drugstore products, tutorials, product reviews, comparisons, and recommendations.
This provides a comprehensive understanding of achieving a natural, everyday look using drugstore products."
}
Score Distribution for Searching methods |
Average scores for different types of Queries |
Query examples
# Genre: Vlogging
{
"general_query": "daily vlogs of people's lives",
"specific_query": "A day in the life of a travel vlogger exploring Bali",
"detail_query": "A daily vlog of a college student's study routine, including tips and tricks for effective studying"
}
Splitting Method | Embedding Method | Accuracy @ 1 |
---|---|---|
256 | embed-english-light-v2.0 (Cohere) | 0.844 |
256 | all-mpnet-base-v2 (SBERT) | 0.822 |
256 | all-MiniLM-L12-v2 (SBERT) | 0.800 |
256 | textembedding-gecko@001 (Google Vertex AI) | 0.778 |
3000 | textembedding-gecko@001 (Google Vertex AI) | 0.778 |
256 | text-embedding-ada-002 (OpenAI) | 0.756 |
256 | textembedding-gecko-multilingual@latest (Google Vertex AI) | 0.711 |
4500 | text-embedding-ada-002 (OpenAI) | 0.689 |
256 | embed-multilingual-v2.0 (Cohere) | 0.667 |
- There exists a ground-truth genre for each channel in the experiment.
Genre Accuracy for Searching methods |
Average Genre Accuracy for different types of Queries |
- Open-source embedding methods from SBERT, especially all-MiniLM-L12-v2, outperform commercial vendors in our tests. As a result, we've integrated the efficient all-MiniLM-L12-v2 method into our application.
- The LLM judge not only provides a relevance score that aligns more closely with a user's search intent than simply using genre accuracy but also offers a means to minimize manual tasks, ultimately saving both time and resources.
- User queries categorized under "general_query" align closely with everyday user usage scenarios. Our proposed models perform exceptionally well for this type of query, suggesting that our experimental results can be seamlessly applied to daily usage.
- The LLM sometimes still would return something as non-intended format, like non-JSON format
- Special thanks to Hitesh Kumar Saini for youtube-search-python.
- Usual way to find a channel in YouTube
- How to search Influencers in YouTube
- 10 Most Watched Categories of YouTube Videos
- 24 Most Popular Types of YouTube Videos in 2023
- What are the Most Popular Genres on YouTube in 2023?
- Top 12 Most Popular Types Of YouTube Videos: List Of Popular Genres To Get The Most Views
- 12 Best Types of YouTube Content To Succeed at Growing a YouTube Channel
- Youtube Commercial Tools