Skip to content

A system enables semantic search within your personal YouTube subscriptions to efficiently find your desired channels.

License

Notifications You must be signed in to change notification settings

yuting1214/YouTube_Subscription_Search

Repository files navigation

YouTube_Subscription_Search

A system enables semantic search within your personal YouTube subscriptions to efficiently find your desired channels.

Demonstration

⬇️ Installation

Clone Repo:

git clone https://github.com/yuting1214/YouTube_Subscription_Search.git

Virtual Environment (Optional)

Install Dependencies:

pip setuptools wheel
pip install .

API Key

See detailed instructions

Make sure to place the file youtube_credential.json under the folder of API_key/

🚀 Quickstart

  1. Create your personal Database
youtube_subscription createdb
  1. Render front-end App
youtube_subscription runapp
  1. Update Database when adding new subscription
youtube_subscription updatedb

🎯 Purpose of the Project:

  • 📺 Channel Management

    • Streamline and enhance the experience of managing subscribed YouTube channels.
  • 🤖 Advanced Modelling Techniques

    • Leverage state-of-the-art Language Learning Models (LLMs), embedding techniques, and vector storage methods to improve system functionality.
  • ⚖️ Automated Judging Mechanism

    • Introduce an LLM-based judge as an alternative to traditional human judgment processes. This method is designed to more accurately gauge relevance related to user intentions, leading to significant reductions in labor costs.
  • 🔍 Information Retrieval Enhancement

    • Harness advanced IR techniques, including:
      • Specialized search algorithms tailored for rich content discovery.
      • Integration of LLMs to bolster the overall project development and refine user queries.
  • 🔧 Tool Exploration and Integration:

    • LLM Exploration: Dived deep into various Large Language Models for content completion using platforms like:
    • Embedding Tools: For text embeddings, relied on both commercial and open-source solutions:
    • Integration: Leveraged LangChain to seamlessly link and combine these tools, ensuring a smooth and integrated workflow.
  • 🌱 Future Development:

    • Word toward developing an intelligent agent that automatically prioritizes high-quality channels and videos to foster users' life-long learning.

Experiment Workflow

Data was collected to assess the proposed search methods. Optimal solutions from these experiments are adopted for the application using personal subscription data.

See more details.

Experiment results

Comparison for Relevance

Splitting Method Embedding Method Avg_Rel @ 1 Avg_Rel @ 3
256 all-mpnet-base-v2 (SBERT) 1.422 3.622
256 all-MiniLM-L12-v2 (SBERT) 1.400 3.533
256 text-embedding-ada-002 (OpenAI) 1.378 3.222
4500 text-embedding-ada-002 (OpenAI) 1.356 3.422
256 textembedding-gecko-multilingual@latest (Google Vertex AI) 1.289 3.400
256 textembedding-gecko@001 (Google Vertex AI) 1.267 3.333
3000 textembedding-gecko@001 (Google Vertex AI) 1.267 3.422
256 embed-english-light-v2.0 (Cohere) 1.222 3.067
256 embed-multilingual-v2.0 (Cohere) 1.156 3.200
  • Rubric for relevance score (See more details)
  • Avg_Rel @ k = Average relevance scores on top k retrieved channels.

Grading examples

{
  "query": "A step-by-step makeup tutorial for achieving a natural, everyday look using drugstore products",
  "query_type": "detail_query",
  "channel": "Tati",
  "genre": "Beauty and Fashion",
  "Relevance_Score": 2,
  "Score_explanation: "The channel has a high relevance score because the majority of the video titles are directly related to the search prompt.
                       The videos cover various aspects of makeup, including drugstore products, tutorials, product reviews, comparisons, and recommendations.
                       This provides a comprehensive understanding of achieving a natural, everyday look using drugstore products."
}
Score Distribution for Searching methods
Score Distribution for Searching methods
Average scores for different types of Queries
Average scores for different types of Queries

Query examples

# Genre: Vlogging
{
  "general_query": "daily vlogs of people's lives",
  "specific_query": "A day in the life of a travel vlogger exploring Bali",
  "detail_query":  "A daily vlog of a college student's study routine, including tips and tricks for effective studying"
}

Comparison for Accuracy(Genre)

Splitting Method Embedding Method Accuracy @ 1
256 embed-english-light-v2.0 (Cohere) 0.844
256 all-mpnet-base-v2 (SBERT) 0.822
256 all-MiniLM-L12-v2 (SBERT) 0.800
256 textembedding-gecko@001 (Google Vertex AI) 0.778
3000 textembedding-gecko@001 (Google Vertex AI) 0.778
256 text-embedding-ada-002 (OpenAI) 0.756
256 textembedding-gecko-multilingual@latest (Google Vertex AI) 0.711
4500 text-embedding-ada-002 (OpenAI) 0.689
256 embed-multilingual-v2.0 (Cohere) 0.667
  • There exists a ground-truth genre for each channel in the experiment.
Genre Accuracy for Searching methods
Genre Accuracy for Searching methods
Average Genre Accuracy for different types of Queries
Average Genre Accuracy for different types of Queries

Key Insights:

  1. Open-source embedding methods from SBERT, especially all-MiniLM-L12-v2, outperform commercial vendors in our tests. As a result, we've integrated the efficient all-MiniLM-L12-v2 method into our application.
  2. The LLM judge not only provides a relevance score that aligns more closely with a user's search intent than simply using genre accuracy but also offers a means to minimize manual tasks, ultimately saving both time and resources.
  3. User queries categorized under "general_query" align closely with everyday user usage scenarios. Our proposed models perform exceptionally well for this type of query, suggesting that our experimental results can be seamlessly applied to daily usage.

Known Issues and Future Resolutions

  1. The LLM sometimes still would return something as non-intended format, like non-JSON format

Acknowledgements

Reference

LLM Judge:

YouTube Scraping API:

YouTube channels:

About

A system enables semantic search within your personal YouTube subscriptions to efficiently find your desired channels.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published