- End-to-end Personalized Video Recommendation System driven by Natural Language Processing (NLP) and Machine Learning (ML), with recommendations based on:
- Video Content: TF-IDF analysis of video transcripts.
- Sentiment Analysis: Comment sentiment score from RoBERTa (Huggingface).
- Clustering: Unsupervised ML for grouping videos.
- Methodology:
- Constructs a Cosine Similarity Matrix (TF-IDF).
- Enhances with Sentiment and Clustering Scores for refined recommendations.
The final score is calculated as follows:
where:
Sentiment-Driven-Video-Recommendations/
├── assets/ # Auxiliary files and resources (e.g., images for documentation)
├── data/ # Data folder
│ ├── clean/ # Data from Sentiment Analysis through PySpark
│ ├── clean_data/ # Processed data
│ ├── raw_data/ # Unprocessed, original data from Youtube API
│ └── README.md # Explanation of the data folder contents
├── docker_app/ # Docker-related files and application code
│ ├── __pycache__/ # Python cache for compiled files
│ ├── Dockerfile # Instructions for building the Docker image
│ ├── final_score_matrix.joblib # Precomputed final score matrix for recommendations
│ ├── main_app.py # Main application script
│ ├── readme.rst # Documentation for Docker app setup
│ └── requirements.txt # Python dependencies for the project
└── notebooks/ # Jupyter notebooks for analysis and modeling
├── 0_fetch_and_clean_data_youtube_api.ipynb # Fetch and clean data from YouTube API
├── 2_emotion_analysis_pyspark.ipynb # Perform sentiment analysis using PySpark
├── 3_clustering.ipynb # Clustering analysis for video grouping
└── 4_tfidf_matrix_and_model_pipeline.ipynb # Build TF-IDF matrix and model pipeline
- Built a TF-IDF matrix from video transcriptions.
- Applied Cosine Similarity to identify similar videos.
- 💬 Incorporated sentiment analysis to refine similarity scores.
- 📊 Integrated video statistics (view count, like count, comment count) into the final recommendation score.
- DBSCAN and K-MEANS for clustering videos based on their features.
- Prioritized videos from the same cluster in the recommendation process.
Conducted research through the YouTube API with queries about 'Artificial Intelligence':
- "What is artificial intelligence?"
- "Artificial intelligence applications in healthcare"
- "AI in autonomous vehicles"
- "Machine learning vs deep learning"
- "Artificial intelligence in finance"
- "How does AI work?"
- "Top AI tools for data science"
- "Artificial intelligence in robotics"
-
df_videos
: Video data including view count, like count, comment count, and more. -
df_comments
: Comment data with sentiment analysis and engagement metrics. -
df_channels
: Channel-level data including subscriber count and total video views. -
df_categories
: Categorical data related to video genres and types.
- Python, PySpark
- Natural Language Processing (NLTK, Google Translate)
- Machine Learning (RoBERTa LLM, TF-IDF, K-Means, DBScan, PCA)
- Docker
- FastAPI
- Google Cloud (Cloud Storage, Cloud Run)
git clone https://github.com/ivanseldas/Sentiment-Driven-Video-Recommendations.git
cd Sentiment-Driven-Video-Recommendations
- Precision@K, Recall@K, F1-Score: The recommendation system achieves high relevance in its top-K recommendations, indicating a well-tuned model.
- Clustering Insights: The DBSCAN clustering effectively groups similar videos, enhancing the recommendation diversity.
- Explore Supervised Learning: Implement supervised models for further improving recommendation accuracy.
- A/B Testing: Deploy the system in a real-world setting for user feedback and further refinement.
- Scalability: Optimize the system for larger datasets and real-time recommendations.
- Ivan Seldas Perulero
This project is licensed under the MIT License - see the LICENSE file for details.