Podcast Recommendation Project

Overview

This project is designed to scrape podcast data, process it, and provide recommendations based on various criteria such as categories, popularity, and embeddings. The project uses a combination of web scraping, data processing, and machine learning techniques to achieve this.

Features

Scrape podcast data from various sources.
Process and clean the data.
Generate recommendations based on different criteria
Visualize data using Dash.

Setup

Clone the repository:

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Download vector embeddings:

pip install gdown
gdown --folder https://drive.google.com/drive/folders/1XP_eSoZ2uNW_poyMnsK9TPWHcxpbwvsg

Run the Dash application:
```
python dash_main.py
```

Dataset collection and data cleaning

Dataset collection

The dataset was collected using web scraping techniques. The primary sources of data were:

Podtail: Scraped using podtail_main.py which collects podcast names, links, and related podcasts with async BFS scraping algorithm
Podcast Index scraped with PodcastIndex API to extract descriptions,RSS feed URLs, language, categories and episode count
iTunes was scraped to extract podcast logos for application and average rating

Data cleaning

Data were cleaned using primarily pandas and other standard DS libs. Data cleaning is described in jupyter notebooks. Podcasts were divided based on the language and their descriptions were indexed (wiki-embeddings) to accelerate information retrieval. Most commons words were deleted to better capture semantic meaning of each podcast.

Recommendation system

Unfortunately, data limitation is the main issue and matrix decomposition cannot be performed, because there is no real data about user preferences. So recommendations were limited based on podcast description semantic meaning and common categories. To adress the question of recommending unpopular podcast, a koefficient based on total episodes, number of reviews and average rating was taken into account. SBERT embeddings and cosine similarity prooved to be the best and were used in the final version.

Dash application

A dash application was used to investigate the quality of recommendations. Similar podcasts can be found based by using podcast name or iTunes id. There are several modes to choose from:

Most popular (random 10 from 200 most popular)
Recommendations from SBERT embeddings
Averaged word embeddings from Wikipedia
Optimal recommendations
Recommendations based on user explanations (f.e. "A podcast about scientifical approach to improve health and mood")

Examples

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
datasets		datasets
jupyter		jupyter
podcasts_icons		podcasts_icons
web_scraping		web_scraping
.gitignore		.gitignore
README.md		README.md
dash_main.py		dash_main.py
pil_for_dash.py		pil_for_dash.py
quering_functions.py		quering_functions.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Podcast Recommendation Project

Overview

Features

Setup

Dataset collection and data cleaning

Dataset collection

Data cleaning

Recommendation system

Dash application

Examples

About

Releases

Packages

Languages

iljamak/podcast_rec_system

Folders and files

Latest commit

History

Repository files navigation

Podcast Recommendation Project

Overview

Features

Setup

Dataset collection and data cleaning

Dataset collection

Data cleaning

Recommendation system

Dash application

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages