Welcome to KTH DD2477 Podcast Search Project!
The Podcast Search project is a Flask-based web application that provides users with an efficient way to find relevant segments of podcasts based on their search queries. Leveraging Elasticsearch, the project indexes podcast data, including transcriptions and time markers, to enable efficient search functionality. Users can easily search for specific topics or keywords, retrieve relevant segments of audio content, and explore search results seamlessly. The interface is intuitive and user-friendly, allowing users to specify the duration of podcast segments and navigate through search results effortlessly.
- Python and above installed on your system (Python 3.6 and above)
- Docker installed on your system (Docker 20.0 and above)
- Git installed on your system (if cloning the repository)
-
Download the dataset from spotify-podcasts-2020
-
Unzip the
podcasts-transcripts
to the path defined bypodcasts_transcripts_path
inelasticsearch.yml
-
Unzip the
metadata.tsv
to the path defined bymetadata_tsv_path
inelasticsearch.yml
To install and run this project locally, follow these steps:
-
Clone the repository:
git clone git@github.com:IsakNordg/DD2477-Podcast-search.git
-
Install dependencies:
cd ./DD2477-Podcast-search pip install -r es/requirements.txt pip install -r backend/requirements.txt
-
Set up Elasticsearch (using Docker is recommended):
Ensure that Elasticsearch is installed and running locally or on a remote server. Docker example:
docker network create elastic docker pull docker.elastic.co/elasticsearch/elasticsearch:8.13.2 docker run --name elasticsearch --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -t docker.elastic.co/elasticsearch/elasticsearch:8.13.2
Configure Elasticsearch connection settings in
es/config/elasticsearch.yml
.Change the IP and port of
hosts
if you run elasticsearch remotely.Replace the password and certificate in
es/config/.env
andes/config/http_ca.crt
with your credentials to Elasticsearch.The default path to
http_ca.crt
in docker is/usr/share/elasticsearch/config/certs/http_ca.crt
. -
Run the Flask application:
python meow.py
The repo also provides the Docker-Compose file for you to run applications in a docker container.
The program supports the following command-line options:
reindex
: Set totrue
to force re-indexing of all podcasts (default isfalse
).append
: Set tofalse
to disable indexing new podcasts (default istrue
).debug
: Set totrue
to enable debug mode for Flask application (default isfalse
).limit
: Specify the maximum number of podcasts to index (default is 105360).hosts
: Specify the host IP address to run the Flask application (default is "0.0.0.0").
Once the application is running, you can open the web browser and search on `http://localhost:5000/:
Clip duration
: Define the max duration of podcast clip to search.Search method
: Select the searching method.
Demo Pictures:
Contributions to this project are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on GitHub.
This project is licensed under the Apache 2.0 License.
Special thanks to the contributors and maintainers of Flask, Elasticsearch, and other open-source libraries used in this project.