Box folder URL | Static repo website | Kaggle dataset
🎵 AZLyrics scraper for getting all the song lyrics and publishing to Box.
This project is using Python3. All these requirements have been specified in the requirements.lock
file.
- Requests: used for retrieving the HTML content of a website.
- BeautifulSoup: used for scraping an HTML content.
- Tor: used for making requests anonymous using other IPs.
- Stem: used for authentificating every request with a different IP.
- Fake User-Agent: used for using random User-Agent's for every request.
- Unidecode: used for cleaning strings from weird characters.
- Box SDK: used for uploading/downloading files to/from Box Cloud Storage.
Usage of virtualenv is recommended for package library / runtime isolation.
To run this script, please execute the following from the root directory:
-
Setup virutal environment
-
Install dependencies
pip3 install -r requirements.lock
-
Move JWT configuration file from Box API
-
Install Tor browser
-
Configure Tor IP renewal editting
/etc/tor/torrc
fileControlPort 9051 CookieAuthentication 1
-
Restart Tor browser
sudo service tor restart
- Run the script
python3 -m src
In order to use Box Cloud Storage API in a secure way, this project is configured for using their service with the JWT authentication. After following the tutorial, we will obtain a configuration file which will have to be located under data
folder with the name of jwt_config.json
as the __init__.py
configuration file says:
# Box integration
BOX_CONFIG_FILE_PATH = 'data/jwt_config.json'
MIT © AZLyrics scraper