Skip to content

AlbertSuarez/azlyrics-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AZLyrics scraper

HitCount GitHub stars GitHub forks GitHub repo size in bytes GitHub contributors GitHub license

Box folder URL | Static repo website | Kaggle dataset

🎵 AZLyrics scraper for getting all the song lyrics and publishing to Box.

Python requirements

This project is using Python3. All these requirements have been specified in the requirements.lock file.

  1. Requests: used for retrieving the HTML content of a website.
  2. BeautifulSoup: used for scraping an HTML content.
  3. Tor: used for making requests anonymous using other IPs.
  4. Stem: used for authentificating every request with a different IP.
  5. Fake User-Agent: used for using random User-Agent's for every request.
  6. Unidecode: used for cleaning strings from weird characters.
  7. Box SDK: used for uploading/downloading files to/from Box Cloud Storage.

Recommendations

Usage of virtualenv is recommended for package library / runtime isolation.

Usage

To run this script, please execute the following from the root directory:

  1. Setup virutal environment

  2. Install dependencies

pip3 install -r requirements.lock
  1. Move JWT configuration file from Box API

  2. Install Tor browser

  3. Configure Tor IP renewal editting /etc/tor/torrc file

    ControlPort 9051
    CookieAuthentication 1
    
  4. Restart Tor browser

sudo service tor restart
  1. Run the script
python3 -m src

JWT configuration

In order to use Box Cloud Storage API in a secure way, this project is configured for using their service with the JWT authentication. After following the tutorial, we will obtain a configuration file which will have to be located under data folder with the name of jwt_config.json as the __init__.py configuration file says:

# Box integration
BOX_CONFIG_FILE_PATH = 'data/jwt_config.json'

Authors

License

MIT © AZLyrics scraper