- Python 3.7
sudo apt-get install python3.7
- Pip
sudo apt-get install python3-pip
- VirtualEnv
sudo pip3 install virtualenv
- MongoDB with collections
linkedin_companies
,linkedin_crawlers
andlinkedin_scrapers
- Writing permission in the app directory to save cookies
To run the crawler and scraper scalably, you will need to use a residential proxies server.
git clone git@github.com:robertoarruda/linkedin-public-dir-companies.git
cd ./linkedin-public-dir-companies
Within the project root, run the command below:
virtualenv venv --python=python3.7
Run the command below to enable:
source venv/bin/activate
Run the command below to install the project dependencies:
pip install -r requirements.txt
Enter the connection settings with the database in the client_db.py file.
class ClientDB():
__MONGO = 'mongodb://root:123456@127.0.0.1:80'
Enter the host of your residential proxies server in the main.py file.
class Main():
__PROXIES = {
'http': 'http://127.0.0.1:80'
}
Execute the command below to run the crawler:
python main.py crawler
The crawler data is saved in the linkedin_crawlers
collection. The crawled companies are saved in the linkedin_companies
collection.
Execute the command below to run the scraper:
python main.py scraper
The scraper data is saved in the linkedin_scrapers
collection. The scraped companies are updated in the collection linkedin_companies
.
Execute the command below to deactivate:
deactivate