[Crawler + Scraper] LinkedIn Public Directory Companies

Prerequisites

Python 3.7 sudo apt-get install python3.7
Pip sudo apt-get install python3-pip
VirtualEnv sudo pip3 install virtualenv
MongoDB with collections linkedin_companies, linkedin_crawlers and linkedin_scrapers
Writing permission in the app directory to save cookies

Considerations

To run the crawler and scraper scalably, you will need to use a residential proxies server.

Installation

Clone the project:

git clone git@github.com:robertoarruda/linkedin-public-dir-companies.git

Enter the project directory:

cd ./linkedin-public-dir-companies

Create the Environment:

Within the project root, run the command below:

virtualenv venv --python=python3.7

Activate the environment:

Run the command below to enable:

source venv/bin/activate

Install dependencies:

Run the command below to install the project dependencies:

pip install -r requirements.txt

Configure MongoDB

Enter the connection settings with the database in the client_db.py file.

class ClientDB():
    __MONGO = 'mongodb://root:123456@127.0.0.1:80'

[Opcional step] Setting residential proxy

Enter the host of your residential proxies server in the main.py file.

class Main():
    __PROXIES = {
        'http': 'http://127.0.0.1:80'
    }

Execute the crawler:

Execute the command below to run the crawler:

python main.py crawler

The crawler data is saved in the linkedin_crawlers collection. The crawled companies are saved in the linkedin_companies collection.

Execute the scraper:

Execute the command below to run the scraper:

python main.py scraper

The scraper data is saved in the linkedin_scrapers collection. The scraped companies are updated in the collection linkedin_companies.

Turn off the environment:

Execute the command below to deactivate:

deactivate

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
client_db.py		client_db.py
companies_collection.py		companies_collection.py
company_directory_interface.py		company_directory_interface.py
company_page_interface.py		company_page_interface.py
crawler.py		crawler.py
crawlers_collection.py		crawlers_collection.py
linkedin.py		linkedin.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
scraper.py		scraper.py
scrapers_collection.py		scrapers_collection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[Crawler + Scraper] LinkedIn Public Directory Companies

Prerequisites

Considerations

Installation

Clone the project:

Enter the project directory:

Create the Environment:

Activate the environment:

Install dependencies:

Configure MongoDB

[Opcional step] Setting residential proxy

Execute the crawler:

Execute the scraper:

Turn off the environment:

About

Releases

Packages

Languages

robertoarruda/linkedin-public-dir-companies

Folders and files

Latest commit

History

Repository files navigation

[Crawler + Scraper] LinkedIn Public Directory Companies

Prerequisites

Considerations

Installation

Clone the project:

Enter the project directory:

Create the Environment:

Activate the environment:

Install dependencies:

Configure MongoDB

[Opcional step] Setting residential proxy

Execute the crawler:

Execute the scraper:

Turn off the environment:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages