Skip to content

Web server app built with Node.js and hapi framework, whose purpose is to scrap company information from different sources such as linkedin.com and societe.com

License

Notifications You must be signed in to change notification settings

nicolaspayot/company-scraper-server

Repository files navigation

company-scraper-server

Build Status

Context

company-scraper-server is a web server app built with Node.js and hapi framework, whose purpose is to scrap company information from different sources such as linkedin.com and societe.com.

Use cases

Here are some high level design schemas to describe 3 different use cases:

Cache hit

cache hit

Cache miss / DB hit

cache miss / db hit

Cache miss / DB miss

cache miss / db miss

Disclaimer: there is actually no cache involved in this version of company-scraper-server. Schemas show how it would ideally work. It could be implemented with Redis or Memcached for example, but it seems a little bit overkill for now.

Low level design

API

company-scraper-server serves a REST API implemented with hapi. It has 2 routes:

POST /api/companies/query

Returns a list of company pages (URLs) that match query, from linkedin.com and societe.com.

  • Parameters
{
  query: 'company_name';
}

POST /api/company/urls

Returns company information from company page(s) URL(s).

  • Parameters
{
  linkedin: 'https://www.linkedin.com/company/company-name',
  societe: 'https://www.societe.com/societe/company-name'
}

Only 1 URL is required (linkedin OR societe OR both)

Scraping

Scraping services use puppeteer to extract data from company pages.

Persistence

Company information data is persisted in a MongoDB collection, named companies. Thus, when a company whose data was already scraped is researched, its data from DB is returned. A scheduled job triggers a cleaning of old companies (it should run everyday at midnight).

Setup

⚠️ Requirements ⚠️

.env file properties

  • Copy .env.example file and rename it by .env
  • Add Linkedin credentials
  • Add MongoDB connection string URI

Usage

$ yarn install    # install dependencies
$ yarn serve      # start dev with server with nodemon
$ yarn lint       # lints files
$ yarn lint:fix   # lints and fixes files
$ yarn test       # run unit tests with Jest

About

Web server app built with Node.js and hapi framework, whose purpose is to scrap company information from different sources such as linkedin.com and societe.com

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •