Modular framework for building and scaling web scraping workloads over CLI, HTTP & WebSockets.
To install this in your project, make sure you have Node.js installed on your workstation and run below command:
yarn add @swipefintech/scrapist
# or if using yarn
npm install @swipefintech/scrapist --save
First you need to implement your scraping jobs (commands) classes.
You should extend either ScrapeUsingBrowserCommand
class or ScrapeUsingHttpClientCommand
to create your jobs as below.
import { IInput, IOutput, Status, ScrapeUsingHttpClientCommand, HttpClient } from '@swipefintech/scrapist'
export default class YourCommand extends ScrapeUsingHttpClientCommand {
async handle (input: IInput, client: HttpClient): Promise<IOutput> {
const { body } = await this.sendRequest(client, {
// request options
})
return {
data: body,
status: Status.SUCCESS
}
}
}
You can persist session data i.e., cookies between commands automatically by using the @StoreCookies(<unique-key>)
decorator.
The key
that you specify in the decorator is the key name in your input whose value has to be used as a unique identifier to load/save data.
import { StoreCookies } from '@swipefintech/scrapist'
@StoreCookies("accountId")
export default class YourCommand extends ScrapeUsingHttpClientCommand {
// command implementation
}
You can also validate data present in your input, powered by Joi by overriding the rules()
method in your command as below.
import Joi, { PartialSchemaMap } from 'joi'
import { ScrapeUsingHttpClientCommand } from '@swipefintech/scrapist'
export default class YourCommand extends ScrapeUsingHttpClientCommand {
rules (): PartialSchemaMap {
return {
email: Joi.string().email().required(),
password: Joi.string().required(),
...super.rules() // make sure to keep this
}
}
}
For bigger projects, it is advised to organise commands into modules like below:
import { IEngine, IModule } from '@swipefintech/scrapist'
import YourCommandNo1 from './YourCommandNo1'
import YourCommandNo2 from './YourCommandNo2'
export default class YourModule implements IModule {
register (engine: IEngine): void {
engine.register('YourCommandNo1', new YourCommandNo1())
engine.register('YourCommandNo2', new YourCommandNo2())
// and so on
}
}
Now that you have defined your commands, you need to create an instance of Engine
class, register your commands
(or mount modules) and handle the input.
import { Engine, IInput, IOutput } from '@swipefintech/scrapist'
import YourCommand1 from './YourCommand1'
import YourCommand2 from './YourCommand2'
import YourModule from './YourModule'
const engine = new Engine()
// either register commands
engine.register('YourCommand1', new YourCommand1())
engine.register('YourCommand2', new YourCommand2())
// or mount the module
engine.mount('YourModule', new YourModule())
const input: IInput = {
command: 'YourCommand1', // or 'YourModule/YourCommand1' is using modules,
data: {
username: 'name@example.com',
password: 'super_secret',
},
externalId: 'Premium-User-123', // if using @StoreCookies(...) decorator
}
engine.handle(input)
.then((output: IOutput) => {
// deal with output
})
If you are using the @StoreCookies(<unique-key>)
decorator, you also need to provide a Cache
implementation (from cache-manager) when creating Engine
object as below.
import path from 'path'
import { caching } from 'cache-manager'
import store from 'cache-manager-fs-hash'
import { Engine } from '@swipefintech/scrapist'
// create a file-system (or any other)
const cache = caching({
store,
options: {
path: path.join(__dirname, 'cache'),
subdirs: true
}
})
const engine = new Engine(cache)
This project also includes samples on implementing and using scrapist via CLI, HTTP (using Express) and WebSockets (using ws) frontends.
Clone this repository and follow below instructions to test the sample apps on your local workstation.
To run the command-line sample, run below command(s) inside cloned folder:
npm run start:cli -- \
ExampleDotCom/GetHomePageLinkUsingBrowser \
--session=Premium-User-123
npm run start:cli -- \
ExampleDotCom/GetHomePageLinkUsingHttpClient \
--referer=https://example.com/
To run the web (API) sample, run below command(s) inside cloned folder:
# start development server
npm run start:web
# run test commands (in another terminal)
curl http://localhost:3000/ExampleDotCom/GetHomePageLinkUsingBrowser \
-H "Content-Type: application/json" \
-d '{"session": "Premium-User-123"}'
curl http://localhost:3000/ExampleDotCom/GetHomePageLinkUsingHttpClient \
-H "Content-Type: application/json" \
-d '{"referer": "https://example.com/"}'
To run the web-socket sample, run below command(s) inside cloned folder:
# start development server
npm run start:ws
# connect to web-socket server
npx wscat -c ws://localhost:3000/
# run test commands
{"command": "ExampleDotCom/GetHomePageLinkUsingBrowser", "session": "Premium-User-123"}
{"command": "ExampleDotCom/GetHomePageLinkUsingHttpClient", "referer": "https://example.com/"}
Please see LICENSE file.