Web spider on Serverless!
Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:
Technology | Used For |
---|---|
Bulma, Buefy | UI |
Vue.js | Front-end logic |
AWS S3 | Website hosting |
AWS Lambda | Backend API |
AWS SNS | Message queue |
AWS DynamoDB | Database |
AWS API Gateway | API gateway |
AWS Cloudfront | CDN |
AWS Route 53 | DNS |
Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).
None
curl /api/subscriptions
[
{
"createdAt": 1544833435070,
"targets": [
{
"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span",
"label":"ratingCount"
}
],
"id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058",
"url": "https://www.imdb.com/title/tt0111161/",
"interval": 60
}
]
Create a new subscription to feed the spider.
- url (required) - Target website url
- targets (required) - List of css selectors from which text contents are expected to be extracted
- interval (required) - The interval (in minutes) between scrape
curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"
{
"id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058",
"url": "https://www.imdb.com/title/tt0111161/",
"targets": [
{
"label":"ratingCount",
"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span"
}
],
"interval": 60,
"createdAt": 1544833533059,
"updatedAt": 1544833533059
}
Delete a subscription.
- id (required) - Subscription id
curl -X DELETE /api/subscriptions/:id
{
"id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058"
}
Scrape target websites and extract target contents.
yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'
[
{
"label": "ratingCount",
"content": "2,025,796"
}
]
Fetch subscriptions from database and filter out the ones need to be executed.
yarn invoke:local cron
None
# install dependencies
yarn install
# start api server on port 8090
yarn start
# invoke function locally
yarn invoke:local function_name
# invoke remote function
yarn invoke cron function_name
# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy