- Fix
dataset.pushData()
validation which would not allow other than plain objects. - Fix
PuppeteerLaunchContext.stealth
throwing when used inPuppeteerCrawler
.
After 3.5 years of rapid development and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).
The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.
We added support for more browsers by replacing PuppeteerPool
with
browser-pool
. A new library that we created
specifically for this purpose. It builds on the ideas from PuppeteerPool
and extends
them to support Playwright. Playwright is
a browser automation library similar to Puppeteer. It works with all well known browsers
and uses almost the same interface as Puppeteer, while adding useful features and simplifying
common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool
.
A large breaking change is that neither puppeteer
nor playwright
are bundled with
the SDK v1. To make the choice of a library easier and installs faster, users will
have to install the selected modules and versions themselves. This allows us to add
support for even more libraries in the future.
Thanks to the addition of Playwright we now have a PlaywrightCrawler
. It is very similar
to PuppeteerCrawler
and you can pick the one you prefer. It also means we needed to make
some interface changes. The launchPuppeteerFunction
option of PuppeteerCrawler
is gone
and launchPuppeteerOptions
were replaced by launchContext
. We also moved things around
in the handlePageFunction
arguments. See the
migration guide
for more detailed explanation and migration examples.
What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.
- BREAKING: Removed
puppeteer
from dependencies. If you want to use Puppeteer, you must install it yourself. - BREAKING: Removed
PuppeteerPool
. Usebrowser-pool
. - BREAKING: Removed
PuppeteerCrawlerOptions.launchPuppeteerOptions
. UselaunchContext
. - BREAKING: Removed
PuppeteerCrawlerOptions.launchPuppeteerFunction
. UsePuppeteerCrawlerOptions.preLaunchHooks
andpostLaunchHooks
. - BREAKING: Removed
args.autoscaledPool
andargs.puppeteerPool
fromhandle(Page/Request)Function
arguments. Useargs.crawler.autoscaledPool
andargs.crawler.browserPool
. - BREAKING: The
useSessionPool
andpersistCookiesPerSession
options of crawlers are nowtrue
by default. Explicitly set them tofalse
to override the behavior. - BREAKING:
Apify.launchPuppeteer()
no longer acceptsLaunchPuppeteerOptions
. It now acceptsPuppeteerLaunchContext
.
- DEPRECATED:
PuppeteerCrawlerOptions.gotoFunction
. UsePuppeteerCrawlerOptions.preNavigationHooks
andpostNavigationHooks
.
- BREAKING: Removed
Apify.utils.puppeteer.enqueueLinks()
. Deprecated in 01/2019. UseApify.utils.enqueueLinks()
. - BREAKING: Removed
autoscaledPool.(set|get)MaxConcurrency()
. Deprecated in 2019. UseautoscaledPool.maxConcurrency
. - BREAKING: Removed
CheerioCrawlerOptions.requestOptions
. Deprecated in 03/2020. UseCheerioCrawlerOptions.prepareRequestFunction
. - BREAKING: Removed
Launch.requestOptions
. Deprecated in 03/2020. UseCheerioCrawlerOptions.prepareRequestFunction
.
- Added
Apify.PlaywrightCrawler
which is almost identical toPuppeteerCrawler
, but it crawls with theplaywright
library. - Added
Apify.launchPlaywright(launchContext)
helper function. - Added
browserPoolOptions
toPuppeteerCrawler
to configureBrowserPool
. - Added
crawler
tohandle(Request/Page)Function
arguments. - Added
browserController
tohandlePageFunction
arguments. - Added
crawler.crawlingContexts
Map
which includes all runningcrawlingContext
s.
- Fix issues with
Apify.pushData()
andkeyValueStore.forEachKey()
by updating@apify/storage-local
to1.0.2
.
- Fix
puppeteerPool
missing in handle page arguments.
- Pinned
cheerio
to1.0.0-rc.3
to avoid install problems in some builds. - Increased default
maxEventLoopOverloadedRatio
inSystemStatusOptions
to 0.6. - Updated packages and improved docs.
This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool
,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.
In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client
package which powers all communication with
the Apify API to version 1.0.0
. This means a completely new API for all internal calls.
If you use Apify.client
calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client
and replaced it with Apify.newClient()
function.
We think it's better to have separate clients for users and internal use.
Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local
which shares interface with apify-client
.
RequestQueue
is now powered by SQLite3
instead of file system, which improves
reliability and performance quite a bit. Dataset
and KeyValueStore
still use file
system, for easy browsing of data. The structure of apify_storage
folder remains unchanged.
After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome
to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions
and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome
is not an allowed property of PuppeteerPoolOptions
.
Based on developer feedback, we decided to remove --no-sandbox
from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.
LiveViewServer
and puppeteerPoolOptions.useLiveView
were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server
NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.
Full list of changes:
-
BREAKING: Updated
apify-client
to1.0.0
with a completely new interface. We also removed theApify.client
property and replaced it with anApify.newClient()
function that creates a newApifyClient
instance. -
BREAKING: Removed
--no-sandbox
from default Puppeteer launch arguments. This will most likely be breaking for Linux and Docker users. -
BREAKING: Function argument validation is now more strict and will not accept extra parameters which are not defined by the functions' signatures.
-
DEPRECATED:
puppeteerPoolOptions.useLiveView
is now deprecated. Use thedevtools-server
NPM package instead. -
Added
postResponseFunction
toCheerioCrawlerOptions
. It allows you to override properties on the HTTP response before processing byCheerioCrawler
. -
Added HTTP2 support to
utils.requestAsBrowser()
. SetuseHttp2
totrue
inRequestAsBrowserOptions
to enable it. -
Fixed handling of XML content types in
CheerioCrawler
. -
Fixed capitalization of headers when using
utils.puppeteer.addInterceptRequestHandler
. -
Fixed
utils.puppeteer.saveSnapshot()
overwriting screenshots with HTML on local. -
Updated
puppeteer
to version5.4.1
with Chrom(ium) 87. -
Removed
RequestQueueLocal
in favor of@apify/storage-local
API emulator. -
Removed
KeyValueStoreLocal
in favor of@apify/storage-local
API emulator. -
Removed
DatasetLocal
in favor of@apify/storage-local
API emulator. -
Removed the
userData
option fromApify.utils.enqueueLinks
(deprecated in Jun 2019). UsetransformRequestFunction
instead. -
Removed
instanceKillerIntervalMillis
andkillInstanceAfterMillis
(deprecated in Feb 2019). UseinstanceKillerIntervalSecs
andkillInstanceAfterSecs
instead. -
Removed the
memory
option fromApify.call
options
which was (deprecated in 2018). UsememoryMbytes
instead. -
Removed
delete()
methods fromDataset
,KeyValueStore
andRequestQueue
(deprecated in Jul 2019). Use.drop()
. -
Removed
utils.puppeteer.hideWebDriver()
(deprecated in May 2019). UseLaunchPuppeteerOptions.stealth
. -
Removed
utils.puppeteer.enqueueRequestsFromClickableElements()
(deprecated in 2018). Useutils.puppeteer.enqueueLinksByClickingElements
. -
Removed
request.doNotRetry()
(deprecated in June 2019) Userequest.noRetry = true
. -
Removed
RequestListOptions.persistSourcesKey
(deprecated in Feb 2020) UsepersistRequestsKey
. -
Removed the
The function passed to Apify.main() threw an exception
error message, because it was confusing to users. -
Removed automatic injection of
charset=utf-8
inkeyValueStore.setValue()
to thecontentType
option.
- Technical release, see 0.22.1
- Pinned
cheerio
to1.0.0-rc.3
to avoid install problems in some builds.
- Bump Puppeteer to 5.5.0 and Chrom(ium) 88.
- Fix various issues in
stealth
. - Fix
SessionPool
not retiring sessions immediately when they become unusable. It fixes a problem wherePuppeteerPool
would not retire browsers wit bad sessions.
- Make
PuppeteerCrawler
safe against malformed Puppeteer responses. - Update default user agent to Chrome 86
- Bump Puppeteer to 5.3.1 with Chromium 86
- Fix an error in
PuppeteerCrawler
caused bypage.goto()
randomly returningnull
.
It appears that CheerioCrawler
was correctly retiring sessions on timeouts
and blocked status codes (401, 403, 429), whereas PuppeteerCrawler
did not.
Apologies for the omission, this release fixes the problem.
- Fix sessions not being retired on blocked status codes in
PuppeteerCrawler
. - Fix sessions not being marked bad on navigation timeouts in
PuppeteerCrawler
. - Update
apify-shared
to version0.5.0
.
This is a very minor release that fixes some issues that were preventing use of the SDK with Node 14.
- Update the request serialization process which is used in
RequestList
to work with Node 10+. - Update some TypeScript types that were preventing build due to changes in typed dependencies.
The statistics that you may remember from logs are now persisted in key-value store, so you won't lose count when your actor restarts. We've also added a lot of useful stats in there which can be useful to you after a run finishes. Besides that, we fixed some bugs and annoyances and improved the TypeScript experience a bit.
- Add persistence to
Statistics
class and automatically persist it inBasicCrawler
. - Fix issue where inaccessible Apify Proxy would cause
ProxyConfiguration
to throw a timeout error. - Update default user agent to Chrome 85
- Bump Puppeteer to 5.2.1 which uses Chromium 85
- TypeScript: Fix
RequestAsBrowserOptions
missing some values and addRequestQueueInfo
as a return value fromrequestQueue.getInfo()
- Fix useless logging in Session.
- Fix cookies with leading dot in domain (as extracted from Puppeteer) not being correctly added to Sessions.
We fixed some bugs, improved a few things and bumped Puppeteer to match latest Chrome 84.
- Allow
Apify.createProxyConfiguration
to be used seamlessly with the proxy component of Actor Input UI. - Fix integration of plugins into
CheerioCrawler
with thecrawler.use()
function. - Fix a race condition which caused
RequestQueueLocal
to fail handling requests. - Fix broken debug logging in
SessionPool
. - Improve
ProxyConfiguration
error message for missing password / token. - Update Puppeteer to 5.2.0
- Improve docs, update packages and so on.
This release comes with breaking changes that will affect most, if not all of your projects. See the migration guide for more information and examples.
First large change is a redesigned proxy configuration. Cheerio
and Puppeteer
crawlers
now accept a proxyConfiguration
parameter, which is an instance of ProxyConfiguration
.
This class now exclusively manages both Apify Proxy and custom proxies. Visit the new
proxy management guide
We also removed Apify.utils.getRandomUserAgent()
as it was no longer effective
in avoiding bot detection and changed the default values for empty properties in
Request
instances.
- BREAKING: Removed
Apify.getApifyProxyUrl()
. To get an Apify Proxy url, useproxyConfiguration.newUrl([sessionId])
. - BREAKING: Removed
useApifyProxy
,apifyProxyGroups
andapifyProxySession
parameters from all applications in the SDK. UseproxyConfiguration
in crawlers andproxyUrl
inrequestAsBrowser
andApify.launchPuppeteer
. - BREAKING: Removed
Apify.utils.getRandomUserAgent()
as it was no longer effective in avoiding bot detection. - BREAKING:
Request
instances no longer initialize empty properties withnull
, which means that:- empty
errorMessages
are now represented by[]
, and - empty
loadedUrl
,payload
andhandledAt
areundefined
.
- empty
- Add
Apify.createProxyConfiguration()
async
function to createProxyConfiguration
instances.ProxyConfiguration
itself is not exposed. - Add
proxyConfiguration
toCheerioCrawlerOptions
andPuppeteerCrawlerOptions
. - Add
proxyInfo
toCheerioHandlePageInputs
andPuppeteerHandlePageInputs
. You can use this object to retrieve information about the currently used proxy inPuppeteer
andCheerio
crawlers. - Add click buttons and scroll up options to
Apify.utils.puppeteer.infiniteScroll()
. - Fixed a bug where intercepted requests would never continue.
- Fixed a bug where
Apify.utils.requestAsBrowser()
would get into redirect loops. - Fix
Apify.utils.getMemoryInfo()
crashing the process on AWS Lambda and on systems running in Docker without memory cgroups enabled. - Update Puppeteer to 3.3.0.
- Add
Apify.utils.waitForRunToFinish()
which simplifies waiting for an actor run to finish. - Add standard prefixes to log messages to improve readability and orientation in logs.
- Add support for
async
handlers inApify.utils.puppeteer.addInterceptRequestHandler()
- EXPERIMENTAL: Add
cheerioCrawler.use()
function to enable attachingCrawlerExtension
to the crawler to modify its behavior. A plugin that extends functionality. - Fix bug with cookie expiry in
SessionPool
. - Fix issues in documentation.
- Updated
@apify/http-request
to fix issue in theproxy-agent
package. - Updated Puppeteer to 3.0.2
- DEPRECATED:
CheerioCrawlerOptions.requestOptions
is now deprecated. Please useCheerioCrawlerOptions.prepareRequestFunction
instead. - Add
limit
option toApify.utils.enqueueLinks()
for situations when full crawls are not needed. - Add
suggestResponseEncoding
andforceResponseEncoding
options toCheerioCrawler
to allow users to provide a fall-back or forced encoding of responses in situations where websites serve invalid encoding information in their headers. - Add a number of new examples and update existing ones to documentation.
- Fix duplicate file extensions in
Apify.utils.puppeteer.saveSnapshot()
when used locally. - Fix encoding of multi-byte characters in
CheerioCrawler
. - Fix formatting of navigation buttons in documentation.
- Fix an error where persistence of
SessionPool
would fail if a cookie included invalidexpires
value. - Jumping a patch version because of an error in publishing via CI.
- BREAKING:
Apify.utils.requestAsBrowser()
no longer aborts request on status code 406 or when other thantext/html
type is received. Useoptions.abortFunction
if you want to retain this functionality. - BREAKING: Added
useInsecureHttpParser
option toApify.utils.requestAsBrowser()
which istrue
by default and forces the function to use a HTTP parser that is less strict than default Node 12 parser, but also less secure. It is needed to be able to bypass certain anti-scraping walls and fetch websites that do not comply with HTTP spec. - BREAKING:
RequestList
now removes all the elements from thesources
array on initialization. If you need to use the sources somewhere else, make a copy. This change was added as one of several measures to improve memory management ofRequestList
in scenarios with very large amount ofRequest
instances. - DEPRECATED:
RequestListOptions.persistSourcesKey
is now deprecated. Please useRequestListOptions.persistRequestsKey
. RequestList.sources
can now be an array ofstring
URLs as well.- Added
sourcesFunction
toRequestListOptions
. It enables dynamic fetching of sources and will only be called if persistedRequests
were not retrieved from key-value store. Use it to reduce memory spikes and also to make sure that your sources are not re-created on actor restarts. - Updated
stealth
hiding ofwebdriver
to avoid recent detections. Apify.utils.log
now points to an updated logger instance which prints colored logs (in TTY) and supports overriding with custom loggers.- Improved
Apify.launchPuppeteer()
code to prevent triggering bugs in Puppeteer by passing more than required options topuppeteer.launch()
. - Documented
BasicCrawler.autoscaledPool
property, and addedCheerioCrawler.autoscaledPool
andPuppeteerCrawler.autoscaledPool
properties. SessionPool
now persists state onteardown
. Before, it only persisted state every minute. This ensures that after a crawler finishes, the state is correctly persisted.- Added TypeScript typings and typedef documentation for all entities used throughout SDK.
- Upgraded
proxy-chain
NPM package from 0.2.7 to 0.4.1 and many other dependencies - Removed all usage of the now deprecated
request
package.
- BREAKING (EXPERIMENTAL):
session.checkStatus() -> session.retireOnBlockedStatusCodes()
. Session
API is no longer considered experimental.- Updates documentation and introduces a few internal changes.
- BREAKING:
APIFY_LOCAL_EMULATION_DIR
env var is no longer supported (deprecated on 2018-09-11). UseAPIFY_LOCAL_STORAGE_DIR
instead. SessionPool
API updates and fixes. The API is no longer considered experimental.- Logging of system info moved from
require
time toApify.main()
invocation. - Use native
RegExp
instead ofxregexp
for unicode property escapes.
- Fix
SessionPool
not automatically working inCheerioCrawler
. - Fix incorrect management of page count in
PuppeteerPool
.
- BREAKING
CheerioCrawler
ignores ssl errors by default -options.ignoreSslErrors: true
. - Add
SessionPool
implemenation toCheerioCrawler
. - Add
SessionPool
implementation toPuppeteerPool
andPupeteerCrawler
. - Fix
Request
constructor not making a copy of objects such asuserData
andheaders
. - Fix
desc
option not being applied in localdataset.getData()
.
- BREAKING: Node 8 and 9 are no longer supported. Please use Node 10.17.0 or higher.
- DEPRECATED:
Apify.callTask()
body
andcontentType
options are now deprecated. Useinput
instead. It must be ofcontent-type: application/json
. - Add default
SessionPool
implementation toBasicCrawler
. - Add the ability to create ad-hoc webhooks via
Apify.call()
andApify.callTask()
. - Add an example of form filling with
Puppeteer
. - Add
country
option toApify.getApifyProxyUrl()
. - Add
Apify.utils.puppeteer.saveSnapshot()
helper to quickly save HTML and screenshot of a page. - Add the ability to pass
got
supported options torequestOptions
inCheerioCrawler
thus supporting things such ascookieJar
again. - Switch Puppeteer to web socket again due to suspected
pipe
errors. - Fix an issue where some encodings were not correctly parsed in
CheerioCrawler
. - Fix parsing bad Content-Type headers for
CheerioCrawler
. - Fix custom headers not being correctly applied in
Apify.utils.requestAsBrowser()
. - Fix dataset limits not being correctly applied.
- Fix a race condition in
RequestQueueLocal
. - Fix
RequestList
persistence of downloaded sources in key-value store. - Fix
Apify.utils.puppeteer.blockRequests()
always including default patterns. - Fix inconsistent behavior of
Apify.utils.puppeteer.infiniteScroll()
on some websites. - Fix retry histogram statistics sometimes showing invalid counts.
- Added regexps for Youtube videos (
YOUTUBE_REGEX
,YOUTUBE_REGEX_GLOBAL
) toutils.social
- Added documentation for option
json
in handlePageFunction ofCheerioCrawler
- Bump Puppeteer to 2.0.0 and use
{ pipe: true }
again because upstream bug has been fixed. - Add
useIncognitoPages
option toPuppeteerPool
to enable opening new pages in incognito browser contexts. This is useful to keep cookies and cache unique for each page. - Added options to load every content type in CheerioCrawler.
There are new options
body
andcontentType
inhandlePageFunction
for this purposes. - DEPRECATED: CheerioCrawler
html
option inhandlePageFunction
was replaced withbody
options.
- This release updates
@apify/http-request
to version 1.1.2. - Update
CheerioCrawler
to userequestAsBrowser()
to better disguise as a real browser.
- This release just updates some dependencies (not Puppeteer).
- DEPRECATED:
dataset.delete()
,keyValueStore.delete()
andrequestQueue.delete()
methods have been deprecated in favor of*.drop()
methods, because thedrop
name more clearly communicates the fact that those methods drop / delete the storage itself, not individual elements in the storage. - Added
Apify.utils.requestAsBrowser()
helper function that enables you to make HTTP(S) requests disguising as a browser (Firefox). This may help in overcoming certain anti-scraping and anti-bot protections. - Added
options.gotoTimeoutSecs
toPuppeteerCrawler
to enable easier setting of navigation timeouts. PuppeteerPool
options that were deprecated from thePuppeteerCrawler
constructor were finally removed. Please usemaxOpenPagesPerInstance
,retireInstanceAfterRequestCount
,instanceKillerIntervalSecs
,killInstanceAfterSecs
andproxyUrls
via thepuppeteerPoolOptions
object.- On the Apify Platform a warning will now be printed when using an outdated
apify
package version. Apify.utils.puppeteer.enqueueLinksByClickingElements()
will now print a warning when the nodes it tries to click become modified (detached from DOM). This is useful to debug unexpected behavior.
Apify.launchPuppeteer()
now acceptsproxyUrl
with thehttps
,socks4
andsocks5
schemes, as long as it doesn't contain username or password. This is to fix Issue #420.- Added
desiredConcurrency
option toAutoscaledPool
constructor, removed unnecessary bound check from the setter property
- Fix error where Puppeteer would fail to launch when pipes are turned off.
- Switch back to default Web Socket transport for Puppeteer due to upstream issues.
- BREAKING CHANGE Removed support for Web Driver (Selenium) since no further updates are planned. If you wish to continue using Web Driver, please stay on Apify SDK version ^0.14.15
- BREAKING CHANGE:
Dataset.getData()
throws an error if user provides an unsupported option when using local disk storage. - DEPRECATED:
options.userData
ofApify.utils.enqueueLinks()
is deprecated. Useoptions.transformRequestFunction
instead. - Improve logging of memory overload errors.
- Improve error message in
Apify.call()
. - Fix multiple log lines appearing when a crawler was about to finish.
- Add
Apify.utils.puppeteer.enqueueLinksByClickingElements()
function which enables you to add requests to the queue from pure JavaScript navigations, form submissions etc. - Add
Apify.utils.puppeteer.infiniteScroll()
function which helps you with scrolling to the bottom of websites that auto-load new content. - The
RequestQueue.handledCount()
function has been resurrected from deprecation, in order to have compatible interface withRequestList
. - Add
useExtendedUniqueKey
option toRequest
constructor to includemethod
andpayload
in theRequest
's computeduniqueKey
. - Updated Puppeteer to 1.18.1
- Updated
apify-client
to 0.5.22
- Fixes in
RequestQueue
to deal with inconsistencies in the underlying data storage - BREAKING CHANGE:
RequestQueue.addRequest()
now sets the ID of the newly added request to the passedRequest
object - The
RequestQueue.handledCount()
function has been deprecated, please useRequestQueue.getInfo()
instead.
- Fix error where live view would crash when started with concurrency already higher than 1.
- Fix
POST
requests in Puppeteer.
Snapshotter
will now log critical memory overload warnings at most once per 10 seconds.- Live view snapshots are now made right after navigation finishes, instead of right before page close.
- Add
Statistics
class to track crawler run statistics. - Use pipes instead of web sockets in Puppeteer to improve performance and stability.
- Add warnings to all functions using Puppeteer's request interception to inform users about its performance impact caused by automatic cache disabling.
- DEPRECATED:
Apify.utils.puppeteer.blockResources()
because of negative impact on performance. Use.blockRequests()
(see below). - Add
Apify.utils.puppeteer.blockRequests()
to enable blocking URL patterns without request interception involved. This is a replacement for.blockResources()
until performance issues with request interception resolve.
- Update
Puppeteer
to 1.17.0. - Add
idempotencyKey
parameter toApify.addWebhook()
.
- Better logs from
AutoscaledPool
class - Replace
cpuInfo
Apify event with newsystemInfo
event inSnapshotter
.
- Bump
apify-client
to 0.5.17
- Bump
apify-client
to 0.5.16
- Stringification to JSON of actor input in
Apify.call()
,Apify.callTask()
andApify.metamorph()
now also supports functions viafunc.toString()
. The same holds for record body insetValue()
method of key-value store. - Request queue now monitors number of clients that accessed the queue which allows crawlers to finish without 10s waiting if run was not migrated during its lifetime.
- Update Puppeteer to 1.15.0.
- Added the
stealth
optionlaunchPuppeteerOptions
which decreases headless browser detection chance. - DEPRECATED:
Apify.utils.puppeteer.hideWebDriver
uselaunchPuppeteerOptions.stealth
instead. CheerioCrawler
now parses HTML using streams. This improves performance and memory usage in most cases.
- Request queue now allows crawlers to finish quickly without waiting in a case that queue was used by a single client.
- Better logging of errors in
Apify.main()
- Fix invalid type check in
puppeteerModule
.
- Made UI and UX improvements to
LiveViewServer
functionality. launchPuppeteerOptions.puppeteerModule
now supportsObject
(pre-required modules).- Removed
--enable-resource-load-scheduler=false
Chromium command line flag, it has no effect. See https://bugs.chromium.org/p/chromium/issues/detail?id=723233 - Fixed inconsistency in
prepareRequestFunction
ofCheerioCrawler
. - Update Puppeteer to 1.14.0
- BREAKING CHANGE: Live View is no longer available by passing
liveView = true
tolaunchPuppeteerOptions
. - New version of Live View is available by passing the
useLiveView = true
option toPuppeteerPool
.- Only shows snapshots of a single page from a single browser.
- Only makes snapshots when a client is connected, having very low performance impact otherwise.
- Added
Apify.utils.puppeteer.addInterceptRequestHandler
andremoveInterceptRequestHandler
which can be used to add multiple request interception handlers to Puppeteer's pages. - Added
puppeteerModule
toLaunchPuppeteerOptions
which enables use of other Puppeteer modules, such aspuppeteer-extra
instead of plainpuppeteer
.
- Fix a bug where invalid response from
RequestQueue
would occasionally cause crawlers to crash.
- Fix
RequestQueue
throttling at high concurrency.
- Fix bug in
addWebhook
invocation.
- Fix
puppeteerPoolOptions
object not being used inPuppeteerCrawler
.
- Fix
REQUEST_QUEUE_HEAD_MAX_LIMIT
is not defined error.
Snapshotter
now marks Apify Client overloaded on the basis of 2nd retry errors.- Added
Apify.addWebhook()
to invoke a webhook when an actor run ends. Currently this only works on the Apify Platform and will print a warning when ran locally.
- BREAKING CHANGE: Added
puppeteerOperationTimeoutSecs
option toPuppeteerPool
. It defaults to 15 seconds and all Puppeteer operations such asbrowser.newPage()
orpuppeteer.launch()
will now time out. This is to prevent hanging requests. - BREAKING CHANGE: Added
handleRequestTimeoutSecs
option toBasicCrawler
with a 60 second default. - DEPRECATED:
PuppeteerPool
options in thePuppeteerCrawler
constructor are now deprecated. Please use the newpuppeteerPoolOptions
argument of typeObject
to pass them.launchPuppeteerFunction
andlaunchPuppeteerOptions
are still available as shortcuts for convenience. CheerioCrawler
andPuppeteerCrawler
now automatically sethandleRequestTimeoutSecs
to 10 times theirhandlePageTimeoutSecs
. This is a precaution that should keep requests from hanging forever.- Added
options.prepareRequestFunction()
toCheerioCrawler
constructor to enable modification ofRequest
before the HTTP request is made to the target URL. - Added back the
recycleDiskCache
option toPuppeteerPool
now that it is supported even in headless mode (read more)
- Parameters
input
andoptions
added toApify.callTask()
.
- Added oldest active tab focusing to
PuppeteerPool
to combat resource throttling in Chromium.
- Added
Apify.metamorph()
, see documentation for more information. - Added
Apify.getInput()
- BREAKING CHANGE: Reduced default
handlePageTimeoutSecs
for bothCheerioCrawler
andPuppeteerCrawler
from 300 to 60 seconds, in order to prevent stalling crawlers. - BREAKING CHANGE:
PseudoUrl
now performs case-insensitive matching, even for the query string part of the URLs. If you need case sensitive matching, use an appropriateRegExp
in place of a Pseudo URL string - Upgraded to puppeteer@1.12.2 and xregexp@4.2.4
- Added
loadedUrl
property toRequest
that contains the final URL of the loaded page after all redirects. - Added memory overload warning log message.
- Added
keyValueStore.getPublicUrl
function. - Added
minConcurrency
,maxConcurrency
,desiredConcurrency
andcurrentConcurrency
properties toAutoscaledPool
, improved docs - Deprecated
AutoscaledPool.setMinConcurrency
andAutoscaledPool.setMaxConcurrency
functions - Updated
DEFAULT_USER_AGENT
andUSER_AGENT_LIST
with new User Agents - Bugfix:
LocalRequestQueue.getRequest()
threw an exception if request was not found - Added
RequestQueue.getInfo()
function - Improved
Apify.main()
to provide nicer stack traces on errors Apify.utils.puppeteer.injectFile()
now supports injection that survives page navigations and caches file contents.
- Fix the
keyValueStore.forEachKey()
method. - Fix version of
puppeteer
to prevent errors with automatic updates.
- Apify SDK now logs basic system info when
required
. - Added
utils.createRequestDebugInfo()
function to create a standardized debug info from request and response. PseudoUrl
can now be constructed with aRegExp
.Apify.utils.enqueueLinks()
now acceptsRegExp
instances in itspseudoUrls
parameter.Apify.utils.enqueueLinks()
now accepts abaseUrl
option that enables resolution of relative URLs when parsing a Cheerio object. (It's done automatically in browser when using Puppeteer).- Better error message for an invalid
launchPuppeteerFunction
passed toPuppeteerPool
.
- DEPRECATION WARNING
Apify.utils.puppeteer.enqueueLinks()
was moved toApify.utils.enqueueLinks()
. Apify.utils.enqueueLinks()
now supportsoptions.$
property to enqueue links from a Cheerio object.
- Disabled the
PuppeteerPool
reusePages
option for now, due to a memory leak. - Added a
keyValueStore.forEachKey()
method to iterate all keys in the store.
- Improvements in
Apify.utils.social.parseHandlesFromHtml
andApify.utils.htmlToText
- Updated docs
- Fix
reusePages
causing Puppeteer to fail when used together with request interception.
- Fix missing
reusePages
configuration parameter inPuppeteerCrawler
. - Fix a memory leak where
reusePages
would prevent browsers from closing.
- Fix missing
autoscaledPool
parameter inhandlePageFunction
ofPuppeteerCrawler
.
-
BREAKING CHANGE:
basicCrawler.abort()
,cheerioCrawler.abort()
andpuppeteerCrawler.abort()
functions were removed in favor of a singleautoscaledPool.abort()
function. -
Added a reference to the running
AutoscaledPool
instance to the options object ofBasicCrawler
'shandleRequestFunction
and to thehandlePageFunction
ofCheerioCrawler
andPuppeteerCrawler
. -
Added sources persistence option to
RequestList
that works best in conjunction with the state persistence, but can be toggled separately too. -
Added
Apify.openRequestList()
function to place it in line withRequestQueue
,KeyValueStore
andDataset
.RequestList
created using this function will automatically persist state and sources. -
Added
pool.pause()
andpool.resume()
functions toAutoscaledPool
. You can now pause the pool, which will prevent additional tasks from being run and wait for the running ones to finish. -
Fixed a memory leak in
CheerioCrawler
and potentially other crawlers.
- Added
Apify.utils.htmlToText()
function to convert HTML to text and removed unncessaryhtml-to-text
dependency. The new function is now used inApify.utils.social.parseHandlesFromHtml()
. - Updated
DEFAULT_USER_AGENT
autoscaledPool.isFinishedFunction()
andautoscaledPool.isTaskReadyFunction()
exceptions will now cause thePromise
returned byautoscaledPool.run()
to reject instead of just logging a message. This is in line with theautoscaledPool.runTaskFunction()
behavior.- Bugfix: PuppeteerPool was incorrectly overriding
proxyUrls
even if they were not defined. - Fixed an issue where an error would be thrown when
datasetLocal.getData()
was invoked with an overflowing offset. It now correctly returns an emptyArray
. - Added the
reusePages
option toPuppeteerPool
. It will now reuse existing tabs instead of opening new ones for each page when enabled. BasicCrawler
(and therefore all Crawlers) now logs a message explaining why it finished.- Fixed an issue where
maxRequestsPerCrawl
option would not be honored after restart or migration. - Fixed an issue with timeout promises that would sometimes keep the process hanging.
CheerioCrawler
now acceptsgzip
anddeflate
compressed responses.
- Upgraded Puppeteer to 1.11.0
- DEPRECATION WARNING:
Apify.utils.puppeteer.enqueueLinks()
now uses an options object instead of individual parameters and supports passing ofuserData
to the enqueuedrequest
. Previously:enqueueLinks(page, selector, requestQueue, pseudoUrls)
Now:enqueueLinks({ page, selector, requestQueue, pseudoUrls, userData })
. Using individual parameters is DEPRECATED.
- Added API response tracking to AutoscaledPool, leveraging
Apify.client.stats
object. It now overloads the system when a large amount of 429 - Too Many Requests is received.
- Updated NPM packages to fix a vulnerability reported at dominictarr/event-stream#116
- Added warning if the Node.js is an older version that doesn't support regular expression syntax used by the tools in
the
Apify.utils.social
namespace, instead of failing to start.
- Added back support for
memory
option inApify.call()
, write deprecation warning instead of silently failing
- Improvements in
Apify.utils.social
functions and tests
- Added new
Apify.utils.social
namespace with function to extract emails, phone and social profile URLs from HTML and text documents. Specifically, it supports Twitter, LinkedIn, Instagram and Facebook profiles. - Updated NPM dependencies
Apify.launchPuppeteer()
now sets thedefaultViewport
option if not provided by user, to improve screenshots and debugging experience.- Bugfix:
Dataset.getInfo()
sometimes returned an object withitemsCount
field instead ofitemCount
- Improvements in deployment script.
- Bugfix:
Apify.call()
was causing permissions error.
- Automatically adding
--enable-resource-load-scheduler=false
Chrome flag inApify.launchPuppeteer()
to make crawling of pages in all tabs run equally fast.
- Bug fixes and improvements of internals.
- Package updates.
- Added the ability of
CheerioCrawler
to request and download onlytext/html
responses. - Added a workaround for a long standing
tunnel-agent
package error toCheerioCrawler
. - Added
request.doNotRetry()
function to prevent further retries of arequest
. - Deprecated
request.ignoreErrors
option. Userequest.doNotRetry
. - Fixed
Apify.utils.puppeteer.enqueueLinks
to allownull
value forpseudoUrls
param - Fixed
RequestQueue.addRequest()
to gracefully handle invalid URLs - Renamed
RequestOperationInfo
toQueueOperationInfo
- Added
request
field toQueueOperationInfo
- DEPRECATION WARNING: Parameter
timeoutSecs
ofApify.call()
is used for actor run timeout. For time of waiting for run to finish usewaitSecs
parameter. - DEPRECATION WARNING: Parameter
memory
ofApify.call()
was renamed tomemoryMbytes
. - Added
Apify.callTask()
that enables to start actor task and fetch its output. - Added option enforcing cloud storage to be used in
openKeyValueStore()
,openDataset()
andopenRequestQueue()
- Added
autoscaledPool.setMinConcurrency()
andautoscaledPool.setMinConcurrency()
- Fix a bug in
CheerioCrawler
whereuseApifyProxy
would only work withapifyProxyGroups
.
- Reworked
request.pushErrorMessage()
to support any message and not throw. - Added Apify Proxy (
useApifyProxy
) support toCheerioCrawler
. - Added custom
proxyUrls
support toPuppeteerPool
andCheerioCrawler
. - Added Actor UI
pseudoUrls
output support toApify.utils.puppeteer.enqueueLinks()
.
- Created dedicated project page at https://sdk.apify.com
- Improved docs, texts, guides and other texts, pointed links to new page
- Bugfix in
PuppeteerPool
: Pages were sometimes considered closed even though they weren't - Improvements in documentation
- Upgraded Puppeteer to 1.9.0
- Added
Apify.utils.puppeteer.cacheResponses
to enable response caching in headless Chromium.
- Fixed
AutoscaledPool
terminating before all tasks are finished. - Migrated to v 0.1.0 of
apify-shared
.
- Allow AutoscaledPool to run tasks up to minConcurrency even when the system is overloaded.
- Upgraded @apify/ps-tree depedency (fixes "Error: spawn ps ENFILE"), upgraded other NPM packages
- Updated documentation and README, consolidated images.
- Added CONTRIBUTING.md
- Updated documentation and README.
- Bugfixes in
RequestQueueLocal
- Updated documentation and README.
- Optimized autoscaled pool default configuration.
- BREAKING CHANGES IN AUTOSCALED POOL
- It has been completely rebuilt for better performance.
- It also now works locally.
- see Migration Guide for more information.
- Updated to apify-shared@0.0.58
- Bug fixes and documentation improvements.
- Upgraded Puppeteer to 1.8.0
- Upgraded NPM dependencies, fixed lint errors
Apify.main()
now sets theAPIFY_LOCAL_STORAGE_DIR
env var to a default value if neitherAPIFY_LOCAL_STORAGE_DIR
norAPIFY_TOKEN
is defined
- Updated
DEFAULT_USER_AGENT
andUSER_AGENT_LIST
- Added
recycleDiskCache
option toPuppeteerPool
to enable reuse of disk cache and thus speed up browsing - WARNING:
APIFY_LOCAL_EMULATION_DIR
environment variable was renamed toAPIFY_LOCAL_STORAGE_DIR
. - Environment variables
APIFY_DEFAULT_KEY_VALUE_STORE_ID
,APIFY_DEFAULT_REQUEST_QUEUE_ID
andAPIFY_DEFAULT_DATASET_ID
have now default valuedefault
so there is no need to define them when developing locally.
- Added
compileScript()
function toutils.puppeteer
to enable use of external scripts at runtime.
- Fixed persistent deprecation warning of
pageOpsTimeoutMillis
. - Moved
cheerio
to dependencies. - Fixed
keepDuplicateUrls
errors with persistent RequestList.
- Added
getInfo()
method to Dataset to get meta-information about a dataset. - Added CheerioCrawler, a specialized class for crawling the web using
cheerio
. - Added
keepDuplicateUrls
option to RequestList to allow duplicate URLs. - Added
.abort()
method to all Crawler classes to enable stopping the crawl programmatically. - Deprecated
pageOpsTimeoutMillis
option. UsehandlePageTimeoutSecs
. - Bluebird promises are being phased out of
apify
in favor ofasync-await
. - Added
log
toApify.utils
to improve logging experience.
- Replaced git-hosted version of our fork of ps-tree with @apify/ps-tree package
- Removed old unused Apify.readyFreddy() function
- Improved logging of URL and port in
PuppeteerLiveViewBrowser
. - PuppeteerCrawler's default page load timeout changed from 30 to 60 seconds.
- Added
Apify.utils.puppeteer.blockResources()
function - More efficient implementation of
getMemoryInfo
function - Puppeteer upgraded to 1.7.0
- Upgraded NPM dependencies
- Dropped support for Node 7
- Fixed unresponsive magnifying glass and improved status tracking in LiveView frontend
- Fixed invalid URL parsing in RequestList.
- Added support for non-Latin language characters (unicode) in URLs.
- Added validation of payload size and automatic chunking to
dataset.pushData()
. - Added support for all content types and their known extensions to
KeyValueStoreLocal
.
- Puppeteer upgraded to 1.6.0.
- Removed
pageCloseTimeoutMillis
option fromPuppeteerCrawler
since it only affects debug logging.
- Bug where failed
page.close()
inPuppeteerPool
was causing request to be retried is fixed. - Added
memory
parameter toApify.call()
. - Added
PuppeteerPool.retire(browser)
method allowing retire a browser before it reaches his limits. This is useful when its IP address got blocked by anti-scraping protection. - Added option
liveView: true
toApify.launchPuppeteer()
that will start a live view server proving web page with overview of all running Puppeteer instances and their screenshots. PuppeteerPool
now kills opened Chrome instances inSIGINT
signal.
- Bugfix in BasicCrawler: native Promise doesn't have finally() function
- Parameter
maxRequestsPerCrawl
added toBasicCrawler
andPuppeteerCrawler
classes.
- Revereted back -
Apify.getApifyProxyUrl()
accepts againsession
andgroups
options instead ofapifyProxySession
andapifyProxyGroups
- Parameter
memory
added toApify.call()
.
PseudoUrl
class can now contain a template forRequest
object creation andPseudoUrl.createRequest()
method.- Added
Apify.utils.puppeteer.enqueueLinks()
function which enqueues requests created from links mathing given pseudo-URLs.
- Added 30s timeout to
page.close()
operation inPuppeteerCrawler
.
- Added
dataset.detData()
,dataset.map()
,dataset.forEach()
anddataset.reduce()
functions. - Added
delete()
method toRequestQueue
,Dataset
andKeyValueStore
classes.
- Added
loggingIntervalMillis
options toAutoscaledPool
- Bugfix:
utils.isProduction
function was incorrect - Added
RequestList.length()
function
- Bugfix in
RequestList
- skip invalid in-progress entries when restoring state - Added
request.ignoreErrors
options. See documentation for more info.
- Bugfix in
Apify.utils.puppeteer.injectXxx
functions
- Puppeteer updated to v1.4.0
- Added
Apify.utils
andApify.utils.puppeteer
namespaces for various helper functions. - Autoscaling feature of
AutoscaledPool
,BasicCrawler
andPuppeteerCrawler
is disabled on Apify platform until all issues are resolved.
- Added
Apify.isAtHome()
function that returnstrue
when code is running on Apify platform andfalse
otherwise (for example locally). - Added
ignoreMainProcess
parameter toAutoscaledPool
. Check documentation for more info. pageOpsTimeoutMillis
ofPuppeteerCrawler
increased to 300 seconds.
- Parameters
session
andgroups
ofgetApifyProxyUrl()
renamed toapifyProxySession
andapifyProxyGroups
to match naming of the same parameters in other classes.
RequestQueue
now caches known requests and their state to beware of unneeded API calls.
- WARNING:
disableProxy
configuration ofPuppeteerCrawler
andPuppeteerPool
removed. By default no proxy is used. You must either use new configurationlaunchPuppeteerOptions.useApifyProxy = true
to use Apify Proxy or provide own proxy vialaunchPuppeteerOptions.proxyUrl
. - WARNING:
groups
parameter ofPuppeteerCrawler
andPuppeteerPool
removed. UselaunchPuppeteerOptions.apifyProxyGroups
instead. - WARNING:
session
andgroups
parameters ofApify.getApifyProxyUrl()
are now validated to contain only alphanumberic characters and underscores. Apify.call()
now throws anApifyCallError
error if run doesn't succeed- Renamed options
abortInstanceAfterRequestCount
ofPuppeteerPool
andPuppeteerCrawler
to retireInstanceAfterRequestCcount - Logs are now in plain text instead of JSON for better readability.
- WARNING:
AutoscaledPool
was completely redesigned. Check documentation for reference. It still supports previous configuration parameters for backwards compatibility but in the future compatibility will break. handleFailedRequestFunction
in bothBasicCrawler
andPuppeteerCrawler
has now also error object available inops.error
.- Request Queue storage type implemented. See documentation for more information.
BasicCrawler
andPuppeteerCrawler
now supports bothRequestList
andRequestQueue
.launchPuppeteer()
changesUser-Agent
only when in headless mode or if not using full Google Chrome, to reduce chance of detection of the crawler.- Apify package now supports Node 7 and newer.
AutoscaledPool
now scales down less aggresively.PuppeteerCrawler
andBasicCrawler
now allow its underlyingAutoscaledPool
functionisFunction
to be overriden.- New events
persistState
andmigrating
added. Check documentation ofApify.events
for more information. RequestList
has a new parameterpersistStateKey
. If this is used thenRequestList
persists its state in the default key-value store at regular intervals.- Improved
README.md
and/examples
directory.
- Added
useChrome
flag tolaunchPuppeteer()
function - Bugfixes in
RequestList
- Removed again the --disable-dev-shm-usage flag when launching headless Chrome, it might be causing issues with high IO overheads
- Upgraded Puppeteer to version 1.2.0
- Added
finishWhenEmpty
andmaybeRunPromiseIntervalMillis
options toAutoscaledPool
class. - Fixed false positive errors logged by
PuppeteerPool
class.
- Added back
--no-sandbox
to launch of Puppeteer to avoid issues on older kernels
- If the
APIFY_XVFB
env var is set to1
, then avoid headless mode and use Xvfb instead - Updated DEFAULT_USER_AGENT to Linux Chrome
- Consolidated startup options for Chrome - use
--disable-dev-shm-usage
, skip--no-sandbox
, use--disable-gpu
only on Windows - Updated docs and package description
- Puppeteer updated to
1.1.1
- A lot of new stuff. Everything is backwards compatible. Check https://sdk.apify.com/ for reference
Apify.setPromiseDependency()
/Apify.getPromiseDependency()
/Apify.getPromisePrototype()
removed- Bunch of classes as
AutoscaledPool
orPuppeteerCrawler
added, check documentation
- Renamed GitHub repo
- Changed links to Travis CI
- Changed links to Apify GitHub repo
Apify.pushData()
added
- Upgraded puppeteer optional dependency to version
^1.0.0
- Initial development, lot of new stuff