Krawler is a fully configurable and asynchronous HTML Crawler written in Kotlin (JVM). Powered by Coroutines, Kotlin Serialization (JSON), Ktor Client, Exposed, SQLite, and SQLite JDBC, Krawler provides a way to easily scrape HTML webpages.
-
Asynchronous Processing: Utilizing Kotlin's coroutines, Krawler is designed for high-performance, concurrent web crawling.
-
Configurability: Krawler is highly customizable through the
krawler_config.json
file, placed at the project path. -
Extensive Logging: Verbose logs can be enabled via the configuration file.
-
Persisting Errors: Errors during the crawling process are stored in the
CrawlErrors
table (with the necessary metadata) and printed to the standard output.
Krawler uses the following tables to persist data:
CrawlActivities : IntIdTable() {
varchar("sessionId", 100)
long("atEpochSeconds")
varchar("type", 50)
}
CrawlErrors : IntIdTable() {
varchar("sessionId", 100)
long("atEpochSeconds")
text("url")
text("error")
}
CrawlingStates : IntIdTable() {
varchar("sessionId", 100)
text("url")
integer("depth")
long("priority")
}
Webpages : IntIdTable() {
varchar("sessionId", 100)
long("atEpochSeconds")
text("url")
text("html")
}
Krawler is highly customizable through the krawler_config.json
file, placed at the project path. Below is a sample
configuration containing all settings:
{
"seeds": [
"https://en.wikipedia.org/wiki/NASA"
],
"filter": {
"#": "dev.yekta.krawler.model.CrawlingFilter.Whitelist",
"allowPatterns": [
"https://en\\.wikipedia\\.org/wiki/.*"
]
},
"depth": 8,
"maxPages": 100,
"maxPageSizeKb": null,
"concurrentConnections": 16,
"verbose": true,
"shouldFollowRedirects": true,
"userAgent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"connectTimeoutMs": 6000,
"readTimeoutMs": 6000,
"retriesOnServerError": 0,
"customHeaders": null
}
seeds
: Starting URLs for crawling.filter
: Crawling filter configuration, either Whitelist or Blacklist.depth
: Maximum depth of crawling.maxPages
: Maximum number of pages to crawl.maxPageSizeKb
: Maximum page size in kilobytes.concurrentConnections
: Number of concurrent connections for crawling.verbose
: Enable verbose logging.shouldFollowRedirects
: Specify if redirects should be followed.userAgent
: User agent string for HTTP requests.connectTimeoutMs
: Connection timeout in milliseconds.readTimeoutMs
: Read timeout in milliseconds.retriesOnServerError
: Number of retries on server errors (5xx
).customHeaders
: Additional custom headers for HTTP requests.
Things that would benefit Krawler the most:
- Implementing Pause/Resume
- Hint: The
UrlPool
is the only state that isn't currently being persisted but needs to be, in order to be able to restore paused sessions.
- Hint: The
- Config:
respectRobotsTxt: Boolean
- Config:
consecutiveErrorsToPause: Int?
Krawler was conceived and brought to life over a weekend, starting as a pet project. It was initially planned to be made as a component of the coursework for the Web & Search Engines course at Yazd University, then growing exponentially due to a sudden desire to make a "good thing" out of it! It's important to note that, no explicit guarantees are extended regarding its correctness of functionality, support, or any other aspect. With that in mind, happy Krawling!
Please refer to LICENSE to view the project's license.