Recursive Site Crawler

Preview

Usage

Standalone (results to console)

$ npm run crawl [website]

Api server:

$ npm run start

Tests:

$ npm run test

Build dist:

$ npm run build

Develop Locally:

$ npm run dev

Features

given a base URL, collects all anchor elements on pages of content text/html
produces JSON of visited URLs and the number of references to each URL in a format like the below:
works with HTTP and HTTPS
handles malformed URLs
unit tests available at src/tests (jest)

[
    {
        "url": "https://example.com",
        "visits": 10,
    },
    {
        "url": "https://example.com/blog",
        "visits": 4,
    },
    {
        "url": "https://example.com/about",
        "visits": 1,
    },
]

Technologies

NodeJS
TS-node
TypeScript (type checking)
Express (API-functionality)
Babel (transpilation)
Jest (unit tests)

Implementation

The core functionality of the app is built around 3 main functions:

crawl

crawl takes in a base URL and a starts with empty object which holds all the URLs that the function finds. It does several higher level tasks:

Ensures that we are crawling within the limits of a page (limit: number is a parameter with a fairly low default; this is to prevent accidentally DDOSing a site)
Ensures that we are not crawling external sites
Calls a helper function, getURLsFromHTML if an anchor element results in a content-type of text/html, and iterates through the list of URLs returned from getURLsFromHTML.

normalizeURL

normalizeURL takes in a URL and cleanses it before data is aggregated. Cleansing refers to removing trailing slashes so the aggregator doesn't count hostnames with trailing slashes as separate paths (https://google.ca/ => https://google.ca)

getURLsFromHTML

getURLsFromHTML is a helper function that simply collects all <a> elements and returns them in an array. It includes validation logic to understand whether an anchor element is a relative or an absolute URL.

Struggles

A big struggle I had was running into unexpected unit test behavior. Given an invalid URL, the URL constructor that Node provides will throw a TypeError according to the documentation.

test("getURLsFromHTML skip invalid URL", () => {
    const htmlBody: string = `
<html>
    <body>
    <a href="invalid">
        No slash or protocol - broken link
    </a>
    </body>
</html>
`
    const inputBaseURL = "https://blog.msoup.com"
    const actual = getURLsFromHTML(htmlBody, inputBaseURL)
    const expected: [] = []
    expect(actual).toEqual(expected)
})

While the test passed, the code took a completely different turn when ran through Jest, like so:

try {
        const url = new URL(`${link.href}`)
        urls.push(url.href)
    }
catch (err: unknown) {
    if (err instanceof TypeError) {
        // path TypeError: this is the expected path
        continue
    }
    else {
        // path Other: this is what happens only through Jest
        continue
    }
}

Upon digging, I discovered that the root issue was from Jest having completely different globals from Node globals. This has been a long standing issue from as early as January 2017.

The take-away is that as of 2023, using someArray instanceof Array and when using http, someError instanceof Error will inevitably return false, even if everything else seems to suggest that it is true. This hasn't been patched because Jest ensures that every test runs in its own sandbox.

In the case of arrays, we can simply move to using Array.isArray as a band-aid fix, but Error.isError is not a function.

As a temporary fix, I have made the paths the same whether err instanceof TypeError is true or not.

TODO

I hope to extend this project so it becomes a callable API, not a stand alone module to be run locally.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
usage		usage
.gitignore		.gitignore
.nvmrc		.nvmrc
README.md		README.md
app.ts		app.ts
babel.config.js		babel.config.js
main.ts		main.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
usage.txt		usage.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recursive Site Crawler

Preview

Usage

Standalone (results to console)

Api server:

Tests:

Build dist:

Develop Locally:

Features

Technologies

Implementation

Struggles

TODO

About

Releases

Packages

Contributors 2

Languages

MSoup/http-crawler

Folders and files

Latest commit

History

Repository files navigation

Recursive Site Crawler

Preview

Usage

Standalone (results to console)

Api server:

Tests:

Build dist:

Develop Locally:

Features

Technologies

Implementation

Struggles

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages