Skip to content

TimTosi/mcrawler

Repository files navigation

Mini Web Crawler

codecov CircleCI Go Report GoDoc License

Table of Contents

What this repository is about

This repository contains source code of an implementation of a simple web crawler written in Go.

This work has been inspired from the Mercator web crawler, minus the distributed part.

Architecture

This web crawler is composed of a configurable pipeline based on the internal.Pipe type. Any internal.Pipe part of the pipeline can be removed, their order changed or even combined with new user based internal.Pipe introducing new features or filters to modify the internal behaviour of this program.

This project comes with already provided internal.Pipe that you can use.

The default pipeline provided with this project and found in the cmd directory looks like this:


S        +-------------------+       +-------------------+
T        |                   |       |                   |
A  ----> |     Archiver      |------>|      Mapper       |--+
R        |                   |       |                   |  |
T        +-------------------+       +-------------------+  |
                                                            |
                                                            |
 +----------------------------------------------------------+
 |
 |
 |      +-------------------+       +-------------------+
 |      |                   |       |                   |
 +----> |     Follower      |------>|      Worker       |---+
        |                   |       |                   |   |
        +-------------------+       +-------------------+   |
                                                            |
                                                            |
 +----------------------------------------------------------+
 |
 |
 |      +-------------------+        S
 |      |                   |        T
 +----> |     Extractor     |------> A
        |                   |        R
        +-------------------+        T

❗ This pipeline cycles on itself so you have to introduce edge condition mechanisms in order to avoid infinite loops if you do not use those provided by default.

All the goroutines are controlled & coordinated through a sync.WaitGroup created in crawler.Run.

💡 If you introduce new internal.Pipes in the pipeline, don't forget to wg.Done() each time you discard an element or to wg.Add(1) each time you add a new element in the pipeline.

Quickstart

First, go get this repository:

go get -d github.com/timtosi/mcrawler

Quickstart with Docker

❗ If you don't have Docker and Docker Compose installed, you still can execute this program by compiling the binary.

This program comes with an already configured Docker Compose that crawls a website located at localhost:8080.

You can use the run target in the provided Makefile to use it easily.

💡 If you want to change the crawled target, you will have to update the Docker Compose file accordingly.

Quickstart without Docker

First install dependencies & compile the binary:

cd $GOPATH/src/github.com/timtosi/mcrawler/
make install && make build

Then launch the program by specifying the target in argument:

cd $GOPATH/src/github.com/timtosi/mcrawler/cmd/
go build && ./mcrawler "http://localhost:8080"

Component List

Here is a list and small description of components provided with this program:

  • Worker: This component fetches a webpage located at domain.Target.BaseURL and populates domain.Target.Content.

  • Archiver: This component discards any domain.Target already seen.

  • Mapper: This component keeps a record of every single domain.Target passing through to display a sitemap visualization with the mapper.Render function.

  • Follower: This component discards domain.Target when a different host than internal.Follower.originHost is found.

  • Extractor: This component parses domain.Target to retrieve any link matching with one of its extractor.CheckFunc function.

How To Add a Component

In order to add a component in the pipeline, you need to create a struct implementing the internal.Pipe interface.

package example

// UserPipe is a `struct` implementaing the `internal.Pipe` interface.
type UserPipe struct {
	// properties ...
}

// NewUserPipe returns a new `example.UserPipe`.
func NewUserPipe() *UserPipe {
    return &UserPipe{}
}

// Pipe is a user defined function used in the pipeline launched by
// `crawler.Crawler`.
func (up *UserPipe) Pipe(wg *sync.WaitGroup, in <-chan *domain.Target, out chan<- *domain.Target) {
	defer close(out)

	for t := range in {
            //
            // --------> Here, do something with element received from `in`.
            //
			wg.Done() // Don't forget to wg.Done() when you discard an element !
		} else {
			out <- t
		}
	}
}

Then you just have to plug it in the main:

package main

import (
	"log"
	"os"

	"github.com/user/example"
	"github.com/timtosi/mcrawler/internal/crawler"
	"github.com/timtosi/mcrawler/internal/domain"
)

func main() {
	if len(os.Args[1]) == 0 {
		log.Fatal(`usage: ./mcrawler <BASE_URL>`)
	}

	t := domain.NewTarget(os.Args[1])

	if err := crawler.NewCrawler().Run(
		example.NewUserPipe(), // ------- > Insert here !!!
	); err != nil {
		log.Fatal(err)
	}
	log.Printf("shutdown")
}

FAQ

None so far 🙌

License

Every file provided here is available under the MIT License.

Not Good Enough ?

If you encouter any issue by using what is provided here, please let me know ! Help me to improve by sending your thoughts to timothee.tosi@gmail.com !

About

[Go] - Web Crawler with composable pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published