- About
- Architecture
- Quickstart with Docker
- Quickstart with binary
- Component List
- How To Add a Component
- FAQ
- Support & Feedbacks
This repository contains source code of an implementation of a simple web crawler written in Go.
This work has been inspired from the Mercator web crawler, minus the distributed part.
This web crawler is composed of a configurable pipeline based on the
internal.Pipe
type. Any internal.Pipe
part of the pipeline can be removed,
their order changed or even combined with new user based internal.Pipe
introducing new features or filters to modify the internal behaviour of this
program.
This project comes with already provided internal.Pipe
that you can use.
The default pipeline provided with this project and found in the cmd directory looks like this:
S +-------------------+ +-------------------+
T | | | |
A ----> | Archiver |------>| Mapper |--+
R | | | | |
T +-------------------+ +-------------------+ |
|
|
+----------------------------------------------------------+
|
|
| +-------------------+ +-------------------+
| | | | |
+----> | Follower |------>| Worker |---+
| | | | |
+-------------------+ +-------------------+ |
|
|
+----------------------------------------------------------+
|
|
| +-------------------+ S
| | | T
+----> | Extractor |------> A
| | R
+-------------------+ T
❗ This pipeline cycles on itself so you have to introduce edge condition mechanisms in order to avoid infinite loops if you do not use those provided by
default
.
All the goroutines are controlled & coordinated through a sync.WaitGroup
created in crawler.Run.
💡 If you introduce new
internal.Pipe
s in the pipeline, don't forget towg.Done()
each time you discard an element or towg.Add(1)
each time you add a new element in the pipeline.
First, go get this repository:
go get -d github.com/timtosi/mcrawler
❗ If you don't have Docker and Docker Compose installed, you still can execute this program by compiling the binary.
This program comes with an already configured Docker Compose
that crawls a website located at localhost:8080
.
You can use the run
target in the provided Makefile to use it easily.
💡 If you want to change the crawled target, you will have to update the Docker Compose file accordingly.
First install dependencies & compile the binary:
cd $GOPATH/src/github.com/timtosi/mcrawler/
make install && make build
Then launch the program by specifying the target in argument:
cd $GOPATH/src/github.com/timtosi/mcrawler/cmd/
go build && ./mcrawler "http://localhost:8080"
Here is a list and small description of components provided with this program:
-
Worker: This component fetches a webpage located at
domain.Target.BaseURL
and populatesdomain.Target.Content
. -
Archiver: This component discards any
domain.Target
already seen. -
Mapper: This component keeps a record of every single
domain.Target
passing through to display a sitemap visualization with themapper.Render
function. -
Follower: This component discards
domain.Target
when a different host thaninternal.Follower.originHost
is found. -
Extractor: This component parses
domain.Target
to retrieve any link matching with one of its extractor.CheckFunc function.
In order to add a component in the pipeline, you need to create a struct
implementing the internal.Pipe
interface.
package example
// UserPipe is a `struct` implementaing the `internal.Pipe` interface.
type UserPipe struct {
// properties ...
}
// NewUserPipe returns a new `example.UserPipe`.
func NewUserPipe() *UserPipe {
return &UserPipe{}
}
// Pipe is a user defined function used in the pipeline launched by
// `crawler.Crawler`.
func (up *UserPipe) Pipe(wg *sync.WaitGroup, in <-chan *domain.Target, out chan<- *domain.Target) {
defer close(out)
for t := range in {
//
// --------> Here, do something with element received from `in`.
//
wg.Done() // Don't forget to wg.Done() when you discard an element !
} else {
out <- t
}
}
}
Then you just have to plug it in the main:
package main
import (
"log"
"os"
"github.com/user/example"
"github.com/timtosi/mcrawler/internal/crawler"
"github.com/timtosi/mcrawler/internal/domain"
)
func main() {
if len(os.Args[1]) == 0 {
log.Fatal(`usage: ./mcrawler <BASE_URL>`)
}
t := domain.NewTarget(os.Args[1])
if err := crawler.NewCrawler().Run(
example.NewUserPipe(), // ------- > Insert here !!!
); err != nil {
log.Fatal(err)
}
log.Printf("shutdown")
}
None so far 🙌
Every file provided here is available under the MIT License.
If you encouter any issue by using what is provided here, please let me know ! Help me to improve by sending your thoughts to timothee.tosi@gmail.com !