Nomad - focused highly customizable web crawler

Features

Crawling of multiply domains
Allows to write flexible rules to decide which links crawl.
Support of robots.txt
MongoDB(GridFS) as storage for crawled content
TitanDB(with InMemory, BerkeleyDB or Cassandra backend) to store graph of links.
Written in Scala.
Works in Linux. It should work in Win as well, but I haven't tested it.

How to get

From source

Nomad uses gradle as build system. To build from source you need:

Install gradle
Checkout src
Go to folder with builld.gradle and run

gradle distZip

You can find nomad*.zip in

 build/distributions/

Binary

Download ready to use binary here https://bitbucket.org/hudvin/nomad/downloads/nomad-release-0.3.zip

How to run

Prerequisite

JRE/JDK 7
MongoDB
Linux. Currently tested on Debian 7 only.

To run nomad you need execute from shell:

./bin/nomad <path to profile>

for example: ./bin/nomad profiles/template

###What is profile?

To simplify usage of different congigurations nomad allows to create profiles. Profile is a folder with 3 files:

application.conf. Contains configuration of graph and files storages and configuration of crawling strategy.
filters.groovy. Groovy file with two functions - filterUrl and filterEntity. Here you can define any logic you want to filter urls and files.
seed.txt - list of urls to crawl.

####application.conf

    app {
        //name of file with urls
        default_seed = seed.txt
    } 
    master {
        //one worker crawles one domain, so number of workers mean number of simultaneously crawled domains
        workers = 10
        //number of links fetched simultaneously
        threads_in_worker = 10
    }
    links{
        //size of cache for links to crawl
        bfs_limit = 5000
        //links extracted from pages are stored in memory, when number of links becomes larger than this value
        //they are flushed to db
        extracted_links_cache = 200000
    }       
    storage{
        //mongo is used as storage for all fetched files
        mongo{
            host = "127.0.0.1"
            port = 27017
            db_name = nomad
            drop = true
        }
        //titan and blueprints are used as storage for graph of links
        titan{
            //backed for titan - inmemory, cassandra or berkeley
            //drop=true means that db will be dropped on each start
            main_connector = inmemory
            backends{
            cassandra{
                host = "127.0.0.1"
                drop = false
            }
            berkeley{
                directory = /tmp/berkeley
                drop = true
            }
            inmemory{
            }
        }
    }
}

####filters.groovy Contains two functions

def filterUrl(url) {return true}

def filterEntity(size, url, mimeType) {return truef}

If function returns true, url or file(entity) will be downloaded, otherwise - skipped. filterUrl is called after link has been extracted. So if filterUrl returns false for this link, nomad will never try to crawle it. filterEntity is called after headers for file is received. If funtions returns false then file is skipped. It may be useful to prevent downloading of large files, for example.

Example of implementation(from profiles/template/filters.groovy):

def filterUrl(url) {
    if (url.contains(".pdf")) {
         return false
    }
    if (url.contains(".tgz")) {
        return false
    }
    if (url.contains("http://consc.net/online/")) {
        return false
    }
    return true
}

def filterEntity(size, url, mimeType) {
    if (size > 10000000) {
        return false
    }
    return true

}

####seed.txt Contains list of urls to crawl. Each url must looks like

http(s)://ibm.com

Notes

It's still contains a lot of bugs.
I am working on external API to provide access to graph and files.
Need to check stability.
Need to perform optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
profiles/template		profiles/template
src		src
.gitignore		.gitignore
.hgignore		.hgignore
.hgtags		.hgtags
LICENSE.txt		LICENSE.txt
README		README
README.md		README.md
build.gradle		build.gradle
licenseheader.txt		licenseheader.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nomad - focused highly customizable web crawler

Features

How to get

From source

Binary

How to run

Prerequisite

Notes

About

Releases

Packages

License

hudvin/nomad

Folders and files

Latest commit

History

Repository files navigation

Nomad - focused highly customizable web crawler

Features

How to get

From source

Binary

How to run

Prerequisite

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages