YaCy Grid Component: Parser

The YaCy Grid is the second-generation implementation of YaCy, a peer-to-peer search engine. A YaCy Grid installation consists of a set of micro-services which communicate with each other using the MCP, see https://github.com/yacy/yacy_grid_mcp

Purpose

The Parser is a microservices which can be deployed i.e. using Docker. When the Parser Component is started, it searches for a MCP and connects to it. By default the local host is searched for a MCP but you can configure one yourself.

What it does

The Parser is able to read a WARC file and parses it's content. The content is analyzed, the plain text, links, images and more entities are extracted. The result is stored in a JSON Object. Calling the parser will generate a list of JSON Objects, each containing the analyzed content of one internet resource. The parser understands not only HTML but also a wide range of different document formats, including PDF, all OpenOffice and MS Office document formats and much more.

Installation: Download, Build, Run

At this time, yacy_grid_parser is not provided in compiled form, you easily build it yourself. It's not difficult and done in one minute! The source code is hosted at https://github.com/yacy/yacy_grid_parser, you can download it and run loklak with:

> git clone --recursive https://github.com/yacy/yacy_grid_parser.git
> cd yacy_grid_parser
> gradle run

This repository uses git submodules to integrate yacy_grid_mcp into yacy_grid_parser. In case that you clones this repository without the --recursive do now:

> git submodule update --init --recursive

The submodules require, that each subsequent

> git pull origin master

requires also a pull for the submodules, in case anything has changed there. You can do that easily with:

> git submodule foreach git pull origin master

Example for Parsing a set of Documents

For this example, a hosted version of yacy_grid_parser is provided at http://yacygrid.com:8500. The example shows, how a web site is crawled using wget, then parsed with yacy_grid_parser and finally indexed with yacy_search_server ('legacy' YaCy/1.x) using the surrogate dump reading method:

First, crawl a site (here:publicplan.de):

> wget -r -l3 "https://www.publicplan.de/" --warc-file="publicplan.de"

This produces the file "publicplan.de.warc.gz". That file can then be send to the hosted yacy_grid_parser with:

> curl -X POST -F "sourcebytes=@publicplan.de.warc.gz" -F "flatfile=true" -o "publicplan.de.flatjson" http://localhost:8500/yacy/grid/parser/parser.json

The result is a file "publicplan.de.flatjson" with 774 single JSON objects, each printed in it's own line in the flatjson file. To index that file with legacy YaCy (YaCy/1.x) just copy it into the yacy_search_server/DATA/SURROGATES/in/ path.

Contribute

This is a community project and your contribution is welcome!

Check for open issues or open a fresh one to start a discussion around a feature idea or a bug.
Fork the repository on GitHub to start making your changes (branch off of the master branch).
Write a test that shows the bug was fixed or the feature works as expected.
Send a pull request and bug us on Gitter until it gets merged and published. :)

What is the software license?

LGPL 2.1

Have fun!

@0rb1t3r

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
.github		.github
.settings		.settings
bin		bin
conf		conf
gradle/wrapper		gradle/wrapper
src/main		src/main
submodules		submodules
.classpath		.classpath
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.project		.project
Dockerfile		Dockerfile
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YaCy Grid Component: Parser

Purpose

What it does

Installation: Download, Build, Run

Example for Parsing a set of Documents

Contribute

What is the software license?

About

Releases

Sponsor this project

Packages

Contributors 4

Languages

yacy/yacy_grid_parser

Folders and files

Latest commit

History

Repository files navigation

YaCy Grid Component: Parser

Purpose

What it does

Installation: Download, Build, Run

Example for Parsing a set of Documents

Contribute

What is the software license?

About

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 4

Languages

Packages