PeARS Fruit Fly

It is widely believed that Web search engines require immense resources to operate, making it impossible for individuals to explore alternatives to the dominant information retrieval paradigms. The PeARS project aims at changing this view by providing search tools that can be used by anyone to index and share Web content on specific topics. The focus is specifically on designing algorithms that will run on entry-level hardware, producing compact but semantically rich representations of Web documents. In this project, we use a cognitively-inspired algorithm to produce queryable representations of Web pages in a highly efficient and transparent manner. The proposed algorithm is a hashing function inspired by the olfactory system of the fruit fly, which has already been used in other computer science applications and is recognised for its simplicity and high efficiency. We will implement and evaluate the algorithm on the task of document retrieval. It will then be integrated into a Web application aimed at supporting the growing practice of 'digital gardening', allowing users to research and categorise Web content related to their interests, without requiring access to centralised search engines.

This repository contains all code necessary to run and replicate our work. Note that the present README provides a minimal overview of the repository. Individual directories contain additional README files. Please also browse the Wiki for extensive information about the framework and our experiments so far.

We gratefully acknowledge financial support from NLnet under the NGI Zero programme.

Install

We recommend installing the code in a virtual environment (under Python3.6):

git clone https://github.com/PeARSearch/PeARS-fruit-fly.git
virtualenv -p python3.6 PeARS-fruit-fly

Install requirements:

cd PeARS-fruit-fly/
source bin/activate
pip install -r requirements.txt

Repository structure

This repository contains three directories, as described below. Each directory contains its own README, which contains further details on each aspect of the framework.

Dataset

The datasets/ directory contains data for evaluating document vectors.

CommonCrawl processor

The common-crawl-processor/ directory contains code for extracting and cleaning documents from CommonCrawl dumps.

Fruit-fly algorithm

The fruit-fly/ directory contains our implementation of the FFA for text classification. This will eventually include:

An implementation of a baseline system with Bayesian hyper-parameters search. [available now]
A genetic algorithm to improve the performance of the FFA. [todo]
A multi-layer FFA. [todo]

Name		Name	Last commit message	Last commit date
Latest commit History 363 Commits
budgeting		budgeting
common_crawl_processor		common_crawl_processor
datasets		datasets
dense_fruit_fly		dense_fruit_fly
fruit_fly		fruit_fly
projection_store		projection_store
spm		spm
web_map		web_map
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PeARS Fruit Fly

Install

Repository structure

Dataset

CommonCrawl processor

Fruit-fly algorithm

About

Releases

Packages

Contributors 4

Languages

License

PeARSearch/PeARS-fruit-fly

Folders and files

Latest commit

History

Repository files navigation

PeARS Fruit Fly

Install

Repository structure

Dataset

CommonCrawl processor

Fruit-fly algorithm

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages