It is widely believed that Web search engines require immense resources to operate, making it impossible for individuals to explore alternatives to the dominant information retrieval paradigms. The PeARS project aims at changing this view by providing search tools that can be used by anyone to index and share Web content on specific topics. The focus is specifically on designing algorithms that will run on entry-level hardware, producing compact but semantically rich representations of Web documents. In this project, we use a cognitively-inspired algorithm to produce queryable representations of Web pages in a highly efficient and transparent manner. The proposed algorithm is a hashing function inspired by the olfactory system of the fruit fly, which has already been used in other computer science applications and is recognised for its simplicity and high efficiency. We will implement and evaluate the algorithm on the task of document retrieval. It will then be integrated into a Web application aimed at supporting the growing practice of 'digital gardening', allowing users to research and categorise Web content related to their interests, without requiring access to centralised search engines.
This repository contains all code necessary to run and replicate our work. Note that the present README provides a minimal overview of the repository. Individual directories contain additional README files. Please also browse the Wiki for extensive information about the framework and our experiments so far.
We gratefully acknowledge financial support from NLnet under the NGI Zero programme.
We recommend installing the code in a virtual environment (under Python3.6):
git clone https://github.com/PeARSearch/PeARS-fruit-fly.git
virtualenv -p python3.6 PeARS-fruit-fly
Install requirements:
cd PeARS-fruit-fly/
source bin/activate
pip install -r requirements.txt
This repository contains three directories, as described below. Each directory contains its own README, which contains further details on each aspect of the framework.
The datasets/ directory contains data for evaluating document vectors.
The common-crawl-processor/ directory contains code for extracting and cleaning documents from CommonCrawl dumps.
The fruit-fly/ directory contains our implementation of the FFA for text classification. This will eventually include:
-
An implementation of a baseline system with Bayesian hyper-parameters search. [available now]
-
A genetic algorithm to improve the performance of the FFA. [todo]
-
A multi-layer FFA. [todo]