-
Notifications
You must be signed in to change notification settings - Fork 59
Software architecture
The HCI project hosts 3 levels of data transparent for users.
- user level - corpus: Web entities and metadata (qualification, note, indicators, etc.) as seen and manipulate by users. This level where the corpus is built. It's a virtual layer of information built on the LRUs tree. The information are agregated by web entities and seen as it.
- memory structure level - Reverse_URLs Tree: reversed URLs said as LRUs and links between them. There is a first level of agregation of page level information. This level is more detailled than the user level and allow web entities redefinition on-the-fly that is without having to re-crawl. There is a system limitation to the precision of information. After a certain level links between are agregated (see Precision_limit)
- raw data level - Extraction of web ressources: links extraction and full content of crawled pages stored as flat files. This level of storage would be used only to recontrust or update the above layers on demand. it has to be a optimised for writing only. The reconstruction would be excpetions and thus reading this level odf storage can have speed issues.
- archive level: dump of web contents by an archiving engine. Full detailed but not vectorized in a specific memory structure which focuses on archive browsing feature
This is draft specification material. Feel free to discuss it in the discussion page
It's the main user interface. It's where users will define web entities and control their navigation and crawls. The corpus builder will be built as a firefox plugin.
It has to interface the core through a json API see core specifications
The Core is the coordinator of the other modules. It'll make the bridge between UI, memory structure and the crawler. It's an important piece which can be seen as the corpus maintainer as described in previous project versions.
The first prototype has been developed in Python using Twisted framework see Core specifications for more details.
Well we will finally have to crawl the web! The crawler will probably coded using scrapy framework.
The specifications of the crawler will be written from the work done here : Scrapy_implementation_proposal See related info Core#Toward_scrapy_implementation and Core#interface_with_the_crawler
The memory structure has to efficiently store the web contents given by the core and on which the corpus will be built.
The memory structure will store reverse URLs (lru), links and web entities.
It will be developed from Lucene technologies.
A first prototype has been developed by Jérome Thièvre from INA DL WEB to test some assumptions on reactivity of such a memory structure when scaling up. Tests were promising.
So Lucene will be our framework for this important part of the software. Yet we haven't integrated Memory structure and Core but this is the next step in the roadmap. (see Core specifications)
Memory structure specifications
see raw data level
Online cartographic representation of the corpus.