-
Notifications
You must be signed in to change notification settings - Fork 59
Session raw discussion 2
jrault edited this page Dec 21, 2012
·
1 revision
ediaspora atlas
- get web entity
- crawl at distance 1
- Q : redefinition of distance and deepness in relation with web entity. A limit to distance 1 is good idea to keep things at hand.
- Q : what is the default behavior, how to define exceptions, in what scope
- get seed in and seed out to decide whether the discovered web ressources at distance 1 does co-link current entity
- import
- URL
- rules to create web entities from URL with/without heuristics and then refine
- extract stem and consider it as a web entity
- extract stem and consider it as a web entity AND ask for pages
- web entities - grammar has to be defined but stem string might be a good idea.
- codebook
- URL
- focus crawling is a special functionality with a prospective aim. This task will require a further distance.
- history of exploration or traceability - time of the day for web crawling, in what order were pages crawled, what definition and redefinition of web entities can we see, etc.
- hide technical details of the crawling and expert terminology
- eventually create a branch to propose a tool to crawling expert to match their expectations
- cursor to define web entity
- HCI is a live crawling software, it requires a human intervention for each step
- each step is defined and planned in a task manager