-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary. DeepMind has recently (Jun 2018) published a paper called "Relational inductive biases, deep learning, and graph networks" (https://arxiv.org/abs/1806.01261) in which they present a framework that wraps existing graph network (GN) approaches. In our work "Pagerank Estimation using Deep Graph Networks" we want to train a graph network to estimate the global pagerank of a domain.
Data gathering. In order to do so, we want to develop a web crawler that visits the top say 200,000 domains and takes screenshots of e.g. 20 web pages per domain, gathers other data such as HTML-structure, text-content, number of backlinks etc. In contrast to other web crawlers, our crawler will only crawl a given number of URLs on the same top-level domain, pre-process it and save it into a database.
Input. The stored graphs are the input for the graph network and contain the following information: Nodes represent a single webpage and consist of an image (screenshot), some textual information which we would encode with word embeddings, and other hand-picked features. The edges correspond to links between web pages.
Output. We would then train the network to estimate a score which positions the particular domain relative to other domains. This ranking (and some other information which we might predict as well) is available to us as ground truth for existing web pages (https://www.alexa.com/siteinfo).
Motivation. The motivation is to (1) do inference on new, unranked web pages to get a rough estimation of their expected popularity. Furthermore, (2) compute a heat-map which highlights areas of the screenshot supporting a high pagerank vs. lower pagerank (computation with gradient input wrt. loss), and (3) examine the applicability and power of GNs in real-world problems in a scientific manner.
Related work. Google Scholar searches have shown that there are only few publications in the field. Most notably (http://delab.csd.auth.gr/~dimitris/courses/ir_spring06/page_rank_computing/01517930.pdf) which is from 2005 and does not use visual information (screenshots) and deep layers of representations. This work (https://dl.acm.org/citation.cfm?id=1487317) seems to be sort of an overview paper.