This dataset was created and used in the student research project Vision-based Page Rank Estimation by Timo Denk and Samed Güner. It contains screenshots and meta information of some of the top 100k most popular domains on the web. It was used to train a combination of CNN and graph network Battaglia et al. (2018) to determine the correlation between page rank and appearance of a web page.
The dataset contain 83,165 samples. Each sample is a screenshot of a given web page as a visitor would see it with a common web browser on a desktop machine. The dataset has the following folder structure (the rank is the image name):
The dataset contains 83,165 samples. Each sample has a rank and represents an enriched directed graph of the link structure of a given domain. While nodes are web pages, edges represent hyperlinks between them. Each web page comes with mobile and desktop screenshot.
A sample is a directed graph with up to eight nodes. It is encoded in JSON format, where nodes contain the following information:
id: identifier of the node, corresponds to the image number
baseUrl: base URL of the website
client_status: client error status (e.g unverified SSL connection)
server_status: server status code (e.g 404 Not Found)
startNode: given node is the entry point to the domain
loading_timeloading time in milliseconds
size: size of the web page in kB
title: web page title
urls: list of hyperlinks to other web pages on the same domain
The dataset has the following folder structure:
The websites and their ranks are taken from the top 100,000 websites of Open PageRank, which relies on the number of backlinks to rank websites.
Please cite our work on Vision-based Page Rank Estimation if you intend to use this dataset.