Vision-based Page Rank Estimation Dataset

Context

This dataset was created and used in the student research project Vision-based Page Rank Estimation by Timo Denk and Samed G√ľner. It contains screenshots and meta information of some of the top 100k most popular domains on the web. It was used to train a combination of CNN and graph network Battaglia et al. (2018) to determine the correlation between page rank and appearance of a web page.

Content

Dataset Version 1

The dataset contain 83,165 samples. Each sample is a screenshot of a given web page as a visitor would see it with a common web browser on a desktop machine. The dataset has the following folder structure (the rank is the image name):

    \<domain-rank>.jpg
   ⋮

Dataset Version 2

The dataset contains 83,165 samples. Each sample has a rank and represents an enriched directed graph of the link structure of a given domain. While nodes are web pages, edges represent hyperlinks between them. Each web page comes with mobile and desktop screenshot.

A sample is a directed graph with up to eight nodes. It is encoded in JSON format, where nodes contain the following information:

The dataset has the following folder structure:

        \<domain-rank>\<domain-rank>.json
        \img\<image-number>.jpeg
        \img\ …
         ⋮
   ⋮

Sample JSON File



    

Acknowledgments

The websites and their ranks are taken from the top 100,000 websites of Open PageRank, which relies on the number of backlinks to rank websites.

Downloads

Reference

Please cite our work on Vision-based Page Rank Estimation if you intend to use this dataset.


Legal