Linguistic-based features for Web Spam Detection

The datasets contain linguistic-based attributes computed on the document level

There are 4 datasets available, computed on 2 different Web Corpora (URL: http://www.yr-bcn.es/webspam/datasets/) and by 2 different software tools.

The datasets:
size | name
228M | lingSpamFeat-corleone06.gz
779M | lingSpamFeat-corleone07.gz
613M | lingSpamFeat-gi06.gz
1,9G | lingSpamFeat-gi07.gz

The names of the 4 data files naturally encode their contents:
a) the strings '06' and '07' refer to the 2006 and 2007 Web Spam Corpora, respectively (in all the cases only up to 400 first pages from each host were considered)
b) the strings: 'corleone' and 'gi' refer to the software tools used to compute the attributes

format of the data:
- each dataset is a gz-compressed ASCII file
- each dataset contains records (lines)
- the first line contains a tab-separated list of the attribute names
- each other line is a proper record and concerns a single Web page and contains a URL followed by a TAB-separated list of numerical attributes

Integrity of the downloaded data can be checked with the md5sums.txt file.

More information:
http://www.users.pjwstk.edu.pl/~msyd/lingSpamFeatures.html

(msyd 18.04.2008)