Search Engines vs. SEO Spam: Statistical Methods
by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.
The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)
To Sum Up
The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.
References:
1. Dennis Fetterly, Mark Manasse, Marc Najork. “Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages” (2004). Microsoft Research.
2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. “Syntactic Clustering of the Web”. In 6th International World Wide Web Conference, April 1997.
Oleg Ishenko, MCSE, MCDBA, BScGet more useful info on SEO at our SEO Research