The Connectivity Sonar: Detecting Site Functionality by Structural Patterns

Authors

  • Ronny Lempel IBM Research Labs, Haifa 31905, Israel
  • Einat Amitay IBM Research Labs, Haifa 31905, Israel
  • David Carmel IBM Research Labs, Haifa 31905, Israel
  • Adam Darlow IBM Research Labs, Haifa 31905, Israel
  • Aya Soffer IBM Research Labs, Haifa 31905, Israel

Abstract

Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural patterns, as the functionality of a site naturally induces a typical hyperlinked structure and typical connectivity patterns to and from the rest of the Web. Thus, the functionality of Web sites is reflected in a set of structural and connectivity-based features that form a typical signature. In this paper, we automatically categorize sites into eight distinct functional classes, and highlight several search-engine related applications that could make immediate use of such technology. We purposely limit our categorization algorithms by tapping connectivity and structural data alone, making no use of any content analysis whatsoever. When applying two classification algorithms to a set of 202 sites of the eight defined functional categories, the algorithms correctly classified between 54.5% and 59% of the sites. On some categories, the precision of the classification exceeded 85%. An additional result of this work indicates that the structural signature can be used to detect spam rings and mirror sites, by clustering sites with almost identical signatures.

Downloads

Published

2006-02-27