Robert Meusel defended his PhD Thesis

From left to right: Heiner Stuckenschmidt, Christian Bizer, Robert Meusel, Wolfgang Nejdl

On March 10th, Robert Meusel successfully defended his PhD thesis Web-Scale Profiling of Semantic Annotations in HTML Pages. Supervisor was Prof. Christian Bizer, second reader was Prof. Wolfgang Nejdl from Leibniz Universität Hannover. 

Abstract of the thesis:

The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by profiling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the first step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates first the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-specific adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-specific issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated.

Keywords:

Dataspace Profiling , RDFa, Microformats , Microdata , Schema.org , Crawling

Full-text:

The full-text of the thesis is available from the MADOC document server.