Writing a Thesis

On this page you find a listing of thesis topics that are currently available / proposed by members of our group. For each proposal you find a link or an email-address of a contact person.

Aside from this you find more information related to writing a bachelor and master thesis within our group via the following links.

Available, ongoing and completed topics

Master Thesis: Integrating Schema.org Product Data into a Global Product Catalog (Bizer/Primpeli)

A large number of e-shops have started to markup structured data about products, offers and reviews in their HTML pages using the markup standard Microdata and the schema.org vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (177,000,000 product records, 143,000,000 offers, 20,000,000 reviews). In general schema.org/products are described on the Web with only a few properties including a free text description of the product. A relatively small number of e-shops also publish product identifiers (productIDgtin13mpn) as well as categorization information (category) for their products or offers.

The aim for this thesis is to develop methods for integrating schema.org product data into a global product catalog covering a single or multiple product categories.

The thesis would focus on developing and evaluating methods in one or more of the following areas:

  • Feature Extraction: Develop methods for extracting product features (brand, screen size, memory size, …) from the textual product descriptions. The methods can use existing product catalogs or product IDs published by multiple shops as a source of supervision for learning feature extractors.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. The methods can use product IDs published by multiple shops as a source of supervision for learning identity resolution heuristics.
  • Product categorization: Develop methods for assigning the products into a product hierarchy.  The methods can use existing product classifications, classification information that is published by the e-shops as well as product IDs published by multiple shops as a source of supervision. 

You will first work with a subset of the data. Once the methods work for the subset, you will be given the necessary compute power (locally at DWS or on Amazon EC2) to apply your methods to the complete dataset.

Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, Databases, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer or Anna Primpeli.

References: