Master Thesis: Integrating Product Data into a Global Product Catalog (Bizer/Primpeli)

A large number of e-shops have started to markup structured data about products, offers and reviews in their HTML pages using the markup standard Microdata and the vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (177,000,000 product records, 143,000,000 offers, 20,000,000 reviews). In general are described on the Web with only a few properties including a free text description of the product. A relatively small number of e-shops also publish product identifiers (productIDgtin13mpn) as well as categorization information (category) for their products or offers.

The aim for this thesis is to develop methods for integrating product data into a global product catalog covering a single or multiple product categories.

The thesis would focus on developing and evaluating methods in one or more of the following areas:

  • Feature Extraction: Develop methods for extracting product features (brand, screen size, memory size, …) from the textual product descriptions. The methods can use existing product catalogs or product IDs published by multiple shops as a source of supervision for learning feature extractors.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. The methods can use product IDs published by multiple shops as a source of supervision for learning identity resolution heuristics.
  • Product categorization: Develop methods for assigning the products into a product hierarchy.  The methods can use existing product classifications, classification information that is published by the e-shops as well as product IDs published by multiple shops as a source of supervision. 

You will first work with a subset of the data. Once the methods work for the subset, you will be given the necessary compute power (locally at DWS or on Amazon EC2) to apply your methods to the complete dataset.

Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, Databases, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer or Anna Primpeli.