Master Thesis: Integrating Product Data using Supervision from the Web (Bizer/Paulheim/Primpeli)

A large number of e-shops have started to markup structured data about products and offers in their HTML pages using the markup standard Microdata and the vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (682,000,000 product records, 497,000,000 offers). A relatively small number of e-shops also publish product identifiers which are indicated with one of the following properties: sku, productID, mpn, identifier, gtin14, gtin13, gtin12, and gtin8.

The aim of this thesis is to analyze and evaluate the utility of product identifiers found on the Web as supervision for matching product descriptions. More concretely, the goal of the thesis is to investigate whether it is possible to learn enough product characteristics from the small set of e-shops that do provide product identifiers in order to detect the same products on websites that do not provide identifiers.

More concretely the tasks involved in the thesis would be:

  • Analysis of Product Identifiers: Analyze the distribution of product identifiers published on the Web. This involves the identification of product entities and product categories for which identifiers are more frequently assigned.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. Product identifiers will be used as a source of supervision in order to learn classification models. The learned models will be evaluated in terms of how well they can generalize to products without assigned identifiers.



Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer, Heiko Paulheim, or Anna Primpeli.