In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (177,000,000 product records, 143,000,000 offers, 20,000,000 reviews). In general schema.org/products are described on the Web with only a few properties including a free text description of the product. A relatively small number of e-shops also publish product identifiers (productID, gtin13, mpn) as well as categorization information (category) for their products or offers.
The aim for this thesis is to develop methods for integrating schema.org product data into a global product catalog covering a single or multiple product categories.
The thesis would focus on developing and evaluating methods in one or more of the following areas:
- Feature Extraction: Develop methods for extracting product features (brand, screen size, memory size, …) from the textual product descriptions. The methods can use existing product catalogs or product IDs published by multiple shops as a source of supervision for learning feature extractors.
- Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. The methods can use product IDs published by multiple shops as a source of supervision for learning identity resolution heuristics.
- Product categorization: Develop methods for assigning the products into a product hierarchy. The methods can use existing product classifications, classification information that is published by the e-shops as well as product IDs published by multiple shops as a source of supervision.
You will first work with a subset of the data. Once the methods work for the subset, you will be given the necessary compute power (locally at DWS or on Amazon EC2) to apply your methods to the complete dataset.
- Preferred Expertise: Programming (Java or other language), Data Mining, Databases, NLP is a plus.
- Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval
- Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 277-292, Riva del Garda, Italy, October 2014.
- Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering Microdata Markup. To appear in the Proceedings of the 4th Workshop on Data Extraction and Object Search (DEOS2014) @ WWW 2014, Seoul, Korea, April 2014.
- Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, Johanna Völker: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. 12th International Semantic Web Conference (ISWC2013), Proceedings Part II, pp.17-32, Sydney, Australia, October 2013.
- Website: Google Webmaster Tools: Use of structured data to enricht search results
- Lecture: Web Data Integration
- Lecture: Data Mining
- Lecture: Web Mining