WDC Training Dataset and Gold Standard for Large-Scale Product Matching released

The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.

With publishing the WDC Training Dataset and Gold Standard for Large-scale Product Matching, we hope to contribute to solving this problem. The training dataset consists of 26 million product offers (16 million English language offers) originating from 79 thousand different e-shops. For grouping the offers into clusters describing the same product, we rely on product identifers such as GTINs or MPNs that are annotated with schema.org markup in the HTML pages of the e-shops. Using these identifiers and a specific cleansing workflow, we group the offers into 16 million clusters. Only considering clusters of English offers having a size larger than five and excluding clusters of sizes bigger than 80 offers which may introduce noise, 20.7 million positive training examples (pairs of matching product offers) and a maximum of 2.6 trillion negative training examples can be derived from the dataset. The training dataset is thus several orders of magnitude larger than the largest training set for product matching that has been accessible to the public so far.

In addition to the training dataset, we have also build a gold standard for evaluating matching methods by manually verifying that 2000 pairs of offers refer or do not refer to the same products. The gold standard covers the product categories computers, shoes, watches, and cameras. Using both artefacts to publicly verify the results that Mudgal et al. (SIGMOD 2018) recently achieved using private training data, we find that embeddings and deep learning based methods outperform traditional symbolic matching methods (SVMs and random forests) by 6% to 10% in F1 on our gold standard.

We think that the creation of the WDC Training Dataset nicely demonstrates the utility of the Semantic Web. Without the website owners putting semantic annotations into their HTML pages it would have been much harder, if not impossible, to extract product offers from from 79 thousand e-shops and we would likely not have dared to approach this task. 

More information about the WDC Training Dataset and Gold Standard for Large-scale Product Matching is found on the WDC website which also offers both artefacts for public download.

Lots of thanks to

  • Anna Primpeli for extracting the training dataset from the CommonCrawl and developing the cleansing workflow.
  • Ralph Peeters for creating the gold standard and performing the matching experiments.