News

WDC Training Dataset and Gold Standard for Large-Scale Product Matching released

The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.

With publishing the WDC Training Dataset and Gold Standard for Large-scale Product Matching, we hope to contribute to solving this problem. The training dataset consists of 26 million product offers (16 million English language offers) originating from 79 thousand different e-shops. For grouping the offers into clusters describing the same product, we rely on product identifers such as GTINs or MPNs that are annotated with schema.org markup in the HTML pages of the e-shops. Using these identifiers and a specific cleansing workflow, we group the offers into 16 million clusters. Only considering clusters of English offers having a size larger than five and excluding clusters of sizes bigger than 80 offers which may introduce noise, 20.7 million positive training examples (pairs of matching product offers) and a maximum of 2.6 trillion negative training examples can be derived from the dataset. The training dataset is thus several orders of magnitude larger than the largest training set for product matching that has been accessible to the public so far.

In addition to the training dataset, we have also build a gold standard for evaluating matching methods by manually verifying that 2000 pairs of offers refer or do not refer to the same products. The gold standard covers the product categories computers, shoes, watches, and cameras. Using both artefacts to publicly verify the results that Mudgal et al. (SIGMOD 2018) recently achieved using private training data, we find that embeddings and deep learning based methods outperform traditional symbolic matching methods (SVMs and random forests) by 6% to 10% in F1 on our gold standard.

We think that the creation of the WDC Training Dataset nicely demonstrates the utility of the Semantic Web. Without the website owners putting semantic annotations into their HTML pages it would have been much harder, if not impossible, to extract product offers from from 79 thousand e-shops and we would likely not have dared to approach this task. 

More information about the WDC Training Dataset and Gold Standard for Large-scale Product Matching is found on the WDC website which also offers both artefacts for public download.

Lots of thanks to

  • Anna Primpeli for extracting the training dataset from the CommonCrawl and developing the cleansing workflow.
  • Ralph Peeters for creating the gold standard and performing the matching experiments.

Prof. Dr.-Ing. Margret Keuper

Image Processing

B6, 26, Room B 015

Email: keuper (at) uni-mannheim . de

Phone: +49 621 181 2602

 

I joined the Data and Web Science Group in April 2017 as a Juniorprofessor. My research interests are Computer Vision and Image Processing. More specifically, I am interested in grouping problems such as

 

During my PhD with Thomas Brox at the University of Freiburg, I focused on the segmentation in volumetric bio-medical image data.

I am looking for a highly motivated PhD student.

  • Applications that do not include CV, transcripts, and a short (!) cover letter will likely be ignored.
  • We generally do not offer short interships (3 months or less).

Publications

You can also find a full list of my publications on Google Scholar..

2018

Margret Keuper, Siyu Tang, Bjoern Andres, Thomas Brox and Bernt Schiele: Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.

Amirhossein Kardoost and Margret Keuper: Solving Minimum Cost Lifted Multicut Problems by Node Agglomeration, accepted at the Asian Conference on Computer Vision (ACCV), 2018.

Eddy Ilg, Tonmoy Saikia, Margret Keuper, Thomas Brox: Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation, European Conference on Computer Vision (ECCV), 2018.

S. Broscheit, R. Gemulla, M. Keuper: Learning Distributional Token Representations from Visual Features In RepL4NLPworkshop, 2018.

2017

Margret Keuper: Higher-Order Minimum Cost Lifted Multicuts for Motion Segmentation, in Proc. of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017. [pdf]

Anne S. Wannenwetsch, Margret Keuper, and Stefan Roth: ProbFlow: Joint optical flow and uncertainty estimation, in Proc. of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017. [pdf]

Yang He, Margret Keuper, Bernt Schiele and Mario Fritz, Learning Dilation Factors for Semantic Segmentation of Street Scenes, 39th German Conference on Pattern Recognition (GCPR), 2017.

Yang He, Wei-Chen Chiu, Margret Keuper, Mario Fritz: STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data Driven Pooling, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [pdf]

Eddy Ilg, Nikolaus Mayer, T. Saikia, Margret Keuper, Alexey Dosovitskiy, Thomas Brox FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [pdf]

2016

Margret Keuper, Thomas Brox: Point-Wise Mutual Information-Based Video Segmentation with High Temporal Consistency. ECCV Workshops (3) 2016: 789-803. [pdf]

Margret Keuper, Thomas Brox: Segmentation in Point Clouds from RGB-D Using Spectral Graph Reduction, Perspectives in Shape Analysis, 155-168

2015

Margret Keuper, Bjoern Andres, Thomas Brox: Motion Trajectory Segmentation via Minimum Cost Multicuts. in Proc. of the IEEE International Conference on Computer Vision (ICCV), 2015: 3271-3279 [pdf]

Margret Keuper, Evgeny Levinkov, Nicolas Bonneel, Guillaume Lavoué, Thomas Brox, Bjoern Andres: Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts. in Proc. of the IEEE International Conference on Computer Vision (ICCV), 2015: 1751-1759 [pdf]

2014

Fabio Galasso, Margret Keuper, Thomas Brox, Bernt Schiele: Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation. in Proc. of the CVPR 2014: 49-56 [pdf]

2013

Thorsten Schmidt, Jasmin Dürr, Margret Keuper, Thomas Blein, Klaus Palme, Olaf Ronneberger: Variational attenuation correction in two-view confocal microscopy. BMC Bioinformatics 14: 366 (2013) [paper]

Margret Keuper, Thorsten Schmidt, Maja Temerinac-Ott, Jan Padeken, Patrick Heun, Olaf Ronneberger, Thomas Brox: Blind Deconvolution of Widefield Fluorescence Microscopic Data by Regularization of the Optical Transfer Function (OTF). CVPR 2013: 2179-2186 [pdf]

Thorsten Schmidt, Jasmin Dürr, Margret Keuper, Thomas Blein, Klaus Palme, Olaf Ronneberger: Variational attenuation correction of two-view confocal microscopic recordings. ISBI 2013: 169-172 [paper]

2012

Thorsten Schmidt, Margret Keuper, Taras Pasternak, Klaus Palme, Olaf Ronneberger: Modeling of Sparsely Sampled Tubular Surfaces Using Coupled Curves. DAGM/OAGM Symposium 2012: 83-92 [pdf]

Margret Keuper, Maja Temerinac-Ott, Jan Padeken, Patrick Heun, Thomas Brox, Hans Burkhardt, Olaf Ronneberger: Blind deconvolution with PSF regularization for wide-field microscopy. ISBI 2012: 1292-1295 [pdf]

2011

Margret Keuper, Thorsten Schmidt, Marta Rodriguez-Franco, Wolfgang Schamel, Thomas Brox, Hans Burkhardt, Olaf Ronneberger: Hierarchical Markov Random Fields for Mast Cell Segmentation in Electron Microscopic Recordings. In Proc. of the International Symposium on Biomedical Imaging: From Nano to Macro (ISBI), 2011, pages 973-978. [pdf]

2010

M. Keuper, R. Bensch, K. Voigt, A. Dovzhenko, K. Palme, H. Burkhardt, O. Ronneberger: Semi-Supervised Learning of Edge Filters for Volumetric Image Segmentation , inProc. of the 32nd DAGM Symposium 2010, LNCS, pages 462-471. [pdf]

M. Keuper, T. Schmidt, J. Padeken, P. Heun, K. Palme, H. Burkhardt, O. Ronneberger: 3D Deformable Surfaces with Locally Self-Adjusting Parameters - A Robust Method to Determine Cell Nucleus Shapes, in Proc. of the ICPR, 20th International Conference on Pattern Recognition, 2010, pages 2254-2257. [pdf]

M. Keuper, J. Padeken, P. Heun, H. Burkhardt, O. Ronneberger: Mean Shift Gradient Vector Flow: A Robust External Force Field for 3D Active Surfaces , in Proc. of the ICPR, 20th International Conference on Pattern Recognition, 2010, pages 2784-2787. [pdf]

M. Temerinac, M. Keuper and H. Burkhardt:
Evaluation of a New Point Clouds Registration Method based on Group Averaging Features, in Proc. of the ICPR, 20th International Conference on Pattern Recognition, 2010, pages 2452-2455. [pdf]

2009

M. Keuper, J. Padeken, P. Heun, H. Burkhardt, O. Ronneberger:
A 3D Active Surface Model for the accurate Segmentation of Drosophila Schneider Cell Nuclei and Nucleoli. [pdf] In Proc. of the ISVC 2009, LNCS, pages: 865-874.

Awards and Nominations

Teaching

At the University of Mannheim, I am teaching the following courses:

Higher Level Computer Vision (CS 646)

Image Processing (CS 647)

Professional Activities

Invited Talks:

  • "Introduction to Computer Vision“ at Female's Favour{IT}e Conference 2018, Feb. 2018.
  • „Probabilistic Graphical Models for Multi-Class Segmentation in Images and Videos“, Colloquiumsvortrag, Institut für Neuroinformatik, Bochum, Sep. 2016.
  • "Segmentation of Neuronal Structures using Multi-Label Level Sets". In the ISBI-2012 workshop/challenge on Segmentation of neuronal structures in EM stacks organized by Ignacio Arganda-Carreras, Sebastian Seung, Albert Cardona, and Johannes Schindelin, März 2012.

 

PC Member / Reviewer:

Conferences: ECCV (since 2014), ICCV (since 2015), CVPR (since 2016), ICPR (2016), ACCV (since 2016), BMVC (since 2017), NIPS (2018), GCPR 2018 (meta-reviewer)

Journals: IEEE Transactions for Circuits and Systems for Video Technology - TCSVT (2015), Image and Vision Computing Journal - IMAVIS (2016), IEEE Transactions on Pattern Analysis and Machine Intelligence - TPAMI (2015, 2016, 2017, 2018), IEEE Transactions on Image Processing – TIP (2017, 2018), Pattern Recognition (2017), The Visual Computer Journal (2017), Journal of Electronic Imaging (2016, 2017), Applied Computing and Informatics (2018), Entropy (2017), Computer Vision and Image Understanding (2017)


Other:

Co-organization of the LWDA 2018 conference (Lernen. Wissen. Daten. Analysen) in Mannheim.