Resources

The Research Group Data and Web Science offers the following resources for free download:

  1. Open Data
  2. Open Source Software
  3. Benchmarks
     

1. Open Data

DBpedia - Querying Wikipedia Like a Database

DBpedia is a community effort to extract structured information from Wikipedia editions in over 90 languages and to make the resulting knowledge base available on the Web. The DBpedia knowledge base currently describes more than 3.5 million things, out of which 1.6 million are classified in a consistent ontology. It is one of the most comprehensive multi-lingual knowledge bases that currently exist and has developed into an interlinking hub for the Web of Data. The knowledge base is widely used by research projects as well as in industry. More information about the project is found on the DBpedia website.

Duration: Active since 2007
Project partners: Universität Leipzig, OpenLink Software, and a world-wide community of developers and mapping editors.

W3C Linking Open Data

W3C Linking Open Data community project supports and loosely coordinates the extension of the Web with a global data space by publishing open-license datasets as RDF and by setting data links between data items within different data sources. The project maintains the LOD dataset catalogue on CKAN as well as tool listings in the W3C ESW wiki. It regularly publishes statistics about the LOD data cloud and maintains the LOD cloud diagram. More information about the project is found on the LOD website.

Duration: Active since 2007 
Project partners: Over 100 world-wide including the Massachusetts Institute of Technology (USA), DERI (Ireland), Talis (UK), University of Southampton (UK), Open University (UK), OpenLink Software (USA), BBC (UK), Geonames (USA). 

Web Data Commons - RDFa Microdata Microformat Corpus

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types. More information about the project is found at www.webdatacommons.org/structureddata/

Duration: Active since March 2012
Project partner: Karlsruhe Institut of Technology (Germany)

Web Data Commons - Web Table Corpora

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a small subset of the tables is relational, meaning that they contain structured data describing a set of entities. The Web Data Commons project extracts relational Web tables from the Common Crawl, the largest and most up-to-data Web corpus that is currently available to the public. The resulting Web table corpora are provided for public download. In addition, we calculate statistics about the structure and content of the tables. More information about the project is found at http://www.webdatacommons.org/webtables/

Duration: Active since March 2014

Web Data Commons - Hyperlink Graph

The project provides a large hyperlink graph for public download and analyses the topology of the graph. The WDC Hyperlink Graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The graph and the results of the analysis are found at http://webdatacommons.org/hyperlinkgraph

 Duration: Active since November 2013

2. Open Source Software

Silk - Link Discovery Framework

The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk can also be used as an identity resolution component within Linked Data applications. Silk provides a declarative language for expressing identity resolution heuristics and implements a sophisticated blocking method (MultiBlock). There is a single machine and a Hadoop-based implementation available. More information about the project is found on the Silk website.

Duration: Active since 2009

RapidMiner Linked Open Data Extension

The RapidMiner Linked Open Data Extension is an extension to the open source data mining software RapidMiner. It allows using data from Linked Open Data both as an input for data mining as well as for enriching existing datasets with background knowledge. More information about the extension as well as its use cases are found on the project's website.

Duration: Active since 2013

D2RQ Plattform - Accessing Relational Databases as Virtual RDF Graphs

The D2RQ Platform is a system for accessing relational databases as virtual, read-only RDF graphs. It offers RDF-based access to the content of relational databases without having to replicate it into an RDF store. Using D2RQ you can Using D2RQ you can: 1. query a non-RDF database using SPARQL; 2. access the content of the database as Linked Data over the Web; 3. create custom dumps of the database in RDF formats for loading into an RDF store; 4. access information in a non-RDF database using the Apache Jena API.  The D2RQ Plattform has been downloaded over 15.000 times from Sourceforge. More information about the plattform is found on the D2RQ website.

Duration: Active since 2004 
Project partners: DERI (Ireland)
OEM distributor: TopBraid (USA)

LDIF - Linked Data Integration Framework

The LDIF – Linked Data Integration Framework is a Hadoop-based framework for integrating and cleansing large amounts of web and enterprise data. LDIF provides an expressive mapping language, an identity resolution component, as well as data quality assessment and data fusion modules. More information about the project is found on the  LDIF website.

Duration: Active since June 2011

ALCOMO - Applying Logical Constraints to Match Ontologies

ALCOMO is a project that has been developed by Christian Meilicke in the context of his Phd. It is a debugging system that allows to transform incoherent alignments in coherent alignments by removing some correspondences from the alignment. The removed part of the alignment is called a diagnosis. It is complete in the sense that it detects any kind of incoherence in SHIN(D) ontologies. At the same time a computed diagnosis is always minimal in the sense that the tool never removes too much, i.e., the removed subset of the alignment is always a minimal hitting set over all conflicts.The system is availabe under MIT license and can be downloaded here.

 Duration: Available since 2012

Semtinel - Thesaurus analysis beyond numbers

Semtinel is a graphical thesaurus analysis and maintenance system developed mainly by Kai Eckert as part of his dissertation. It also formed the technical basis for several master theses and student research projects. The software is available at Semtinel.org.

Duration: Available since 2008

WDC - Extraction Framework

The Web Data Commons - Extraction Framework is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation. The framework provides an easy to use basis for the distributed processing of large web crawls using Amazon EC2 cloud services. The framework is published under the terms of the Apache license and can be simply customized to perform different data extraction tasks. More information and the download instructions can be found on the Web Data Commons website.

Duration: Available since July 2014

3. Benchmarks

Berlin SPARQL Benchmark (BSBM)

The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol.   The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix illustrates the search and navigation pattern of a consumer looking for a product. More information about the benchmark is found on the BSBM website.

Duration: Active since 2008

OAEI Anatomy and Library Track

The Ontology Alignment Evaluation Initiative (OAEI) is a coordinated international initiative to assess strengths and weaknesses of alignment/matching systems and to compare the performance of techniques. Since 2006 we offered for the first time the Anatomy track. This track consists of finding alignments between the Adult Mouse Anatomy and a part of the NCI Thesaurus (describing the human anatomy). The task is placed in a domain where we find large, carefully designed ontologies that are described in technical terms. Since 2012 we are offering a second track, called the Library track. The Library track is a real-word task to match the STW and the TheSoz thesaurus. Both provide a vocabulary for economic resp. social science subjects and are used by libraries for indexation and retrieval. The latest versions of the datasets as well as the tools to process them are available via http://oaei.ontologymatching.org/2012/.

Duration: Since 2006 as part of the annual OAEI campaign

T2D Gold Standard for Evaluating Web Table Matching Systems

A small fraction of the HTML tables on the Web contains structured data. As this data has a wide coverage, it could potentially be very valuable for filling missing values and extending cross-domain knowledge bases. As a prerequisite for being able to use table data for knowledge base extension, the Web tables need to be matched to the knowledge base in question. The T2K Gold standard provides a rich set of correspondences between a public Web table corpus and the DBpedia knowledge base. The Gold standard can be used to compare systems that match Web tables to rich knowledge bases. More information about the gold standard is found on the T2D website.

Duration: Available since April 2015

WDC Gold Standard for Product Matching and Product Feature Extraction

In order to support the evaluation and comparison of product feature extraction and product matching methods, we have created two public gold standards for these tasks. The gold standards comprise of several hundreds products from the categories mobile phones, TVs, and headphones. Website.

Duration: Available since June 2016