DWS Group - News http://example.com This RSS feed provides news about the research activities of the Data and Web Science Group at the University of Mannheim. en-gb TYPO3 News Thu, 21 Jun 2018 00:21:28 +0000 Thu, 21 Jun 2018 00:21:28 +0000 TYPO3 EXT:news news-2119 Mon, 11 Jun 2018 13:23:27 +0000 Papers accepted at ACL 2018 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/papers-accepted-at-acl-2018/ We have three papers to be presented at the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), the premier international conference on Computational Linguistics and Natural Language Processing.

Two short papers prepared in collaboration with our colleagues from the University of Cambridge, the University of Hamburg and the University of Oslo have been accepted at the main conference track:

One paper has been accepted at the 3rd Workshop on Representation Learning for NLP (RepL4NLP) hosted by ACL 2018:

  • Samuel Broscheit: Learning Distributional Token Representations from Visual Features.
]]>
Research
news-2108 Mon, 07 May 2018 06:56:32 +0000 Roche Hypo University Challenge won by DWS-AI https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/roche-hypo-university-challenge-won-by-dws-ai/ We are happy to announce that Jakob Huber and Timo Sztyler reached the 1st place in the Hypo University Challenge that was hosted by Roche Diabetes Care GmbH and powered by IBM. The goal of the challenge was to develop an algorithm that predicts the probability for a nocturnal hypoglycemic event (severe, mild, hypo) in the upcoming 10, 20, 30, 40, and 60 minutes.

 

Today, more than 425 million people have Diabetes Mellitus, a metabolic disorder characterized by an increased blood sugar level. Keeping this untreated can lead to a hyperglycemia which results in confusion, abdominal pain, and coma. The treatment of diabetes lasts as long as life, i.e., there is no cure.

 

After the challenge, they were invited to present their solution approach as part of the Roche internal "Diagnostics R&D Fair" in Basel where they also received a trophy for winning the challenge.

]]>
Research
news-2105 Fri, 27 Apr 2018 09:58:42 +0000 Data Science Conference LWDA 2018 in Mannheim https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/data-science-conference-lwda-2018-in-mannheim-1/ The Data and Web Science Group is hosting the Data Science Conference LWDA 2018 in Mannheim on August 22-24, 2018.

LWDA, which expands to „Lernen, Wissen, Daten, Analysen“ („Learning, Knowledge, Data, Analytics“), covers recent research in areas such as knowledge discovery, machine learning & data mining, knowledge management, database management & information systems, information retrieval. 

The LWDA conference is organized by and brings together the various special interest groups of the Gesellschaft für Informatik (German Computer Science Society) in this area. The program comprises of joint research sessions and keynotes as well as of workshops organized by each special interest group.

Further information can be found on the conference website: https://www.uni-mannheim.de/lwda-2018/.

Download the conference poster.

]]>
Other Topics - Künstliche Intelligenz I Topics - Data Mining Topics - Decision Support Topics - Web Search and IR Chris Heiner Rainer Simone
news-2098 Tue, 17 Apr 2018 09:27:36 +0000 Paper accepted at IJCAI 2018 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-ijcai-2018/ Together with our colleagues Paola, Irene and Stefano at Sapienza University in Rome we have a paper accepted at the 27th International Joint Conference on Artificial Intelligence (IJCAI), the premier conference in the field of AI:

  • Stefano Faralli, Irene Finocchi, Simone Paolo Ponzetto and Paola Velardi: Efficient Pruning of Large Knowledge Graphs.
]]>
Publications Simone Research
news-2097 Tue, 17 Apr 2018 09:24:14 +0000 Paper accepted at JCDL 2018 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-jcdl-2018/ We have a paper accepted at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries

  • Federico Nanni, Simone Paolo Ponzetto and Laura Dietz: Entity-Aspect Linking:  Providing Fine-Grained Semantics of Entities in Context.

The work presented in the paper is a collaboration between the DWS group and Prof. Laura Dietz at the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

 

 

]]>
Research Publications Simone
news-2096 Tue, 17 Apr 2018 09:08:19 +0000 Paper accepted at SIGIR 2018 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-sigir-2018/ Together with our colleague Ivan Vulic at the University of Cambridge we have a paper accepted at the 41st International ACM Conference on Research and Development in Information Retrieval (SIGIR), the premier conference in the field of Information Retrieval:

  • Robert Litschko, Goran Glavas, Ivan Vulic and Simone Paolo Ponzetto: Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only.
]]>
Research Publications Simone
news-2084 Mon, 12 Mar 2018 11:57:47 +0000 Third Cohort of Students starts Part-time Master in Data Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/third-cohort-of-students-starts-part-time-master-in-data-science/ The third cohort consisting of 32 students has started their studies in the part-time master program in Data Science that professors of the DWS group offer together with the Hochschule Albstadt-Sigmaringen.

This weekend the students of the third cohort of the master program as well as students participating in the certificate program Data Science were in Mannheim for a data mining project weekend.

The students worked in teams on two case studies, one in the area of online marketing, the other in the area of text mining. The teams were coached by Prof. Christian Bizer, Dr. Robert Meusel, and Alexander Diete and we were very happy to see an exciting competition between the teams for the best F1 scores as well as the highest raises in sales.

Additional Information:

 

]]>
Projects Chris
news-2075 Fri, 23 Feb 2018 14:41:28 +0000 Dmitry Ustalov has defended his PhD thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dmitry-ustalov-has-defended-his-phd-thesis/ Dmitry Ustalov has successfully defended his Kandidat Nauk (PhD) thesis on “Models, Methods and Algorithms for Constructing a Word Sense Network for Natural Language Processing” («Модели, методы и алгоритмы построения семантической сети слов для задач обработки естественного языка» in Russian). The defense was held at the South Ural State University (Chelyabinsk, Russia) on February 21, 2018. This thesis, among many other contributions, proposes the Watset and Watlink methods for extracting, inducing, clustering, and linking the word senses from the unstructured data.

Abstract

The goal of the thesis is to develop models, methods, and algorithms for constructing a semantic network that establishes semantic links between individual word senses using the weakly structured dictionaries; as well as to implement them as the software system for word sense network construction. Therefore, Part I reviews the state-of-the-art in the field of natural language processing and urges the development of new efficient ontology induction algorithms for under-resourced languages.

Part II proposes two new algorithms, Watset and Watlink, that extract and structure the knowledge available in unstructured form. Watset is a meta-algorithm for fuzzy graph clustering. This algorithm creates an intermediate representation of the input graph that naturally reflects the “ambiguity” of its nodes. Then, it uses hard clustering to discover clusters in this intermediate graph. This makes it possible to discover synsets in a synonymy graph. Watlink is an algorithm for discovering the disambiguated hierarchical links between individual word senses. This algorithm uses the synsets obtained using Watset to contextualize the input asymmetric word links. To increase the recall of the linking, it optionally uses a regularized projection learning approach to predict additional relevant links.

Part III describes the implementation of the proposed models, methods, and algorithms as a software system. The system is implemented in Python, AWK, and Bash programming languages using the scikit-learn, TensorFlow, NetworkX, and Raptor libraries. Also, it defines the representation of the produced word sense network as Linked Data.

Part IV reports the results of the experiments conducted on the Russian language, an under-resourced natural language. Both Watset and Watlink show state-of-the-art performance on the synset induction and hypernymy detection tasks on the RuWordNet and Yet Another RussNet gold standards.

]]>
Research Group
news-2073 Tue, 20 Feb 2018 14:28:00 +0000 Paper accepted at AAAI: On Multi-Relational Link Prediction with Bilinear Models https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-aaai-on-multi-relational-link-prediction-with-bilinear-models/ The paper "On Multi-Relational Link Prediction with Bilinear Models" (pdf) by Y. Wang, R. Gemulla and H. Li has been accepted at the 2018 AAAI Conference on Artificial Intelligence (AAAI).

Abstract
We study bilinear embedding models for the task of multi-relational link prediction and knowledge graph completion. Bilinear models belong to the most basic models for this task, they are comparably efficient to train and use, and they can provide good prediction performance. The main goal of this paper is to explore the expressiveness of and the connections between various bilinear models proposed in the literature. In particular, a substantial number of models can be represented as bilinear models with certain additional constraints enforced on the embeddings. We explore whether or not these constraints lead to universal models, which can in principle represent every set of relations, and whether or not there are subsumption relationships between various models. We report results of an independent experimental study that evaluates recent bilinear models in a common experimental setup. Finally, we provide evidence that relation-level ensembles of multiple bilinear models can achieve state-of-the art prediction performance.

]]>
Publications Rainer
news-2072 Fri, 16 Feb 2018 09:07:44 +0000 Semester Kick-Off BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kick-off-bbq-2/ Traditionally, the DWS group takes the beginning of each new semester as an opportunity to host a barbecue in order to welcome new colleagues and introduce the upcoming courses to the best students of last semester. Accompanied by cold beverages and grilled food the professors gave an overview of the current activities of the group and presented the spring/summer semester program. The courses for this term are:

Data Mining IData Mining IIWeb MiningWeb Search and Information RetrievalData Mining and Matrices, Higher Level Computer Vision, Database Technology


The BBQ was attended by around 40 people. We thank all the participants for coming and wish our students a good and successful start into the new semester!

We also would like to give a big thank you to mayato who sponsored the BBQ this year!

German Version 

Traditionell nutzt die Forschungsgruppe Data und Web Science den Beginn des Semesters, um bei einem Grillfest neue Kolleginnen und Kollegen willkommen zu heißen, das aktuelle Lehrangebot vorzustellen und dazu die besten Studierenden des letzten Semester einzuladen Begleitet von Grillgut und kühlen Getränken präsentierten die Professoren die nächsten Kurse des aktuellen Semesters.

Die folgenden Kurse wurden vorgestellt:

Data Mining IData Mining IIWeb MiningWeb Search and Information RetrievalData Mining and MatricesHigher Level Computer VisionDatabase Technology

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen unseren Studenten einen guten und erfolgreichen Start ins neue Semester!

Besonders wollen wir uns bei mayato bedanken, die dieses Jahr das Grillfest gesponsert haben!

]]>
Group Other
news-2060 Fri, 19 Jan 2018 13:07:59 +0000 Paper accepted for Digital Scholarship in the Humanities https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-digital-scholarship-in-the-humanities/ We have a paper accepted in Digital Scholarship in the Humanities, the premier journal in the field of Digital Humanities.

Federico Nanni, Laura Dietz and Simone Paolo Ponzetto. Toward a computational history of universities: Evaluating text mining methods for interdisciplinarity detection from PhD dissertation abstracts. To appear in Digital Scholarship in the Humanities. DOI: 10.1093/llc/fqx062 (available with a free-access article link here). 

The work presented in the paper is a collaboration between the DWS group and Prof. Laura Dietz at the University of New Hampshire.

Abstract

For the first time, historians of higher education have large data sets of primary sources that reflect the complete output of academic institutions at their disposal. To analyze this unprecedented abundance of digital materials, scholars have access to a large suite of computational methods developed in the field of Natural Language Processing. However, when the intention is to move beyond exploratory studies and use the results of such analyses as quantitative evidences, historians need to take into account the reliability of these techniques. The main goal of this article is to investigate the performance of different text mining methods for a specific task: the automatic identification of interdisciplinary works from a corpus of PhD dissertation abstracts. Based on the output of our study, we provide the research community of a new data set for analyzing recent changes in interdisciplinary practices in a large sample of European universities. We show the potential of this collection by tracking the growth in adoption of computational approaches across different research fields, during the past 30 years.

]]>
Research Simone Publications
news-2059 Fri, 19 Jan 2018 12:56:24 +0000 Paper accepted for Knowledge-Based Systems https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-knowledge-based-systems/ Together with our colleagues of the Natural Language Engineering (NLE) Lab of the University of Valencia we have a paper accepted for Knowledge-Based Systems journal (2016 Impact Factor: 4.529).

Goran Glavaš, Marc Franco-Salvador, Simone P. Ponzetto and Paolo Rosso. A resource-light method for cross-lingual semantic textual similarity. To appear in Knowledge-Based Systems. DOI: 10.1016/j.knosys.2017.11.041. A pre-print version is available here

Abstract

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.

]]>
Research Simone Publications
news-2058 Fri, 19 Jan 2018 12:44:46 +0000 Paper accepted for the Journal of Natural Language Engineering https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-the-journal-of-natural-language-engineering/ We have a new journal paper in the Natural Language Engineering journal summarizing the findings of the first part of our DFG JOIN-T (Joining Ontologies and semantics INduced from Text) project with the colleagues of the Language Technology Group of the University of Hamburg

Chris Biemann, Stefano Faralli, Alexander Panchenko and Simone Paolo Ponzetto: A framework for enriching lexical semantic resources with distributional semantics. To appear in the Journal of Natural Language Engineering. DOI: 10.1017/S135132491700047X. A pre-print version is available here

You can find the project homepage here.

Abstract

We present an approach to combining distributional semantic representations induced from text corpora with manually constructed lexical semantic networks. While both kinds of semantic resources are available with high lexical coverage, our aligned resource combines the domain specificity and availability of contextual information from distributional models with the conciseness and high quality of manually crafted lexical networks. We start with a distributional representation of induced senses of vocabulary terms, which are accompanied with rich context information given by related lexical items. We then automatically disambiguate such representations to obtain a full-fledged proto-conceptualization, i.e. a typed graph of induced word senses. In a final step, this proto-conceptualization is aligned to a lexical ontology, resulting in a hybrid aligned resource. Moreover, unmapped induced senses are associated with a semantic type in order to connect them to the core resource. Manual evaluations against ground-truth judgments for different stages of our method as well as an extrinsic evaluation on a knowledge-based Word Sense Disambiguation benchmark all indicate the high quality of the new hybrid resource. Additionally, we show the benefits of enriching top-down lexical knowledge resources with bottom-up distributional information from text for addressing high-end knowledge acquisition tasks such as cleaning hypernym graphs and learning taxonomies from scratch.

]]>
Simone Research Publications
news-2055 Mon, 15 Jan 2018 15:48:14 +0000 Petar Ristoski has defended his PhD thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/petar-ristoski-has-defended-his-phd-thesis/ Petar Ristoski has successfully defended his PhD thesis on "Exploiting Web Knowledge Graphs in Data Mining" today. Among many other contributions, his thesis proposes the RDF2Vec method for generating vector space embeddings of RDF graphs.

Abstract

Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. Semantic Web knowledge graphs are a backbone of many information systems that require access to structured knowledge. Such knowledge graphs contain factual knowledge about real word entities and the relations between them, which can be utilized in various natural language processing, information retrieval, and any data mining applications. Following the principles of the Semantic Web, Semantic Web knowledge graphs are publicly available as Linked Open Data. Linked Open Data is an open, interlinked collection of datasets in machine-interpretable form, covering most of the real world domains.

In this thesis, we investigate the hypothesis if SemanticWeb knowledge graphs can be exploited as background knowledge in different steps of the knowledge discovery process, and different data mining tasks. More precisely, we aim to show that Semantic Web knowledge graphs can be utilized for generating valuable data mining features that can be used in various data mining tasks.

Identifying, collecting and integrating useful background knowledge for a given data mining application can be a tedious and time consuming task. Furthermore, most data mining tools require features in propositional form, i.e., binary, nominal or numerical features associated with an instance, while Linked Open Data sources are usually graphs by nature. Therefore, in Part I, we evaluate unsupervised feature generation strategies from types and relations in knowledge graphs, which are used in different data mining tasks, i.e., classification, regression, and outlier detection. As the number of generated features grows rapidly with the number of instances in the dataset, we provide a strategy for feature selection in hierarchical feature space, in order to select only the most informative and most representative features for a given dataset. Furthermore, we provide an end-to-end tool for mining theWeb of Linked Data, which provides functionalities for each step of the knowledge discovery process, i.e., linking local data to a SemanticWeb knowledge graph, integrating features from multiple knowledge graphs, feature generation and selection, and building machine learning models. However, we show that such feature generation strategies often lead to high dimensional feature vectors even after dimensionality reduction, and also, the reusability of such feature vectors across different datasets is limited.

In Part II, we propose an approach that circumvents the shortcomings introduced with the approaches in Part I. More precisely, we develop an approach that is able to embed complete Semantic Web knowledge graphs in a low dimensional feature space, where each entity and relation in the knowledge graph is represented as a numerical vector. Projecting such latent representations of entities into a lower dimensional feature space shows that semantically similar entities appear closer to each other. We use several Semantic Web knowledge graphs to show that such latent representation of entities have high relevance for different data mining tasks. Furthermore, we show that such features can be easily reused for different datasets and different tasks.

In Part III, we describe a list of applications that exploit Semantic Web knowledge graphs, besides the standard data mining tasks, like classification and regression. We show that the approaches developed in Part I and Part II can be used in applications in various domains. More precisely, we show that Semantic Web graphs can be exploited for analyzing statistics, building recommender systems, entity and document modeling, and taxonomy induction.

]]>
Group Research
news-2049 Thu, 11 Jan 2018 09:42:41 +0000 38.7 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/387-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2017 version of the Common Crawl covering 3.2 billion HTML pages which originate from 26 million websites (pay-level domains).

In summary, we found structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38.9%). These pages originate from 7.4 million different pay-level domains out of the 26 million pay-level-domains covered by the crawl (28.4%). Approximately 3.7 million of these websites use Microdata, 2.6 million websites use JSON-LD, and 1.2 million websites make use of RDFa. Microformats are used by more than 3.3 million websites within the crawl.

Background:

More and more websites annotate data describing for instance products, people, organizations, places, events, reviews, and cooking recipes within their HTML pages using markup formats such as Microdata, embedded JSON-LD, RDFa and Microformat. The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and Microformat data from the Common Crawl web corpus, the largest web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. We run yearly extractions since 2012 and we provide the dataset series as well as the related statistics at:

http://webdatacommons.org/structureddata/

Statistics about the November 2017 Release:

Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2017-12/stats/stats.html

Markup Format Adoption

The page below provides an overview of the increase in the adoption of the different markup formats as well as widely used schema.org classes from 2012 to 2017:

http://webdatacommons.org/structureddata/#toc10

Comparing the statistics from the new 2017 release to the statistics about the October 2016 release of the data sets (http://webdatacommons.org/structureddata/2016-10/stats/stats.html), we see that the adoption of structured data keeps on increasing while Microdata remains the most dominant markup syntax. The different nature of the crawling strategy that was used makes it hard to compare absolute as well as certain relative numbers between the two releases. More concretely, we observe that the November 2017 Common Crawl corpus is much deeper for certain domains like blogspot.com and wordpress.com while other domains are covered in a shallower way, with fewer URLs crawled in comparison to the October 2016 Common Crawl corpus. Nevertheless, it is clear that the growth rate of Microdata and Microformats is much higher than the one of RDFa and embedded JSON-LD. Although, the latter format is widely spread, it is mainly used to annotate metadata for search actions (80% of the domains using JSON-LD) while only a few domains use it for annotating content information such as Organizations (25% of the domains using JSON-LD), Persons (4% of the domains using JSON-LD) or Offers (0.1% of the domains using JSON-LD).

Vocabulary Adoption

Concerning the vocabulary adoption, schema.org, the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant in the context of Microdata with 78% of the webmasters using it in comparison to its predecessor, the data-vocabulary, which is only used by 14% of the websites containing Microdata. In the context of RDFa, the Open Graph Protocol recommended by Facebook remains the most widely used vocabulary.

Parallel Usage of Multiple Formats

Analyzing topic-specific subsets, we discover some interesting trends. As observed in the previous extractions, content related information is mostly described either with the Microdata format or less frequently with the JSON-LD format, in both cases using the schema.org vocabulary. However, we find out that 30% of the websites that use JSON-LD annotations to describe product related information, make use of Microdata as well as JSON-LD to cover the same topic. This is not the case for other topics, such as Hotels or Job Postings, for which webmasters use only one format to annotate their content.

Richer Descriptions of Job Postings

Following the release of the “Google for Jobs” search vertical and the more detailed guidance by Google on how to annotate job postings (https://developers.google.com/search/docs/data-types/job-posting), we see an increase in the number of websites annotating job postings (2017: 7,023, 2016: 6,352). In addition, the job posting annotations tend to become richer in comparison to the previous years as the number of Job Posting related properties adopted by at least 30% of the websites containing job offers has increased from 4 (2016) to 7 (2017). The newly adopted properties are JobPosting/url, JobPosting/datePosted, and JobPosting/employmentType. You can find a more extended analysis concerning specific topics, like Job Posting and Product, here

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis

Download:

The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 38.7 billion RDF quads. For download, we split the data into 8,433 files with a total size of 858 GB.

http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org classes separate files, including all quads extracted from pages, using a specific schema.org class.

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html

Lots of thanks to:

  • the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
  • the Any23 project for providing their great library of structured data parsers.
  • Amazon Web Services in Education Grant for supporting WebDataCommons.
  • the Ministry of Economy, Research and Arts of Baden – Württemberg which supported through the ViCE project the extraction and analysis of the November 2017 corpus.

General Information about the WebDataCommons Project:

The WebDataCommons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of WebTables, a corpus of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

http://webdatacommons.org/

Have fun with the new data set!

Cheers,

Anna Primpeli, Robert Meusel, and Christian Bizer

 

 

]]>
Research - Web-based Systems Topics - Linked Data Projects Chris
news-2045 Mon, 08 Jan 2018 12:32:58 +0000 Papers accepted at PerCom 2018 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/papers-accepted-at-percom-2018/ We have a few papers accepted at the 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom), one of the top-tier conferences in the field of Pervasive Computing:

 

Main Conference

 

Hips Do Lie! A Position-Aware Mobile Fall Detection System
(Christian Krupitzer, Timo Sztyler, Janick Edinger, Martin Breitbach, Heiner Stuckenschmidt and Christian Becker)

NECTAR: Knowledge-based Collaborative Active Learning for Activity Recognition
(Gabriele Civitarese, Claudio Bettini, Timo Sztyler, Daniele Riboni, and Heiner Stuckenschmidt)

 

Satellit Events (Workshops)

 

Towards Systematic Benchmarking of Activity Recognition Algorithms
(Timo Sztyler, Christian Meilicke and Heiner Stuckenschmidt)

Modeling and Reasoning with ProbLog: An Application in Recognizing Complex Activities
(Timo Sztyler, Gabriele Civitarese and Heiner Stuckenschmidt)

Improving Motion-based Activity Recognition with Ego-centric Vision
(Alexander Diete, Timo Sztyler, Lydia Weiland and Heiner Stuckenschmidt)

]]>
Publications Heiner Research
news-2038 Tue, 12 Dec 2017 20:01:51 +0000 Understanding Euroscepticism Through the Lens of Big Data at Villa Vigoni https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/understanding-euroscepticism-through-the-lens-of-big-data-at-villa-vigoni/ During the first week of December 2017 we participated in a hackathon that brought together researchers from the field of natural language processing and political science to look at ways to leverage today's abundance of digital primary sources for better understanding the continent-wide rise of Euroscepticism.

Colleagues from top academic schools like Bocconi University (Italy), Gesis (Germany), London School of Economics and Turing Institute (UK), among others, worked closely together to share and discuss complementary methodologies and developed new models of spatial placement from text for the topic of European integration.

The event was a joint effort organized by the Data and Web Science Group, the Digital Humanities Group at FBK Trento and Unitelma Roma. This collaboration is part of an ongoing larger effort from members of DWS to explore the benefits of expertise in the fields of artificial intelligence and natural language processing to support cutting-edge research in political and computational social sciences (see also our work in the context of our research collaborative center SFB 884 on the "Political Economy of Reform").

]]>
Research Simone
news-2022 Tue, 07 Nov 2017 16:30:43 +0000 DepCC: A Dependency-Parsed Text Corpus from the Common Crawl https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/depcc-a-dependency-parsed-text-corpus-from-the-common-crawl/ Together with our colleagues at the University of Hamburg, we just released a new web-scale dependency-parsed corpus based on the CommonCrawl. DepCC is a large linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl.

You can find the corpus here: https://commoncrawl.s3.amazonaws.com/contrib/depcc/CC-MAIN-2016-07/index.html

A description is available in this paper: https://arxiv.org/abs/1710.01779

]]>
Research Simone
news-2021 Mon, 06 Nov 2017 14:11:00 +0000 Dominique Ritze defended her PhD Thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dominique-ritze-defended-her-phd-thesis/ On November 6th, Dominique Ritze successfully defended her PhD thesis Web-Scale Web Table to Knowledge Base Matching. Supervisor was Prof. Christian Bizer, second reader was Prof. Kai Eckert from Hochschule der Medien Stattgart. 

Abstract of the thesis:

Millions of relational HTML tables are found on the World Wide Web. In contrast to unstructured text, relational web tables provide a compact representation of entities described by attributes. The data within these tables covers a broad topical range. Web table data is used for question answering, augmentation of search results, and knowledge base completion. Until a few years ago, only search engines companies like Google and Microsoft owned large web crawls from which web tables are extracted. Thus, researches outside the companies have not been able to work with web tables.

In this thesis, the first publicly available web table corpus containing millions of web tables is introduced. The corpus enables interested researchers to experiment with web tables. A profile of the corpus is created to give insights to the characteristics and topics. Further, the potential of web tables for augmenting cross-domain knowledge bases is investigated. For the use case of knowledge base augmentation, it is necessary to understand the web table content. For this reason, web tables are matched to a knowledge base. The matching comprises three matching tasks: instance, property, and class matching. Existing web table to knowledge base matching systems either focus on a subset of these matching tasks or are evaluated using gold standards which also only cover a subset of the challenges that arise when matching web tables to knowledge bases.

This thesis systematically evaluates the utility of a wide range of different features for the web table to knowledge base matching task using a single gold standard. The results of the evaluation are used afterwards to design a holistic matching method which covers all matching tasks and outperforms state-of-the-art web table to knowledge base matching systems. In order to achieve these goals, we first propose the T2K Match algorithm which addresses all three matching tasks in an integrated fashion. In addition, we introduce the T2D gold standard which covers a wide variety of challenges. By evaluating T2K Match against the T2D gold standard, we identify that only considering the table content is insufficient. Hence, we include features of three categories: features found in the table, in the table context like the page title, and features that base on external resources like a synonym dictionary.

We analyze the utility of the features for each matching task. The analysis shows that certain problems cannot be overcome by matching each table in isolation to the knowledge base. In addition, relying on the features is not enough for the property matching task. Based on these findings, we extend T2K Match into T2K Match++ which exploits indirect matches to web tables about the same topic and uses knowledge derived from the knowledge base. We show that T2K Match++ outperforms all state-of-the-art web table to knowledge base matching approaches on the T2D and Limaye gold standard. Most systems show good results on one matching task but T2K Match++ is the only system that achieves F-measure scores above 0:8 for all tasks. Compared to results of the best performing system TableMiner+, the F-measure for the difficult property matching task is increased by 0:08, for the class and instance matching task by 0:05 and 0:03, respectively.

Bibliographic meta-information and download of the thesis.

 

 

]]>
Chris Research Research - Web-based Systems Publications
news-2014 Fri, 27 Oct 2017 08:41:55 +0000 SWSA Ten-Year Award won by DBpedia Paper https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/swsa-ten-year-award-won-by-dbpedia-paper/ We are happy to announce that Professor Christian Bizer has received the SWSA Ten-Year Award at the 16th International Semantic Web Conference (ISWC2017) in Vienna for the paper "DBpedia: A Nucleus for a Web of Open Data” that he co-authored in 2007.

The SWSA Ten-Year Award recognizes the highest impact papers from the ISWC proceedings ten years prior (i.e., in 2017 the award honors a paper from 2007). The decision is based primarily, but not exclusively, on the number of citations to the papers from the proceedings in the intervening decade.

DBpedia is a large-scale cross-domain knowledge base which we extract from Wikipedia and make available on the Web under an open license. DBpedia allows users to ask sophisticated queries against Wikipedia knowledge and serves as an interlinking hub in the Web of Linked Data. In addition, DBpedia is widely used as background knowledge for applications such as search, natural language understanding, and data integration.

According to Google Scholar, the paper "DBpedia: A Nucleus for a Web of Open Data” has been cited 2770 times as of October 2017.

 

]]>
Research - Web-based Systems Projects Chris
news-2003 Wed, 04 Oct 2017 11:52:07 +0000 Paper accepted at K-CAP 2017: Detection of Relation Assertion Errors in Knowledge Graphs https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-k-cap-2017-detection-of-relation-assertion-errors-in-knowledge-graphs/ The paper "Detection of Relation Assertion Errors in Knowledge Graphs", authored by André Melo and Heiko Paulheim, has been accepted at the Ninth International Conference on Knowledge Capture (K-CAP 2017).

Abstract:

Although the link prediction problem, where missing relation assertions are predicted, has been widely researched, error detection did not receive as much attention. In this paper, we investigate the problem of error detection in relation assertions of knowledge graphs, and we propose an error detection method which relies on path and type features used by a classifier for every relation in the graph exploiting local feature selection. We perform an extensive evaluation on a variety of datasets, backed by a manual evaluation on DBpedia and NELL, and we propose and evaluate heuristics for the selection of relevant graph paths to be used as features in our method.

Download paper

]]>
Publications Group
news-1707 Thu, 14 Sep 2017 13:28:00 +0000 Paper accepted at EMNLP: MinIE: Minimizing Facts in Open Information Extraction https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-emnlp-minie-minimizing-facts-in-open-information-extraction/ The paper "MinIE: Minimizing Facts in Open Information Extraction (pdf) by K. Gashteovski, L. del Corro, and R. Gemulla has been accepted for the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Abstract
The goal of Open Information Extraction (OIE) is to extract surface relations and their arguments from natural-language text in an unsupervised, domain-independent manner. In this paper, we propose MinIE, an OIE system that aims to provide useful, compact extractions with high precision and recall. MinIE approaches these goals by (1) representing information about polarity, modality, attribution, and quantities with semantic annotations instead of in the actual extraction, and (2) identifying and removing parts that are considered overly specific. We conducted an experimental study with several real-world datasets and found that MinIE achieves competitive or higher precision and recall than most prior systems, while at the same time producing shorter, semantically enriched extractions.

]]>
Publications Rainer
news-1988 Wed, 13 Sep 2017 08:10:14 +0000 EMNLP 2017 Outstanding Paper https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/emnlp-2017-outstanding-paper/ Our paper on Topic-Based Agreement and Disagreement in US Electoral Manifestos, coauthored with the DH Group at FBK Trento was selected as an Outstanding Paper at EMNLP 2017, one of the premier conferences in the field of NLP. 

The paper presents a topic-based analysis of agreement and disagreement in US political manifestos, which relies on a new method for topic detection based on key concept clustering. Data and software can be found here.

]]>
Research Simone
news-1984 Mon, 11 Sep 2017 06:16:10 +0000 Semester Kick-Off BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kick-off-bbq-1/ As it is a good tradition at the DWS group, we kicked off the semester with our semester start BBQ. Since we were lucky catch one of the last warm and sunny evenings of the late summer, more than 50 students joined and took the opportunity to enjoy food and drinks, to inform themselves about the research and teaching activities of the group, as well as to get in touch with the DWS group in an informal setting.

]]>
Group Other
news-1970 Fri, 01 Sep 2017 17:34:28 +0000 Juniorprofessor for Text Analytics for Interdisciplinary Research https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/juniorprofessor-for-text-analytics-for-interdisciplinary-research/ Dr. Goran Glavaš has been appointed as the Juniorprofessor for Text Analytics for Interdisciplinary Research. The Data and Web Science group is very happy to announce that Prof. Dr. Goran Glavaš has been appointed as Juniorprofessor for Text Analytics for Interdisciplinary Research!

Below, you find an overview of Professor Glavaš' research interests as well as the courses in which he will participate.

Research Interests 

Prof. Glavaš' interest lay in statistical natural language processing (NLP), with special focus on:

  • Lexical and Computational Semantics
  • Information Extraction
  • Multilingual and Cross-lingual NLP
  • NLP Applications for Social Sciences and Humanities

Teaching

Prof. Glavaš currently is and be involved in teaching activities in the following courses: 

  • Information Retrieval & Web Search
  • Text Analytics
  • Web Mining
  • Knowledge Management
  • Seminar in Text Analytics

 Former Positions

  • 2015-2017: PostDoc at the University of Mannheim
  • 2014-2015: PosDoc at the University of Zagreb
  • 2011-2014: Research assistant (and PhD student) at the University of Zagreb

Further Information

Further information about Professor Glavaš can be found on his webpage.

]]>
Other Topics - Artificial Intelligence (NLP)
news-1966 Fri, 25 Aug 2017 09:06:16 +0000 DWS Company Outing https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-company-outing/ In the 2017 edition of the DWS company outing we visited once again the beautiful Neckar region. While we took a riverboat two years ago, we decided to go by canoe or by foot this year.  The canoe group started in Hirschhorn with the final destination Neckargemünd. On their way (~2-3 hours out on the water) they had to master several dangerous challenges, including the passing of a water gate. The hiking group started in Neckarsteinach taking a route over Dilsberg also with the final destination of Neckargemünd. All of us met at the target location (some earlier, some later) where we visited a restaurant to celebrate the successful completion of our excursion.

]]>
Other
news-1964 Tue, 22 Aug 2017 11:22:20 +0000 Artificial Intelligence Journal 2017 Prominent Paper Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/artificial-intelligence-journal-2017-prominent-paper-award/ The paper Simone co-authored back in 2012 with his former colleague Roberto Navigli on BabelNet won the Prominent Paper Award 2017 of the Artificial Intelligence journal, the most prestigious journal in the field of AI. 

The award recognizes outstanding papers published not more than seven years ago in the AI Journal that are exceptional in their significance and impact. You can find the official announcement here

]]>
Research Simone
news-1958 Thu, 17 Aug 2017 12:09:02 +0000 We won the CVPR 2017 Multiple-Object Tracking Challenge https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/we-won-the-cvpr-2017-multiple-object-tracking-challenge/ Our approach

Motion Segmentation and Multiple Object Tracking by Correlation Clustering, Margret Keuper, Siyu Tang Bjoern Andres, Thomas Brox and Bernt Schiele

won the

CVPR 2017 Multiple-Object Tracking Challenge.

A detailed method description is given in:  M. Keuper, S. Tang, Y. Zhongjie, B. Andres, T. Brox, B. Schiele. A multi-cut formulation for joint segmentation and tracking of multiple objects. In arXiv preprint arXiv:1607.06317, 2016.

 

Abstract: Recently, Minimum Cost Multicut Formulations have been proposed and proven to be successful in both motion
trajectory segmentation and multi-target tracking scenarios. Both tasks benefit from decomposing a graphical model
into an optimal number of connected components based on attractive and repulsive pairwise terms. The two tasks are
formulated on different levels of granularity and, accordingly, leverage mostly local information for motion segmentation and mostly high-level information for multi-target tracking. In this paper we argue that point trajectories and their local relationships can contribute to the high-level task of multi-target tracking and also argue that high-level cues from object detection and tracking are helpful to solve motion segmentation. We propose a joint graphical model for point trajectories and object detections whose Multicuts are solutions to motion segmentation and multi-target tracking problems at once.

]]>
Research
news-1949 Wed, 19 Jul 2017 12:01:14 +0000 Nomination for the Best Student Paper Award at JCDL 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/nomination-for-the-best-student-paper-award-at-jcdl-2017/ Our paper on "Building Entity-Centric Event Collections" was nominated for the Best Student Paper Award at the 2017 edition of the Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries. The paper. The work presented in the paper is a collaboration between the DWS group and Prof. Laura Dietz at the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura. The paper is available here.

Abstract. Web archives preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric subcollections from large archives, which include not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). Œis is achieved by 1) identifying relevant concepts and entities from a knowledge base, and 2) detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record, and we test its performance on the TREC KBA Stream corpus, a large and publicly available web archive.

 

 

]]>
Research Simone
news-1939 Fri, 14 Jul 2017 07:55:09 +0000 DSGDpp library for parallel matrix factorization available as open source https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dsgdpp-library-for-parallel-matrix-factorization-available-as-open-source/ We have open-sourced the DSGDpp library, which contains implementations of various parallel algorithms for computing low-rank matrix factorizations. Both shared-memory and shared-nothing (via MPI) implementations are provided.

This library (finally!) makes our implementations of the DSGD++ algorithm of Teflioudi et al. (2012) and the CSGD algorithm of Makari et al. (2013) publicly available.

More information can be found at the GitHub page: github.com/uma-pi1/DSGDpp

 

 

]]>
Research - Data Analytics Research Topics - Data Mining Rainer
news-1937 Tue, 11 Jul 2017 08:10:36 +0000 Paper accepted at VLDB 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-vldb-2017/ We have a paper accepted at the 43th International Conference on Very Large Data Bases (VLDB 2017), a premier conference in the field of databases and data management. The conference takes place in Munich at the end of August 2017.

Authors:
Oliver Lehmberg, Christian Bizer 

Title:
Stitching Web Tables for Improving Matching Quality

Abstract:
HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables.
Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.

]]>
Research - Web-based Systems Publications Chris
news-1934 Thu, 06 Jul 2017 19:55:31 +0000 Four papers accepted at EMNLP 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/four-papers-accepted-at-emnlp-2017/ We have a few papers accepted at the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), one of the top-tier conferences in the field of Natural Language Processing:

  • Kiril Gashteovski, Rainer Gemulla, Luciano del Corro: MinIE: Minimizing Facts in Open Information Extraction
  • Goran Glavaš and Simone Paolo Ponzetto: Dual Tensor Model for Detecting Asymmetric Lexico-Semantic Relations
  • Stefano Menini, Federico Nanni, Simone Paolo Ponzetto and Sara Tonelli: Topic-Based Agreement and Disagreement in US Electoral Manifestos
  • Alexander Panchenko, Fide Marten, Eugen Ruppert, Stefano Faralli, Dmitry Ustalov, Simone Paolo Ponzetto and Chris Biemann: Unsupervised & Knowledge-Free & Interpretable Word Sense Disambiguation
]]>
Rainer Simone Research Publications
news-1930 Tue, 27 Jun 2017 12:40:38 +0000 Two papers accepted at German Conference on Artificial Intelligence https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-papers-accepted-at-german-conference-on-artificial-intelligence/ Two papers - "A Robust Number Parser based on Conditional Random Fields" by Heiko Paulheim and "One Knowledge Graph to Rule them All?" by Daniel Ringler and Heiko Paulheim - have been accepted at the 40th German Conference on Artificial Intelligence (KI 2017).

A Robust Number Parser based on Conditional Random Fields

Abstract: When processing information from unstructured sources, numbers  have  to  be  parsed  in  many  cases  to  do  useful  reasoning  on  that information. However, since numbers can be expressed in different ways, a robust number parser that can cope with number representations in different shapes is required in those cases. In this paper, we show how to train such a parser based on Conditional Random Fields. As training data, we use pairs of Wikipedia infobox entries and numbers from public knowledge graphs. We show that it is possible to parse numbers at an accuracy of more than 90%.

Download PDF

One Knowledge Graph to Rule them All? Analyzing the Di fferences between DBpedia, YAGO, Wikidata & co.

Abstract: Public Knowledge Graphs (KGs) on the Web are considered a valuable asset for developing intelligent applications. They contain general knowledge which can be used, e.g., for improving data analytics tools, text processing pipelines, or recommender systems. While the large players, e.g., DBpedia, YAGO, or Wikidata, are often considered similar in nature and coverage, there are, in fact, quite a few di fferences. In this paper, we quantify those di fferences, and identify the overlapping and the complementary parts of public KGs. From those considerations, we can conclude that the KGs are hardly interchangeable, and that each of them has its strenghts and weaknesses when it comes to applications in di fferent domains.

Download PDF

]]>
Group Research Publications
news-1929 Tue, 27 Jun 2017 12:35:29 +0000 Article published in AI Review Journal https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-published-in-ai-review-journal/ The article "Local and global feature selection for multilabel classification with binary relevance" is by André Melo and Heiko Paulheim published in the AI Review journal. The article "Local and global feature selection for multilabel classification with binary relevance" by André Melo and Heiko Paulheim has been accepted and is going to be published in the Artificial Intelligence Review journal.

Abstract:
Multilabel classification has become increasingly important for various use cases. Amongst the existing multilabel classification methods, problem transformation approaches, such as Binary Relevance, Pruned Problem Transformation, and Classifier Chains, are some of the most popular, since they break a global multilabel classification problem into a set of smaller binary or multiclass classification problems. Transformation methods enable the use of two different feature selection approaches: local, where the selection is performed independently for each of the transformed problems, and global, where the selection is performed on the original dataset, meaning that all local classifiers work on the same set of features. While global methods have been widely researched, local methods have received little attention so far. In this paper, we compare those two strategies on one of the most straight forward transformation approaches, i.e., Binary Relevance. We empirically compare their performance on various flat and hierarchical multilabel datasets of different application domains. We show that local outperforms global feature selection in terms of classification accuracy, without drawbacks in runtime performance.

Download PDF

]]>
Group Research Publications
news-1928 Fri, 23 Jun 2017 08:52:12 +0000 Web Data Integration Framework (WInte.r) released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/web-data-integration-framework-winter-released/ We are happy to announce the release of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions. In addition, these pre-defined building blocks can be used as foundation for implementing advanced integration methods.

The WInte.r famework forms the foundation for our research on large-scale web data integration. The framework contains an implementation of the T2K Match algorithm for matching millions of Web tables against a central knowledge base. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve the course case study. In addition, most students use the framework as foundation for their term projects.  

Detailed information about the WInte.r framework is found at

https://github.com/olehmberg/winter

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

]]>
Research - Web-based Systems Chris Projects
news-1948 Mon, 19 Jun 2017 11:32:00 +0000 Federico Nanni defended his PhD Thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/federico-nanni-defended-his-phd-thesis/  

 

 

On June 7th, Federico Nanni successfully defended his PhD thesis "THE WEB AS A HISTORICAL CORPUS Collecting, Analysing and Selecting Sources On the Recent Past Of Academic Institutions", jointly supervised by Prof. Simone Ponzetto and Prof. Maurizio Matteuzzi (University of Bologna).

]]>
Research Simone
news-1904 Sat, 17 Jun 2017 07:17:00 +0000 DWS Students Score Top Results in International Data Science Competition https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-students-score-top-results-in-international-data-science-competition/ The Data Mining Cup is an annual competition for data science students all over the world. Within six weeks time, student teams have to solve a data science task based on real world data. This year's task was to predict revenues in an online store for pharmaceuticals with varying prices. Each university was allowed to register two teams, and all in all, 202 teams from 150 universities in 48 countries participated in 2017.

 The two teams from the University of Mannheim, master students in the Data Science and Business Informatics study programs, are among the top 10 teams out of the 202 participating teams, and will be invited to the prudsys personalization summit in Berlin on June 28th/29th to present their solutions. The final winners will be announced at the summit in Berlin.

 

It is the fourth year that students from the University of Mannheim participate in the Data Mining Cup. Participation in the cup is an integral part of the Data Mining 2 course taught by Prof. Heiko Paulheim, allowing the students to deepen the skills acquired in the lecture in a competitive real-world setting.

 

We congratulate the two student teams for this great achievement!

]]>
Projects Other Topics - Data Mining
news-1959 Mon, 12 Jun 2017 21:01:00 +0000 Two short papers accepted at ACL 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-short-papers-accepted-at-acl-2017/ DWS will be present at the poster session of the forthcoming ACL 2017 conference, the top-tier conference in the field of Natural Language Processing, with two short papers.

  • Sergiu Nisioi; Sanja Štajner; Simone Paolo Ponzetto; Liviu P. Dinu: Exploring Neural Text Simplification Models
  • Sanja Štajner; Marc Franco-Salvador; Simone Paolo Ponzetto; Paolo Rosso; Heiner Stuckenschmidt: Sentence Alignment Methods for Improving Text Simplification Systems

 

 

]]>
Research Publications Simone
news-1873 Wed, 26 Apr 2017 07:27:04 +0000 New Professor on Computer Vision joining the DWS group: Prof. Dr.-Ing. Margret Keuper https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-professor-on-computer-vision-joining-the-dws-group-prof-dr-ing-margret-keuper/ The Data and Web Science group is very happy to announce that Prof. Dr.-Ing. Margret Keuper has joined the group as Juniorprofessor for Computer Vision and Image Processing. A warm welcome to Professor Keuper, we are very happy to have her in the group now!

Below, you find an overview of Professor Keuper's research as well as the courses that she will offer at the University of Mannheim.

Research

Extracting relevant information from a scene is an easy task for humans. During car driving, the driver knows where pedestrians are walking dangerously close to the street or which bicyclist might cross their way. Thus, image analysis is a highly relevant aspect in technologies such as humanoid robotics or autonomous driving. The task is to understand images and image sequences with their motion patterns, to recognize and delineate object categories such as "person", "pedestrian" or "car". Such questions drive the research of computer scientist Margret Keuper. Her focus is on the formulation and optimization of segmentation or clustering problems, i.e. the grouping of picture and video elements to relevant entities such as pedestrians or driving cars in crowded street scenes. During her PhD, she focused on the automated analysis of volumetric, microscopic recordings, with high relevance for biological research questions, for example in collaboration with the Max-Planck-Institute for Immunobiology and Epigenetics.

Research Interests

  • Image and Video Segmentation
  • Motion Analysis

Teaching

  • Image Processing (autumn semester): This lecture focusses on the fundamentals of image processing. Central aspects are: Imaging and fundamental image operators, feature extraction, image segmentation and motion estimation.
  • Higher Level Computer Vision (spring semester): This lecture focusses on more complex computer vision algorithms, such as image and video segmentation by continuous and combinatorial optimization, motion estimation and depth reconstruction. An further important topic are the fundamentals of convolutional neural networks for the implementation of deep learning algorithms for computer vision.

 Former Positions

  • 2012-2017: PostDoc at the University of Freiburg
  • 2013-2017: Visiting researcher at the Max-Planck-Institute for Informatics, Saarbrücken

Further Information

Further information about Professor Keuper is found on her webpage.

]]>
Research Other
news-1872 Tue, 25 Apr 2017 10:28:00 +0000 Paper accepted at IJCAI 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-ijcai-2017/ We have a paper accepted at the 26th International Joint Conference on Artificial Intelligence (IJCAI), the premier conference in the field of AI:

  • Sanja Štajner, Simone Paolo Ponzetto and Heiner Stuckenschmidt: Automatic Assessment of Absolute Sentence Complexity.

The work presented in the paper is a collaboration between the NLP and AI groups of DWS in the context of project C4 of the Collaborative Research Center SFB 884 on "Political Economy of Reforms"

 

 

]]>
Research Publications Simone Heiner
news-1848 Thu, 23 Mar 2017 13:07:07 +0000 Paper accepted at JCDL 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-jcdl-2017/ We have a paper accepted at the 2017 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries

  • Federico Nanni, Simone Paolo Ponzetto and Laura Dietz: Building Entity-Centric Event Collections.

The work presented in the paper is a collaboration between the DWS group and Prof. Laura Dietz at the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

 

 

]]>
Research Publications Simone
news-1847 Wed, 22 Mar 2017 11:53:12 +0000 Application open for Mannheim Master in Data Science starting in Fall term https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/application-open-for-mannheim-master-in-data-science-starting-in-fall-term/ The application for the Mannheim Master in Data Science programme is now open for the Fall term (starting in September 2017). You can apply online until May 31st.

Please review the admission criteria first to make sure you include all necessary files.

Link to the online application system

]]>
Group Other
news-1840 Tue, 14 Mar 2017 14:43:49 +0000 Two papers accepted at ESWC 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-papers-accepted-at-eswc-2017/ Two papers from the DWS group have been accepted at ESWC 2017: "Synthesizing Knowledge Graphs for Link and Type Prediction Benchmarking", and "Data-driven Joint Debugging of the DBpedia Mappings and Ontology: Towards Addressing the Causes instead of the Symptoms of Data Quality in DBpedia". Two papers from the DWS group have been accepted at ESWC 2017:

"Synthesizing Knowledge Graphs for Link and Type Prediction Benchmarking" by André Melo and Heiko Paulheim.

Abstract:

Despite the growing amount of research in link and type prediction in knowledge graphs, systematic benchmark datasets are still scarce. In this paper, we propose a synthesis model for the generation of benchmark datasets for those tasks. Synthesizing data is a way of having control over important characteristics of the data, and allows the study of the impact of such characteristics on the performance of different methods. The proposed model uses existing knowledge graphs to create synthetic graphs with similar characteristics, such as distributions of classes, relations, and instances.
selection of instances, and horn rules. As a first step, we replicate already existing knowledge graphs in order to validate the synthesis model. To do so, we perform extensive experiments with different link and type prediction methods. We show that we can systematically create knowledge graph benchmarks which allow for quantitative measurements of the result quality and scalability of link and type prediction methods.

Download paper

 

 "Data-driven Joint Debugging of the DBpedia Mappings and Ontology: Towards Addressing the Causes instead of the Symptoms of Data Quality in DBpedia" by Heiko Paulheim

Abstract:

DBpedia is a large-scale, cross-domain knowledge graph extracted from Wikipedia. For the extraction, crowd-sourced mappings from Wikipedia infoboxes to the DBpedia ontology are utilized. In this process, different problems may arise: users may create wrong and/or inconsistent mappings, use the ontology in an unforeseen way, or change the ontology without considering all possible consequences. In this paper, we present a data-driven approach to discover problems in mappings as well as in the ontology and its usage in a joint, data-driven process. We show both quantitative and qualitative results about the problems identified, and derive proposals for altering mappings and refactoring the DBpedia ontology.

Download paper

]]>
Group Publications
news-1839 Tue, 14 Mar 2017 08:11:52 +0000 Best Paper Award at ICAART 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/best-paper-award-at-icaart-2017/ We are happy to announce that our paper Where is that Button Again?! – Towards a Universal GUI Search Engine won the best paper award at the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017) in the artificial intelligence area.

 In feature-rich software a wide range of functionality is spread across various menus, dialog windows, toolbars etc. Remembering where to find each feature is usually very hard, especially if it is not regularly used.We therefore provide a GUI search engine which is universally applicable to a large number of applications.Besides giving an overview of related approaches, we describe three major problems we had to solve, which are analyzing the GUI, understanding the users’ query and executing a suitable solution to find a desired UI element. Based on a user study we evaluated our approach and showed that it is particularly useful if a not regularly used feature is searched for. We already identified much potential for further applications based on our approach.

 This research was funded in part by the German Federal Ministry of Education and Research under grant no. 01IS12050 (project SuGraBo).

 The paper was one of the 32 full papers in the artificial intelligence area accepted for presentation at the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017).

]]>
Publications Projects
news-1837 Mon, 13 Mar 2017 10:36:46 +0000 Robert Meusel defended his PhD Thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/robert-meusel-defended-his-phd-thesis/ On March 10th, Robert Meusel successfully defended his PhD thesis Web-Scale Profiling of Semantic Annotations in HTML Pages. Supervisor was Prof. Christian Bizer, second reader was Prof. Wolfgang Nejdl from Leibniz Universität Hannover. 

Abstract of the thesis:

The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by profiling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the first step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates first the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-specific adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-specific issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated.

Keywords:

Dataspace Profiling , RDFa, Microformats , Microdata , Schema.org , Crawling

Full-text:

The full-text of the thesis is available from the MADOC document server. 

 

 

]]>
Chris Research
news-1824 Fri, 24 Feb 2017 10:45:50 +0000 Journal Papers on Demand Forecasting and Process Model Management accepted https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/journal-papers-on-demand-forecasting-and-process-model-management-accepted/ Two Journal Papers from the Artificial Intelligence Group have been accepted recently

The paper " Cluster-based hierarchical demand forecasting for perishable goods by Jakob Huber Alexander Gossmann and Heiner Stuckenschmidt has been accepted for Publication in Elseviers' Expert Systems with Applications(Impact Factor . 2.981)

The Paper "Overcoming Individual Process Model Matcher Weaknesses Using Ensemble Matching" By Christian Meilicke, Henrik Leopold, Elena Kuss and Heiner Stuckenschmidt has been accepted in Elsevier's Decision Support Systems (Impact Factor 2.604)

]]>
Research Publications Heiner
news-1823 Fri, 24 Feb 2017 10:28:00 +0000 Semester Kick-off BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kick-off-bbq/ Traditionally, the DWS group takes the beginning of each new semester as an opportunity to host a barbecue in order to welcome new colleagues and introduce the upcoming courses to the best students of last semester. This year, the BBQ was also used to welcome the first students of the new Mannheim Master in Data Science. Accompanied by cold beverages and grilled food the professors gave an overview of the current activities of the group and presented the spring/summer semester program. The courses for this term are:

Knowledge ManagementData Mining IData Mining IIWeb MiningWeb Search and Information RetrievalLarge-Scale Data Management.

We thank all the participants for coming and wish our students a good and successful start into the new semester!

German Version 

Traditionell nutzt die Forschungsgruppe Data und Web Science den Beginn des Semesters, um bei einem Grillfest neue Kolleginnen und Kollegen willkommen zu heißen, das aktuelle Lehrangebot vorzustellen und dazu die besten Studierenden des letzten Semester einzuladen. Dieses Semester wurde zudem die Gelegenheit genutzt, die ersten Studenten des neuen Mannheimer Master in Data Science willkommen zu heißen. Begleitet von Grillgut und kühlen Getränken präsentierten die Professoren die nächsten Kurse des aktuellen Semesters.

Die folgenden Kurse wurden vorgestellt:

Knowledge ManagementData Mining IData Mining IIWeb MiningWeb Search and Information RetrievalLarge-Scale Data Management.

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen unseren Studenten einen guten und erfolgreichen Start ins neue Semester!

]]>
Group Other
news-1822 Thu, 23 Feb 2017 11:41:31 +0000 Three short papers accepted at DH2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/three-short-papers-accepted-at-dh2017/ DWS will be present at the forthcoming DH 2017 conference, the premier forum for research in Digital Humanities, with three posters.

  • Nanni, Federico; Marinov, Nikolay; Ponzetto, Simone Paolo; Dietz, Laura. Building Entity-Centric Event Collections For Supporting Research in Political and Social History.
  • Nanni, Federico; Zhao, Yang; Ponzetto, Simone Paolo; Dietz, Laura. Enhancing Domain-Specific Entity Linking in DH.
  • Lauscher, Anne; Nanni, Federico; Ponzetto, Simone Paolo. SLaTE: A System for Labeling Topics with Entities.
]]>
Research Publications Simone
news-1819 Tue, 21 Feb 2017 08:20:04 +0000 Registration Open for DataFest Germany 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/registration-open-for-datafest-germany-2017-1/ DataFest is a competition and networking event. You will get the unique opportunity to work with a large dataset and analyze it using your own ideas as well as meet leaders and companies in the field of statistics. DataFest began in 2011 at UCLA, and is now sponsored by the American Statistical Association.

DataFest Germany 2017 is the third annual DataFest event organized in Germany. It is hosted by a consortium of the Statistics and Social Science Methodology Chair at the University of Mannheim, the Institute of Statistics of LMU Munich, and the P3 Group.

The task and dataset of this year's DataFest is still secret, but the registration for the DataFest is already open: Bachelor and Masters-level students of all subjects are welcomed to apply in teams of 2-5 people.  Applications are open until Thursday, March 12, 2017. Space is limited, so only the first 20 teams will be accepted.

As DWS students were quite quite successful and won a prize at DataFest 2015 in Mannheim, we strongly encourage our students to participate again in this year's DataFest.

 

 

]]>
Projects Topics - Data Mining
news-1807 Thu, 09 Feb 2017 20:07:34 +0000 Two short papers accepted at EACL 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-short-papers-accepted-at-eacl-2017/ DWS will be present at the poster session of the forthcoming EACL 2017 conference, the premier European forum for research in Natural Language Processing, with two short papers.

  • Goran Glavaš, Federico Nanni and Simone Paolo Ponzetto: Unsupervised Cross-Lingual Scaling of Political Texts
  • Patrick Klein, Simone Paolo Ponzetto and Goran Glavaš: Improving Neural Knowledge Base Completion with Cross-Lingual Projections

 

 

]]>
Research Publications Simone
news-1786 Tue, 17 Jan 2017 14:47:30 +0000 44.2 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/442-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2016 version of the CommonCrawl covering 3.2 billion HTML pages which originate from 34 million websites (pay-level domains).

Altogether we discovered structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.6 million different pay-level domains out of the 34 billion pay-level domains covered by the crawl (16.5%).

Approximately 2.5 million of these websites use Microdata, 2.1 million websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats are used by over 1.6 million websites within the crawl.

 

Background: 

More and more websites annotate structured data within their HTML pages using markup formats such as RDFa, Microdata, embedded JSON-LD and Microformats. The annotations  cover topics such as products, reviews, people, organizations, places, events, and cooking  recipes.

The WebDataCommons project extracts all Microdata, RDFa data, and Microformat data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. 

Besides the markup data, the WebDataCommons project also provides large web table corpora and web graphs for download. General information about the WebDataCommons project is found at 

webdatacommons.org 


Data Set Statistics: 

Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa  
and Microformat data sets as well as the vocabularies that are used together with each 
markup format are found at: 

webdatacommons.org/structureddata/2016-10/stats/stats.html

Comparing the statistics to the statistics about the November 2015 release of the data sets

 

webdatacommons.org/structureddata/2015-11/stats/stats.html

we see that the Microdata syntax remains the most dominant annotation format. Although it is hard to compare the adoption of the syntax between the two years in absolute numbers, as the October 2016 crawl corpus is almost double the size of the November 2015 one, a relative increase can be observed: In the October 2016 corpus over 44% of the pay-level domains containing markup data make use of the Microdata syntax in comparison to 40% one year earlier. Even though the absolute numbers concerning the RDFa markup syntax adoption rise, the relative increase does not follow up the increase of the corpus size indicating that RDFa is less used by the websites. Similar to the 2015 release, the adoption of embedded JSON-LD has considerably increased, even though the main focus of the annotation remains the search action offered by the websites (70%).

As already observed in the previous years, the schema.org vocabulary is most frequently used in the context of Microdata while the adoption of its predecessor, the data vocabulary, continues to decrease. In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress). More concretely, the October 2016 corpus includes more than 682 million product records originating from 249 thousand websites which use the schema.org vocabulary. The new release contains postal address data for more than 291 million entities originating from 338 thousand websites. Furthermore, the content describing hotels has doubled in size in this release, with a total of 61 million hotel descriptions.

Visualizations of the main adoption trends concerning the different annotation formats, popular schema.org, as well as RDFa classes within the time span 2012 to 2016 are found at

webdatacommons.org/structureddata/

 

Download:

The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,661 files with a total size of 987 GB. 

webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class. 

webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.html

 

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which supported by means of the ViCe project the extraction and analysis of the October 2016 corpus.


Have fun with the new data set. 

Anna Primpeli, Robert Meusel and Chris Bizer

]]>
Research - Data Mining and Web Mining Research - Data Analytics Topics - Data Mining Topics - Linked Data Projects Chris
news-1785 Fri, 13 Jan 2017 10:09:38 +0000 Open PhD or PostDoc Position in Data Search and Data Integration https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/open-phd-or-postdoc-position-in-data-search-and-data-integration/ The Data and Web Science Group at the University of Mannheim invites applications for the following open position:

    PhD Student or PostDoc in the field of Data Search, Data Integration, and Data Mining (18 months)

in a third party funded research project performed together with a leading German data mining company as industry partner [1]. The project will develop data search methods for enabling large amounts of intranet and web data to be used for (semi-) automatic table extension. Applicants should have experience in one or more of the following areas and be fluent in Java:

  • Data search, information retrieval, and efficient indexing structures
  • Data integration, schema and instance matching, data fusion
  • Data mining, in particular feature creation and selection

The Data and Web Science Group is one of the largest research groups in the area of data science in Germany, comprising of 5 professors and over 25 PostDocs and PhD students, and thus provides a fruitful ecosystem for researchers.

The project builds on earlier work of the group [2-5].  We seek to fill the position with either a PostDoc researcher or PhD student depending on the individual skills of the candidate. The position is payed according to German regulations (TV-L 13, >3300€ before tax depending on your experience). The earliest possible starting date is March 1, 2017. The initial contract duration is 18 months (until the end of the project in August 2018). It is very likely that we will be able to offer a follow up contract after the successful completion of the project.

Applications should contain a CV, university transcripts, a link to the PhD/master thesis, a publication list (if applicable), and a list of personal references.  All applications received until February 6th, 2017 will receive full consideration, but you are invited to send your application earlier (as we already start interviewing candidates before the deadline).

Applications can be sent via e-mail to ds4dm-project(at)dwslab.de. Questions concerning the position are answered via the same email address by Christian Bizer and Heiko Paulheim.

The University of Mannheim seeks to increase the proportion of women in research and teaching. Preference will be given to suitably qualified women or persons with disabilities, all other considerations being equal.

 

[1] DS4DM website: http://dws.informatik.uni-mannheim.de/en/projects/ds4dm-data-search-for-data-mining/

[2] Lehmberg et al.: The Mannheim Search Join Engine. JWS 2015,
http://dx.doi.org/10.1016/j.websem.2015.05.001

[3] Ristoski et al.: Mining the Web of Linked Data with RapidMiner. JWS 2015,
http://dx.doi.org/10.1016/j.websem.2015.06.004

[4] Ritze et al.: Matching HTML Tables to DBpedia. WIMS 2015, 
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Ritze-etal-MatchingTablesToDBpedia-WIMS2015.pdf

[5] Gentile et al.: Extending RapidMiner with Data Search and Integration Capabilities. Demo at ESWC 2016,
http://ub-madoc.bib.uni-mannheim.de/40718/1/DataSearchDemo.pdf

]]>
Chris Open Positions - Staff
news-1779 Thu, 22 Dec 2016 08:21:56 +0000 DAGStat-Bulletin features Mannheim Data Science degree programs https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dagstat-bulletin-features-mannheim-data-science-degree-programs-1/ The DAGStat-Bulletin of the Deutsche Arbeitsgemeinschaft Statistik features the different Data Science degree programs that are offered by the University of Mannheim or in which professors from the university participate, in its December 2016 issue. The report that is titled Mannheimer Data Science Offensive is found on page 4:

DAGStat­Bulletin - Neues über Statistik und aus den Gesellschaften der Deutschen Arbeitsgemeinschaft Statistik, Issue 18, December 2016.

The featured programs are:

  1. Mannheim Master in Data Science (MMDS)
  2. International Program in Survey and Data Science (IPSDS)
  3. Part-time Master Program Data Science (PTMDS)
]]>
Projects Chris
news-1773 Tue, 13 Dec 2016 10:18:15 +0000 Paper accepted at ICSC 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-icsc-2017/ Our paper on "Domain Adaptation for Automatic Detection of Speculative Sentences" by Sanja Štajner, Goran Glavas, Simone Paolo Ponzetto and Heiner Stuckenschmidt has been accepted as a full paper for the 11th International Conference on Semantic Computing (IEEE ICSC 2017).

]]>
Research Publications Heiner Simone
news-1772 Mon, 12 Dec 2016 15:12:18 +0000 Paper Accepted for PerCom 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-percom-2017/ The Paper "Online Personalization of Cross-Subjects based Activity Recognition Models on Wearable Devices" by Timo Sztyler and Heiner Stuckenschmidt has been accepted as a full paper for the IEEE International Conference on Pervasive Computing and Communications 2017 (PerCom'17). Research Publications Heiner news-1771 Thu, 08 Dec 2016 13:34:20 +0000 Studententeam gewinnt Community Preis des 2nd BMVI DATA-RUN https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/studententeam-gewinnt-community-preis-des-2nd-bmvi-data-run/ Mehr als 90 Programmierer entwickelten 24 Stunden lang im Bundesministerium für Verkehr und digitale Infrastruktur innovative Lösungen für die Mobilität 4.0 Mannheim, den 5.12.2016. Unter dem Motto "Unsere Daten. Deine Ideen." startete Bundesminister Alexander Dobrindt am 2.12.16 den 2nd BMVI DATA-RUN, den zweiten Regierungs-Hackathon in Deutschland. 24 Stunden lang entwickelten Programmierer und Gründer innovative Lösungen für die Mobilität 4.0. Dafür erhielten sie Zugang zu ausgewählten Echtzeit-Daten des Ministeriums und seiner nachgeordneten Behörden. Anschließend zeichnete eine Fachjury die besten Ideen aus.

Zitat Alexander Dobrindt, Bundesminister für Verkehr und digitale Infrastruktur:

„Beim 2nd BMVI DATA-RUN haben wir noch einmal mehr Teilnehmer, mehr Daten und mehr Tempo als im Jahr zuvor. Im Zentrum steht diesmal die Vernetzung von Echtzeitdaten im Verkehr. Dafür stellen wir Echtzeit-Mobilitätsdaten bereit. Mein Ziel ist, in Deutschland das beste Ökosystem für Mobility-Startups zu bauen. Dafür öffnen wir die Datenschätze unseres Hauses, bringen die kreativen Köpfe zusammen und stellen mit dem mFUND 100 Millionen Euro zur Förderung digitaler Innovationen bereit.“

Mehr als 90 Programmierer nahmen am 2nd BMVI Data Run teil. In der Summe beteiligten sich 12 überwiegend professionell aufgestellte Teams, die sich teilweise auch intensiv auf den Hackathon vorbereitet hatten. Mit am Start war auch das Team ‚Rhein-Neckar Data Generation‘ mit 13 überwiegend studentischen Entwickler aus der Region. Das Team der Projektpartner Institut für Enterprise Systems der Uni Mannheim (InES), Lehrstuhl für Geoinformatik der Uni Heidelberg, SAP Next-Gen Consulting und geomer GmbH wurde vom Netzwerk Geoinformation der Metropolregion Rhein-Neckar e.V. (GeoNet.MRN) initiiert und gecoacht. Den Community-Preis für den Beitrag "Truckoo" ging dabei an Entwickler aus dem Team „Rhein-Neckar Data Generation“. Dabei entstand innerhalb von 24 Stunden eine App, die LKW-Fahrern auf Basis von Echtzeit-Parkplatz-Belegungsdaten und anderen Echtzeitdaten, Vorschläge für geeignete Rastmöglichkeiten macht. Ermöglicht wurde die Teilnahme durch die freundliche Unterstützung der Metropolregion Rhein-Neckar GmbH, des Instituts für Enterprise Systems der Universität Mannheim, der SAP University Alliances, der mayato GmbH, der JobRouter AG und des GeoNet.MRN e.V.

]]>
Research
news-1770 Tue, 06 Dec 2016 09:54:05 +0000 Two papers accepted at EACL 2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-papers-accepted-at-eacl-2017/ We have two papers with the Language Technology Group of the University of Hamburg accepted for the forthcoming EACL 2017 conference, the premier European forum for research in Natural Language Processing.

  • Stefano Faralli, Alexander Panchenko, Chris Biemann and Simone Paolo Ponzetto: The ContrastMedium Algorithm: Taxonomy Induction From Noisy Knowledge Graphs With Just A Few Links
  • Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto and Chris Biemann: Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation

 

 

]]>
Research Simone Publications
news-1769 Mon, 05 Dec 2016 15:06:25 +0000 Industry Talk on Information Extraction for E-Commerce https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/industry-talk-on-information-extraction-for-e-commerce/ Martin Rezk from Rakuten Tokyo/Paris presented today his recent work in collaboration with Simona Maggio, Bruno Charron, David Purcell, Hirate Yu and Béranger Dumont on how Rakuten and PriceMinister (e-commerce) extract and combine semantic information from product titles, descriptions, and images to maintain and enhance their ontologies, to improve the user selling experience, and to build fine-grained marketing campaigns. 

Martin's work nicely fits together with the work within the DWS group on product data integration.

Details about parts of the talk can be found in their ISWC 2016 industry track paper on Extracting Semantic Information for e-Commerce.

]]>
Topics - Linked Data Research - Web-based Systems Research Projects
news-1766 Thu, 01 Dec 2016 09:44:09 +0000 ACM JDIQ Special Issue on Web Data Quality published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/acm-jdiq-special-issue-on-web-data-quality-published/ We are happy to announce that a special issue of the ACM Journal of Data and Information Quality (JDIQ) on Web Data Quality has been published. The special issue was edited by Christian Bizer, Luna Dong (Amazon), Ihab Ilyas (University of Waterloo), and Maria-Esther Vidal (Universidad Simon Bolivar).

A summary of the articles in the special issue is provided in the issue's editorial.

All articles of the special issue are accessible via the ACM Digital Library.

]]>
Publications Chris
news-1747 Tue, 15 Nov 2016 08:14:22 +0000 mayato GmbH becomes Industry Partner of Mannheim Master in Data Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/mayato-gmbh-becomes-industry-partner-of-mannheim-master-in-data-science-1/ Die Analyse und Auswertung von großen, oft komplexen Datenmengen sind Schlüsselfaktoren für den wirtschaftlichen Erfolg von Unternehmen. Dafür bedarf es gut ausgebildeter Spezialistinnen und Spezialisten. Das BI-Analysten- und Beraterhaus mayato und die Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik der Universität Mannheim haben für den neuen Studiengang „Mannheim Master in Data Science“ (MMDS), der ab Februar 2017 beginnt, eine Kooperation vereinbart. Gemeinsames Ziel ist es, eine wissenschaftlich fundierte und an den Bedürfnissen der beruflichen Praxis ausgerichtete Ausbildung zum Data Scientist anzubieten. Im Rahmen des neuen Studiengangs erhalten Studentinnen und Studenten theoretische Kenntnisse in Statistik, Mathematik, Datenanalysen- und Verfahren. Dazu gehören Themen wie Datenbank-Technologien, Data Mining, Text Analytics, Machine Learning, Optimierung, Algorithmen und Datensicherheit. Die Kooperation mit mayato fördert zusätzlich den praktischen Bezug: Im Rahmen des MMDS-Studiums begleiten die Experten von mayato Abschlussarbeiten und Praxisprojekte. Darüber hinaus sind Vorträge über aktuelle Anwendungen von Data Science im Rahmen von Vorlesungen und Informationsveranstaltungen geplant.

„Die exzellente theoretische Ausbildung von Data Scientists kombiniert mit der Markterfahrung und Expertise von mayato im Bereich Data Science, Analytics, Big Data und Business Intelligence führt zu top qualifizierten Datenspezialisten, die relevante Erkenntnisse aus großen und komplexen Datenmengen gewinnen können“, sagt Eric Ecker, Leiter Geschäftsbereich Industry Analytics, mayato GmbH.

„Praktische Erfahrungen zu sammeln, ist für unsere Studentinnen und Studenten unerlässlich. Es freut uns, dass wir mit mayato einen kompetenten Partner für den neuen Studiengang gewinnen konnten“, erklärt Prof. Dr. Christian Bizer (Web Based Systems) und Prof. Dr. Heiner Stuckenschmidt (Künstliche Intelligenz) fügt hinzu: „Das Beratungshaus widmet sich dediziert dem Bereich Datenanalyse, das ist für uns ein echter Glücksgriff.“

Weiterführende Informationen:

 

 

 

 

]]>
Topics - Data Mining Projects Chris Heiner Simone
news-1722 Thu, 03 Nov 2016 08:30:46 +0000 Paper accepted at MMM2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-mmm2017/ Our paper on "Using Object Detection, NLP, and Knowledge Bases to Understand the Message of Images" by Lydia Weiland, Ioana Hulpus, Simone Ponzetto and Laura Dietz has been accepted at the 23rd International Conference on MultiMedia Modeling.

]]>
Research Publications Simone
news-1720 Wed, 26 Oct 2016 09:20:00 +0000 Rim Helaoui receicves best PhD Thesis Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rim-helaoui-receicves-best-phd-thesis-award/ Rim Helaoui, a former PhD Student in the Artificial Intelligence group received an award for the best PhD Thesis submitted to the faculty of Business Informatics and -mathematics in the academic year 2015/2016 for her Thesis on Human Activity recognition.

]]>
Research Heiner Rim
news-1715 Mon, 24 Oct 2016 14:10:00 +0000 Christian Bizer gives Keynote Talk at ISWC2016 in Japan https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/christian-bizer-gives-keynote-talk-at-iswc2016-in-japan/ In his keynote at the 15th International Semantic Web Conference in Kobe, Christian Bizer compares the expectations about the Semantic Web with the deployment patterns of Semantic Web Technologies that are currently observed on the Web and discusses the challenges that arise from these patterns.

 

Title

Is the Semantic Web what we expected? Adoption Patterns and Content-driven Challenges

Abstract

Semantic Web technologies, such as Linked Data and Schema.org, are used by a significant number of websites to support the automated processing of their content. In the talk, I will contrast the original vision of the Semantic Web with empirical findings about the adoption of Semantic Web technologies on the Web. The analysis will show areas in which data providers behave as envisioned by the Semantic Web community but will also reveal areas in which real-world adoption patterns strongly deviate. Afterwards, I will discuss the challenges that result from the current adoption situation. To address these challenges, I will exemplify entity reconciliation, vocabulary matching, and data quality assessment techniques which exploit all semantic clues that are provided while being tolerant to noise and lazy data providers.

 

 

 

 

]]>
Research Topics - Linked Data Chris
news-1716 Mon, 24 Oct 2016 12:22:32 +0000 Paper accepted for EDBT2017 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-edbt2017/ The full research paper Matching Web Tables To DBpedia - A Feature Utility Study by Dominique Ritze and Christian Bizer has been accepted for the 20th International Conference on Extending Database Technology 2017 (EDBT'17).

]]>
Publications Research Chris
news-1994 Wed, 12 Oct 2016 10:20:56 +0000 Article accepted at TODS: Exact and Approximate Maximum Inner Product Search with LEMP (copy 1) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-accepted-at-tods-exact-and-approximate-maximum-inner-product-search-with-lemp-copy-1/ The article "Exact and Approximate Maximum Inner Product Search with LEMP" (author version) by C. Teflioudi and R. Gemulla has been accepted into ACM Transactions on Database Systems (TODS).

Abstract
We study exact and approximate methods for maximum inner product search, a fundamental problem in a number of data mining and information retrieval tasks. We propose the LEMP framework, which supports both exact and approximate search with quality guarantees. At its heart, LEMP transforms a maximum inner product search problem over a large database of vectors into a number of smaller cosine similarity search problems. This transformation allows LEMP to prune large parts of the search space immediately and to select  suitable search algorithms for each of the remaining problems individually. LEMP is able to leverage existing methods for cosine similarity search, but we also provide a number of novel search algorithms tailored to our setting. We conducted an extensive experimental study that provides insight into the performance of many state-of-the-art techniques - including LEMP - on multiple real-world datasets. We found that LEMP often was significantly faster or more accurate than alternative methods.

]]>
Publications Rainer
news-1701 Tue, 04 Oct 2016 15:52:44 +0000 Paper accepted at ICDM: What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-icdm-what-you-will-gain-by-rounding-theory-and-algorithms-for-rounding-rank/ The paper "What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank" by Stefan Neumann, Rainer Gemulla, and Pauli Miettinen has been accepted at the 2016 IEEE International Conference on Data Mining (ICDM).

Abstract:
When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods
that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what will we gain by rounding? Will this yield lower reconstruction errors? Is it easy
to find a low-rank matrix that rounds to a given binary matrix? Does it matter which threshold we use for rounding? Does it
matter if we allow for only non-negative factorizations? In this paper, we approach these and further questions by presenting
and studying the concept of rounding rank. We show that rounding rank is related to linear classification, dimensionality
reduction, and nested matrices. We also report on an extensive experimental study that compares different algorithms for finding good factorizations under the rounding rank model.

]]>
Publications Rainer Research - Data Mining and Web Mining
news-1700 Tue, 04 Oct 2016 15:48:42 +0000 Paper accepted at ICDM: DESQ: Frequent Sequence Mining with Subsequence Constraints https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-icdm-desq-frequent-sequence-mining-with-subsequence-constraints/ The paper "DESQ: Frequent Sequence Mining with Subsequence Constraints" by Kaustubh Beedkar and Rainer Gemulla has been accepted at the 2016 IEEE International Conference on Data Mining (ICDM).

Abstract:

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined; e.g., length, gap, span, regular-expression, and hierarchy constraints. We show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. In more detail, we propose a set of simple and intuitive “pattern expressions” to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners.

]]>
Publications Rainer Research - Data Mining and Web Mining
news-1691 Mon, 19 Sep 2016 10:33:46 +0000 DWS Students Take Part in Data Science Game 2016 Finals https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-students-take-part-in-data-science-game-2016-finals/ The finals of th Data Science Game 2016 took place in castle Les Fontaines near Paris on September 9th to 11th. For the second consecutive year, a team of four DWS students - Christopher Zech, Thomas Stach, Robert Litschko and Benjamin Schäfer - have reached the final phase, this time qualifying in 5th place out of 143 student teams from universities around the world.

The task in the qualifying round revolved around the prediction of roof orientation in satellite images, which has relevance for applications in solar energy. Much like the other finalists, the team solved the problem using deep learning.

In the final round of the top 20 teams, they tackled a task provided by insurance company AXA, which entailed predicting conversion rates of customers receiving car insurance quotes. Finally finishing in 12th place, applying their classroom knowledge from lectures on Data Mining, Machine Learning and other topics in the three-day competition against a strong field of fellow students was both fun and a great opportunity to learn for the entire team.

Based on the very positive feedback from sponsors and participants, the organizers plan to establish the event with regular annual installations. http://www.datasciencegame.com/

]]>
Group Topics - Data Mining Research Research - Data Mining and Web Mining
news-1664 Mon, 01 Aug 2016 18:13:59 +0000 Master Thesis: Adaptive query generation for finding customers’ hot topics (Ponzetto) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-adaptive-query-generation-for-finding-customers-hot-topics-ponzetto/ The Web offers a goldmine of information describing a multitude of companies whose products and services can be potentially matched against Web users’ profiles (e.g., Twitter or Facebook profiles) in order to raise their consumer interest. However, searching the Web poses non-trivial challenges due to its large size as well as its noisy and heterogeneous content: in fact, DIY Web search engine building is, of course, impractical in most, if not all scenarios, due to a variety of scalability and other engineering issues.

In this thesis we will focus on the topic of learning user queries for lead enrichment: to this end different methods will be explored to build a query generation engine that adapts to different users’ profiles and allows to automatically generate Web search queries that, when used in conjunction with a general-purpose search engine like Google or Bing, retrieve Web documents from websites of companies that provide products or services of interest for a potential customer.

 

This thesis is offered in collaboration with the GMS department of Siemens AG. Global Marketing Services (GMS) is Siemens' in-house partner for sales and marketing topics across Siemens Global market research projects. Sales and marketing concepts, customer loyalty projects, lead generation, market potential models and automated sales solutions as well as sales management via dashboards and tablets all form part of their innovative and highly specialized portfolio, which is available to all Siemens divisions and regions.

 

]]>
Thesis - Master Thesis Simone Topics - Artificial Intelligence (NLP)
news-1657 Thu, 21 Jul 2016 13:15:28 +0000 Five Papers accepted for ISWC2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/five-papers-accepted-for-iswc2016/ Five papers from the DWS group have been accepted for the 15th International Semantic Web Conference (ISWC2016) in Kobe, Japan, i.e., two for the Research Track and three for the Resources Track.

Papers accepted for the ISWC2016 Research Track

  • "RDF2Vec: RDF Graph Embeddings for Data Mining" by Petar Ristoski and Heiko Paulheim
  • "Containment of Expressive SPARQL Navigational Queries" by Melisachew Wudage Chekol and Giuseppe Pirrò

Papers accepted  for the ISWC2016 Resources Track

  • "Linking Lexical Resources to Disambiguated Distributional Semantic Networks" by Stefano Faralli, Alexander Panchenko, Chris Biemann and Simone P. Ponzetto
  • "Conference Linked Data: the ScholarlyData project" by Andrea Giovanni Nuzzolese, Anna Lisa Gentile, Valentina Presutti and Aldo Gangemi
  • "A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web" by Petar Ristoski, Gerben Klaas Dirk de Vries and Heiko Paulheim
]]>
Publications Research Chris
news-1644 Tue, 12 Jul 2016 08:04:33 +0000 Paper accepted for ICTIR 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-ictir-2016/ The Paper "Understanding the Message of Images by Knowledge Base Traversals" by Lydia Weiland, Ioana Hulpus, Simone Paolo Ponzetto, and Laura Dietz has been accepted as a full paper for the ACM International Conference on the Theory of Information Retrieval (ICTIR 2016). Publications Simone Research news-1645 Fri, 08 Jul 2016 08:14:00 +0000 Paper accepted for EyeWear 2016 (UbiComp 2016) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-eyewear-2016-ubicomp-2016/ The Paper "Exploring a Multi-Sensor Picking Process in the Future Warehouse" by Alexander Diete, Timo Sztyler, Lydia Weiland, and Heiner Stuckenschmidt has been accepted for the First Workshop on Eye Wear Computing (EyeWear 2016, at the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2016)). Publications news-1634 Wed, 29 Jun 2016 08:47:38 +0000 New Degree Program Mannheim Master in Data Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-degree-program-mannheim-master-in-data-science-1/ The School of Business Informatics and Mathematics together with the School of Social Sciences start to offer the new degree program Mannheim Master in Data Science in Spring 2017. The degree program equips students with solid theoretical foundations as well as the necessary practical skills to obtain operational insights from large and complex data sets. 

The acquisition and utilization of large quantities of data nowadays influences all areas of our daily life. The analysis and interpretation of large and often complex datasets - called Data Science - is a key factor for the economic success of businesses, the improvement of processes in public administration, as well as the advancement of science. The central obstacle that hinders companies, governments, and research institutions from realizing a larger number of Big Data projects is the lack of well-trained specialists who have the ability to integrate, analyze and interpret large amounts of data. With its new degree program “Mannheim Master in Data Science“ (MMDS), the University of Mannheim contributes as one of the first universities in Germany to closing this education gap.

Due to the traditionally strong quantitative and empirical orientation of Mannheim’s social sciences as well as due to the focus of Mannheim’s business informatics and mathematics on analyzing large data sets, the University of Mannheim provides the ideal environment for educating data scientists. The new degree program teaches students both the fundamental knowledge about methods of empirical social research and explorative data mining, and the skills to apply this knowledge on large data sets in practice. The program is highly interdisciplinary and is run as a collaboration of University of Mannheim’s Data and Web Science Group, Institute of Business Informatics, Department of Sociology, Department of Political Science, and Institute of Mathematics.  

The MMDS degree program starts in the spring term of 2017 with 25 students per year. The program is financially supported by the second stage of the “Master 2016” extension program of the state of Baden-Württemberg. The target audience of the degree program are graduates of technically oriented Bachelor programs such as business informatics, business mathematics, informatics, mathematics and statistics, as well as graduates of quantitatively oriented Bachelor programs in social and economic sciences such as business administration, political sciences, sociology, and economics.

 
Structure and Content

The program is structured into the five blocks Fundamentals, Data Management, Data Analytics, Projects and Seminars, and the Master’s Thesis.

Fundamentals: The goal of the fundamentals block is to align the previous knowledge of students from different degree programs. Graduates from computer science and mathematics acquire the required knowledge in empirical research (in particular, data collection and multivariate statistics). Graduates from the social sciences and other fields acquire the required knowledge in computer science (in particular, programming and database technology).

Data Management: One of the central challenges in the Big Data area is to handle the enormous amount, speed, heterogeneity, and quality of the data collected in industry, the public sector, and science. The Data Management block covers methods and concepts for obtaining, storing, integrating, managing, querying, and processing large amounts of data. The block includes courses on modern data management technology (such as parallel database systems, Spark, and NoSQL databases), data integration, information retrieval and search, software engineering, and algorithms.

Data Analytics: The Data Analytics block forms the core of the study program. It provides courses ranging from data mining, machine learning, and decision support, over text analytics and natural language processing, to advanced social science methods such as cross-sectional and longitudinal data analysis. The range of methodological courses is enhanced by courses on optimization, visualization, mathematics and information, and algebraic statistics.

Projects and Seminars: The Projects and Seminars block introduces students to independent research and teaches the skills necessary to successfully participate in and contribute to larger data science projects. The block consists of research seminars, individual projects, team projects, as well as data science competitions. The projects are conducted jointly with industrial partners and/or support ongoing research efforts of participating institutes.

Master Thesis: In the master thesis, students apply what they learned throughout the program. The master thesis has a duration of 6 months. Students are encouraged to write their thesis either in the context of research projects conducted by participating institutes or together with an industrial partner.


More Information

More information about the degree program and how to apply for the program is found at http://www.wim.uni-mannheim.de/de/fakultaet/studiengaenge/msc-in-data-science/.

 

 

]]>
Chris Rainer Simone Heiner Projects
news-1633 Fri, 24 Jun 2016 09:59:27 +0000 Rhein-Neckar Smart Data Meetup https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rhein-neckar-smart-data-meetup/ The second Smart Data meetup was hosted at the University of Mannheim.

There were two technical talks.

Marcel Karnstedt-Hulpus, Senior Data Architect at Springer Nature talked about Building a Landscape of Natural Scientific Facts Databases at Springer Nature.

Anna Lisa Gentile and Heiko Paulheim working with the Web Data Mining group at Mannheim University presented their work on Extending RapidMiner with Data Search and Integration Capabilities.

]]>
Group Other
news-1626 Mon, 06 Jun 2016 12:40:00 +0000 RapidMiner Data Search Extension wins ESWC2016 Best Demo Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rapidminer-data-search-extension-wins-eswc2016-best-demo-award/ We are happy to announce that our demonstration Extending RapidMiner with Data Search and Integration Capabilities did win the best demonstration award at the 13th European Semantic Web Conference (ESWC2016).

Analysts are increasingly confronted with the situation that data which they need for a data mining project exists somewhere on the Web or in an organization’s intranet but they are unable to find it. The data mining tools that are currently available on the market offer a wide range of powerful data mining methods but hardly support analysts in searching for suitable data as well as in integrating data from multiple sources.

Our demonstration at ESWC2016 showed an extension to RapidMiner, a popular data mining framework, which enables analysts to search for relevant datasets and integrate discovered data with data that they already know. In particular, we support the iterative extension of data tables with additional attributes. To this end we propose (1) a data search and integration framework and (2) an initial Open Source implementation of the framework as a RapidMiner extension.

The demo was one of the 19 demos accepted for presentation at the 13th European Semantic Web Conference (ESWC2016).

 The Rapidminer Extension is developed as part of the BMBF-funded research project DS4DM.

]]>
Topics - Linked Data Projects Chris
news-1641 Mon, 06 Jun 2016 12:40:00 +0000 ESWC2016 7 Years Best Paper Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/eswc2016-7-years-best-paper-award/ We are happy to announce that the paper Media meets Semantic Web – How the BBC uses DBpedia and Linked Data to make connections  co-authored by Christian Bizer won the 7 Year Best Paper Award at the 13th European Semantic Web Conference (ESWC2016).

This application paper describes how the British Broadcasting Corporation (BBC) uses Linked Data technologies together with the MusicBrainz and DBpedia knowledge bases to integrate data and link documents across multiple content management systems. The paper is one of the first papers propagating Linked Data as a lightweight integration technology for increasing the interoperability of heterogeneous systems in the enterprise context.

 

 

]]>
Topics - Linked Data Projects Chris
news-1618 Thu, 02 Jun 2016 10:52:16 +0000 Paper Accepted for UbiComp 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-ubicomp-2016/ The Paper "Unsupervised Recognition of Interleaved Activities of Daily Living through Ontological and Probabilistic Reasoning" by Daniele Riboni, Timo Sztyler, Gabriele Civitarese, and Heiner Stuckenschmidt has been accepted as a full paper for the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2016).

]]>
Publications Research Heiner
news-1617 Tue, 31 May 2016 06:43:32 +0000 DWS Excursion: Climbing in the Trees https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-excursion-climbing-in-the-trees/ Within the 2016 company outing, the most courageous and lion-hearted members of the DWS group went to the climbing garden in Viernheim, May 22th 2016. We had perfect weather conditions and it was a great experience to climb in the trees (especially for those who did this for the first time). After three hours of exhausting climbing we went back to lovely Mannheim and enjoyed the remaining day in a beer garden close to the Neckar, where we also meet our less courageous colleagues informing them about our adventures. We are looking forward to the 2017 company outing, meanwhile we have to focus on teaching and research ... ;-)

]]>
Group Other
news-1591 Mon, 02 May 2016 09:49:18 +0000 Paper accepted for EAMT 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-eamt-2016/ The paper "Can Text Simplification Help Machine Translation?" by Sanja Štajner and Maja Popović has been accepted for the research track of EAMT 2016.

]]>
Publications Simone Research
news-1583 Mon, 25 Apr 2016 13:33:43 +0000 24.4 billion quads RDFa, Microdata and Microformat data published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/244-billion-quads-rdfa-microdata-and-microformat-data-published/ The DWS group is happy to announce a new release of the Web Data Commons RDFa, Microdata, Embedded JSON-LD and Microformat data corpus.

The data corpus have been extracted from the November 2015 version of the Common Crawl covering 1.77 billion HTML pages which originate from 14.4 million websites (pay-level domains). 

Altogether we discovered structured data within 541 million HTML pages out of the 1.77 billion pages contained in the crawl (30%). These pages originate from 2.7 million different pay-level-domains out of the 14.4 million pay-level domains covered by the crawl (19%). 

Approximately 521 thousand of these websites use RDFa, while 1.1 million websites use Microdata. Microformats are used also by over 1 million websites within the crawl. For the first time, we have also extracted embedded json-ld which we can report to be used by more than 596 thousand websites. 

Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, reviews, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.

The WebDataCommons project extracts all Microformat, Microdata and RDFa data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

Besides the data extracted from the named markup syntaxes the WebDataCommons project also provides one of the largest public accessible corpora of WebTables extracted from web crawls as well as a collection of hypernyms extract from billions of web pages for download.

General information about the WebDataCommons project is found at http://webdatacommons.org/  

Data Set Statistics

Basic statistics about the November 2015 RDFa, Microdata, Embedded JSON-LD and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2015-11/stats/stats.html

Comparing the statistics to the statistics about the December 2014 release of the data sets

http://webdatacommons.org/structureddata/2014-12/stats/stats.html

we see that the adoption of the Microdata markup syntax has again increased (1.1 million websites in 2015 compared to 819 thousand in 2014, where both crawls cover a comparable number of websites). Where the deployment of RDFa and Microformats is more or less stable.

As already observed in the former year the vocabulary schema.org, recommended by Google, Microsoft, Yahoo!, and Yandex is most frequently used by the webmasters in the context of Microdata. We observe a decreasing deployment of its predecessor, the data vocabulary.  In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, that many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).

For the first time, we have also extracted information marked up using embedded JSON-LD. Over 99% of all webmasters using this syntax use it to mark-up search boxes on their webpages (http://schema.org/SearchAction). Only a small part of the websites also use embedded JSON-LD to annotate other information, e.g. about organizations (92 thousand websites) or persons (18 thousand websites).

Download 

The overall size of the November 2015 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 24.4 billion RDF quads. For download, we split the data into 3,961 files with a total size of 404 GB. 

http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class. 

http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html 

Lots of thanks to

+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 


Have fun with the new data set. 

Robert Meusel and Christian Bizer

]]>
Topics - Data Mining Topics - Linked Data Projects Chris Research - Data Mining and Web Mining Research - Data Analytics
news-1216 Tue, 05 Apr 2016 09:29:00 +0000 Paper Accepted for IJCAI 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-ijcai-2016/ The Paper "Group Decision Making via Probabilistic Belief Merging" by Nico Potyka, Erman Acar, Matthias Thimm and Heiner Stuckenschmidt has been accepted for the main research track of IJCAI 2016.

]]>
Research Heiner Publications
news-1209 Tue, 22 Mar 2016 10:07:00 +0000 Open PhD/PostDoc Position in Data & Information Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/open-phdpostdoc-position-in-data-information-science/ The following position is offered by the WISS Research Group at
Stuttgart Media University, lead by Prof. Dr. Kai Eckert, one of our
cooperating partners. If desired, doing your PhD thesis and working
within the DWS Research Group is possible, too, depending on your
qualifications and research topic and subject to approval of one of our
professors.

Stuttgart Media University educates media specialists for enterprises, organizations, and other institutions. With sixteen Bachelors and nine Masters programs, we provide an attractive academic environment for over 4,500 students. In the Web-based Information Systems and Services Research Group (WISS, wiss.iuk.hdm-stuttgart.de), directed by Prof. Dr. Kai Eckert, we are offering a full-time position for a

RESEARCHER

IN THE FIELD OF COMPUTER, DATA & INFORMATION SCIENCE (Stuttgart / Frankfurt) up to paygrade E13 TV-L, reference number SS1602AM

The position is for a fixed term of two years, which may be extended to three years subject to a positive evaluation of the project by the German Science Foundation (DFG). Use of the project results in a PhD thesis is encouraged.

In the context of a DFG-funded project with the University Library Johann Christian Senckenberg in Frankfurt/Main, WISS is developing an online information service, “Jewish Studies”, for exploring Hebrew literature. WISS is responsible for developing the processes for data integration and data enrichment on which the innovative research services will be based. Because of the close cooperation within the project, it would be possible to work part-time or completely in Frankfurt; a workplace can be provided both in Stuttgart and in Frankfurt.

Your duties:

  • Autonomous completion of the project
  • Development and implementation of suitable processes for data integration and enrichment
  • Co-development and evaluation of a coordinated workflow with the University Library in Frankfurt
  • Evaluation and documentation of the results

We offer:

  • Work with the latest technologies in an exciting research field
  • Flexible worktime models and the option of home working or telework
  • Compatibility of family and work life

Your profile:

  • University degree (Masters or similar) in Computer Science or a related field
  • Programming skills, ideally in Java
  • Very good command of written and spoken English
  • Self-starter

Knowledge of data integration, Linked Open Data, XML or Web technology is a plus. For any questions, do not hesitate to ask Prof. Dr. Kai Eckert via email.

Here is the link to the official job offer for your application: bewerbung.hdm-stuttgart.de/job-offer.html

]]>
Open Positions - Staff
news-1196 Tue, 23 Feb 2016 12:17:00 +0000 Rim Helaoui defended her PhD Thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rim-helaoui-defended-her-phd-thesis/ On February 19th, Rim Helaoui successfully defended her PhD Thesis "On Leveraging Statistical and Relational Information for the Representation and Recognition of Complex Human Activities". Supervisor was Prof. Stuckenschmidt, second reader was Prof. Riboni from the University of Cagliari. 

]]>
Publications Heiner Research
news-1189 Mon, 22 Feb 2016 09:58:00 +0000 Semester Kickoff BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kickoff-bbq/ Traditionally the Research Group Data and Web Science takes the beginning of the new semester as an opportunity to host a barbecue in order to welcome new colleagues and introduce the upcoming courses to the best students of last semester. Thus, accompanied by cold beverages and grilled food the professors presented the autumn/winter semester program. The courses for this term are:

<link />Knowledge Management, <link />Data Mining I, <link />Data Mining II, <link />Web Mining, <link />Web Search and Information Retrieval, <link />Large-Scale Data Management, <link />Process Mining Seminar, <link />Data and Web Science Seminar

Throughout the evening, many interesting topics where discussed and some cornerstones for future thesis could be laid. 

We thank all the participants for coming and wish especially our students a good and successful start into the new semester.

<link />Here you can download  the slides presented during the event.

German Version 

Traditionell hat die Forschungsgruppe Data und Web Science den Beginn des Semesters genutzt, um bei einem Grillfest neue Kolleginnen und Kollegen willkommen zu heißen, das aktuelle Angebot vorzustellen und dazu die besten Studierenden des letzten Semester einzuladen. Am Donnerstag (18.02.2016) präsentierten die Professoren die nächsten Kurse des aktuellen Semesters begleitet von Grillgut und kühlen Getränken.

Die folgenden Kurse wurden vorgestellt:

<link />Knowledge Management, <link />Data Mining I, <link />Data Mining II, <link />Web Mining, <link />Web Search and Information Retrieval, <link />Large-Scale Data Management, <link />Process Mining Seminar, <link />Data and Web Science Seminar

Während des gesamten Abends ergaben sich viele interessante Diskussion zwischen den Anwesenden über fachliche Themen am Lehrstuhl.

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen vorallem unserem Studenten einen guten und erfolgreichen Start ins neue Semester.

Die Slides zur Präsentation der Kurse können <link />hier heruntergeladen werden.

]]>
Other
news-1187 Thu, 18 Feb 2016 10:16:00 +0000 Survey Article published in Semantic Web Journal https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/survey-article-published-in-semantic-web-journal/ The article "Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods" by Heiko Paulheim has been accepted for publication in the Semantic Web Journal.

The article reviews automatic and semi-automatic methods for refining (i.e., completing and correcting) general-purpose knowledge graphs like DBpedia, YAGO or Wikidata. Furthermore, it takes a critical look at the methodological questions of how to evaluate such approaches, and sketches a roadmap for future research in the field.

A preprint version is available here.

]]>
Group Publications
news-1177 Tue, 09 Feb 2016 08:13:00 +0000 Arnab Dutta defended his PhD thesis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/arnab-dutta-defended-his-phd-thesis/ Arnab Dutta sucessfully defended his PhD thesis on February, 4th. His research, supervised by Prof. Dr. Heiner Stuckenschmidt, was concerned with the topic "Automated Knowledge Base Extension Using Open Information". Second Reader was Prof. Fabian Suchanek from Télécom ParisTech University, Paris.  

 

]]>
Publications Research Other People Heiner
news-1174 Mon, 08 Feb 2016 12:29:00 +0000 JOIN-T DWS NLP and LT TU Darmstadt team wins SemEval task on Taxonomy Extraction Evaluation (TExEval-2) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/join-t-dws-nlp-and-lt-tu-darmstadt-team-wins-semeval-task-on-taxonomy-extraction-evaluation-texeval/ A system developed as part of a collaboration between the NLP group of DWS and the Language Technology group of TU Darmstadt has been ranked first in an upcoming SemEval challenge on Taxonomy Extraction Evaluation (TExEval-2). The system, named TAXI, consists of an approach that relies on two sources of evidence, substring matching and Hearst-like patterns, and can be easily ported to many different languages.

The results of the challenge can be found here. SemEval is the premier evaluation forum the computational semantics community. This work is part of an ongoing DFG-project (JOIN-T) collaboration between the two groups.

]]>
Simone Publications Research
news-1169 Tue, 02 Feb 2016 08:36:00 +0000 2nd DataFest Germany https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/2nd-datafest-germany/ The 2nd DataFest Germany will be held during the weekend of April 1-3, 2016 at the University of Munich LMU.

As DWS students were quite quite successful and won a prize at last year's DataFest in Mannheim, we strongly encourage students to participate again in this year's DataFest.

The registration for the 2nd DataFest is now open to students (BA and MA). Space is limited. Registration deadline is March 1st, 2016. 

DataFest(TM) is a data analysis competition that started in the US (ASA DataFest (TM)) a few years ago, and found its way to Germany last year. Teams of up to five students have a weekend to attack a large, complex, and surprise dataset. Your job is to represent your school by finding and communicating insights into these data. The teams that impress the judges will win prizes as well as glory for their school. Everyone else will have a great experience, lots of food, and fun! Impression from last year can be found here https://www.youtube.com/watch?v=Z9RPgV0zoxg

For more information about the 2nd DataFest Germany please visit  http://datafest.de/

 

 

 

 

]]>
Projects
news-1159 Wed, 20 Jan 2016 07:48:00 +0000 Paper Accepted for WACV 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-wacv-2016/ The paper "A Real-Time Visual Card Reader for Mobile Devices" by Lukas Stehr, Robert Meusel and Stephan Kopf has been accepted a the 2016 IEEE Winter Conference on Application of Computer Vision (WACV). The paper presents an approach to automatically detect the card provider, card type and the customer number of loyalty cards. 

]]>
Publications Research Topics - Data Mining
news-1157 Fri, 15 Jan 2016 08:40:00 +0000 Article accepted in Journal of Web Semantics https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-accepted-in-journal-of-web-semantics/ The article Semantic Web in data mining and knowledge discovery: A comprehensive survey by Petar Ristoski and Heiko Paulheim has been accepted for publication by the Journal of Web Semantics. The article reviews more than 100 approaches using Semantic Web data at various http://dx.doi.org/10.1016/j.websem.2016.01.001stages of the data mining process, and thus provides a comprehensive, timely survey of the field.

]]>
Publications
news-1155 Tue, 22 Dec 2015 08:39:00 +0000 New Lecture Videos and Screen Casts online https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-lecture-videos-and-screen-casts-online/ The Data and Web Science Group records core lectures for Master students on video and provides screen casts of accompanying exercises in order to enable students to be more flexible in their learning patterns.

In the Spring 2015 semester, we have recorded the Data Mining I lecture (Prof. Bizer) and the Data Mining II lecture (Prof. Paulheim) and have produced screen casts for the Data Mining I exercise (Robert Meusel). 

This fall semester, we have recorded two additional lectures: Web Data Integration (Prof. Bizer) which introduces techniques for integrating data from large numbers of Web data sources and Text Analytics (Prof. Ponzetto) which provides an introduction to state-of-the-art principles and methods of Natural Language Processing. In addition, we have produced screen casts on data translation with MapForce and on how to use the Java frameworks that are necessary for the Web Data Integration exercise.

The videos and screen casts are available from the Lecture Videos page.

We plan to record more lectures in the upcoming semesters. Next semester, we will record the Web Mining lecture (Prof. Bizer) and the Web Search and Information Retrieval lecture (Prof. Ponzetto, Dr. Dietz).

Lots of thanks to the Referat Neue Medien of the Stabsstelle Studium und Lehre for supporting us in recording the lecture videos.

Beside of being used at the University of Mannheim, the lecture videos and screen casts will also be used in the part-time master program Data Science which the DWS group runs together with the University of Tübingen and  Albstadt-Sigmaringen University.

]]>
Chris Projects
news-1150 Wed, 16 Dec 2015 07:34:00 +0000 Papers Accepted for WWW2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/papers-accepted-for-www2016/ The  full research paper "Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases" by Dominique Ritze, Oliver Lehmberg, Yaser Oulabi and Christian Bizer as well as the poster "A Large Corpus of Web Tables containing Time and Context Metadata" (poster) by Oliver Lehmberg, Dominique Ritze, Robert Meusel, Christian Bizer have been accepted for the World Wide Web Conference 2016 (WWW'16).

]]>
Publications Research Chris
news-1147 Tue, 15 Dec 2015 07:48:00 +0000 DWS and Yahoo Open Source Structured Data Crawler https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-and-yahoo-open-source-structured-data-crawler/ Anthelion, An Open Source Focused Crawler Publicly Released Would you like to collect large datasets from the Web? If so, then we have good news! Recently we together with Yahoo Labs publicly released Anthelion, a focused crawler for semantic annotations in Web pages that steers in the direction of HTML pages–which are annotated with markup languages like RDFa, Microformats, and Microdata–to GitHub.

Anthelion can be targeted to crawl for specific pages; for example, those including markup describing movies with at least two different attributes such as the title of and actors in a movie. The system includes a ready-to-run extension for the Apache Nutch Crawler (nutch-anth), which can be run on a single machine as well as a Hadoop cluster. In addition, the GitHub project includes a testing environment for crawler simulations that makes it possible to measure the efficiency of the crawler in a controlled environment, as demonstrated in our research paper, “Focused Crawling for Structured Data.”

With regard to methodology, the Anthelion system combines the benefits of online learning and a bandit-based selection strategy to adopt to the current crawling environment. Based on a given target function, each newly-discovered URL is classified, where the current crawled page is analyzed with respect to a target function and passed to the learner to further improve its quality. The final selection of what page to crawl next, is done by a bandit-based selection. The bandit chooses between exploration and exploitation (i.e., choosing between the most-promising page or a random page) based on a configuration parameter.

Experiments have shown that in comparison to a pure breadth-first search selection strategy, the number of retrieved relevant pages can be increased by a factor of three. In comparison to a pure online classification-based selection, the improvements are about 26%.

The complete code, which is released under Apache License 2.0, as well as a more comprehensive description can be found at the Yahoo GitHub repository: https://github.com/yahoo/anthelion

Lots of thanks to:
+ the Web Data Commons project and the Common Crawl project for providing their crawl data and extraction data
+ Yahoo Labs for providing the funding for some of the work 
+ the Any23 project for providing their great library of structured data parsers
+ the Nutch project for providing a great extendible crawling infrastructure

Have fun with the code and please let us know if you have any feedback!

Petar Ristoski, Peter Mika, Roi Blanco, and Robert Meusel

 

Original posting from yahoo tumblr.

]]>
Research Publications Chris
news-1145 Thu, 10 Dec 2015 09:55:00 +0000 Topic Model Tutorial at WebSci2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/topic-model-tutorial-at-websci2016/ In cooperation with GESIS, Laura Dietz will co-teach the tutorial on Topic Models at WebSci conference in 2016.

]]>
Laura Simone Research
news-1141 Thu, 03 Dec 2015 14:24:00 +0000 Paper Accepted for PerCom 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-percom-2016/ The Paper "On-body Localization of Wearable Devices: An Investigation of Position-Aware Activity Recognition" by Timo Sztyler and Heiner Stuckenschmidt has been accepted as a full paper for the IEEE International Conference on Pervasive Computing and Communications 2016 (PerCom'16).

]]>
Heiner Research Publications
news-174 Wed, 25 Nov 2015 11:25:00 +0000 Invited talk on Web Data Search at Heidelberg University https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/invited-talk-on-web-data-search-at-heidelberg-university/ Professor Christian Bizer talks about Data Search at the Departement of Computational Linguistics on 26th November 2015.

See: http://www.cl.uni-heidelberg.de/colloquium/

 

Title of the talk:

Data Search and Search Joins

Abstract:

The amount of structured data that is published on the Web has increased sharply over the last years. The deluge of available data calls for new search techniques which support users in finding and integrating data from large numbers of data sources. In his talk, Christian Bizer will give an overview of the different types of data search that have been proposed so far: Entity search, table search, constraint and unconstraint search joins. As an example of a system from the last category, he will introduce the Mannheim Search Join Engine which provides for executing unconstraint search joins over different types of Web data including Linked Data, Microdata, Web tables and Wikipedia tables.

Slides:

Slides of the talk

 

 

]]>
Chris Research
news-1136 Thu, 19 Nov 2015 09:20:00 +0000 Web Table Corpus containing 233 million tables released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/web-table-corpus-containing-233-million-tables-released/ The DWS group is happy to announce the release of the WDC Web Table Corpus 2015.

The corpus has been extracted from the July 2015 version of the Common Crawl which contains 1.78 billion HTML pages originating from 15 million pay-level domains.

The WDC Web Tables Corpus 2015 consists of 233 million HTML tables which are classified into the categories: relational, entity, and matrix. In addition to the actual tables, the corpus also contains table metadata such as table orientation, header rows, and key columns, as well as table context information such as the text on the HTML page before and after the table, the page title, and timestamp information from the page.

Detailed statistics about the corpus, information about its application domains, as well as instructions on how to download the corpus are found at

http://webdatacommons.org/webtables/

We want to thanks the Common Crawl Foundation for gathering their great web corpora and thus enabling the creation of the WDC Web Tables Corpus. We also want to thank Amazon Web Services for supporting the Web Data Commons project by allowing us to use their cloud infrastructure. Great thanks also to the Dresden Web Table Corpus team for extending the WDC framework which we further extended and used for this extraction.

Enjoy the new corpus!

Dominique Ritze, Oliver Lehmberg, Robert Meusel, Sanikumar Zope, and Christian Bizer

 

 

 

]]>
Research Chris Projects
news-1133 Tue, 17 Nov 2015 15:31:00 +0000 Laura Dietz admitted to Elite Program for Post-docs of the BW-Stiftung https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/laura-dietz-admitted-to-elite-program-for-post-docs-of-the-bw-stiftung/ Through the Elite program for Post-docs, the Baden Wuerttemberg Stiftung awarded Laura Dietz with a 110.000 EUR fellowship for her project on "Knowledge Consolidation and Organization for Query-specific Wikipedia Construction".

The goal of the research project is to make information on the Web accessible in a Wikipedia-like form through a query-driven interaction paradigm. This research requires a combination of methods from information retrieval and automatic text understanding to provide the user with a synthesis of the information through summarization, sub-topic identification, and article organization.

We are looking for an prospective PhD student who is interested in this project. Job announcement to follow.

]]>
Simone Projects Laura
news-1131 Mon, 16 Nov 2015 10:03:00 +0000 Paper Accepted for AAAI 2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-aaai-2016/ The Paper "On the Containment of SPARQL Queries under Entailment Regimes" by Melisachew Wudage Chekol has been accepted as a full paper for the 30th Conference on Artificial Intelligence AAAI 2016.

]]>
Research Heiner Publications
news-751 Fri, 30 Oct 2015 08:49:00 +0000 9th Linked Data on the Web Workshop at WWW2016 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/9th-linked-data-on-the-web-workshop-at-www2016/ Together with Sir Tim Berners-Lee (W3C/MIT, USA), Tom Heath (Open Data Institute, UK) and Sören Auer (University of Bonn and Fraunhofer IAIS, Germany), Christian Bizer is organizing the 9th Linked Data on the Web Workshop (LDOW2016)  at the 25th World Wide Web Conference (WWW2016) in Montreal, Canada.

Goals of the Workshop

The Web is developing from a medium for publishing textual documents into a medium for sharing structured data. This trend is fueled on the one hand by the adoption of the Linked Data principles by a growing number of data providers. On the other hand, large numbers of websites have started to semantically mark up the content of their HTML pages and thus also contribute to the wealth of structured data available on the Web.

The 9th Workshop on Linked Data on the Web (LDOW2016) aims to stimulate discussion and further research into the challenges of publishing, consuming, and integrating structured data from the Web as well as mining knowledge from the global Web of Data. The special focus of this year’s LDOW workshop will be Web Data Quality Assessment and Web Data Cleansing.

Important Dates

  • Submission deadline: 24 January, 2016 (23:59 Pacific Time)
  • Notification of acceptance: 10 February, 2016
  • Workshop date: 11-13 April, 2016

Topics of Interest

Topics of interest for the workshop include, but are not limited to, the following:

Web Data Quality Assessment

  • methods for evaluating the quality and trustworthiness of web data
  • tracking the provenance of web data
  • profiling and change tracking of web data sources
  • cost and benefits of web data quality assessment
  • web data quality assessment benchmarks

Web Data Cleansing

  • methods for cleansing web data
  • data fusion and truth discovery
  • conflict resolution using semantic knowledge
  • human-in-the-loop and crowdsourcing for data cleansing
  • cost and benefits of web data cleansing
  • web data quality cleansing benchmarks

Integrating Web Data from Large Numbers of Data Sources

  • linking algorithms and heuristics, identity resolution
  • schema matching and clustering
  • evaluation of linking and schema matching methods

Mining the Web of Data

  • large-scale derivation of implicit knowledge from the Web of Data
  • using the Web of Data as background knowledge in data mining

Linked Data Applications

  • application showcases including Web data browsers and search engines
  • marketplaces, aggregators and indexes for Web Data
  • security, access control, and licensing issues of Linked Data
  • role of Linked Data within enterprise applications (e.g. ERP, SCM, CRM)
  • Linked Data applications for life-sciences, digital humanities, social sciences etc.

Submissions

We seek the following kinds of submissions:

  1. Full scientific papers: up to 10 pages in ACM format
  2. Short scientific and position papers: up to 5 pages in ACM format

Submissions must be formatted using the ACM SIG template (as per the WWW2016 Research Track) available at http://www.acm.org/sigs/publications/proceedings-templates. Please note that the author list does not need to be anonymized, as we do not operate a double-blind review process. Submissions will be peer reviewed by at least three independent reviewers. Accepted papers will be presented at the workshop and included in the workshop proceedings. 

Proceedings

Accepted paper will be made available one this website and be published as a volume of the CEUR series of workshop proceedings.

]]>
Research Publications Chris
news-1115 Mon, 28 Sep 2015 14:59:00 +0000 Boating and Hiking Excursion along the River Neckar https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/boating-and-hiking-excursion-along-the-river-neckar/ On Friday 25.09.2015, the DWS group went together for a day off along the river Neckar. We started with a boat trip from Heidelberg into the Odenwald through the Neckar valley. Leaving the boat in Neckarsteinach, we hiked up the hill and visited the three castles: Schadek, Hinterburg and Mittelburg. Afterwards, we refreshed ourselves within a beer garden and enjoyed the wonderful view upon the Neckar valley.

]]>
Other
news-1106 Mon, 14 Sep 2015 07:38:00 +0000 Semester Kickoff BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kickoff-bbq-1/ English Version

Traditionally the Research Group Data and Web Science takes the beginning of the new semester as an opportunity to host a barbecue in order to welcome new colleagues and introduce the upcoming courses to the best students of last semester. Thus, accompanied by cold beverages and grilled food the professors presented the autumn/winter semester program. The courses for this term are:

Data Mining I, Decision Support, Semantic Web Technologies, Text Analytics, Web Data Integration, Hot Topics in Machine Learning, Queripidia Seminar.

Throughout the evening, many interesting topics where discussed and some cornerstones for future thesis could be laid. 

We thank all the participants for coming and wish especially our students a good and successful start into the new semester.

 

German Version 

Traditionell hat die Forschungsgruppe Data und Web Science den Beginn des Semesters genutzt, um bei einem Grillfest neue Kolleginnen und Kollegen willkommen zu heißen, das aktuelle Angebot vorzustellen und dazu die besten Studierenden des letzten Semester einzuladen. Am Donnerstag (10.11.2015) präsentierten die Professoren die nächsten Kurse des aktuellen Semesters begleitet von Grillgut und kühlen Getränken.

Die folgenden Kurse wurden vorgestellt:

Data Mining IDecision SupportSemantic Web TechnologiesText AnalyticsWeb Data IntegrationHot Topics in Machine LearningQueripidia Seminar.

Während des gesamten Abends ergaben sich viele interessante Diskussion zwischen den Anwesenden über fachliche Themen am Lehrstuhl.

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen vorallem unserem Studenten einen guten und erfolgreichen Start ins neue Semester.

]]>
Group Other
news-1099 Fri, 11 Sep 2015 14:18:00 +0000 Amazon AWS awarded to the Queripidia project https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/amazon-aws-awarded-to-the-queripidia-project/ Amazon awarded an AWS grant to Laura Dietz for the project "Queripidia - A New Web Search Paradigm through Query-specific Knowledge Resource Construction"

 

Project Description


Some questions demand complex answers. Today's dominant search paradigm of "ten blue links" is not sufficient to meet such demands. Instead, most users turn towards Wikipedia, but find themselves on the limits of its coverage. We are working on a new web search paradigm, we call Queripidia, to construct a query-specific knowledge resource that combines text and structure.

In response to a search query, Queripidia will compose a knowledge resource describing relevant entities embedded in relevant text. Human users are provided with a hypertext interface that resembles Wikipedia, but is automatically generated. In addition, machine readable information on the search query is provided for further inference and incremental population of a knowledge base.

Inputs into the extraction process are unstructured web pages and structured knowledge bases. This requires new algorithms to reason over text and knowledge in a unified way with novel information retrieval algorithms for text and knowledge in combination of natural language processing. This touches on the topics of optimal index building, on-demand natural language processing, incremental feature extraction, and large-scale training of re-ranking models.

Our research on Queripidia is therefore an ongoing effort, which already led to several publications in premier conferences on Information Retrieval (SIGIR and TREC) and on Automated Knowledge Base Construction (AKBC). A web demo which is based on knowledge bases Freebase and Wikipedia as well as the 3TB Clueweb12 Corpus is available. Demo at ciir.cs.umass.edu/~dietz/queripidia/

We apply for this AWS scholarship in order to study Queripidia with a realistic web-scale corpus, the Common Crawl, which is available as AWS Public Dataset. The Common Crawl is with its 500TB about 200 times larger than the ClueWeb12 corpus we use so far. We want to evaluate whether the high precision and recall we see so far, can also be achieved in a realistic web setting; and furthermore want to meet the algorithmic challenges that increase with the data size.

A second motivation is to welcome new researchers into this new paradigm in providing query-focused inverted indexes, query-relevant subsets of the common crawl and query-relevant knowledge base entries of DBpedia, Freebase, and Wikipedia, together with semantic annotations between text and knowledge. In particular, we provide this for queries in all available TREC Web benchmarks [1]. To this end we aim to push ongoing community efforts such as the webdatacommons framework [2] and the linguistic annotations of Stanford's CoreNLP [3] to the next level.


REFERENCES

[1] trec.nist.gov
[2] www.webdatacommons.org/framework
[3] deepdive.stanford.edu/doc/opendata/

Laura Dietz, Michael Schuhmacher, Simone Paolo Ponzetto. Queripidia: Query-specific Wikipedia Construction. AKBC 2014.

Jeffrey Dalton, Laura Dietz, James Allan. Entity query feature expansion using knowledge base links. SIGIR 2014.

Michael Schuhmacher, Laura Dietz, Simone Paolo Ponzetto. Ranking Entities for Web Queries through Text and Knowledge. In Proc. of CIKM-15, 2015.

]]>
Projects
news-1101 Fri, 28 Aug 2015 12:48:00 +0000 New Industry Project on Data Search https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-industry-project-on-data-search/ We are happy to announce the start of the new research project Data Search for Data Mining (DS4DM). Within the project, we will work together with Rapidminer GmbH on extending their data mining plattform with data search and data integration functionalities.

The DS4DM project has a duration of 3 years and is funded by the German Federal Ministry of Education and Research (BMBF) under the funding scheme KMU-innovativ with an amount of 400k €.

More information about the project is found on the DS4DM project page.

 

 

 

]]>
Chris Projects
news-1098 Wed, 19 Aug 2015 09:05:00 +0000 New Project Grant on Web Knowledge Graphs https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-project-grant-on-web-knowledge-graphs/ We are happy to announce the grant of the new research project SyKo²W² (Synthesis of Completion and Correction for Web Knowledge Graphs). The project will examine how completion and correction methods for knowledge graphs (both of which have been extensively studied in isolation) can be combined into efficient joint methods. The project grant with a budget of 150 k€ is funded by the state of Baden-Württemberg in their funding scheme for assistant professors (Juniorprofessoren-Programm), with Heiko Paulheim being  the principal investigator.

]]>
Projects
news-1097 Tue, 18 Aug 2015 13:29:00 +0000 Open PhD / PostDoc position in Text Analysis https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/open-phd-postdoc-position-in-text-analysis/ The Data and Web Science Research Group at the University of Mannheim invite applications for

ONE PHD / POSTDOC POSITION IN TEXT ANALYSIS

The researcher is expected to contribute to the C4 project on “Measuring a common space and the dynamics of reform positions” within the DFG-funded Collaborative Research Centre (SFB) 884 “Political Economy of Reforms” (http://reforms.uni-mannheim.de) at the University of Mannheim. The topic of the PhD will focus on exploiting computational methods for analysing discourse phenomena like, e.g., uncertainty, vagueness and bias in political texts. This is a joint collaboration between the Natural Language Processing and Information Retrieval group (Prof. Simone Paolo Ponzetto) and the Chair of Artificial Intelligence (Prof. Heiner Stuckenschmidt), which will also involve close collaboration with project partners at the Department of Political Science (Prof. Dr. Nicole Rae Baerg, Prof. Dr. Thomas Gschwend), ranked as the best Political Science Department in Germany in different national and international university rankings. The student will be located at the Data and Web Science Group (DWS) of the University of Mannheim, one of leading centers for Data Science in Germany.

We are particularly interested in candidates with a background in one or several of the following areas:

  • statistical semantics and discourse processing

  • machine learning and natural language processing

  • discourse analysis

  • automated text-based scaling

Applicants should have a Masters or PhD degree (or obtain it in the near future) in Computer Science, Natural Language Processing, Machine Learning or Social Science and have previous experience in applying human language technology.

Duration: initially one year (starting in Fall 2015) with possible extension to 3-5 years.

Salary range: according to German public scale TV-L 13 100% (full time, ranging between 3200,- and 4.600,- Euro before taxes depending on qualification).

Applications can be made per e-mail (sfb884(at)informatik.uni-mannheim.de) and should include a short research statement, CV, copy of university degrees and transcripts and - if available - a copy of the master thesis, as well as list of publications and published software. Further information about the groups can be found at http://dws.informatik.uni-mannheim.de/. All documents should be e-mailed as a single PDF. All applications sent before September, 15 2015 will receive full consideration. The positions remain open until filled.

The University of Mannheim is committed to increase the percentage of female scientists and encourages female applicants to apply. Among candidates of equal aptitude and qualifications, a person with disabilities will be given preference.

Please contact Simone Paolo Ponzetto (simone(At)informatik(DoT)uni-mannheim(DoT)de) and Heiner Stuckenschmidt (heiner(At)informatik(DoT)uni-mannheim(DoT)de) for informal enquiries.

]]>
Open Positions - Staff Heiner Simone
news-1096 Tue, 18 Aug 2015 10:54:00 +0000 Paper accepted at EMNLP: "FINET: Context-Aware Fine-Grained Named Entity Typing" https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-emnlp-finet-context-aware-fine-grained-named-entity-typing/ The paper "FINET: Context-Aware Fine-Grained Named Entity Typing" by Luciano del Corro, Abdalghani Abujabal, Rainer Gemulla, and Gerhard Weikum has been accepted at the 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP).

Abstract:

We propose FINET, a system for detecting the types of named entities that occur in short inputs—such as sentences or tweets—with respect to WordNet’s super fine-grained type system. FINET generates candidate types using a sequence of multiple extractors, ranging from explicitly mentioned types to implicit types, and subsequently selects the most appropriate type using ideas from word-sense disambiguation. FINET combats the data scarcity and noise problems that plague existing systems for named entity typing: It does not rely on supervision in its extractors and generates training data for type selection directly from WordNet and other resources. FINET supports the most fine-grained type system so far, including types for which no annotated training data is provided. Our experiments indicate that FINET outperforms state-of-the-art methods in terms of recall, precision, and granularity of extracted types.

]]>
Publications Research - Data Analytics
news-1095 Tue, 18 Aug 2015 10:50:00 +0000 Paper accepted at EMNLP: "CORE: Context-Aware Open Relation Extraction with Factorization Machines" https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-emnlp-core-context-aware-open-relation-extraction-with-factorization-machines/ The paper "CORE: Context-Aware Open Relation Extraction with Factorization Machines" by Fabio Petroni, Luciano del Corro, and Rainer Gemulla has been accepted at the 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP).

Abstract:

We propose CORE, a novel matrix factorization model that leverages contextual information for open relation extraction. Our model is based on factorization machines and integrates facts from various sources, such as knowledge bases or open information extractors, as well as the context in which these facts have been observed. We argue that integrating contextual information—such as metadata about extraction sources, lexical context, or type information—significantly improves prediction performance. Open information extractors, for example, may produce extractions that are unspecific or ambiguous when taken out of context. Our experimental study on a large real-world dataset indicates that CORE has significantly better prediction performance than state-of-the-art approaches when contextual information is available.

]]>
Publications Research - Data Analytics
news-1089 Wed, 22 Jul 2015 15:41:00 +0000 DWS sponsors student team for the Data Science Game 2015 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-sponsors-student-team-for-the-data-science-game-2015/ The Data Science Game 2015, organized by the ParisTech assocation of technical unversities, took place in castle Les Fontaines near Paris on June 20th and 21st. In total, 20 teams of students from all over Europe participated. They tackled a task provided by the main sponsor Google, which entailed predicting categories of web videos based on various features such as title, description and number of views. A team of four DWS students - Christopher Zech, Thomas Stach, Robert Litschko and Benjamin Schäfer - represented the research group. Applying their classroom knowledge from lectures on Data Mining, Text Analytics and other topics in the 48-hours competition against a strong field of fellow students was both fun and a great opportunity to learn for the entire team. Based on the very positive feedback from sponsors and participants, the event is planned to receive a sequel in June 2016. http://www.datasciencegame.com/

]]>
Other
news-1093 Tue, 21 Jul 2015 12:50:00 +0000 Open PostDoc Position in Data Search, Data Integration, and Data Mining https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/open-postdoc-position-in-data-search-data-integration-and-data-mining/  

The Data and Web Science Group at the University of Mannheim invites applications for the following open position:

    PostDoc in the field of Data Search, Data Integration, and Data Mining for 33 months

in a third party funded research project performed together with a leading German data mining company as industry partner. The project will develop data search methods for enabling large amounts of intranet and web data to be used for (semi-) automatic table extension. Applicants should have a strong background in one or more of the following areas:

  • Data search, information retrieval, and efficient indexing structures
  • Data integration, schema and instance matching, data fusion
  • Data mining, in particular feature creation and selection

The Data and Web Science Group is one of the largest research groups in the area of data science in Germany, comprising of 6 professors and over 20 PostDocs and PhD students, and thus provides a fruitful ecosystem for researchers.

The project builds on earlier work of the group [1-3]. As a PostDoc researcher in the project, you will coordinate 2-3 PhD students working in the same field.  

The position is payed according to German regulations (TV-L 13, >3300€ before tax depending on your experience). The earliest possible starting date is Oct 1, 2015.

We seek to fill the position with a PostDoc researcher, nevertheless PhD candidates with an exceptional track record are also invited to apply.

Applications should contain a CV, a publication list, university transcripts, a link to the PhD/master thesis, as well as a list of personal references.  All applications received until August 21st, 2015 will receive full consideration, but you are invited to send your application earlier.

Applications can be sent via e-mail to ds4dm-project(at)dwslab.de. Questions concerning the position are answered via the same email address by Christian Bizer and Heiko Paulheim.

The University of Mannheim seeks to increase the proportion of women in research and teaching. Preference will be given to suitably qualified women or persons with disabilities, all other considerations being equal.

 

[1] Lehmberg et al.: The Mannheim Search Join Engine. JWS 2015,
http://dx.doi.org/10.1016/j.websem.2015.05.001

[2] Ristoski et al.: Mining the Web of Linked Data with RapidMiner. JWS 2015,
http://dx.doi.org/10.1016/j.websem.2015.06.004

[3] Ritze et al.: Matching HTML Tables to DBpedia. WIMS 2015,
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Ritze-etal-MatchingTablesToDBpedia-WIMS2015.pdf

]]>
Open Positions Open Positions - Staff
news-567 Mon, 20 Jul 2015 13:25:00 +0000 Studentische Hilfskraft für das Web Data Commons Forschungsprojekt https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/studentische-hilfskraft-fuer-das-web-data-commons-forschungsprojekt/ Immer mehr Webseiten haben damit begonnen strukturierte Daten über Personen, Produkte, Orte und Veranstaltungen innerhalb ihren HTML-Seiten einzubinden und verfügbar zu machen. Das Open Source Projekt Web Data Commons (www.webdatacommons.org) extrahiert diese Daten aus mehreren Milliarden Webseiten und stellt die so gewonnen Daten als Download zur Verfügung. Somit ermöglicht es Web Data Commons jedem, die Daten zu nutzen und mit ihnen zu Arbeiten ohne selbst das ganze Web zu crawlen.

Um auf die ständige Weiterentwicklung im Bereich strukturierter Daten und Auswertungen von WebCrawls zeitnah reagieren zu können sucht der Lehrstuhl Wifo V tatkräftige Unterstützung bei der

  • Weiterentwicklung des WebDataCommons Extraktionsframeworks
  • Analyse der extrahierten Daten

Anforderungen

  • Solide Kenntnisse in Java
  • Erste Erfahrungen mit strukturierten (Web) Daten (RDF(a), Microformats und Microdata)
  • Erfahrungen mit Cloud Computing (AWS) und/oder Web Crawlern sind ein Plus
  • Motivation und Begeisterung sich in neue Themengebiete einzuarbeiten
  • Selbstständiges Arbeiten

Was bieten wir:

  • Flexible Arbeitszeiten
  • Mitarbeit an einem OpenSource Projekt
  • Anwendung von an der Uni Gelerntem in der Praxis
  • Kaffeemaschine am Lehrstuhl ;)

Interessierte senden ihre Bewerbung inkl. Auflistung der IT-Kenntnisse und Erfahrungen sowie einer Notenübersicht an Prof. Dr. Christian Bizer und Robert Meusel.

 

 

]]>
Open Positions - Hiwis
news-1086 Tue, 14 Jul 2015 10:09:00 +0000 Yahoo Faculty Research And Engagement Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/yahoo-faculty-research-and-engagement-award/ We are happy to announce that Professor Christian Bizer's team has received a Yahoo Faculty Research and Engagement (FREP) award in the 2015 award program.

The goal of the Yahoo Faculty Research And Engagement (FREP) Award program is to establish high quality scientific collaborations between Yahoo Labs and selected universities around the globe in order to conduct research in areas of mutual interest. 

The goal of our joint research with Yahoo Labs is to develop methods for completing cross-domain knowledge graphs with data from large numbers of external data sources.

The work will build on existing research at the University of Mannheim on matching large numbers of Web tables against DBpedia [1] as well as on learning fusion policies for resolving data conflicts while augmenting the DBpedia knowledge base [2].

 

[1] Dominique Ritze, Oliver Lehmberg, Christian Bizer: Matching HTML Tables to DBpedia. 5th International Conference on Web Intelligence, Mining and Semantics (WIMS2015), Limassol, Cyprus, July 2015.

[2] Volha Bryl, Christian Bizer: Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion. 4th Workshop on Web Quality (WebQuality2014) @ WWW 2014, Seoul, Korea, April 2014.

 

 

 

 

]]>
Projects Chris
news-1084 Tue, 14 Jul 2015 07:21:00 +0000 Article Analyzing the Graph Structure of the Web accepted by Journal of Web Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-analyzing-the-graph-structure-of-the-web-accepted-by-journal-of-web-science/ The article The Graph Structure in the Web - Analyzed on Different Aggregation Levels by Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer was accepted for publication by the Journal of Web Science.

The article analyzes the structure of the overall hyperlink graph of the Web and updates the findings of Broder et al. from the year 2000.

Preprint version of the Article.

 

Title

The Graph Structure in the Web - Analyzed on Different Aggregation Levels

Abstract

Knowledge about the general graph structure of the World Wide Web is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion links. We analyse and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components.

We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research , some of the features previously observed by Broder et al.  are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers, very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described in is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the web.More importantly, statistical testing and visual inspection of sizerank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We  also provide for the first time accurate measurement of distancebased features, using recently introduced algorithms that scale to the size of our crawl.

 

 

 

]]>
Publications Chris
news-1085 Tue, 14 Jul 2015 06:51:00 +0000 Article accepted by Journal of Web Semantics: Mining the Web of Linked Data with RapidMiner https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-accepted-by-journal-of-web-semantics-mining-the-web-of-linked-data-with-rapidminer/ We are happy to announce that the article Mining the Web of Linked Data with RapidMiner by Petar Ristoski, Christian Bizer, and Heiko Paulheim has been accepted for publication by the Journal of Web Semantics.

Abstract

Lots of data from different domains is published as Linked Open Data (LOD). While there are quite a few browsers for such data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this system paper, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining and analysis platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need for expert knowledge in SPARQL or RDF. The extension allows for autonomously exploring the Web of Data by following links, thereby discovering relevant datasets on the fly, as well as for integrating overlapping data found in different datasets. As an example, we show how statistical data from the World Bank on scientific publications, published as an RDF data cube, can be automatically linked to further datasets and analyzed using additional background knowledge from ten different LOD datasets.

Keywords

  • Linked Open Data
  • Data Mining
  • RapidMiner

More information about the RapidMiner LOD Extension is found here.

]]>
Publications Chris
news-181 Mon, 13 Jul 2015 11:36:00 +0000 ACM CIKM 2015 paper accepted https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/acm-cikm-2015-paper-accepted/ Michael will present his joint work with Laura and Simone on "Ranking Entities for Web Queries through Text and Knowledge" at the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015), one of the top-tier ACM conferences in the areas of information retrieval, knowledge management and databases. More info at http://www.cikm-2015.org/.

 

 

]]>
Publications Simone
news-1087 Mon, 13 Jul 2015 11:31:00 +0000 CfP: ACM Journal of Data and Information Quality (JDIQ), Special Issue on Web Data Quality https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/cfp-acm-journal-of-data-and-information-quality-jdiq-special-issue-on-web-data-quality/ Together with Luna Dong (Google), Ihab Ilyas (University of Waterloo) and Maria-Esther Vidal (Universidad Simon Bolivar), Christian Bizer is editing a special issue of the ACM Journal of Data and Information Quality (JDIQ) on Web Data Quality. The goal of the special issue is to present innovative research in the areas of Web Data Quality Assessment and Web Data Cleansing. 

The submission deadline for the special issue is November 1st 2015.

The call for papers is found below.

 

Call for Papers

ACM Journal of Data and Information Quality (JDIQ)

Special Issue on Web Data Quality

 

Guest editors

  • Christian Bizer, University of Mannheim, Germany
  • Luna Dong, Google, USA
  • Ihab Ilyas, University of Waterloo, Canada
  • Maria-Esther Vidal, Universidad Simon Bolivar, Venezuela

Introduction

The volume and variety of data that is available on the web has risen sharply. In addition to traditional data sources and formats such as CSV files, HTML tables and deep web query interfaces, new techniques such as Microdata, RDFa, Microformats and Linked Data have found wide adoption. In parallel, techniques for extracting structured data from web text and semi-structured web content have matured resulting in the creation of large-scale knowledge bases such as NELL, YAGO, DBpedia, and the Knowledge Vault.

Independent of the specific data source or format or information extraction methodology, data quality challenges persist in the context of the web. Applications are confronted with heterogeneous data from a large number of independent data sources while metadata is sparse and of mixed quality. In order to utilize the data, applications must first deal with this widely varying quality of the available data and metadata.

Topics

The goal of this special issue of JDIQ is to present innovative research in the areas of Web Data Quality Assessment and Web Data Cleansing. Specific topics within the scope of the call include, but are not limited to, the following:

WEB DATA QUALITY ASSESSMENT

  • Metrics and methods for assessing the quality of web data, including Linked Data, Microdata, RDFa, Microformats and tabular data.
  • Methods for uncovering distorted and biased data / data SPAM detection.
  • Methods for quality-based web data source selection.
  • Methods for copy detection.
  • Methods for assessing the quality of instance- and schema-level links Linked Data.
  • Ontologies and controlled vocabularies for describing the quality of web data sources and metadata.
  • Best practices for metadata provision.
  • Cost and benefits of web data quality assessment and benchmarks.

WEB DATA CLEANSING

  • Methods for cleansing Web data, Linked Data, Microdata, RDFa, Microformats and tabular data.
  • Conflict resolution using semantic knowledge and truth discovery.
  • Human-in-the-loop and crowdsourcing for data cleansing.
  • Data quality for automated knowledge base construction.
  • Empirical evaluation of scalability and performance of data cleansing methods and benchmarks.

APPLICATIONS AND USE CASES IN THE LIFE SCIENCES, HEALTHCARE, MEDIA, SOCIAL MEDIA, GOVERNMENT AND SENSOR DATA.

Important dates

  • Initial submission: November 1, 2015
  • First review: January 15, 2016
  • Revised manuscripts: February 15, 2016
  • Second review: March 30, 2016
  • Publication: May 2016

Submission guidelines

jdiq.acm.org/authors.cfm

]]>
Publications Chris
news-1075 Tue, 30 Jun 2015 08:23:00 +0000 Prof. Bizer gives keynote speech at 18th International Conference on Business Information Systems https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/prof-bizer-gives-keynote-speech-at-18th-international-conference-on-business-information-systems/ Professor Christian Bizer was invited to give a keynote speech at the 18th International Conference on Business Information Systems (BIS2015) in Poznań, Poland. 

The slides of the keynote are available on Slideshare.

Title: 

Evolving the Web into a Global Dataspace – Advances and Applications

Abstract:

Motivated by Google, Yahoo!, Microsoft, and Facebook, hundreds of thousands of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, and Microformats. In parallel, the adoption of Linked Data technologies by government agencies, libraries, and scientific institutions has risen considerably. In his talk, Christian Bizer will give an overview of the content profile of the resulting Web of Data. He will showcase applications that exploit the Web of Data and will discuss the challenges of integrating and cleansing data from thousands of independent Web data sources.

Further Information:

More Information about the conference is found here.

 

 

 

]]>
Chris Research Publications
news-235 Fri, 26 Jun 2015 06:59:00 +0000 Paper accepted at International Semantic Web Conference https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-international-semantic-web-conference/ The paper "<link />Serving DBpedia with DOLCE - More than Just Adding a Cherry on Top", co-authored by <link />Heiko Paulheim from the Data and Web Science Group and <link />Aldo Gangemi from Universite Paris 13 - Sorbonne and STLab, ISTC-CNR Rome, has been accepted at the <link />14th International Semantic Web Conference (ISWC 2015).

Abstract: Large knowledge bases, such as DBpedia, are most often created heuristically due to scalability issues. In the building process, both random as well as systematic errors may occur. In this paper, we focus on finding systematic errors, or anti-patterns, in DBpedia. We show that by aligning the DBpedia ontology to the foundational ontology DOLCE-Zero, and by combining reasoning and clustering of the reasoning results, errors affecting millions of statements can be identified at a minimal workload for the knowledge base designer. Furthermore, identifying systematic errors can also lead to interesting  questions about the nature of categorization, and the trade off between data design and cognitive semantics.

]]>
Publications
news-1082 Tue, 23 Jun 2015 09:02:00 +0000 Paper published in Machine Learning Journal https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-published-in-machine-learning-journal/ The paper A decomposition of the outlier detection problem into a set of supervised learning problems by Heiko Paulheim and Robert Meusel has been published in Machine Learning. The paper discusses a novel outlier detection approach, which trains a set of regression models to capture the inherent patterns of the data, and determines outliers based on the deviations from those patterns. With that approach, it is also possible to create precise explanations for outliers.

Having been submitted to the ECML/PKDD journal track, the paper will also be presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
 (ECML/PKDD 2015).

]]>
Research
news-1081 Mon, 22 Jun 2015 14:30:00 +0000 Two Papers Accepted for the 5th International Conference on Web Intelligence, Mining and Semantics https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-papers-accepted-for-the-5th-international-conference-on-web-intelligence-mining-and-semantics/ The paper A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time by Robert Meusel, Christian Bizer and Heiko Paulheim as well as Matching HTML Tables to DBpedia by Dominique Ritze, Oliver Lehmberg and Christian Bizer have been accepted for the 5th International Conference on Web Intelligence, Mining and Semantics in Limassol, Cyprus.

Please find the abstracts of the papers below:

1. A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of large-scale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from different points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare different versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at different points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.

2. Matching HTML Tables to DBpedia

Millions of HTML tables containing structured data can be found on the Web. With their wide coverage, these tables are potentially very useful for filling missing values and extending cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, the HTML tables need to be matched with the knowledge base, meaning that correspondences between table rows/columns and entities/schema elements of the knowledge base need to be found. This paper presents the T2D gold standard for measuring and comparing the performance of web table to knowledge base matching systems. T2D consists of 8,700 schema-level and 26,100 entity-level correspondences between the WebDataCommons Web Tables Corpus and the DBpedia knowledge base. In contrast related work on web tables to knowledge base matching, the Web Tables Corpus ($147$ million tables), the knowledge base, as well as the gold standard are publicly available. The gold standard is used afterward to evaluate the performance of T2K Match, an iterative matching method which combines schema and instance matching. T2K Match is designed for the use case of matching large quantities of mostly small and narrow web tables against large cross-domain knowledge bases. The evaluation using the T2D gold standard shows that T2K Match discovers table-to-class correspondences with a precision of 94%, row-to-entity correspondences with a precision of 90%, and column-to-property correspondences with a precision of 77%.

 

 

]]>
Publications Chris Research - Data Mining and Web Mining
news-306 Mon, 22 Jun 2015 09:15:00 +0000 New Part-time Master Program in Data Science https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-part-time-master-program-in-data-science/ Starting this fall, the Data and Web Science Group is offering a new part-time master program in Data Science together with the University of Tübingen and the University of Applied Sciences Albstadt-Sigmaringen.

The program is designed for students from industry who want to gain profound knowledge about Data Science while keep on working on their jobs in parallel. The program consists of a mixture of distant teaching together with phases of attendance at the three universities.

Detailed information about the program is found at Masterstudiengang Data Science.

The registration deadline for the new program is July 15h, 2015. More information about the registration is found here.

 

 

 

 

]]>
Projects Chris Heiner Simone
news-1080 Mon, 15 Jun 2015 14:12:00 +0000 Paper Accepted for the 9th International Conference on Scalable Uncertainty Management. https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-the-9th-international-conference-on-scalable-uncertainty-management/ The Paper "Towards large Scale Probabilistic OBDA" by Jörg Schönfisch and Heiner Stuckenschmidt has been accepted for the 9th International Conference on Scalable Uncertainty Management in Quebec City, Canada.

]]>
Heiner Research
news-1077 Thu, 11 Jun 2015 06:51:00 +0000 Paper Accepted for the 4th International Conference on Algorithmic Decision Theory. https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-the-4th-international-conference-on-algorithmic-decision-theory/ The Paper "Towards Decision Making via Expressive Probabilistic Ontologies" by Erman Acar, Camilo Thorne and Heiner Stuckenschmidt has been accepted for the 4th International Conference on Algorithmic Decision Theory  in Lexington, Kentucky.

]]>
Publications Heiner
news-1072 Wed, 03 Jun 2015 13:39:00 +0000 Article accepted by Journal of Web Semantics: The Mannheim Search Join Engine https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-accepted-by-journal-of-web-semantics-the-mannheim-search-join-engine/ We are happy to announce that the article "The Mannheim Search Join Engine" by Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer has been accepted for publication by the Journal of Web Semantics.

Abstract

A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with attributes containing the headquarters, turnover, and revenue of each company. Or imagine you are a film enthusiast and want to extend a table describing films with attributes like director, genre, and release date of each film. This article presents theMannheim Search Join Engine which automatically performs such table extension operations based on a large corpus of Web data. Given a local table, the Mannheim Search Join Engine searches the corpus for additional data describing the entities contained in the input table. The discovered data are joined with the local table and are consolidated using schema matching and data fusion techniques. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. We evaluate the Mannheim Search Join Engine using heterogeneous data originating from over one million different websites. The data corpus consists of HTML tables, as well as Linked Data and Microdata annotations which are converted into tabular form. Our experiments show that the Mannheim Search Join Engine achieves a coverage close to 100% and a precision of around 90% for the tasks of extending tables describing cities, companies, countries, drugs, books, films, and songs.

Keywords

  • Table extension
  • Data search 
  • Search joins 
  • Web tables 
  • Microdata 
  • Linked data

More information about the Mannheim Search Join engine is found here.

]]>
Publications Chris Research - Data Mining and Web Mining
news-710 Tue, 26 May 2015 12:40:00 +0000 Master Thesis: Multilingual Entity Linking (Ponzetto, Bizer) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-multilingual-entity-linking-ponzetto-bizer/ Entity linking, the task of linking mentions of entities in text to wide-coverage concept repositories like DBPedia or Freebase, has so far concentrated almost exclusively on English [1]. This is well reflected on available taggers working only on English, indeed a very big limitation for the multilingual web of data. This goal of this thesis, accordingly, will be to extend existing taggers like, for instance, DBPedia Spotlight [2], to a wide range of languages other than English.

Requirements

  • Solid programming skills
  • Experience / genuine interest to work with large datasets
  • Previous knowledge of LOD, NLP and Machine Learning are a plus

 

References

[1] A framework for benchmarking entity-annotation systems. M. Cornolti, P. Ferragina and M. Ciaramita. In WWW-13

[2] DBpedia Spotlight: Shedding Light on the Web of Documents. P.N. Mendes, Max Jakob, A. García-Silva and C. Bizer. In I-Semantics-11

 

Contact: Prof. Dr. Bizer or Prof. Dr. Ponzetto

]]>
Topics Chris Simone Topics - Artificial Intelligence (NLP) Thesis - Master
news-1064 Thu, 07 May 2015 07:02:00 +0000 Data Mining I and II Lecture Videos online https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/data-mining-i-and-ii-lecture-videos-online/ The Data and Web Science Group has started to record core lectures for Master students on video and to provide screen casts of accompanying exercises in order to enable students to be more flexible in their learning patterns.

Up till now, we have recorded the Data Mining I and the Data Mining II lectures and provide screen casts for the Data Mining I exercise. 

The Data Mining I lecture gives an introduction into data mining and covers fundamental mining tasks such as classification, clustering, association analysis, as well as the basics of text mining. The screen casts of the accompanying exercise explain how the data mining software Rapidminer is used to apply the learned methods within various use cases.  

The Data Mining II lecture covers advanced data mining topics such as dimensionality reduction, anomaly detection, time series analysis, regression, ensembles, and online learning.

The videos and screen casts are available from the Lecture Videos page.

We plan to record more lectures in the upcoming semesters. Candidate lectures for being recorded are: Decision Support, Web Data Integration, and Text Analytics.

Lots of thanks to the Referat Neue Medien of the Stabsstelle Studium und Lehre for supporting us in recording the lecture videos.

]]>
Projects Chris
news-1059 Mon, 04 May 2015 10:03:00 +0000 KDnugget and American Scientist coverage of WebDataCommons project https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/kdnugget-and-american-scientist-coverage-of-webdatacommons-project/ We are happy that KDnuggets has invited us to present the WebDataCommons project to the international data mining community in the form of a blog post on the portal.

The post about the project is found here.

Our work on updating the findings of Broder at al. from the year 2000 about the overall graph structure of the World Wide Web by analyzing the WebDataCommons Hyperlink Graph was also mentioned in the current issue of American Scientist as a use case for the Common Crawl.

The section mentioning our work is found here.
The whole article about the Common Crawl is found here.

]]>
Research Chris Projects Publications
news-1050 Mon, 13 Apr 2015 08:48:00 +0000 RDFa, Microdata and Microformat Corpus published containing data from 2 billion webpages https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rdfa-microdata-and-microformat-corpus-published-containing-data-from-2-billion-webpages/ The DWS group is happy to announce a new release of the WebDataCommons RDFa, Microdata, and Microformat data corpus.

The data corpus been extracted from the December 2014 version of the Common Crawl covering 2.01 billion HTML pages which originate from 15.7 million websites (pay-level domains).

Altogether we discovered structured data within 620 million HTML pages out of the 2.04 billion pages contained in the crawl (30%). These pages originate from 2.7 million different pay-level-domains out of the 15.7 million pay-level domains covered by the crawl (17%).

Approximately 571 thousand of these websites use RDFa, while 819 thousand websites use Microdata. Microformats are used by over 1 million websites within the crawl.


Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, reviews, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.

The WebDataCommons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

General information about the WebDataCommons project is found at http://webdatacommons.org/


Data Set Statistics 

Basic statistics about the December 2014 RDFa, Microdata, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2014-12/stats/stats.html

Comparing the statistics to the statistics about the November 2013 release of the data sets

http://webdatacommons.org/structureddata/#trend-2012-2014

we see that the adoption of the Microdata markup syntax has again increased (819 thousand websites in 2014 compared to 463 thousand in 2013, where both crawls cover a comparable number of websites). Where the deployment of RDFa and Microformats is more or less stable.

Looking at the adoption of different vocabularies, we see that webmasters mostly follow the recommendation by Google, Microsoft, Yahoo!, and Yandex to use the schema.org vocabularies as well as their predecessors in the context of Microdata. In the context of RDFa, the most widely used vocabulary is the Open Graph Protocol recommended by Facebook.

Topic-wise the trend, which was already identified from 2012 to 2013 continues. We see that beside of navigational, blog and CMS related meta-information many websites markup e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).


Download

The overall size of the December 2014 RDFa, Microdata, and Microformat data sets is 20.4 billion RDF quads. For download, we split the data into 3,533 files with a total size of 357 GB.

http://webdatacommons.org/structureddata/2014-12/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class.

http://webdatacommons.org/structureddata/2014-12/stats/schema_org_subsets.html


Lots of thanks to 

  • the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project. 
  • the Any23 project for providing their great library of structured data parsers. 
  • Amazon Web Services in Education Grant for supporting WebDataCommons. 

Have fun with the new data set.

Robert Meusel, Anna Primpeli, and Christian Bizer 

]]>
Research Projects Chris
news-150 Tue, 07 Apr 2015 13:46:00 +0000 Paper accepted in TODS: Closing the Gap: Sequence Mining at Scale https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-in-tods-closing-the-gap-sequence-mining-at-scale/ The paper "Closing the Gap: Sequence Mining at Scale" by Kaustubh Beedkar, Klaus Berberich, Rainer Gemulla and Iris Miliaraki has been accepted in the ACM Transactions on Database Systems (TODS).

 

Abstract:

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.

]]>
Research - Data Analytics Publications
news-160 Sat, 04 Apr 2015 08:57:00 +0000 DWS Students Win Prize at DataFest Germany https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-students-win-prize-at-datafest-germany/ From March 20th to 22nd, the DataFest Germany was held at University of Mannheim. 18 teams from all over Germany, coming from various backgrounds (computer science, maths, social science, etc.) were asked to analyze a given dataset of the usage of mobile phone apps.

A team of five DWS students - Mats Schade, Daniel Ringler, Alexander Renz-Wieland, Robert Litschko, and Benjamin Schaefer, some of which also work as student assistants for the group - have won the prize for the best use of outside data. They developed a framework which allows for measuring the outreach of mobile phones, both geographically as well as temporary, and showed that mobile phone users can be easily de-anonymized based on indicators such as their browser history and geo location data.

Congratulations to the winners!

]]>
Research Other
news-192 Wed, 01 Apr 2015 10:16:00 +0000 Paper accepted at SIGMOD 2015: LEMP: Fast Retrieval of Large Entries in a Matrix Product https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-sigmod-2015-lemp-fast-retrieval-of-large-entries-in-a-matrix-product/ The paper "LEMP: Fast Retrieval of Large Entries in a Matrix Product" by Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk was accepted for the SIGMOD 2015 conference in Melbourne, Australia.

Abstract:

We study the problem of efficiently retrieving large entries in the product of two given  matrices, which arises in  a number of data  mining and information retrieval tasks. We focus on the setting where the two input matrices are tall and skinny, i.e.,  with millions of rows  and tens to hundreds  of columns. In such settings,  the product matrix  is large  and its complete  computation is generally infeasible in practice. To address this problem, we propose the LEMP algorithm, which efficiently  retrieves only the large entries  in the product matrix without  actually computing  it.  LEMP  maps the  large-entry retrieval problem  to a  set of  smaller cosine  similarity search  problems, for  which existing methods  can be  used. We  also propose  novel algorithms  for cosine similarity search, which  are tailored to our setting.  Our experimental study on  large  real-world datasets  indicates  that  LEMP is  up  to  an order of  magnitude faster than state-of-the-art approaches.

]]>
Publications Rainer Topics - Data Mining Research - Data Analytics Research
news-162 Wed, 01 Apr 2015 08:23:00 +0000 Gold Standard for Evaluating Web Table Matching Systems released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/gold-standard-for-evaluating-web-table-matching-systems-released/ Many HTML tables on the Web are used for layout purposes, but a small fraction of all tables contains structured data. As this data has a wide coverage, it could potentially be very valuable for filling missing values and extending cross-domain knowledge bases such as DBpediaYAGO or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, the Web tables need to be matched to the knowledge base in question, meaning that correspondences between the rows of the tables and the entities described in the knowledge base as well as between the columns of the tables and the schema of the knowledge base need to be found.

Various systems have been developed to solve this matching task. Up till now, it was difficult to compare the performance of these systems as they were evaluated using in part non-public Web tables data as well as different knowledge bases.

The Data and Web Science group has developed the T2D Gold Standard which tries to fill this gap by providing a large set of human-generated correspondences between a public Web tables corpus and the DBpedia knowledge base.

T2D Gold Standard contains schema-level correspondences between 1748 Web tables from the English-language subset of the Web Data Commons Web Tables Corpus and DBpedia Version 2014. For 233 out of these tables all rows have been manually mapped to entities in the DBpedia knowledge base (altogether 26,124 corrspondences).

More information about the T2D Gold standard is found at http://webdatacommons.org/webtables/goldstandard.html

]]>
Research Projects Chris
news-231 Wed, 25 Mar 2015 08:28:00 +0000 Three Papers accepted at ESWC 2015 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/three-papers-accepted-at-eswc-2015/ We are happy to announce that three papers got accepted at the 12th Extended Semantic Web Conference (ESWC 2015), held in Portoroz, Slovenia. The ESWC is a important international forum for the Semantic Web / Linked Data Community.

Abstracts of the accepted papers: 

RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration (Christoph Pinkel, Carsten Binnig, Ernesto Jimenez-Ruiz, Wolfgang May, Dominique Ritze, Martin G. Skjaeveland, Alessandro Solimando and Evgeny Kharlamov)
A major challenge in information management today is the integration of huge amounts of data distributed across multiple data sources. One suggested approach to this problem is ontology-based data integration where legacy data systems are integrated via a common on- tology that represents a unified global view over all data sources. In many domains (e.g., biology, medicine) there exist established ontolo- gies to integrate data from existing data sources. However, data is often not natively born using these ontologies. Instead, much data resides in relational databases. Therefore, mappings that relate the legacy data sources to the ontology need to be constructed. Recent techniques and systems that automatically construct such mappings have been devel- oped. The quality metrics of these systems are, however, often only based on self-designed, highly biased benchmarks. This paper introduces a new publicly available benchmarking suite called RODI which is designed to cover a wide range of integration challenges in Relational-to-Ontology Data Integration scenarios. RODI provides a set of different relational data sources and ontologies as well as a scoring function with which the performance of relational-to-ontology mapping construction systems may be evaluated.

Towards Linked Open Data enabled Data Mining: Strategies for Feature Generation, Propositionalization, Selection, and Consolidation (Petar Ristoski)
Background knowledge from Linked Open Data sources can be used to improve the results of a data mining problem at hand: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is a tedious manual work. In this paper we propose a set of desiderata, and identify the challenges for developing a framework for unsupervised generation of data mining features from Linked Data.

Heuristics for Fixing Common Errors in Deployed schema.org Microdata (Robert Meusel and Heiko Paulheim)
Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.  

 

 

]]>
Publications Research - Data Mining and Web Mining Chris
news-229 Tue, 17 Mar 2015 08:20:00 +0000 Semantic Web Journal Special Issue on on Semantic Web Interfaces published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semantic-web-journal-special-issue-on-on-semantic-web-interfaces-published/ The Semantic Web Journal has published a special issue on Semantic Web Interfaces, co-edited by Roberto Garcja, Heiko Paulheim, and Paola Di Maio.

The special issue collects research works addressing the question of how humans can interact with the semantic web along the whole lifecycle from data creation to data consumption, considering both individual and collaborative interaction.

]]>
Group Publications
news-238 Thu, 05 Mar 2015 15:04:00 +0000 Springer starts publishing Linked Data https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/springer-starts-publishing-linked-data/ The publishing company Springer has started to publish Linked Data on the Web. Springer has opened selected metadata for conference proceeding, heeding the European Commission’s call for promoting open data. More details on the Springer Linked Data portal can be found at lod.springer.com, with the visualization available at lod.springer.com/live.

The pilot project is a joint effort of Springer with the Data and Web Science Group, University of Mannheim, Germany and Netwise, Italy. It currently involves open data on roughly 8,000 proceedings volumes from around 1,200 conference series, including Springer’s Lecture Notes in Computer Science. The data currently available focuses on computer science, with other fields to follow.

The pilot group was recently joint by the Vienna University of Technology, providing data about members of conference program committees and the PEERE project, which provides additional information on the peer-review processes employed at different conferences. The Springer Linked Data is for instance being used by the participants of the Semantic Publishing Challenge 2015.

See also: Springer Press Release about pilot project on Linked Open Data

]]>
Projects Chris
news-292 Fri, 20 Feb 2015 09:50:00 +0000 Paper accepted for SIGMOD 2015: LASH: Large-Scale Sequence Mining with Hierarchies https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-for-sigmod-2015-lash-large-scale-sequence-mining-with-hierarchies/ The paper "LASH: Large-Scale Sequence Mining with Hierarchies" by Kaustubh Beedkar and Rainer Gemulla was accepted for the SIGMOD 2015 conference in Melbourne, Australia.

Abstract:

We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.

]]>
Research - Data Analytics Publications
news-380 Tue, 10 Feb 2015 11:10:00 +0000 Semester Kickoff BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kickoff-bbq-2/ English Version

Slowly it is becoming a tradition, that the Research Group Data and Web Science takes the beginning of the new semester as an opportunity to host a barbecue in order to welcome new colleagues and introduce the upcoming courses to the best students of last semester. Thus, accompanied by cold beverages and grilled food the professors presented the spring/summer semester program. You can find the slides at this link, the courses are:

Knowledge Management, Database Systems II, Data Mining I, Data Mining II, Web Mining, Web Search and Information Retrieval, Process Mining, Database Analytics Seminar, Team Project: Sequential Pattern Analytics.

Throughout the evening, many interesting topics where discussed and some cornerstones for future thesis could be laid.

We thank all the participants for coming and wish especially our students a good and successful start into the new semester.

 

German Version

So langsam wird es zur Tradition, dass die Forschungsgruppe Data und Web Science den Beginn des Semesters nutzt, um bei einem Grillfest neue Kolleginnen und Kollegen willkommen zu heißen, das aktuelle Angebot vorzustellen und dazu die besten Studierenden des letzten Semester einzuladen. Am Montag (9.2.2015) präsentierten die Professoren also die nächsten Kurse begleitet von Grillgut und Kaltgetränken.


Die Präsentation zum Nachlesen gibt es unter diesem Link, vorgestellt wurden folgende Kurse:

Knowledge Management, Database Systems II, Data Mining I, Data Mining II, Web Mining, Web Search and Information Retrieval, Process Mining, Database Analytics Seminar, Team Project: Sequential Pattern Analytics.

Während des gesamten Abends ergaben sich viele interessante Diskussion zwischen den Anwesenden über fachliche Themen am Lehrstuhl.

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen vorallem unserem Studenten einen guten und erfolgreichen Start ins neue Semester.

]]>
Other Group
news-388 Tue, 10 Feb 2015 08:39:00 +0000 Full Paper accepted for Caise 2015 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/full-paper-accepted-for-caise-2015/ The paper "Towards the Automated Annotation of Process Models" by Henrik Leopold,
Christian Meilicke, Michael Fellmann, Fabian Pittke, Heiner Stuckenschmidt and Jan Mendling has been accepted for the 27th International Conference on Advanced Information Systems Engineering Caise'2015.It will be presented at the in Stockholm, June 8-12, 2015.

]]>
Publications Heiner Research
news-442 Sun, 18 Jan 2015 11:55:00 +0000 Full Paper Accepted for WWW'15 Research Track https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/full-paper-accepted-for-www15-research-track/ The paper "Enriching Structured Knowledge with Open Information" by Arnab Dutta, Christian Meilicke and Heiner Stuckenschmidt has been accepted for the Research track of WWW 2015 in Florence, Italy. 

]]>
Publications Heiner
news-526 Thu, 11 Dec 2014 12:20:00 +0000 Karl Steinbuch Scholarship for Viktor Schulz https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/karl-steinbuch-scholarship-for-viktor-schulz/ Viktor Schulz has been selected for receiving a Karl-Steinbuch Scholarship from the federal state of Baden Württemberg for carrying out an individual research project along with his studies in the Business Informatics program of the faculty. His Project on Business Process Matching will be supervised by Prof. Stuckenschmidt and carried out in cooperation with Dr. Henrik Leopold from the University of Business and Economics in Vienna.

]]>
Heiner Research
news-1128 Wed, 03 Dec 2014 08:49:00 +0000 8th Linked Data on the Web Workshop at WWW2015 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/8th-linked-data-on-the-web-workshop-at-www2015/ Together with Sir Tim Berners-Lee (W3C/MIT, USA), Tom Heath (Open Data Institute, UK) and Sören Auer (University of Bonn and Fraunhofer IAIS, Germany), Christian Bizer is organizing the 8th Linked Data on the Web Workshop (LDOW2015)  at the 24th World Wide Web Conference (WWW2015) in Florence, Italy.

Goals of the Workshop

The Web is continuing to develop from a medium for publishing textual documents into a medium for sharing structured data. In 2014, the Web of Linked Data grew to a size of about 1000 datasets with contributions coming from companies, governments and other public sector bodies such as libraries, statistical bodies or research institutions. In parallel, the schema.org initiative has found increasing adoption with large numbers of websites semantically marking up the content of their HTML pages.

The 8th Workshop on Linked Data on the Web (LDOW2015) aims to stimulate discussion and further research into the challenges of publishing, consuming, and integrating structured data from the Web as well as mining knowledge from the global Web of Data. In addition to its traditional focus on open web data, the special focus of this year’s LDOW workshop will be the application of Linked Data technologies in enterprise settings as well as the potentials of interlinking closed enterprise data with open data from the Web.

Important Dates

  • Submission deadline: 15 March, 2015 (23:59 Pacific Time)
  • Notification of acceptance: 6 April, 2015
  • Camera-ready versions of accepted papers: 20 April, 2015
  • Workshop date: 19 May, 2015

Topics of Interest

Topics of interest for the workshop include, but are not limited to, the following:

Linked Enterprise Data

  • role of Linked Data within enterprise applications (e.g. ERP, SCM, CRM)
  • integration of SOA and Linked Data approaches in joint frameworks
  • authentication, security and access control approaches for Linked Enterprise Data
  • use cases combining closed enterprise data with open data from the Web

Mining the Web of Data

  • large-scale derivation of implicit knowledge from the Web of Data
  • using the Web of Data as background knowledge in data mining

Integrating Large Numbers of Linked Data Sources

  • linking algorithms and heuristics, identity resolution
  • schema matching and clustering
  • data fusion
  • evaluation of linking, schema matching and data fusion methods

Quality Assessment, Provenance Tracking and Licensing

  • evaluating quality and trustworthiness of Web data
  • profiling and change tracking of Web data sources
  • tracking provenance and usage of Web data
  • licensing issues in Linked Data publishing

Linked Data Applications

  • application showcases including browsers and search engines
  • marketplaces, aggregators and indexes for Web data
  • visualization and exploration of Web data
  • business models for Linked Data publishing and consumption
  • Linked Data applications for life-sciences, digital humanities, social sciences etc.

Submissions

We seek the following kinds of submissions:

  1. Full scientific papers: up to 10 pages in ACM format
  2. Short scientific and position papers: up to 5 pages in ACM format

Submissions must be formatted using the ACM SIG template (as per the WWW2015 Research Track) available at http://www.acm.org/sigs/publications/proceedings-templates. Please note that the author list does not need to be anonymized, as we do not operate a double-blind review process. Submissions will be peer reviewed by at least three independent reviewers. Accepted papers will be presented at the workshop and included in the workshop proceedings. 

Proceedings

Accepted paper will be made available one this website and be published as a volume of the CEUR series of workshop proceedings.

]]>
Research Publications Chris
news-872 Fri, 21 Nov 2014 10:42:00 +0000 Heiner Stuckenschmidt will be PC Chair of EC-Web 2015 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/heiner-stuckenschmidt-will-be-pc-chair-of-ec-web-2015/ Heiner Stuckenschmidt has been nominated as Program Chair of the 16th International Conference on Electronic Commerce and Web Technologies. Together with Dietmar Jannach who is heading the e-Services Research Group at TU Dortmund he will be in charge of the scientific program of the Conference  that will take place in Valencia, Spain in September 2015.

]]>
Heiner Research
news-1049 Mon, 27 Oct 2014 14:43:00 +0000 RapidMiner Linked Open Data Extension wins Semantic Web Challenge Open Track https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rapidminer-linked-open-data-extension-wins-semantic-web-challenge-open-track/ We are very happy that the RapidMiner Linked Open Data Extension has won the Semantic Web Challenge 2014 Open Track.

The Semantic Web Challenge is the premier competition for showcasing progress towards the realization of the Semantic Web. The challenge takes place since 11 years at the annual International Semantic Web Conferences. The challenge is organized into two tracks: The goal of the Open Track is to showcase the benefits that Semantic Web technologies can bring to end-user applications. The goal of the Big Data Track is to demonstrate approaches that can work on Web scale using realistic Web-quality data.

Our submission to the Open Track demonstrates how Linked Data from the Web can be used as background knowledge within data mining processes. The extension hooks into the powerful data mining platform RapidMiner and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF.

Full reference

]]>
Research Publications Chris
news-1048 Mon, 27 Oct 2014 13:57:00 +0000 Search Join Engine wins Semantic Web Challenge Big Data Track https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/search-join-engine-wins-semantic-web-challenge-big-data-track/ We are very happy that the Mannheim Search Join Engine has won the Semantic Web Challenge 2014 Big Data Track.

The Semantic Web Challenge is the premier competition for showcasing progress towards the realization of the Semantic Web. The challenge takes place since 11 years at the annual International Semantic Web Conferences. The challenge is organized into two tracks: The goal of the Open Track is to showcase the benefits that Semantic Web technologies can bring to end-user applications. The goal of the Big Data Track is to demonstrate approaches that can work on Web scale using realistic Web-quality data.

Our submission to the Big Data Track demonstrates how local tables about arbitrary topics can be extended with additional columns based on data from 36.3 million web tables which originate from over 1.5 million different websites. For instance, given a table about countries the system finds several hundred web tables that contain data about the population of these countries and adds a column to the input table by merging (fusing) all population numbers from these web tables. 

Full reference

 

 

 

 

]]>
Research Publications Chris
news-1047 Mon, 27 Oct 2014 13:17:00 +0000 DBpedia Article wins Semantic Web Journal Outstanding Paper Award https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dbpedia-article-wins-semantic-web-journal-outstanding-paper-award/ We are happy to announce that our article DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia has won the Semantic Web Journal Outstanding Paper Award.

The article gives an overview of the development of the DBpedia knowledge base as well as the DBpedia data extraction framework that took place since the publication of the previous DBpedia overview article in 2009.

Since it acceptance for publication in Janaury 2014, the article has been

Full Reference

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer, DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia.

Abstract

The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.

]]>
Research Chris Publications
news-1030 Mon, 15 Sep 2014 10:18:00 +0000 Praktikum ab Oktober 2014 im Bereich IT-Infrastruktur & Betrieb (ITI/CA), Werk Germersheim https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/praktikum-ab-oktober-2014-im-bereich-it-infrastruktur-betrieb-itica-werk-germersheim/ Aufgaben

  • Das Team "DAS" (IT Datacenter and Application Services) ist verantwortlich für die Serverinfrastruktur aller GLC und LC Standorte. Innerhalb Ihrer abwechslungsreichen und selbständigen Tätigkeit bei ITI/CA-DAS wirken Sie beim Betrieb und der Betreuung von Windows-, Unix-, Linux- Servern und Applikationen mit. Im Rahmen Ihres Praktikums erhalten Sie die Möglichkeit eigenverantwortlich mitzuarbeiten, sowie die entsprechenden Prozesse kennenzulernen.
  • Dies umfasst die Durchführung folgender Tätigkeiten:
  • Unterstützung des Teams im Serverumfeld
  • Mitwirkung im Servermanagement, primär im Bereich Betrieb
  • Eigenständige Bearbeitung von definierten Arbeitspaketen
  • Erstellung von Präsentationen, Dokumentationen und Briefings für interne Besprechungen
  • Ausarbeitung der Konfiguration der Monitoring Software "HP Openview" zur Korrelation von Applikationsstatusinformationen
  • Konzeption und Aufbau einer zentralen Logfileanalyse auf Basis von Logstash

 

Qualifikationen

  • Studiengang: (Wirtschafts-) Informatik, (Wirtschafts-) Ingenieurwesen oder ein vergleichbarer Studiengang
  • Sprachkenntnisse: Sichere Deutsch- und Englischkenntnisse in Wort und Schrift
  • IT-Kenntnisse: Sicherer Umgang mit MS-Office sowie Kenntnisse in Server Technologien
  • Persönliche Kompetenzen: Flexibilität, Eigenmotivation, Eigeninitiative, ein sicheres Auftreten sowie Kommunikationsgeschick, eine gute Ausdrucksweise und eine selbständiger Arbeitsweise

 

zusätzliche Informationen

  • Die Tätigkeit ist in Vollzeit
  • Sie haben Interesse? Dann bewerben Sie sich bitte ausschließlich online über unsere Homepage mit Ihren vollständigen Unterlagen als Anhang (Lebenslauf, Immatrikulationsbescheinigung, aktueller Notenspiegel, relevante Zeugnisse, ggf. Pflichtpraktikumsnachweis und Nachweis über die Regelstudienzeit).
  • Angehörige von Staaten außerhalb des europäischen Wirtschaftsraumsschicken ggf. bitte Ihre Aufenthalts-/Arbeitsgenehmigung mit.
  • Kontakt Fachbereich: Herrn Rainer Geiges, Tel.: +49 7274/56-2881.
  • Kontakt Personal: HR Services, Tel.: +49 711/17-99544.

 

Kurzinfos
Veröffentlicht am:
06.08.2014
Ausschreibungsnummer:
124360
Standortinformation:
Global Logistics Center Germersheim
Routenplaner
Abteilung:
Datacenter & Application Operation Services
Kontakt Fachabteilung:
Herr Dr. Rainer Geiges
Tätigkeitsbereich:
IT / Telekommunikation

 

 

]]>
Open Positions - Hiwis
news-1026 Tue, 09 Sep 2014 12:21:00 +0000 DBpedia Knowledge Base Version 2014 released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dbpedia-knowledge-base-version-2014-released/ We are happy to announce the release of DBpedia 2014.

Knowledge bases are playing an increasingly important role in enhancing the intelligence of Web and enterprise search and in supporting information integration as well as natural language processing. Today, most knowledge bases cover only specific domains, are created by relatively small groups of knowledge engineers, and are very cost intensive to keep up-to-date as domains change. At the same time, Wikipedia has grown into one of the central knowledge sources of mankind, maintained by thousands of contributors.

The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia and by making this information accessible on the Web as a large, multilingual, cross-domain knowledge base.

The most important improvements of the new DBpedia 2014 release compared to DBpedia 3.9 release are:

1. the new release is based on updated Wikipedia dumps dating from April / May 2014 (the 3.9 release was based on dumps from March / April 2013), leading to an overall increase of the number of things described in the English edition from 4.26 to 4.58 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner data.

The English version of the DBpedia knowledge base currently describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology (http://wiki.dbpedia.org/Ontology2014), including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases.

We provide localized versions of DBpedia in 125 languages. All these versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia. The full DBpedia data set features 38 million labels and abstracts in 125 different languages, 25.2 million links to images and 29.8 million links to external web pages; 80.9 million links to Wikipedia categories, and 41.2 million links to YAGO categories. DBpedia is connected with other Linked Datasets by around 50 million RDF links.

Altogether the DBpedia 2014 release consists of 3 billion pieces of information (RDF triples) out of which 580 million were extracted from the English edition of Wikipedia, 2.46 billion were extracted from other language editions.

Detailed statistics about the DBpedia data sets in 28 popular languages are provided at Dataset Statistics page (http://wiki.dbpedia.org/Datasets2014/DatasetStatistics).

You can download the new DBpedia datasets from http://wiki.dbpedia.org/Downloads.  As usual, the new dataset is also available as Linked Data and via the DBpedia SPARQL endpoint athttp://dbpedia.org/sparql.

More information about DBpedia is found at http://dbpedia.org/About as well as in the new overview article about the project or take a look at our the dbpedia2pager.

Lots of thanks to

  1. Daniel Fleischhacker (University of Mannheim) and Volha Bryl (University of Mannheim) for improving the DBpedia extraction framework, for extracting the DBpedia 2014 data sets for all 125 languages, for generating the updated RDF links to external data sets, and for generating the statistics about the new release.
  2. All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  3.  The whole DBpedia Internationalization Committee for pushing the DBpedia internationalization forward.
  4. Dimitris Kontokostas (University of Leipzig) for improving the DBpedia extraction framework and loading the new release onto the DBpedia download server in Leipzig.
  5. Heiko Paulheim (University of Mannheim) for re-running his algorithm to generate additional type statements for formerly untyped resources and identify and removed wrong statements.
  6. Petar Ristoski (University of Mannheim) for generating the updated links pointing at the GADM database of Global Administrative Areas. Petar will also generate an updated release of DBpedia as Tables soon.
  7. Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy) for providing the links from DOLCE to DBpedia ontology.
  8.  Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.
  9.  OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  10. Michael Moore (University of Waterloo, as an intern at the University of Mannheim) for implementing the anchor text extractor and and contribution to the statistics scripts.
  11. Ali Ismayilov (University of Bonn) for implementing Wikidata extraction, on which the interlanguage link generation was based.
  12. Gaurav Vaidya (University of Colorado Boulder) for implementing and running Wikimedia Commons extraction.
  13. Andrea Di Menna, Jona Christopher Sahnwaldt, Julien Cojan, Julien Plu, Nilesh Chakraborty and others who contributed improvements to the DBpedia extraction framework via the source code repository on GitHub.
  14.  All GSoC mentors and students for working directly or indirectly on this release:https://github.com/dbpedia/extraction-framework/graphs/contributors

The work on the DBpedia 2014 release was financially supported by the European Commission through the project LOD2 - Creating Knowledge out of Linked Data (http://lod2.eu/).

Have fun with the new knowledge base!

Daniel Fleischhacker, Volha Bryl, and Christian Bizer

]]>
Research Chris Volha Projects Topics - Linked Data Publications
news-1018 Thu, 04 Sep 2014 10:28:00 +0000 Third Semester Kickoff BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/third-semester-kickoff-bbq/ Englisch Version

Accompanied by windy late summer weather, the third semester kickoff barbecue of the Research Group Data and Web Science took place last Wednesday (September 3, 2014).

In addition to the researchers and staff of the group, the most excellent students from our lectures last semester were invited to join the event.

Besides of grilled food and beer, the new joint teaching concept of our group was briefly introduced and an overview of our new courses Decision Support, Data Mining and Matrices, Text Analytics, and Web Data Integration was given.

Throughout the evening, a lot interesting topics where discussed and some cornerstones for future thesis could be laid.

We thank all the participants for coming and wish especially our students a good and successful start into the new semester.

German Version

Bei durchwachsenem Spätsommerwetter fand am letzen Mittwoch (3. September 2014) zum dritten Mail das Semester-Kickoff Grillen der Forschungsgruppe Data und Web Science statt.

Neben den Mitarbeitern der Forschungsgruppe waren auch die erfolgreichsten Studenten aus den Vorlesungen des vergangengen Semesters eingeladen.

Bei Grillgut und Bier wurde kurz das neue gemeinsame Lehrkonzept der Gruppe vorgestellt, welches die neuen Veranstaltungen Decision SupportData Mining and MatricesText Analytics, and Web Data Integration enthält.

Während des gesamten Abends ergaben sich viele interessante Diskussion zwischen den Anwesenden über fachliche Themen am Lehrstuhl.

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen vorallem unserem Studenten einen guten und erfolgreichen Start ins neue Semester.

]]>
Group Other
news-993 Mon, 01 Sep 2014 10:28:00 +0000 "Most Valuable Paper" at NLP Unshared Task in PoliInformatics 2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/most-valuable-paper-at-nlp-unshared-task-in-poliinformatics-2014/ The paper "Estimating Central Bank Preferences" by Nicole Rae Baerg, Will Lowe, Simone Paolo Ponzetto, Heiner Stuckenschmidt, and Caecilia Zirn has been elected the most valuable contribution to the Unshared Task on "Understanding the Financial Crisis". The paper is part of the joint work in the SFB 884.

]]>
Research Publications Simone
news-1016 Wed, 27 Aug 2014 11:37:00 +0000 Framework for the Distributed Processing of Large Web Crawls released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/framework-for-the-distributed-processing-of-large-web-crawls-released/ Hi all,

We are happy to announce the release of the WDC Extraction framework, which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation. The framework provides an easy to use basis for the distributed processing of large web crawls using Amazon Cloud Services. The framework is published under the terms of the Apache license and can be simply customized to perform different data extraction tasks.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extractions can be found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework and also to contribute their own improvements.

Best Regards,

Robert, Hannes, Oliver, Petar and Chris

]]>
Research Projects Chris
news-1012 Mon, 25 Aug 2014 07:53:00 +0000 Paper accepted at International Conference on Information and Knowledge Management 2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-international-conference-on-information-and-knowledge-management-2014/ We are happy to announce that the paper "Focused Crawling for Structured Data" by of Robert Meusel (DWS), Petar Mika (Yahoo Labs) and Roi Blanco (Yahoo Labs) is accepted at CIKM'14. CIKM is a top-tier conference sponsored by ACM in the areas of Information Retrieval, Knowledge Management and Databases, bringing together leading researchers and practitioners from the three communities to identify challenging problems facing the development of future knowledge and information systems, and to shape future research directions through the publication of high quality, applied and theoretical research findings.

Abstract of the paper:

The Web is rapidly transforming from a pure document collection to the largest connected public data space. Semantic annotations of web pages make it notably easier to extract and reuse data and are increasingly used by both search engines and social media sites to provide better search experiences through rich snippets, faceted search, task completion, etc. In our work, we study the novel problem of crawling structured data embedded inside HTML pages. We describe Anthelion, the first focused crawler addressing this task. We propose new methods of focused crawling specifically designed for collecting data-rich pages with greater efficiency. In particular, we propose a novel combination of online learning and bandit-based explore/exploit approaches to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages. We show that these techniques significantly outperform state-of-the-art approaches for focused crawling, measured as the ratio of relevant pages and non-relevant pages collected within a given budget. 

]]>
Research Publications
news-1008 Wed, 13 Aug 2014 14:34:00 +0000 Large Hyperlink Graph from April 2014 Web Crawl published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/large-hyperlink-graph-from-april-2014-web-crawl-published/ The DWS team is happy to announce the publication of a second large hyperlink graph. The graph has been extracted by Robert Meusel from the April 2014 Common Crawl web corpus and covers 1.7 billion web pages and 64 billion hyperlinks between these pages. 

The graph can be downloaded in various formats from http://webdatacommons.org/hyperlinkgraph/2014-04/download.html

We provide initial statistics about the topology of the graph at http://webdatacommons.org/hyperlinkgraph/2014-04/topology.html

We hope that the graph will be useful for researchers who develop

  • Search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • Graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

We want to thank the Common Crawl Foundation for providing their great web crawls and thus enabling the creation of the WDC Hyperlink Graph

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services.  

]]>
Research Publications Other Chris
news-1005 Wed, 13 Aug 2014 08:36:00 +0000 Version 1.4 of RapidMiner Linked Open Data extension released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/version-14-of-rapidminer-linked-open-data-extension-released/ The Data and Web Science group has released version 1.4 of the RapidMiner Linked Open Data extension, which allows for using Linked Open Data in the RapidMiner platform both as background knowledge for a given data mining task, as well as for carrying out analyses on LOD as such. Possible use cases include, but are not limited to:

  • Error detection in LOD
  • Schema matching for LOD
  • Using background knowledge from LOD to build better predictive and descriptive models


Features added to the latest release include:

  • Automatic following of links in the LOD cloud
  • Support for RDF data cubes
  • New propositionalization algorithms, as described in this paper
  • Feature selection algorithms optimized for LOD, as described in this paper
  • More fine-grained control of data generation operators


A manual, installation instructions, and example processes can be obtained on the extension's website.

]]>
Research Projects
news-995 Fri, 01 Aug 2014 08:14:00 +0000 Paper accepted at the International Journal on Semantic Web and Information Systems https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-the-international-journal-on-semantic-web-and-information-systems/ We are happy to announce that a paper got accepted for publication at the International Journal on Semantic Web and Information Systems (IJSWIS), and will be published in a special issue on "Web Data Quality".

Improving the Quality of Linked Data Using Statistical Distributions (Heiko Paulheim and Christian Bizer).

Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.

]]>
Publications Research Chris
news-1007 Fri, 01 Aug 2014 08:14:00 +0000 Three Papers accepted at ISWC 2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/three-papers-accepted-at-iswc-2014/ We are happy to announce that three papers got accepted at the 13th International Semantic Web Conference help in Riva del Garda, Italy (ISWC2014). ISWC 2014 is the premier international forum for the Semantic Web / Linked Data Community.

Abstracts of the accepted papers: 

  • Detecting Errors in Numerical Linked Data using Cross-Checked Outlier Detection (Daniel Fleischhacker, Heiko Paulheim, Volha Bryl, Johanna Völker and Christian Bizer)Outlier detection used for identifying wrong values in data is typically applied to single datasets to search them for values of unexpected behavior. In this work, we instead propose an approach which combines the outcomes of two independent outlier detection runs to get a more reliable result and to also prevent problems arising from natural outliers which are exceptional values in the dataset but nevertheless correct. Linked Data is especially suited for the application of such an idea, since it provides large amounts of data enriched with hierarchical information and also contains explicit links between instances. In a first step, we apply outlier detection methods to the property values extracted from a single repository, using a novel approach for splitting the data into relevant subsets. For the second step, we exploit owl:sameAs links for the instances to get additional property values and perform a second outlier detection on these values. Doing so allows us to confirm or reject the assessment of a wrong value. Experiments on the DBpedia and NELL datasets demonstrate the feasibility of our approach.
  • Adoption of Linked Data Best Practices in Different Topical Domains (Max Schmachtenberg, Heiko Paulheim and Christian Bizer)
    The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the \emph{State of the LOD Cloud} report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the \emph{datahub.io} Linked Data catalog. In this paper, we revisit and update the findings of the 2011 \emph{State of the LOD Cloud} report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.
  • The WebDataCommons Microdata, RDFa and Microformat Dataset Series (Robert Meusel, Petar Petrovski and Christian Bizer)
    In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this paper, we present a series of publicly accessible Microdata, RDFa, Microformats datasets that we have extracted from three large web corpora dating from 2010, 2012 and 2013. Altogether, the datasets consist of almost 30 billion RDF quads. The most recent of the datasets contains amongst other data over 211 million product descriptions, 54 million reviews and 125 million postal addresses originating from thousands of websites. The availability of the datasets lays the foundation for further research on integrating and cleansing the data as well as for exploring its utility within different application contexts. As the dataset series covers four years, it can also be used to analyze the evolution of the adoption of the markup formats.

 

 

]]>
Publications Research
news-1003 Thu, 24 Jul 2014 14:17:00 +0000 JudaicaLink released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/judaicalink-released/ Data extractions from two encyclopediae from the domain of Jewish culture and history have been released as Linked Open Data within our JudaicaLink project.

JudaicaLink now provides access to 22,808 concepts in English (~ 10%) and Russian (~ 90%), mostly locations and persons. 

See here for further information: http://www.judaicalink.org/blog/kai-eckert/encyclopedia-russian-jewry-released-updates-yivo-encyclopedia 

]]>
Projects
news-1000 Wed, 16 Jul 2014 08:52:00 +0000 InFoLiS II DFG project granted https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/infolis-ii-dfg-project-granted/ We are happy to announce that the  DFG will fund the InFoLiS II: Integration von Forschungsdaten und Literatur project. The project is conducted by the DWS group in cooperation with the Mannheim University Library and the GESIS - Leibniz-Institute for the Social Sciences. It will focus on the identification of references to research data in publications in different languages and domains and will develop a sustainable LOD-infrastructure for publishing the references. For more information about the project, please contact Kai Eckert.

]]>
Projects Research Kai
news-998 Wed, 09 Jul 2014 15:46:00 +0000 Paper accepted at Discovery Science 2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-discovery-science-2014/ We are happy to announce that our paper "Feature Selection in Hierarchical Feature Spaces" is accepted at the Discovery Science Conference. The scope of the conference includes the development and analysis of methods for discovering scientific knowledge, coming from machine learning, data mining, and intelligent data analysis, as well as their application in various scientific domains.

 

Abstract of the paper:

Feature selection is an important preprocessing step in data mining, which has an impact on both the runtime and the result quality of the subsequent processing steps. While there are many cases where hierarchic relations between features exist, most existing feature selection approaches are not capable of exploiting those relations. In this paper, we introduce a method for feature selection in hierarchical feature spaces. The method first eliminates redundant features along paths in the hierarchy, and further prunes the resulting feature set based on the features' relevance. We show that our method yields a good trade-off between feature space compression and classification accuracy, and outperforms both standard approaches as well as other approaches which also exploit hierarchies.

]]>
Research Publications
news-984 Wed, 18 Jun 2014 11:16:00 +0000 Hiking and Wine Tasting Excursion https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/hiking-and-wine-tasting-excursion/ On Tuesday 10.06.2014, the DWS group went together for a day off to the Pfalz area. We were walking form Bad Dürkheim to the Klosterruine Limburg and back. Afterwards, we refreshed during a wine tasting in a cold wine cellar.

]]>
Other
news-976 Fri, 06 Jun 2014 08:25:00 +0000 Three awards in LOD-enabled Recommender Systems Challenge at ESWC 2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/three-awards-in-lod-enabled-recommender-systems-challenge-at-eswc-2014/ A joint team of Petar Ristoski and Heiko Paulheim from the DWS group and Eneldo Loza Mencía from the Knowledge Engineering Group at the Technical University of Darmstadt has won three (out of four) awards at the Linked Open Data-enabled Recommender Systems challenge, held at the 11th Extended Semantic Web Conference:

  • Best performing approach for task "Rating prediction in cold-start situations"
  • Best performing approach for task "Diversity"
  • Best performing approach overall

The proposed recommender system uses different features from DBpedia and RDF Book Mashup to create book recommendations with a variety of strategies, which are combined using stacking and rank aggregation.

]]>
Projects Research Publications
news-961 Wed, 21 May 2014 11:44:00 +0000 JOIN-T DFG Project with TU Darmstadt https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/join-t-dfg-project-with-tu-darmstadt/ The DFG will fund a 3-year project - JOIN-T (Joining Ontologies and semantics INduced from Text), overall budget of approx. 500,000 EUR - jointly lead by Chris Biemann (TU Darmstadt) and Simone Paolo Ponzetto (University of Mannheim). The project will focus on combining ontological information from large-scale knowledge bases (like Freebase, YAGO or DBpedia) with distributional semantic information encoded within large (i.e. web-scale) amounts of text. Stay tuned for a soon-to-come project webpage and publications!

]]>
Research Projects Simone
news-952 Mon, 12 May 2014 14:36:00 +0000 Paper accepted at ACM Web Science 2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-acm-web-science-2014/ We are happy to announce that our paper "Graph Structure in the Web - Aggregated by Pay-Level Domain" is accepted the ACM Web Science 2014 Conference. The conference brings together researchers from computer science with researchers from physical- and social sciences to complement each other in understanding how the Web affects our interactions and behaviors.


Abstract of the paper:

Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional insights about the macroscopic structure of the World Web Web by analyzing an aggregated version of a recent web graph. The graph covers over 3:5 billion web pages and 128 billion hyperlinks between pages. It was crawled in the first half of 2012. We aggregate this graph by pay-level domain (PLD), meaning that all pages that belong to the same pay-level domain are represented by a single node and that an arc exists between two nodes if there is at least one hyperlink between pages of the corresponding pay-level domains. The resulting PLD graph covers 43 million PLDs and contains 623 million arcs between PLDs. Analyzing this aggregated graph allows us to present findings about linkage patterns between complete websites and not only individual HTML pages. In this paper, we present basic statistics about the PLD graph, such as degree distributions, top-ranked PLDs, distances and diameter. We analyze whether the bow-tie structure introduced by Broder et al. can also be identified in our PLD graph and reveal a backbone of highly interlinked websites within the graph. We group the websites by top-level domain and report findings about the overall linkage within and between different top-level domains. In a last experiment, we use data from the Open Directory Project (DMOZ) to categorize websites by topic and report findings about linkage patterns between websites belonging to different topical categories.

]]>
Research Publications
news-951 Wed, 07 May 2014 13:51:00 +0000 New DFG Project at the Knowledge Representation and Reasoning Goup https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-dfg-project-at-the-knowledge-representation-and-reasoning-goup/ The DFG funds a new research project "Matching Representations at different levels of granularity" with approx. 270.000,- Euro over the next 3 years. The project will continue previous work on ontology matching by extending it to the problem of process matching and the investigation of specific problems that arise from models with different levels of detail.

]]>
Heiner Research Research- Reasoning and Learning Christian Projects
news-937 Thu, 10 Apr 2014 10:44:00 +0000 Paper accepted at ACL SRW https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-acl-srw/ Caecilia's paper on "Analyzing Positions and Topics in Political Discussions of the German Bundestag" has been accepted at the ACL SRW 2014, the premiere doctoral consortium for graduate students working in the field of Natural Language Processing.

]]>
Publications Research
news-933 Fri, 04 Apr 2014 13:29:00 +0000 Paper accepted at NLDB'2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-nldb2014/ Arnab and Michael will present their most recent work on "Entity Linking for Open Information Extraction" at the 19th International Conference on Application of Natural Language to Information Systems (NLDB'2014).

]]>
Publications Research
news-928 Wed, 02 Apr 2014 08:53:00 +0000 RDFa, Microdata and Microformat corpus published containing data from 1.7 million websites https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rdfa-microdata-and-microformat-corpus-published-containing-data-from-17-million-websites/ We are happy to announce a new release of the WebDataCommons RDFa, Microdata, and Microformat data sets.

The data sets have been extracted from the November 2013 version of the Common Crawl covering 2.24 billion HTML pages which originate from 12.8 million websites (pay-level-domains).

Altogether we discovered structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%). These pages originate from 1.7 million different pay-level-domains out of the 12.8 million pay-level-domains covered by the crawl (13%).

Approximately 471 thousand of these websites use RDFa, while 463 thousand websites use Microdata. Microformats are used on 1 million websites within the crawl.

Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

General information about the WebDataCommons project is found at

http://webdatacommons.org/

Data Set Statistics

Basic statistics about the November 2013 RDFa, Microdata, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2013-11/stats/stats.html

Comparing the statistics to the statistics about the August 2012 release of the data sets

http://webdatacommons.org/structureddata/2012-08/stats/stats.html

we see that the adoption of the Microdata markup syntax has strongly increased (463 thousand websites in 2013 compared to 140 thousand in 2012, even given that the 2013 version of the Common Crawl covers significantly less websites than the 2012 version).

Looking at the adoption of different vocabularies, we see that webmasters mostly follow the recommendation by Google, Microsoft, Yandex, and Yahoo to use the schema.org vocabularies as well as their predecessors in the context of Microdata. In the context of RDFa, the most widely used vocabulary is the Open Graph Protocol recommended by Facebook.

Looking at the most frequently used classes, we see that beside of navigational, blog and CMS related meta-information many websites markup e-commerce related data (products, offers, and reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).

Download

The overall size of the November 2013 RDFa, Microdata, and Microformat data sets is 17.2 billion RDF quads. For download, we split the data into 3,398 files with a total size of 332 GB.

http://webdatacommons.org/structureddata/2013-11/stats/how_to_get_the_data.html

Lots of thanks to

  •  the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
  • the Any23 project for providing their great library of structured data parsers.
  • the LOD2 and PlanetData research projects as well as Amazon Web Services for supporting WebDataCommons.

Best,

Christian Bizer, Petar Petrovski, and Robert Meusel

]]>
Research Publications
news-924 Thu, 27 Mar 2014 08:12:00 +0000 Christian Bizer gives Invited Lecture about Search Joins at ICDT https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/christian-bizer-gives-invited-lecture-about-search-joins-at-icdt/ Prof. Christian Bizer gave an invited lecture about Search Joins with the Web at the International Conference on Database Theory (ICDT2014).

Abstract of the Lecture

The lecture will discuss the concept of Search Joins. A Search Join is a join operation which extends a local table with additional attributes based on the large corpus of structured data that is published on the Web in various formats. The challenge for Search Joins is to decide which Web tables to join with the local table in order to deliver high-quality results. Search joins are useful in various application scenarios. They allow for example a local table about cities to be extended with an attribute containing the average temperature of each city for manual inspection. They also allow tables to be extended with large sets of additional attributes as a basis for data mining, for instance to identify factors that might explain why the inhabitants of one city claim to be happier than the inhabitants of another.

In the talk, Christian Bizer will draw a theoretical framework for Search Joins and will highlight how recent developments in the context of Linked Data, RDFa and Microdata publishing, public data repositories as well as crowd-sourcing integration knowledge contribute to the feasibility of Search Joins in an increasing number of topical domains. 

Slides of the Lecture

Bizer: Search Joins with the Web

]]>
Research Publications
news-923 Wed, 19 Mar 2014 09:06:00 +0000 Master Thesis (Stuckenschmidt, Debus, Weiland): Look who is talking: recognizing politicians in images and videos https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-stuckenschmidt-debus-weiland-look-who-is-talking-recognizing-politicians-in-imag/ In political science, the way parties position themselves in the space of opinions is a central matter of investigation. As certain opinions are often strongly connected with a certain person, appearances of these persons in the media are also an indication of a political position of the whole party.

In this thesis which is offered in cooperation with the Chair of political science III (Prof. Debus) the task is to support the analysis of party positioning by developing methods for automatically recognizing politicians in images and videos found on party web sites. As part of the thesis, first a collection of images and scenes from videos has to be created and annotated with the expected result. The candidate has to study literature on face recognition, select, implement and test a suitable method for solving the task on the created image collection.

Candidates should have programming skills (ideally in C++). Basic knowledge in image processing is an advantage.

For more information contact Lydia Weiland (Lydia(at)informatik.uni-mannheim.de).

]]>
Topics
news-920 Tue, 18 Mar 2014 11:21:00 +0000 Win attractive prizes at the Linked Data Mining Challenge https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/win-attractive-prizes-at-the-linked-data-mining-challenge/ This year's ESWC workshop on Knowledge Discovery and Data Mining Meets Linked Open Data (Know@LOD) will feature the second edition of the Linked Data Mining Challenge.

You can win attractive prizes:

  • The best submission to the predictive task wins a RapidMiner software package worth $3,000, sponsored by RapidMiner, Inc.
  • The best challenge paper wins an Amazon voucher worth EUR 500, sponsored by the EU project LOD2

All information about the challenge, the tasks, datasets, and important dates, can be found on here.

]]>
Research Group Projects
news-918 Mon, 17 Mar 2014 10:20:00 +0000 Two papers accepted at ESWC https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-papers-accepted-at-eswc/ We have two papers accepted at the forthcoming 11th European Semantic Web Conference (ESWC), the premier European conference on Semantic Web:

  • Arnab Dutta, Christian Meilicke and Simone Paolo Ponzetto: "A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources"
  • Dominik Wienand and Heiko Paulheim: "Detecting Incorrect Numerical Data in DBpedia"

 

 

]]>
Research Publications
news-914 Fri, 07 Mar 2014 16:23:00 +0000 Corpus of 147 million relational Web tables published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/corpus-of-147-million-relational-web-tables-published/ The WDS group is happy to announce the release of a corpus containing 147 million quasi-relational Web tables.

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities.

A corpus of Web tables can be useful for research and applications in areas such as data search, table augmentation, knowledge base construction, and for various NLP tasks.

The WDC Web Tables corpus has been extracted from the 2012 version of the Common Crawl [1], the largest Web crawl that is available to the public. The corpus contains the subset of the 11 billion HTML tables found in the Common Crawl that are likely quasi-relational.

More information about the corpus, its application domains as well as information about how to download the corpus is found at

http://webdatacommons.org/webtables/

We want to thanks the Common Crawl Foundation for providing their great web crawl and thus enabling the creation of the WDC Web Tables corpus.

The creation of the WDC Web Tables corpus was supported by the German Research Foundation (DFG), the EU FP7 project PlanetData and by Amazon Web Services. We thank our sponsors a lot.

Enjoy the new coprus!

Best regards,

Petar Ristoski, Oliver Lehmberg, Heiko Paulheim, Robert Meusel, and Christian Bizer

[1] commoncrawl.org

 

 

]]>
Publications Research
news-915 Fri, 07 Mar 2014 11:10:00 +0000 New Software Updates: Sieve, Silk, and RapidMiner Linked Open Data Extension https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/new-software-updates-sieve-silk-and-rapidminer-linked-open-data-extension/ We are happy to announce updates for three software packages developed at the DWS group:

  • LDIF 0.5.2: The latest version LDIF (Linked Data Integration Framework) now contains a component for learning fusion policies to be used when integrating data from different sources.
  • Silk 2.6: The latest version of the link discovery framework Silk includes a new version of the Silk Workbench that offers a REST API and a plugin System. Furthermore, a plugin for preprocessing free text is now available.
  • RapidMiner Linked Open Data Extension 1.3.1: The extension to the popular Data Mining platform RapidMiner now also allows for accessing Linked Open Data sources that do not provide a SPARQL endpoint, and comes with multiple performance improvements. The extension is available via the RapidMiner marketplace from within RapidMiner.
]]>
Research Projects
news-907 Thu, 27 Feb 2014 16:16:00 +0000 LOD2 Plenary Meeting in Mannheim https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/lod2-plenary-meeting-in-mannheim/ The 4th plenary meeting of the LOD2 EU project was hosted by the Data and Web Science group on February 24-25, 2014.

The LOD2 EU project develops tools and methodologies for exposing and managing large amounts of structured data on the Web, and applies them in e-government, media and enterprise search use cases. The focus of the DWS group within LOD2 is on methods for interlinking and fusing Web data as well as for assessing the quality of Web data.

A detailed report about the meeting is given in this blog post by Orri Erling from OpenLink Software.

]]>
Research Projects
news-906 Thu, 27 Feb 2014 15:00:00 +0000 2nd Linked Data Mining Challenge https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/2nd-linked-data-mining-challenge/ This year, in conjunction with the 3rd Workshop on Knowledge Discovery and Data Mining meets Linked Open Data (Know@LOD 2014), we will hold the 2nd Linked Data Mining Challenge, co-organized by Vojtech Svatek and Jindrich Mynarz from the University of Economics in Prague, and Heiko Paulheim from the University of Mannheim.

The challenge aims at applying data mining techniques to Linked Open Data. It offers data from two different domains - public procurement and research collaborations - and seeks contributions in a predictive (regression) task, and two explorative tasks, which aim at the discovery of interesting nuggets such as descriptive patterns.

Details about the challenge are available here.

Important dates:

  • 31 March 2014: Submission deadline for predictive task results
  • 3 April 2014: Submission deadline for LDMC papers
  • 10 April 2014: Notification of acceptance for LDMC papers
  • 15 April 2014: Camera-ready versions of papers
  • 15 May 2014: Complete evaluation results (both quantitative and panel-based) published, and LDMC session schedule finalized
  • 25 May 2014: LDMC session held as part of Know@LOD
]]>
Projects Research
news-883 Thu, 27 Feb 2014 08:03:00 +0000 Mannheim Linked Open Data Meetup https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/mannheim-linked-open-data-meetup/ The Data and Web Science research group has organized the “Mannheim Linked Open Data Meetup” in order to bring together people interested in Linked Data from Southern Germany to exchange ideas, chat, and drink beer.

The meetup was a public event, half-way between a workshop and an informal gathering, meaning that the talks were followed by free drinks and snacks provided by organizers, the registration was free and everyone was invited to attend.

The meetup was co-located with the plenary meeting of the LOD2 EU project, which was hosted by the DWS group in the two following days.

We had the following invited talks at the meetup:

1. Sören Auer (University of Bonn), Overview of the LOD2 project

2. Katja Eck (Wolters Kluwer), LOD in the publishing industry

3. Martin Kaltenböck (Semantic Web Company, Vienna), Enterprise Semantics - Information Management with PoolParty

4. Mirjam Kessler, Aliaksandr Birukou (Springer), Linked Data Initiatives at Springer Verlag

5. Peter Haase (Fluid Operations, Walldorf), Linked Data Applications with the Information Workbench

6. Sebastian Hellman, Dimitris Kontokostas (University of Leipzig), DBpedia Project and Dutch DBpedia for Searching and Visualising Dutch Library Data

The meetup took place at the University of Mannheim in the building B6 26, Room A101 on Sunday evening, February 23, 18:30-22:30.

Around 50 participants attended the meetup, not only from Mannheim and Heidelberg, but from the broader area including Karlsruhe (KIT), Frankfurt (German National Library) and Cologne (GESIS). LOD2 project participants also attended the event extending the geography to other German cities as well as to Austria, Serbia, Czech Republic, Netherlands and UK.

Slides of the talks are available here.

Find out more in a blog post by Orri Erling from OpenLink Software, and at the meetup page

http://www.meetup.com/OpenKnowledgeFoundation/Mannheim-DE/1092882/

The DWS group thanks the speakers and the participants of the meetup!

 

 

]]>
Projects Research
news-896 Thu, 13 Feb 2014 07:57:00 +0000 First Open Ranking of the World Wide Web available https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/first-open-ranking-of-the-world-wide-web-available/ The Laboratory for Web Algorithmics of the Università degli studi di Milano together with the Data and Web Science Group of the University of Mannheim have put together the first entirely open ranking of more than 100 million sites of the Web. The ranking is based on classic and easily explainable centrality measures applied to a host graph, and it is entirely open — all data and all software used is publicly available. Pages are ranked using harmonic centrality with raw Indegree centrality, Katz's index, and PageRank provided for comparison.

More information about the web graph is available in a pre-print paper that will be presented at the World Wide Web Conference in April.

]]>
Research Projects Publications
news-890 Wed, 05 Feb 2014 16:45:00 +0000 DWS participates in SFB488 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-participates-in-sfb488/ The DWS Research Group participates in the C4 project of the SFB 488 "Political Economy of Reforms". The goal of the project is to measure party, media and voter positions, which is an essential task to understanding political reforms. Given that the core data on which these analyses rely consists of text, we will contribute our expertise in Natural Language Processing to develop intelligent algorithms for position analysis.

]]>
Projects Research
news-889 Wed, 05 Feb 2014 09:17:00 +0000 bw-FDM-Communities: Projekt on Research Data Management started https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bw-fdm-communities-projekt-on-research-data-management-started/ The DWS Research Group participates in the Project bw-FDM-Communities which aims at defining requirements for a long term management and thus availability of data used or created within the research process.

All 9 universities in Baden-Württemberg take part in this project, which will end in 2015.

For more details visit bwFDM-communities or contact Johannes Knopp.

]]>
Projects
news-888 Tue, 04 Feb 2014 11:22:00 +0000 WWW 2014 Web Science Track Paper accepted https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/www-2014-web-science-track-paper-accepted/ A paper by Robert Meusel, Sebastiano Vigna, Oliver Lehmberg and Christian Bizer analyzing the hyperlink structure of the Wolrd Wide Web was accepted for the Web Science track of the 23rd International World Wide Web Conference (WWW2014).

The Web graph that is analyzed in the paper has been extracted from the Common Crawl 2012 web corpus. The graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. 

Pre-print version of the paper:

  • Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer: Graph Structure in the Web — Revisited. 23rd International World Wide Web Conference (WWW2014), Web Science Track, Seoul, Korea, April 2014.

 Public download of the analyzed Web graph:

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services.  We thank your sponsors a lot.

 

 

]]>
Publications Projects
news-887 Tue, 04 Feb 2014 11:11:00 +0000 LREC 2014 Paper accepted https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/lrec-2014-paper-accepted/ A paper by our student Gregor Titze, Volha and Simone on discovering topics in Wikipedia categories was accepted as poster for the 9th Edition of the Language Resources and Evaluation Conference (LREC).

]]>
Volha Research Publications
news-885 Tue, 04 Feb 2014 09:45:00 +0000 ACM WSDM 2014 Paper accepted https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/acm-wsdm-2014-paper-accepted/ Michael and Simone will present their latest work on knowledge-rich, graph-based document content modeling and semantic similarity at the 7th ACM Conference on Web Search and Data Mining, one of the top-tier publication venues for research in search and data mining.

 

 

 

]]>
Research Publications Simone
news-882 Tue, 04 Feb 2014 08:49:00 +0000 Lectures offered in the FSS2014 Semester https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/lectures-offered-in-the-fss2014-semester/ The Data and Web Science research group is offering the following lectures for Master students in the upcoming FFS2014 semester:

 For PhD students and students writing their master thesis, we are offering the:

  • DWS Colloquium (Prof. Stuckenschmidt, Prof. Bizer, Prof. Ponzetto)

An overview of the content and organization of the lectures is given in the following slide set:

 

 

 

 

]]>
Other
news-845 Tue, 26 Nov 2013 13:22:00 +0000 7th Linked Data on the Web Workshop at WWW2014 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/7th-linked-data-on-the-web-workshop-at-www2014/ Together with Sir Tim Berners-Lee (W3C/MIT, USA), Tom Heath (Open Data Institute, UK) and Sören Auer (University of Bonn and Fraunhofer IAIS, Germany), Christian Bizer is organizing the 7th Linked Data on the Web Workshophttp://events.linkeddata.org/ldow2013/ (LDOW2014)  at the 23nd World Wide Web Conference (WWW2014) in Seoul, Korea.

Linked Data is a set of best practices for publishing structured data on the Web which focuses on identifying data items with URIs and setting hyperlinks between data items provided by different web servers. These hyperlinks connect the data from all servers into a global data graph - the Web of Linked Data.

The 7th Workshop on Linked Data on the Web (LDOW2014) aims to stimulate discussion and further research into the challenges of publishing, integrating and consuming Linked Data as well as to evaluating and mining knowledge from the global Web of Linked Data. The challenges associated with Linked Data management range from lower level technical issues over large-scale data processing, quality assessment and mining, to higher level conceptual questions of value propositions and business models. LDOW2014 will provide a forum for exposing novel, high quality research and applications in all of these areas. By bringing together researchers in the field, the workshop will further shape the ongoing Linked Data research agenda.

Important Dates

  • Submission deadline: 16 February, 2014 
  • Notification of acceptance: 3 March, 2014
  • Camera-ready versions of accepted papers: 16 March, 2014
  • Workshop date: 8 April, 2014

Topics of Interest

Topics of interest for the workshop include, but are not limited to, the following:

Mining the Web of Linked Data

  • large-scale approaches to deriving implicit knowledge from the Web of Linked Data
  • using the Web of Linked Data as background knowledge in data mining applications

Linking and Fusion

  • linking algorithms and heuristics, identity resolution
  • increasing the value of Schema.org and OpenGraphProtocol data through linking
  • Web data integration and fusion
  • performance of linking infrastructures/algorithms on Web data

Quality, Trust, Provenance and Licensing in Linked Data

  • profiling and change tracking of Linked Data sources
  • tracking provenance and usage of Linked Data
  • evaluating quality and trustworthiness of Linked Data
  • licensing issues in Linked Data publishing

Linked Data Applications and Business Models

  • Linked Data browsers and search engines
  • Linked Data as pay-as-you-go data integration technology within corporate contexts
  • marketplaces, aggregators and indexes for Linked Data
  • interface and interaction paradigms for Linked Data applications
  • business models for Linked Data publishing and consumption
  • Linked Data applications for life-sciences, digital humanities, social sciences etc.

More  information about the workshop is found at http://events.linkeddata.org/ldow2014/

]]>
Projects
news-837 Tue, 12 Nov 2013 13:54:00 +0000 Largest public Hyperlink Graph published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/largest-public-hyperlink-graph-published/ The DWS group is happy to announce the publication of a large hyperlink graph covering 3.5 billion web pages and 128 billion hyperlinks between these pages.

The graph has been extracted by Robert Meusel und Oliver Lehmberg from the Common Crawl 2012 web corpus.

To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft.

The graph can be downloaded in various formats from

http://webdatacommons.org/hyperlinkgraph

We provide initial statistics about the topology of the graph at

http://webdatacommons.org/hyperlinkgraph/topology.html

We hope that the graph will be useful for researchers who develop

  • Search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • Graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

We want to thanks the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph.

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services.  We thank your sponsors a lot.

 

 

]]>
Projects
news-836 Thu, 07 Nov 2013 14:30:00 +0000 Student Assistant for Linked Data Publishing https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/student-assistant-for-linked-data-publishing/ We are looking for a student assistant to support us in the creation of Linked Data representations of various datasets. An example is our JudaicaLink project, where we provide  Linked Data access to encyclopediae from the domain of Jewish culture and history. Therefore, structured information has to be extracted from the original data and transformed into RDF. The RDF data is then provided on the Web as Linked Data.

In the context of this work, several approaches for the extraction of information and the interlinking of the resulting datasets can be developed, applied, and improved, depending on your interests and own ideas. The later conduction of a Bachelor or Master thesis in this area is also possible.

Requirements:

  • Programming skills in Java,
  • Experience with the development of Web applications (HTTP, REST, Web-Services),
  • Experience or genuine interest in the data and the domain of cultural heritage and digital humanities is a plus.

 Contact: Kai Eckert

]]>
Open Positions - Hiwis Kai
news-835 Thu, 07 Nov 2013 14:05:00 +0000 Student Assistant for Web-Scale Knowledge Harvesting https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/student-assistant-for-web-scale-knowledge-harvesting/ We are looking for a student assistant to develop and conduct experiments regarding the creation of fact databases from the Common Crawl, a huge dataset of crawled web pages. 

Requirements are:

  • Solid programming skills in Java,
  • Experience with Hadoop and/or work with very large datasets (or the willingness to learn it),
  • Previous knowledge of natural language processing is a plus.

 

Contact: Kai Eckert or Simone Ponzetto.

Note: We also provide topics for a Master thesis in the same area.

]]>
Open Positions Open Positions - Hiwis Kai
news-832 Tue, 05 Nov 2013 09:36:00 +0000 Best Poster Award at ISWC 2013 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/best-poster-award-at-iswc-2013/ Heiko Paulheim from the DWS group and Sven Hertling from the Knowledge Engineering group at TU Darmstadt have won the best poster award at the International Semantic Web Conference 2013 for their joint work "Discoverability of SPARQL Endpoints in Linked Open Data".

The work provides an experimental analysis of different approaches for automatically discovering a SPARQL endpoint, given the URI of a resource. The results show that mechanisms such as VoID and Provenance are still not adopted widely enough for creating a reliable service, and that catalogues such as DataHub are often outdated and may contain unreliable information.

As the best evaluation paper award going to Carlos Buil-Aranda et al. for their work "SPARQL Web-Querying Infrastructure: Ready for Action?", a major trend at this year's ISWC was a critical review of existing infrastructures and technologies, which is an important step for the semantic web to become adopted on a large scale.

]]>
Group Projects Publications
news-826 Thu, 24 Oct 2013 22:31:00 +0000 Winners of the Semantic Statistics Challenge https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/winners-of-the-semantic-statistics-challenge/ Petar Ristoski and Heiko Paulheim have won the first price at the semantic statistics challenge, held at the SemStats workshop 2013 in conjunction with the International Semantic Web Conference.

The task of the challenge was to analyze a given dataset of unemployment data in France and Australia, using Linked Open Data. In their solution, Petar and Heiko used background knowledge from Linked Open Data sources such as DBpedia, Eurostat, GADM, and LinkedGeoData, to create both advanced visualizations as well as discovering patterns that explain the unemployment data.

During their work on the challenge contribution, various assets were created which have been used in other projects: a new linkset between DBpedia and GADM has been created, which became part of the DBpedia 3.9 release. Various linking algorithms were developed, which are now available as operators in the RapidMiner Linked Open Data extension.

Slides of Heiko's presentation of the challenge contribution

]]>
Projects Publications Group Other
news-825 Tue, 22 Oct 2013 07:43:00 +0000 DWS Group receives Amazon Machine Learning Research Grant https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-group-receives-amazon-machine-learning-research-grant/ We are happy to announce that the DWS Group was awarded  an Amazon Machine Learning Research Grant

With its machine learning grant program, Amazon supports selected research projects in machine learning and big data analysis with free access to the Amazon Web Services (AWS) infrastructure in order to enable them to use large numbers of machines in the cloud. The grants focus on supporting novel applications in the area of distributed data transformation, feature discovery and feature selection, large-scale and/or online classification, regression, recommendation and clustering learning as well as structure discovery. 

The DWS group was awarded  a 20.000 US$ grant, which translates into approximately 100.000 to 150.000 machine hours on the AWS cloud depending on the size of the machines we use.

We will use the grant to continue our work on mining the Common Crawl, the largest Web crawl that is currently available to the public (3.5 billion HTML pages from 40 million websites, 210 terabytes), the following projects:

  • Web Data Commons: More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from all web pages contained in the Common Crawl and provides the data for free download alongside with statistics about the deployment of the different formats. 
  • Hyperlink Graph Extraction: Web pages are connected by hyperlinks which make them part of a large global hyperlink graph. Public research on the structure of this global graph has died down around 10 years ago as the research community outside Google and Microsoft did not have access to representative Web graphs anymore. Within the Hyperlink Graph Extraction project, we extract a large Web graph from the Common Crawl and will provide this graph for free download. The page-level graph consists of 3.5 billion nodes and 128 billion edges (hyperlinks) and is thus the largest real-world graph that will be available to the public. We will publish the graph in November 2013 alongside with initial analysis about its structure.
  • Web Tables Extraction: Large parts of the world's knowledge are contained in HTML-tables and these tables are thus very valuable for question answering and for building comprehensive knowledge bases  (see here).  There are hardly any large-scale HTML-table extraction projects outside Google and Microsoft. We are currently working on extracting all HTML tables from the Common Crawl and classify them into content- and layout-tables. If we are successful, we will publish a large corpus of content tables, likely in Spring/Summer 2014.

We believe that by publishing these data sets, we contribute to enabling the research community to perform research on Web-scale data and to draw level with research efforts inside companies such as Google and Microsoft. We are thus very thankful to Amazon for supporting our work with this grant.

 

 

]]>
Projects Other
news-822 Wed, 16 Oct 2013 12:09:00 +0000 Heiko Paulheim gives Invited Talk at ECML/PKDD Workshop https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/heiko-paulheim-gives-invited-talk-at-ecmlpkdd-workshop/ Heiko Paulheim has given an invited talk at the first Workshop on Data Mining from Linked Data, DMoLD’13, was held on September 23, 2013, in Prague, in collocation with the  European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013), a prime scientific event in the data mining field. The workshop gained a lot of attraction in the machine learning and data mining community, attracting a total of aorund 40 participants.

In his invited talk on Exploiting Linked Open Data as Background Knowledge in Data Mining, Heiko advertised the recently released RapidMiner LOD Extension, and exemplified how such technology allows to exploit rich background knowledge (for both horizontal and vertical enrichment of original data) while outsourcing its maintenance to the LOD infrastructure, for example, the regular DBpedia updates via extraction from Wikipedia. Using various examples, he discussed lessons learned since the development of the first prototype, as well as current challenges, which are, among others, addressed in the DFG funded project Mine@LOD. The audience was interested, among other, in applying the presented approach in bioinformatics and in comparing/combining it with more traditional relational data mining such as Inductive Logic Programming.

]]>
Projects Group Publications
news-808 Mon, 23 Sep 2013 07:48:00 +0000 DBpedia Knowledge Base Version 3.9 released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dbpedia-knowledge-base-version-39-released/ We are very happy to announce the release of the DBpedia Knowledge Base Version 3.9.

Knowledge bases are playing an increasingly important role in enhancing the intelligence of Web and enterprise search and in supporting information integration as well as natural language processing. Today, most knowledge bases cover only specific domains, are created by relatively small groups of knowledge engineers, and are very cost intensive to keep up-to-date as domains change. At the same time, Wikipedia has grown into one of the central knowledge sources of mankind, maintained by thousands of contributors.

The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia and by making this information accessible on the Web as a large, multilingual, cross-domain knowledge base.

The English version of the DBpedia knowledge base currently describes 4.0 million things, out of which 3.22 million are classified in a consistent Ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases.

We provide localized versions of DBpedia in 119 languages. All these versions together describe 24.9 million things, out of which 16.8 million overlap (are interlinked) with the concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 12.6 million unique things in 119 different languages; 24.6 million links to images and 27.6 million links to external web pages; 45.0 million external links into other RDF datasets, 67.0 million links to Wikipedia categories, and 41.2 million YAGO categories.

Altogether the DBpedia 3.9 release consists of 2.46 billion pieces of information (RDF triples) out of which 470 million were extracted from the English edition of Wikipedia, 1.98 billion were extracted from other language editions, and about 45 million are links to external data sets.

Detailed statistics about the DBpedia data sets in 24 popular languages are provided at Dataset Statistics.

The most important improvements of the new DBpedia release compared to DBpedia 3.8 are:

1. the new release is based on updated Wikipedia dumps dating from March / April 2013 (the 3.8 release was based on dumps from June 2012), leading to an overall increase in the number of concepts in the English edition from 3.7 to 4.0 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner concept descriptions.

3. we extended the DBpedia type system to also cover Wikipedia articles that do not contain an infobox.

4. we provide links pointing from DBpedia concepts to Wikidata concepts and updated the links pointing at YAGO concepts and classes, making it easier to integrate knowledge from these sources.

More information about DBpedia is found at http://dbpedia.org/About as well as in the new overview article about the project.

Lots of thanks to

  • Jona Christopher Sahnwaldt (Freelancer funded by the University of Mannheim, Germany) for improving the DBpedia extraction framework, for extracting the DBpedia 3.9 data sets for all 119 languages, and for generating the updated RDF links to external data sets.
  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Heiko Paulheim (University of Mannheim, Germany) for inventing and implementing the algorithm to generate additional type statements for formerly untyped resources.
  • The whole Internationalization Committee for pushing the DBpedia internationalization forward.
  • Dimitris Kontokostas (University of Leipzig) for improving the DBpedia extraction framework and loading the new release onto the DBpedia download server in Leipzig.
  • Volha Bryl (University of Mannheim, Germany) for generating the statistics about the new release.
  • Petar Ristoski (University of Mannheim, Germany) for generating the updated links pointing at the GADM database of Global Administrative Areas.
  • Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.
  • OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  • Julien Cojan, Andrea Di Menna, Ahmed Ktob, Julien Plu, Jim Regan and others who contributed improvements to the DBpedia extraction framework via the source code repository on GitHub.

The work on the DBpedia 3.9 release was financially supported by the European Commission through the project LOD2 - Creating Knowledge out of inked Data (http://lod2.eu/).

Have fun with the new DBpedia release!

Cheers,

Christian Bizer and Christopher Sahnwaldt

 

]]>
Projects
news-798 Fri, 13 Sep 2013 08:49:00 +0000 RapidMiner Linked Open Data Extension released https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rapidminer-linked-open-data-extension-released/ The Data and Web Science Group has released the first version of the RapidMiner Linked Open Data extension. The extension can be downloaded from the RapidMiner marketplace.

The extension provides access to Linked Open Data within the open source data mining tool RapidMiner. Data from Linked Open Data can be used both as an input to data mining processes, as well as for enriching existing data mining problems with background knowledge. Possible use cases include:

  • Importing data from a Linked Data source, such as Eurostat, into RapidMiner, and analyze it using RapidMiner operators
  • Adding data about population, GDP, and literacy from Eurostat to a data set of countries
  • Adding data about universities and companies to a data set of cities
  • Adding data about turnover and number of employees to a data set of companies

The project web site contains detailed workflow examples, including links to myExperiment.

]]>
Projects
news-797 Wed, 11 Sep 2013 06:22:00 +0000 Semester Kickoff BBQ https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/semester-kickoff-bbq-3/ Englisch Version

Accompanied by beautiful summer weather, the first semester kickoff barbecue of the Research Group Data and Web Science took place last Friday (September 6, 2013).

In addition to the researchers and staff of the group, the most successful students from our lectures last semester were invited to join the event.

With grilled food and beer, the new joint teaching concept of the group was briefly introduced, which contains the new courses like Web Data Integration and Decision Support Systems.

Throughout the evening, a lot interesting topics where discussed by the attendees and some cornerstones could be laid for future thesis.

We thank all the participants for coming and wish especially our students a good and successful start to the new semester.

German Version

Bei schönstem Sommerwetter fand am letzen Freitag (6. September 2013) zum ersten Mal das Semester-Kickoff Grillen der Forschungsgruppe Data and Web Science statt.

Neben den Mitarbeitern der Forschungsgruppe waren auch die erfolgreichsten Studenten aus den Vorlesungen des vergangengen Semesters eingeladen.

Bei Grillgut und Bier wurde kurz das neue gemeinsame Lehrkonzept der Gruppe vorgestellt, welches einige neue Vorlesungen wie Web Data Integration und Decision Support Systems enthält.

Während des gesamten Abends ergaben sich viele interessante Diskussion zwischen den Anwesenden über fachliche Themen am Lehrstuhl.

Wir bedanken uns bei allen Teilnehmern für ihr kommen und wünschen vorallem unserem Studenten einen guten und erfolgreichen Start ins neue Semester.

]]>
Other Group
news-783 Mon, 02 Sep 2013 18:14:00 +0000 Congratulations to Robert Isele for receiving his Doctoral degree https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/congratulations-to-robert-isele-for-receiving-his-doctoral-degree/ We congratulate Robert Isele for receiving his Doctoral degree. The topic of Robert Isele’s Doctoral thesis is entity matching in the context of the Web of Data. Within the thesis, he developed a genetic programming algorithm for learning expressive linkage rules as well as an active learning method that minimizes the user-involvement into the learning process. The thesis was supervised by Professor Bizer and Professor Stuckenschmidt.

The thesis is available online at https://ub-madoc.bib.uni-mannheim.de/33418/.
The title and abstract of the thesis are found below.

Title:

Learning Expressive Linkage Rules for Entity Matching using Genetic Programming

Abstract:

A central problem in data integration and data cleansing is to identify pairs of entities in data sets that describe the same real-world object. Many existing methods for matching entities rely on explicit linkage rules, which specify how two entities are compared for equivalence. Unfortunately, writing accurate linkage rules by hand is a non-trivial problem that requires detailed knowledge of the involved data sets. Another important issue is the efficient execution of linkage rules. In this thesis, we propose a set of novel methods that cover the complete entity matching workflow from the generation of linkage rules using genetic programming algorithms to their efficient execution on distributed systems. First, we propose a supervised learning algorithm that is capable of generating linkage rules from a gold standard consisting of set of entity pairs that have been labeled as duplicates or non-duplicates. We show that the introduced algorithm outperforms previously proposed entity matching approaches including the state-of-the-art genetic programming approach by de Carvalho et al. and is capable of learning linkage rules that achieve a similar accuracy than the human written rule for the same problem. In order to also cover use cases for which no gold standard is available, we propose a complementary active learning algorithm that generates a gold standard interactively by asking the user to confirm or decline the equivalence of a small number of entity pairs. In the experimental evaluation, labeling at most 50 link candidates was necessary in order to match the performance that is achieved by the supervised GenLink algorithm on the entire gold standard. Finally, we propose an efficient execution workflow that can be run on cluster of multiple machines. The execution workflow employs a novel multidimensional indexing method that allows the efficient execution of learned linkage rules by reducing the number of required comparisons significantly.

]]>
Projects Publications
news-761 Tue, 16 Jul 2013 14:54:00 +0000 4 Papers accepted at the International Semantic Web Conference (ISWC2013) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/4-papers-accepted-at-the-international-semantic-web-conference-iswc2013/ Four papers from our research group were accepted at this year's International Semantic Web Conference (ISWC 2013), the premier conference for research in the area of the Semantic Web.

Research Track (Acceptance rate:  21.5%)

  • Heiko Paulheim and Christian Bizer: Type Inference on Noisy RDF Data.
  • Heiner Stuckenschmidt, Michael Schuhmacher, Johannes Knopp, Christian Meilicke and Ansgar Scherp: On the Status of Experimental Research on the Semantic Web.

In-Use Track  (Acceptance rate:  20.25%)

Evaluation Track

  • Dominique Ritze, Heiko Paulheim and Kai Eckert: Evaluation Measures for Ontology Matchers in supervised Matching Scenarios.

]]>
Publications
news-760 Mon, 15 Jul 2013 10:01:00 +0000 Book "Semantic Models for Adaptive Interactive Systems" published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/book-semantic-models-for-adaptive-interactive-systems-published/ The book Semantic Models for Adaptive Interactive Systems, co-edited by Heiko Paulheim together with Tim Hussein and Jürgen Ziegler from University of Duisburg Essen, Stephan Lukosch from TU Delft, and Gaëlle Calvary from University of Grenoble, has been published by Springer in the Human Computer Interaction Series.

Providing insights into methodologies for designing adaptive systems based on semantic data, and introducing semantic models that can be used for building interactive systems, this book showcases many of the applications made possible by the use of semantic models.

Ontologies may enhance the functional coverage of an interactive system as well as its visualization and interaction capabilities in various ways. Semantic models can also contribute to bridging gaps; for example, between user models, context-aware interfaces, and model-driven UI generation. There is considerable potential for using semantic models as a basis for adaptive interactive systems. A variety of reasoning and machine learning techniques exist that can be employed to achieve adaptive system behavior. The advent and rapid growth of Linked Open Data as a large-scale collection of semantic data has also paved the way for a new breed of intelligent, knowledge-intensive applications.

Semantic Models for Adaptive Interactive Systems includes ten complementary chapters written by experts from both industry and academia. Rounded off by a number of case studies in real world application domains, this book will serve as a valuable reference for researchers and practitioners exploring the use of semantic models within HCI.

]]>
Group Publications Projects
news-732 Fri, 14 Jun 2013 09:57:00 +0000 Japanese Translation of Linked Data Book published https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/japanese-translation-of-linked-data-book-published/ We are happy to announce that a Japanese translation of our book Heath/Bizer: Linked Data – Evolving the Web into a Global Data Space has been published and want to thank Professor Hideaki Takeda from the National Institute of Informatics Japan  for coordinating the translation and publication of the book.

The translation can be ordered at Amazon Japan.

This is already the second translation of the book. The first one, the French translation, is available at Pearson France.

A free online version of the original English edition of the book is found at http://linkeddatabook.com/

]]>
Publications Projects
news-706 Sun, 19 May 2013 10:45:00 +0000 Social media matching app mobEx wins 10.000 EUR klickTel Award! https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/social-media-matching-app-mobex-wins-10000-eur-klicktel-award/ Congratulations to C. Bikar, M. Jess, F. Knip, B. Opitz, B. Pfister and T. Sztyler!

A student team headed by Jun.-Prof. Dr. habil. Ansgar Scherp has won the 10.000 EUR klickTel Award for the location and event finder mobEx. More details can be found here (in German): blog.telegate.com/entwicklerwettbewerb-%E2%80%9Emobex-gewinnt-klicktel-award-2013/

The application is available for Android in German and English, see: https://play.google.com/store/apps/details?id=de.unima.mobex.client&hl=en

 

]]>
Other Group
news-700 Sun, 12 May 2013 11:03:00 +0000 AAAI and UbiComp: Two papers accepted at Top conferences ! https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/aaai-and-ubicomp-two-papers-accepted-at-top-conferences/ The Artificial Intelligence Research Group has two papers accepted at international Top Tier Conferences:

While with the paper " Exploiting Parallelism and Symmetry for MAP Inference in Statistical Relational Models" Jan Nößner and his co-authors managed to publish recent results on efficient probabilistic inference at one of the top AI conferences, Rim Helaoui and her co-authors have their paper "A Probabilistic Ontological Framework for the Recognition of Multilevel Human Activities" accepted at UbiComp, the leading conference on ubiquitous and pervasive computing. 

]]>
Publications
news-695 Mon, 06 May 2013 08:53:00 +0000 1st Workshop on Semantic Web Enterprise Adoption and Best Practice https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/1st-workshop-on-semantic-web-enterprise-adoption-and-best-practice/ Marco Neumann (KONA LLC), Sam Coppens (iMinds – Multimedia Lab – Ghent University), Karl Hammer (Jönköping University, Linköping University), Magnus Knuth (Hasso Plattner Institute - University of Potsdam), Dominique Ritze (DWS - University of Mannheim) and Miel Vander Sande (iMinds – Multimedia Lab – Ghent University) are organizing the first Workshop on Semantic Web Enterprise Adoption and Best Practice, to be held at the 12th International Semantic Web Conference (ISWC) in Sydney, Australia.

Over the years, Semantic Web based systems, applications, and tools have shown significant improvement. Their development and deployment shows the steady maturing of semantic technologies and demonstrates their value in solving current and emerging problems. Examples include enabling generic clients, facilitating autonomous agents and large scale distributed data integration. Despite the encouraging figures, the number of enterprises working on and with these technologies is dwarfed by the large number who have not yet adopted semantic technologies. Current adoption is mainly restricted to methodologies provided by the research community. Although Semantic Web acts as candidate technology to the industry, it does not win through in current enterprise challenges like data fusion, data integration or natural language processing (e.g. IBM Watson). To better understand the market dynamics uptake needs to be addressed and if possible quantified.


The Workshop on Semantic Web Enterprise Adoption and Best Practice is intended to close the gap between the industry tracks and research tracks at ISWC2013. Topics for presentation and discussion at the workshop include both technical and usage-oriented issues. They include everything that helps shortening development and deployment time for an academic or a practitioner, wishing to work with semantic technologies. Relevant topics include:

- Surveys or case studies on Semantic Web technology in enterprise systems
- Comparative studies on the evolution of Semantic Web adoption
- Semantic systems and architectures of methodologies for industrial challenges
- Semantic Web based implementations and design patterns for enterprise systems
- Enterprise platforms using Semantic Web technology as part of the workflow
- Architectural overviews for Semantic Web systems
- Design patterns for semantic technology architectures and algorithms
- System development methods as applied to semantic technologies
- Semantic toolkits for enterprise applications
- Surveys on identified best practices based on Semantic Web technology

]]>
Other Projects
news-687 Fri, 26 Apr 2013 09:51:00 +0000 DBpedia @ Google Summer of Code 2013 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dbpedia-google-summer-of-code-2013/ Google Summer of Code (GSoC) is a global program that offers post-secondary student developers (ages 18 and older, BSc, MSc, PhD) stipends to write code for various open source software projects. Since its inception in 2005, the program has brought together over 6,000 successful student participants and over 3,000 mentors from over 100 countries worldwide, all for the love of code.

After participating successfully in last’s year GSoC, DBpedia proposal is accepted again for GSoC 2013. All DBpedia-family products participate this time - DBpedia, DBpedia Spotlight and DBpedia Wiktionary.

This year brand new and exciting ideas are proposed so, if you know energetic students (BSc, MSc, PhD) interested in working with DBpedia, text processing, and semantics, please encourage them to apply!

The application deadline for students is May 3rd.

The interested applicants are referred to the dedicated mailing list, ideas and warm-up tasks, and application guidelines.

]]>
Projects Other
news-686 Thu, 25 Apr 2013 07:31:00 +0000 1st Workshop on Linked Data for Information Extraction https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/1st-workshop-on-linked-data-for-information-extraction/ Together with Anna Lisa Gentile and Ziqi Zhang from University of Sheffield and Claudia d'Amato from University of Bari, Heiko Paulheim is organizing the first workshop on Linked Data for Information Extraction (LD4IE), to be held at the 12th International Semantic Web Conference (ISWC) in Sydney, Australia.

This workshop focuses on the exploitation of Linked Data for Web Scale Information Extraction (IE), which concerns extracting structured knowledge from unstructured/semi-structured documents on the Web. One of the major bottlenecks for the current state of the art in IE is the availability of learning materials (e.g., seed data, training corpora), which, typically are manually created and are expensive to build and maintain.

Linked Data (LD) defines best practices for exposing, sharing, and connecting data, information, and knowledge on the Semantic Web using uniform means such as URIs and RDF. It has so far been created a gigantic knowledge source of Linked Open Data (LOD), which constitutes a mine of learning materials for IE. However, the massive quantity requires efficient learning algorithms and the unguaranteed quality of data requires robust methods to handle redundancy and noise.

LD4IE intends to gather researchers and practitioners to address multiple challenges arising from the usage of LD as learning material for IE tasks, focusing on (i) modelling user defined extraction tasks using LD; (ii) gathering learning materials from LD assuring quality (training data selection, cleaning, feature selection etc.); (iii) robust learning algorithms for handling LD; (iv) publishing IE results to the LOD cloud.

We welcome paper submissions with topics related to Information Extraction
using Linked Data, such as :

Modeling Extraction Tasks

  • modeling extraction tasks
  • extracting knowledge patterns for task modeling
  • user friendly approaches for querying linked data

Information Extraction

  • selecting relevant portions of LOD as training data
  • selecting relevant knowledge resources from linked data
  • IE methods robust to noise in training data
  • Information Extractions tasks/applications exploiting LOD (Wrapper induction, Table interpretation, IE from unstructured data, Named Entity Recognition etc.)
  • publishing information extraction results as Linked Data
  • linking extracted information to existing LOD datasets

Linked Data for Learning

  • assessing the quality of LOD data for training
  • select optimal subset of LOD to seed learning
  • managing incompleteness, noise, and uncertainty of LOD
  • scalable learning methods
  • pattern extraction from LOD

Format

We accept the following formats of submissions:

  • Full paper with a maximum of 12 pages including references
  • Short paper with a maximum of 6 pages including references
  • Poster with a maximum of 4 pages including references

All research submissions must be in English. Submissions must be in PDF formatted in the style of the Springer Publications format for Lecture Notes in Computer Science (LNCS). Submissions are not anonymous.

Accepted papers will be published online via CEUR-WS.

]]>
Other Projects
news-662 Wed, 27 Mar 2013 16:31:00 +0000 Kollaborative semantische Mind-Maps https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/kollaborative-semantische-mind-maps/ Der Prozess des Mind-Mappings kann genutzt werden um neue Gedanken zu
visualisieren, Pläne zu strukturieren oder bereits Erlerntes in
Verbindung zueinander zu setzen. Im Rahmen dieser Arbeit soll eine
Möglichkeit geschaffen werden, solche Mind-Maps kollaborativ erstellen
und visuell erkunden zu können.

Zu stärkeren visuellen Erfahrung sollen die Mind-Maps in einer
2D/3D-Umgebung erfahrbar gemacht werden. Benutzer sollen die Möglichkeit
haben, sich in einer virtuellen 2/3D-Welt durch die Mind-Map zu bewegen
und diese zu verändern bzw. zu ergänzen.
Technisch muss hierbei mittels geeigneter Synchronisierungsverfahren
sichergestellt werden, dass mehrere Benutzer gleichzeitig an einer
Mind-Map arbeiten können. Dazu können sowohl Client-Server- als auch
Peer-to-Peer-Verfahren eingesetzt werden.

Die Repräsentation der Knoten in den Mind-Maps sollen das RDF-Format [1]
unterstützen, damit vorhandene Datenquellen genutzt und importiert
werden können. So kann beispielsweise bestehendes Hintergrundwissen
der DBpedia [2], eine RDF-Version der Wikipedia angebunden werden.
Eine Chatfunktion soll darüber hinaus die Kommunikation
der Benutzer ermöglichen.

[1] W3C Resource Description Framework, www.w3.org/RDF/
[2] DBpedia, de.dbpedia.org

Kontakt: Jun.-Prof. Dr. Ansgar Scherp (ansgar@informatik.uni-mannheim.de), Philip Mildner, Prof. Dr.-Ing. Wolfgang Effelsberg

]]>
Topics
news-642 Tue, 26 Feb 2013 09:12:00 +0000 We are seeking two additional researchers https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/we-are-seeking-two-additional-researchers/ The Research Group Data and Web Science is currently seeking to fill two additional researcher positions:

  • PhD Position with a research focus on Ontologies, Knowledge Representation and Probabilistic Methods.
  • PhD or PostDoc Position with a research focus on Information Extraction, Data Cleansing and Large-Scale Data Management.
]]>
Projects
news-609 Fri, 08 Feb 2013 12:47:00 +0000 2nd Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data (Know@LOD) https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/2nd-workshop-on-knowledge-discovery-and-data-mining-meets-linked-open-data-knowlod/ After last year’s successful debut, the second international workshop on Knowledge Discovery and Data Mining Meets Linked Open Data (Know@LOD) will be held at the 10th Extended Semantic Web Conference (ESWC). The workshop will be organized by Johanna Völker and Heiko Paulheim from the Data and Web Science Group, together with Jens Lehmann from University of Leipzig, Mathias Niepert from University of Washington, and Harald Sack from University of Potsdam.

Knowledge discovery and data mining (KDD) is a well-established field with a large community investigating methods for the discovery of patterns and regularities in large data sets, including relational databases and unstructured text. Research in this field has led to the development of practically relevant and scalable approaches such as association rule mining, subgroup discovery, graph mining, and clustering. At the same time, the Web of Data has grown to one of the largest publicly available collections of structured, cross-domain data sets. While the growing success of Linked Data and its use in applications, e.g., in the e-Government area, has provided numerous novel opportunities, its scale and heterogeneity is posing challenges to the field of knowledge discovery and data mining:

  • The extraction and discovery of knowledge from very large data sets;
  • The maintenance of high quality data and provenance information;
  • The scalability of processing and mining the distributed Web of Data; and
  • The discovery of novel links, both on the instance and the schema level.

 

Topics of interest include data mining and knowledge discovery methods for generating and processing, or using linked data, such as

  • Automatic link discovery
  • Event detection and pattern discovery
  • Frequent pattern analysis
  • Graph mining
  • Knowledge base debugging, cleaning and repair
  • Large-scale information extraction
  • Learning and refinement of ontologies
  • Modeling provenance information
  • Ontology matching and object reconciliation
  • Scalable machine learning
  • Statistical relational learning
  • Text and web mining
  • Usage mining

 

 

Important Dates

Submission deadline: March 4th, 2013
Notification: April 1st, 2013
Camera ready version: April 15th, 2013
Workshop: May 26th or 27th, 2013

]]>
Other Projects
news-600 Fri, 25 Jan 2013 08:38:00 +0000 Bachelor/Master Thesis (Stuckenschmidt): Semi-Automatic Evolution Support for DBpedia https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelormaster-thesis-stuckenschmidt-semi-automatic-evolution-support-for-dbpedia/ Type of Work: Bachelor/Master Thesis

Person of Contact: Dr. Johanna Völker

Topic: DBpedia [1] is a large knowledge repository, which contains several billions of facts automatically extracted from Wikipedia infoboxes. Representing these facts using W3C standards such as RDF and making them accessible via the SPARQL query language, DBpedia has become one of today's most popular resources for intelligent knowledge-based applications.

In order to facilitate logical inference and query answering, DBpedia provides a rich schema [3] expressed in the Web Ontology Language (OWL). The schema defining, for example, the possible values of relations (e.g. the mayor of a city must be a person, and the name of a person is of type xsd:string) also serves as a guideline for the DBpedia extraction process. Manually created mappings [2] connect concepts and relations in the schema to the templates which are used by Wikipedia infoboxes.

However, neither the schema nor the data of DBpedia are perfect, and DBpedia-based applications still have to deal with incorrect information. Many errors result from unsynchronized changes to the contents of Wikipedia, the infobox templates, as well as to the DBpedia schema and associated mappings.

---

Bachelor Thesis: Evaluating the Quality of Crowdsourced Schema-Mappings in DBpedia

The goal of this bachelor thesis is to identify the most severe types of errors caused by the dynamics of this large and partially crowdsourced dataset. Using methods for data profiling and cleansing, you will analyse the impact of unsynchronized updates on the quality of information in DBpedia.

Requirements

  • Programming skills
  • Ideally, knowledge about semantic web technologies

---

Master Thesis: A Web-based User Interface for the Semi-Automatic Maintenance of DBpedia

Based on the findings from the Bachelor thesis, you will develop strategies for synchronizing the evolution of the DBpedia schema and associated mappings with structural changes in Wikipedia infoboxes. These strategies will have to be implemented by developing a wiki-based user interface for the DBpedia community, which efficiently combines crowd resources with automatic approaches.

Requirements

  • Advanced programming skills
  • Ideally, knowledge about semantic web technologies

---

[1] http://dbpedia.org
[2] http://mappings.dbpedia.org
[3] http://wiki.dbpedia.org/Ontology

 

]]>
Topics
news-581 Thu, 17 Jan 2013 09:56:00 +0000 6th Linked Data on the Web Workshop at WWW2013 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/6th-linked-data-on-the-web-workshop-at-www2013/ Together with Sir Tim Berners-Lee (W3C/MIT, USA), Tom Heath (Talis Information Ltd, UK) Sören Auer (Universität Leipzig), and Michael Hausenblas (DERI, Irland), Christian Bizer is organizing the 6th Linked Data on the Web Workshop (LDOW2013)  at the 22nd World Wide Web Conference (WWW2013) in Rio de Janeiro, Brazil.

Linked Data is a set of best practices for publishing structured data on the Web which focuses on setting hyperlinks between data items provided by different web servers. These hyperlinks connect the data from all servers into a single global data graph - the Web of Linked Data.

The 6th Workshop on Linked Data on the Web (LDOW2013) aims to stimulate further research into exploiting this global data graph to deliver transformative applications to large user bases, as well as to mine the graph for implicit knowledge. Inevitably the challenges associated with Linked Data range from lower level plumbing issues over large-scale data processing and mining, to higher level conceptual questions of value propositions and business models. LDOW2013 will provide a forum for exposing novel, high quality research and applications in all of these areas. In addition, by bringing together researchers in the field, the workshop will further shape the ongoing Linked Data research agenda.

Important Dates

  • Submission deadline: 10 March, 2013
  • Notification of acceptance: 30 March, 2013
  • Camera-ready versions of accepted papers: 15 April, 2013
  • Workshop date: 14 May, 2013

Topics of Interest

Topics of interest for the workshop include, but are not limited to, the following:

Mining the Web of Linked Data

  • large-scale approaches to deriving implicit knowledge from the Web of Linked Data
  • using the Web of Linked Data as background knowledge in data mining applications

Linking and Fusion

  • linking algorithms and heuristics, identity resolution
  • increasing the value of Schema.org and OpenGraphProtocol data through linking
  • Web data integration and fusion
  • performance of linking infrastructures/algorithms on Web data

Quality, Trust, Provenance and Licensing in Linked Data

  • profiling and change tracking of Linked Data sources
  • tracking provenance and usage of Linked Data
  • evaluating quality and trustworthiness of Linked Data
  • licensing issues in Linked Data publishing

Linked Data Applications and Business Models

  • Linked Data browsers and search engines
  • Linked Data as pay-as-you-go data integration technology within corporate contexts
  • marketplaces, aggregators and indexes for Linked Data
  • interface and interaction paradigms for Linked Data applications
  • business models for Linked Data publishing and consumption
  • Linked Data applications for life-sciences, digital humanities, social sciences etc.

 

More  information about the workshop is found at http://events.linkeddata.org/ldow2013/

]]>
Projects Other
news-602 Tue, 11 Dec 2012 14:53:00 +0000 Web Data Commons publishes data from 3 billion webpages https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/web-data-commons-publishes-data-from-3-billion-webpages/ More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provides the extracted data for download. In addition, we calculate and publish statistics about the deployment of the different formats as well as the vocabularies that are used together with each format.

Web Data Commons is a joint effort of the Research Group Data and Web Science at the University of Mannheim (Christian BizerRobert Meusel, Michael Schuhmacher, Johanna Völker, Kai Eckert) and the Institute AIFB at the Karlsruhe Institute of Technology (Andreas HarthSteffen Stadtmüller). 

Today, we are happy to announce the release of a new WebDataCommons dataset.

The dataset has been extracted from the latest version of the Common Crawl. This August 2012 version of the Common Crawl contains over 3 billion HTML pages which originate from over 40 million websites (pay-level-domains).

Altogether we discovered structured data within 369 million HTML pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million websites (5.65%).  Approximately 519 thousand of these websites use RDFa, while 140 thousand websites use Microdata. Microformats are used on 1.7 million websites.

Basic statistics about the extracted dataset as well as the vocabularies that are used together with each encoding format are found at:

http://www.webdatacommons.org/2012-08/stats/stats.html

Additional statistics that analyze top-level domain distribution and the popularity of the websites covered by the Common Crawl, as well as the topical domains of the embedded data are found at:

http://www.webdatacommons.org/2012-08/stats/additional_stats.html

The overall size of the August 2012 WebDataCommons dataset is 7.3 billion quads. The dataset is split into 1,416 files each having a size of around 100 MB. In order to make it easier to find data from a specific website or top-level-domain, we provide indexes about the location of specific data within the files.

In order to make it easy for third parties to investigate the usage of different vocabularies and to generate seed-lists for focused crawling endeavors, we provide a website-class-property matrix for each format. The matrixes indicate which vocabulary term (class/property) is used by which website and avoid that you need to download and scan the whole dataset to obtain this information.

The extracted dataset and website-class-property matrix can be downloaded from:

http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data parsers.
+ the PlanetData and the LOD2 EU research projects for supporting WebDataCommons.

Have fun with the new dataset.

Christian Bizer and Robert Meusel 

]]>
Projects
news-540 Thu, 08 Nov 2012 12:24:00 +0000 Metadata Provenance Workshop at Semantic Web in Libraries Conference https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/metadata-provenance-workshop-at-semantic-web-in-libraries-conference/ We organize a workshop/tutorial about metadata provenance and provenance in the Semantic Web at the SWIB conference (Semantic Web in Libraries):

When metadata is distributed, combined, and enriched as Linked Data, the tracking of its provenance becomes a hard issue. Using data encumbered with licenses that require attribution of authorship may eventually become impractible as more and more data sets are aggegated - one of the main motivations for the call to open data under permissive licenses like CC0. Nonetheless, there are important scenarios where keeping track of provenance information becomes a necessity. A typical example is the enrichment of existing data with automatically obtained data, for instance as a result of automatic indexing. Ideally, the origins, conditions, rules and other means of production of every statement are known and can be used to put it into the right context.
Part 1 - Metadata Provenance in RDF: In RDF, the mere representation of provenance - i.e., statements about statements - is challenging. We explore the possibilities, from the unloved reification and other proposed alternative Linked Data practices through to named graphs and recent developments regarding the upcoming next version of RDF.
Part 2 - Interoperable Metadata Provenance: As with metadata itself, common vocabularies and data models are needed to express basic provenance information in an interoperable fashion. We investigate the PROV model that is currently developed by the W3C Provenance Working Group and compare it to Dublin Core as a representative of a flat, descriptive metadata schema.
We actively encourage participants to present their own use cases and open challenges at this workshop. Please contact the organizers for details.

The workshop is chaired by Kai Eckert and Magnus Pfeffer.

]]>
Other Projects
news-565 Wed, 26 Sep 2012 15:42:00 +0000 Best Paper Awards at TPDL 2012 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/best-paper-awards-at-tpdl-2012/ Dominique Ritze and Katarina Boland (Gesis), together with Brigitte Mathiak (Gesis) and Kai Eckert, won both the Best Paper Award and the Best Student Paper Award at the International Conference on Theory and Practice of Digital Libraries (TPDL) 2012 for their paper on "Identifying References to Datasets in Publications".

]]>
Publications
news-500 Thu, 30 Aug 2012 12:22:00 +0000 Bachelor/Master Thesis (Stuckenschmidt): Vergleich schneller approximativer String-Matching Verfahren https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelormaster-thesis-stuckenschmidt-vergleich-schneller-approximativer-string-matching-verfahre/ In Unternehmen müssen häufig große Datenbanken miteinander vernetzt werden. Hierbei besteht eine der Datenintegrationsherausforderungen darin, "fast" gleiche Einträge zu identifizieren. Beispielsweise sollte ein Algorithmus bei der Integration einer Personendatenbank erkennen, dass "Hans, Peter" und "P. Hans" relativ ähnlich sind.


Das Problem besteht nun darin, dass im Falle von sehr großen Datenmengen nicht mehr jeder Eintrag mit jedem verglichen werden kann (quadratische Laufzeit). Vergleicht man beispielsweise zwei Datenbanken mit je 1 Mio. Einträgen (wenig!) müssen 100.000.000.000 Vergleichsoperatoren durchgeführt werden (sehr viel!). In dieser Arbeit geht es also darum, Algorithmen zu betrachten, welche schneller (also nicht quatratisch) sind. 


In diesem Bereich sind zahlreiche Master/Bachelorarbeiten denkbar. Es könnten beispielsweise verschiedene Ansätze implementiert werden. Anschließend könnten diese anhand von Echtdaten miteinander verglichen werden. So könnte eine von Dritten einfach zu verwendende Toolbox geschaffen werden, die anderen Wissenschaftlern zur Verfügung gestellt werden kann.

]]>
Topics Christian Topics - Künstliche Intelligenz I Topics - Decision Support
news-492 Fri, 27 Jul 2012 12:26:00 +0000 1st Workshop on Linked Data for Information Extraction https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/1st-workshop-on-linked-data-for-information-extraction-1/ Together with Anna Lisa Gentile and Ziqi Zhang from University of Sheffield and Claudia d'Amato from University of Bari, Heiko Paulheim is organizing the first workshop on Linked Data for Information Extraction (LD4IE), to be held at the 12th International Semantic Web Conference (ISWC) in Sydney, Australia.

This workshop focuses on the exploitation of Linked Data for Web Scale Information Extraction (IE), which concerns extracting structured knowledge from unstructured/semi-structured documents on the Web. One of the major bottlenecks for the current state of the art in IE is the availability of learning materials (e.g., seed data, training corpora), which, typically are manually created and are expensive to build and maintain.

Linked Data (LD) defines best practices for exposing, sharing, and connecting data, information, and knowledge on the Semantic Web using uniform means such as URIs and RDF. It has so far been created a gigantic knowledge source of Linked Open Data (LOD), which constitutes a mine of learning materials for IE. However, the massive quantity requires efficient learning algorithms and the unguaranteed quality of data requires robust methods to handle redundancy and noise.

LD4IE intends to gather researchers and practitioners to address multiple challenges arising from the usage of LD as learning material for IE tasks, focusing on (i) modelling user defined extraction tasks using LD; (ii) gathering learning materials from LD assuring quality (training data selection, cleaning, feature selection etc.); (iii) robust learning algorithms for handling LD; (iv) publishing IE results to the LOD cloud.

We welcome paper submissions with topics related to Information Extraction
using Linked Data, such as :

Modeling Extraction Tasks

  • modeling extraction tasks
  • extracting knowledge patterns for task modeling
  • user friendly approaches for querying linked data

Information Extraction

  • selecting relevant portions of LOD as training data
  • selecting relevant knowledge resources from linked data
  • IE methods robust to noise in training data
  • Information Extractions tasks/applications exploiting LOD (Wrapper induction, Table interpretation, IE from unstructured data, Named Entity Recognition etc.)
  • publishing information extraction results as Linked Data
  • linking extracted information to existing LOD datasets

Linked Data for Learning

  • assessing the quality of LOD data for training
  • select optimal subset of LOD to seed learning
  • managing incompleteness, noise, and uncertainty of LOD
  • scalable learning methods
  • pattern extraction from LOD

Format

We accept the following formats of submissions:

  • Full paper with a maximum of 12 pages including references
  • Short paper with a maximum of 6 pages including references
  • Poster with a maximum of 4 pages including references

All research submissions must be in English. Submissions must be in PDF formatted in the style of the Springer Publications format for Lecture Notes in Computer Science (LNCS). Submissions are not anonymous.

Accepted papers will be published online via CEUR-WS.

]]>
Other Projects
news-484 Thu, 12 Jul 2012 08:29:00 +0000 SEALS mission accomplished https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/seals-mission-accomplished/ Final review of the SEALS project successfully passed Project officier and reviewers from the european commision have very positively reviewed the work done in the SEALS project in the last three years.  Our part in the project was related to the field of Ontology Matching evaluation. After three years of hard work and many successfull OAEI evaluations we are happy that we could contribute to the excellent work done in the project

]]>
Projects
news-481 Wed, 11 Jul 2012 13:41:00 +0000 Data Integration Games https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/data-integration-games/  

Im Projekt MappingAssistant haben wir gezeigt, dass es möglich ist, Fachnutzer bei der Erstellung und Wartung von Datenintegrationsregeln zu unterstützen und die Qualität der resultierenden Mappings deutlich zu erhöhen.

Eine offene Frage ist, wie man Fachnutzer dazu bringt, die Qualität der integrierten Daten von sich aus regelmäßig zu überwachen. Die Bereitstellung von Werkzeugen, wie dem MappingAssistant ist zwar eine notwendige Voraussetzung hierfür, sie ist jedoch nicht ausreichend. Es müssen daher geeignete Anreizsysteme geschaffen werden, um eine Mitarbeit zu bewirken.

Im Projekt sollen Anreizsysteme für die kontinuierliche Überprüfung der Qualität der Datenintegration entwickelt und als Erweiterung des OntoStudios Softwaretechnisch unterstützt werden. Die resultierenden Methoden sollen in Nutzerstudien evaluiert und bewertet werden.

Eine Möglichkeit, Anreize zu schaffen ist die Formulierung der Integrationsaufgabe als Spiele mit ernsthaftem Hintergrund. Die Performanz der Mitarbeiter in diesem Spiel können dann mit konkreten Anreizen wie Boni oder Leitungsbewertung verknüpft werden.

Wir illustrieren die Möglichkeit, Datenintegration als Spiel zu formulieren am Beispiel der Aufgabe der Erkennung von Datensätzen, die das gleich Objekt beschreiben, sich jedoch in der konkreten Beschreibung dieses Objektes unterscheiden. Das entsprechende Spiel könnte folgendermaßen aussehen:

·         In jeder Runde des Spiels wird ein Paar von Datensätzen, die möglicherweise das gleiche Objekt beschreiben ausgewählt und mit einem bestimmten Punktwert für die richtige Lösung versehen.

·         Das Paar wird mehreren Spielern zunächst verdeckt gezeigt. Die Spieler haben nun in mehreren Schritten folgende Optionen:

o   Sie geben ein Votum ab, ob sie meinen dass der Datensatz das gleiche Objekt beschreibt oder nicht. Ist das Votum korrekt, bekommen sie den Punktwert gutgeschrieben, ist es falsch, wird der Punktwert abgezogen (hierdurch wird bloßes Raten verhindert).

o   Die Spieler kaufen sich weitere Informationen über die Daten in Form von Anfragen zur Aufdeckung bestimmter Attribute der Datensätze. Der Kaufpreis wird vom Konto des Spielers in der aktuellen Runde abgezogen und darf nicht höher sein als der Wert der Aufgabe. Hierbei sind unterschiedlich komplexe Mechanismen denkbar:

§  Je Attribut wird ein fester Betrag abgezogen. Dies motiviert die Spieler, sich auf wenige sehr aussagekräftige Attribute zu konzentrieren.

§  Spieler konkurrieren in einer Auktion um die Attribute und nur die höchsten Gebote werden berücksichtigt. Hierdurch erhält man Informationen über die Bedeutung eines Attributs bei der Erkennung gleichen Objekte

§  Die Spieler können in einer kombinatorischen Auktion Gebote für bestimmte Kombinationen von Attributen abgeben. Attribute werden hierbei als knappe Ressource betrachtet und dürfen nur in einer begrenzten Anzahl von Geboten berücksichtigt werden. Hierdurch bekommt man Informationen über aussagekräftige Kombinationen von Attributen.

o   Haben alle Nutzer Ihr Votum abgegeben, wird das Votum der Mehrheit als Wahr angenommen und die Punkte entsprechend ausgeschüttet.

Die Anwendung dieses Spiels hat eine Reihe konkreter Vorteile für ein Unternehmen. Zum einen erhält man eine relativ sichere Aussage darüber, ob es sich um Beschreibungen für das gleiche Objekt handelt oder nicht. Zusätzlich ergeben sich Erkenntnisse darüber, welche Attribute Ausschlaggebend für die Entscheidung waren. Diese Information kann bei der Entwicklung und Optimierung automatischer Matching Verfahren für Daten verwendet werden. Schließlich erhält ein Unternehmen auch Informationen darüber, welche Personen im Unternehmen besonders gut darin sind, Duplikate zu erkennen. Diese Mitarbeiter können entsprechend belohnt und gezielter für die Datenpflege eingesetzt werden.

]]>
Topics Topics - Decision Support Christian
news-480 Tue, 10 Jul 2012 07:55:00 +0000 RR and ReasoningWeb 2013 in Mannheim https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/rr-and-reasoningweb-2013-in-mannheim/ The 7th  International Conference on Web Reasoning and Rule Systems and the 9th Reasoning Web Summer School will take place at Mannheim University in August 2013. 

]]>
Other
news-478 Tue, 10 Jul 2012 07:54:00 +0000 Best Paper Award at ECEIS 2012 https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/best-paper-award-at-eceis-2012/ Heiner Stuckenschmidt, Jan Nößner and Faraz Fallahi have won the best paper award at the 14th International Conference on Enterprise Information Systems (ICEIS 2012) for their work "A Study in User-Centric Information Integration".

]]>
Publications
news-479 Tue, 10 Jul 2012 07:54:00 +0000 Google Research Award for Mathias Niepert! https://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/google-research-award-for-mathias-niepert/ Mathias Niepert won a Google Faculty Research Award. Mit Dr. Mathias Niepert und Prof. Heiner Stuckenschmidt haben Wissenschaftler der Forschungsgruppe Stuckenschmidt am Institut für Enterprise Systems (InES) der Universität Mannheim einen der renommierten "Google Faculty Research Awards" für ihre Forschungsarbeit mit verteilten probabilistischen Wissensdatenbanken erhalten. Das InES steht damit in einer Reihe mit prominenten Award-Empfängen wie dem MIT, der Carnegie Mellon University oder der Harvard University und ist eine von nur 28 Institutionen außerhalb der USA, welche den Preis erhalten hat.

Das Institut für Enterprise Systems der Universität Mannheim betreibt seit seiner Gründung im Jahre 2011 interdisziplinäre Forschung im Bereich der betrieblichen Informationssysteme. Neben Fragen des Lebenszyklus dieser Systeme steht auch die Entwicklung innovativer Technologien im Fokus der Forschung. Das InES kooperiert im Rahmen von Forschungsprojekten sowohl mit Technologieunternehmen wie der SAP AG als auch mit Anwenderunternehmen wie der Deutschen Bank AG. Mit Google konnte nun auch eines der wichtigsten IT Unternehmen in den USA von der Qualität der Forschung am InES überzeugt werden.

Auf Basis der Mittel aus dem Forschungspreis wird nun ein wissenschaftlicher Mitarbeiter in der InES Forschungsgruppe von Prof. Stuckenschmidt an der Entwicklung effizienter, verteilen Algorithmen zur Erstellung von grossen Wissensdatenbanken im Web arbeiten und hierbei eng mit der Google Forschungsabteilung in Zürich kooperieren. Im Gegensatz zu herkömmlichen Methoden werden mit Hilfe der zu entwickelnden Methoden auf Webseiten vorhandene Informationen nicht nur extrahiert und gespeichert, sondern gleichzeitig hinsichtlich Ihrer Korrektheit bewertet. Die so entstehenden Modelle weisen dadurch eine sehr hohe Qualität auf und können zum Beispiel die Grundlage für Suchmaschinen bilden, die in der Lage sind auch komplexe Anfragen zu beantworten.

Primäres Ziel des Forschungspreises, der von Google jährlich vergeben wird, ist der Aufbau einer intensiveren Kooperation zwischen Google und universitären Einrichtungen. Die Entscheidung, welche der vorgeschlagenen Projektanträge angenommen werden, wird von zahlreichen Expertenkommissionen im Unternehmen selbst getroffen, die die Anträge bezüglich ihrer praktischen Bedeutung, Innovationskraft und Relevanz für Google beurteilen.

]]>
Projects