RSS-Feed en-gb TYPO3 News Sun, 27 May 2018 00:02:41 +0000 Sun, 27 May 2018 00:02:41 +0000 TYPO3 EXT:news news-2007 Tue, 10 Oct 2017 12:59:52 +0000 Master Thesis: Integrating Product Data using Supervision from the Web (Bizer/Paulheim/Primpeli) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-integrating-product-data-using-supervision-from-the-web-bizerpaulheimprimpeli/ A large number of e-shops have started to markup structured data about products and offers in their HTML pages using the markup standard Microdata and the vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (682,000,000 product records, 497,000,000 offers). A relatively small number of e-shops also publish product identifiers which are indicated with one of the following properties: sku, productID, mpn, identifier, gtin14, gtin13, gtin12, and gtin8.

The aim of this thesis is to analyze and evaluate the utility of product identifiers found on the Web as supervision for matching product descriptions. More concretely, the goal of the thesis is to investigate whether it is possible to learn enough product characteristics from the small set of e-shops that do provide product identifiers in order to detect the same products on websites that do not provide identifiers.

More concretely the tasks involved in the thesis would be:

  • Analysis of Product Identifiers: Analyze the distribution of product identifiers published on the Web. This involves the identification of product entities and product categories for which identifiers are more frequently assigned.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. Product identifiers will be used as a source of supervision in order to learn classification models. The learned models will be evaluated in terms of how well they can generalize to products without assigned identifiers.



Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer, Heiko Paulheim, or Anna Primpeli.



Thesis - Master Chris
news-1981 Thu, 07 Sep 2017 09:48:14 +0000 Master thesis: Address management and geocoding (Gemulla, DHL) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-address-management-and-geocoding-gemulla-dhl/ eCommerce is on the rise. Logistics companies like Deutsche Post DHL are expanding and building up new logistics networks in emerging markets around the globe. Delivering parcels to end-consumers quickly, reliably and efficiently with an outstanding service requires much planning effort. This is especially complex in emerging countries where the infrastructure poses additional challenges. One essential part of making the delivery more efficient is route optimization of the driven tours on the “last mile”. It is done by using optimization algorithms connecting the delivery locations considering all kinds of restrictions (capacity, working hours of courier, traffic etc.).

In emerging countries, the delivery location cannot be easily deduced from the address provided by the customer. Addresses can follow different local address logics or can take the form of something like "Slip road from Megenagna to Imperial Hotel, In front of Anbessa Garage, P.O. Box 184 Code 1110, Addis Ababa, Ethiopia". In order to use the address for geocoding, it has to be broken down into a structured format, analyzed, compared with other existing addresses in databases, possibly updated and then translated into a geocode first. There are many ways and possible methods to achieve this. In the course of this master thesis the student is asked to give a scientific overview of current methods and algorithms in that context and come up with a suitable solution incorporating latest developments in Machine Learning.

The student is expected to have excellent analytical skills, knowledge in how to create and describe an algorithm and some previous Machine Learning experience. Knowledge in an object-oriented programming language is a plus.

The master thesis will be written in cooperation with Deutsche Post DHL. The student will work closely together with project teams in Singapore, Thailand, Malaysia, Vietnam and the global headquarters in Bonn. To apply for this master thesis, please send your CV to Gunnar Buchhold <gunnar.buchhold(at)> and briefly state your motivation and why you deem yourself suitable to cover this topic.

DHL eCommerce is the e-commerce logistics specialist of Deutsche Post DHL Group, the biggest logistics company worldwide. We offer choice, convenience, control and quality for both the merchant and the consumer. Our global team of e-commerce experts is dedicated to providing innovative solutions that create a great online shopping experience.

ma master data science theses Thesis Thesis - Bachelor Rainer
news-1816 Fri, 17 Feb 2017 11:38:00 +0000 Master thesis: Text Mining for Cyber Threat Analysis (Gemulla, Schönhofer GmbH) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-text-mining-for-cyber-threat-analysis-gemulla-schoenhofer-gmbh/ Since mid-September 2015, the threat from ransomware has grown considerably [1]. Against this background, comprehensive geographical and temporal mapping of cyber attacks and early detection of such attacks have become particularly important. Attacks on an organisation's own IT-infrastructure are typically analysed and defended against at network level. Outside an organisation's own infrastructure, other sources, e.g., news portals and social media, usually have to be used. Given the large volume and variety of this unstructured data as well as the speed with which it is generated, automated analytical procedures from the fields of text mining and machine learning to handle it are not only particularly promising but also the only practical approach. 


A review and evaluation of sources dealing with cyber-threat analysis, e.g.,

  • News websites, news portals and social media such as Facebook & Twitter
  • Pre-evaluation / Prediction / forecasting websites such as Google Trends or Europe Media Monitor
  • Reports from Computer Emergency Response Teams (CERTs)
  • Reports from anti-virus software companies. e.g., Kaspersky

The sources may contain previously evaluated and summarised results. The sources should be analysed and metadata extracted. The following directions are of particular interest:

  • Reports on new threats
  • Differentiation between duplicated confirmation and new reports
  • Sentiment analysis, classification of phishing, hoaxes & fake news
  • Regional / geographical and temporal distribution
  • Significant parties (parties issuing threats as well as those analysing / defending against threats)

Moreover, on the basis of configurable taxonomies the texts should be subjected to an entity analysis and, if possible, to relations analysis.
A data corpus, which has been created on the basis of relevant RSS feeds, is available to test the procedure and can be expanded during the work. In addition, the possibilities of adding further metadata while importing data should be investigated, e.g., designation of source / publisher, evaluation of source (reliability, trustworthiness etc.), which can then be considered when extracting the metadata.


Detailed knowledge of text analysis / text mining as well as programming skills in Java/Scala, Python or a comparable programming language is required. Knowledge of virtualisation and databases is an advantage. In-depth knowledge of cyber security is not required.


The Master thesis is supervised by the Chair for Data Analytics (Prof. Gemulla) as well as by the Schönhofer Sales & Engineering GmbH.

Schönhofer Sales & Engineering GmbH is an innovative systems and software company. The company, which is located in Siegburg, realises complex projects and products for complex event prediction, big-data analytics and metadata processing for public sector clients, banks, insurance companies and corporates.

If you are interested in this thesis topic, please contact

Holger Krispin
Schönhofer S&E GmbH, IT-Systems area
Tel. +49 (0)2241 3099 37


[1] Ransomware: Bedrohungslage, Prävention & Reaktion. BSI-Report. March 2016

Thesis Rainer ma master data science theses Thesis - Master partner companies
news-1810 Fri, 10 Feb 2017 08:37:44 +0000 Master Thesis: Natural Language Processing and Information Retrieval for Political Science (Nanni/Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-natural-language-processing-and-information-retrieval-for-political-science-nannipo/ Target: Master

Type: Survey

Short abstract: This thesis should provide an in-depth overview of the adoption of natural language processing and information retrieval approaches in political science research (for example, adoption of supervised classifiers in content analysis tasks, use of unsupervised approaches for political scaling studies). The thesis should explain in detail all prominent tasks, stressing the benefits and current limitations of NLP and IR solutions. A very successful thesis would also include the re-implementation of one of the solutions examined and application to a different dataset (e.g. re-implementation of Wordfish and political scaling of UK parliament debates).

Thesis - Master
news-1809 Fri, 10 Feb 2017 08:34:41 +0000 Master Thesis: Cross-lingual Topic-Based Manifesto Classification (Nanni/Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-cross-lingual-topic-based-manifesto-classification-nanniponzetto/ Target: Master

Type: Experiments

Introduction/problem:  Party manifestos ( present the vision of a specific party over different topics. Manifestos have been labeled at topic level in english documents (e.g. US and Uk parties manifestos), but the same annotations are not always available in other languages.

Goal: Classify manifestos in other languages in topics, using the English manifestos as training set.

Approach:  Goal of this thesis is to train a cross-lingual classifier that learn topics in English and detect topics in for example German manifestos using the embeddings and translation matrices. Evaluation will be on the few manifestos in other languages that have been labeled with topics. Train the classifier on English documents (but with embedding as features) and then you can use that classifier for other language texts (by first using the translation matrix to map these texts to English embedding space).

Requirements: Skilled programmer.



Thesis - Master
news-1808 Fri, 10 Feb 2017 08:32:19 +0000 Master Thesis: Enhancing Domain Specific Entity Linking (Nanni/Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-enhancing-domain-specific-entity-linking-nanniponzetto/ Title: Enhancing Domain Specific Entity Linking

Target: Master

Type: Experiments

Introduction/problem: Entity linking could improve text exploration and information retrieval in DH. However, currently DH researchers are limited to annotate texts with entity links that are represented by Wikipedia.

Goal: Develop a domain-adaptable entity linking pipeline and evaluate performances.

Approach: Re-implementation of TagMe algorithm and development of a pipeline that where the knowledge base can be easily changed.

Requirements: Confidence in programming (web scraping, Wikipedia dump processing) and knowledge of supervised classification algorithms and text processing.



Thesis - Master
news-1788 Thu, 19 Jan 2017 13:46:23 +0000 Master Thesis: Extraction and decoration of social networks users' needs (Faralli/Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-extraction-and-decoration-of-social-networks-users-needs-faralliponzetto/ In recent years thanks to the success of social networks (e.g., Facebook, Twitter, ...) and the availability of collaborative customer-made company portals (e.g., Amazon Product reviews, Trip Advisor , …) companies can analyze from both unstructured text (social network posts) and structured repositories (product reviews and specifications) the customers’ opinions on existing products. The thesis aims are: i) to investigate new models to represent user's’ needs; ii) how to identify rational (e.g., “My IPhone is  too expensive!”) and irrational opinions (e.g., “I’m not happy of my Iphone”)  on products and/or products aspects (e.g., “IPhone battery”) iii) how to automatically identify products properties (e.g., “Battery life”) and properties values (e.g., “3h standby”).



Thesis - Master
news-753 Tue, 20 Dec 2016 09:03:00 +0000 Master Thesis: Integrating Product Data into a Global Product Catalog (Bizer/Primpeli) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-integrating-schemaorg-product-data-into-a-global-product-catalog-bizerprimpeli/ A large number of e-shops have started to markup structured data about products, offers and reviews in their HTML pages using the markup standard Microdata and the vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (177,000,000 product records, 143,000,000 offers, 20,000,000 reviews). In general are described on the Web with only a few properties including a free text description of the product. A relatively small number of e-shops also publish product identifiers (productIDgtin13mpn) as well as categorization information (category) for their products or offers.

The aim for this thesis is to develop methods for integrating product data into a global product catalog covering a single or multiple product categories.

The thesis would focus on developing and evaluating methods in one or more of the following areas:

  • Feature Extraction: Develop methods for extracting product features (brand, screen size, memory size, …) from the textual product descriptions. The methods can use existing product catalogs or product IDs published by multiple shops as a source of supervision for learning feature extractors.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. The methods can use product IDs published by multiple shops as a source of supervision for learning identity resolution heuristics.
  • Product categorization: Develop methods for assigning the products into a product hierarchy.  The methods can use existing product classifications, classification information that is published by the e-shops as well as product IDs published by multiple shops as a source of supervision. 

You will first work with a subset of the data. Once the methods work for the subset, you will be given the necessary compute power (locally at DWS or on Amazon EC2) to apply your methods to the complete dataset.

Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, Databases, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer or Anna Primpeli.


Thesis - Master Thesis Chris
news-1188 Sun, 18 Dec 2016 09:01:00 +0000 Master Thesis: Design and Implementation of a Data Integration Extension for RapidMiner (Bizer/Lehmberg) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-design-and-implementation-of-a-data-integration-extension-for-rapidminer-bizerlehmb/ Data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, location-based mashups, and data search engines.

Many data integration solutions, however, require a high level of technical understanding from their users. There is currently no tool that allows a user to integrate datasets in an ad-hoc way and does not require a too deep knowledge of the underlying process.

In the area of data mining, the same situation existed and the data mining tool RapidMiner provided a solution with a graphical user interface and easy-to-use operators. Your task is to develop an extension for RapidMiner that contains operators for data integration tasks such as Identity Resolution, Schema Matching and Data Fusion. These operators should allow any user who is comfortable with using RapidMiner to perform a full data integration process. The implementation of the algorithms is provided by the framework used in the IE 670 Web Data Integration lecture.

The aim for this thesis is twofold:

  1. Develop a RapidMiner extension for Data Integration algorithms based on an existing Java framework that implements the algorithms

  2. Evaluate the extension with multiple use cases with respect to the data integration result and extension usability

Your skills:

For more information please contact Christian Bizer or Oliver Lehmberg.


[1] AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
[2] Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007.
[3] Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
[4] Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
[5] Jérôme Euzenat, Pavel Shvaiko: Ontology Matching. Springer, 2007.
[6] Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
[7] Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.
[8] S. Kirstein, S. Land, D. Halfkann RapidMiner 7 How to extend RapidMiner

[9] RapidMiner

Thesis - Master Thesis Chris
news-1735 Fri, 04 Nov 2016 11:18:24 +0000 Master Thesis: Recurrent Neural Networks for Natural Language Processing (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-recurrent-neural-networks-for-natural-language-processing-glavas-ponzetto/ This thesis should provide an in-depth overview of the various recurrent neural network models (fully recurrent networks, recursive networks, long short-term memory networks, etc.) and it’s variants (bidirectionality, attention-based extensions) that are used in different natural language processing tasks. The thesis should analyse in detail all relevant models (emphasizing the advantages and shortcomings for each of them) and the NLP tasks in which these models have been successfully applied, i.e., tasks in which these models achieve state-of-the-art performance. The candidate is also expected to focus on  the implementation of one of the RNN models and its application state-of-the-art models and it’s application on one particular NLP task.

Thesis - Master Simone
news-1734 Fri, 04 Nov 2016 11:16:15 +0000 Master Thesis: Methods for Embedding Knowledge Graphs/Bases into Semantic Vector Spaces (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-methods-for-embedding-knowledge-graphsbases-into-semantic-vector-spaces-glavas-po/ This thesis should provide an in-depth overview of the state-of-the-art methods for representing knowledge graphs and knowledge bases in the (i.e., concepts from their nodes and relations from their edges) continuous vector space. The thesis should explain in detail all prominent models, stressing the advantages and shortcomings of each of them. Additionally, the candidate is also expected to focus on the implementation of one of the state-of-the-art models and its application on one particular knowledge graph.



Thesis - Master Simone
news-1733 Fri, 04 Nov 2016 11:13:01 +0000 Master Thesis: Linking Social Network profiles to Wikipedia (Faralli, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-linking-social-network-profiles-to-wikipedia-faralli-ponzetto/ Social network are of high interests, for many applications ranging from simple user profiling to user customized advertisement. In this thesis, we will explore methods for linking of social network user (e.g., from Twitter and Facebook) to existing knowledge bases (e.g. Wikipedia).

Thesis - Master Simone
news-1731 Fri, 04 Nov 2016 11:10:25 +0000 Master Thesis: Continuous Emotions detection from live speech (Faralli, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-continuous-emotions-detection-from-live-speech-faralli-ponzetto/ Continuous emotions detection is a core aspect for many real application. In this work we will experiment with an existing interactive installation used for therapy treatments see The goals of this thesis are: i) to  improve the existing Speech to Text module (now implemented with the help of Google a.p.i.); ii) to investigate and design a new model of emotions including sarcasm inflection.



Thesis - Master Thesis Simone
news-1732 Fri, 04 Nov 2016 11:10:06 +0000 Master Thesis: Event/Topic Detection and Tracking from German News (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-eventtopic-detection-and-tracking-from-german-news-glavas-ponzetto/ The goal of this thesis would be to organize news from German news outlets in such a way to detect events and salient topics in the news. The successful thesis would include implementing the following steps: (1) crawling news stories from several German news outlets, (2) implementing the algorithms for new event detection and recognizing news belonging to previously identified events, and (3) implement methods for analyzing events in terms of named entities (participants, location, time) and keywords/keyphrases, (4) a prototype system (user interface) that presents the trending events and their information to the end user.



Thesis - Master Simone
news-1730 Fri, 04 Nov 2016 11:06:11 +0000 Master thesis: Transfer Learning for Text Classification with Convolutional Neural Networks (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-transfer-learning-for-text-classification-with-convolutional-neural-networks-glavas/ Convolutional neural networks have been shown to be very successful to various text classification tasks. The main shortcoming of CNNs used for text classification is that they, like most neural models, require a large number of annotated instances (i.e., training examples) in order to achieve solid classification performance. The goal of this thesis would be to explore and experiment with several transfer learning techniques at different network layers that would allow for a smaller number of examples in each particular classification task. Transfer learning means that some of the parameters trained on one datasets can be set as initial parameter values for the CNN trained for another classification task. The underlying assumption is that the early layer parameters (e.g., such as semantic vectors of words as input) of the CNN are general and transferable across domains. The thesis should includes the development of custom CNNs, transfer learning implementation and proper evaluation. All of these steps should be extensively described and documented in the thesis itself.    

Thesis - Master Simone
news-1729 Fri, 04 Nov 2016 11:06:02 +0000 Master Thesis: Linking a Web-scale collection of isa relations to DBPedia (Faralli, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-linking-a-web-scale-collection-of-isa-relations-to-dbpedia-faralli-ponzetto/ Recently the DWS group released a huge repository of hypernymy relations the Web, the WebIsADb (, containing a large amoun of relations between lexical pair of terms e.g. (“Katy Perry”, “celebrity”). In this work we aim at linking the two arguments of these relations to DBpedia concept e.g.,  “Katy Perry”  to the corresponding “” and “celebrity” to “”.

Thesis - Master Simone
news-1727 Fri, 04 Nov 2016 10:29:11 +0000 Bachelorarbeit: Analyse von Bitcoin-Transaktionen (Gemulla, Schönhofer GmbH) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/bachelorarbeit-analyse-von-bitcoin-transaktionen-gemulla-schoenhofer-gmbh/ MOTIVATION
Bitcoin ist eine virtuelle Währung mit einem Marktwert von ca. 10 Mrd US-$ und ca. 200.000 Transaktionen pro Tag. Bitcoin beruht auf der Blockchain-Technologie. Eine Analyse der Bitcoin-Transaktionen kann Erkenntnisse bzw. Hinweise zu

  • Typischen Akteuren (z.B. Wechselstuben, Online-Casinos)
  • Typischen Transaktionsmustern (z.B. auch Hinweise auf illegale Aktivitäten)
  • Statistik- und Monitoring-Informationen (z.B. Service-Qualität der Abwicklung von Transaktionen)

liefern. Teile der Erkenntnisse bzw. der zugrunde liegenden Analysemethoden können auch auf andere Implementationen der Blockchain-Technologie angewendet werden (z.B. Ethereum).

Es soll ein Überblick über die derzeit vorhandenen Werkzeuge und Ergebnisse der Analyse von Bitcoin-Transaktionen erstellt werden. In der Bachelor-Arbeit soll untersucht werden,

  • über welche Schnittstellen auf Bitcoin-Daten zugegriffen werden kann,
  • welche Werkzeuge zur Analyse existieren,
  • welche Ziele bzw. Aufgabenstellungen diese Werkzeuge haben,
  • welche Analysemethoden in den Werkzeugen genutzt werden,
  • welche Ergebnisse damit erzielt wurden bzw. erzielt werden können.
  • optional auch, welche weiteren Tools und Bibliotheken (z. B. zum Event-Stream-Processing, zur Datenspeicherung, zur interaktiven graphischen Darstellung) in dem Umfeld zum Einsatz kommen.

Wünschenswert wären zum Abschluss der Arbeit eine Bewertung der gefundenen Ergebnisse hinsichtlich Funktionsumfang, Ausgereiftheit der Entwicklung, Verlässlichkeit der Plattform bzw. der Projekte sowie architektonischer Randbedingungen, die bei einer Verwendung jeweils zu beachten sind.

Vertiefte Kenntnisse in Themen der statistischen Datenanalyse, sowie Programmierkenntnisse in Java oder einer vergleichbaren Programmiersprache sind erforderlich.

Ein Überblick über verfügbare Open-Source-Lösungen sowie deren Funktionsumfang und Einschränkungen ist hilfreich. Da die relevante Literatur im Wesentlichen nur in Englisch verfügbar ist, sind entsprechende Sprachkenntnisse bei der Bearbeitung dieses Themas erforderlich.

Die Bachelorarbeit wird durch den Lehrstuhl für Data Analytics (Prof. Gemulla) sowie die Schönhofer GmbH betreut.

Die Firma Schönhofer ist ein innovatives System- und Softwarehaus. Das Unternehmen mit Sitz in Siegburg realisiert komplexe Projekte und Produkte im Umfeld Complex Event Prediction, Big-Data Analytics und Metadatenverarbeitung für öffentliche Auftraggeber, Banken, Versicherungen und Corporates.

Bei Interesse wenden Sie sich zunächst an Herrn

Dr. Wolfgang Schneider
Bereich IT-Systems
Tel. 02241 3099 33

Thesis - Bachelor Thesis Rainer
news-1726 Thu, 03 Nov 2016 13:40:22 +0000 Bachelor Thesis: Improving the Annotation of Images from News Media (Weiland, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/bachelor-thesis-improving-the-annotation-of-images-from-news-media-weiland-ponzetto/ In this thesis we will build upon and extend an annotation tool to conduct a user study and better understand the requirements towards image understanding from news media. We plan to focus, in particular, on complex topics such as global warming, biodiversity, and sustainability.

Thesis - Bachelor Thesis Simone
news-1104 Thu, 03 Nov 2016 13:37:00 +0000 Master Thesis: Understanding Images from News Media (Dietz, Ponzetto, Weiland) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-understanding-images-from-news-media-dietz-ponzetto-weiland/ Object detection in images from news articles is a very challenging task. On the one hand, available training data for object detectors is only available for a limited number of classes such as persons, bikes, oranges. On the other hand, it is complicated through complex scenes of different objects in front important backgrounds.

Task of this thesis is to train new object detectors towards classes that are important to understand news articles on the topics of global warming, biodiversity, and sustainability.

If interested, please contact Prof. Simone Paolo Ponzetto.

Thesis Thesis - Master Simone
news-1736 Thu, 03 Nov 2016 12:01:00 +0000 Master Thesis: Speculation detection in political speeches (Štajner, Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-speculation-detection-in-political-speeches-stajner-ponzetto/ Introduction/problem: Speculation/hedging/vagueness identification plays significant role in many applications, e.g. information extraction, machine translation, text simplification.

Goal: Automatic identification of sentences which contain speculation/hedging or are vague, as those sentences need special care when being translated or when used in information extraction systems (i.e. we usually just want information that is certain and not speculations and hypotheses).

Approach: Knowledge-rich speculation detection approach on political speeches.

Additional goals: Direct comparison of systems built for different domains (political speeches vs. Wikipedia).

Requirements: Basic knowledge of supervised classification algorithms and text processing (tokenisation, lemmatisation, etc.)

Thesis - Master Simone Thesis
news-1721 Thu, 03 Nov 2016 08:15:59 +0000 Master Thesis: Deciphering Abbreviations in Medieval Legal Texts (Ponzetto, Kümper) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-deciphering-abbreviations-in-medieval-legal-texts-ponzetto-kuemper/ The "ius commune" or "learned laws" (= "roman and canon law” of the Middle Ages) are full of citations which follow a set of generally common rules which then again differ in its actual spelling, its sequencing and other details. This repeatedly makes it hard for scholars to unravel actual citations to precise modern conventions. Often, one finds himself searching through the actual reference law codes again to find was meant by the medieval scholar.

This thesis will look at ways to deploy state-of-the-art methods for identifying abbreviation definitions in medieval legal texts (written in Latin) and assess the challenges of their application in this specific domain, as well as to build a tool that provides an easy access to this functionality for humanities scholars.

This thesis is offered in collaboration with the Lehrstuhl für Geschichte des Spätmittelalters und der frühen Neuzeit (Prof. Dr. Hiram Kümper).

Thesis - Master Thesis Simone
news-1664 Mon, 01 Aug 2016 18:13:59 +0000 Master Thesis: Adaptive query generation for finding customers’ hot topics (Ponzetto) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-adaptive-query-generation-for-finding-customers-hot-topics-ponzetto/ The Web offers a goldmine of information describing a multitude of companies whose products and services can be potentially matched against Web users’ profiles (e.g., Twitter or Facebook profiles) in order to raise their consumer interest. However, searching the Web poses non-trivial challenges due to its large size as well as its noisy and heterogeneous content: in fact, DIY Web search engine building is, of course, impractical in most, if not all scenarios, due to a variety of scalability and other engineering issues.

In this thesis we will focus on the topic of learning user queries for lead enrichment: to this end different methods will be explored to build a query generation engine that adapts to different users’ profiles and allows to automatically generate Web search queries that, when used in conjunction with a general-purpose search engine like Google or Bing, retrieve Web documents from websites of companies that provide products or services of interest for a potential customer.


This thesis is offered in collaboration with the GMS department of Siemens AG. Global Marketing Services (GMS) is Siemens' in-house partner for sales and marketing topics across Siemens Global market research projects. Sales and marketing concepts, customer loyalty projects, lead generation, market potential models and automated sales solutions as well as sales management via dashboards and tablets all form part of their innovative and highly specialized portfolio, which is available to all Siemens divisions and regions.


Thesis - Master Thesis Simone Topics - Artificial Intelligence (NLP)
news-1648 Fri, 15 Jul 2016 09:31:09 +0000 Master Thesis: Dirty cheap text classification from the CommonCrawl (Ponzetto, Glavas) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-dirty-cheap-text-classification-from-the-commoncrawl-ponzetto-glavas/ Recently, there has been much interest to exploit Web-scale resource like the CommonCrawl for intelligent text processing and information extraction -- e.g., see our WebIsaDB.

In this thesis, we will at ways to exploit "cheap" heuristics" in order to collect training data for building supervised text classifiers from very large amounts of text. Key objective is to improve the performance of standard supervised methods by automatically harvesting high-quality labeled data from the Web in a simple, yet effective fashion.

If interested, please contact Prof. Simone Paolo Ponzetto.

Thesis - Master Thesis Simone
news-1647 Fri, 15 Jul 2016 09:23:45 +0000 Master Thesis: Multilingual WebIsA Database (Ponzetto, Faralli) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-multilingual-webisa-database-ponzetto-faralli/ Recently, we started investigating methods and framework to automatically extract high-quality hypernym relations from Web-scale amounts of data, i.e., like those found in the publicly available CommonCrawl. The result is the so-called WebIsA database (available here).

This thesis will look at ways to extend our pattern-based framework to new languages other than English, e.g., Spanish, Arabic, etc..

If interested, please contact Prof. Simone Paolo Ponzetto.

Thesis Thesis - Master Simone
news-1200 Mon, 07 Mar 2016 08:13:00 +0000 Bachelorarbeit: Verteilte Texterkennung einer historischen Zeitung http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/bachelorarbeit-verteilte-texterkennung-einer-historischen-zeitung/ Die Bachelorarbeit wird gemeinsam von Prof. Rainer Gemulla (Lehrstuhl für Praktische Informatik I) und Stefan Weil (UB Mannheim) unterstützt. Die UB Mannheim besitzt Bilddaten von über 700000 Seiten der Zeitung Deutscher Reichsanzeiger und Preussischer Staatsanzeiger, die zwischen 1819 und 1945 unter verschiedenen Bezeichnungen erschienen ist. Um dieses Archiv für weitere Forschung zugänglich zu machen, sollen die in den Seiten enthaltenen Texte erkannt und maschinell lesbar gemacht werden (OCR, optical character recognition).

Dazu müssen über 350000 TIFF-Bilddateien (>20TB an Daten) verarbeitet werden. Ein einzelner Rechner benötigt für die Texterkennung in einer Bilddatei mit gängiger OCR-Software ca. 10 Minuten; die Verarbeitung aller Bilddateien würde so knapp 7 Jahre in Anspruch nehmen.

Ziel der Bachelorarbeit ist es, einen Cluster aus PCs ähnlich zu SETI@Home zu realisieren. Dazu könnten beispielsweise die öffentlichen PCs aller Bibliotheksbereiche, die Arbeitsplätze von Bibliotheksmitarbeitern (insofern einverstanden) sowie von anderen Freiwilligen verwendet werden. Die Clusterknoten haben somit eine gewisse Heterogenität, z.B. unterschiedliche Geräteausstattung und –leistung sowie unterschiedliche Betriebssysteme.

Als Basis zur Realisierung des Clusters könnte beispielsweise freie Software wie BOINC [2,3] oder Docker Swarm [3,4] zum Einsatz kommen. Die OCR wird ebenfalls freie Software (Tesseract [5], eventuell OCRopus [6] sowie Software zur Bildvorverarbeitung) verwenden. Neben den rein technischen Herausforderungen sind auch Themen wie Sicherheit und Akzeptanz zu berücksichtigen.

Bei Interesse oder Rückfragen melden Sie sich bitte bei Rainer Gemulla oder Stefan Weil.








Rainer Thesis Thesis - Bachelor
news-1149 Mon, 07 Mar 2016 07:20:00 +0000 Bachelorarbeit: Vorverarbeitung von Bildern für automatische Texterkennung (OMR) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/bachelorarbeit-vorverarbeitung-von-bildern-fuer-automatische-texterkennung-omr/ Die Bachelorarbeit wird gemeinsam von Prof. Rainer Gemulla (Lehrstuhl für Praktische Informatik I) und Dr. Philipp Zumstein (UB Mannheim) unterstützt. Die UB Mannheim besitzt Bilddaten von über 700000 Seiten der Zeitung Deutscher Reichsanzeiger und Preussischer Staatsanzeiger. Um dieses Archiv für weitere Forschung zugänglich zu machen, sollen die in den Seiten enthaltenen Texte erkannt und maschinell lesbar gemacht werden (OCR, optical character recognition).

In dieser Arbeit soll untersucht werden, ob und inwieweit die Qualität gängiger OCR-Software durch geeignete Vorverarbeitung der Bilder verbessert werden kann. Dazu gehören beispielsweise Techniken zum Aufteilen in Seiten, De-Warping [1], Erkennung von Nicht-Textbereichen [2] oder Binarisierung [3]. Sowohl kommerzielle OCR-Programme wie ABBYY Finereader als auch freie OCR-Software wie etwa Tesseract mit Leptonica oder Ocropus führen eine gewisse Vorverarbeitung bereits durch. Diese ist aber nur bedingt effektiv und kann ggf. weiter verbessert werden (u.a. durch den Einsatz von speziell auf diesen Datensatz entwickelten Vorverarbeitungsschritte). Dazu sollen verschiedene Vorverarbeitungschritte vorgeschlagen und bezüglich ihrer Effektivität evaluiert werden.

Bei Interesse oder Rückfragen melden Sie sich bei Rainer Gemulla oder Philipp Zumstein.

[1] Le, Thoma, Wechsler (1994): Automated page orientation and skew angle detection for binary document images.

[2] Bukhari, Al Azawi, Shafait, Breuel (2010): Document Image Segmentation using Discriminative Learning over Connected Components.

[3] Gatos, Pratikakis, Perantonis (2008): Efficient Binarization of Historical and Degraded Document Images.

Thesis - Bachelor Thesis Rainer
news-710 Tue, 26 May 2015 12:40:00 +0000 Master Thesis: Multilingual Entity Linking (Ponzetto, Bizer) http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/master-thesis-multilingual-entity-linking-ponzetto-bizer/ Entity linking, the task of linking mentions of entities in text to wide-coverage concept repositories like DBPedia or Freebase, has so far concentrated almost exclusively on English [1]. This is well reflected on available taggers working only on English, indeed a very big limitation for the multilingual web of data. This goal of this thesis, accordingly, will be to extend existing taggers like, for instance, DBPedia Spotlight [2], to a wide range of languages other than English.


  • Solid programming skills
  • Experience / genuine interest to work with large datasets
  • Previous knowledge of LOD, NLP and Machine Learning are a plus



[1] A framework for benchmarking entity-annotation systems. M. Cornolti, P. Ferragina and M. Ciaramita. In WWW-13

[2] DBpedia Spotlight: Shedding Light on the Web of Documents. P.N. Mendes, Max Jakob, A. García-Silva and C. Bizer. In I-Semantics-11


Contact: Prof. Dr. Bizer or Prof. Dr. Ponzetto

Topics Chris Simone Topics - Artificial Intelligence (NLP) Thesis - Master
news-886 Tue, 04 Feb 2014 10:05:00 +0000 Bachelorarbeit (Meilicke): Entwicklung einer Spiele-KI http://dws.informatik.uni-mannheim.deen/thesis/singleview/detail/News/bachelorarbeit-meilicke-entwicklung-einer-spiele-ki/ Im Rahmen der Veranstaltung KI werden unter anderem grundlgende Verfahren zur Entwicklung einer Spiele KI vorgestellt. Wer die Veranstaltung besucht hat, kann eine Bachelorarbeit schreiben, in der eine KI für ein Spiel aus der folgenden Liste entwickelt und evaluiert wird.

  • 6 nimmt
  • Qwirkle
  • Lost Cities

Hierzu müssen geeignete Verfahren identifiziert, adaptiert und erweitert werden. Die ARbeit schließt neben der Implementierung auch eine Evaluation der Spielstärke der KI mit ein. Die genannten Spiele können auch, in Rücksprache mit dem Betreuer, durch andere Spiele ersetzt werden.

Achtung: Dieses Bachelorthema setzt den erfolgreichen Besuch der Veranstaltung KI vorraus!

Betreuer: Christian Meilicke / Jörg Schönfisch

Thesis - Bachelor Thesis