RSS-Feed http://example.com en-gb TYPO3 News Sun, 27 May 2018 03:26:46 +0000 Sun, 27 May 2018 03:26:46 +0000 TYPO3 EXT:news news-2007 Tue, 10 Oct 2017 12:59:52 +0000 Master Thesis: Integrating Product Data using Supervision from the Web (Bizer/Paulheim/Primpeli) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-integrating-product-data-using-supervision-from-the-web-bizerpaulheimprimpeli/ A large number of e-shops have started to markup structured data about products and offers in their HTML pages using the markup standard Microdata and the schema.org vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (682,000,000 product records, 497,000,000 offers). A relatively small number of e-shops also publish product identifiers which are indicated with one of the following schema.org properties: sku, productID, mpn, identifier, gtin14, gtin13, gtin12, and gtin8.

The aim of this thesis is to analyze and evaluate the utility of product identifiers found on the Web as supervision for matching product descriptions. More concretely, the goal of the thesis is to investigate whether it is possible to learn enough product characteristics from the small set of e-shops that do provide product identifiers in order to detect the same products on websites that do not provide identifiers.

More concretely the tasks involved in the thesis would be:

  • Analysis of Product Identifiers: Analyze the distribution of product identifiers published on the Web. This involves the identification of product entities and product categories for which identifiers are more frequently assigned.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. Product identifiers will be used as a source of supervision in order to learn classification models. The learned models will be evaluated in terms of how well they can generalize to products without assigned identifiers.

 

 

Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer, Heiko Paulheim, or Anna Primpeli.

 

References:

]]>
Thesis - Master Chris
news-1981 Thu, 07 Sep 2017 09:48:14 +0000 Master thesis: Address management and geocoding (Gemulla, DHL) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-address-management-and-geocoding-gemulla-dhl/ eCommerce is on the rise. Logistics companies like Deutsche Post DHL are expanding and building up new logistics networks in emerging markets around the globe. Delivering parcels to end-consumers quickly, reliably and efficiently with an outstanding service requires much planning effort. This is especially complex in emerging countries where the infrastructure poses additional challenges. One essential part of making the delivery more efficient is route optimization of the driven tours on the “last mile”. It is done by using optimization algorithms connecting the delivery locations considering all kinds of restrictions (capacity, working hours of courier, traffic etc.).

In emerging countries, the delivery location cannot be easily deduced from the address provided by the customer. Addresses can follow different local address logics or can take the form of something like "Slip road from Megenagna to Imperial Hotel, In front of Anbessa Garage, P.O. Box 184 Code 1110, Addis Ababa, Ethiopia". In order to use the address for geocoding, it has to be broken down into a structured format, analyzed, compared with other existing addresses in databases, possibly updated and then translated into a geocode first. There are many ways and possible methods to achieve this. In the course of this master thesis the student is asked to give a scientific overview of current methods and algorithms in that context and come up with a suitable solution incorporating latest developments in Machine Learning.

The student is expected to have excellent analytical skills, knowledge in how to create and describe an algorithm and some previous Machine Learning experience. Knowledge in an object-oriented programming language is a plus.

The master thesis will be written in cooperation with Deutsche Post DHL. The student will work closely together with project teams in Singapore, Thailand, Malaysia, Vietnam and the global headquarters in Bonn. To apply for this master thesis, please send your CV to Gunnar Buchhold <gunnar.buchhold(at)dpdhl.com> and briefly state your motivation and why you deem yourself suitable to cover this topic.

DHL eCommerce is the e-commerce logistics specialist of Deutsche Post DHL Group, the biggest logistics company worldwide. We offer choice, convenience, control and quality for both the merchant and the consumer. Our global team of e-commerce experts is dedicated to providing innovative solutions that create a great online shopping experience.

]]>
ma master data science theses Thesis Thesis - Bachelor Rainer
news-1816 Fri, 17 Feb 2017 11:38:00 +0000 Master thesis: Text Mining for Cyber Threat Analysis (Gemulla, Schönhofer GmbH) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-text-mining-for-cyber-threat-analysis-gemulla-schoenhofer-gmbh/ Since mid-September 2015, the threat from ransomware has grown considerably [1]. Against this background, comprehensive geographical and temporal mapping of cyber attacks and early detection of such attacks have become particularly important. Attacks on an organisation's own IT-infrastructure are typically analysed and defended against at network level. Outside an organisation's own infrastructure, other sources, e.g., news portals and social media, usually have to be used. Given the large volume and variety of this unstructured data as well as the speed with which it is generated, automated analytical procedures from the fields of text mining and machine learning to handle it are not only particularly promising but also the only practical approach. 

TASK

A review and evaluation of sources dealing with cyber-threat analysis, e.g.,

  • News websites, news portals and social media such as Facebook & Twitter
  • Pre-evaluation / Prediction / forecasting websites such as Google Trends or Europe Media Monitor
  • Reports from Computer Emergency Response Teams (CERTs)
  • Reports from anti-virus software companies. e.g., Kaspersky

The sources may contain previously evaluated and summarised results. The sources should be analysed and metadata extracted. The following directions are of particular interest:

  • Reports on new threats
  • Differentiation between duplicated confirmation and new reports
  • Sentiment analysis, classification of phishing, hoaxes & fake news
  • Regional / geographical and temporal distribution
  • Significant parties (parties issuing threats as well as those analysing / defending against threats)

Moreover, on the basis of configurable taxonomies the texts should be subjected to an entity analysis and, if possible, to relations analysis.
 
A data corpus, which has been created on the basis of relevant RSS feeds, is available to test the procedure and can be expanded during the work. In addition, the possibilities of adding further metadata while importing data should be investigated, e.g., designation of source / publisher, evaluation of source (reliability, trustworthiness etc.), which can then be considered when extracting the metadata.

PREREQUISITES

Detailed knowledge of text analysis / text mining as well as programming skills in Java/Scala, Python or a comparable programming language is required. Knowledge of virtualisation and databases is an advantage. In-depth knowledge of cyber security is not required.

CONTACT

The Master thesis is supervised by the Chair for Data Analytics (Prof. Gemulla) as well as by the Schönhofer Sales & Engineering GmbH.

Schönhofer Sales & Engineering GmbH is an innovative systems and software company. The company, which is located in Siegburg, realises complex projects and products for complex event prediction, big-data analytics and metadata processing for public sector clients, banks, insurance companies and corporates.

If you are interested in this thesis topic, please contact

Holger Krispin
Schönhofer S&E GmbH, IT-Systems area
Holger.Krispin@schoenhofer.de
Tel. +49 (0)2241 3099 37

REFERENCES

[1] Ransomware: Bedrohungslage, Prävention & Reaktion. BSI-Report. March 2016

]]>
Thesis Rainer ma master data science theses Thesis - Master partner companies
news-1810 Fri, 10 Feb 2017 08:37:44 +0000 Master Thesis: Natural Language Processing and Information Retrieval for Political Science (Nanni/Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-natural-language-processing-and-information-retrieval-for-political-science-nannipo/ Target: Master

Type: Survey

Short abstract: This thesis should provide an in-depth overview of the adoption of natural language processing and information retrieval approaches in political science research (for example, adoption of supervised classifiers in content analysis tasks, use of unsupervised approaches for political scaling studies). The thesis should explain in detail all prominent tasks, stressing the benefits and current limitations of NLP and IR solutions. A very successful thesis would also include the re-implementation of one of the solutions examined and application to a different dataset (e.g. re-implementation of Wordfish and political scaling of UK parliament debates).


]]>
Thesis - Master
news-1809 Fri, 10 Feb 2017 08:34:41 +0000 Master Thesis: Cross-lingual Topic-Based Manifesto Classification (Nanni/Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-cross-lingual-topic-based-manifesto-classification-nanniponzetto/ Target: Master

Type: Experiments

Introduction/problem:  Party manifestos (https://manifestoproject.wzb.eu/) present the vision of a specific party over different topics. Manifestos have been labeled at topic level in english documents (e.g. US and Uk parties manifestos), but the same annotations are not always available in other languages.

Goal: Classify manifestos in other languages in topics, using the English manifestos as training set.

Approach:  Goal of this thesis is to train a cross-lingual classifier that learn topics in English and detect topics in for example German manifestos using the embeddings and translation matrices. Evaluation will be on the few manifestos in other languages that have been labeled with topics. Train the classifier on English documents (but with embedding as features) and then you can use that classifier for other language texts (by first using the translation matrix to map these texts to English embedding space).

Requirements: Skilled programmer.

 

 

]]>
Thesis - Master
news-1808 Fri, 10 Feb 2017 08:32:19 +0000 Master Thesis: Enhancing Domain Specific Entity Linking (Nanni/Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-enhancing-domain-specific-entity-linking-nanniponzetto/ Title: Enhancing Domain Specific Entity Linking

Target: Master

Type: Experiments

Introduction/problem: Entity linking could improve text exploration and information retrieval in DH. However, currently DH researchers are limited to annotate texts with entity links that are represented by Wikipedia.

Goal: Develop a domain-adaptable entity linking pipeline and evaluate performances.

Approach: Re-implementation of TagMe algorithm and development of a pipeline that where the knowledge base can be easily changed.

Requirements: Confidence in programming (web scraping, Wikipedia dump processing) and knowledge of supervised classification algorithms and text processing.

 

 

]]>
Thesis - Master
news-1788 Thu, 19 Jan 2017 13:46:23 +0000 Master Thesis: Extraction and decoration of social networks users' needs (Faralli/Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-extraction-and-decoration-of-social-networks-users-needs-faralliponzetto/ In recent years thanks to the success of social networks (e.g., Facebook, Twitter, ...) and the availability of collaborative customer-made company portals (e.g., Amazon Product reviews jmcauley.ucsd.edu/data/amazon/, Trip Advisor times.cs.uiuc.edu/~wang296/Data/ , …) companies can analyze from both unstructured text (social network posts) and structured repositories (product reviews and specifications) the customers’ opinions on existing products. The thesis aims are: i) to investigate new models to represent user's’ needs; ii) how to identify rational (e.g., “My IPhone is  too expensive!”) and irrational opinions (e.g., “I’m not happy of my Iphone”)  on products and/or products aspects (e.g., “IPhone battery”) iii) how to automatically identify products properties (e.g., “Battery life”) and properties values (e.g., “3h standby”).

 

 

]]>
Thesis - Master
news-1786 Tue, 17 Jan 2017 14:47:30 +0000 44.2 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/442-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2016 version of the CommonCrawl covering 3.2 billion HTML pages which originate from 34 million websites (pay-level domains).

Altogether we discovered structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.6 million different pay-level domains out of the 34 billion pay-level domains covered by the crawl (16.5%).

Approximately 2.5 million of these websites use Microdata, 2.1 million websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats are used by over 1.6 million websites within the crawl.

 

Background: 

More and more websites annotate structured data within their HTML pages using markup formats such as RDFa, Microdata, embedded JSON-LD and Microformats. The annotations  cover topics such as products, reviews, people, organizations, places, events, and cooking  recipes.

The WebDataCommons project extracts all Microdata, RDFa data, and Microformat data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. 

Besides the markup data, the WebDataCommons project also provides large web table corpora and web graphs for download. General information about the WebDataCommons project is found at 

webdatacommons.org 


Data Set Statistics: 

Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa  
and Microformat data sets as well as the vocabularies that are used together with each 
markup format are found at: 

webdatacommons.org/structureddata/2016-10/stats/stats.html

Comparing the statistics to the statistics about the November 2015 release of the data sets

 

webdatacommons.org/structureddata/2015-11/stats/stats.html

we see that the Microdata syntax remains the most dominant annotation format. Although it is hard to compare the adoption of the syntax between the two years in absolute numbers, as the October 2016 crawl corpus is almost double the size of the November 2015 one, a relative increase can be observed: In the October 2016 corpus over 44% of the pay-level domains containing markup data make use of the Microdata syntax in comparison to 40% one year earlier. Even though the absolute numbers concerning the RDFa markup syntax adoption rise, the relative increase does not follow up the increase of the corpus size indicating that RDFa is less used by the websites. Similar to the 2015 release, the adoption of embedded JSON-LD has considerably increased, even though the main focus of the annotation remains the search action offered by the websites (70%).

As already observed in the previous years, the schema.org vocabulary is most frequently used in the context of Microdata while the adoption of its predecessor, the data vocabulary, continues to decrease. In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress). More concretely, the October 2016 corpus includes more than 682 million product records originating from 249 thousand websites which use the schema.org vocabulary. The new release contains postal address data for more than 291 million entities originating from 338 thousand websites. Furthermore, the content describing hotels has doubled in size in this release, with a total of 61 million hotel descriptions.

Visualizations of the main adoption trends concerning the different annotation formats, popular schema.org, as well as RDFa classes within the time span 2012 to 2016 are found at

webdatacommons.org/structureddata/

 

Download:

The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,661 files with a total size of 987 GB. 

webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class. 

webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.html

 

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which supported by means of the ViCe project the extraction and analysis of the October 2016 corpus.


Have fun with the new data set. 

Anna Primpeli, Robert Meusel and Chris Bizer

]]>
Research - Data Mining and Web Mining Research - Data Analytics Topics - Data Mining Topics - Linked Data Projects Chris
news-753 Tue, 20 Dec 2016 09:03:00 +0000 Master Thesis: Integrating Schema.org Product Data into a Global Product Catalog (Bizer/Primpeli) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-integrating-schemaorg-product-data-into-a-global-product-catalog-bizerprimpeli/ A large number of e-shops have started to markup structured data about products, offers and reviews in their HTML pages using the markup standard Microdata and the schema.org vocabulary.

In the context of the WebDataCommons project we have extracted a large corpus of product data from the Common Crawl web corpus. The product data corpus is found here (177,000,000 product records, 143,000,000 offers, 20,000,000 reviews). In general schema.org/products are described on the Web with only a few properties including a free text description of the product. A relatively small number of e-shops also publish product identifiers (productIDgtin13mpn) as well as categorization information (category) for their products or offers.

The aim for this thesis is to develop methods for integrating schema.org product data into a global product catalog covering a single or multiple product categories.

The thesis would focus on developing and evaluating methods in one or more of the following areas:

  • Feature Extraction: Develop methods for extracting product features (brand, screen size, memory size, …) from the textual product descriptions. The methods can use existing product catalogs or product IDs published by multiple shops as a source of supervision for learning feature extractors.
  • Identity Resolution: Develop identity resolution methods for finding out which e-shops sell the same product. The methods can use product IDs published by multiple shops as a source of supervision for learning identity resolution heuristics.
  • Product categorization: Develop methods for assigning the products into a product hierarchy.  The methods can use existing product classifications, classification information that is published by the e-shops as well as product IDs published by multiple shops as a source of supervision. 

You will first work with a subset of the data. Once the methods work for the subset, you will be given the necessary compute power (locally at DWS or on Amazon EC2) to apply your methods to the complete dataset.

Your skills:

  • Preferred Expertise: Programming (Java or other language), Data Mining, Databases, NLP is a plus.
  • Relevant Lectures: IE 500 Data Mining, IE 670 Web Data Integration, IE 671 Web Mining, IE663 Information Retrieval

For more information please contact Christian Bizer or Anna Primpeli.

References:

]]>
Thesis - Master Thesis Chris
news-1188 Sun, 18 Dec 2016 09:01:00 +0000 Master Thesis: Design and Implementation of a Data Integration Extension for RapidMiner (Bizer/Lehmberg) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-design-and-implementation-of-a-data-integration-extension-for-rapidminer-bizerlehmb/ Data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, location-based mashups, and data search engines.

Many data integration solutions, however, require a high level of technical understanding from their users. There is currently no tool that allows a user to integrate datasets in an ad-hoc way and does not require a too deep knowledge of the underlying process.

In the area of data mining, the same situation existed and the data mining tool RapidMiner provided a solution with a graphical user interface and easy-to-use operators. Your task is to develop an extension for RapidMiner that contains operators for data integration tasks such as Identity Resolution, Schema Matching and Data Fusion. These operators should allow any user who is comfortable with using RapidMiner to perform a full data integration process. The implementation of the algorithms is provided by the framework used in the IE 670 Web Data Integration lecture.

The aim for this thesis is twofold:

  1. Develop a RapidMiner extension for Data Integration algorithms based on an existing Java framework that implements the algorithms

  2. Evaluate the extension with multiple use cases with respect to the data integration result and extension usability

Your skills:

For more information please contact Christian Bizer or Oliver Lehmberg.

References:

[1] AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
[2] Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007.
[3] Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
[4] Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
[5] Jérôme Euzenat, Pavel Shvaiko: Ontology Matching. Springer, 2007.
[6] Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
[7] Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.
[8] S. Kirstein, S. Land, D. Halfkann RapidMiner 7 How to extend RapidMiner

[9] RapidMiner

]]>
Thesis - Master Thesis Chris
news-1757 Mon, 21 Nov 2016 09:37:30 +0000 Adtelligence - Masterthesis zum Thema Reporting für Machine Learning http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/adtelligence-masterthesis-zum-thema-reporting-fuer-machine-learning/ Adtelligence ist ein internationales Software- & Technologieunternehmen und wurde vom Weltwirtschaftsforum 2014 als Technology Pioneer und von Deloitte in 2015 als Technology Fast 50 Rising Star ausgezeichnet. Adtelligence ist ein internationales Software- & Technologieunternehmen und wurde vom Weltwirtschaftsforum 2014 als Technology Pioneer und von Deloitte in 2015 als Technology Fast 50 Rising Star ausgezeichnet.
Die Adtelligence Customer Intelligence und Personalisierungsplattform nutzt Big Data und Machine Learning Technologien, um jedem Besucher einer Webseite ein individuelles Kundenerlebnis zu erschaffen. Dadurch gibt es keine statische Webseite mehr. Jede Webseite reagiert, wie Google, in Echtzeit auf den Besucher und passt sich an die Bedürfnisse und die aktuelle Situation an, damit jeder Besucher die für ihn passenden Inhalte angezeigt bekommt. Adtelligence unterstützt führende eCommerce Shops, Banken, Versicherungen, Automotive und Gaming Firmen bei der Steigerung Ihrer digitalen Wettbewerbsfähigkeit und erhöht die online Conversion Rates und Umsätze. Adtelligence arbeitet weltweit mit über 60 Partnern wie Facebook, Google oder SAP/Hybris an der Zukunft des Internets.

]]>
partner companies
news-1756 Mon, 21 Nov 2016 09:02:47 +0000 Adtelligence - Masterthesis zum Thema Validierungsframework für Machine Learning http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/adtelligence-masterthesis-zum-thema-validierungsframework-fuer-machine-learning/ Adtelligence ist ein internationales Software- & Technologieunternehmen und wurde vom Weltwirtschaftsforum 2014 als Technology Pioneer und von Deloitte in 2015 als Technology Fast 50 Rising Star ausgezeichnet. Adtelligence ist ein internationales Software- & Technologieunternehmen und wurde vom Weltwirtschaftsforum 2014 als Technology Pioneer und von Deloitte in 2015 als Technology Fast 50 Rising Star ausgezeichnet.
Die Adtelligence Customer Intelligence und Personalisierungsplattform nutzt Big Data und Machine Learning Technologien, um jedem Besucher einer Webseite ein individuelles Kundenerlebnis zu erschaffen. Dadurch gibt es keine statische Webseite mehr. Jede Webseite reagiert, wie Google, in Echtzeit auf den Besucher und passt sich an die Bedürfnisse und die aktuelle Situation an, damit jeder Besucher die für ihn passenden Inhalte angezeigt bekommt. Adtelligence unterstützt führende eCommerce Shops, Banken, Versicherungen, Automotive und Gaming Firmen bei der Steigerung Ihrer digitalen Wettbewerbsfähigkeit und erhöht die online Conversion Rates und Umsätze. Adtelligence arbeitet weltweit mit über 60 Partnern wie Facebook, Google oder SAP/Hybris an der Zukunft des Internets.

]]>
partner companies
news-1735 Fri, 04 Nov 2016 11:18:24 +0000 Master Thesis: Recurrent Neural Networks for Natural Language Processing (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-recurrent-neural-networks-for-natural-language-processing-glavas-ponzetto/ This thesis should provide an in-depth overview of the various recurrent neural network models (fully recurrent networks, recursive networks, long short-term memory networks, etc.) and it’s variants (bidirectionality, attention-based extensions) that are used in different natural language processing tasks. The thesis should analyse in detail all relevant models (emphasizing the advantages and shortcomings for each of them) and the NLP tasks in which these models have been successfully applied, i.e., tasks in which these models achieve state-of-the-art performance. The candidate is also expected to focus on  the implementation of one of the RNN models and its application state-of-the-art models and it’s application on one particular NLP task.

]]>
Thesis - Master Simone
news-1734 Fri, 04 Nov 2016 11:16:15 +0000 Master Thesis: Methods for Embedding Knowledge Graphs/Bases into Semantic Vector Spaces (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-methods-for-embedding-knowledge-graphsbases-into-semantic-vector-spaces-glavas-po/ This thesis should provide an in-depth overview of the state-of-the-art methods for representing knowledge graphs and knowledge bases in the (i.e., concepts from their nodes and relations from their edges) continuous vector space. The thesis should explain in detail all prominent models, stressing the advantages and shortcomings of each of them. Additionally, the candidate is also expected to focus on the implementation of one of the state-of-the-art models and its application on one particular knowledge graph.

 

 

]]>
Thesis - Master Simone
news-1733 Fri, 04 Nov 2016 11:13:01 +0000 Master Thesis: Linking Social Network profiles to Wikipedia (Faralli, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-linking-social-network-profiles-to-wikipedia-faralli-ponzetto/ Social network are of high interests, for many applications ranging from simple user profiling to user customized advertisement. In this thesis, we will explore methods for linking of social network user (e.g., from Twitter and Facebook) to existing knowledge bases (e.g. Wikipedia).

]]>
Thesis - Master Simone
news-1731 Fri, 04 Nov 2016 11:10:25 +0000 Master Thesis: Continuous Emotions detection from live speech (Faralli, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-continuous-emotions-detection-from-live-speech-faralli-ponzetto/ Continuous emotions detection is a core aspect for many real application. In this work we will experiment with an existing interactive installation used for therapy treatments see  http://voicingelder.com/. The goals of this thesis are: i) to  improve the existing Speech to Text module (now implemented with the help of Google a.p.i.); ii) to investigate and design a new model of emotions including sarcasm inflection.

 

 

]]>
Thesis - Master Thesis Simone
news-1732 Fri, 04 Nov 2016 11:10:06 +0000 Master Thesis: Event/Topic Detection and Tracking from German News (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-eventtopic-detection-and-tracking-from-german-news-glavas-ponzetto/ The goal of this thesis would be to organize news from German news outlets in such a way to detect events and salient topics in the news. The successful thesis would include implementing the following steps: (1) crawling news stories from several German news outlets, (2) implementing the algorithms for new event detection and recognizing news belonging to previously identified events, and (3) implement methods for analyzing events in terms of named entities (participants, location, time) and keywords/keyphrases, (4) a prototype system (user interface) that presents the trending events and their information to the end user.

 

 

]]>
Thesis - Master Simone
news-1730 Fri, 04 Nov 2016 11:06:11 +0000 Master thesis: Transfer Learning for Text Classification with Convolutional Neural Networks (Glavaš, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-transfer-learning-for-text-classification-with-convolutional-neural-networks-glavas/ Convolutional neural networks have been shown to be very successful to various text classification tasks. The main shortcoming of CNNs used for text classification is that they, like most neural models, require a large number of annotated instances (i.e., training examples) in order to achieve solid classification performance. The goal of this thesis would be to explore and experiment with several transfer learning techniques at different network layers that would allow for a smaller number of examples in each particular classification task. Transfer learning means that some of the parameters trained on one datasets can be set as initial parameter values for the CNN trained for another classification task. The underlying assumption is that the early layer parameters (e.g., such as semantic vectors of words as input) of the CNN are general and transferable across domains. The thesis should includes the development of custom CNNs, transfer learning implementation and proper evaluation. All of these steps should be extensively described and documented in the thesis itself.    

]]>
Thesis - Master Simone
news-1729 Fri, 04 Nov 2016 11:06:02 +0000 Master Thesis: Linking a Web-scale collection of isa relations to DBPedia (Faralli, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-linking-a-web-scale-collection-of-isa-relations-to-dbpedia-faralli-ponzetto/ Recently the DWS group released a huge repository of hypernymy relations the Web, the WebIsADb (http://webdatacommons.org/isadb/), containing a large amoun of relations between lexical pair of terms e.g. (“Katy Perry”, “celebrity”). In this work we aim at linking the two arguments of these relations to DBpedia concept e.g.,  “Katy Perry”  to the corresponding “http://dbpedia.org/page/Katy_Perry” and “celebrity” to “http://dbpedia.org/page/Celebrity”.

]]>
Thesis - Master Simone
news-1727 Fri, 04 Nov 2016 10:29:11 +0000 Bachelorarbeit: Analyse von Bitcoin-Transaktionen (Gemulla, Schönhofer GmbH) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelorarbeit-analyse-von-bitcoin-transaktionen-gemulla-schoenhofer-gmbh/ MOTIVATION
Bitcoin ist eine virtuelle Währung mit einem Marktwert von ca. 10 Mrd US-$ und ca. 200.000 Transaktionen pro Tag. Bitcoin beruht auf der Blockchain-Technologie. Eine Analyse der Bitcoin-Transaktionen kann Erkenntnisse bzw. Hinweise zu

  • Typischen Akteuren (z.B. Wechselstuben, Online-Casinos)
  • Typischen Transaktionsmustern (z.B. auch Hinweise auf illegale Aktivitäten)
  • Statistik- und Monitoring-Informationen (z.B. Service-Qualität der Abwicklung von Transaktionen)

liefern. Teile der Erkenntnisse bzw. der zugrunde liegenden Analysemethoden können auch auf andere Implementationen der Blockchain-Technologie angewendet werden (z.B. Ethereum).

AUFGABENSTELLUNG
Es soll ein Überblick über die derzeit vorhandenen Werkzeuge und Ergebnisse der Analyse von Bitcoin-Transaktionen erstellt werden. In der Bachelor-Arbeit soll untersucht werden,

  • über welche Schnittstellen auf Bitcoin-Daten zugegriffen werden kann,
  • welche Werkzeuge zur Analyse existieren,
  • welche Ziele bzw. Aufgabenstellungen diese Werkzeuge haben,
  • welche Analysemethoden in den Werkzeugen genutzt werden,
  • welche Ergebnisse damit erzielt wurden bzw. erzielt werden können.
  • optional auch, welche weiteren Tools und Bibliotheken (z. B. zum Event-Stream-Processing, zur Datenspeicherung, zur interaktiven graphischen Darstellung) in dem Umfeld zum Einsatz kommen.

Wünschenswert wären zum Abschluss der Arbeit eine Bewertung der gefundenen Ergebnisse hinsichtlich Funktionsumfang, Ausgereiftheit der Entwicklung, Verlässlichkeit der Plattform bzw. der Projekte sowie architektonischer Randbedingungen, die bei einer Verwendung jeweils zu beachten sind.

VORKENNTNISSE
Vertiefte Kenntnisse in Themen der statistischen Datenanalyse, sowie Programmierkenntnisse in Java oder einer vergleichbaren Programmiersprache sind erforderlich.

Ein Überblick über verfügbare Open-Source-Lösungen sowie deren Funktionsumfang und Einschränkungen ist hilfreich. Da die relevante Literatur im Wesentlichen nur in Englisch verfügbar ist, sind entsprechende Sprachkenntnisse bei der Bearbeitung dieses Themas erforderlich.

KONTAKT
Die Bachelorarbeit wird durch den Lehrstuhl für Data Analytics (Prof. Gemulla) sowie die Schönhofer GmbH betreut.

Die Firma Schönhofer ist ein innovatives System- und Softwarehaus. Das Unternehmen mit Sitz in Siegburg realisiert komplexe Projekte und Produkte im Umfeld Complex Event Prediction, Big-Data Analytics und Metadatenverarbeitung für öffentliche Auftraggeber, Banken, Versicherungen und Corporates.

Bei Interesse wenden Sie sich zunächst an Herrn

Dr. Wolfgang Schneider
Bereich IT-Systems
Wolfgang.Schneider@schoenhofer.de
Tel. 02241 3099 33

]]>
Thesis - Bachelor Thesis Rainer
news-1726 Thu, 03 Nov 2016 13:40:22 +0000 Bachelor Thesis: Improving the Annotation of Images from News Media (Weiland, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelor-thesis-improving-the-annotation-of-images-from-news-media-weiland-ponzetto/ In this thesis we will build upon and extend an annotation tool to conduct a user study and better understand the requirements towards image understanding from news media. We plan to focus, in particular, on complex topics such as global warming, biodiversity, and sustainability.

]]>
Thesis - Bachelor Thesis Simone
news-1104 Thu, 03 Nov 2016 13:37:00 +0000 Master Thesis: Understanding Images from News Media (Dietz, Ponzetto, Weiland) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-understanding-images-from-news-media-dietz-ponzetto-weiland/ Object detection in images from news articles is a very challenging task. On the one hand, available training data for object detectors is only available for a limited number of classes such as persons, bikes, oranges. On the other hand, it is complicated through complex scenes of different objects in front important backgrounds.

Task of this thesis is to train new object detectors towards classes that are important to understand news articles on the topics of global warming, biodiversity, and sustainability.

If interested, please contact Prof. Simone Paolo Ponzetto.

]]>
Thesis Thesis - Master Simone
news-1736 Thu, 03 Nov 2016 12:01:00 +0000 Master Thesis: Speculation detection in political speeches (Štajner, Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-speculation-detection-in-political-speeches-stajner-ponzetto/ Introduction/problem: Speculation/hedging/vagueness identification plays significant role in many applications, e.g. information extraction, machine translation, text simplification.

Goal: Automatic identification of sentences which contain speculation/hedging or are vague, as those sentences need special care when being translated or when used in information extraction systems (i.e. we usually just want information that is certain and not speculations and hypotheses).

Approach: Knowledge-rich speculation detection approach on political speeches.

Additional goals: Direct comparison of systems built for different domains (political speeches vs. Wikipedia).

Requirements: Basic knowledge of supervised classification algorithms and text processing (tokenisation, lemmatisation, etc.)

]]>
Thesis - Master Simone Thesis
news-1721 Thu, 03 Nov 2016 08:15:59 +0000 Master Thesis: Deciphering Abbreviations in Medieval Legal Texts (Ponzetto, Kümper) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-deciphering-abbreviations-in-medieval-legal-texts-ponzetto-kuemper/ The "ius commune" or "learned laws" (= "roman and canon law” of the Middle Ages) are full of citations which follow a set of generally common rules which then again differ in its actual spelling, its sequencing and other details. This repeatedly makes it hard for scholars to unravel actual citations to precise modern conventions. Often, one finds himself searching through the actual reference law codes again to find was meant by the medieval scholar.

This thesis will look at ways to deploy state-of-the-art methods for identifying abbreviation definitions in medieval legal texts (written in Latin) and assess the challenges of their application in this specific domain, as well as to build a tool that provides an easy access to this functionality for humanities scholars.

This thesis is offered in collaboration with the Lehrstuhl für Geschichte des Spätmittelalters und der frühen Neuzeit (Prof. Dr. Hiram Kümper).

]]>
Thesis - Master Thesis Simone
news-1701 Tue, 04 Oct 2016 15:52:44 +0000 Paper accepted at ICDM: What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-icdm-what-you-will-gain-by-rounding-theory-and-algorithms-for-rounding-rank/ The paper "What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank" by Stefan Neumann, Rainer Gemulla, and Pauli Miettinen has been accepted at the 2016 IEEE International Conference on Data Mining (ICDM).

Abstract:
When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods
that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what will we gain by rounding? Will this yield lower reconstruction errors? Is it easy
to find a low-rank matrix that rounds to a given binary matrix? Does it matter which threshold we use for rounding? Does it
matter if we allow for only non-negative factorizations? In this paper, we approach these and further questions by presenting
and studying the concept of rounding rank. We show that rounding rank is related to linear classification, dimensionality
reduction, and nested matrices. We also report on an extensive experimental study that compares different algorithms for finding good factorizations under the rounding rank model.

]]>
Publications Rainer Research - Data Mining and Web Mining
news-1700 Tue, 04 Oct 2016 15:48:42 +0000 Paper accepted at ICDM: DESQ: Frequent Sequence Mining with Subsequence Constraints http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/paper-accepted-at-icdm-desq-frequent-sequence-mining-with-subsequence-constraints/ The paper "DESQ: Frequent Sequence Mining with Subsequence Constraints" by Kaustubh Beedkar and Rainer Gemulla has been accepted at the 2016 IEEE International Conference on Data Mining (ICDM).

Abstract:

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined; e.g., length, gap, span, regular-expression, and hierarchy constraints. We show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. In more detail, we propose a set of simple and intuitive “pattern expressions” to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners.

]]>
Publications Rainer Research - Data Mining and Web Mining
news-1691 Mon, 19 Sep 2016 10:33:46 +0000 DWS Students Take Part in Data Science Game 2016 Finals http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-students-take-part-in-data-science-game-2016-finals/ The finals of th Data Science Game 2016 took place in castle Les Fontaines near Paris on September 9th to 11th. For the second consecutive year, a team of four DWS students - Christopher Zech, Thomas Stach, Robert Litschko and Benjamin Schäfer - have reached the final phase, this time qualifying in 5th place out of 143 student teams from universities around the world.

The task in the qualifying round revolved around the prediction of roof orientation in satellite images, which has relevance for applications in solar energy. Much like the other finalists, the team solved the problem using deep learning.

In the final round of the top 20 teams, they tackled a task provided by insurance company AXA, which entailed predicting conversion rates of customers receiving car insurance quotes. Finally finishing in 12th place, applying their classroom knowledge from lectures on Data Mining, Machine Learning and other topics in the three-day competition against a strong field of fellow students was both fun and a great opportunity to learn for the entire team.

Based on the very positive feedback from sponsors and participants, the organizers plan to establish the event with regular annual installations. http://www.datasciencegame.com/

]]>
Group Topics - Data Mining Research Research - Data Mining and Web Mining
news-1664 Mon, 01 Aug 2016 18:13:59 +0000 Master Thesis: Adaptive query generation for finding customers’ hot topics (Ponzetto) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-adaptive-query-generation-for-finding-customers-hot-topics-ponzetto/ The Web offers a goldmine of information describing a multitude of companies whose products and services can be potentially matched against Web users’ profiles (e.g., Twitter or Facebook profiles) in order to raise their consumer interest. However, searching the Web poses non-trivial challenges due to its large size as well as its noisy and heterogeneous content: in fact, DIY Web search engine building is, of course, impractical in most, if not all scenarios, due to a variety of scalability and other engineering issues.

In this thesis we will focus on the topic of learning user queries for lead enrichment: to this end different methods will be explored to build a query generation engine that adapts to different users’ profiles and allows to automatically generate Web search queries that, when used in conjunction with a general-purpose search engine like Google or Bing, retrieve Web documents from websites of companies that provide products or services of interest for a potential customer.

 

This thesis is offered in collaboration with the GMS department of Siemens AG. Global Marketing Services (GMS) is Siemens' in-house partner for sales and marketing topics across Siemens Global market research projects. Sales and marketing concepts, customer loyalty projects, lead generation, market potential models and automated sales solutions as well as sales management via dashboards and tablets all form part of their innovative and highly specialized portfolio, which is available to all Siemens divisions and regions.

 

]]>
Thesis - Master Thesis Simone Topics - Artificial Intelligence (NLP)
news-1648 Fri, 15 Jul 2016 09:31:09 +0000 Master Thesis: Dirty cheap text classification from the CommonCrawl (Ponzetto, Glavas) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-dirty-cheap-text-classification-from-the-commoncrawl-ponzetto-glavas/ Recently, there has been much interest to exploit Web-scale resource like the CommonCrawl for intelligent text processing and information extraction -- e.g., see our WebIsaDB.

In this thesis, we will at ways to exploit "cheap" heuristics" in order to collect training data for building supervised text classifiers from very large amounts of text. Key objective is to improve the performance of standard supervised methods by automatically harvesting high-quality labeled data from the Web in a simple, yet effective fashion.

If interested, please contact Prof. Simone Paolo Ponzetto.

]]>
Thesis - Master Thesis Simone
news-1647 Fri, 15 Jul 2016 09:23:45 +0000 Master Thesis: Multilingual WebIsA Database (Ponzetto, Faralli) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-multilingual-webisa-database-ponzetto-faralli/ Recently, we started investigating methods and framework to automatically extract high-quality hypernym relations from Web-scale amounts of data, i.e., like those found in the publicly available CommonCrawl. The result is the so-called WebIsA database (available here).

This thesis will look at ways to extend our pattern-based framework to new languages other than English, e.g., Spanish, Arabic, etc..

If interested, please contact Prof. Simone Paolo Ponzetto.

]]>
Thesis Thesis - Master Simone
news-1583 Mon, 25 Apr 2016 13:33:43 +0000 24.4 billion quads RDFa, Microdata and Microformat data published http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/244-billion-quads-rdfa-microdata-and-microformat-data-published/ The DWS group is happy to announce a new release of the Web Data Commons RDFa, Microdata, Embedded JSON-LD and Microformat data corpus.

The data corpus have been extracted from the November 2015 version of the Common Crawl covering 1.77 billion HTML pages which originate from 14.4 million websites (pay-level domains). 

Altogether we discovered structured data within 541 million HTML pages out of the 1.77 billion pages contained in the crawl (30%). These pages originate from 2.7 million different pay-level-domains out of the 14.4 million pay-level domains covered by the crawl (19%). 

Approximately 521 thousand of these websites use RDFa, while 1.1 million websites use Microdata. Microformats are used also by over 1 million websites within the crawl. For the first time, we have also extracted embedded json-ld which we can report to be used by more than 596 thousand websites. 

Background

More and more websites embed structured data describing for instance products, people, organizations, places, events, reviews, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats.

The WebDataCommons project extracts all Microformat, Microdata and RDFa data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format.

Besides the data extracted from the named markup syntaxes the WebDataCommons project also provides one of the largest public accessible corpora of WebTables extracted from web crawls as well as a collection of hypernyms extract from billions of web pages for download.

General information about the WebDataCommons project is found at http://webdatacommons.org/  

Data Set Statistics

Basic statistics about the November 2015 RDFa, Microdata, Embedded JSON-LD and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2015-11/stats/stats.html

Comparing the statistics to the statistics about the December 2014 release of the data sets

http://webdatacommons.org/structureddata/2014-12/stats/stats.html

we see that the adoption of the Microdata markup syntax has again increased (1.1 million websites in 2015 compared to 819 thousand in 2014, where both crawls cover a comparable number of websites). Where the deployment of RDFa and Microformats is more or less stable.

As already observed in the former year the vocabulary schema.org, recommended by Google, Microsoft, Yahoo!, and Yandex is most frequently used by the webmasters in the context of Microdata. We observe a decreasing deployment of its predecessor, the data vocabulary.  In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, that many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress).

For the first time, we have also extracted information marked up using embedded JSON-LD. Over 99% of all webmasters using this syntax use it to mark-up search boxes on their webpages (http://schema.org/SearchAction). Only a small part of the websites also use embedded JSON-LD to annotate other information, e.g. about organizations (92 thousand websites) or persons (18 thousand websites).

Download 

The overall size of the November 2015 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 24.4 billion RDF quads. For download, we split the data into 3,961 files with a total size of 404 GB. 

http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html

In addition, we have created for over 50 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class. 

http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html 

Lots of thanks to

+ the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 


Have fun with the new data set. 

Robert Meusel and Christian Bizer

]]>
Topics - Data Mining Topics - Linked Data Projects Chris Research - Data Mining and Web Mining Research - Data Analytics
news-1200 Mon, 07 Mar 2016 08:13:00 +0000 Bachelorarbeit: Verteilte Texterkennung einer historischen Zeitung http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelorarbeit-verteilte-texterkennung-einer-historischen-zeitung/ Die Bachelorarbeit wird gemeinsam von Prof. Rainer Gemulla (Lehrstuhl für Praktische Informatik I) und Stefan Weil (UB Mannheim) unterstützt. Die UB Mannheim besitzt Bilddaten von über 700000 Seiten der Zeitung Deutscher Reichsanzeiger und Preussischer Staatsanzeiger, die zwischen 1819 und 1945 unter verschiedenen Bezeichnungen erschienen ist. Um dieses Archiv für weitere Forschung zugänglich zu machen, sollen die in den Seiten enthaltenen Texte erkannt und maschinell lesbar gemacht werden (OCR, optical character recognition).

Dazu müssen über 350000 TIFF-Bilddateien (>20TB an Daten) verarbeitet werden. Ein einzelner Rechner benötigt für die Texterkennung in einer Bilddatei mit gängiger OCR-Software ca. 10 Minuten; die Verarbeitung aller Bilddateien würde so knapp 7 Jahre in Anspruch nehmen.

Ziel der Bachelorarbeit ist es, einen Cluster aus PCs ähnlich zu SETI@Home zu realisieren. Dazu könnten beispielsweise die öffentlichen PCs aller Bibliotheksbereiche, die Arbeitsplätze von Bibliotheksmitarbeitern (insofern einverstanden) sowie von anderen Freiwilligen verwendet werden. Die Clusterknoten haben somit eine gewisse Heterogenität, z.B. unterschiedliche Geräteausstattung und –leistung sowie unterschiedliche Betriebssysteme.

Als Basis zur Realisierung des Clusters könnte beispielsweise freie Software wie BOINC [2,3] oder Docker Swarm [3,4] zum Einsatz kommen. Die OCR wird ebenfalls freie Software (Tesseract [5], eventuell OCRopus [6] sowie Software zur Bildvorverarbeitung) verwenden. Neben den rein technischen Herausforderungen sind auch Themen wie Sicherheit und Akzeptanz zu berücksichtigen.

Bei Interesse oder Rückfragen melden Sie sich bitte bei Rainer Gemulla oder Stefan Weil.

 

[1] https://de.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing

[2] https://boinc.berkeley.edu/

[3] https://de.wikipedia.org/wiki/Docker_%28Software%29

[4] https://docs.docker.com/swarm/

[5] https://de.wikipedia.org/wiki/Tesseract_%28Software%29

[6] https://de.wikipedia.org/wiki/OCRopus

]]>
Rainer Thesis Thesis - Bachelor
news-1149 Mon, 07 Mar 2016 07:20:00 +0000 Bachelorarbeit: Vorverarbeitung von Bildern für automatische Texterkennung (OMR) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelorarbeit-vorverarbeitung-von-bildern-fuer-automatische-texterkennung-omr/ Die Bachelorarbeit wird gemeinsam von Prof. Rainer Gemulla (Lehrstuhl für Praktische Informatik I) und Dr. Philipp Zumstein (UB Mannheim) unterstützt. Die UB Mannheim besitzt Bilddaten von über 700000 Seiten der Zeitung Deutscher Reichsanzeiger und Preussischer Staatsanzeiger. Um dieses Archiv für weitere Forschung zugänglich zu machen, sollen die in den Seiten enthaltenen Texte erkannt und maschinell lesbar gemacht werden (OCR, optical character recognition).

In dieser Arbeit soll untersucht werden, ob und inwieweit die Qualität gängiger OCR-Software durch geeignete Vorverarbeitung der Bilder verbessert werden kann. Dazu gehören beispielsweise Techniken zum Aufteilen in Seiten, De-Warping [1], Erkennung von Nicht-Textbereichen [2] oder Binarisierung [3]. Sowohl kommerzielle OCR-Programme wie ABBYY Finereader als auch freie OCR-Software wie etwa Tesseract mit Leptonica oder Ocropus führen eine gewisse Vorverarbeitung bereits durch. Diese ist aber nur bedingt effektiv und kann ggf. weiter verbessert werden (u.a. durch den Einsatz von speziell auf diesen Datensatz entwickelten Vorverarbeitungsschritte). Dazu sollen verschiedene Vorverarbeitungschritte vorgeschlagen und bezüglich ihrer Effektivität evaluiert werden.

Bei Interesse oder Rückfragen melden Sie sich bei Rainer Gemulla oder Philipp Zumstein.

[1] Le, Thoma, Wechsler (1994): Automated page orientation and skew angle detection for binary document images. http://doi.org/10.1016/0031-3203(94)90068-X

[2] Bukhari, Al Azawi, Shafait, Breuel (2010): Document Image Segmentation using Discriminative Learning over Connected Components. http://doi.org/10.1145/1815330.1815354

[3] Gatos, Pratikakis, Perantonis (2008): Efficient Binarization of Historical and Degraded Document Images. http://doi.org/10.1109/DAS.2008.66

]]>
Thesis - Bachelor Thesis Rainer
news-1081 Mon, 22 Jun 2015 14:30:00 +0000 Two Papers Accepted for the 5th International Conference on Web Intelligence, Mining and Semantics http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/two-papers-accepted-for-the-5th-international-conference-on-web-intelligence-mining-and-semantics/ The paper A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time by Robert Meusel, Christian Bizer and Heiko Paulheim as well as Matching HTML Tables to DBpedia by Dominique Ritze, Oliver Lehmberg and Christian Bizer have been accepted for the 5th International Conference on Web Intelligence, Mining and Semantics in Limassol, Cyprus.

Please find the abstracts of the papers below:

1. A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of large-scale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from different points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare different versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at different points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.

2. Matching HTML Tables to DBpedia

Millions of HTML tables containing structured data can be found on the Web. With their wide coverage, these tables are potentially very useful for filling missing values and extending cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, the HTML tables need to be matched with the knowledge base, meaning that correspondences between table rows/columns and entities/schema elements of the knowledge base need to be found. This paper presents the T2D gold standard for measuring and comparing the performance of web table to knowledge base matching systems. T2D consists of 8,700 schema-level and 26,100 entity-level correspondences between the WebDataCommons Web Tables Corpus and the DBpedia knowledge base. In contrast related work on web tables to knowledge base matching, the Web Tables Corpus ($147$ million tables), the knowledge base, as well as the gold standard are publicly available. The gold standard is used afterward to evaluate the performance of T2K Match, an iterative matching method which combines schema and instance matching. T2K Match is designed for the use case of matching large quantities of mostly small and narrow web tables against large cross-domain knowledge bases. The evaluation using the T2D gold standard shows that T2K Match discovers table-to-class correspondences with a precision of 94%, row-to-entity correspondences with a precision of 90%, and column-to-property correspondences with a precision of 77%.

 

 

]]>
Publications Chris Research - Data Mining and Web Mining
news-1072 Wed, 03 Jun 2015 13:39:00 +0000 Article accepted by Journal of Web Semantics: The Mannheim Search Join Engine http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/article-accepted-by-journal-of-web-semantics-the-mannheim-search-join-engine/ We are happy to announce that the article "The Mannheim Search Join Engine" by Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer has been accepted for publication by the Journal of Web Semantics.

Abstract

A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with attributes containing the headquarters, turnover, and revenue of each company. Or imagine you are a film enthusiast and want to extend a table describing films with attributes like director, genre, and release date of each film. This article presents theMannheim Search Join Engine which automatically performs such table extension operations based on a large corpus of Web data. Given a local table, the Mannheim Search Join Engine searches the corpus for additional data describing the entities contained in the input table. The discovered data are joined with the local table and are consolidated using schema matching and data fusion techniques. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. We evaluate the Mannheim Search Join Engine using heterogeneous data originating from over one million different websites. The data corpus consists of HTML tables, as well as Linked Data and Microdata annotations which are converted into tabular form. Our experiments show that the Mannheim Search Join Engine achieves a coverage close to 100% and a precision of around 90% for the tasks of extending tables describing cities, companies, countries, drugs, books, films, and songs.

Keywords

  • Table extension
  • Data search 
  • Search joins 
  • Web tables 
  • Microdata 
  • Linked data

More information about the Mannheim Search Join engine is found here.

]]>
Publications Chris Research - Data Mining and Web Mining
news-710 Tue, 26 May 2015 12:40:00 +0000 Master Thesis: Multilingual Entity Linking (Ponzetto, Bizer) http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/master-thesis-multilingual-entity-linking-ponzetto-bizer/ Entity linking, the task of linking mentions of entities in text to wide-coverage concept repositories like DBPedia or Freebase, has so far concentrated almost exclusively on English [1]. This is well reflected on available taggers working only on English, indeed a very big limitation for the multilingual web of data. This goal of this thesis, accordingly, will be to extend existing taggers like, for instance, DBPedia Spotlight [2], to a wide range of languages other than English.

Requirements

  • Solid programming skills
  • Experience / genuine interest to work with large datasets
  • Previous knowledge of LOD, NLP and Machine Learning are a plus

 

References

[1] A framework for benchmarking entity-annotation systems. M. Cornolti, P. Ferragina and M. Ciaramita. In WWW-13

[2] DBpedia Spotlight: Shedding Light on the Web of Documents. P.N. Mendes, Max Jakob, A. García-Silva and C. Bizer. In I-Semantics-11

 

Contact: Prof. Dr. Bizer or Prof. Dr. Ponzetto

]]>
Topics Chris Simone Topics - Artificial Intelligence (NLP) Thesis - Master
news-231 Wed, 25 Mar 2015 08:28:00 +0000 Three Papers accepted at ESWC 2015 http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/three-papers-accepted-at-eswc-2015/ We are happy to announce that three papers got accepted at the 12th Extended Semantic Web Conference (ESWC 2015), held in Portoroz, Slovenia. The ESWC is a important international forum for the Semantic Web / Linked Data Community.

Abstracts of the accepted papers: 

RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration (Christoph Pinkel, Carsten Binnig, Ernesto Jimenez-Ruiz, Wolfgang May, Dominique Ritze, Martin G. Skjaeveland, Alessandro Solimando and Evgeny Kharlamov)
A major challenge in information management today is the integration of huge amounts of data distributed across multiple data sources. One suggested approach to this problem is ontology-based data integration where legacy data systems are integrated via a common on- tology that represents a unified global view over all data sources. In many domains (e.g., biology, medicine) there exist established ontolo- gies to integrate data from existing data sources. However, data is often not natively born using these ontologies. Instead, much data resides in relational databases. Therefore, mappings that relate the legacy data sources to the ontology need to be constructed. Recent techniques and systems that automatically construct such mappings have been devel- oped. The quality metrics of these systems are, however, often only based on self-designed, highly biased benchmarks. This paper introduces a new publicly available benchmarking suite called RODI which is designed to cover a wide range of integration challenges in Relational-to-Ontology Data Integration scenarios. RODI provides a set of different relational data sources and ontologies as well as a scoring function with which the performance of relational-to-ontology mapping construction systems may be evaluated.

Towards Linked Open Data enabled Data Mining: Strategies for Feature Generation, Propositionalization, Selection, and Consolidation (Petar Ristoski)
Background knowledge from Linked Open Data sources can be used to improve the results of a data mining problem at hand: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is a tedious manual work. In this paper we propose a set of desiderata, and identify the challenges for developing a framework for unsupervised generation of data mining features from Linked Data.

Heuristics for Fixing Common Errors in Deployed schema.org Microdata (Robert Meusel and Heiko Paulheim)
Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.  

 

 

]]>
Publications Research - Data Mining and Web Mining Chris
news-886 Tue, 04 Feb 2014 10:05:00 +0000 Bachelorarbeit (Meilicke): Entwicklung einer Spiele-KI http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/bachelorarbeit-meilicke-entwicklung-einer-spiele-ki/ Im Rahmen der Veranstaltung KI werden unter anderem grundlgende Verfahren zur Entwicklung einer Spiele KI vorgestellt. Wer die Veranstaltung besucht hat, kann eine Bachelorarbeit schreiben, in der eine KI für ein Spiel aus der folgenden Liste entwickelt und evaluiert wird.

  • 6 nimmt
  • Qwirkle
  • Lost Cities


Hierzu müssen geeignete Verfahren identifiziert, adaptiert und erweitert werden. Die ARbeit schließt neben der Implementierung auch eine Evaluation der Spielstärke der KI mit ein. Die genannten Spiele können auch, in Rücksprache mit dem Betreuer, durch andere Spiele ersetzt werden.

Achtung: Dieses Bachelorthema setzt den erfolgreichen Besuch der Veranstaltung KI vorraus!

Betreuer: Christian Meilicke / Jörg Schönfisch

]]>
Thesis - Bachelor Thesis