RSS-Feed http://example.com en-gb TYPO3 News Sun, 27 May 2018 00:01:35 +0000 Sun, 27 May 2018 00:01:35 +0000 TYPO3 EXT:news news-2084 Mon, 12 Mar 2018 11:57:47 +0000 Third Cohort of Students starts Part-time Master in Data Science http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/third-cohort-of-students-starts-part-time-master-in-data-science/ The third cohort consisting of 32 students has started their studies in the part-time master program in Data Science that professors of the DWS group offer together with the Hochschule Albstadt-Sigmaringen.

This weekend the students of the third cohort of the master program as well as students participating in the certificate program Data Science were in Mannheim for a data mining project weekend.

The students worked in teams on two case studies, one in the area of online marketing, the other in the area of text mining. The teams were coached by Prof. Christian Bizer, Dr. Robert Meusel, and Alexander Diete and we were very happy to see an exciting competition between the teams for the best F1 scores as well as the highest raises in sales.

Additional Information:

 

]]>
Projects Chris
news-2049 Thu, 11 Jan 2018 09:42:41 +0000 38.7 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/387-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2017 version of the Common Crawl covering 3.2 billion HTML pages which originate from 26 million websites (pay-level domains).

In summary, we found structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38.9%). These pages originate from 7.4 million different pay-level domains out of the 26 million pay-level-domains covered by the crawl (28.4%). Approximately 3.7 million of these websites use Microdata, 2.6 million websites use JSON-LD, and 1.2 million websites make use of RDFa. Microformats are used by more than 3.3 million websites within the crawl.

Background:

More and more websites annotate data describing for instance products, people, organizations, places, events, reviews, and cooking recipes within their HTML pages using markup formats such as Microdata, embedded JSON-LD, RDFa and Microformat. The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and Microformat data from the Common Crawl web corpus, the largest web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. We run yearly extractions since 2012 and we provide the dataset series as well as the related statistics at:

http://webdatacommons.org/structureddata/

Statistics about the November 2017 Release:

Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:

http://webdatacommons.org/structureddata/2017-12/stats/stats.html

Markup Format Adoption

The page below provides an overview of the increase in the adoption of the different markup formats as well as widely used schema.org classes from 2012 to 2017:

http://webdatacommons.org/structureddata/#toc10

Comparing the statistics from the new 2017 release to the statistics about the October 2016 release of the data sets (http://webdatacommons.org/structureddata/2016-10/stats/stats.html), we see that the adoption of structured data keeps on increasing while Microdata remains the most dominant markup syntax. The different nature of the crawling strategy that was used makes it hard to compare absolute as well as certain relative numbers between the two releases. More concretely, we observe that the November 2017 Common Crawl corpus is much deeper for certain domains like blogspot.com and wordpress.com while other domains are covered in a shallower way, with fewer URLs crawled in comparison to the October 2016 Common Crawl corpus. Nevertheless, it is clear that the growth rate of Microdata and Microformats is much higher than the one of RDFa and embedded JSON-LD. Although, the latter format is widely spread, it is mainly used to annotate metadata for search actions (80% of the domains using JSON-LD) while only a few domains use it for annotating content information such as Organizations (25% of the domains using JSON-LD), Persons (4% of the domains using JSON-LD) or Offers (0.1% of the domains using JSON-LD).

Vocabulary Adoption

Concerning the vocabulary adoption, schema.org, the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant in the context of Microdata with 78% of the webmasters using it in comparison to its predecessor, the data-vocabulary, which is only used by 14% of the websites containing Microdata. In the context of RDFa, the Open Graph Protocol recommended by Facebook remains the most widely used vocabulary.

Parallel Usage of Multiple Formats

Analyzing topic-specific subsets, we discover some interesting trends. As observed in the previous extractions, content related information is mostly described either with the Microdata format or less frequently with the JSON-LD format, in both cases using the schema.org vocabulary. However, we find out that 30% of the websites that use JSON-LD annotations to describe product related information, make use of Microdata as well as JSON-LD to cover the same topic. This is not the case for other topics, such as Hotels or Job Postings, for which webmasters use only one format to annotate their content.

Richer Descriptions of Job Postings

Following the release of the “Google for Jobs” search vertical and the more detailed guidance by Google on how to annotate job postings (https://developers.google.com/search/docs/data-types/job-posting), we see an increase in the number of websites annotating job postings (2017: 7,023, 2016: 6,352). In addition, the job posting annotations tend to become richer in comparison to the previous years as the number of Job Posting related properties adopted by at least 30% of the websites containing job offers has increased from 4 (2016) to 7 (2017). The newly adopted properties are JobPosting/url, JobPosting/datePosted, and JobPosting/employmentType. You can find a more extended analysis concerning specific topics, like Job Posting and Product, here

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis

Download:

The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 38.7 billion RDF quads. For download, we split the data into 8,433 files with a total size of 858 GB.

http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org classes separate files, including all quads extracted from pages, using a specific schema.org class.

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html

Lots of thanks to:

  • the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
  • the Any23 project for providing their great library of structured data parsers.
  • Amazon Web Services in Education Grant for supporting WebDataCommons.
  • the Ministry of Economy, Research and Arts of Baden – Württemberg which supported through the ViCE project the extraction and analysis of the November 2017 corpus.

General Information about the WebDataCommons Project:

The WebDataCommons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of WebTables, a corpus of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

http://webdatacommons.org/

Have fun with the new data set!

Cheers,

Anna Primpeli, Robert Meusel, and Christian Bizer

 

 

]]>
Research - Web-based Systems Topics - Linked Data Projects Chris
news-2014 Fri, 27 Oct 2017 08:41:55 +0000 SWSA Ten-Year Award won by DBpedia Paper http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/swsa-ten-year-award-won-by-dbpedia-paper/ We are happy to announce that Professor Christian Bizer has received the SWSA Ten-Year Award at the 16th International Semantic Web Conference (ISWC2017) in Vienna for the paper "DBpedia: A Nucleus for a Web of Open Data” that he co-authored in 2007.

The SWSA Ten-Year Award recognizes the highest impact papers from the ISWC proceedings ten years prior (i.e., in 2017 the award honors a paper from 2007). The decision is based primarily, but not exclusively, on the number of citations to the papers from the proceedings in the intervening decade.

DBpedia is a large-scale cross-domain knowledge base which we extract from Wikipedia and make available on the Web under an open license. DBpedia allows users to ask sophisticated queries against Wikipedia knowledge and serves as an interlinking hub in the Web of Linked Data. In addition, DBpedia is widely used as background knowledge for applications such as search, natural language understanding, and data integration.

According to Google Scholar, the paper "DBpedia: A Nucleus for a Web of Open Data” has been cited 2770 times as of October 2017.

 

]]>
Research - Web-based Systems Projects Chris
news-1928 Fri, 23 Jun 2017 08:52:12 +0000 Web Data Integration Framework (WInte.r) released http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/web-data-integration-framework-winter-released/ We are happy to announce the release of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions. In addition, these pre-defined building blocks can be used as foundation for implementing advanced integration methods.

The WInte.r famework forms the foundation for our research on large-scale web data integration. The framework contains an implementation of the T2K Match algorithm for matching millions of Web tables against a central knowledge base. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve the course case study. In addition, most students use the framework as foundation for their term projects.  

Detailed information about the WInte.r framework is found at

https://github.com/olehmberg/winter

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

]]>
Research - Web-based Systems Chris Projects
news-1904 Sat, 17 Jun 2017 07:17:00 +0000 DWS Students Score Top Results in International Data Science Competition http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dws-students-score-top-results-in-international-data-science-competition/ The Data Mining Cup is an annual competition for data science students all over the world. Within six weeks time, student teams have to solve a data science task based on real world data. This year's task was to predict revenues in an online store for pharmaceuticals with varying prices. Each university was allowed to register two teams, and all in all, 202 teams from 150 universities in 48 countries participated in 2017.

 The two teams from the University of Mannheim, master students in the Data Science and Business Informatics study programs, are among the top 10 teams out of the 202 participating teams, and will be invited to the prudsys personalization summit in Berlin on June 28th/29th to present their solutions. The final winners will be announced at the summit in Berlin.

 

It is the fourth year that students from the University of Mannheim participate in the Data Mining Cup. Participation in the cup is an integral part of the Data Mining 2 course taught by Prof. Heiko Paulheim, allowing the students to deepen the skills acquired in the lecture in a competitive real-world setting.

 

We congratulate the two student teams for this great achievement!

]]>
Projects Other Topics - Data Mining
news-1839 Tue, 14 Mar 2017 08:11:52 +0000 Best Paper Award at ICAART 2017 http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/best-paper-award-at-icaart-2017/ We are happy to announce that our paper Where is that Button Again?! – Towards a Universal GUI Search Engine won the best paper award at the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017) in the artificial intelligence area.

 In feature-rich software a wide range of functionality is spread across various menus, dialog windows, toolbars etc. Remembering where to find each feature is usually very hard, especially if it is not regularly used.We therefore provide a GUI search engine which is universally applicable to a large number of applications.Besides giving an overview of related approaches, we describe three major problems we had to solve, which are analyzing the GUI, understanding the users’ query and executing a suitable solution to find a desired UI element. Based on a user study we evaluated our approach and showed that it is particularly useful if a not regularly used feature is searched for. We already identified much potential for further applications based on our approach.

 This research was funded in part by the German Federal Ministry of Education and Research under grant no. 01IS12050 (project SuGraBo).

 The paper was one of the 32 full papers in the artificial intelligence area accepted for presentation at the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017).

]]>
Publications Projects
news-1819 Tue, 21 Feb 2017 08:20:04 +0000 Registration Open for DataFest Germany 2017 http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/registration-open-for-datafest-germany-2017-1/ DataFest is a competition and networking event. You will get the unique opportunity to work with a large dataset and analyze it using your own ideas as well as meet leaders and companies in the field of statistics. DataFest began in 2011 at UCLA, and is now sponsored by the American Statistical Association.

DataFest Germany 2017 is the third annual DataFest event organized in Germany. It is hosted by a consortium of the Statistics and Social Science Methodology Chair at the University of Mannheim, the Institute of Statistics of LMU Munich, and the P3 Group.

The task and dataset of this year's DataFest is still secret, but the registration for the DataFest is already open: Bachelor and Masters-level students of all subjects are welcomed to apply in teams of 2-5 people.  Applications are open until Thursday, March 12, 2017. Space is limited, so only the first 20 teams will be accepted.

As DWS students were quite quite successful and won a prize at DataFest 2015 in Mannheim, we strongly encourage our students to participate again in this year's DataFest.

 

 

]]>
Projects Topics - Data Mining
news-1786 Tue, 17 Jan 2017 14:47:30 +0000 44.2 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data published http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/442-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-published/ The DWS group is happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2016 version of the CommonCrawl covering 3.2 billion HTML pages which originate from 34 million websites (pay-level domains).

Altogether we discovered structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.6 million different pay-level domains out of the 34 billion pay-level domains covered by the crawl (16.5%).

Approximately 2.5 million of these websites use Microdata, 2.1 million websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats are used by over 1.6 million websites within the crawl.

 

Background: 

More and more websites annotate structured data within their HTML pages using markup formats such as RDFa, Microdata, embedded JSON-LD and Microformats. The annotations  cover topics such as products, reviews, people, organizations, places, events, and cooking  recipes.

The WebDataCommons project extracts all Microdata, RDFa data, and Microformat data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. 

Besides the markup data, the WebDataCommons project also provides large web table corpora and web graphs for download. General information about the WebDataCommons project is found at 

webdatacommons.org 


Data Set Statistics: 

Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa  
and Microformat data sets as well as the vocabularies that are used together with each 
markup format are found at: 

webdatacommons.org/structureddata/2016-10/stats/stats.html

Comparing the statistics to the statistics about the November 2015 release of the data sets

 

webdatacommons.org/structureddata/2015-11/stats/stats.html

we see that the Microdata syntax remains the most dominant annotation format. Although it is hard to compare the adoption of the syntax between the two years in absolute numbers, as the October 2016 crawl corpus is almost double the size of the November 2015 one, a relative increase can be observed: In the October 2016 corpus over 44% of the pay-level domains containing markup data make use of the Microdata syntax in comparison to 40% one year earlier. Even though the absolute numbers concerning the RDFa markup syntax adoption rise, the relative increase does not follow up the increase of the corpus size indicating that RDFa is less used by the websites. Similar to the 2015 release, the adoption of embedded JSON-LD has considerably increased, even though the main focus of the annotation remains the search action offered by the websites (70%).

As already observed in the previous years, the schema.org vocabulary is most frequently used in the context of Microdata while the adoption of its predecessor, the data vocabulary, continues to decrease. In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress). More concretely, the October 2016 corpus includes more than 682 million product records originating from 249 thousand websites which use the schema.org vocabulary. The new release contains postal address data for more than 291 million entities originating from 338 thousand websites. Furthermore, the content describing hotels has doubled in size in this release, with a total of 61 million hotel descriptions.

Visualizations of the main adoption trends concerning the different annotation formats, popular schema.org, as well as RDFa classes within the time span 2012 to 2016 are found at

webdatacommons.org/structureddata/

 

Download:

The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,661 files with a total size of 987 GB. 

webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org classes separate files, including all quads from pages, deploying at least once the specific class. 

webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.html

 

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which supported by means of the ViCe project the extraction and analysis of the October 2016 corpus.


Have fun with the new data set. 

Anna Primpeli, Robert Meusel and Chris Bizer

]]>
Research - Data Mining and Web Mining Research - Data Analytics Topics - Data Mining Topics - Linked Data Projects Chris
news-1779 Thu, 22 Dec 2016 08:21:56 +0000 DAGStat-Bulletin features Mannheim Data Science degree programs http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/dagstat-bulletin-features-mannheim-data-science-degree-programs-1/ The DAGStat-Bulletin of the Deutsche Arbeitsgemeinschaft Statistik features the different Data Science degree programs that are offered by the University of Mannheim or in which professors from the university participate, in its December 2016 issue. The report that is titled Mannheimer Data Science Offensive is found on page 4:

DAGStat­Bulletin - Neues über Statistik und aus den Gesellschaften der Deutschen Arbeitsgemeinschaft Statistik, Issue 18, December 2016.

The featured programs are:

  1. Mannheim Master in Data Science (MMDS)
  2. International Program in Survey and Data Science (IPSDS)
  3. Part-time Master Program Data Science (PTMDS)
]]>
Projects Chris
news-1769 Mon, 05 Dec 2016 15:06:25 +0000 Industry Talk on Information Extraction for E-Commerce http://dws.informatik.uni-mannheim.deen/news/singleview/detail/News/industry-talk-on-information-extraction-for-e-commerce/ Martin Rezk from Rakuten Tokyo/Paris presented today his recent work in collaboration with Simona Maggio, Bruno Charron, David Purcell, Hirate Yu and Béranger Dumont on how Rakuten and PriceMinister (e-commerce) extract and combine semantic information from product titles, descriptions, and images to maintain and enhance their ontologies, to improve the user selling experience, and to build fine-grained marketing campaigns. 

Martin's work nicely fits together with the work within the DWS group on product data integration.

Details about parts of the talk can be found in their ISWC 2016 industry track paper on Extracting Semantic Information for e-Commerce.

]]>
Topics - Linked Data Research - Web-based Systems Research Projects