Master thesis: Text Mining for Cyber Threat Analysis (Gemulla, Schönhofer GmbH)

Since mid-September 2015, the threat from ransomware has grown considerably [1]. Against this background, comprehensive geographical and temporal mapping of cyber attacks and early detection of such attacks have become particularly important. Attacks on an organisation's own IT-infrastructure are typically analysed and defended against at network level. Outside an organisation's own infrastructure, other sources, e.g., news portals and social media, usually have to be used. Given the large volume and variety of this unstructured data as well as the speed with which it is generated, automated analytical procedures from the fields of text mining and machine learning to handle it are not only particularly promising but also the only practical approach. 

TASK

A review and evaluation of sources dealing with cyber-threat analysis, e.g.,

  • News websites, news portals and social media such as Facebook & Twitter
  • Pre-evaluation / Prediction / forecasting websites such as Google Trends or Europe Media Monitor
  • Reports from Computer Emergency Response Teams (CERTs)
  • Reports from anti-virus software companies. e.g., Kaspersky

The sources may contain previously evaluated and summarised results. The sources should be analysed and metadata extracted. The following directions are of particular interest:

  • Reports on new threats
  • Differentiation between duplicated confirmation and new reports
  • Sentiment analysis, classification of phishing, hoaxes & fake news
  • Regional / geographical and temporal distribution
  • Significant parties (parties issuing threats as well as those analysing / defending against threats)

Moreover, on the basis of configurable taxonomies the texts should be subjected to an entity analysis and, if possible, to relations analysis.
 
A data corpus, which has been created on the basis of relevant RSS feeds, is available to test the procedure and can be expanded during the work. In addition, the possibilities of adding further metadata while importing data should be investigated, e.g., designation of source / publisher, evaluation of source (reliability, trustworthiness etc.), which can then be considered when extracting the metadata.

PREREQUISITES

Detailed knowledge of text analysis / text mining as well as programming skills in Java/Scala, Python or a comparable programming language is required. Knowledge of virtualisation and databases is an advantage. In-depth knowledge of cyber security is not required.

CONTACT

The Master thesis is supervised by the Chair for Data Analytics (Prof. Gemulla) as well as by the Schönhofer Sales & Engineering GmbH.

Schönhofer Sales & Engineering GmbH is an innovative systems and software company. The company, which is located in Siegburg, realises complex projects and products for complex event prediction, big-data analytics and metadata processing for public sector clients, banks, insurance companies and corporates.

If you are interested in this thesis topic, please contact

Holger Krispin
Schönhofer S&E GmbH, IT-Systems area
Holger.Krispin@schoenhofer.de
Tel. +49 (0)2241 3099 37

REFERENCES

[1] Ransomware: Bedrohungslage, Prävention & Reaktion. BSI-Report. March 2016