Dmitry Ustalov has defended his PhD thesis

Dmitry Ustalov has successfully defended his Kandidat Nauk (PhD) thesis on “Models, Methods and Algorithms for Constructing a Word Sense Network for Natural Language Processing” («Модели, методы и алгоритмы построения семантической сети слов для задач обработки естественного языка» in Russian). The defense was held at the South Ural State University (Chelyabinsk, Russia) on February 21, 2018.

This thesis, among many other contributions, proposes the Watset and Watlink methods for extracting, inducing, clustering, and linking the word senses from the unstructured data.

Abstract

The goal of the thesis is to develop models, methods, and algorithms for constructing a semantic network that establishes semantic links between individual word senses using the weakly structured dictionaries; as well as to implement them as the software system for word sense network construction. Therefore, Part I reviews the state-of-the-art in the field of natural language processing and urges the development of new efficient ontology induction algorithms for under-resourced languages.

Part II proposes two new algorithms, Watset and Watlink, that extract and structure the knowledge available in unstructured form. Watset is a meta-algorithm for fuzzy graph clustering. This algorithm creates an intermediate representation of the input graph that naturally reflects the “ambiguity” of its nodes. Then, it uses hard clustering to discover clusters in this intermediate graph. This makes it possible to discover synsets in a synonymy graph. Watlink is an algorithm for discovering the disambiguated hierarchical links between individual word senses. This algorithm uses the synsets obtained using Watset to contextualize the input asymmetric word links. To increase the recall of the linking, it optionally uses a regularized projection learning approach to predict additional relevant links.

Part III describes the implementation of the proposed models, methods, and algorithms as a software system. The system is implemented in Python, AWK, and Bash programming languages using the scikit-learn, TensorFlow, NetworkX, and Raptor libraries. Also, it defines the representation of the produced word sense network as Linked Data.

Part IV reports the results of the experiments conducted on the Russian language, an under-resourced natural language. Both Watset and Watlink show state-of-the-art performance on the synset induction and hypernymy detection tasks on the RuWordNet and Yet Another RussNet gold standards.