Categorizing multilingual open-ended survey responses and social media comments using Natural Language Processing

crowd in a public square

A technical overview of our approach to overcoming the challenges of unsupervised text classification.

Text Classification is a very popular and well explored research and application area in machine learning. There are more and more sophisticated ways to classify text based on a statistical approach. Comparatively, unsupervised text classification—or clustering—does not share the same popularity. We aim to address the text classification problem with an unsupervised approach.

We receive many unstructured open-ended datasets that cannot be annotated or processed due to resource or knowledge scarcity. There is no way of applying supervised techniques on them, no matter how advanced or sophisticated the algorithms. Therefore, building a robust and easily-tunable unsupervised algorithm is our prime motive.

The principal challenge for this unsupervised task is that the input dataset is a corpus of unstructured, free-form documents. There are no restrictions on the length of each sample or the nature of the data itself; meaning that the datasets can be topical, conversational, or domain-specific. Moreover, the datasets also contain a lot of noise and that increases the difficulty to achieve optimum results.

Methodology

To develop the initial model, we used unstructured survey responses as our prime dataset for evaluation. This dataset is labelled by human annotators, so it provided a benchmark for our clustering algorithm. We used internal and external metrics for evaluation on this dataset.

Our clustering algorithm is based on contextual sentence semantic similarity and not on topic modelling, therefore it is deeper in language understanding. The vectorization that we use goes beyond the word/token level and seeks to understand the underlying meaning of words and their relationships. The algorithm is based on the transfer learning approach that postulates that instead of training on small datasets, a model should be trained on a large-scale dataset and then retrained on smaller ones. Training on a large dataset provides the model with necessary language skills that are utilized when it is fine trained on a small dataset. Further, another added benefit is the inclusion of a larger vocabulary from the large dataset and hence out-of-vocabulary tokens are minimized.

The algorithm uses a multi-stage process. Since our aim is to support mixed-language datasets—or multiple languages in a single file—the first stage is to detect non-English content and translate it into English. Following translation, all instances are processed in the same manner as done with monolingual datasets.

The next stage is the analysis of pair-wise similarity computation for sentences. The subsequent stages are concerned with hierarchical clustering steps that progressively dissolve minor clusters in an attempt to cluster remaining loose elements. A set of parameters control the resolution of clusters—whether clusters are to be few and general, or many and specific.

Evaluation

Given the unsupervised nature of the algorithm, we must be able to generalize it to allow its application to varied datasets. Generalization is required as for future data, there would not be any human labelling that can help verify accuracy; rather, only internal evaluation metrics will be available. During development, we relied on Fowlkes Mallows index to validate performance against the expert-labelled data. In the case of unlabelled data, we use a variety of internal evaluation metrics including Cosine, Correlation, Sqeuclidean, and Chebyshev.