Charles Explorer logo
🇬🇧

An Approach for Textual Based Clustering Using Word Embedding

Publication

Abstract

Numerous endeavors have been made to improve the retrieval procedure in Textual Case-Based Reasoning (TCBR) utilizing clustering and feature selection strategies. SOPHisticated Information Analysis (SOPHIA) approach is one of the most successful efforts which is characterized by its ability to work without the domain of knowledge or language dependency.

SOPHIA is based on the conditional probability, which facilitates an advanced Knowledge Discovery (KD) framework for case-based retrieval. SOPHIA attracts clusters by themes which contain only one word in each.

However, using one word is not sufficient to construct cluster attractors because the exclusion of the other words associated with that word in the same context could not give a full picture of the theme. The main contribution of this chapter is to introduce an enhanced clustering approach called GloSOPHIA (GloVe SOPHIA) that extends SOPHIA by integrating word embedding technique to enhance KD in TCBR.

A new algorithm is proposed to feed SOPHIA with similar terms vector space gained from Global Vector (GloVe) embedding technique. The proposed approach is evaluated on two different language corpora and the results are compared with SOPHIA, K-means, and Self- Organizing Map (SOM) in several evaluation criteria.

The results indicate that GloSOPHIA outperforms the other clustering methods in most of the evaluation criteria.