Paper ID: 2207.13771

CompText: Visualizing, Comparing & Understanding Text Corpus

Suvi Varshney, Divjeet Singh Jas

A common practice in Natural Language Processing (NLP) is to visualize the text corpus without reading through the entire literature, still grasping the central idea and key points described. For a long time, researchers focused on extracting topics from the text and visualizing them based on their relative significance in the corpus. However, recently, researchers started coming up with more complex systems that not only expose the topics of the corpus but also word closely related to the topic to give users a holistic view. These detailed visualizations spawned research on comparing text corpora based on their visualization. Topics are often compared to idealize the difference between corpora. However, to capture greater semantics from different corpora, researchers have started to compare texts based on the sentiment of the topics related to the text. Comparing the words carrying the most weightage, we can get an idea about the important topics for corpus. There are multiple existing texts comparing methods present that compare topics rather than sentiments but we feel that focusing on sentiment-carrying words would better compare the two corpora. Since only sentiments can explain the real feeling of the text and not just the topic, topics without sentiments are just nouns. We aim to differentiate the corpus with a focus on sentiment, as opposed to comparing all the words appearing in the two corpora. The rationale behind this is, that the two corpora do not many have identical words for side-by-side comparison, so comparing the sentiment words gives us an idea of how the corpora are appealing to the emotions of the reader. We can argue that the entropy or the unexpectedness and divergence of topics should also be of importance and help us to identify key pivot points and the importance of certain topics in the corpus alongside relative sentiment.

Submitted: Jul 27, 2022