UA   RU
DonNTU   Masters' portal

Abstract

Content

Introduction

Thanks to modern technologies, users have the opportunity to share information with each other, including expressing their opinion about everything that surrounds him, be it a book, a film, a statement of a famous figure or a complaint about a delivery service. The volumes of text on the Web are getting larger and larger every second, so it is physically impossible to manually process them by a person. This is how the need for such a direction as the intellectual analysis of the text was formed.

Text Mining – automation of the extraction of information from text data. Its peculiarity (in contrast to the analysis of other data) lies in the fact that the initial information is not formalized: it cannot be described by a simple mathematical function [1].

For various companies and corporations to conduct successful actions, it is important to quickly determine the reaction of users, and this need is one of the key ones for the analysis of sentiment.

Sentiment Analysis is a class of content analysis methods designed for the automated detection of emotionally colored vocabulary in texts and the emotional assessment of authors in relation to the objects discussed in the text [2].The main purpose of sentiment analysis is to find opinions in the text and identify their properties. What properties will be investigated depends on the task at hand: the purpose of the analysis can be the author, that is, the person who owns the opinion.

The task of determining the sentiment of a text is the task of classifying texts in a broad sense. Classification of documents is one of the tasks of information retrieval (section of machine learning), which consists in classifying a document into one of several categories based on the content of the document [3]. In this case, the classes are the subset opinions expressed by the users.

1. Theme urgency

In recent years, there has been a need for tools to track the reactions of Internet users to events, products, and even songs. Positive and negative opinions are powerful, because they can be used to win the trust of the buyer or significantly damage the reputation among the fans. So, it is known that 40% of buyers form an opinion about a business after reading 1-3 reviews. You can also say that people are much more likely to choose a product among other others if it is recommended by a person whom they trust. You can try to automate the process of tracking user opinions by collecting reviews, organizing and processing them accordingly, and applying text sentiment analysis techniques.

The main purpose of sentiment analysis is to find opinions in the text and identify their properties. The properties under study depend on the task at hand: for someone the reaction of the community to the book is important (generally positive / negative), while for someone, say, a cosmetic company, a more detailed analysis will be required: for example, to determine which target audience the author belongs to. text and what he focused on. As the main tools for solving this problem, the Python programming language is used, as well as various libraries for text processing.

This programming language is usually chosen for its versatility, as well as the presence of many tools (i.e. libraries) designed to make your work easier.

2. Goal and tasks of the research

The object of research is the definition of the sentiment of a text.

The subject of the research is methods for determining the sentiment of a text.

The aim of the research is to study approaches to the analysis of the sentiment of a text, as well as to develop a tool for analyzing the sentiment of a loaded text corpus and generating statistics based on them.

The main tasks of the research:

  1. study of existing algorithms and methods of text preprocessing;
  2. study of algorithms for determining the sentiment of the text;
  3. creation of your own corpus of news texts from the field of culture;
  4. development of our own algorithm for determining the sentiment of the text using the example of the created corpus of news texts;
  5. development of a software model for determining the tonality of uploaded texts and compiling statistical data based on them to demonstrate the attitude of society (represented by the authors of uploaded articles) to various news.

It is planned that the developed program model will have an intuitive user interface, the ability to save the analysis results, as well as export them for further work in other programs.

3. Overview of references

3.1 Worldwide references

Bing Liu's article Sentiment Analysis and Subjectivity examines the relationships and differences between facts and opinions. Opinions are usually subjective expressions that describe feelings, assessments or feelings of people in relation to objects, events and their properties.

Bo Pang, Lillian Lee's article A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts explores sentiment analysis aimed at identifying the point of view (s) underlying a range of text. To determine the polarity of sentiment, a new machine learning method is proposed that applies text categorization methods only to the subjective parts of the document.

3.2 National references

In the article by A.G. Pazelskaya, A.N. Solovyov Method for determining emotions in texts in Russian, methods of automatic determination of the emotional component (tonality) in the text are considered and the experience of the current practical implementation of the system for media texts in Russian, which is based on dictionaries of lexical sentiment and a set of combinatorial rules for combining individual words and phrases. The paper proposes a method for determining the sentiment based on the predication relationship in the proposition.

In the article by V.V. Osokin, M.V. Shegai Analysis of the sentiment of a Russian-language text as a classifier, a naive Bayesian classifier is used. Various methods are used for the selection of features, the results obtained are compared with the results of the classification of the English-language text.

3.3 Local references

In the works of the masters, exactly the same formulation of the problem was not found (analysis of the sentiment of the text in order to characterize the perception of news by society), however, a similar topic has already been covered in the master's theses. So, Prokapovich A.A. in his master's thesis on the topic Development of algorithmic support for an intellectual module for analyzing the emotional content of natural language messages of blogs and forums set a goal to analyze the emotional content of messages from various blogs and forums, while developing the appropriate algorithmic support for the intelligent analysis module; in his work, he considered the already existing algorithms, the scientific novelty of this approach, and also noted that the algorithms using the linguistic approach are more popular and accurate.

Pilipenko A.S. also touched upon this topic in his work on the study of methods and algorithms for determining the sentiment of natural language text: he compared popular tools that determine the uniqueness of the text, and also noted that these tools (services Text.ru, Antiplagiat.ru, Advego Plagiatus, Etxt Antiplagiarism), although they highlight some of the characteristics of the loaded text, they do not determine its tonality in the form in which it is assumed within the framework of the task at hand.

4. Stages of sentiment analysis

Before you start working on determining the tonality of a document, you need to process it. Text preprocessing includes converting all words to lower case, removing stop words, tokenization, lemmatization or stemmatization [4]. All these steps serve to reduce the noise inherent in any ordinary text and increase the accuracy of the classifier results: after the actions taken, all significant words found in the document will act as signs.

5. Methods for the classification of texts

There are several groups of text classification methods. Analysis using rules and vocabulary-based methods involves working with pre-compiled tonal vocabularies. The process of compiling these dictionaries is very laborious and problematic, since one word in different contexts can have different sentiments (for example, the word complex is a positive characteristic in relation to a security system, but a negative one to the registration or user authorization procedure). For correct use in this case, you need to draw up a large number of rules. There are a number of approaches to automate the compilation of vocabularies for a specific subject area. In methods based on graph-theoretic models, the text is depicted as a graph based on the assumption that some words have more weight, which means they have a stronger effect on the tonality of the text. Here, text analysis begins with building a graph and ranking its vertices. After ranking, words are classified according to the dictionary, where each analyzed word has a characteristic ("negative", "neutral", "positive"). The result is defined as the ratio of the number of words with a positive score to the number of words with a negative score: if the score is close to 1, then the text is neutral, more is positive, less is negative. The key point in supervised machine learning methods is a machine classifier, the algorithm for working with which is as follows: collection of information (documents) on the basis of which training will take place;

  1. decomposition of each document in the form of a vector of features, according to which the analysis will take place;
  2. indication of the correct type of key for each document;
  3. selection of a classification algorithm and method for training the classifier;
  4. using the resulting model to determine the sentiment of a new set of information.

Unsupervised machine learning is based on the idea that terms that are more common in this text and at the same time are present in a small number of texts in the entire set of texts (collection) have the greatest weight. The conclusion about the sentiment is based on the selection of such terms and the determination of their tonality.

Conclusions

Support vector machines (for high-quality analysis results and the ability to train a small data set), a naive Bayesian classifier (for high speed of work and easy interpretability of results) and methods related to neural networks are preferable for further research.

References

  1. Интеллектуальный анализ текста, или Text Mining [Электронный ресурс]. – Режим доступа: интеллектуальный-анализ-текста-что-это-и-зачем-он-нужен.aspx . – Заглавие с экрана.
  2. Анализ тональности текста [Электронный ресурс]. – Режим доступа: https://ru.wikipedia.org/wiki/Анализ_тональности_текста. – Заглавие с экрана.
  3. Классификация документов методом опорных векторов [Электронный ресурс]. – Режим доступа: – Режим доступа: https://habr.com/ru/post/130278/. – Заглавие с экрана.
  4. Батура Т.В. Методы автоматической классификации текстов / Т.В. Батура // Программные продукты и системы. 2017. Т. 30. № 1. С. 85–99; DOI: 10.15827/0236-235X.030.1.085-099.
  5. Анализ данных и процессов: учеб. пособие / А. А. Барсегян, М. С. Куприянов, И. И. Холод, М. Д. Тесс, С. И. Елизаров. – 3-е изд., перераб. и доп. – СПб.: БХВ-Петербург, 2009. – 512 с.
  6. Что такое стемминг [Электронный ресурс]. – Режим доступа: https://habr.com/ru/post/130278/ https://textis.ru/stemming/. – Заглавие с экрана.
  7. Лемматизация [Электронный ресурс]. – Режим доступа: https://cropas.by/seo-slovar/lemmatizatsiya/. – Заглавие с экрана.
  8. Word2Vec [Электронный ресурс]. – Режим доступа: https://en.wikipedia.org/wiki/Word2vec. – Заглавие с экрана.
  9. TF-IDF [Электронный ресурс]. – Режим доступа: https://seonomad.net/slovar/tf-idf. – Заглавие с экрана.
  10. Наивный байесовский классификатор [Электронный ресурс]. – Режим доступа: http://bazhenov.me/blog/2012/06/11/naive-bayes.html. – Заглавие с экрана.
  11. Классификация текстов и анализ тональности [Электронный ресурс]. – Режим доступа: http://neerc.ifmo.ru/wiki/index.php?title=Классификация_текстов_и_анализ_тональности – Заглавие с экрана.
  12. Decision Trees – scikit-learn [Электронный ресурс]. – Режим доступа: – Режим доступа: https://scikit-learn.org/stable/modules/tree.html. – Заглавие с экрана.
  13. Метод опорных векторов [Электронный ресурс]. – Режим доступа: – Режим доступа: https://ru.wikipedia.org/wiki/Метод_опорных_векторов. – Заглавие с экрана.
  14. Сверточные нейронные сети [Электронный ресурс]. – Режим доступа: https://ru.wikipedia.org/wiki/Свёрточная_нейронная_сеть. – Заглавие с экрана.