Українська Русский
DonNTU Masters' portal

Abstract on the topic of graduation work

Contents

Introduction

The amount of information in the world has increased over the past few years. More and more information appears every day (Figure 1). Various means are used to store information: books, magazines, the Internet. To find useful information from such a huge number of sources, you need to sort and study it.

Today, there is not always time to sort it out, choose the one that is really important. For such purposes, computers can be used as detectors of information characteristics: text tonality, volume, degree of uniqueness, and so on. In the master's thesis, it is proposed to develop an application in C # to determine the sentiment of text using Text Mining, while modernizing the standard algorithms, as well as supplementing the algorithm with mechanisms for determining other characteristics of the text, for example, errors.

Figure 1 – Increase in the amount of information
Figure 1 – Growth in the amount of information [1]

1. Relevance of the topic

This topic is an urgent task, since not all Text Mining tools are able to determine the tonality of the text, as well as some other characteristics of the text at the same time. Moreover, there are almost no tools for determining the tonality of the text.

Correct definition of the text sentiment helps to protect the user from reading literature that is depressed, which will lead to a spoiled mood.

2. Purpose and objectives of the study, planned results

The aim of the study is:

Main research objectives:

  1. examining existing tools for extracting characteristics from text;
  2. studying algorithms for working with Text Mining to determine sentiment;
  3. creating your own algorithm for determining the sentiment of the text;
  4. creating a C # program to determine the sentiment and some other characteristics of the text.

Research Object : Determining the tonality of the text.

Research subject : The effectiveness of methods for determining the sentiment of text.

As part of the master's work, it is planned to obtain relevant scientific results in the following areas:

For an experimental assessment of the theoretical results obtained and the formation of the foundation for subsequent research, as practical results it is planned development of a customizable and functional system for determining some characteristics of the text:

It is planned that this system will have:

3. Research and development overview

Let's consider the basic concepts that need to be understood to determine the sentiment of the text, then consider two approaches (algorithms) for determining the sentiment of the text, consider the tools that use the Text Mining algorithms.

3.1 Text Mining

Text Mining is a direction in artificial intelligence aimed at which is to retrieve information from collections of text documents that are practical in practice in terms of Machine Learning and Natural Language Processing [2].

The key task groups are:

It is important to understand what document categorization is.

Document categorization is a selection of documents from one or several groups (class, cluster) with similar texts (for example, on a topic or style). Categorization can occur both with human participation and without it.

In the first case, if we talk about classification documents, the system must classify documents already into certain classes. So that the user must provide the system with all classes and sample documents, belonging to these classes [3].

The second case of categorization is called document clustering. And the system itself must determine the set of clusters that require teach without a teacher. In this case, the user must inform the number of cluster systems to be used to collect process attributes [4].

Text Mining is applied in many fields of science every day. new opportunities appear. At a minimum, Text Mining is used in security area and helps to analyze the text of news sites, and in the software explores text analysis technologies to future automation of analysis and data extraction processes. Also Text Mining can be used commercially [5].

The main stages of Text Mining (Figure 2):

  1. search for information;
  2. preprocessing documents;
  3. retrieving information;
  4. using Text Mining methods;
  5. interpreting the results [6].

Figure 2 – Text Mining Stages
Figure 2 – Stages of Text Mining [6]

3.2 Algorithm overview

Sentiment is the emotional attitude of the author of the statement to any object expressed in text. This object can be real world object, process, property, attribute, event [7].

Text tone analysis is a class of methods of content analysis (data analysis) in computer (computational) linguistics, designed for automatic search in texts for emotionally colored vocabulary and the opinions of the authors regarding the object, which are discussed in the text [8].

The main tasks of sentiment analysis are:

The sentiment score can be set, for example, as a percentage (%).

Thus, the sentiment can be:

It can also be:

The choice of evaluation option is implementation dependent. In the second case, all negative words and sentences will take away the overall assessment of the text, positive add, and neutral do not change anything.

There are many methods for determining the sentiment of text, there are many libraries for different programming languages. All methods and libraries have their own advantages and disadvantages.

Let's consider a simple example of how this model works for understanding.

Let's say we have a sentence:

Много людей в этом зале. Я тоже в зале. Я огорчён.

Now, ignoring punctuation and case (these steps should perform the library), then after carrying out linguistic processing of the text, we can identify many words from the sentence:

M1 (massive) = [много, люд, в, этом, зал, я, тоже, в, зал, я, огорчиться];

Also sometimes there is meaning to remove words that do not mean anything in the language, for example, an English word The as they do not affect the grade.

The next step is to count the number of occurrences identical words in a sentence (this way we simplify the speed the program works: there will be no need to memorize every word separately, although sometimes this can lead to new results when modified implementation). We represent the result of this step in the format JSON:

V1 (vector) = {много:1, люд:1, в:2, этом:1, зал:2, я:2, тоже:1, огорчиться};

As you can see, the words in the vector (array) are unique words of the text, which is being analyzed. Therefore, this vector can be called a dictionary (in only unique values are placed in it, but taking into account the number). Speaking in the language of sets, the vector of the text will be equal to the union (sum) vectors of sentences, but taking into account the number (multiplicity).

This model is mainly used for parsing information from the text. After this stage, there are several options for what you can learn about the text. For example, the simplest option is extracting the most common word or, for example, defining percentage of water content of the text, but we are currently interested in the definition tonality.

The next step for determining the tonality we need use (connect to the system) dictionaries that will contain sentiment estimates for most words, in our case Russian. Exactly from the dictionary and depends on how exactly the sentiment assessment should be interpreted. In most cases, dictionaries also contain more than just sentiment scores words, but also other characteristics for words. Depending on the dictionary as well it uses different approaches from the connection of words [9].

3.2.1 Concept of sentiment using Bag-of-words

The Bag-of-words model is a simplified representation of used in natural language processing and information retrieval. AT in this model, text (such as a proposal or document) is represented as Bag (multiset) of his words, ignoring grammar and even order words while maintaining the plurality [10].

The model is commonly used in document classification methods where the frequency of occurrence of each word is used as a function for training the classifier.

The model is often found in the form of a matrix in which the rows correspond to one text, and the columns are the words included in it. All these words in the corresponding document [11].

With this approach (method), the dictionary consists of words and sentiment scores for each of them.

For example, let our connected dictionary have grades the sentiments of our words, presented in table 1.

Table 1 – Dictionary of words
Word Sentiment assessment
много 0,01
люд 0,01
в 0
этом 0
зал 0
я 0,01
тоже 0,01
огорчиться -0,02

In our presented case, the sentiment score indicates how many percent does the emotional coloring of the text increase (estimate from the + sign increases the overall rating of the text in the positive part, and 0 – nothing does not change the evaluation of the text, evaluation with a - sign leads the text to a negative sentiment assessment).

After connecting the dictionary, it is necessary to compare words and calculate the overall grade. In our case, if we use such a dictionary, then initially the estimate of 0.5 is neutral (50%).

Also, for simplicity, in the dictionary, each word often has an index, their are inserted instead of words in the vector. This makes it easier to compare words. Performing counting (0 is not counting):

0.5 + 0.01 + 0.01 + 0.01 * 2 + 0.01-0.02 = 0.5 + 0.05-0.02 = 0.5 + 0.03 = 0.53.

Thus, it can be seen that our text is positive by tonality. However, it should also be noted that this assessment is not accurate (objective). For a more accurate assessment, more accurate dictionaries are needed. Full algorithm using Bag-of-words in graphical form shown in Figure 3 [ 9 ].

Figure 3 – Algorithm for determining sentiment with the Bag-of-words model
Figure 3 – Algorithm for determining sentiment with the Bag-of-words model

Drawing conclusions, we can say that the bag-of-words model is a fairly simple model for defining the characteristics of the text.

Benefits:

Disadvantages:

3.2.2 Concept of sentiment with Word2Vec

Word2Vec is the generic name for a set of models based on artificial neural networks designed to obtain vector representations of words in natural language. It is used to analyze semantics of natural languages based on distribution semantics, machine learning and vector representation of words [12] [13].

This approach implies the presence of not just words and grades to them, and also semantic relationships between words. Semantically identical words form semantic groups and comparison is performed between group and a word from the text. For example, the word животные and звери semantically the same and belong to the same group (the merit of the algorithm). However, here and a problem (drawback) of this common method arises: the semantics of similar words are different. For example, the words тёмный and чёрный in different sentences will have both different and the same semantics [14] [15].

In our case, this approach also compares all words of the text, but with word groups and calculates the grade.

3.3 Overview of tools using Text Mining

There are many means for detecting (defining) the characteristics of text, but none of the means does not determine the tone. Let's consider tools that can determine several characteristics of the text at once.

Text.ru service

This service checks the text for uniqueness, comparing with various sources. Uniqueness – the percentage of unique text, which did not match the sources. Also, this service allows you to check spelling text, perform SEO analysis of the text [16].

Benefits of the service:

Disadvantages:

Antiplagiat.ru service

This service is the first system for detecting text borrowing. It provides 2 options for work: students and organizations. When working with the student mode, it is necessary register in the system, where the results will be sent checks [17].

Benefits:

Disadvantages:

Advego Plagiatus service

Service provides a program with which there is the ability to thoroughly (fully) check the text for uniqueness. This service is famous for the fact that it takes a very long time to scan text for plagiarism, output, the user receives a quality assessment (since the check was long enough) [18].

Benefits:

Disadvantages:

Etxt Антиплагиат service

This is a program for searching for plagiarism on the network and assessing the uniqueness of texts. With its help you can check the text for uniqueness quickly and efficiently. Allows you to conduct a detailed analysis of the uniqueness of the text and determine the originality of the article as a percentage. Shows non-unique phrases by highlighting them in different colors and allows you to immediately edit them and send the text for re-checking. [19].

Advantages:

Disadvantages:

3.4 Overview of research at different levels

Consider the research that is carried out at different levels in this area (Text Mining (definition of the characteristics of the text), the definition of the tonality of the text).

3.4.1 World level

Consider the research that is carried out at the global level in this area by studying various sources (articles, abstracts, term papers, etc.).

For example, Moshe Koppel, Jonathan Schler, Kfir Zigdon's article Determining an Author's Native Language by Mining a Text for Errors discusses determining the native language by looking for errors in the text. To solve this problem, DataMining tools are used. The stylistic features of the text can be used to determine the native language of the anonymous author with high accuracy.

Yuejin Xu, Noah Reynolds' article Using Text Mining Techniques to Analyze Students' Written Responses to a Teacher Leadership Dilemma explores text mining techniques for analyzing student written responses to the teacher leadership dilemma. This article also discusses Text Mining tools. The purpose of this study was to test the accuracy of the categories generated by IBM SPSS Text Analytics for Surveys.

Bing Liu's article Sentiment Analysis and Subjectivity explores the relationships and differences between facts and opinions. Opinions are usually subjective expressions that describe feelings, assessments or feelings of people in relation to objects, events and their properties. Facts are objective expressions about entities, events and their properties.

Bo Pang, Lillian Lee's article A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts explores sentiment analysis aimed at identifying the point of view (s) underlying a range of text. To determine the polarity of sentiment, a new machine learning method is proposed that applies text categorization methods only to the subjective parts of the document.

The article Automated Classification of Text Sentiment by Emmanuel Dufourq, Bruce A. Bassett discusses automatic sentiment detection using two new Genetic Algorithms (GAs). These algorithms find out if the words in the text are thin or amplifying and their corresponding size. This approach builds a dictionary of sentiments. The results show that the proposed approach was able to outperform several algorithms for analyzing public and / or commercial sentiment.

Omri Koshorek Adir Cohen Noam Mor Michael Rotman Jonathan Berant's article Text Segmentation as a Supervised Learning Task articulates text segmentation as a supervised learning problem and presents a large new dataset for text segmentation that is automatically extracted and tagged from Wikipedia. Moreover, we develop a segmentation model based on this dataset and show that it generalizes well to invisible natural text.

3.4.2 National level

At the national level, a lot of research is carried out on this topic to solve various problems, so we can say that this task is quite relevant.

In the article by A.S. Romanova, M.I. Vasilyeva, A.V. Kurtukova, RV & Meshcheryakova Анализ тональности текста с использованием методов машинного обучения presents the results of a study of the text sentiment analysis technique using machine learning methods, such as support vector machines, naive Bayesian classifier, and random tree methods. An overview of research, methods and software products in the field of text sentiment analysis is given, the stages of modeling the process of conducting experiments and determining the sentiment of text are described, descriptions of the created text corpora and dictionaries, as well as the results of the research, are given.

In the article by A.E. Ermakova, S.L. Kiseleva Лингвистическая модель для компьютерного анализа тональности публикаций СМИ, the experience of the practical solution of the problem of determining the tonality of the text in relation to a given object is highlighted, the means used by the author of the text to form a toned image of the object are systematized, and a linguistic model is constructed to highlight all the components of this image, described a scheme for assessing the positive / negative tonality, taking into account the places that tonal and neutral words occupy in the composition of propositions, means of expressing negation and inversion of meaning.

In the article by A.G. Pazelskaya, A.N. Solovyova Метод определения эмоций в текстах на русском языке methods of automatic determination of the emotional component (sentiment) in the text are considered and the experience of the current practical implementation of the system for media texts in Russian, which is based on dictionaries of lexical sentiment and a set of combinatorial rules combining individual words and phrases. This work is the first to propose a method for determining the sentiment based on predication relations in a proposition. In this regard, we have proposed a classification of verbs depending on their emotive impact and the location of the tonality object.

In the article by V.V. Osokina, M.V. Shegai Анализ тональности русскоязычного текста as a classifier, a naive Bayesian classifier is used. Various methods are used to select features, the results obtained are compared with the results of the classification of the English-language text.

The article Открытое тестирование систем анализа тональности на материале русского языка (NV Lukashevich, II Chetverkin) describes the experience of conducting an open assessment of methods for analyzing Russian-language texts by sentiment on the basis of the ROMIP seminar in 2011 – 2012. As part of the track, several training collections were created, which are now freely available. An overview of the current state of affairs in the processing of evaluative texts in Russian, a description of the main tasks, characteristics of collections, as well as measures for measuring quality are given.

In the article Разработка системы анализа тональности текстовой информации (V.V. Garshina, K. S. Kalabukhov, V. A. Stepantsov, S. V. Smotrov) approaches for automatic determination of the sentiment of text data are analyzed, a comparative analysis methods and algorithms of machine learning for solving the problem of classifying the sentiment of text, provides a description of the developed software for highlighting the sentiment of text data, which implements an approach based on the method of machine learning with a teacher with an optimal set of parameters for classification.

In the article Использование синтаксиса для анализа тональности твитов на русском языке (Yu.V. Adaskina, P.V. sentiment analysis in Russian. The described algorithm was applied in the track for analyzing the sentiment of tweets about banks and telecommunications companies. For these data, a classification into three classes was developed and evaluated: positive, negative and neutral.

The article Entity Based Sentiment Analysis Using Syntax Patterns and Convolutional Neural Network (Karpov IA, Kozhevnikov MV, Kazorin VI, Nemov NR) proposes an alternative method for extracting subjective sentiment in text messages, based on a modified method previously proposed by Mingbo, which first parses the syntax and then matches sentiment to the object of analysis. Two approaches to classifying mood polarity are shown: syntax rule templates and convolutional neural network (CNN).

The article Сентимент-анализ текста (Zvereva P. P.) explores the emotional assessment of the text, in particular the emotional assessment of the texts of the mass media. Such concepts as media text, media linguistics, sentimentality of the text are considered. Sentiment analysis of fragments of printed articles of one of the leading US publications, extracted from the corpus by the method of textual analysis and by keywords, is carried out. The data obtained as a result of the sentiment analysis are compared with the results of a questionnaire survey conducted among a group of respondents.

The article Применение сентимент-анализа текстов для оценки общественного мнения (Posevkina R. V., Bessmertny I. A.) Describes an approach to assessing the emotional coloring of natural language texts based on tonality dictionaries. A method for automatic assessment of public opinion using sentiment analysis of reviews and discussions of published documents on the Internet, based on the statistics of words used, is proposed. A research prototype of a software system that produces sentiment analysis of a natural language text in Russian on the basis of a linear scale has been developed.

In the work Анализ тональности текстов на основе ДСМ-метода (S. Vychegzhanin, E. Kotelnikov), the analysis of the sentiment of a text based on the JSM method is considered. The advantage of the JSM method over statistical methods is the transparency and correctness of the inference process, good interpretability of the generated hypotheses, and the absence of the need for a large number of examples for training.

In the work Анализ тональности текстов с использованием нейросетевых моделей (Nefedova E. A., Mishenin A. N.), the definition of the sentiment of the text using neural networks (neural network models) is considered.

3.4.3 Local level

At the local level (in the works of masters), the same problem was not found (determining the sentiment of the text), however, works were found in which the means of Text Mining are studied.

In the work Разработка распределенного поискового робота (Pranskevichus V. A.) search robots, their structure, as well as their advantages and disadvantages are studied, and an effective implementation is proposed.

In the work Методы и алгоритмы извлечения структурированных данных из текстов новостей (Sarah N. A.), an algorithm for extracting structured data from news media is proposed, the relevance of this problem is given, implementations at different levels are considered, and an implementation of the algorithm for extracting data from news about science.

In the work Разработка и исследование алгоритмов для повышения эффективности интеллектуального анализа web-контента (O. Arbuzova) algorithms for extracting data from web-content are considered, their advantages and disadvantages are studied, a more optimized, optimal algorithm for performing this task.

In the work Разработка и исследование алгоритма формирования семантического ядра веб-сайта на основе методов Data Mining (Kisnichenko E. A.), Data Mining tools are considered for the implementation of the set goals (creation of an algorithm for the formation of the semantic core of the site). It is assumed that this algorithm will be implemented in the site administration systems or in the means of supporting the work of SEO specialists to increase the completeness, accuracy and reduce the development time for the web sites with dynamic content.

The work Разработка алгоритмического обеспечения интеллектуального модуля анализа эмоционального содержания естественно языковых сообщений блогов и форумов (Prokapovich A.A.) considers algorithms for determining the sentiment of a text, the scientific novelty of determining the sentiment of a text, and also proposes a specific algorithm for finding emotionality in blogs and forums.

Исследование методов и алгоритмов определения жанра литературных произведений на основе технологии Text Mining (N. Storozhuk). Finding semantic similarities between texts is a serious problem for automatic text processing. The need to find the distance between documents arises in various tasks, such as plagiarism detection, document authorship, information search, machine translation, formation of tests and tasks, automatic construction of abstracts, etc. It is proposed to implement an algorithm for determining the literary genre of Russian text using Data Mining.

Conclusions

As you can see there are almost no services which determine the sentiment of the text, so this task is relevant, and in order for the service (program) to be of greater interest, it is also necessary to implement additional algorithms for extracting the characteristics of the text.

In this work:

Also planned:

The work is not finished yet and it is planned that it will be ready by May 29, 2021. Full information about the work can be obtained from the developer of this program or from the scientific supervisor, consultant.

List of sources

  1. Synthesis of an electrochromic film based on a compound of lithium fullerene and a transition metal oxide // https://en.ppt-online.org/463200 & ndash Title. from the screen;
  2. Peskova O. V. Algorithms for the classification of full-text documents // Automatic processing of texts in natural language and computational linguistics. – M .: MIEM (Moscow State Institute of Electronics and Mathematics), 2011. – P. 170 – 212.
  3. Survey of Text Mining I: Clustering, Classification, and Retrieval // Ed. by M. W. Berry. – 2004. – Springer, 2003. – 261 p.
  4. Aggarwal C. C., Zhai C. Mining Text Data // Springer, 2012. – 527 p.
  5. Do Prado H. A. Emerging Technologies of Text Mining: Techniques and Applications // Ed. by H. A. Do Prado, E. Ferneda. – Idea Group Reference, 2007 .– 358 p.
  6. Text preprocessing methods // https://megapredmet.su/1 – 53369.html – Title. from the screen;
  7. Bo Pang, Lillian Lee. Opinion Mining and Sentiment Analysis (English) // Foundations and Trends in Information Retrieval: journal. – 2008. – No. 2. – P. 1 – 135;
  8. Bing Liu. Sentiment Analysis and Subjectivity // Handbook of Natural Language Processing (English) / ed. N. Indurkhya and F. J. Damerau. – 2010 P. 28 – 105;
  9. Automatic sentiment analysis (Sentiment Analysis) // https: // habr .com / ru / post / 263171 – Title. from the screen;
  10. Sivic, Josef (April 2009). Efficient visual search of videos cast as text retrieval (PDF) // IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 4. IEEE. pp. 591 – 605;
  11. Harris, Zellig (1954). Distributional Structure // Word. 10 (2/3): 146 – 62. doi: 10.1080 / 00437956.1954.11659520. And this stock of combinations of elements becomes a factor in the way later choices are made ... for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use;
  12. Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space // In Proceedings of Workshop at ICLR. – 2013a;
  13. Mikolov T., Yih W., Zweig G. Linguistic Regularities in Continuous Space Word Representations // In Proceedings of NAACL HLT. – 2013b;
  14. Bengio Y., Ducharme R., Vincent P. A neural probabilistic language model // In Journal of Machine Learning Research. – 2003;
  15. Collobert R., Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning // In Proceedings of the 25th ICML. – 2008;
  16. Text.ru // https://text.ru/antiplagiat – Title from the screen;
  17. АНТИПЛАГИАТ // https://www.antiplagiat.com – Title from the screen;
  18. ADVEGO // https://advego.com/plagiatus – Title from the screen;
  19. Etxt Антиплагиат // https: // www .softportal.com / software-17702-etxt-antiplagiat.html – Title. from the screen.