Abstract
- Introduction
- 1. Relevance of the topic
- 2. Purpose and objectives of the study, planned results
- 3. Sentiment analysis of text
- 3.1 Algorithm overview
- 3.1.1 Concept of Sentiment with Bag-of-Words
- 3.1.2 Concept of Sentiment with TF-IDF
- Conclusions
- List of sources
Introduction
In recent years, considerable effort has gone into the study and classification of two of the three main components of music: tones and rhythms. But there is also a third component that receives comparatively less attention - the song test.
The study of song lyrics is different from the general sentimental analysis of, say, cultural literature. Lyrics can have attributes that are out of place in literature or formal prose, such as repetition, rhyming, and rhythmic performance. Moreover, it can be assumed that texts may have a higher tendency to be stereotyped or even cliché, making them more amenable to pattern analysis than prose in general.
1. Relevance of the topic
Lyrics are unstructured information that is difficult to process manually. But it is necessary to collect and process information at least because it allows you to obtain new information from existing data, with the help of which you can expand the variety of decisions made. In this regard, the problem of automatic data analysis is relevant, and many methods and models have been developed to solve it. One of the methods is Data Mining.
Data Mining is a process of automatic detection of hidden information in the initial data, which was not previously known, is non-trivial, practically useful and is available for human interpretation [1].
This topic is an urgent task, since today many music services (Apple Music, Yandex.Music) analyze the songs that the user listens to and adjusts playlists for them with the mood of the songs that the user listens to the most.
2. Purpose and objectives of the study, planned results
Object of research: Determination of the sentiment of the lyrics.
Subject of research: The effectiveness of methods for determining the sentiment of the text, applicable to the library of song tests.
The aim of the research is to study approaches to the analysis of the sentiment of a text, as well as to develop a tool for analyzing the sentiment, loaded lyrics and generating statistics based on them.
The main objectives of the study:
- study of algorithms for determining the sentiment of the text;
- creation of your own corpus of texts of popular songs;
- development of our own algorithm for determining the tonality of the text based on the example of the lyrics of famous performers;
- development of a software model for determining the tonality of the loaded lyrics and compiling statistical data based on them.
As part of the master's work, it is planned to obtain relevant scientific results in the following areas:
- advantages and disadvantages of algorithms for determining the sentiment of the text;
- the advantages and disadvantages of determining the sentiment of the text using a computer.
It is planned that this system will have:
- intuitive user interface with prompts;
- optimized algorithm for work;
- the ability to save the results of the sessions for later work.
3. Sentiment analysis of text
Sentiment analysis, also called intelligent opinion analysis, is a natural language processing approach that identifies the emotional tone hidden behind the body of a text. It is a popular way for organizations to define and categorize opinions about a product, service, or idea.
Sentiment analysis systems help gather information from disorganized and unstructured text that comes from online sources such as emails, blog posts, support tickets, web chats, social media channels, forums, and comments. Algorithms replace manual data processing with implementation of rule-based, automatic, or hybrid methods [7]. Rule-based systems perform sentiment analysis based on predefined vocabulary-based rules, while automated systems learn from data using machine learning techniques. Hybrid sentiment analysis combines both approaches.
In the general case, the task of analyzing the sentiment of a text is equivalent to the task of classifying a text, where the categories of texts can be tonal ratings. Examples of tonal ratings:
- positive;
- negative;
- neutral (the text does not contain emotional connotation).
The main stages of sentiment analysis are shown in Figure 1 [8].
3.1 Algorithm overview
Sentiment analysis uses a variety of natural language processing techniques and algorithms, which we will look at in more detail in this section.
The main types of sentiment analysis algorithms used include:
- Rule-based: These systems perform sentiment analysis based on a set of manually created rules.
- Automatically: Systems rely on machine learning techniques to learn from data.
- Hybrid systems combine rule-based and automatic approaches.
3.1.1 Concept of Sentiment with Bag-of-Words
The Bag-of-words (or BoW) is a way to extract functions from text for use in modeling, for example, with machine learning algorithms [9]. The approach is very simple and flexible, and it can be used in a variety of ways to extract features from documents.
A word package is a text representation that describes the appearance of words in a document. This includes two things:
- Dictionary of famous words.
- A measure of the presence of known words.
This is called a bag
of words because any information about the order or structure of words in the document is discarded. The model only cares about whether famous words occur in the document, not where in the document.
Intuition dictates that documents are similar if they have similar content. In addition, only by the content we can learn something about the meaning of the document.
The word package can be as simple or complex as you like. The difficulty lies both in deciding how to build a dictionary of known words (or signs) and in how to assess the presence of known words.
Consider the corpus (set of texts) of documents D {d1, d2 ... ..dD}, and N unique tokens extracted from the corpus C. N tokens (words) will form a list, and the size of the matrix M of the bag of words will be given by DX N. Each line in matrix M contains the frequency of tokens in document D (i).
- D1: All my loving I will send to you.
- D2: All my loving, darling I will be true.
It creates a dictionary using unique words from all documents (my
, loving
, I
, send
, you
, true
, darling
). As you can see from the above list, we do not consider all
, will
, be
in this set, because they do not convey the necessary information required for the model.
A 2x6 matrix M (D = 2 - the number of documents, N = 6 - the number of words in the dictionary) is presented in Table 1.
My | loving | I | Send | You | True | Darling | |
D1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
D2 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
The table above shows the learning functions containing the frequency of terms of each word in each document. This is called the bag of words
approach because in this approach the number of occurrences matters, not the sequence or order of the words.
3.1.2 Concept of Sentiment with TF-IDF
TF-IDF (the term TF — term frequency, IDF — inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents [10].
TF-IDF was invented for document retrieval and information retrieval. It works by increasing in proportion to the number of times a word appears in a document but offset by the number of documents containing that word. Thus, words that are common in every document, such as this, what, and if, are ranked low, even though they may appear many times, since they have little meaning to that document in particular.
Usually TF-IDF has two members:
- term frequency (TF);
- inverse document frequency (IDF).
The first calculates the normalized frequency of the term (TF).
Consider the lyrics of Michael Jackson The Way You Make Me Feel, containing 493 words, in which the word Baby occurs 10 times. Then the term frequency (TF) for the word Baby is equal to 49,3.
The second term is the inverse frequency of documents (IDF).
Suppose we have the lyrics of 190 Michael Jackson's songs, and the word Baby occurs in 85 of them. The Inverse Document Frequency (IDF) is equal to 0,35.
Then TF-IDF is calculated as:
In the examples above, the term frequency is 49,3 and the return document frequency is 0,35.
Thus, the TF-IDF is the product of these values: 49,3 * 0,35 = 17,2.
Conclusions
In the course of this work, algorithms and classifications for identifying information were analyzed. After a detailed study, the disadvantages and advantages of the considered approaches were identified. As a result of this review, the following areas of research can be noted:
- A detailed study of the Bag-of-words and TF-IDF methods
- Development of combined methods
In particular, the Bag-of-words and TF-IDF methods are of interest, since they are quite simple to understand and implement. Research prospects are the development of their own algorithm based on them.
When writing this essay, the master's work has not yet been completed. Final completion: June 2022. The full text of the work and materials on the topic can be obtained from the author or his manager after that date.
List of sources
- Пескова О. В. Алгоритмы классификации полнотекстовых документов // Автоматическая обработка текстов на естественном языке и компьютерная лингвистика. – М.: МИЭМ (Московский государственный институт электроники и математики), 2011. – С. 170 – 212.
- Rachel Harsley. Hit Songs’ Sentiments Harness Public Mood & Predict Stock Market / Rachel Harsley, Bhavesh Gupta, Barbara Di Eugenio, and Huayi Li // WASSA 16 - 2016 - pp. 17–25 — [Ссылка].
- Yunqing Xia. Lyric-based Song Sentiment Classification with Sentiment Vector Space Model / Yunqing Xia, Linlin Wang, Kam-Fai Wong, Mingxing Xu // ACL-08 - 2008 - pp. 133–136 — [Ссылка].
- Сперцян К.М. Сравнительный анализ методов определения эмоциональной окраски сообщений в социальных сетях с применением обучения с учителем / Н.Ю. Рязанова, К.М. Сперцян. // Новые информационные технологии в автоматизированных системах. Компьютерные и информационные науки. — 2018 — [Ссылка].
- Семина Т.А. Анализ тональности текста: современные подходы и существующие проблемы / Т.А. Семина // Социальные и гуманитарные науки. Отечественная и зарубежная литература. Сер. 6, Языкознание: Реферативный журнал. — 2020 — С. 47-64 [Ссылка].
- Ландэ Д. В. Интернетика Навигация в сложных сетях Модели и алгоритмы. — М.: Книжный дом „ЛИБРОКОМ“, 2009. — с. 87-88.
- Классификация текстов и анализа тональности [Электронный ресурс] — [Ссылка].
- Wisam A. Qader. An Overview of Bag of Words; Importance, Implementation, Applications, and Challenges / Wisam A. Qader, Musa M.Ameen, Bilal I. Ahmed //Fifth International Engineering Conference on Developments in Civil & Computer Engineering Applications 2019 - (IEC2019) - Erbil - IRAQ — [Ссылка].
- Bijoyan Das. An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation / Bijoyan Das, Sarit Chakraborty // IEEE, Kolkata, India - 2018 — [Ссылка].
- Bijoyan Das. An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation / Bijoyan Das, Sarit Chakraborty // IEEE, Kolkata, India - 2018 — [Ссылка].