Abstract
Содержание
- Introduction
- 1. Theme urgency
- 2. Goal and tasks of the research
- 3. Text Processing
- 3.1 Stages of natural language text processing
- 3.2 Methods of preprocessing texts
- 4. Approaches to creating recommendation systems
- Conclusion
- References
Introduction
On average, about 350 feature films are released per year [1], and there is a tendency to increase this number. In such circumstances, the viewer, who is interested in watching films, needs to document their impressions and share them with others. For this purpose, services were developed containing information about the films and allowing users to express their opinion.
Technologies are evolving to make life easier for users, so most of these services implement preference recommendations. There are many implementations of the algorithm for issuing recommendations, but not all of them are effective in the field of cinema, as a result of which many services become useless over time after evaluating a certain number of films.
The introduction of a recommender system is commercially viable, as the user is more likely to pay attention to a service that will help him in finding products in a particular industry. The recommendations apply when searching for movies, music, products in an online store, news and services of various kinds. For example, a recommender system will allow you to continue watching films without a lot of time, showing films that match the tastes of a given user above the others in the list.
In this regard, it is relevant to create your own system that would meet the requirements of a modern user who actively uses recommendation services to search for new films.
This work is devoted to the analysis of methods and models for studying the similarity of texts. The results of this work will be used in the implementation of our own method of analysis of texts in natural language with the aim of improving the system of recommendations developed for the graduate project of the bachelor.
1. Theme urgency
In order to improve the system developed for the bachelor's thesis project, it was decided to use not only the algorithm for determining the recommended films based on genres, but also analyze the description of the films, as well as user reviews. Descriptions and reviews are unstructured information that is too labor-intensive to process manually. But it is necessary to collect and process information, if only because it makes it possible to obtain new information from existing data, with which you can increase the variety of decisions made. In this regard, the task of automatic data analysis is relevant, and many methods and models have been developed for its solution. One method is Data Mining.
Data Mining - the process of automatically detecting hidden information in the source data that was not previously known, non-trivial, practically useful and accessible for human interpretation [2].
A separate area of knowledge processing is the analysis of unstructured textual information. Unstructured textual information refers to a set of documents that are logically combined text, not limited to structural components [3].
The study conducted research on approaches to issuing recommendations and identified the need to use a combination of approaches: content and collaborative filtering. The stages of creating a list of recommended films based on the analysis of information received from users and from information about films are also determined.
2. Goal and tasks of the research
In new systems, often there is no formed list of user preferences based on which recommendations of similar films can be generated, and for recommendation objects (films) there is no information about interactions with it. This situation is called the cold start problem, and the standard, unmodified algorithms used for collaborative filtering cannot be effective in this case. The looping problem occurs when a user who requests a list of recommended films too rarely replenishes the list of viewed and rated objects: in this case, the system recommends the user the same objects. To solve these problems, hybrid systems are created that combine collaborative filtering based on user actions and content filtering based on certain information about films. Thus, the aim of the study is to develop an approach to the issuance of user recommendations for films that solves the problem of cold start and the problem of looping.
The main objectives of the study:
- Analysis of models and algorithms for text information classification.
- Analysis of metrics to determine the proximity of texts.
- Development of software model architecture.
- Modification of existing metrics for determining the proximity of texts to determine the category to which the film belongs.
- Evaluation of the effectiveness of the developed metric to determine the category to which the film belongs
Object of study: text processing algorithms.
Subject of research: creating a recommendation system by improving existing methods for issuing recommendations.
As part of the master's work, it is planned to obtain relevant scientific results in the following areas:
- Development of a software model for an automated system for determining the category to which a film belongs, according to its description.
- Development of an algorithm for automated determination of the category to which the film belongs.
- Modification of well-known metrics and methods for making recommendations and evaluating the effectiveness of their application in the system.
For the experimental evaluation of the theoretical results obtained and the formation of the foundation for subsequent research, as a practical results , it is planned to develop a cross-platform, customizable and functional recommendation system with the following properties:
- Create a graphical user interface as a website.
- Implementing an approach to making recommendations based on user reviews, descriptions, and other data from movie information.
- Providing the results of generating a list of recommended films in a human-readable form.
3. Text processing
In fig. 1 there is a process of extracting important data from texts.
The analysis of information presented in text form includes:
- Search for information. At this stage, a set of documents to be analyzed is determined and their availability for further processing is ensured. In the movie recommender system, documents include movie descriptions and user reviews.
- Document preprocessing. The next step is common to all analysis methods, but differs in implementation. All text documents found at the previous stage are pre-processed in order to isolate a certain structure for further use of this data in automatic similarity detection methods. Thus, unnecessary words are removed from the text and the text takes on a more structured form.
- Extraction of useful knowledge. At this stage, the selected Text Mining methods work to extract structured data in texts. For example, determining frequent sets of words and combining them into key concepts, calculating the probabilities of a document belonging to a class, compiling an index of documents to search by keywords, reducing the text while retaining meaning, etc.
- Processing the results. The last stage in the process of finding useful information solves the problem of analyzing the results. The result of work in a recommender system is a list of recommended objects.
3.1 Ways of preprocessing texts
Pre-processing of the text is necessary in order to prepare the text for the further identification of keywords. Raw text contains many words that do not carry useful information. For example, natural languages are flexible, so formally different words may have similar or identical meanings (synonyms). Also unnecessary for the analysis process are non-informative words, such as auxiliary parts of speech (conjunctions, prepositions). Therefore, at this stage, all such words are deleted, and words with similar meanings are brought into general form. This reduces the analysis time and allows the system to give more accurate results.
The following text preprocessing methods are used:
- getting rid of uninformative words: lists of uninformative words (
that is,
as said,
possible
) are compiled in advance and all matches are deleted through the text; - morphological search (stemming): the conversion of words into a single form suitable for a given part of speech; for example, the words
use
,used
can be reduced to a verb in the form of the infinitive -use
. For each language, it is necessary to implement different algorithms, taking into account lexical features; - n-grams: strings are divided into parts of n characters and the analysis of characters around each such part is performed. This method is less dependent on random errors in spelling words than the previous two methods, and is independent of the linguistic representation of words, but does not cope well with the task of reducing the number of uninformative words;
- capitalization: all alphabetic characters of the text are lowercase to simplify the work with text.
The most effective text processing is when using all of the above methods.
3.2 Approaches to Creating Recommender Systems
The task of recommendation systems is to analyze user actions, properties of objects and features of the recommendations sphere in order to predict further user actions. There are such types of recommendation systems: based on content filtering (item-item), based on collaborative filtering (user-user) and hybrid.
Content filtering is based on the fact that each film has a profile
with some parameters (for example, genre, actors). Each such profile
is compared with films that the user has rated highly, and the search algorithm for similar objects will search among these profiles
for the most similar parameters. It is recommended to take objects from a small time period, as people's tastes change over time.
Collaborative filtering displays the relation of other users to the film and is based on creating a table of rated films for each user. A search is made for users who rate the same movies. Among the films list of these users, films that the current user has not yet rated but other users rated highly will be added to the recommendations of the current user.
The following methods are suggested for making recommendations for a specific user:
- Analysis of film descriptions in order to highlight key concepts and associative rules; the user will be recommended films in the description of which there are keywords from the descriptions of those films that he already liked, based on associative rules.
- Analysis of reviews of films with which the user has not yet interacted. It is proposed to evaluate the tone of reviews, i.e. highlight emotionally colored vocabulary and reveal the attitude of the author of the review to this film. If the review has a positive connotation, then it is highly likely that the user will like the film.
- Analysis of the current user’s reviews that he leaves for the films he interacted with. Using the methods of Text Mining, you can find out what exactly the user liked / disliked in this film, and based on the information received select new films. For example, if a user in a review indicated that he liked the plot of the film, then the recommendation system will select films for him with parameters that affect the plot.
- Statistical analysis based on user ratings, film genres, participating actors, directors, etc.
Thus, the developed recommendation system is hybrid, because it includes both content and collaborative filtering methods. This approach allows you to get rid of the main drawback of new systems - lack of information from users and reduce the problem of recommending the same objects due to the variety of methods for providing recommendations.
Conclusion
In the framework of this work, the main tasks of Data Mining regarding the possibility of their application in the system of recommendations of films are considered. To solve the recommendations problem, approaches to the statistical analysis of the independent parameters of objects (films) are proposed, as well as to the analysis of naturally linguistic textual information, such as film descriptions and reviews. In the future, it is planned to determine the algorithms in a formal form and implement them and conduct experiments to evaluate the effectiveness of the system.
The master's thesis is devoted to the actual scientific problem of processing textual information. As part of the research performed:
- The types of recommendation networks and principles of issuing a recommendation are considered.
- Based on the analysis of literary sources, the main algorithms that can be used in the proposed approach to the provision of recommended films are identified.
- The analysis of Data Mining methods with respect to solving the problem of comparing textual information and extracting useful knowledge.
- A combination of guidelines is suggested.
Further research focuses on the following aspects:
- Qualitative improvement of the proposed approach to creating a list of recommended films, its addition and expansion.
- Adaptation of well-known methods of providing recommendations and analysis of texts in order to extract useful knowledge
- Development of a cross-platform and functional recommendation system in the form of a web service.
When writing this essay, the master's work is not yet completed. Final completion: June 2019. The full text of the work and materials on the topic can be obtained from the author or his manager after the specified date.
References
- Гомзин, А. Г. Системы рекомендаций: обзор современных подходов [Текст] / А. Г. Гомзин, А. В. Коршунов — М. : Труды Ин-та сист. прогр. РАН, 2012. — 20 с.
- Батура Т. В. Методы автоматической классификации текстов. — Новосибирск : Институт систем информатики им. А.П. Ершова СО РАН, 2017. — с. 87-93.
- Принципы работы рекомендательных механизмов Интернета [Электронный ресурс] / Интернет-ресурс. — Режим доступа : https://www.ibm.com/developerworks/ru/library/os-recommender1/index.html. — Загл. с экрана. (дата обращения: 24.11.2019)
- Барсегян А.А. Анализ данных и процессов: учеб. пособие / А.А. Барсегян, М.С. Куприянов, И.И. Холод, М.Д. Тесс, С.И. Елизаров. — 3-е изд., перераб. и доп. — СПб.: БХВ-Петербург, 2009. — 512 с.