Mastre's photo
Baev Dmitry Eduardovich
Faculty of Intelligent Systems and Programming
Department of Software Engineering named after L.P. Feldman
Specialty Methods and tools for software development

Classification of texts on websites based on the subject area

Supervisor: Ph.D., Associate Professor, Dept. PI Skvortsov Anatoly Efremovich
Consultant: Senior Lecturer of the Department of PI Kolomoytseva Irina Alexandrovna
This abstract refers to a work that has not been completed yet. Estimated completion date: June 2022. Contact author or his scientific adviser after that date to obtain complete text.
Abstract
Contents
1 Relevance of the topic

Text classification is a Data Mining technology, which in turn is considered one of the nine main methods for processing big data (i.e. Big Data). Big Data - this term defines the arrays of information that cannot be processed or analyzed using traditional methods using human labor and desktop computers. Another feature of Big Data is that the data array continues to grow exponentially over time, therefore, for the operational analysis of the collected materials, the computing power of supercomputers is needed. Accordingly, cost-effective, innovative methods for processing information and providing insights are needed to process Big Data.
The issues of processing large textual information, including the determination of the tone of text documents, classification based on some parameters, have been very relevant over the past few years. This can only be judged on the basis of the main data sources for Big Data, as an example, the main sources are:

  • Internet of Things (IoT), as well as connected devices;
  • social networks, blogs and media;
  • company data: transactions, orders for goods and services, taxi and car sharing trips, customer profiles;
  • instrument readings: meteorological stations, air and water composition meters, satellite data;
  • statistics of cities and states: data on movements, births and deaths;
  • medical data: tests, diseases, diagnostic images.
2 Goals and objectives of the study, planned results

The purpose of this work is the software implementation of one of the Data Mining tasks - the classification of texts based on the subject area.
Based on the purpose of the study, the following tasks were set related to the processing of large amounts of information:

  • explore the main issues related to Big Data and Data Mining in particular;
  • study the tools used for processing big data (DBMS, programming languages, frameworks);
  • consider examples of the use and implementation of Big Data and Data Mining algorithms;
  • programmatically implement one of the Data Mining algorithms - text classification.

The object of the study is Data Mining technology - one of the nine main methods for processing big data.
The subject of the study is the classification of texts.

3 Overview of research and development

The research area is popular not only in international, but also in national scientific communities.

3.1 Review of international sources

O`Reilly publishing house deals with the issues of programming and development of software products in the international space, in particular, English-language sources. The publishing house pays close attention to the issues of understanding the material under study - no matter what industry or area of development they choose, the O`Reilly publishing house tries, first of all, to bring information intelligibly in its articles / books / magazines, on the basis of which this publishing house has practically monopolized the provision of information to the masses.
O'Reilly Media produces a lot of printed information media related to programming and development, however, within the framework of the researched topic on data processing issues, the following books can be distinguished, which are considered standards in the international space: "Data Science. Data science from scratch", " Fundamentals of data engineering", as well as "Generative Deep Learning". English-language international publishers also have good textbooks that cover some of the nuances that have been omitted by other authors.

3.2 Overview of national sources

As for national interests in the field of big data processing, then all the literature is a translation into Russian of English-language sources, so the study of Big Data algorithms will be considered in demand in the next 5 years. All information on the field of research on Big Data issues comes down to articles on the Internet from unknown authors. A similar situation can be traced in the scientific space of the Russian Federation - all articles are, in one form or another, a translation of English-language sources and studies.

3.3 Overview of local sources

At the Donetsk National Technical University, on the master's portal, we managed to find several master's works similar in subject matter.

A student of DonNTU, Berdyukova Svetlana Sergeevna, conducted a study of methods for analyzing the tonality of texts to characterize the perception of news from the field of culture by society [1]. In this study, she considered the concepts of text mining (Text Mining), sentiment analysis (Sentiment Analysis), as well as issues related to the classification of documents.

Seryozhenko Anna Alexandrovna also conducted a study on the processing of big data, she recorded her results in the work Study of sentiment analysis methods using the example of song lyrics [2]. Her work was also based on the concept of Text Mining, with an in-depth analysis of how music services work. This study relied on the relevance of the analysis of the songs listened to by the user, adjusting playlists for them with the mood of the songs that users listen to most often.

Lyutova Ekaterina Igorevna was engaged in the study of information classification methods using the Bayesian classifier [3]. Her research is driven by the rapid growth in popularity of electronic communications, including email, and the low cost of using them, resulting in an increasing flow of unauthorized mass mailings. To solve the problem of unsolicited mailings, Ekaterina considered a classification based on the Bayesian method, based on the rule that some words are more common in spam, and others in regular letters - this algorithm is ineffective if such an assumption is incorrect.

Pilipenko Artem Sergeevich considered issues related to the study of methods and algorithms for determining the tonality of a natural language text [4]. In the study, Artem considered the issues of determining the tonality, since. Not all Text Mining tools are able to determine the tone of the text simultaneously with other characteristics that are of interest to the user.

Guma Svetlana Nikolaevna was engaged in the study of methods for comparative analysis of texts on the example of a recommender system of films [5]. For the experimental evaluation of the theoretical results obtained and the formation of the foundation for subsequent research, Svetlana planned to obtain a developed cross-platform, customizable and functional recommender system as practical results.

A student of DonNTU, Vlasyuk Dmitry Alexandrovich conducted a study of methods for extracting knowledge from HTML pages of the Internet about sports competitions [6]. Dmitry also considered the issues of preliminary processing of information, automatic collection and processing of information.

Storozhuk Natalya Olegovna prepared a practical study of methods and algorithms for determining the genre of literary works based on the Text Mining technology [7], during which she designed and implemented a system for determining the genre of a literary work. Along the way, Natalia considered the problem of effective automated text processing.

Titarenko Mikhail Gennadievich investigated methods for classifying information about the foreign trade activities of states within the framework of an information retrieval system [8]. Mikhail also considered the problem of universal automatic classification, for the solution of which he proposed several studies and implementations of specialized algorithms.

A student of DonNTU, Poletaev Vladislav Anatolyevich was engaged in research of image search methods in graphical databases [9]. This study does not apply to Text/Data Mining technology, but is directly related to solving one of the main problems of Big Data, if we consider the issue of obtaining data from cloud storage. Vladislav himself noted this in his work: Searching in a large amount of information is a complex task that requires the development of efficient indexing and search algorithms, along with the creation of productive software systems that implement these algorithms.

My supervisor, Irina Alexandrovna Kolomoytseva, deals with Big Data issues, and over the past few years she has been introducing interest in the topic of Big Data to the student masses.

4 Big data theory

Big data refers to information whose volume can be over a hundred terabytes and petabytes. Moreover, this information is regularly updated. Examples include data coming from contact centers, social media, data on stock exchange trading, [10, 11, 12] etc. Also, the concept of "big data" sometimes includes methods and methods for processing them.
If we talk about terminology, then “Big Data” means not only data as such, but also the principles of processing big data, the possibility of their further use, the procedure for detecting a specific information block in large arrays. Questions related to such processes do not lose their relevance. Their solution is important for those systems that have been generating and accumulating various information for many years [11].

4.1 Information criteria that determine belonging to Big Data

There are information criteria defined in 2001 by the Meta Group, which allow you to evaluate whether the data corresponds to the concept [11] of Big Data or not:

  • Volume (volume) - approximately 1 Petabyte and above;
  • Velocity (speed) - generation, receipt and processing of data at high speed;
  • Variety (variety) - heterogeneity of data, different formats and possible lack of structure;
  • Variability (variability) [13] - diverse intensity of income, which affects the choice of processing methods;
  • Value (significance) - the difference in the level of complexity of the information received.

Thus, the data coming from messages in the chatbot of online stores has one level of complexity. And the data that the machines that track the seismic activity of the planet give out is a completely different level.
In most cases, the resulting raw data is stored in the so-called "data lake" - Data Lake [10, 11, 12, 14, 15, 17]. The format and level of structuring information in this case can be varied [15]:

  • structural (data in the form of rows and columns);
  • partially structured (logs, CSV, XML, JSON files);
  • unstructured (pdf-format, document format, etc.);
  • binary (video, audio and image format).
4.2 Tools for storing and processing data in Data Lake

DataLake [10, 11, 12, 14, 15, 17] - in addition to the storage function, it also includes a software platform (for example, such as Hadoop), and also defines sources and methods for replenishing data, clusters of nodes for storing and processing information, management, and training tools. DataLake scales up to many hundreds of nodes as needed without stopping the cluster.
The location of the "lake" is usually in the cloud. So, about 72% of companies, when working with Big Data, prefer cloud ones to their own servers. This is due to the fact that the processing of large databases requires serious computing power, while the cloud significantly reduces the cost of work. This is the reason why companies choose cloud storage. The cloud has no limits on the amount of data stored in it. Therefore, it is beneficial in terms of cost savings for those companies whose workload is growing rapidly, as well as businesses associated with testing various hypotheses.
Hadoop [10, 14, 16, 18] is a package of utilities and libraries used to build systems that process, store, and analyze large amounts of non-relational data: data from sensors, Internet traffic, JSON objects, log files, images, and social media messages.
HPPC (DAS) is a supercomputer capable of processing data in real time or in a "batch state". Implemented by LexisNexis Risk Solutions.
Storm is a Big Data framework designed to work with real-time information. Developed in the Clojure programming language.

4.3 Three main principles of working with big data
4.3.1 Horizontal adaptability

The amount of data is unlimited, so the system that processes it must be able to expand: as data volumes increase, the amount of equipment must increase proportionally to maintain the operability of the entire system.

4.3.2 Stability in operation during failures

Horizontal adaptability implies the presence of a large number of machines in a computer node. For example, a Hadoop cluster has over 40,000 machines [13]. It goes without saying that periodically the equipment, wearing out, will be subject to breakdowns. Big data processing systems must function in such a way as to safely survive possible failures.

4.3.3 Data concentration

In large systems, data is distributed across a large number of equipment. Let's say that the location of the data is one server, and their processing takes place on another server. In this case, the costs of transferring information from one server to another may exceed the costs of the processing process itself [13]. Accordingly, in order to avoid this, it is necessary to concentrate the data on the same equipment on which the processing takes place.

4.4 Nine main methods of big data processing
4.4.1 Machine learning

This method of data analysis contains at its core the ability of an analytical system to independently learn in the process of solving various problems - the program is given an algorithm that allows it to learn to identify certain patterns. The areas of application of this method are quite diverse [12, 15] - for example, marketing research is carried out using machine learning, social networks offer a selection of posts, and medical programs are being developed.

4.4.2 Neural network

The neural network is used to recognize visual images [10, 12, 15]. Neural networks are mathematical models displayed by program code. Such models work on the principle of a neural network of a living being: receiving information - processing and transmitting it - issuing a result.
The neural network is able to do the work for several dozen people. It is used for entertainment, forecasting, security, medical diagnosis, etc. (in various social and professional fields).

4.4.3 Data mining technology

Mathematician Grigory Pyatetsky-Shapiro introduced this term in 1989. The method involves the detection of certain patterns in raw data using data mining [11]. Data Mining is used for:

  • determination of atypical data in the general flow of information through the analysis of deviations;
  • searching for identical information in various sources using associations;
  • determining the factors of influence on a given parameter through regression analysis;
  • distribution of data into groups with similar characteristics (data classification);
  • separation of records by pre-formed classes (clustering).
4.4.4 Crowdsourcing strategy

In some situations, when there is no economic benefit in the development of an AI (artificial intelligence) system, a large number of people are attracted to perform one-time work. They can solve problems that a computer can't handle on its own. An example would be the collection and processing of sociological survey data. Such information may be in non-digitized form, errors and abbreviations may be made in it. Such a format will be understandable to a person, and he will be able to organize the data in a form that will be read by program algorithms.

4.4.5 Predictive analytics method

In other words, a forecasting technique. With enough relevant information, you can make a forecast and answer the question "How will events develop?". The principle of predictive analytics is as follows: first you need to examine the data for the past period; identify patterns or factors that caused the result; then, using a neural network or mathematical calculations, create a model that can make predictions.
The forecasting technique is used in various fields [10, 12]. For example, predictive analytics allows you to identify and prevent fraudulent schemes in lending or insurance. In medicine, predictive analysis based on patient data helps to determine its predisposition to any diseases.

4.4.6 Principle of statistical analysis

The essence of the method is to collect data, study it on the basis of specific parameters and obtain a result, usually expressed as a percentage. This method has a weak link - the inaccuracy of data in small samples. Therefore, to obtain the most accurate results, it is necessary to collect a large amount of initial data [10, 15].
Statistical analysis is often used as part of another way to process Big Data [10, 12, 15], such as in machine learning or predictive analytics.
To obtain statistical indicators, use [19]:

  • correlation analysis to determine the interdependence of indicators;
  • percentage of analysis results;
  • time series to assess the intensity of changes in certain conditions in a specific time interval;
  • determination of the average.
4.4.7 Simulation technology

Simulation modeling differs from the forecasting technique in that it takes into account factors whose influence on the result is difficult to track in real conditions - models are built taking into account hypothetical rather than real data, and then these models are explored in virtual reality [10, 12, 15].
The method of simulation models is used to analyze the influence of various circumstances on the final indicator. For example, in the field of sales, the impact of price changes, the presence of discount offers, the number of sellers, and other conditions are examined in this way. Various variations of changes help to determine the most effective marketing strategy model for implementation in practice. For this kind of modeling, it is necessary to use a large number of possible factors in order to reduce the risks of unreliable results.

4.4.8 Analytical data visualization method

For the convenience of evaluating the results of the analysis, data visualization is used. To implement this method, subject to working with big data, virtual reality and "large screens" are used. The main advantage of visualization is that such a data format is perceived better than textual, because up to 90% of all information a person assimilates with the help of vision.
The analytical data visualization method allows you to quickly perceive and compare, for example, sales levels in different regions, or evaluate the dependence of sales volumes on a decrease / increase in the cost of goods.

4.4.9 Data mixing and integration method

In the vast majority of cases, Big Data is obtained from various sources, respectively, the data has a heterogeneous format [12, 13]. It makes no sense to load such data into one database, since their parameters do not have a mutual relationship. It is in such cases that mixing and integration is used, that is, they bring all the data to a single form.
To use information from various sources, the following methods are used:

  • bringing data into a single format by converting documents, translating text into numbers, text recognition;
  • information for one object is supplemented with data from different sources;
  • from unnecessary information filter out and remove the one that is not available for analysis.

After the integration process is completed, data analysis and processing follows. As an example of a data integration and mixing method, we can consider: a store that trades in several directions - offline sales, a marketplace and one of the social networks. To conduct a full assessment of sales and demand, you need to collect data: about orders through the marketplace, sales receipts for offline sales, orders through the social network, stock balances, and so on.

4.5 Data classification
4.5.1 Structured data

Typically stored in relational databases. Organize data at the table level - for example, Excel. From the information that can be analyzed in Excel itself, Big Data differs in large volume.

4.5.2 Partially structured

The data is not suitable for tables, but can be organized hierarchically. Text documents or files with records of events fit this description.

4.5.3 Unstructured

They do not have an organized structure: audio and video materials, photos and other images.

4.6 Data sources
4.6.1 Human-generated social data

The main sources of social data are social networks, the web [12], and GPS movement data [10]. Also, Big Data specialists use statistical indicators of cities and countries: birth rate, death rate, standard of living and any other information that reflects the indicators of people's lives.

4.6.2 Transaction information

This type of information appears during any monetary transactions and interaction with ATMs: transfers, purchases, deliveries.

4.6.3 Machine data

Smartphones, IoT gadgets, cars and other equipment, sensors, tracking systems and satellites serve as a source of machine data.

5 Problems of analysis and processing of large amounts of data

The main problem of processing a large amount of data lies on the surface - these are high costs [12]. It takes into account the costs of purchasing, maintaining and repairing equipment, as well as the salaries of specialists who are competent in working with Big Data.
The next problem is related to the large amount of information that needs to be processed. For example, if in the process of research we get not two or three results, but a large number of possible outcomes, then it is extremely difficult to choose exactly those that will have a real impact on the indicators of a particular event.
Another issue is the privacy of big data [11]. Privacy can be compromised as more and more customer service-related services use data online. Accordingly, this increases the growth of cybercrime. Even the usual storage of customer personal data in the cloud can be leaked. The issue of the safety of personal data is one of the most important tasks that must be solved when using Big Data methods.
Threat of data loss. A one-time reservation does not solve the issue of saving information. The vault requires at least two or three backups. But with the growth of data volumes, the problem of redundancy increases. Therefore, experts are busy looking for the most effective way out of this situation.

6 Big data tools

One of the methods of distributed computing is Google's MapReduce [10, 11, 12, 17] parallel processing method. The framework organizes data into records. Functions work independently and in parallel, which ensures that the principle of horizontal scalability is observed. Processing takes place in three stages:

  • Map [12, 17]. The function is defined by the user, map serves as initial processing and filtering. The function is applicable to one input record, it produces many key-value pairs. It is applied on the same server where the data is stored, which corresponds to the principle of locality.
  • Shuffle [12, 17]. The output of map is parsed into buckets. Each corresponds to one output key of the first stage, parallel sorting occurs. "Baskets" serve as an input for the third stage.
  • Reduce [12, 17]. Each "basket" with values falls into the input of the reduce function. It is set by the user and calculates the final result for each "basket". The set of all values of the reduce function becomes the final result.

A set of utilities, libraries and the Hadoop framework [10, 14, 16, 18] are used to develop and run programs running on clusters of any size. The Apache Software Foundation is open source software for storing, scheduling, and collaborating with data.
Apache Spark [15] is an open-source framework that is part of the Hadoop ecosystem and is used for cluster computing. The Apache Spark library set performs in-memory calculations, which significantly speeds up the solution of many problems and is suitable for machine learning.
NoSQL is a type of non-relational DBMS. Data storage and retrieval is modeled by means other than tabular relationships. No predetermined data schema is required to store information. The main advantage of this approach is that any data can be quickly placed and retrieved from the storage. The term stands for "Not Only SQL" [15].
All databases belong to the Amazon "family":

  • DynamoDB is a managed, serverless key-value database built to run high-performance applications at scale, suitable for IoT, gaming, and advertising applications.
  • DocumentDB is a document database designed to work in directories, user profiles and content management systems, where each document is unique and changes over time.
  • Neptune [13] is a managed graph database service. Simplifies the development of applications that work with sets of complex data. Suitable for working with recommendation services, social networks, fraud detection systems.
7 Most Popular Programming Languages for Working with Big Data
7.1 R

The language is used to process data, collect statistics and work with graphics. Loadable modules link R to GUI frameworks and allow you to develop GUI analysis utilities [19]. Graphics can be exported to popular formats and used for presentations. Statistics are displayed in the form of graphs and charts.

7.2 Scala

Native language for Apache Spark, used for data analysis. The Apache Software Foundation projects Spark and Kafka are written primarily in Scala.

7.3 Python

It has ready-made libraries for working with AI, ML and other methods of statistical calculations: TensorFlow, PyTorch, SKlearn, Matplotlib, Scipy, Pandas. For data processing and storage, there are APIs in most frameworks: Apache Kafka, Spark, Hadoop.

8 Examples of using analytics based on Big Data: business, IT, media

Big data is used to develop IT products. For example, Netflix predicts consumer demand using predictive models for new online movie theater features. The streaming platform experts classify the key attributes of the popularity of films and series, analyze the commercial success of products and features. A key feature of such services is built on this - recommender systems that predict the interests of users.
Gamedev uses big data to calculate player preferences and analyze behavior in video games. Research like this helps improve gaming experiences and monetization schemes.
For any large-scale production, Big Data allows you to analyze revenues and feedback from customers, detail information about production chains and logistics. Factors like these improve demand forecasting, reducing costs and downtime.
Big Data helps with semi-structured data about spare parts and equipment. Log entries and information from sensors can be indicators of an imminent breakdown. If it is predicted in time, it will increase the functionality, service life and efficiency of equipment maintenance.
In retail, big data analytics provide in-depth knowledge of customer behavior patterns. Analytics of information from social networks and websites improves the quality of service, increases loyalty and solves the problem of customer churn.
In medicine, Big Data will help with the analysis of drug use statistics, the effectiveness of services provided, and the organization of work with patients.
Banks use distributed computing to work with transactional information, which is useful for detecting fraud and improving services.
Government agencies analyze big data to improve the safety of citizens and improve urban infrastructure, improve the work of housing and communal services and public transport.

Conclusions

In conclusion, it should be noted that the development of big data processing technologies opens up wide opportunities for improving the efficiency of various areas of human activity: medicine, transport services, public administration, finance, and production. This is what determines the intensity of development of this direction in recent years.

List of sources
  1. Berdyukova S.S. Study of methods for analyzing the tone of texts to characterize the perception of news from the field of culture by society. [Electronic resource]. Access mode: https://masters.donntu.ru/2021/fisp/berdiukova/diss/index.htm
  2. Serezhenko A.A. Investigation of sentiment analysis methods on the example of song lyrics. [Electronic resource]. Access mode: https://masters.donntu.ru/2021/fisp/serozhenko/diss/index.htm
  3. Lyutova E.I. Study of methods for classifying information using a Bayesian classifier. [Electronic resource]. Access mode: https://masters.donntu.ru/2020/fknt/lutova/diss/indexru.html
  4. Pilipenko A.S. The study of methods and algorithms for determining the tonality of a natural language text. [Electronic resource]. Access mode: https://masters.donntu.ru/2020/fknt/pilipenko/diss/index.htm
  5. Guma S.N. The study of methods of comparative analysis of texts on the example of a recommender system of films. [Electronic resource]. Access mode: https://masters.donntu.ru/2019/fknt/guma/diss/index.htm
  6. Vlasyuk D.A. Study of methods for extracting knowledge from HTML pages of the Internet about sports competitions. [Electronic resource]. Access mode: https://masters.donntu.ru/2018/fknt/vlasiuk/diss/index.htm
  7. Storozhuk N.O. Research of methods and algorithms for determining the genre of literary works based on Text Mining technology. [Electronic resource]. Access mode: https://masters.donntu.ru/2018/fknt/storozhuk/diss/index.htm
  8. Titarenko M.G. Study of methods for classifying information on foreign trade activities of states within the framework of an information retrieval system. [Electronic resource]. Access mode: https://masters.donntu.ru/2018/fknt/titarenko/diss/index.htm
  9. Poletaev V.A. Study of image search methods in graphic databases. [Electronic resource]. Access mode: https://masters.donntu.ru/2019/fknt/poletaev/diss/index.htm
  10. Анналин Ын, Кеннет Су. Теоретический минимум по Big Data. Всё что нужно знать о больших данных. - СПб.: Питер, 2019 - 208 с.: ил.
    [источник]
  11. Кукьер К., Майер-Шенбергер В. Большие данные. Революция, которая изменит то, как мы живем, работаем и мыслим. / Виктор Майер-Шенбергер, Кеннет Кукьер ; пер. с англ. Инны Гайдюк. — М.: Манн, Иванов и Фербер, 2014. — 240 с.: ил.
    [источник]
  12. Уоррен Дж., Марц Н. Большие данные. Принципы и практика построения масштабируемых систем обработки данных в реальном времени. - М.: Вильямс, 2018 - 368 с.: ил.
    [источник]
  13. Сенько А. Работа с BigData в облаках. Обработка и хранение данных с примерами из Microsoft Azure. - СПб.: Питер, 2019 - 448 с.: ил.
    [источник]
  14. Вайгенд Андреас. BIG DATA. Вся технология в одной книге. - М.: Эксмо, 2021 - 384 с.: ил.
    [источник]
  15. O'Reilly Media. Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale / 4th Edition. - V.: O’Reilly, 2015 - 754 с.: ил.
    [источник]
  16. Зыков Р. Роман с Data Science. Как монетизировать большие данные. - СПб.: Питер, 2022 - 320 с.: ил.
    [источник]
  17. Благирев А. Big data простым языком. - М.: АСТ, 2019. - 256 с.: ил.
    [источник]
  18. Грас Д. Data Science. Наука о данных с нуля: Пер. с англ. - 2-е изд., перераб. и доп. - СПб.: БХВ-Петербурr, 2021. - 416 с.: ил.
    [источник]
  19. Garrett Grolemund, Hadley Wickham. R for Data Science. - Sebastopol, CA : O'Reilly, 2017. - 494 c.: ил.
    [источник]