Summary of the final work

Content

Introduction
Currently, the main problems of linguistics is the study of vocabulary and semantics, fast automated translation. In these studies it is impossible to do without working with dictionaries and archives. But scientists do not always have the opportunity to access the necessary information resources. This branch of science, such as computer applied linguistics, which is engaged in the creation of various systems for the processing of natural language, can help modern linguists.

The purpose of this paper is to study the processing of natural language — one of the directions of artificial intelligence and mathematical linguistics, which is studying the problems of computer analysis and synthesis of natural languages.

This goal determined the following tasks:
- define the concept of natural language processing;
- identify the main tasks of natural language processing;
- identify difficulties encountered in performing natural language tasks.
Natural language processing
Natural language processing is the general direction of artificial intelligence and mathematical linguistics. It studies the problems of computer analysis and synthesis of natural languages. For artificial intelligence, analysis means understanding the language, and synthesis means generating literate text. Solving these problems will mean creating a more convenient form of interaction between a computer and a person.

Understanding, recognition of natural language is a key task, because recognition and recognition of the language of the living requires tremendous knowledge of the language system, language structure, their features and patterns.

There are 10 main and most relevant tasks of natural language processing.[1,8]
1. One of the most important tasks is speech recognition. By this process is meant the process leading to the conversion of the speech signal of a human voice into digital information. This feature can be used by people lacking the ability to type using hands or to simplify and speed up this process.
2. Speech synthesis is the formation of speech signals in a printed text, that is, an artificial production of human speech. This task is carried out by such a branch of modern computer science, computational linguistics, information technology, such as artificial intelligence.
  This task is mainly intended for use in information and reference systems, dispatch services, for issuing information requests about technological processes, to help people with impaired vision and speech.
3. Text analysis is the process of extracting meaningful, high–quality information from natural language text to automate the process of extracting and analyzing data.
4. Text synthesis is the combination of words into sentences, sentences into text according to the pragmatic structure set at the analysis stage. The task of synthesis can be considered as inverse to the analysis. For example, multilingual generation. So called the automatic preparation of special documents in several languages (patent formulas, instructions for the use of technical products or software systems). To solve this type of problem, detailed language models are used.
5. Machine, or automatic translation. This natural language processing task is the process of translating oral texts written in natural language into another, also natural, language using electronic computers in computer programs designed for this type of task.
6. Creating question–answer systems, that is, such information systems that are able to receive, recognize, classify questions and give answers to them in natural language.
  This task is carried out according to the following algorithm:
  - question type definitions;
  - search for texts that potentially contain an answer to this question;
  - extract the answer from these sources.
  Such systems can be classified into:
  - those that are designed to work with texts and topics of a particular subject;
  - those who are able to work with information relating to different areas of knowledge.
7. Information retrieval is the process of identifying information in documents contained in databases accessible by the search system that correspond to a given query on a topic.
  The execution of this task implies the execution of the following sequence of operations:
  - formulation of a request for information;
  - search for potential holders of relevant information;
  - extract information from found documents;
  - Acquaintance with the results of the search and selection of the most suitable sources for the query conditions.
  There are 3 types of information retrieval:
  - search by content of the entire document;
  - Search by document name, creation date, author, size, etc. data;
  - Search by subject image, item, present on it.[2]
8. Extraction of information — natural language processing tasks that automatically extract the necessary data from the source of information, text (usually unstructured).
9. Analysis of the tonality of the text — analysis of the text tokens, evaluation of their emotional coloring and classification by belonging to a neutral, positive or negative lexical layer of the language.
10. Referencing — reducing the amount of text by highlighting the main thesis by searching for matches given in the search keywords and its summary.
Apply machine learning to understand and use text
Processing natural language allows you to get exciting new results and is a very wide area. The following key aspects of practical application, which are much more common:
- Identification of different cohorts of users or customers (for example, predicting customer churn, total customer profits, product preferences)
- Accurate detection and extraction of various categories of reviews (positive and negative opinions, references to individual attributes like clothing size, etc.)
- Classification of text in accordance with its meaning (request for elementary assistance, urgent problem).[3]
Difficulties in performing tasks
In the process of performing tasks, there are obstacles created by certain features of the natural language. For example, factors such as:
- attribution of a language to a language family, group;
- speech order (direct, reverse or free);
- characteristic features of the national culture of speakers of a natural language;
- logical structure of speech;
- syntactic construction of speech;
- grammatical structure of speech;
- literacy;
- phonetic features of speech;
- polysemy of the language;
- the presence of homonyms in this natural language;
- methods of word formation inherent in a particular language;
- neologisms, occasional;
- phraseological phrases and stable expressions.[4,5]
Phases of language analysis for Natural language Processing
Computers work great with structured information, such as tables in databases. But people communicate with each other not in tables, but in words. For computers, this is too complicated.

The problem of extracting data from a plain text machine is addressed by a special area of artificial intelligence: natural language processing, or NLP (Natural Language Processing).

Computers cannot fully understand a living human language, but they are capable of a lot. NLP can do truly magical things and save a tremendous amount of time.

The process of reading and understanding the text itself is very complicated. People often do not follow the logic and sequence of the narration.

Implementing a complex complex task in machine learning usually means building a pipeline. The point of this approach is to break the problem into very small parts and solve them separately. By connecting several such models that supply each other data, you can get remarkable results.

First you need to break up the language analysis process into stages and understand how they work.[6–8]
1. Highlighting offers. It can be assumed that each sentence is an independent thought or idea. It is easier to teach a program to understand a single sentence, and not a whole paragraph.
  One could just split the text into certain punctuation marks. But modern NLP pipelines have more sophisticated methods in stock, suitable even for working with unformatted fragments.
2. Tokenization, or word highlighting. Separation of a piece of text when we encounter a space. Punctuation marks are also tokens, because they can be important.
3. Definition of parts of speech. Looks through each token and tries to guess which part of speech it is: noun, verb, adjective, or something else. Knowing the role of each word in a sentence, one can understand its general meaning.
  
  Analyzes each word along with its immediate surroundings using a previously prepared classification model. She was trained on a million sentences with parts of speech already indicated for each word and is now able to recognize them. This analysis is based on statistics — in fact, the model does not understand the meaning of words embedded in them by man.
4. Lemmatization. In languages words can have various forms. If the computer processes the texts, he must know the basic form of each word in order to understand that this is about the same concept. In NLP, this process is called lemmatization — finding the main form (lemma) of each word in a sentence.
5. Definition of stop words. Determine the importance of each word in a sentence. For example, in English there are a lot of auxiliary words, such as: “and”, “the”, “a”. When statistical analysis of the text, these tokens create a lot of noise, as they appear more often than others. Some NLP pipelines mark them as stop words and filter out before counting the number. Ready—made tables are usually used to detect stop words.
6. Parsing dependencies. Establishing the relationship between words in a sentence. The ultimate goal is to build a tree in which each token has a single parent. The root can be the main verb. The model gets the words and returns the result. However, this is a more difficult task.
7. Named Entity Recognition (NER) recognition. Detection of nouns and their connections with real concepts. NER systems are not just browsing dictionaries. They analyze the context of the token in a sentence and use statistical models to guess which object it represents.
8. Resolution of the coreference. The permission of coreference is the tracking of pronouns in sentences in order to select all words related to one entity. Combining this technique with a parsing tree and information about named entities and get the opportunity to extract a huge amount of useful data from the document.
NLP Pipeline in Python

Figure 1 shows the standard steps of a normal NLP conveyor, but depending on the final goal of the project and the specifics of the model, some of them can be skipped or interchanged. All listed steps are already written and ready to use. [7]

Figure 1 — Summary of the pipeline (animation: 9 frames, 1 cycle of repetition, 16,5 kilobytes)
Natural language processing and machine learning
Thanks to natural language processing and machine learning, chat bots can interpret natural language data. Interactive systems help to decipher this data into meaningful information, and provide a response for the request.

Many companies are trying to develop the ideal chat bot, which leads a dialogue that is indistinguishable from normal communication between people. New chat bots use deep learning not only to analyze the input of human speech, but also to generate responses. Analysis and creation of a response is achieved as a result of using the deep learning algorithm, which is used in decoding input and generating a response. NLP also translates input and output into text format that both computer and human can understand.

List of tasks that artificial language processing should solve. Many of them can be associated with the recognition of both text and speech, or even pictures.[9,10]
1. Referencing. The task is to create an abstract or summary of a large text.
2. Open and closed questions. From modern chatbots expect readiness to answer questions regardless of whether they are open or closed.
3. Matching. A bot must match objects with words, and understand when different words refer to the same object.
4. Ambiguity. The ambiguity, which is often found in the phenomena of natural language, so far represents a serious problem for bots. Homonymy alone requires that the correct meaning be chosen depending on the context.
5. Morphology. Chat bot must be able to divide words into morphemes.
6. Semantics. Actually, this is the task of determining the meaning of sentences or words in a natural language, and the generation of statements in a natural language.
7. Text structure. Connected with text structure and punctuation.
8. Tonality. The chat bot must distinguish the emotional coloring of a person’s statements, his attitude to the subject of conversation. I must recognize the person’s manner of expressing himself, the structure of sentences and the choice of words in which mood a person is: angry, happy, sad.
Conclusion

The article defined the notion of natural language processing, and also identified the main tasks of natural language processing, and the difficulties encountered when performing tasks. Despite the presence of a large number of scientific publications and tutorials on the topic of NLP on the Internet, today there are practically no full–fledged recommendations and advice on how to effectively cope with the tasks of NLP, while considering solutions to these problems from the very foundations.

Also in this paper, the concept of artificial intelligence, approaches to the development and direction of artificial intelligence.

Processing natural language allows you to get amazing new results and is a very wide area.
List of sources
1. Корн, Г. Справочник по математике для научных работников и инженеров / Г. Корн, Т. Корн. — М.: Наука. Главная редакция физико–математической литературы, 1974. — 832 с.
2. Джаратано, Джозеф. Экспертные системы: принципы разработки и программирование / Джозеф Джаратано. — М.: Высшая школа, 2002. — 1152 с.
3. Ясницкий, Л. Н. Введение в искусственный интеллект / Л. Н. Ясницкий. — М.: Издат. центр Академия, 2005. — 176 с.
4. Джексон, Питер Введение в экспертные системы / Питер Джексон. — Харьков, 1997. — 112 с.
5. Большакова, Е.И. Автоматическая обработка текстов на естественном языке и компьютерная лингвистика / Е.И. Большакова, Э.С. Клышинский, Д.Э. Ланде, А.А. Носков, О.В. Пескова. Е.В. Ягунова — М.: СССР-США СП Параграф, 1990. — 160 с.
6. Как решить 90% задач NLP: пошаговое руководство по обработке естественного языка — Режим доступа: https://habr.com/company/oleg-bunin/blog/352614/
7. Обработка естественного языка на Python — Режим доступа: https://proglib.io/p/fun-nlp/
8. Столбунская А.С., Кравец Т.Н. Создание интеллектуальной системы стилистической оценки текста // Программная инженерия: методы и технологии разработки информационновычислительных систем(ПИИВС–2018) сборник научных трудов II научно–практической конференции (студенческая секция), Том 2, 14–15 ноября 2018 г. — Донецк, ГОУВПО Донецкий национальный технический университет, 2018. — с.253–256.
9. Андрюхин А.И., Полетаев В.А. Рефлексивная референция и анализ квайнов // Информатика, управляющие системы, математическое и компьютерное моделирование в рамках III форума Инновационные перспективы Донбасса (ИУСМКМ — 2017): VIII Международная научно–техническая конференция, 25 мая 2017, г. Донецк: / Донец. национал. техн. ун–т; редкол. Ю.К. Орлов и др. — Донецк: ДонНТУ, 2017. — с.163–166.
10. Как решить проблему машинного понимания естественного языка — Режим доступа: https://habr.com/post/271321/
11. Особенности функционирования интеллектуальной поисковой системы — Режим доступа: http://elib.bsu.by/..

Summary of the final work

Content

Introduction

Natural language processing

Apply machine learning to understand and use text

Difficulties in performing tasks

Phases of language analysis for Natural language Processing

NLP Pipeline in Python

Natural language processing and machine learning

Conclusion

List of sources