UDC: 004.415.2.031.43

NATURAL LANGUAGE PRE-PROCESSING

FOR MACHINE LEARNING MODELS

Zolushkin Y.A.

yura.zolushkin @ mail.ru

Abstract: This article discusses the concept of natural language processing. The analysis of existing methods of natural language processing is carried out. Their advantages and disadvantages are emphasized.

Keywords: tokenization, bag of words, lemmatization, stemming, stop words.

Formulation of the problem.

Natural language processing is used almost in all industries, probably except for the conservative ones.  In  many technological solutions recognition and processing of "human" language were  introduced long ago: that’s  why an ordinary  IVR (interactive voice response) with  rigidly predetermined response options is gradually  getting out of date, voice assistants are beginning to adequately communicate without the participation of a human operator, etc. Natural language processing deals with teaching computers to understand, process and apply natural languages. In this article some of the common techniques used in NLP problems are regarded. 

What is Natural Language Processing?

Natural Language Processing (further - NLP) - natural language processing is a subsection of computer science and AI dedicated to how computers analyze natural (human) languages. NLP enables machine learning algorithms to be applied to text and speech.

For example, NLP can be used to build systems like speech recognition, machine translation, documents generalizing, named entities recognizing, spam detecting, questions answering, predictive text input, etc.

Today many people have smartphones with speech recognition - they apply NLP to understand our speech. Also, many people use laptops with OS built-in speech recognition (for example, Cortana, Siri, Gmail, etc.).


 

Language processing is done with various methods such as:

1.  Tokenization;

2.  Stemming;

3.  Filtration;

4.  Semantic reasoning.

Sentence Tokenization 

Sentence tokenization is the process of dividing a written language into component sentences. In the text we can extract a sentence every time we find a certain punctuation mark - a point.

But even in English this task is not trivial, since the point is also used in abbreviations.  An abbreviation table can be of great help during text processing to avoid misplacing sentence boundaries. In most cases, libraries are used for this purpose.

Word Tokenization 

Word tokenization is the process of splitting the sentences into word components. In Latin alphabet – based texts the splitter is often a gap.

However, some problems can arise if we use only a gap - English compound nouns are spelled differently and sometimes their elements are separated with a gap.

Lemmatization and Stemming text

Usually texts contain different grammatical forms of the same word, and there may occur words having the same root. The task of lemmatization and Stemming is to turn all occurring forms of words into the one, a normal word form.

Lemmatization and Stemming are special cases of normalization, and they are different.

Stemming is a crude heuristic process that cuts off "excess" from the root of the words that often leads to the loss of derivational suffixes.

Lemmatization is a more subtle process applying vocabulary and morphological analysis to transform a word into its canonical form - a lemma.

The difference is that the stemmer acts without knowing the context and does not understand the difference between the words that have different meanings depending on the part of speech. However, stemmers have some advantages: they are more easily introduced and they work faster.

Stop words

Stop words are words that are removed from the text before or after the text processing.  When applying machine learning to texts, such words can add a lot of noise, so it is necessary to get rid of irrelevant words.

Articles, interjections, conjunctions, etc., which do not carry a semantic load are usually regarded as stop-words.  It should be understood that there is no universal stop word list, everything depends on the specific case.

Regular Expressions

A regular expression (regular expression, the regexp ,regex ) is a sequence of characters that defines the search pattern.

Example:

·      . - any character except for  line translation;

·      \ w – one character;

·      \ d – one digit;

·      \ s – one gap;

·      \ W - one NON- character;

·      \ D - one NON-digit;

·      \ S - one NON- gap;

·      abc ] - matches any of the specified characters match any of a, b, or c;

·      [^ abc ] - finds any character except the mentioned ones;

·      [ag] - finds a character between a and g.

We can use regular expressions for further filtering of our text. For example, all characters that are not words can be removed.  In many cases, punctuation is unnecessary and can be easily removed using regular expressions.

Regular expressions are powerful tool that can be used to create much more complex patterns. 

           Bag of words

Machine learning algorithms cannot directly work with a raw text, so it is necessary to convert the text into sets of numbers (vectors). This is called feature extraction. 

Bag of words is a popular and simple feature extraction technique used when working with the text. It describes the occurrence of each word in the text.

For using the model we need:

1.  To determine the dictionary of known words (tokens).

2.  To select the presence degree of words that are known.

Any information about word order or structure is ignored. That’s why it is called a bag of words. This model tries to figure out if a familiar word appears in the document, but does not know where exactly it appears.

The tricky part of this model is how to define the vocabulary and how to count the occurrence of words. As the vocabulary size grows, the document vector grows as well.

 Consequently, there will be many zeros in the vector representation. Vectors with many zeros are called sparse vectors, they require more memory and processing resources.

However, we can reduce the number of familiar words when using this model in order to reduce the number of requirements for processing resources.

 The same techniques that had been considered before creating the bag of words can be used on this purpose:

·    word case ignoring ;

·    punctuation ignoring;

·    stop words removing;

·    reduction of words to their basic forms ( lemmatization and stemming );

·    misspelled words correction.

Another, more complex way to compose a vocabulary is to use grouped words. This will change the size of the vocabulary and give the bag of words more details about the document. This approach is called "N-gram ".

N-gram is a sequence of any entities (words, letters, numbers, digits, etc.). In the context of linguistic corpora, an N- gram is usually understood as a sequence of words. A unigram is one word, a bigram is a sequence of two words, a trigram is three words, and so on. The digit N denotes the number of grouped words included in the N-gram. Not all possible N-grams are included in the model, but only those that appear in the corpus.

Conclusions.

This article deals with the principles of NLP for text, namely:

·      NLP enables machine learning algorithms to be applied for text and speech;

·      Sentence tokenization is the process of dividing a written language into component sentence

·      word tokenization is the process of dividing sentences into word components;

·      lemmatization and stemming  is intended to transform all occurring forms of the word to one, normal word form;

·      stop words are words that are removed  from  the text before / after the text processing;

·      regex ( regular season , the regexp , regex )  is a sequence of characters that defines the search pattern;

·      bag of words is a popular and simple feature extraction technique used when working with text. It describes the occurrence of each word in the text.

References

1. Îáðàáîòêà åñòåñòâåííîãî ÿçûêà â Python. Îñíîâû. [ýëåêòðîííûé ðåñóðñ] // Îáðàçîâàòåëüíûå ñòàòüè è ïåðåâîäû: [ñàéò]. [2019]. URL: https://nuancesprog.ru/p/5870/

2. Îñíîâû Natural Language Processing äëÿ òåêñòà [ýëåêòðîííûé ðåñóðñ] // Ñîîáùåñòâî IT-ñïåöèàëèñòîâ: [ñàéò]. [2006-2021]. URL: https://habr.com/ru/company/Voximplant/blog/446738/

3.     Ïëàâíîå ââåäåíèå â Natural Language Processing [ýëåêòðîííûé ðåñóðñ] // Áëîã êîíôåðåíöèè DataStart ïî Data Science, Machine Learning è Big Data: [ñàéò]. [2019]. URL: https://datastart.ru/blog/read/plavnoe-vvedenie-v-natural-language-processing-nlp

Àííîòàöèÿ: Â äàííîé ñòàòüå ðàññìàòðèâàåòñÿ ïîíÿòèå îáðàáîòêè åñòåñòâåííîãî ÿçûêà. Âûïîëíåí àíàëèç ñóùåñòâóþùèõ ñïîñîáîâ îáðàáîòêè åñòåñòâåííîãî ÿçûêà. Îïðåäåëåíû èõ ïðåèìóùåñòâà è íåäîñòàòêè.

Êëþ÷åâûå ñëîâà:òîêåíèçàöèÿ, ìåøîê ñëîâ, ëåììàòèçàöèÿ, ñòèììèíã, ñòîï-ñëîâà.

Ñâåäåíèÿ îá àâòîðàõ:

Çîëóøêèí Þðèé Àëåêñååâè÷– ñò. ãð. ÈÑ-20Ì, ÄîíÍÒÓ

Âàñÿåâà Òàòüÿíà Àëåêñàíäðîâíà - ê.òåõ.í., äîöåíò, êàôåäðà "Àâòîìàòèçèðîâàííûå ñèñòåìû óïðàâëåíèÿ", ÄîíÍÒÓ

Ðåâèíà Íàòàëüÿ Âëàäèìèðîâíà - ñò.ïðåïîäàâàòåëü êàôåäðû àíãëèéñêîãî ÿçûêà, ÄîíÍÒÓ