DonNTU   Masters' portal

Introduction

The task of the intellectual processing of natural language text first appeared in the late 60's and 70's. The twentieth century. To date, many studies performed in this field, algorithms are developed and set up pilot programs that can analyze the proposal. But these systems are not widely known because of the narrow specializations or the high costs of computer time and resources.

Computer technology is increasingly being introduced into our lives, the task of providing a convenient interface to communicate with technology becoming more and more urgent. A person who is not familiar with computers, it is difficult to get used to managing such a technique. To facilitate this process should be as close contact "person-to-computer" to communicate "people-person".
    Provide interaction with electron-computers (PC) in a natural language is the most important task of artificial intelligence.
For this area include the problem of machine translation, text summarization, organization of natural language interfaces to database management systems and information retrieval texts.
     One of the important problems of computer processing of natural language texts (KOEYAT) is a selection of words in the text associated with each other in meaning.
It arises in the construction of ontologies, dictionaries, compatibility, extracting knowledge from text. In connected speech the grammatical expression of structural and semantic relations is a syntactic relation.

1. Theme urgency

In social terms, the importance of linguistic problems of computerization is associated with the emergence of new types of media, including the construction of artificial languages ​​and computer dictionaries, the development of information bank, the construction of algorithms for text processing, the development of modes of communication in the system "man- machine-man", etc. In general, the linguistic dimension is also important for all major areas of knowledge processing industry, such as the collection, creation, storage, ordering, distribution and interpretation of information.

The task allocation is syntactically related words of the Russian language involved such well-known Russian company "Garant-Park-Internet", "INTELTEK PLUS", "DIALING". In Ukraine, on the problem employs a team of Cognitive Technologies. In Ukraine, a means of automatic analysis of text based on linguistic methods are not well developed, indicating the relevance of this work.

2. Goal and tasks of the research

PurposeDevelopment of software for automatic selection of words syntactically connected simple uncomplicated common proposal of the Russian language.

Purpose of the studysimple uncomplicated common proposal of the Russian language.

The object of studyselection methods are syntactically related words in a sentence.

This work aims to develop an automatic syntactic analysis based on linguistic techniques. It proposed the following approach: a set of pairs is searched word forms, potentially unrelated, and then carried over the set of pairs of full parsing sentences, which is determined by the initial set of syntactically related words in a sentence.

3. Review of Research and Development

In terms of natural language descriptions of formal theories distinguish between formal grammar and probabilistic and statistical approaches. The formal-grammatical approach aims to create a complex system of rules that would allow in each case to decide in favor of a syntactic structure, and statistics – to collect statistics of occurrence of different structures in a similar context, on the basis of which a decision on the choice of options structure.

Also known methods of syntactic analysis, based on the data of psychology and neurophysiology. One such method is to offer methods for isolating nuclei.

The formal-grammatical approach laid the classification of formal languages ​​and grammars, Chomsky proposed. For Computational Linguistics, among them the most important grammar of finite automata, context-free (CC) and context-sensitive grammar. To describe the natural language phenomena are mainly used CFG with some extensions.

The grammar of finite automata (Finite-State Transition Network) formally corresponds to a simple grammar of possibilities of the third type. The state machine contains a set of states (nonterminal symbols), among which emit one or more of the initial and final, and the conditions of transition between states. Information for the transition under the terms of the characters are coming from the tape, read by machine. Sometimes, the state machine can write characters to a different tape in the English tradition of such a machine called the transducer. Most applications for linguistic terms of the transition is not given directly, but computed lexical component that associates with character or string of characters to strip characters generics.

Finite state machines are declarative means of representation, which means the possibility of reversibility, ie, application and analysis, and synthesis. They are also very effective in terms of speed, but limited ability to describe the many structures found in natural language, such as nested structures, for example, nested subordinate clauses.

A higher level of grammar constitute the context-free grammars, which are described in the form of productions (rules) that relate to non-terminal symbols in its left-hand side (before the "=") a set of terminal and nonterminal symbols in the right-hand sides.

This grammar describes sentences such as "Fox sees the wolf," "a young fox sees the old wolf," "young old fox sees the wolf lying down," "Fox lies," etc. Just to extend this grammar to represent the dictionary Russian morphology more fully. Note that in this particular choice of the grammar rules for the construction of verb groups (VP-rule), or nominal groups (NP-rules) set options, guaranteed to make a choice between them under this rule is impossible. This grammar is a so-called non-deterministic grammars.

The syntax rules of the COP is very simple, but the description of many phenomena of natural language simple apparatus CFG is not enough. In particular, context-free rules is inconvenient to describe the alignment (eg, in person and number between subject and predicate). KC-unit is also inconvenient to display ruptured relationships caused by the movement of words for a phrase, or to describe the lack of components.

In the current international developments, aimed at the analysis of NL-texts, much attention is paid to the statistical analysis of the schemes. The basis of most statistical methods of analysis are so-called PCFG-grammar (probabilistic context-free grammars), which are, in fact, grammar (context-free), in which each rule is supplemented by a probability estimate. Although the use of a simple CFG does not allow to achieve the required degree of accuracy of the analysis (this conclusion was reached in the early 1970s.), Different schemes of analysis, built on extensions of context-free grammars have been used successfully in the modern natural-language systems.

The choice of ways to represent the syntactic structure is largely associated with the device parsing algorithm. Formal grammar work, usually with the syntactic representation of the tree components. Attractive properties of the dependency graph is their efficiency, ease of use in the transformations, the ability to view the partial results of the analysis as a set of subgraphs.

To create the "exact" algorithms, semantic and syntactic analysis of texts, it is essential to the functioning of language going on a strict "rules", ie that the language was a kind of calculus. An example is the high level programming languages. But natural language is not calculus. In it, if there are any rules, celebrated linguists (such as "grammar rules"), they have a "dim" the scope and inaccurate. Language is a universal means of communication between people, and it is difficult to expect a simple solution to the problem of its modeling. He is like a 'black box', which can be seen only by its inputs and outputs, and a "mechanism" for its functioning can only speculate.

Currently, the following means of automatic parsing of Russian proposals: RCO Syntactic Engine, DIALING, Solarix, TREETON, MCA.

Conclusion

This work aims to develop software for automatic selection syntactically related words common complication of a simple sentence.

To achieve this, a review of methods and means for parsing. He showed that at present the formal grammatical analysis techniques were gradually replaced by methods in one form or another, using the probabilistic assessment.

Methods such as the probabilistic principle can not provide 100% accuracy of the analysis, but their results with the actual text is quite satisfactory for many applications. Although the development costs of probability analyzers can be significantly lower than in the creation of comprehensive structural and grammatical patterns of natural language, but have less accuracy and completeness of the analysis.

Analyzed the structure of the complex and complicated sentences: the types of segments, the function of punctuation marks, conjunctions and words. An analysis of the following conclusions were made that before the parser we are unable to determine whether the proposal difficult or complicated. Therefore, the analysis of complex, complicated and simple sentences will be carried out by a single algorithm.

The algorithm in the form of the general scheme of analysis offers a list of information resources (the base of stable combinations of punctuation marks, the base of set phrases and conjunctions, prepositions, and the base complex prepositional words), the algorithm of parsing segments offer.

References

1. Valgina NS The syntax of the modern Russian language: Textbook / Valgina NS – M.: Agar, 2000. – 416.

2. Druchinina Veronica. Extraction of informative snippets of text for automatic abstract [text] / Veronica Druchinina / / Lіngvokomp'yuternі doslіdzhennya: ST. Sciences. Pratzen / Donetsk natsіonalny unіversitet / UCL.: A. Zagnіtko (vіdp. ed.), Jean-Krasnobaєva Chorna (zast. vіdp. ed.) is the іn. – Donetsk: Donetsk National University. , 2011. – VIP. 4.-p.39-42.

3. OS Kulagina Research on machine translation / OS Kulagina – Moscow: Nauka, 1979. – 279 p.

4. Belonogov GG Computational linguistics and advanced information technology / Belonogov GG – Moscow: Russian World, 2004. – 189 p.

5. Smooth A. The syntactic structure of natural language in automated systems / A. Smooth – Moscow: Nauka, 1985 – 334 p.

6. Beloshapkova VA Modern Russian / Beloshapkova VA – M.: Azbukovnik, 1997. – 928 p.

7. NN Leont'ev The structure of a semantic component information model of automatic understanding of the text / NN Leont'ev . – M.: Azbukovnik, 1990 – 229 p.

8. The syntax of the Russian language [electronic resource]. – Mode of access: http://shkola.lv/

9. Dorokhina GV Module morphological analysis of words of the Russian language / G. Dorokhin, AP Pavlyukova / / Artificial Intelligence. – 2004 – pp. 636-642.

10. Dorokhina GV unit of morphological analysis of words without a dictionary of Russian language / GV Dorokhin, VY Trunov, E. Shilov / / Artificial Intelligence. – № 2. – 2010. – P.32-36.