Українська   Русский
DonNTU   Masters' portal

Abstract

Content

Introduction

Аutomatic text summarization represented one of the major branches of the modern information. The use of computers and the advent of the Internet provides the ability to quickly obtain and publish any information that on the one hand, speeds up the search for the required data, and increases the efficiency of work with different types of information, but, on the other hand, such development of information technologies has led to the company's transition to a new type of information. In such circumstances amounts of information has grown in tens times and continue to grow beyond the human capacity to perceive and process that information.

The main part of the knowledge of the person receives by analysis, comparison and synthesis of information from different sources, often represented by the text. The amount of new knowledge, acquired by a person in the process of study of texts, reaches 85 %. Scientific and technical progress has led to the emergence of a large number of publications (books, papers, etc) on various problems of science, technology, education, and experts do not have time to keep track of the latest literature in his field of knowledge. Open information sources allow access to a large number of different publications, which leads to the emergence of the problem of effective work with large volumes of data.

Аutomatic text summarization are difficult type of intellectual activity. Preparation of essays takes a long time. The process of automatic text summarization of information allows to replace the time consuming process of identifying important information manually. The formation of a brief sense of the original text in the form of abstracts several times increases the speed of analysis of text documents.

1. Theme urgency

At the present stage of the society development time is the most critical resource for humans. Man that's always have to deal with a lot of different information that should be processed. A significant part of this information presents texts in natural language. In the case when the document is too much and people are not able to read them carefully in time, automatic text summarization systems comes to the aid.

Abstracting texts is one of the most important branches of modern information technologies, as the amount of information constantly growing and work out all the necessary material becomes simply impossible. Thus, development of algorithms of automatic text summarization is not loosing its relevance, but on the contrary, it is becoming increasingly necessary in connection with the ever-increasing amount of text data.

2. Goal and tasks of the research

The goal is to study and to solve the problem of automatic text summarization from technologies based on fuzzy logic. Designed system of automatic text summarization will improve semantic quality of summary and increase the efficiency of data processing and knowledge in computer systems, and also will allow to work well with texts of different genres, different levels of complexity and volume.

The main research tasks:

As a result of execution of work should be developed structure of automatic text summarization, selected methods and algorithms of automatic text summarization and are developed the ways of improvement of the developed system.

3. The review of exist methods of automatic abstracting

Automatic text summarization (Automatic Text Summarization) – extraction of the most important data from one or several documents and generation on their basis of laconic and information and saturated reports. There are two directions of an automatic text summarization – summarizing and a contents summary. The summary of an initial material is based on allocation from texts by means of methods of artificial intelligence and special information languages of the most important information and generation of the new texts which are substantially generalizing primary documents [7].

Summarizing is based on extraction from primary documents by means of certain formal signs "the most informative" the phrases (fragments) which set forms some extract (quasipaper). Actually automatic text summarization is based on allocation from texts by means of special information languages of the most essential information and generation of new texts (papers), to a greater or lesser extent isomorphic to primary documents (or to their parts) [6].

Summarizing possesses that feature in comparison with actually summarizing that is based on the analysis of the superficial and syntactic relations in the text which are expressed in it and the appeal to the deep and semantic processes which study is still obviously insufficient for the description of properties of any text don't demand. The second direction is presented now by pilot studies and broad realization didn't reach yet [6].

The extractive method assumes emphasis on allocation of characteristic fragments (as a rule, offers). For this purpose the method of comparison of phrase templates, allocates blocks of the greatest lexical and statistical relevance. Creation of the final document in this case is a connection of the chosen fragments [6].

In the majority of methods the model of linear weight coefficients [8] is applied. The basis of an analytical stage in this model is made by an appointment procedure of weight coefficients for each block of the text according to such characteristics, as an arrangement of this block in the original, emergence frequency in the text, use frequency in key offers, and also indicators of the statistical importance. The sum of individual scales, as a rule, defined after additional modification according to the special settings connected with each weight, gives the gross weight of all block of the text.

The weight coefficient of an arrangement (Location) in this model depends on where in all the text or in separately taken paragraph there is this fragment – at the beginning, in the middle or at the end, and also whether it is used in key sections, for example, the prolog or in the conclusion.

Key phrases represent lexical or phrase summarizing designs, such as "in summary", "in this article", "according to results of the analysis" and so on.

Besides, at purpose of weight coefficients in this model the indicator of statistical importance (StatTerm) is considered. Statistical importance is calculated on the basis of the data obtained as a result of the analysis of automatic indexation in which researchers reveal and estimate a number of the metrics defining weight coefficients of the term. These metrics allow to allocate the document from among others in a certain set of documents [8].

One group of metrics, for example, tf.idf metrics, characterizes balance between the frequency of emergence of the term in the document and the frequency of its emergence in a set of documents (as a rule, it is used with other metrics of frequency and means of normalization of length) [9].

And, at last, this model assumes viewing of terms in the block of the text and determination of its weight coefficient according to additional existence of terms (AddTerm) – whether there are they also in heading, in the headline, the first paragraph and in the text of the user inquiry. Allocation of the priority terms most precisely reflecting interests of the user, is one of ways to adjust the paper or the summary on the specific person or group [8].

At an analytical stage the model of linear weight coefficients assuming performance of sequence of calculations of frequency and operations of comparison of lines or templates which for each block of a source text give out weight coefficients of four types (Location, CuePhrase, StatTerm, AddTerm) is applied. Then these coefficients are summarized for each block then n of the blocks possessing the highest sum of coefficients (value n get out can be defined on the basis of extent of compression) for inclusion in the paper.

Unlike linear model in trial and error methods of excerpts, for preparation of a summary of information, powerful computing resources for systems of processing of natural languages (NLP – natural language processing), including grammars and dictionaries for syntactic analysis and generation of natural language designs are required. Besides, for realization of this method the certain ontologic reference books reflecting reasons of common sense and concept, focused on subject domain, for decision-making are necessary during the analysis and definition of the most important information [8].

The method of formation of a summary assumes two main approaches. The first (above) leans on a traditional linguistic method of syntactic analysis of offers [6].

In this method also semantic information is applied to annotation of trees of analysis. Procedures of comparison manipulate directly trees for the purpose of removal and a regrouping of parts, for example, by reduction of branches on the basis of some structural criteria, such as brackets or the built-in conditional or subordinated offers. After such procedure the analysis tree significantly becomes simpler, becoming, in essence, structural "pressing" of a source text [8].

The second approach to drawing up a summary originates in systems of artificial intelligence and relies on understanding of a natural language [12]. Syntactic analysis is included also a component in such method of the analysis, but analysis trees in this case aren't generated. On the contrary, conceptual representative structures of all initial information which accumulate in the text knowledge base are formed. As structures formulas of logic of predicates or such representations, as a semantic network or a set of frames can be used.

Conclusion


In this work, the existing methods and approaches to automatic summarization of the texts were studied. Also analyzed the advantages and disadvantages and identified the need for the use of modern technologies in this field. The conducted research methods, which are in the basis of modern systems of automatic summarization has allowed to draw the following conclusions:

  1. The problem of automatic summarization includes the following subtasks: allocation of key words and phrases, the search suggestions that contain key words and phrases, synthesis on the basis of the text of the abstract.
  2. System of automatic text summarization includes the following three key stages: analysis of the input text (preprocessing, data preparation); analysis of the content of the document, which identifies keywords, discarded excessive and unnecessary information, and others; preparation of the summary from the information obtained at the previous stage.

Thus, for solving the tasks of automatic text summarization was shown the relevance of the use of the algorithm based on fuzzy logic. In further development of this subject should be considered possible ways of application of fuzzy logic in systems of automatic text summarization.

References

  1. Luhn H. The automatic creation of literature abstracts. In IBM Journal of Research and Development, Vol. 2(2), 1958. – P. 159–165.
  2. Берзон В.Е. Синтаксические сверхфразовые связи и их инженерно-лингвистичекое моделирование / В.Е. Берзон (отв. ред. Р.Г. Пиотровский). – Кишинев: Штиинца, 1984. – 167 с.
  3. Севбо И.П. Структура связного текста и автоматизация реферирования / И.П. Севбо // М.: Наука, 1969. – 135 с.
  4. Скороходько Э.Ф. Семантические сети и автоматическая обработка текста / Э.Ф. Скороходько // К.: Наук. думка, 1983. – 220 с.
  5. Леонов В.П. О методах автоматического реферирования / В.П. Леонов // НТИ. Сер. 2. – 1975. – № 6. – С. 16–20.
  6. Луканин А.В. Автоматическая обработка естественного языка / А.В. Луканин; М-во образования и науки Российской Федерации, Южно-Уральский гос. ун-т, Каф. "Общая лингвистика". – Челябинск: Изд. центр ЮУрГУ, 2011. – 70 с.
  7. Гинкул А.С. Сравнительный анализ существующих систем автоматического реферирования текста / А.С. Гинкул // Політ. сучасні проблеми науки – Киев, 2012. – С. 255.
  8. Хан У. Системы автоматического реферирования / У. Хан, И. Мани // Открытые системы. – 2000. – № 12. – [Электронный ресурс]. – Режим доступа: http://www.osp.ru/os/2000/12/178370.
  9. Jurafsky D. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition / D. Jurafsky, J.H. Martin. — New Jersey: Prentice Hall, 2000. – 934 p.
  10. Приходько С.М. Автоматическое реферирование на основе анализа межфразовых связей / С.М. Приходько, Э.Ф. Скороходько // НТИ. – Сер. 2, № 1, 1982 – С. 27–31.
  11. Богданов В.В., Реферирование / В.В. Богданов // Прикладное языкознание: учебник. – СПб.: Изд-во С.-Петербург. ун-та, 1996. – С. 389–398.
  12. J. Hutchins, «Summarization: Some Problems and Methods» Proc. Informatics 9: Meaning-The Fron-tier of Informatics, K.P. Jones, ed., Aslib, London, 1987. – P. 151–173.
  13. Кутукова. Е.С. Технология Text mining/ Е.С. Кутукова// SWorld: Перспективные инновации в науке, образовании, производстве и транспорте. – Одесса, 2013.
  14. Dan Sullivan. Document Warehousing and Textmining. NY; Wiley publishing house, 2001. – P. 36–38.
  15. Харламов А.А. Автоматический структурный анализ текстов / А. Харламов. //Открытые системы. – 2002. – № 10. – С. 16–22.
  16. Kupiec J., Pederson J. and Chen F. A trainable document summarizer. In Proceedings of the 18th ACM/SIGIR Annual Conference on Research and Development in Information Retrieval, Seattle, 1995. – P. 68–73.
  17. А. Михаилян. Некоторые методы автоматического анализа естественного языка, используемые в промышленных продуктах, 2000. – [Электронный ресурс]. – Режим доступа: http://www.inteltec.ru/publish/articles/textan/natlang.shtml.
  18. Ступин B.C. Система автоматического реферирования методом симметричного реферирования / B.C. Ступин // Компьютерная лингвистика и интеллектуальные технологии. Труды межд. конференции «Диалог 2004». — М.: Наука, 2004. – С. 579–591.
  19. Моніторинг діяльності органів виконавчої влади із застосуванням комп’ютерної системи контент-аналізу електронних ЗМІ / Г. Леліков, В. Сороко, О. Григор’єв, Д. Ланде // Вісн. держ. служби України. – 2002. – № 2. – С. 21–38.
  20. Танатар Н.В., Федорчук А.Г. Интеллектуальные поисково-аналитические системы мониторинга СМИ / Н.В Танатар., А.Г. Федорчук // Научно-практический и теоретический сборник. – Киев, 2008. – 477 с.
  21. Iatsko V. Linguistic Aspects of Summarization // Philologie in Netz – № 18. – 2001. – P. 33–46. – [Электронный ресурс]. – Режим доступа: http://www.fu-berlin.de/phin/phin18/p18t3.htm.