DonNTU   Masters' portal

Abstract

Contents

Introduction

The task of annotating and referencing of documents is relevant to any storage media, from libraries to the Internet - portals.The intensification of the information flow of contemporary society, including the amount of information on the Internet, leads to the fact that it quickly receive the correct summaries becomes more and more difficult.Formation of papers and abstracts manually requires huge human resources, so the task of building effective methods of automatic summarization and annotation is becoming increasingly important.

Abstracting and annotating documents are among the main types of information of human activity in a number of traditional search techniques. The resulting analytical review is a unique information, the product is able to provide scientists and specialists complete and concentrated information using unique to review ways of classifying, analyzing, evaluating, and above all - the concentration of scattered among various sources of valuable material. Summarizing the data on scientific advances, concepts, challenges and different approaches to it, an analytical review of the information model is the solution of the problems of development of this field of activity.

Under these conditions, enhanced the importance of semantic data compression techniques, primarily text. Among them, a special place is occupied by mining techniques summarization of documents and document collections.

All this testifies to the relevance and of great practical significance selected theme.

Development of systems for automatic summarization is considered the most difficult task of automatic processing of text, so it includes the need to carry out a deep syntactic, semantic, lexical and morphological analysis of the document followed by synthesis for the issue of the correct result to the user. And although there is no system that can generate a full abstract (managed to create a system kvazireferirovaniya), they, along with the automatic search and machine translation, to help navigate today in the world information space and to find the information we need.

1. Actuality of the theme

The use of computers in human activities, including research, not only accelerates the creation and processing of documents, but also dramatically increases the number and volume. Today, many users regularly face the need to quickly view a large amount of documents and selecting the most relevant and indeed necessary documents. This problem occurs when working with text documents, databases, analysis of e-mail, as well as finding information on the Internet. In addition, very often in large organizations and companies need to prescribe the rules of proceedings to accompany a brief abstract of each document.In all of these cases, the way out is not just a view of the document, and its compressed description - summary or abstract. This has led to the need for research in solving the problem of automatic summarization of full-text documents.

2. The purpose and objectives of the research, the expected results

The aim is to study and improve existing algorithms automatically generate an abstract of the text content that will improve the quality of the semantic abstract.

The main objectives of the study:

  1. A review and analysis of existing solutions in the field of automatic text summarization.
  2. As a result of the analysis to justify the choice of using the algorithm:
    • determining the content, i.e. highlight key words, phrases and sentences;
    • organization of information, ie, making a logical sequence of statements in the abstract;
    • processing proposals that simplifying and harmonizing the selected proposals.

As a result of the work must be designed structure of the automatic summarization techniques are chosen to be implemented in its modules, identified ways to improve the quality of the system.

3. Ranking algorithm connected structures as

Manifold Ranking algorithm allows to describe the cohesive structure of a text by means of matrices.Initially, the algorithm involves the allocation of items (sentences) closest given (topic).These interpretations problem of information retrieval. For automatic summarization is also highlighted a set of proposals, the closest cluster of a given topic, but it is obligatory to use the algorithm cut-off "similar" offers, which is especially important for multiple document annotation. Automatic Referencing a set of documents using the ranking algorithm of connected structures consists of two phases:

  1. Computing the rank of each proposal. This solves the problem of ranking all proposals according to their "closeness" of a given topic cluster.
  2. The use of cut-off algorithm offers the most similar to the ones that have already fallen to the review essay. This solves the problem of exclusion from the review essay of the same or similar offers.
  3. As a result, a number of proposals with the highest rank is chosen for the resulting abstract. Ranking of proposals generally not specified approach. I had realized the simplest algorithm sample sentences in the order of their relative priority sequence for shorter sentences, that is natural for Russian language.Strictly speaking, the question of connectivity resulting essay is the subject of a separate study.

  4. Information significance: given a set of proposals and a given topic T is computed vector information importance of each sentence. Information importance of supply is defined as the degree of proximity to a given topic T. It is assumed that the theme of the cluster T fully reflects the content of a set of documents and contains the most complete set of vocabulary.
  5. Information novelty: For each sentence is determined by its proximity with the other proposals set. As a result, the overall rating that determines getting offers to the review essay is calculated by taking into account both the importance of information supply and its "information novelty."

Conclusion

The research methods that underpin modern systems of automatic summarization led to the following conclusions:

  1. in general, the problem involves abstracting Identifying documents, the selection of keywords and phrases, search for sentences containing the keywords and phrases, synthesis on this basis, phrases and sentences that reflect the main themes of the text summary;
  2. general structure of all the systems of automatic text summarization is unchanged and consists of three interrelated parts: the analysis of the input text block, including pre-processing and data preparation, weighing block of text elements in who might be words, phrases, sentences, paragraphs, headings, etc.; generating unit abstract.

Thus, current summarization system can provide invaluable assistance to people whose professional activity involves analyzing large amounts of information. In this scientific and engineering areas there are many promising approaches to development.

References

  1. Гайдамак, Е.С. Информационно-аналитическая деятельность специали¬ста в области образования [Электронный ресурс] / Е. С. Гайдамак // Электронный научный журнал «Вестник Омского государственного педагогического университе¬та». - Омск, 2006.
  2. Мелюхин, И. С. Состояние информационно-аналитической деятельности в России [Текст] / И. С. Мелюхин / Журнал «Информационное общество» Вып. №6, - М. ИРИО. -1994. - С. 55-64.
  3. Материалы Лаборатории информационно-системного анализа ГПНТБ СО РАН [Электронный ресурс].
  4. Грачева JI. В. Отчет НИР по теме «Исследования возможности применения метода объектно-ориентированного анализа (МОДА) для составления рефератов/научно-технической литературы» [Текст] / JI. В. Грачева, Ю. С. Гузев, Е. С. Похвалина- М.: ВИНИТИ, 2003.-31 с.
  5. Щуко, Ю. Н. Интеллектуализация процессов современной обработки и преобразования информационной продукции на содержательном уровне [Текст] / Ю. Н. Щуко, JI. В. Грачева // Материалы 7-й международной конференции ВИНИТИ «Информационное общество. Интеллектуальная обработка информации. Информационные технологии». - М.: ВИНИТИ. - С. 347-348.
  6. Hutchins, J. The first public demonstration of machine translation: the Georgetown-IBM system, 7th January 1954. [Электронный ресурс] / J. Hutchins // Paper presented at the AMTA Conference.
  7. Luhn, H.P. The automatic creation of literature abstracts. [Текст] H.P. Luhn// IBM Journal of Research and Development - Вып. 2. - 1958. - С. 159-165.
  8. Поспелов, Д.А. Из истории искусственного интеллекта: история искус-ственного интеллекта до середины 80-х годов [Текст] / Д.А.Поспелов // Новости искусственного интеллекта - Вып. 4, 1994 - С.70 - 90.
  9. Гиляревский, P.C. Методы автоматизированного фрагментирования текста, отражающиеся на характеристике внутреннего состава фрагментов [Текст] / P.C. Гиляревский, С.И. Гиндин // Семиотика и информатика. М.: ВИНИТИ, 1977. - Т.9. - С.35-84.
  10. Леонов, В. П. О методах автоматического реферирования (США 1958¬1974 гг.) [Текст] / В.П.Леонов // Научно-техническая информация, сер.2. - 1975. - №6.- С.16-20.
  11. Пиотровский, Р.Г. Текст, машина, человек [Текст]: монография / Р.Г. Пиотровский. - Л.: Наука, 1975. - 327с.
  12. Пиотровский, Р.Г. Инженерная лингвистика и теория языка [Текст]: монография / Р.Г. Пиотровский. - Л. : Наука, 1979. - 112 С.
  13. Яцко, В.А. Симметричное реферирование: теоретические основы и методика [Текст] / В.А. Яцко // НТИ. Сер. 2. - 2002. - №5. - С. 18-28.
  14. Зубов, A.B. Автоматическое построение табличного реферата группы текстов одной тематики / А.В.Зубов.// Материалы конференции «Диалог-2005».
  15. Зубов, A.B. Информационные технологии в лингвистике [Текст]: монография / А.В.Зубов. - М.: Академия, 2004. - 208 с.
  16. Преображенский, А.Б. Состояние развития систем естественно-языкового общения [Текст] / А. Б. Преображенский // Сб. Искусственный интеллект. - М.:Радио и связь,1990. - Т.1. - С.32-64.
  17. Мак Кьюин, К. Дискурсивные стратегии для синтеза текста на естественном языке [Текст]/ К. Мак Кьюин // Новое в зарубежной лингвистике: - М.: - 1989. - Вып. XXIV. - С. 311 - 356.
  18. Гаврилова, Т.А Базы знаний интеллектуальных систем [Текст]: монография / Т!А.Гаврилова, В.Г.Хорошевский. - СПб.: Питер, 2000. - 384 с.
  19. Поспелов, Д.А. Логико-лингвистические модели в системах управления [Текст] монография./ Д.А. Поспелов. М., 1981.- 232 с.

Notice

At the time of writing this essay master work still is not complete. Estimated date of completion: December 2013, which is why the full text of the paper, as well as materials on the subject may be obtained from the author or his head only after the specified date.