Summary of the final work

Content


1. Introduction.

2. Urgency.

3. Purpose and tasks.

4. Planned scientific novelty.

5. Planned practical results.

6. Review of research and development related.

7. Algorithm for cleaning of web-pages.

8. Conclusions.

9. References.

INTRODUCTION

High availability of a large quantity of constantly replenishing information, and also growing popularity of web services among all categories of users have aggravated a problem of allocation significant for the user of a part of the information.

URGENCY

Denote the number of domains that can be applied task of cleaning the web pages:

The task of cleaning of web pages from information noise is rather actual presently and the decision of the given problem will help to present required the user the information in a type convenient for it and as positively will affect results of web search, information classification, extraction of the text information, etc

PURPOSE AND TASKS

The purpose of master's work is creation of the popular tool means, allowing to clear web pages of information noise.

To achieve this goal it is necessary to solve the basic problem:

  1. A comparative analysis of methods of separating the main content web-pages
  2. Develop a classification scheme of information units of a site
  3. Develop adaptive algorithm for estimating the information blocks of pages
  4. Develop tools allowing to process data blocks
  5. Test the effectiveness of the developed tools

PLANNED SCIENTIFIC NOVELTY

  1. A new classification scheme of the information sites, with a set of values which takes into account the structure and site-specific
  2. A model of cleaning a web page of information noise based on the classification scheme of blocks is developed

PLANNED PRACTICAL RESULTS

The developed tools will perform the following tasks:

REVIEW OF RESEARCH AND DEVELOPMENT RELATED

The applied methods for analysis of web-pages can be divided into:

  1. Methods based on separation is repeated for all (or part) of site pages pieces of information [1]
  2. Methods based on analysis of dom-tree pages on the site [3]
  3. Combined Methods [2]
  4. Methods of syntactic and visual analysis [5]
  5. Methods of the analysis of pages constructed on HTML 5

Analysis of existing methods has shown that methods based on the analysis of DOM tree are effective and simple, and also give possibility to spend processing of an individual web page.

There are tools that will partially solve the problem of selection of a Web content:

A review of existing tools, cleaning web pages of information noise allowed to identify the main difficulties which users face:

Development of an algorithm cleaning from information noise let's stop the choice on ideas of creation bookmark.

Algorithm for cleaning of web-pages from information noise consists of the following steps:

  1. Getting the address of web-pages
  2. Determining the structure of DOM-tree from the HTML-pages
  3. Classification tags (nodes) DOM-tree
  4. Determination of significant sites
  5. Processing of information blocks
  6. Treatment is effective?
  7. Customize
  8. Saving results

CONCLUSIONS

Purification of web-pages from information noise is one of the promising directions of development of the information and communication technologies.

Evidence of the relevance of the selection of the main content page on the user's query is a continuous improvement of software tools to create and display web-pages. In terms of market saturation technology services increasingly attract the attention of the methods and means of personalization of content streams.

REFERENCES

  1. Агеев М.С., Добров Б.В., Лукашевич Н.В., Сидоров А.В. Экспериментальные алгоритмы поиска/классификации и сравнение с «basic line». // Российский семинар по Оценке Методов Информационного Поиска (РОМИП 2004) [электронный ресурс]. Режим доступа – http://romip.narod.ru/...
  2. И. Некрестьянов, Е. Павлова. Обнаружение структурного подобия HTML-документов. СПГУ, 2002 [электронный ресурс]. Режим доступа – http://meta.math.spbu.ru
  3. М.С. Агеев, И.В. Вершинников, Б.В. Добров. Извлечение значимой информации из web-страниц для задач информационного поиска. Интернет-математика 2005. Сборник работ по программам научных стипендий Яндекса. Москва, 2005.
  4. Р.Ф. Кузнецов, Н.В. Мурашов. Оценка влияния извлечения значимой информации на качество классификации web-страниц
  5. Определение понятия «информационный шум» [электронный ресурс]. Режим доступа – http://mediart.ru/...
  6. Yi, L., Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003.
  7. Краковецкий А. Очищаем веб-страницы от информационного шума [электронный ресурс]. Режим доступа – http://msug.vn.ua/...
  8. Soumen Chakrabarti. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction // In Proceedings of WWW10, May 1-5, 2001, Hong Kong. [электронный ресурс]. Режим доступа – http://www10.org/...
  9. Suhit Gupta, Gail E Kaiser, Peter Grimm, Michael Chiang, Justin Starren, Automating Content Extraction of HTML Documents // World Wide Web Journal, January 2005
  10. Краковецкий А. Получение основного контента веб-страниц программно [электронный ресурс]. Режим доступа – http://habrahabr.ru/...
  11. Методы и средства извлечения слабоструктурированных схем из документов в HTML и конвертирования HTML документов в их XML представление [электронный ресурс]. Режим доступа – http://www.raai.org/resurs/...