Summary of the final work

Content

1. Introduction.

2. Urgency.

3. Purpose and tasks.

4. Planned scientific novelty.

5. Planned practical results.

6. Review of research and development related.

7. Algorithm for cleaning of web-pages.

8. Conclusions.

9. References.

INTRODUCTION

High availability of a large quantity of constantly replenishing information, and also growing popularity of web services among all categories of users have aggravated a problem of allocation significant for the user of a part of the information.

URGENCY

Denote the number of domains that can be applied task of cleaning the web pages:

content delivery services, when other ways, for whatever reason are not suitable (for example, RSS feed available);
Systems to collect some information from various sources
in mobile applications, where it is important to minimize traffic
systems of data mining (data mining – the process of discovery in the raw data of unknown, non-trivial, practically useful and accessible interpretation of the knowledge needed for decision-making in various spheres of human activity.)

The task of cleaning of web pages from information noise is rather actual presently and the decision of the given problem will help to present required the user the information in a type convenient for it and as positively will affect results of web search, information classification, extraction of the text information, etc

PURPOSE AND TASKS

The purpose of master's work is creation of the popular tool means, allowing to clear web pages of information noise.

To achieve this goal it is necessary to solve the basic problem:

A comparative analysis of methods of separating the main content web-pages
Develop a classification scheme of information units of a site
Develop adaptive algorithm for estimating the information blocks of pages
Develop tools allowing to process data blocks
Test the effectiveness of the developed tools

PLANNED SCIENTIFIC NOVELTY

A new classification scheme of the information sites, with a set of values which takes into account the structure and site-specific
A model of cleaning a web page of information noise based on the classification scheme of blocks is developed

PLANNED PRACTICAL RESULTS

The developed tools will perform the following tasks:

Hiding the banner ad units, multimedia content, distracting the user's attention
Adaptation of information on the site under the user requests

REVIEW OF RESEARCH AND DEVELOPMENT RELATED

The applied methods for analysis of web-pages can be divided into:

Methods based on separation is repeated for all (or part) of site pages pieces of information [1]
Methods based on analysis of dom-tree pages on the site [3]
Combined Methods [2]
Methods of syntactic and visual analysis [5]
Methods of the analysis of pages constructed on HTML 5

Analysis of existing methods has shown that methods based on the analysis of DOM tree are effective and simple, and also give possibility to spend processing of an individual web page.

There are tools that will partially solve the problem of selection of a Web content:

NoScript
AdBlock Plus
Flash Block
Safari Reader
Readability

A review of existing tools, cleaning web pages of information noise allowed to identify the main difficulties which users face:

Lock the useful content. Often the system for the allocation of the main content with navigation and banner block useful information , and this information will be available only for the abolition of processing a Web page
Not universality Many existing tools are designed for a specific browser. This leads to a decrease in the category of users
Absence adaptation to a particular user the allocation of the main content web page tools based on the general perception the concept of "useful information" – a block of text information that does not always match user needs
Small efficiency can be concluded that the development of tools cleaning Web pages from the information noise is quite active, but still there are no universal tools, which could satisfy all user requests.

Development of an algorithm cleaning from information noise let's stop the choice on ideas of creation bookmark.

Algorithm for cleaning of web-pages from information noise consists of the following steps:

Getting the address of web-pages
Determining the structure of DOM-tree from the HTML-pages
Classification tags (nodes) DOM-tree
Determination of significant sites
Processing of information blocks
Treatment is effective?
Customize
Saving results

CONCLUSIONS

Purification of web-pages from information noise is one of the promising directions of development of the information and communication technologies.

Evidence of the relevance of the selection of the main content page on the user's query is a continuous improvement of software tools to create and display web-pages. In terms of market saturation technology services increasingly attract the attention of the methods and means of personalization of content streams.

REFERENCES

Агеев М.С., Добров Б.В., Лукашевич Н.В., Сидоров А.В. Экспериментальные алгоритмы поиска/классификации и сравнение с «basic line». // Российский семинар по Оценке Методов Информационного Поиска (РОМИП 2004) [электронный ресурс]. Режим доступа – http://romip.narod.ru/...
И. Некрестьянов, Е. Павлова. Обнаружение структурного подобия HTML-документов. СПГУ, 2002 [электронный ресурс]. Режим доступа – http://meta.math.spbu.ru
М.С. Агеев, И.В. Вершинников, Б.В. Добров. Извлечение значимой информации из web-страниц для задач информационного поиска. Интернет-математика 2005. Сборник работ по программам научных стипендий Яндекса. Москва, 2005.
Р.Ф. Кузнецов, Н.В. Мурашов. Оценка влияния извлечения значимой информации на качество классификации web-страниц
Определение понятия «информационный шум» [электронный ресурс]. Режим доступа – http://mediart.ru/...
Yi, L., Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003.
Краковецкий А. Очищаем веб-страницы от информационного шума [электронный ресурс]. Режим доступа – http://msug.vn.ua/...
Soumen Chakrabarti. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction // In Proceedings of WWW10, May 1-5, 2001, Hong Kong. [электронный ресурс]. Режим доступа – http://www10.org/...
Suhit Gupta, Gail E Kaiser, Peter Grimm, Michael Chiang, Justin Starren, Automating Content Extraction of HTML Documents // World Wide Web Journal, January 2005
Краковецкий А. Получение основного контента веб-страниц программно [электронный ресурс]. Режим доступа – http://habrahabr.ru/...
Методы и средства извлечения слабоструктурированных схем из документов в HTML и конвертирования HTML документов в их XML представление [электронный ресурс]. Режим доступа – http://www.raai.org/resurs/...

Krinitskaya Alesya

Faculty computer sciences and technologies
Department automated сontrol systems
Speciality «Information control system and tehnologies»

Development tools cleaning web-pages of information noise

Scientific adviser: Martynenko Tatiana

Summary of the final work

Content

INTRODUCTION

URGENCY

PURPOSE AND TASKS

PLANNED SCIENTIFIC NOVELTY

PLANNED PRACTICAL RESULTS

REVIEW OF RESEARCH AND DEVELOPMENT RELATED

Algorithm for cleaning of web-pages from information noise consists of the following steps:

CONCLUSIONS

REFERENCES

Krinitskaya Alesya

Faculty computer sciences and technologies Department automated сontrol systems Speciality «Information control system and tehnologies»

Development tools cleaning web-pages of information noise

Scientific adviser: Martynenko Tatiana

Summary of the final work

Content

INTRODUCTION

URGENCY

PURPOSE AND TASKS

PLANNED SCIENTIFIC NOVELTY

PLANNED PRACTICAL RESULTS

REVIEW OF RESEARCH AND DEVELOPMENT RELATED

Algorithm for cleaning of web-pages from information noise consists of the following steps:

CONCLUSIONS

REFERENCES

Faculty computer sciences and technologies
Department automated сontrol systems
Speciality «Information control system and tehnologies»