Summary of the final work
Content
2. Urgency.
4. Planned scientific novelty.
6. Review of research and development related.
7. Algorithm for cleaning of web-pages.
8. Conclusions.
9. References.
INTRODUCTION
High availability of a large quantity of constantly replenishing information, and also growing popularity of web services among all categories of users have aggravated a problem of allocation significant for the user of a part of the information.
URGENCY
Denote the number of domains that can be applied task of cleaning the web pages:
- content delivery services, when other ways, for whatever reason are not suitable (for example, RSS feed available);
- Systems to collect some information from various sources
- in mobile applications, where it is important to minimize traffic
- systems of data mining (data mining – the process of discovery in the raw data of unknown, non-trivial, practically useful and accessible interpretation of the knowledge needed for decision-making in various spheres of human activity.)
The task of cleaning of web pages from information noise is rather actual presently and the decision of the given problem will help to present required the user the information in a type convenient for it and as positively will affect results of web search, information classification, extraction of the text information, etc
PURPOSE AND TASKS
The purpose of master's work is creation of the popular tool means, allowing to clear web pages of information noise.
To achieve this goal it is necessary to solve the basic problem:
- A comparative analysis of methods of separating the main content web-pages
- Develop a classification scheme of information units of a site
- Develop adaptive algorithm for estimating the information blocks of pages
- Develop tools allowing to process data blocks
- Test the effectiveness of the developed tools
PLANNED SCIENTIFIC NOVELTY
- A new classification scheme of the information sites, with a set of values which takes into account the structure and site-specific
- A model of cleaning a web page of information noise based on the classification scheme of blocks is developed
PLANNED PRACTICAL RESULTS
The developed tools will perform the following tasks:
- Hiding the banner ad units, multimedia content, distracting the user's attention
- Adaptation of information on the site under the user requests
REVIEW OF RESEARCH AND DEVELOPMENT RELATED
The applied methods for analysis of web-pages can be divided into:
- Methods based on separation is repeated for all (or part) of site pages pieces of information [1]
- Methods based on analysis of dom-tree pages on the site [3]
- Combined Methods [2]
- Methods of syntactic and visual analysis [5]
- Methods of the analysis of pages constructed on HTML 5
Analysis of existing methods has shown that methods based on the analysis of DOM tree are effective and simple, and also give possibility to spend processing of an individual web page.
There are tools that will partially solve the problem of selection of a Web content:
- NoScript
- AdBlock Plus
- Flash Block
- Safari Reader
- Readability
A review of existing tools, cleaning web pages of information noise allowed to identify the main difficulties which users face:
- Lock the useful content. Often the system for the allocation of the main content with navigation and banner block useful information , and this information will be available only for the abolition of processing a Web page
- Not universality Many existing tools are designed for a specific browser. This leads to a decrease in the category of users
- Absence adaptation to a particular user the allocation of the main content web page tools based on the general perception the concept of "useful information" – a block of text information that does not always match user needs
- Small efficiency can be concluded that the development of tools cleaning Web pages from the information noise is quite active, but still there are no universal tools, which could satisfy all user requests.
Development of an algorithm cleaning from information noise let's stop the choice on ideas of creation bookmark.
Algorithm for cleaning of web-pages from information noise consists of the following steps:
- Getting the address of web-pages
- Determining the structure of DOM-tree from the HTML-pages
- Classification tags (nodes) DOM-tree
- Determination of significant sites
- Processing of information blocks
- Treatment is effective?
- Customize
- Saving results
CONCLUSIONS
Purification of web-pages from information noise is one of the promising directions of development of the information and communication technologies.
Evidence of the relevance of the selection of the main content page on the user's query is a continuous improvement of software tools to create and display web-pages. In terms of market saturation technology services increasingly attract the attention of the methods and means of personalization of content streams.
REFERENCES
- Агеев М.С., Добров Б.В., Лукашевич Н.В., Сидоров А.В. Экспериментальные алгоритмы поиска/классификации и сравнение с «basic line». // Российский семинар по Оценке Методов Информационного Поиска (РОМИП 2004) [электронный ресурс]. Режим доступа – http://romip.narod.ru/...
- И. Некрестьянов, Е. Павлова. Обнаружение структурного подобия HTML-документов. СПГУ, 2002 [электронный ресурс]. Режим доступа – http://meta.math.spbu.ru
- М.С. Агеев, И.В. Вершинников, Б.В. Добров. Извлечение значимой информации из web-страниц для задач информационного поиска. Интернет-математика 2005. Сборник работ по программам научных стипендий Яндекса. Москва, 2005.
- Р.Ф. Кузнецов, Н.В. Мурашов. Оценка влияния извлечения значимой информации на качество классификации web-страниц
- Определение понятия «информационный шум» [электронный ресурс]. Режим доступа – http://mediart.ru/...
- Yi, L., Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003.
- Краковецкий А. Очищаем веб-страницы от информационного шума [электронный ресурс]. Режим доступа – http://msug.vn.ua/...
- Soumen Chakrabarti. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction // In Proceedings of WWW10, May 1-5, 2001, Hong Kong. [электронный ресурс]. Режим доступа – http://www10.org/...
- Suhit Gupta, Gail E Kaiser, Peter Grimm, Michael Chiang, Justin Starren, Automating Content Extraction of HTML Documents // World Wide Web Journal, January 2005
- Краковецкий А. Получение основного контента веб-страниц программно [электронный ресурс]. Режим доступа – http://habrahabr.ru/...
- Методы и средства извлечения слабоструктурированных схем из документов в HTML и конвертирования HTML документов в их XML представление [электронный ресурс]. Режим доступа – http://www.raai.org/resurs/...