Українська   Русский
DonNTU   Masters' portal

Abstract

Content

Introduction

Currently when doing business actively are applied modern information technology based on the global computer network Internet.Although the Internet has a long history, its commercial use began only in 1988.We can no longer imagine life without the Internet, which is filled with millions of web sites that create a virtual information space.Internet resources have become a tool for the daily work of people of many professions.

The rapid growth of information in the network ocean of diverse data, the importance of which increases in proportion to their volume.Every day there are new documents online without retrieval systems in the vast majority were not in demand,We never would have found none, and all the wealth of information would have been useless.There was a need to create funds that would allow easy to navigate in the information resources of global networks, quickly and reliably find the desired information. Thus the Internet there are special search tools.

1. Relevance

Search engines - the most important part of the modern Internet. Previously, search was made for a special directory containing links to the existing resources, but today their number is so large, that require special fully automated system to search the Internet.

Today, the Internet space is constantly updated with new sites. In order to attract more visitors to their websites, webmasters looking for new path optimization to help maintain its leadership position.

Models of behavior of users - one of the main areas of research in improving Internet search. Master's thesis is devoted to the actual problem of optimization searches, review of current models of user behavior and the study of like behaviors are combined with other features in the ranking function.There are various methods of search queries that spent a lot of time and skills. Therefore, the subject of study is the impact of different methods of search engine optimization for raising the level of the site.

2. Research goals and objectives, expected results

The main objective of the search – to provide answers to questions. A significant part of the query is searching for goods and services.Among the many sites in goods and services search engine to find and offer users the highest quality – comfortable, informative and authoritative. Clearly, all these characteristics are subjective, and may use a search algorithm Only measurable parameters.

Main tasks of the research:

  1. Analysis of user behavior [9].
  2. Analysis of of modeling user session.
  3. Analysis of new principles of better search engine [8].
  4. Development of a method of optimization of search queries in the field of online-provision of goods and services.

Research object: Optimization of search queries in the field of online-provision of goods and services.

Research subject: Method modeling the behavior of users.

As part of the master's work is to get the actual scientific results in the following areas:

  1. Developing approaches to model user behavior when placing an order in the field of online-provision of goods and services programmatically.
  2. Developing a personal site on which will be conducted studies to test this method.
  3. Modification of the known methods of user behaviors and performance evaluation of their application for the optimization of search queries.

3. A review of research and development

Every year more and more familiar way to access a variety of information is becoming the Internet. Search engines – the most important part of the modern Internet and has already become an integral feature of the modern information society. Models of behavior of users - one of the main areas of research for the advancement of Internet search.

3.1 Overview of international sources

Internet search technology grow with the customer's needs. Specialists constantly have to evolve and keep pace with track information relating to changes in the requirements and algorithms of leading search engines. We can assume that in this area of research can expect new breakthroughs in the near future.

I would like to highlight the subject of foreign experts:

  1. Eugene Agichtein – professor at Emory University, Georgia USA [1].
  2. Chris Bishop – member of the Royal Academy of Engineering [2].
  3. Nick Craswell – researcher Bing in Bellevue Washington [4].
  4. Monica Wright – director for the audience in the publications.
  5. Trevor Hastie – professor of Mathematical Sciences, Stanford University [3].

3.2 Review of national sources

In Ukraine, it is small enough professionals who would consider this subject of research. The first of the experts became Dubinsky AG graduate student of the National Technical univestiteta of Ukraine Kiev Polytechnic Institute [10].

3.3 Browse local sources

In the Donetsk national technical University the problems of mining the Internet pages zanimalasya master Shinkarenko V. on the topic "audience Analysis and forecasting of traffic to the site". In work the analysis of the target audience site and finding dependencies for prediction and evaluation of the visit to the site and other parameters.

4. Principles informational search

Information Retrieval

In our time, the search for information usually involve Internet search, however, the term information retrieval originated much earlier. According to the monograph [6] information search – This search process in a large collection (stored, typically in the computer memory) some unstructured material, the information needs.

To interact with a search engine user makes a request – formulate their information needs in a language understood by the system. In response to a query system gives the user an ordered list of documents. To determine the relevance of documents to the needs of information search theory introduced the following notion: relevance – this correspondence document information request.By the method of determination usually distinguish the formal and substantive relevance. Formal relevance determined using an algorithm implemented in a retrieval system. The content relevance – is the document matching a user's request, determined by informal, on the semantics of the document.

At first glance, the purpose of information retrieval can be formulated as follows: find all the relevant documents. But when working with large collections of documents the final number of documents requested, can be so great that people simply can not see them all. Thus, one important task is search engine ranking of documents according to their compliance with the request. Function which assigns each document number – calculated relevance of the document to your request – called a ranking function (ranking function). This function takes into account the various features of the document request, as well as the entire collection of documents in the search engine .

Probabilistic Model Search

User does not exactly need to formulate their information in the form of a request. With only a request, the system can not accurately determine the relevance of a document. For decision-making under uncertainty requires mathematical apparatus of the theory of probability.

Assume that the binary evaluation of relevance: it can either be relevant to your request or not relevant. Thus, for each document d and a query q is introduced a random variable R (d, q) - indicator of relevance; it is equal to one if the document d is relevant request q, and equal to zero otherwise. When it does not cause confusion, we denote a measure of relevance R.

In this model, it is natural to rank documents for relevance otsenёnnym their request probabilities: p (R (d, q) = 1). This approach is the basis of the probability ranking principle proposed by Robertson in 1977 [7].

Its major provisions:

  1. Relevance of the document request is independent of the other documents in the collection;
  2. Probabilistic ranking principle: if the search engine in response to each user query ranks the documents in descending order of their likelihood be relevant to the user's request, and this probability is estimated based on the most accurate data available, the overall quality of the system is based on the best available data.

Features evaluate technical efficiency research

1. Priority testing accuracy. Typically, on-demand is a lot of documents, including many and relevant. Therefore, more important is not the completeness and accuracy of the search. Indeed, consider the two search engines. At the request of some 1st engine finds 200 documents, and they are all relevant. 2nd car on the same request is 5,000 documents, 500 of which are relevant, and among the first 200 documents relevant to only 100. And although the body poiska2 th car is much higher, it is obvious that the better the 1st machine, the user is able as rare view hundreds of documents found (more likely a user is limited to the first page of search results).

2. It is necessary to test the quality ranking. The found documents are issued in a ranged form, so the assessment of the quality of research must take into account the position of the document in the list of found, that is, the quality of search on the request should be characterized by a set of values of accuracy in different sizes of the initial part of the list of documents, for example, set the value of accuracy at 10, 30, 50 70 and 100 documents from an initial portion of the list. The greater the number of values, the better the score, but the complexity of evaluating more.

3. Need grading accuracy values. The accuracy of the plurality of characterizing the quality of the search request, more important are those which are obtained for a small number of documents. For example, when accuracy is more important than the accuracy of documents 30 when the documents 300. In other words, the main interest is the relationship between the completeness and accuracy in the region of small values of completeness.

The criteria for assessing the quality of the search

To assess the quality of the search is necessary to have a test set containing "credible" information about what the document is relevant any inquiries. Usually test set is constructed by special experts and includes assessments of relevance for couples (request document). Estimates can be numeric or categorical. Since estimates are obtained from people who test set covers only a small part of the entire database search engine and its preparation is time consuming and costly.

The classical parameters for evaluating the performance of the search engine are accuracy and completeness:

  1. precision – the number of relevant documents in the extradition request, divided by the total number of documents in issue;
  2. recall – the number of relevant documents in the extradition request, divided by the total number of relevant documents in the database search engine.

Conclusion

Based on the analysis of placing information resources Internet, thematic bundles Information Web-space of the functioning of search engines and indexing mechanisms of search engines, as well as the most current techniques and methods of research and optimization of search queries, it was concluded that the possibility of and the need to create a simplified method of rapid user quality assessment and ranking of search queries.

The results will be presented in several stages. The first stage will be a comparative analysis of several methods of information retrieval. The second - will be conducted experimental verification of the proposed methods based on standard test cases.

In writing this essay master's work is not yet complete. Final completion: December 2015. The full text of work and materials on the topic can be obtained from the author or his manager after that date.

References

  1. Ageev M., Guo Q., Lagun D., Agichtein E. Find it if you can: a game for modeling different types of web search success using interaction data. Proceedings of the 34th Annual ACM SIGIR Conference, 2011.– P. 345–354.
  2. Bishop C. M. Pattern Recognition and Machine Learning. Springer, 2006.
  3. Hastie T., Tibshirani R., Friedman J. Elements of StatisticalLearning. Springer, 2008.
  4. Craswell N., Zoeter O., Taylor M., Ramsey B. An experimental comparison of click position-bias models. Proceedings of the 1st ACM International Conference on Web Search and Data Mining,2008.– P. 87–94.
  5. Яндекс. Поиск в интернете: что и как ищут пользователи.Информационный бюллетень «Яндекс»
  6. Manning C. D., Raghavan P., Sch¨utze H. Introduction to Information Retrieval. Cambridge University Press, 2008.
  7. Robertson S. E. Probability ranking principle in IR. Journal of Documentation, 1977.– P. 294–304.
  8. Breiman L., Friedman J. H., Olshen R. A., Stone C. T. Classification and Regression Trees. New York: Chapman Hall, 1984.
  9. Николенко С. И., Фишков А. А. SCM: новая вероятностная модель поведения пользователей интернет-поиска. Труды СПИ-ИРАН, 2012.
  10. Дубинский А.Г. Факторы, влияющие на качество информационного поиска. Системний аналіз та інформаційні технології: Зб. тез доп. Міжн. наук.-практ. конф. студ., аспірантів та молод. вчених. - Киев: НТУУ «КПІ», 2001.– c. 43– 48.