|
DonNTU >>
Masters
portal
|
Bezuglyi Ievgen
Faculty: Computer Science and Informatics
Speciality: System Programming
Theme of master's work:
Increasing of the effectiveness of search engines |
Biografy
|
The referat
Scientific
supervisor: Anoprienko Alexandr
1. Introduction |
The goal of information retrieval (IR) is to provide users with those documents
that will satisfy their information need. We use the word "document" as a general
term that could also include non-textual information, such as multimedia objects.
Users have to formulate their information need in a form that can be understood
by the retrieval mechanism. There are several steps involved in this translation
process that we will briefly discuss below. Likewise, the contents of large document
collections need to be described in a form that allows the retrieval mechanism to
identify the potentially relevant documents quickly. In both cases, information
may be lost in the transformation process leading to a computer-usable representation.
Hence, the matching process is inherently imperfect.
The process of information retrieval consits four phases[5]:
- An Indexing data
- Analise of query
- Model working
- Result ranking
The main idea in constructing the retrieval system was a building of an index, represents
a some data base with documents and words of this documents. Using this index it
is possible to build some models of IR |
2. General Model of Information
Retrieval
2.1 Standard Boolean |
The Boolean model of information retrieval (BIR) is a classical information retrieval
(IR) model and, at the same time, the first and most adopted one. It is used by
virtually all commercial IR systems today. The BIR is based on Boolean Logic and
classical Sets Theory in that both the documents to be searched and the user's query
are conceived as sets of terms. Retrieval is based on whether or not the documents
contain the query terms |
2.2 .Vector Space Model |
The vector space model represents the documents and queries as vectors in a multidimensional
space, whose dimensions are the terms used to build an index to represent the documents.
The creation of an index involves lexical scanning to identify the significant terms,
where morphological analysis reduces different word forms to common "stems", and
the occurrence of those stems is computed.
Query and document surrogates are compared by comparing their vectors, using, for
example, the cosine similarity measure. In this model, the terms of a query surrogate
can be weighted to take into account their importance, and they are computed by
using the statistical distributions of the terms in the collection and in the documents
. The vector space model can assign a high ranking score to a document that contains
only a few of the query terms if these terms occur infrequently in the collection
but frequently in the document.
The vector space model makes the following assumptions: 1) The more similar a document
vector is to a query vector, the more likely it is that the document is relevant
to that query. 2) The words used to define the dimensions of the space are orthogonal
or independent. While it is a reasonable first approximation, the assumption that
words are pairwise independent is not realistic. |
The probabilistic retrieval model is based on the Probability Ranking Principle,
which states that an information retrieval system is supposed to rank the documents
based on their probability of relevance to the query, given all the evidence available.
The principle takes into account that there is uncertainty in the representation
of the information need and the documents. There can be a variety of sources of
evidence that are used by the probabilistic retrieval methods, and the most common
one is the statistical distribution of the terms in both the relevant and non-relevant
documents.
P(Info need|document) that an information need is satisfied by a given document.
An inference network consists of a directed acyclic dependency graph, where edges
represent conditional dependency or causal relations between propositions represented
by the nodes. The inference network consists of a document network, a concept representation
network that represents indexing vocabulary, and a query network representing the
information need.
The concept representation network is the interface between documents and queries.
To compute the rank of a document, the inference network is instantiated and the
resulting probabilities are propagated through the network to derive a probability
associated with the node representing the information need. These probabilities
are used to rank documents. |
|
|
4. Latent Semantic Indexing |
Several statistical and AI techniques have been used in association with domain
semantics to extend the vector space model to help overcome some of the retrieval
problems described above, such as the "dependence problem" or the "vocabulary problem".
One such method is Latent Semantic Indexing (LSI).
In LSI the associations among terms and documents are calculated and exploited in
the retrieval process. The assumption is that there is some "latent" structure in
the pattern of word usage across documents and that statistical techniques can be
used to estimate this latent structure. An advantage of this approach is that queries
can retrieve documents even if they have no words in common.
The LSI technique captures deeper associative structure than simple term-to-term
correlations and is completely automatic. The only difference between LSI and vector
space methods is that LSI represents terms and documents in a reduced dimensional
space of the derived indexing dimensions. As with the vector space method, differential
term weighting and relevance feedback can improve LSI performance substantially.
Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space
model.
The four methods were the result of crossing two factors, the first factor being
whether the retrieval method used Latent Semantic Indexing or keyword matching,
and the second factor being whether the profile was based on words or phrases provided
by the user (Word profile), or documents that the user had previously rated as relevant
(Document profile). The LSI match-document profile method proved to be the most
successful of the four methods. This method combines the advantages of both LSI
and the document profile.
The document profile provides a simple, but effective, representation of the user's
interests. Indicating just a few documents that are of interest is as effective
as generating a long list of words and phrases that describe one's interest. Document
profiles have an added advantage over word profiles: users can just indicate documents
they find relevant without having to generate a description of their interests.
|
5. Linguistic and Knowledge-based Approaches |
n the simplest form of automatic text retrieval, users enter a string of keywords
that are used to search the inverted indexes of the document keywords. This approach
retrieves documents based solely on the presence or absence of exact single word
strings as specified by the logical representation of the query.
Clearly this approach will miss many relevant documents because it does not capture
the complete or deep meaning of the user's query. The Smart Boolean approach and
the statistical retrieval approaches, each in their specific way, try to address
this problem. Linguistic and knowledge-based approaches have also been developed
to address this problem by performing a morphological, syntactic and semantic analysis
to retrieve documents more effectively. In a morphological analysis, roots and affixes
are analyzed to determine the part of speech (noun, verb, adjective etc.) of the
words.
Next complete phrases have to be parsed using some form of syntactic analysis. Finally,
the linguistic methods have to resolve word ambiguities and/or generate relevant
synonyms or quasi-synonyms based on the semantic relationships between words. The
development of a sophisticated linguistic retrieval system is difficult and it requires
complex knowledge bases of semantic information and retrieval heuristics. Hence
these systems often require techniques that are commonly referred to as artificial
intelligence or expert systems techniques. |
6. Some problems of classics model |
Boolean Model
Very rigid:
- AND means all; OR means any.
- Difficult to express complex user requests.
- Difficult to control the number of documents retrieved. All matched documents will
be returned.
- Difficult to rank output. All matched documents logically satisfy the query.
- Difficult to perform relevance feedback. If a document is identified by the user
as relevant or irrelevant, how should the query be modified?
Vector Model
- Missing semantic information (e.g. word sense).
- Missing syntactic information (e.g. phrase structure, word order, proximity information).
- Assumption of term independence (e.g. ignores synonomy).
- Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).
Given a two-term query “A B”, may prefer a document containing A frequently but
not B, over a document that contains both A and B, but both less frequently.
|
Bibliografy |
- Модели информационного поиска, исследование
булевской модели.
Авторы: Безуглый Е.Н. Аноприенко А.Я
Описание: Тезисы доклада на V международной
научно-технической конференции молодых учёных и студентов "Информатика и компьютерные
технологии 2009", Донецк, ДонНТУ, 12 мая 2009 г. [читать].
- Introduction
to Information Retrieval
Авторы: Manning C. D.,
Schutze H.
Описание: Азбука всех поисковых систем, содержит исчерпываюшие
структуры данных и алгоритмы от самых азов. (2008). (HTML/PDF);
-
Theory of Rank Tests Edition 2, Academic Press, 1999.- 425c
Авторы: Jaroslav
Hájek, Zbynek Sidak, Pranab Kumar Sen
Описание: Книга содержит многочисленные алгоритмы ранжирования
(HTML)
- Латентно-семантический анализ
Автор: Игорь Некрестьянов
Описание: Введение в латентно-семантический анализ
(HTML)
- Введение в поисковые
системы
Автор: Ю. Лившиц
Описание: Архитектура поисковых систем, алгоритмы и оптимизация
[читать]
/ PDF
- Векторная модель поиска
Автор: не известен
Описание: Введение в векторную модель поиска
(HTML)
- Применение вероятностных
моделей для анализа содержания информационных документов
Автор: Е.А. Воронин, О.Н.
Бородин
Описание: Описание вероятностной модели алгоритма лингвистической обработки текста информационного документ
(HTML)
-
Автоматическое понимание текстов: системы, модели, ресурсы / Н.Н. Леонтьева. – Москва Академия, 2006.
Автор: Леонтьева Н.Н.
Описание: Описание алгоритмов, структур и методов копьютерной лингвистики
(HTML)
-
Modern Information Retrieval: A Brief Overview,Google, Inc.
Авторы: Amit Singhal
Описание: Обзорная статья о информационном поиске. Содержит
статистические данные поисковой системы Google (PDF)
-
Information Retrieval: Algorithms and Heuristics
(2nd Edition). – Springer, 2004. – 332 p.
Авторы: Grossman
D. A., Frieder O.
Описание: Книга, содержит большое количество алгоритмов, в том
числе еврестических, посвященных информационному поиску (HTML/PDF)
|
|