RERANKING METHOD FOR VISUAL SEARCH

Ñòàòüÿ ðàññêàæûâàåò íàì îá ïåðåîöåíêå ìåòîäîâ äëÿ âèçóàëüíîãî ïîèñêà

Ññûëêà íà èñòî÷íèê: Reranking Methods for Visual Search

 

Winston H, Lyndon S

Most semantic video search methods use text-keyword queries or example video clips and images. But such methods have limitations. Here the authors introduce two reranking processes for image and video search that automatically reorder results from initial text-only searches based on visual features and content similarity.

The continuing growth of online video data, digital photos, and 24-hour broadcast news has helped make video and image retrieval an active research area. Current semantic video search usually builds on text search by using text associated with video content, such as speech transcripts, closed captions, or video optical char­acter recognition (OCR) text. Adding multiple modalities such as image features, audio, face detection, and high-level concept detection can improve text-based video search systems.14 Researchers often achieve this using multiple query example images, applying specific concept detectors, or incorporating highly tuned retrieval models for specific query types. However, because it's difficult to acquire the extra image information or build the models in advance, such approaches are impractical for large-scale applications.5
It's also difficult for users to acquire example images for example-based queries. When users provide image examples, the examples are often few in number and fail to capture the relevant regions of high-dimensional feature space that might contain the true positive videos. Practically, users prefer "search by text" rather than "search by examples."6
To address the problems of example-based video search approaches and avoid the use of spe­cialized models, we conduct semantic video searches using a reranking method that auto­matically reorders the initial text search results based on visual cues and associated context. We developed two general reranking methods that explore the recurrent visual patterns in many contexts, such as the returned images or video shots from initial text queries, and video stories from multiple channels.

Visual reranking
Assume our initial text search returns n visual documents {dlt d2, ..., d„},. These documents might include Web pages, images, and video stories. The visual reranking process improves search accuracy by reordering the visual documents based on multimodal cues extracted from the initial text search results and any available auxiliary knowledge. The auxiliary knowledge can be features extracted from each visual document or the multimodal similarities between them.
Pseudorelevance feedback279 is an effective tool for improving initial text search results in both text and video retrieval. PRF assumes that a significant fraction of top-ranked documents are relevant and uses them to build a model for reranking the search result set.7 This is in contrast to relevance feedback, in which users explicitly provide feed­back by labeling results as positive or negative.
Some researchers have implemented the PRF concept in video retrieval. Chua et al., for exam­ple, used the textual information in top-ranking shots to obtain additional keywords to perform retrieval and rerank the baseline shot lists.2 Their experiment improved mean average precision (MAP) from 0.120 to 0.124 in the TRECVID 2004 video search task (see http://www-nlpir.nist.gov/ projects/trecvid; MAP is a search performance metric used in TRECVID, a National Institute of Standards and Technology-sponsored conference focused on information retrieval in digital video). Other researchers sampled the pseudonegative images from the lowest rank of the initial query results, used the query videos and images as the positive examples, and formulated retrieval as a classification problem.8 This improved the search performance from MAP 0.105 to 0.112 in TRECVID 2003.

Reranking approaches
Patterns with strong visual similarities often recur across diverse video sources. Such recurrent images or videos frequently appear in image search engines (such as Yahoo or Google) and photo-sharing sites (such as Flickr). In earlier work,10 we analyzed the frequency of such recur­rent patterns (in terms of visual duplicates) for cross-language topic tracking (a large percentage of international news videos share common video clips, near duplicates, or high-level seman­tics). Based on our observations, we propose two general, visual search reranking approaches.
We first propose an image search reranking approach called information bottleneck (IB) rerank­ing.511 This approach explores the fact that in image search, multiple similar images are often spread across different spots in the top-level pages of the initial text search results.5 IB rerank­ing revises the search relevance scores to favor images that occur multiple times with high visu­al similarity and high initial text retrieval scores. Visual similarity is typically based on matching image features extracted from image frames in video shots. Thus, we can easily apply it to rerank search results at the image or video shot levels.
The preferred search level is sometimes higher than the image or shot level—that is, a semantic level such as news stories or scenes in films. In such cases, feature-based similarity between visu­al documents isn't well defined. To address this issue, we propose contextual links. For example, two stories share the same or similar contexts if they're related to the same topic and thus share some common or near-duplicate visual footage. Our context-reranking approach explores such contextual similarity among high-level video sto­ries using a random walk framework.12

IB-reranking: Image/video reranking based on recurrent frequency
In image/video search systems, we're given a query or a statement of information need, and must estimate the relevance R(x) of each image or video shot in the search set, x e X, and order them by relevance scores. Researchers have tested many approaches in recent years—from simply associating video shots with text search scores to fusing multiple modalities. Some approaches rely on user-provided query images as positive exam­ples to train a supervised classifier to approximate the posterior probability P(Y\X), where Fis a ran­dom variable representing search relevance.9 They then use the posterior probability for R(x) in visual ranking.
Text search results generally put certain rele­vant images near the top of the return set. In large image or video databases, in particular, some positive images might share great visual similarity, but receive quite different initial text search scores. For example, shows the top 24 shots from a story-level text search for TRECVID 2005 topic 153, "Find shots of Tony Blair." We can see recurrent relevant shots of various appearances among the return set, but they're mixed with some irrelevant shots (for example, anchor or graphic shots).
Thus, instead of user-provided search examples, we consider the search posterior probability, esti­mated from initial text-based results in an unsu­pervised fashion, and the recurrent patterns (or feature density) among image features, and use them to rerank the search results. A straightfor­ward implementation of the idea is to fuse both measures—search posterior probability and feature density—in a linear fashion. This fusion approach is commonly used in multimodal video search sys­tems.5 We can formulate the approach as

where p(y\x) is the search posterior probability, p(x) is the retrieved image's feature density, and a and p are linear fusion scalars.
Equation 1 incurs two main problems, which our prior experiment confirms.5 The posterior probability p(y\x), estimated from the text query results and (soft) pseudolabeling strategies, is noisy; we need a denoised representation for the posterior probability. In addition, the feature density estimation p{x) in Equation 1 might be problematic because recurrent patterns that are irrelevant to the search (for example, anchors, commercials, and crowd scenes) can be frequent. Instead, we should consider only those recurrent patterns within buckets (or clusters) of higher rel­evance. To exploit both search relevance and recurrent patterns, we represent the search rele­vance score R(x) as

where p(y\c) is a denoised posterior probability smoothed over a relevance-consistent cluster c, which covers image x; and p(x\c) is the local fea­ture density estimated at feature x. The cluster-denoising process has been effective in text search.13 Meanwhile, we use local feature densi­ty p{x\c) to favor images occurring multiple times with high visual similarity. The choice of para­meter a or p will affect the reranked results. In our preliminary experiments, the denoised pos­terior probability p{y\c) was more effective and played a more important role in search relevance than the pattern recurrence within the same rel­evant clusters. Accordingly, an intuitive approach is to let a be significantly larger than (3 so that the reranking process first orders clusters at a coarse level and then refines the image order in each cluster according to local feature density.
Two main issues arise in this proposed approach:

-How can we find the relevance-consistent clus­ters, in an unsupervised fashion, from noisy text search results and high-dimensional visu­al features?
-How can we use the recurrent patterns across video sources?

To address the first problem, we adopt the IB principle,5-11 which finds the optimal clustering of images that preserves the maximal mutual information about the search relevance. We iter-atively update the denoised posterior probabili­ties p(y\c) during the clustering process. We then estimate the feature densities p{x\c) from each cluster ñ accordingly.

In the figure, the reranking method discovers four relevance-con­sistent clusters automatically. Images of the same cluster (that is, Q) have the same denoised pos­terior probability p(y\c), but might have recurrent patterns with different appearances. For example, Cj has three regions with high density in the fea­ture space. We first rank the image clusters by posterior probability p(y\c) and then order with-in-cluster images by the local feature density p{x\c). In short, the visually consistent images that occur most frequently within higher-rele­vance clusters will receive higher ranks. Note that the visual features we adopted are grid color moments and Gabor texture. More technical details are available elsewhere.

References


[1]  E. Chang, K.-T. Cheng, W.-C. Lai, C.-T. Wu, C.-W. Chang, and Y.-L. Wu.   PBIR — a system that learns subjective im­age query concepts.   Proceedings of ACM Multimedia, http://www.mmdb.ece.ucsb.edu/~demo/corelacm/, pages 611-614, • October 2001.
[2]  E. Chang and B. Li.   Mega — the maximizing expected generalization algorithm for learning complex query concepts (extended version). Technical Report http://www-db.stanford.edu/~echang/mega-extended.pdf, November 2000.
[3]  A. Gersho and R. Gray. Vector Quantization and Signal Compression. Kluwer Academic, 1991.
[4]  B. Li, E. Chang, and C.-S. Li.  Learning image query concepts via intelligent sampling.  Proceedings of IEEE Multimedia and Expo, August 2001.

Biography | Abstract | Library | References | Report on search | Individual task
© 2008 DonNTU. All Rights Reserved