Joo-Hwee Lim and Qi Tian Institute for Infocomm Research, Singapore
Philippe Mulhem
Image Processing and Applications Laboratory, National Center of Scientific Research, France
Rapid advances in sensor, storage, processor, and communication technologies let consumers store large digital photo collections. Consumers need effective tools to organize and access photos in a semantically meaningful way. We address the semantic gap between feature-based indexes computed automatically and human query and retrieval preferences.
Because digital cameras are so easy to use, consumers tend to take and accumulate more and more digital photos. Hence they need effective and efficient tools to organize and access photos in a semantically meaningful way without too much manual annotation effort. We define semantically meaningful as the ability to index and search photos based on the purposes and contexts of taking the photos.
From a user study1 and a user survey that we conducted, we confirmed that users prefer to organize and access photos along semantic axes such as the event (for example, a birthday party, swimming pool trip, or park excursion), people (for example, myself, my son, Mary), time (for example, last month, this year, 1995), and place (for example, home, Disneyland, New York). However, users are reluctant to annotate all their photos manually as the process is too tedious and time consuming.
As a matter of fact, content-based image retrieval (CBIR) research in the last decade2 has focused on general CBIR approaches (for example, Corel images). As a consequence, key efforts have concentrated on using low-level features such as color, texture, and shapes to describe and compare image contents. CBIR has yet to bridge the semantic gap between feature-based indexes computed automatically and human query and retrieval preferences.
We address this semantic gap by focusing on the notion of event in home photos. In the case of people identification in home photos, we can tap the research results from the face recognition literature. We recognize that general face recognition in still images is a difficult problem when dealing with small faces (20 x 20 pixels or less), varying poses, lighting conditions, and so on. However, in most circumstances, consumers are only interested in a limited number of faces (such as family members, relatives, and friends) in their home photos, so we might achieve a more satisfactory face recognition performance for home photos.
With advances in digital cameras, we can easily recover the time stamps of photo creation. Industrial players are looking into the standardization of the file format that contains this information—for example, the Exchangeable Image File Format version 2.2 (http://www.jeita.or.jp/ english/index.htm). Similarly, with the advances in Global Positioning System technology, the camera can provide the location where a photo was taken (for example, the Kodak Digital Science 420 GPS camera).
Home photo event taxonomy
We define home photos as typical digital photos taken by average consumers to record theii lives as digital memory, as opposed to those taken by professional photographers for com-j mercial purposes (for example, stock photos like the Corel collection and others—see http://www. fotosearch.com). At one typical Web site where consumers upload and share their home photos, users apparently prefer occasions or activities as a broad descriptor for photos to other characteristics (like objects present in the photo or the location at which a photo was taken). In particular, the site's classification directory contains many more photos under the category "Family and Friends" (more than 9 million) than the sum of the other categories (adding up to around 1 million). Furthermore, categories such as "Scenery and Nature," "Sports," "Travel," and so.
Related Works
Text annotation by human users is tedious, inconsistent, and erroneous due to the huge number of possible keywords. An efficient annotation system needs to limit the possible keyword choices from which a user can select. The MiAlbum system described in Liu, Sun, and Zhang1 performs automatic annotation in the context of relevance feedback.2 Essentially, the text keywords in a query are assigned to positive feedback examples (that is, retrieved images that the user who issues the query considers relevant). This would require constant user intervention (in the form of relevance feedback) and the keywords issued in a query might not necessarily correspond to what is considered relevant in the positive examples.
The works on image categorization34 are also related to our work in that they attempt to assign semantic class labels to images. However, none of these approaches has made a complete taxonomy for home photos. In particular, the attempts of photo classification are: indoor versus outdoor,3 natural versus man made,34 and categories of natural scenes.4 Furthermore, the classifications were based on low-level features such as color, edge directions, and so on. In our article, the notion of events is meant to be more complex than visual classes though we only demonstrate our approach on visual events. Also, we focus on relevance measures of unlabeled photos to events to support event-based retrieval rather than class memberships for classification purposes.
On are the outcome of activities. Although Vailaya et al.3 have presented a hierarchy of eight categories for vacation photos, they're skewed toward scenery classification. Hence an event-based taxonomy is what consumers need. The "Related Works" sidebar discusses other approaches.
We suggest a typical event taxonomy for home photos. Because home photo collections are highly personal contents, we view our proposed event taxonomy as a classification basis open to individual customization. Our proposed computational learning framework for event models facilitates the personalization of the event taxonomy nodes. In broad terms, a typical event could be a gathering, family activity, or visit to some place during a holiday. These correspond to the purposes of meeting with someone, performing some activity, and going to some place respectively.
A gathering event could be in the form of parties for occasions such as birthdays or weddings, or simply having meals together. A family activities event would involve family members. We keep our classification simple and general by dividing this event type into indoor and outdoor family activities. Examples of indoor activities include kids playing, dancing, chatting, and so on; outdoor activities include sports, kids at a playground, picnics, and so on.
The third major type of event is visiting places.
These events could be people centric or not. People centric refers to the case when a photo has family members as its focus. In non-people-centric photos, family members are either absent or not clearly visible. We divide this latter case into photos of natural (nature) and urban (man-made) scenes. A nature event photo is one taken
at, for example, a mountainous area, riverside or lakeside (waterside), beach, and park (also garden, field, forest, and so on). For the man-made event, we include photos taken at a swimming pool, roadside (or street), or inside a structure. The notion of event usually encompasses four aspects:
I who takes part in the event (for example, John and his wife),
I what occasion or activity is involved (for example, my birthday),
I where the event takes place (for example, my house), and
I when the event takes place (for example, last month).
Using visual content alone (that is, without using information from people, time, and place) wouldn't let us model all these four aspects of an event. For example, it's not feasible to further differentiate breakfast, lunch, and dinner for the meals event if we don't use the time information. In this article, we approximate event by visual event, defined as an event based on the visual content of photos (that is, the "what" aspect).
Modeling visual events
For each Event ?,, we assume that there's an associated computational model M, that lets us compute the relevance measure R(Mt, x) (e [0,1]) of a photo x to ?;. To preserve a high level of semantics to model events in our proposed event taxonomy, we require an expressive knowledge representation of image content. To minimize the effort of manual annotation and to allow personalization of event models Mt for a user's photo collection (for example, a modern city street versus a rural roadside, an Asian wedding versus a Western wedding), we propose a computational learning approach to construct event models M, from a small set of photos L labeled by the event. Ei and compute the relevance measures of other photos unlabeled U to the event models. In this article, we adopt a vocabulary-based indexing methodology called visual keywords* to automatically extract relevant semantic tokens, a conceptual graph formalism5 to model the visual events, and instance-based and graph-generalization representations to learn the event models. So, an event model M,- has two facets: the event model according to the visual keywords representation, namely Mv„ and the event model according to conceptual graphs, namely Mgt.
Visual keywords indexing
The visual keywords approach4 is a new attempt to achieve content-based image indexing and retrieval beyond feature-based (for example, QBIC6) and region-based (for example, Blobworld7) approaches. Visual keywords are intuitive and flexible visual prototypes extracted or learned from a visual content domain with relevant semantics labels.
In this article, we have designed a visual vocabulary for home photos. There are eight classes of visual keywords, each subdivided into two to five subclasses. Hence there are 26 distinct labels in total. We used a three-layer feed-forward neural network with dynamic node creation capabilities to learn these 26 visual keywords subclasses from 375 labeled image patches cropped from home photos. Color and texture features4 are computed for each training region as an input vector for the neural network.
Once the neural network has learned a visual vocabulary, our approach subjects an image to be indexed to multiscale, view-based recognition against the 26 visual keywords. Our approach reconciles the recognition results across multiple resolutions and aggregates them according to configurable spatial tessellation. For example, the swimming pool image is indexed as five visual keyword histograms based on left, right, top, bottom, and center areas, which shows only the center schematic histogram. In essence, an image area is represented as a histogram of visual keywords based on local recognition certainties. For instance, the visual keyword histogram for the center area in the swimming pool image has a 0.36 value for subclass water: pool, and small values for the rest of the bins. This spatial configuration is appropriate for home photos because the center area is usually the image's focus and hence we can assign higher weight to it during similarity matching. Other spatial configuration includes uniform grids (for example, 4 x 4) of equal weights, and so on.
We can compute similarity matching between two images as the weighted average of the similarities (for example, histogram intersection) between the images' corresponding local visual keyword histograms. When comparing the similarity between a single image with index xvk and a group of images with indexes V= {v,-}, we compute the similarity matching score as
Although the Blobworld approach also performs similarity matching based on local regions, the visual keywords approach does not perform image segmentation, and it associates semantic labels to the local image regions. Mohan and colleagues have proposed an interesting approach to recognize objects by their components and applied it to people detection based on adaptive combination of classifiers.8 However, many objects in home photos do not have a well-defined part-whole structure. Instead, we adopt a graph-theoretic representation to allow hierarchies of concepts and relations. During the indexing process, we also generate fuzzy labels to facilitate object segmentation with a clustering algorithm in the visual keyword space. At the graph-based representation level, the most probable label is kept, and inference allows keeping specificities of relationships like symmetry and transitivity.
Visual event graph
We chose the knowledge representation formalism called conceptual graphs as a framework to handle concepts, concept hierarchies, and relation hierarchies.5 Conceptual graphs are bipartite finite oriented graphs composed of concept and relation nodes. Concept nodes are composed of a concept type and a referent. A generic referent denotes the existence of a referent, while an individual refers to one instance of the concept type. In our case, the concept type set includes the objects of the real world present in the photos, and they're extracted as visual keywords. We organized concept types in a lattice that reflects generalization/specialization relationships. The relationship set includes absolute
spatial (position of the center of gravity), relative spatial, and structural relationships.
The weighting scheme only considers media-dependent weight and the input are the certainties of the recognition of concepts.
I the weight of the concept w, representing the importance of the concept in the photo (defined as the relative size of its region), and
I the certainty of recognition ce of the concept ñ (we use the certainty values that come from a labeling process).
We then represent a concept as a [type:refer-ent I w I ce]. The image is composed of two objects: a foliage region with an importance 0.34 and a certainty of recognition 0.59 and a water-pool region with an importance of 0.68 and a certainty of recognition of 0.32.
An event model graph has the same components as the image graphs, except that the concepts corresponding to the image objects might be complex. That is, they might contain more than one concept type associated with a value indicating the concept type's relevance for the model, and the relationships might also have a relevance value for the model. Consider a part of an event graph composed from two images chosen by a user to describe a swimming pool: and another composed of a building region and a water pool that touches the building. This event graph stores the fact that water pool regions appear in all the selected images, and the complex concept related to the other concept denotes that 50 percent of the selected images (that is, a relevance of the concept type rel is 0.5) contains foliage and 50 percent building. Then, using the concept type hierarchy of, 50 percent of the selected images contained a natural object and 50 percent a man-made object, and the two images (that is, a relevance rel of 1.0) contain a general object. The relevance values for the relations of the arches follow the same principle.
The matching between a graph corresponding to an event model Mgt and an image graph xcg is based on the fact that the matching value must incorporate elements coming from the concepts (that is, the elements present in the images) as well as the relationships between these elements represented, in our case, by arches. So, we compute the matching value between the model graph Mg, and the document graph xcg, according to the weights of the matching arches ad (a component of a graph of the form [typed]:referentd31 »ìI cedl] -» (Relation,,) -» [typedz:referentd21 wd21 ced2\) and the weights of the matching concepts cd of xcg. The relevance status value for an event model graph Mgt and a document graph xcg, where nMgi(xcg) is the set of possible projections of the event model graph into the image graph.
The matching function match, of an image concept cd= [Typed:referentd I wd I ced] and a corresponding complex concept cp of the projection gp of an event model graph is based on the certainty recognition of the concept, weight of the concept, and relevance value computed from the images to define an event model. The value given by match, is proportional to the certainty of recognition, to the weight of the concept, and to the relevance of the best concept of the event model.
Visual event learning
In this article, for the learning of a model Ùfor an event E„ we have two levels of representations, namely local visual keyword histograms and a graph-based abstraction. As presented previously, we adopt an instance-based approach to construct a visual keywords representation Mv, for Ei as the set of visual keyword indexes (v;) for I. Then, given an unlabeled photo x, we compute the similarity matching score Sv(Mvi, xvk) as in Equation 1. Using the conceptual graph formalism, we compute the conceptual graph representation Mg, of an event Et from the generalization of a given set of labeled photos L = Ùas described in the previous section. Then, given an unlabeled photo x, we can compute the similarity matching score Sg(Mgt, xcg). Finally, we combine these relevance measures to obtain an overall relevance measure
R(Mit x) = X. Sv(MVi, xvk) + (!-%). Sg(Mg„ xcg) (2)
where we might determine the X parameter a priori or with empirical tuning.
Experimental evaluation
To evaluate the effectiveness of event-based retrieval of home photos, we conducted event model learning and event-based query experiments using 2,400 heterogeneous home photos collected over a five-year period with both indoor and outdoor settings. The images are those of the smallest resolution (that is, 256 x 384) from Kodak PhotoCDs, in both portrait and landscape layouts. After removing possibly noisy marginal pixels, the images become 240 x 360 resolution. We didn't remove these bad quality photos from our test collection to reflect the true complexity of the original real data.
For our experimental evaluation, we selected four events from our proposed home photo event taxonomy. They are parks, swimming pools, waterside, and wedding.
Event-based learning and query
For each of these events, a user constructed the list of photos considered relevant to the event from the 2,400 photos. The sizes of these ground truth lists are 306 for parks, 52 for swimming pools, 114 for waterside, and 241 for wedding.
To learn an event, we decided to use a training set of only 10 labeled photos to simulate practical situations (that is, a user only needs to label 10 photos for each event). To ensure unbiased training samples, we generated 10 different training sets from the ground truth list for each event based on uniform random distribution. The learning and retrieval of each event were thus performed 10 times and the respective results are averages over these 10 runs. Note that for each of these runs, we removed the photos used as training from the ground truths when computing the precision and recall values.
To query based on event, a user issues or selects one of the four event labels. Based on the learned event models, we computed the relevance measures using Equation 2.
Comparison and analysis
We compare three methods on event-based retrieval. The first method, which we denote as hue saturation value (HSV)-global, indexed photos as global histograms of 11 key colors (red, green, blue, black, gray, white, orange, yellow, brown, pink, purple) in the HSV color space, as adopted by the original PicHunter system.9
The second method (denoted as HSV-grid) is the same color histogram as in HSV-global but computed for each 4x4 grid on the photos.
The third method, visual event retrieval (VER), implemented the approach proposed in this article. In particular, the visual keywords learned are characterized using color and texture features.4 We used five local visual keyword histograms (left, right, top, bottom, and center blocks with center block having a higher weight). We computed the conceptual graph representation as compound least common generalization from training photos of an event. The relevance measure was based on Equation 2 with Ê = 0.5 that produced the overall best result among all X values (between 0 and 1 at 0.05 interval) that we experimented with.
Table 1 lists the average precisions (over 10 runs) of retrieval for each event and for all events using different methods. Table 2 shows the average precisions (over 10 runs) among the top 20 and 30 retrieved photos for each event and for all events using the methods compared.
From our experimental results in Table 1, we observe that the overall results of our proposed VER method are encouraging. This could be attributed to the integration of orthogonal representations used in our VER method (that is, instance and abstraction based). Hence the VER approach performed much better than the com
pared methods. The same can be said for the individual events. In particular, the overall average precision is 61 percent and 37 percent better than that of HSV-global and HSV-grid respectively.
Furthermore, for practical needs, the overall average precisions of the top 20 and 30 retrieved photos were improved by 52 percent and 48 percent respectively when compared with the HSV-global method, and by 37 percent and 35 percent respectively when compared with the HSV-grid method (see Table 2). In concrete terms, the VER approach, on average, retrieved 13 and 19 relevant photos among the top 20 and 30 photos for any event query.
From Tables 1 and 2, we notice that the waterside event is less accurately recognized by each of the approaches considered. This comes from the fact that the waterside images are more varied than other kinds of images considered, negatively impacting the learning quality.
Future work
We presented our event taxonomy for home photo content modeling and a computational learning framework to personalize event models from labeled sample photos and to compute relevance measures of unlabeled photos to the event models. These event models are built automatically from learning and indexing using a predefined visual vocabulary with conceptual graph representation. The strength of our event models comes from the intermediate semantic representations of visual keywords and conceptual graphs that are abstractions of low-level feature-based representations to bridge the semantic gap.
In actual deployment, a user annotates a small set of photos with event labels selected from a limited event vocabulary during photo import (from digital cameras to hard disk) or photo upload (to online sharing Web sites). Our approach would then build the event models from the labeled photos, upon semantics extracted from the photos annotated with events.
In the near future, we'll experiment with more events and with other indexing cues such as people identification and time stamps. We'll also explore other computational models and learning algorithms for personalized event modeling to increase the quality of learning on events that contain a lot of variability.
Readers may contact Joo-Hwee Lim at the Inst, for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613; joohwee@i2r.a-star.edu.sg.
For further information on this or any other computing topic, please visit our Digital Library at http://computer. org/publications/dlib.
References
[1] E. Chang, K.-T. Cheng, W.-C. Lai, C.-T. Wu, C.-W. Chang, and Y.-L. Wu. PBIR — a system that learns subjective image query concepts. Proceedings of ACM Multimedia, http://www.mmdb.ece.ucsb.edu/~demo/corelacm/, pages 611-614, • October 2001.
[2] E. Chang and B. Li. Mega — the maximizing expected generalization algorithm for learning complex query concepts (extended version). Technical Report http://www-db.stanford.edu/~echang/mega-extended.pdf, November 2000.
[3] A. Gersho and R. Gray. Vector Quantization and Signal Compression. Kluwer Academic, 1991.
[4] B. Li, E. Chang, and C.-S. Li. Learning image query concepts via intelligent sampling. Proceedings of IEEE Multimedia and Expo, August 2001.
[5] Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, March 1999.
[6] S. Tong and E. Chang. Support vector machine active learning for image retrieval. Proceedings of ACM International Conference on Multimedia, pages 107-1 18, October 2001.
|