Face Detection Using Multimodal Density Models

M.-H. Yang, N. Ahuja, and D. Kriegman

Computer Vision and Image Understanding (CVIU), vol. 84, no. 2, pp. 264-284, 2001

Source of information: http://library.graphicon.ru/pubbin/view_paper.pl?paper_id=507&view_graphics=1

Images of human faces are central to intelligent human computer interaction. Many current research topics involve face images, including face recognition, face tracking, pose estimation, facial expression recognition, and gesture recognition. However, most existing solutions assume that human faces in an image or an image sequence have been identified and localized. To build fully automated systems that extract information from images with human faces, it is essential to develop robust and efficient algorithms to detect faces.
Given a single image or a sequence of images, the goal of face detection is to identify and locate all of the human faces regardless of their positions, scales, orientations, poses, expressions, occlusions and lighting conditions. This is a challenging problem because faces are nonrigid objects with a high degree of variability in size, shape, color, texture, facial hair, jewelry, makeups, and glasses. Most recent face detection methods can only detect upright, frontal faces under certain lighting conditions. Since the images of a human face lie in a complex subset of the image space that is unlikely to be modeled by a single linear subspace or characterized by a unimodal probability density function, we use multimodal density models to estimate the distribution of face and nonface patterns. Although some methods [19, 35] have applied mixture models for face detection, these use principal component analysis (PCA) for projection, which does not find the optimal subspace maximizing class separation.
Statistical pattern recognition approaches for face detection generally fall into two major categories, generative or discriminative methods, depending on the estimation criteria used for adjusting the model parameters and/or structure. Generative approaches such as Markov random field (MRF) [24], naive Bayes classifier [32], hidden Markov Model (HMM) [25], and higher-order statistics [25] rely on estimating a probability distribution over examples using maximum likelihood (ML) or maximum a posterior (MAP) methods, whereas discriminative methods such as neural networks [28, 35], support vector machines (SVM) [21] and SNoW [39] aim to find a decision surface between face and nonface patterns.1 Discriminative methods require both positive (face) and negative (nonface) samples to find a decision boundary.
Nevertheless, studies in cognitive psychology have suggested that humans learn to recognize objects (e.g., faces) using positive examples without the need for negative examples [17]. Furthermore, while it is relatively easy to gather a representative set of face samples, it is extremely difficult to collect a representative set of nonface samples. The effectiveness of discriminative methods requires efforts in collecting nonface patterns. On the other hand, generative mixture methods such as a mixture of Gaussians and a mixture of factor analyzers rely on a joint probability distribution over examples, classification labels, and hidden variables (i.e., mixture weights). Although the joint distribution in this approach carries a number of advantages, e.g., in handling incomplete examples, the typical estimation criterion (maximum likelihood or its variants) is nevertheless suboptimal from the classification viewpoint. Furthermore, generative methods usually require data sets larger than those of discriminative methods since most of them involve estimating covariance matrices. Discriminative methods that focus directly on the parametric decision boundary, e.g., SVMs or Fisher’s linear discriminant, typically yield better classification results, when they are applicable and properly utilized.
In this paper, we aim to investigate the advantages and disadvantages of generative and discriminative approaches to face detection. In the generative approach, we use only positive examples (i.e., face samples) and aim to estimate a probability distribution of face patterns. Furthermore, we use a mixture method to better model the distribution of face patterns. In the discriminative approach, we use Fisher’s linear discriminant to find a decision boundary between face and nonface patterns. We then compare the performance of both methods on several benchmark data sets in order to investigate their pros and cons.
1 Note that it is possible to incorporate generative methods in discriminative methods and vice versa.
The first detection method is an extension of factor analysis. Factor analysis (FA) is a statistical method for modeling the covariance structure of high-dimensional data using a small number of latent variables. FA is analogous to PCA in several aspects. However PCA, unlike FA, does not define a proper density model for the data since the cost of coding a data point is equal anywhere along the principal component subspace (i.e., the density is unnormalized along these directions). Further, PCA is not robust to independent noise in the features of the data since the principal components maximize the variances of the input data, thereby retaining unwanted variations. Synthetic and real examples in [3, 5, 8, 9] have shown that the projected samples from different classes in the PCA subspace can often be smeared. For the cases where the samples have certain structure, PCA is suboptimal from the classification standpoint. Hinton et al. have applied FA to digit recognition, and they compare the performance of PCA and FA models [14]. A mixture model of factor analyzers has recently been extended [11] and applied to face recognition [10]. Both studies show that FA performs better than PCA in digit and face recognition. Since pose, orientation, expression, and lighting condition affect the appearance of a human face, the distribution of faces in the image space can be better represented by a multimodal density model where each modality captures certain characteristics of certain face appearances. We present a probabilistic method that uses a mixture of factor analyzers (MFA) to detect faces with wide variations. The parameters in the mixture model are estimated using the EM algorithm.
The second method that we present uses Fisher’s linear discriminant (FLD) to project samples from a high-dimensional image space to a lower-dimensional feature space. Recently, the Fisherface method [3] and others [36, 40] based on linear discriminant analysis have been shown to outperform the widely used Eigenface method [37] in face recognition on several data sets, including the Yale face database where face images are taken under varying lighting conditions. One possible explanation is that FLD provides a better projection than PCA for pattern classification since it aims to find the most discriminant projection direction. Consequently, the classification results in the projected subspace may be superior than other methods. (See [18] for a discussion about training set size). In the second proposed method, we decompose the training face and nonface samples into several subclasses using Kohonen’s self-organizing map (SOM). From these relabeled samples, the within-class and between-class scatter matrices are computed, thereby generating the optimal projection based on FLD. For each subclass, we use a Gaussian to model its class-conditional density function where the parameters are estimated based on maximum likelihood [8, 9]. To detect faces, each input image is scanned with a rectangular window in which the class-dependent probability is computed. The maximum likelihood decision rule is used to determine whether a face is detected.
To capture the variations in face patterns, we use a set of 1681 face images from Olivetti [31], UMIST [12], Harvard [13], Yale [3], and FERET [23] databases. Our experimental results on the data sets used in [28, 35] (which consist of 145 images with 619 faces) show that our methods perform as well as the reported methods in the literature, yet with fewer false detections. To further test our methods, we collected a set of 80 images containing 252 faces. This data set is rather challenging since it contains profile views of faces, faces with a wide variety of expressions, and faces with heavy shadows. Our methods are able to detect most of these faces as well. Furthermore, our methods have fewer false detections than other methods.
The remainder of this paper is organized as follows. We review previous work on face detection in Section 2. In Section 3, we describe a mixture of factor analyzers and apply them to face detection. We then present the second multimodal density model using Kohonen’s self-organizing map algorithm for clustering and Fisher’s linear discriminant for projection in Section 4. Comprehensive experiments on several benchmark data sets are detailed in Section 5.We also compare the results from our methods with other methods in the literature. Finally, we conclude this paper with comments and future work in Section 6.