Vasileios Mezaris 1,2 , Ioannis Kompatsiaris 2 and Michael G. Strintzis 1,2
1 Aristotle University of Thessaloniki, Thessaloniki, Greece
2 Informatics and Telematics Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Definition: Segmentation is the process of partitioning a piece of information into meaningful elementary parts termed segments .
Considering still images, (spatial) segmentation means partitioning the image to a number of arbitrarily shaped regions, each of them typically being assumed to constitute a meaningful part of the image, i.e. to correspond to one of the objects depicted in it or to a part of one such object. Considering moving images, i.e. video, the term segmentation is used to describe a range of different processes for partitioning the video to meaningful parts at different granularities. Segmentation of video can thus be temporal, aiming to break down the video to scenes or shots, spatial, addressing the problem of independently segmenting each video frame to arbitrarily shaped regions, or spatio-temporal, extending the previous case to the generation of temporal sequences of arbitrarily shaped spatial regions. The term segmentation is also frequently used to describe foreground/background separation in video, which can be seen as a special case of spatio-temporal segmentation. Regardless of the employed decision space, i.e. 1D, 2D or 3D for temporal, spatial and spatio-temporal segmentation, respectively, the application of any segmentation method is often preceded by a simplification step for discarding unnecessary information (e.g. low-pass filtering) and a feature extraction step for modifying or estimating features not readily available in the visual medium (e.g. texture, motion features etc., but also color features in a different color space etc.), as illustrated in for a variety a segmentation algorithms (Figure 1).
Segmentation of images and video is generally an ill-posed problem, i.e. for a given natural image or image sequence, there exists no unique solution to the segmentation problem; the spatial, temporal or spatio-temporal segments that should ideally be formed as a result of segmentation largely depend on the application under consideration and most frequently on the subjective view of each human observer.
Commonly considered applications of segmentation include region-based image and video description, indexing and retrieval, video summarization, interactive region-based annotation schemes, detection of objects that can serve as cues for event recognition, region-based coding, etc. Particularly image and video description, indexing and retrieval has been on the focus of attention of many researchers working on segmentation, since the benefits of introducing segmentation to this application have recently been documented well and significant progress has been made on related topics such as region-based description for indexing, most notably with the introduction of the MPEG-7 Standard.
Most segmentation methods serving all aforementioned applications are generic , i.e. make no restrictive assumptions regarding the semantics of the visual content, such as that the content belongs to a specific domain; however, domain-specific methods for applications like medical image segmentation also exist.
Segmentation methods for 2D images may be divided primarily into region-based and boundary-based methods. Region-based approaches rely on the homogeneity of spatially localized features such as intensity, texture, and position. On the other hand, boundary-based methods use primarily gradient information to locate object boundaries. Hybrid techniques that integrate the results of boundary detection and homogeneity-based clustering (e.g. region growing), as well as techniques exploiting additional information such as structural properties (e.g. inclusion), have also been proposed.
Traditional region-based approaches include region growing and split and merge techniques. Starting from an initial region represented by an arbitrarily chosen single pixel, region growing is the process of adding neighboring pixels to this region by examining their similarity to the ones already added; when no further additions are possible according to the defined similarity criteria, a new region is created and grows accordingly. The opposite of this approach is split and merge. Starting from a single initial region spanning the entire image, region homogeneity is evaluated; if the homogeneity criterion is not satisfied, the region is split according to a pre-defined Page 782 pattern and neighboring regions are subsequently merged, providing this does not violate the homogeneity criterion. The interchange of split and merge steps continues until the latter is satisfied for all regions.
Region-based approaches also include the Recursive Shortest Spanning Tree (RSST) algorithm, which starts from a very fine partitioning of the image and performs merging of neighboring nodes while considering the minimum of a cost function; the latter preserves the homogeneity of the generated regions. In order to avoid a possible premature termination of the merging process, resulting to over-segmentation, in the case that the desired final number of regions is not explicitly defined, the introduction of syntactic visual features to RSST has been proposed . The K-means algorithm, an iterative classification method, has also been used as the basis of several region-based approaches. In, the K-Means-with-Connectivity-Constraint variant of K-means is used to effect segmentation by means of pixel clustering in the combined intensity-texture-position feature space (Fig. 2). Another approach to pixel clustering is based on the Expectation-Maximization (EM) algorithm, which is a method for finding maximum likelihood estimates when there is missing or incomplete data. For the application of EM to segmentation, the cluster membership for each pixel can be seen as such. In, image segmentation is treated as a graph partitioning problem and the normalized cut, a global criterion measuring both the total dissimilarity between the different groups as well as the total similarity within the groups, is employed for segmenting the graph. In contrast to the aforementioned methods, boundary-based methods rely on detecting the discontinuities present in the feature space. The Canny edge detector is a popular such scheme, based on the convolution of the image, over a small window, with the directional derivatives of a Gaussian function. Another approach to boundary detection is anisotropic diffusion, which can be seen as a robust procedure for estimating a piecewise smooth image from a noisy input image. Anisotropic diffusion employs an edge-stopping function that allows the preservation of edges while diffusing the rest of the image.
Mathematical morphology methods, including in particular the watershed algorithm , have also received considerable attention for use in image segmentation. The watershed algorithm determines the minima of the gradients of the image to be segmented, and associates a segment to each minimum. Conventional gradient operators generally produce many local minima, which are caused by noise or quantization errors, and hence, the watershed transformation with a conventional gradient operator usually results in over-segmentation. To alleviate this problem, the use of multiscale morphological gradient operators has been proposed. More recently, the use of the watershed algorithm to generate an initial over-segmentation and the subsequent representation of this result as a graph, to which partitioning via the weighted mean cut criterion is applied, was proposed to combat the over-segmentation effect .
Finally, global energy minimization schemes, also known as snakes or active contour models, involve the evolution of a curve from an initial position toward the boundary of an object in such a way that a properly defined energy functional is minimized. Depending on the definition of the energy functional, the resulting scheme may be Page 783 edge-based, region-based or based on a combination of boundary detection and homogeneity-preserving criteria.
Temporal video segmentation aims to partition the video to elementary image sequences termed scenes and shots . A shot is defined as a set of consecutive frames taken without interruption by a single camera. A scene, on the other hand, is usually defined as the basic story-telling unit of the video, i.e. as a temporal segment that is elementary in terms of semantic content and may consist of one or more shots.
Temporal segmentation to shots is performed by detecting the transition from one shot to the next. Transitions between shots, which are effects generated at the video editing stage, may be abrupt or gradual , the former being detectible by examining two consecutive frames, the latter spanning more than two frames (Fig. 3) and being usually more difficult to detect, depending among others on the actual transition type (e.g. fade, dissolve, wipe, etc.). Temporal segmentation to shots in uncompressed video is often performed by means of pair-wise pixel comparisons between successive or distant frames or by comparing the color histograms corresponding to different frames. Methods for histogram comparison include the comparison of absolute differences between corresponding bins and histogram intersection. Other approaches to temporal segmentation include block-wise comparisons, where the statistics of corresponding blocks in different frames are compared and the number of “changed” blocks is evaluated by means of thresholding, and edge-based methods. The latter involve the detection of edges by application of an edge detector (e.g. Canny) and the subsequent comparison of edges in different frames to calculate values such as the edge change ratio. Other methods, involving the comparison of motion features at different time instances have also been proposed. In, a method for foveated shot detection is proposed. This is based on the generation of patterns of visuomotor behavior (traces) for the sequence and their subsequent use for inferring shot boundary presence.
Other recent efforts on shot detection have focused on avoiding the prior decompression of the video stream, resulting to significant gains in terms of efficiency. Such methods consider mostly MPEG video, but also other compression schemes such as wavelet-based ones. These exploit compression-specific cues such as macroblock-type ratios to detect points in the ID decision space where temporal redundancy, which is inherent in video and greatly exploited by compression schemes, is reduced.
Temporal segmentation to scenes involves grouping of shots to semantically coherent clusters. Several methods have been proposed for this, including clustering algorithms also used to effect homogeneity-based spatial segmentation. One approach is, starting from one shot, to progressively group similar shots with it until no similar shots can be found within a chosen temporal distance; this is a simple application of the previously discussed region-growing algorithm to the ID segmentation case. Another approach, also used for 2D segmentation as well, involves treating shots as nodes of a graph and using the normalized cut to segment the graph, this forming shot clusters corresponding to the scenes . The aforementioned approaches make use of visual features to describe the different shots and to estimate the similarity between any two. The use of audio and linguistic information extracted via speech recognition, although possibly not applicable to any kind of video, has also been proposed to assist the detection of semantically similar shots.
Several approaches have been proposed for spatio-temporal video segmentation (i.e. segmentation in a 3D decision space), both unsupervised, as is almost always the case in ID and 2D decision spaces, and supervised. The latter require human interaction for defining the number of objects present in the sequence, for defining an initial contour of the objects to be tracked or more often for grouping homogeneous regions to semantic objects, while the former require no such interaction. In both types of approaches, however, it is frequently assumed that the video comprises a single shot; this assumption dictates that spatio-temporal segmentation is preceded by temporal segmentation to shots.
Some spatio-temporal segmentation approaches rely on initially applying spatial segmentation methods to each frame independently. Spatio-temporal objects are subsequently formed by associating the spatial regions formed in successive frames using their low-level features. In, this general approach is realized by adopting a seeded region-growing scheme for performing spatial segmentation; the temporal correspondence of spatial regions between adjacent frames is established by examining the overlapping of seeds, which in the case of the selected region growing technique may be quite large, ranging from 1/3 to 2/3 of the final region size.
A different approach is to use motion information to perform motion projection, i.e. to estimate the position of a region at a future frame, based on its current position and its estimated motion features. In this case, a spatial segmentation method need only be applied to the first frame of the sequence, whereas in subsequent frames only refinement of the motion projection result is required. In , a homogeneity-based spatial segmentation method, namely the watershed algorithm, is applied to the first frame of the sequence; then, motion projection is used to estimate a segmentation of the following frame, which is eventually refined by applying a marker-controlled watershed algorithm. Final results are obtained by merging those of the resulting 3D watershed volumes that have similar motion characteristics. A similar in nature approach is followed in, where the need for motion projection is substituted by a Bayes-based approach to color-homogeneous region-tracking using color information. Color- and motion-homogeneous spatio-temporal regions generated by means of spatial segmentation at the first frame of the sequence and tracking at subsequent frames are eventually clustered to different objects using again the region long-term motion trajectories (Figure 4).
Region tracking is an important part of 3D segmentation. This is true both for approaches of the previous category, where tracking of regions generated via 2D segmentation at previous frames may be required, as well as for interactive schemes supporting the manual initialization of the tracked region. Noisy data and illumination conditions that change with time are two key factors limiting the efficiency of temporal tracking algorithms. To address these issues, a color-based deformable model that is robust against noisy data and changing illumination is developed in . Measuring color constant gradients and estimating the corresponding amount of sensor noise, this approach uses a weighting term in the deformation process to force noisy and unstable gradient information to contribute less to the deformation process than reliable gradient information yielding robust tracking. In a geometric prior is introduced to the active contour framework in order to improve the temporal coherency of the tracking result, and the application of this approach to interactive segmentation is considered.
Alternatively to the above techniques, one could restrict the problem of video segmentation to foreground/background separation. The latter can be achieved either by using primarily motion information, or by performing in parallel segmentation using other feature spaces as well (e.g. intensity information) and employing rule-based processing to enhance the motion segmentation result. In , where a number of different methods are presented, a method for foreground/background separation using intensity information, based on the assumption that the background comprises large uniform regions, is developed. In a fast moving object segmentation algorithm is Page 786 developed, based upon change detection and background registration techniques; this algorithm also incorporates a shadow cancellation technique for dealing with light changing and shadow effects.
Finally, as in temporal segmentation, the spatio-temporal segmentation of compressed video has recently attracted considerable attention. Algorithms of this category generally employ coarse motion and color information that can be extracted from the video stream without full decompression, such as macroblock motion vectors and DC coefficients of DCT-coded image blocks.
Segmentation methods for generic visual content have significantly improved over the last few years. While the further improvement of generic segmentation techniques will continue to attract attention, particularly techniques considering the segmentation of compressed data, an important future direction is the introduction of prior knowledge to the segmentation procedure, leading to the development of knowledge-assisted analysis techniques. Although the applicability of the latter is by definition limited to specific domains, the introduction of prior knowledge to the segmentation process is the key to transforming the ill-posed segmentation problem to a well-posed one by restricting the number of admissible solutions.