Segmentation of Images and Video - Spatial segmentation, Temporal segmentation, Spatio-temporal segmentation, Future directions
Vasileios Mezaris 1,2, Ioannis Kompatsiaris 2 and Michael G. Strintzis 1,2
1 Aristotle University of Thessaloniki, Thessaloniki, Greece
2 Informatics and Telematics Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Sourse of information: http://encyclopedia.jrank.org/articles/pages/6890/Segmentation-of-Images-and-Video.html
Definition: Segmentation is the process of partitioning a piece of
information into meaningful elementary parts termed segments.
Considering still images, (spatial) segmentation means partitioning the
image to a number of arbitrarily shaped regions, each of them typically
being assumed to constitute a meaningful part of the image, i.e. to
correspond to one of the objects depicted in it or to a part of one such
object. Considering moving images, i.e. video, the term segmentation is used
to describe a range of different processes for partitioning the video to
meaningful parts at different granularities. Segmentation of video can thus
be temporal, aiming to break down the video to scenes or shots, spatial,
addressing the problem of independently segmenting each video frame to
arbitrarily shaped regions, or spatio-temporal, extending the previous case
to the generation of temporal sequences of arbitrarily shaped spatial
regions. The term segmentation is also frequently used to describe
foreground/background separation in video, which can be seen as a special
case of spatio-temporal segmentation. Regardless of the employed decision
space, i.e. 1D, 2D or 3D for temporal, spatial and spatio-temporal
segmentation, respectively, the application of any segmentation method is
often preceded by a simplification step for discarding unnecessary
information (e.g. low-pass filtering) and a feature extraction step for
modifying or estimating features not readily available in the visual medium
(e.g. texture, motion features etc., but also color features in a different
color space etc.), as illustrated in for a variety a segmentation algorithms
(Figure 1).
Segmentation of images and video is generally an ill-posed problem, i.e. for
a given natural image or image sequence, there exists no unique solution to
the segmentation problem; the spatial, temporal or spatio-temporal segments
that should ideally be formed as a result of segmentation largely depend on
the application under consideration and most frequently on the subjective
view of each human observer.
Commonly considered applications of segmentation include region-based image
and video description, indexing and retrieval, video summarization,
interactive region-based annotation schemes, detection of objects that can
serve as cues for event recognition, region-based coding, etc. Particularly
image and video description, indexing and retrieval has been on the focus of
attention of many researchers working on segmentation, since the benefits of
introducing segmentation to this application have recently been documented
well and significant progress has been made on related topics such as
region-based description for indexing, most notably with the introduction of
the MPEG-7 Standard.
Most segmentation methods serving all aforementioned applications are
generic , i.e. make no restrictive assumptions regarding the semantics of
the visual content, such as that the content belongs to a specific domain;
however, domain-specific methods for applications like medical image
segmentation also exist.
Spatial segmentation
Segmentation methods for 2D images may be divided primarily into
region-based and boundary-based methods. Region-based approaches rely on the
homogeneity of spatially localized features such as intensity, texture, and
position. On the other hand, boundary-based methods use primarily gradient
information to locate object boundaries. Hybrid techniques that integrate
the results of boundary detection and homogeneity-based clustering (e.g.
region growing), as well as techniques exploiting additional information
such as structural properties (e.g. inclusion), have also been proposed.
Traditional region-based approaches include region growing and split and
merge techniques. Starting from an initial region represented by an
arbitrarily chosen single pixel, region growing is the process of adding
neighboring pixels to this region by examining their similarity to the ones
already added; when no further additions are possible according to the
defined similarity criteria, a new region is created and grows accordingly.
The opposite of this approach is split and merge. Starting from a single
initial region spanning the entire image, region homogeneity is evaluated;
if the homogeneity criterion is not satisfied, the region is split according
to a pre-defined Page 782 pattern and neighboring regions are subsequently
merged, providing this does not violate the homogeneity criterion. The
interchange of split and merge steps continues until the latter is satisfied
for all regions.
Region-based approaches also include the Recursive Shortest Spanning Tree
(RSST) algorithm, which starts from a very fine partitioning of the image
and performs merging of neighboring nodes while considering the minimum of a
cost function; the latter preserves the homogeneity of the generated
regions. In order to avoid a possible premature termination of the merging
process, resulting to over-segmentation, in the case that the desired final
number of regions is not explicitly defined, the introduction of syntactic
visual features to RSST has been proposed . The K-means algorithm, an
iterative classification method, has also been used as the basis of several
region-based approaches. In, the K-Means-with-Connectivity-Constraint
variant of K-means is used to effect segmentation by means of pixel
clustering in the combined intensity-texture-position feature space (Fig.
2). Another approach to pixel clustering is based on the
Expectation-Maximization (EM) algorithm, which is a method for finding
maximum likelihood estimates when there is missing or incomplete data. For
the application of EM to segmentation, the cluster membership for each pixel
can be seen as such. In, image segmentation is treated as a graph
partitioning problem and the normalized cut, a global criterion measuring
both the total dissimilarity between the different groups as well as the
total similarity within the groups, is employed for segmenting the graph. In
contrast to the aforementioned methods, boundary-based methods rely on
detecting the discontinuities present in the feature space. The Canny edge
detector is a popular such scheme, based on the convolution of the image,
over a small window, with the directional derivatives of a Gaussian
function. Another approach to boundary detection is anisotropic diffusion,
which can be seen as a robust procedure for estimating a piecewise smooth
image from a noisy input image. Anisotropic diffusion employs an
edge-stopping function that allows the preservation of edges while diffusing
the rest of the image.
Mathematical morphology methods, including in particular the watershed
algorithm , have also received considerable attention for use in image
segmentation. The watershed algorithm determines the minima of the gradients
of the image to be segmented, and associates a segment to each minimum.
Conventional gradient operators generally produce many local minima, which
are caused by noise or quantization errors, and hence, the watershed
transformation with a conventional gradient operator usually results in
over-segmentation. To alleviate this problem, the use of multiscale
morphological gradient operators has been proposed. More recently, the use
of the watershed algorithm to generate an initial over-segmentation and the
subsequent representation of this result as a graph, to which partitioning
via the weighted mean cut criterion is applied, was proposed to combat the
over-segmentation effect .
Finally, global energy minimization schemes, also known as snakes or active
contour models, involve the evolution of a curve from an initial position
toward the boundary of an object in such a way that a properly defined
energy functional is minimized. Depending on the definition of the energy
functional, the resulting scheme may be Page 783 edge-based, region-based or
based on a combination of boundary detection and homogeneity-preserving
criteria.
Temporal segmentation
Temporal video segmentation aims to partition the video to elementary
image sequences termed scenes and shots . A shot is defined as a set of
consecutive frames taken without interruption by a single camera. A scene,
on the other hand, is usually defined as the basic story-telling unit of the
video, i.e. as a temporal segment that is elementary in terms of semantic
content and may consist of one or more shots.
Temporal segmentation to shots is performed by detecting the transition from
one shot to the next. Transitions between shots, which are effects generated
at the video editing stage, may be abrupt or gradual , the former being
detectible by examining two consecutive frames, the latter spanning more
than two frames (Fig. 3) and being usually more difficult to detect,
depending among others on the actual transition type (e.g. fade, dissolve,
wipe, etc.). Temporal segmentation to shots in uncompressed video is often
performed by means of pair-wise pixel comparisons between successive or
distant frames or by comparing the color histograms corresponding to
different frames. Methods for histogram comparison include the comparison of
absolute differences between corresponding bins and histogram intersection.
Other approaches to temporal segmentation include block-wise comparisons,
where the statistics of corresponding blocks in different frames are
compared and the number of “changed” blocks is evaluated by means of
thresholding, and edge-based methods. The latter involve the detection of
edges by application of an edge detector (e.g. Canny) and the subsequent
comparison of edges in different frames to calculate values such as the edge
change ratio. Other methods, involving the comparison of motion features at
different time instances have also been proposed. In, a method for foveated
shot detection is proposed. This is based on the generation of patterns of
visuomotor behavior (traces) for the sequence and their subsequent use for
inferring shot boundary presence.
Other recent efforts on shot detection have focused on avoiding the prior
decompression of the video stream, resulting to significant gains in terms
of efficiency. Such methods consider mostly MPEG video, but also other
compression schemes such as wavelet-based ones. These exploit
compression-specific cues such as macroblock-type ratios to detect points in
the ID decision space where temporal redundancy, which is inherent in video
and greatly exploited by compression schemes, is reduced.
Temporal segmentation to scenes involves grouping of shots to semantically
coherent clusters. Several methods have been proposed for this, including
clustering algorithms also used to effect homogeneity-based spatial
segmentation. One approach is, starting from one shot, to progressively
group similar shots with it until no similar shots can be found within a
chosen temporal distance; this is a simple application of the previously
discussed region-growing algorithm to the ID segmentation case. Another
approach, also used for 2D segmentation as well, involves treating shots as
nodes of a graph and using the normalized cut to segment the graph, this
forming shot clusters corresponding to the scenes . The aforementioned
approaches make use of visual features to describe the different shots and
to estimate the similarity between any two. The use of audio and linguistic
information extracted via speech recognition, although possibly not
applicable to any kind of video, has also been proposed to assist the
detection of semantically similar shots.
Spatio-temporal segmentation
Several approaches have been proposed for spatio-temporal video
segmentation (i.e. segmentation in a 3D decision space), both unsupervised,
as is almost always the case in ID and 2D decision spaces, and supervised.
The latter require human interaction for defining the number of objects
present in the sequence, for defining an initial contour of the objects to
be tracked or more often for grouping homogeneous regions to semantic
objects, while the former require no such interaction. In both types of
approaches, however, it is frequently assumed that the video comprises a
single shot; this assumption dictates that spatio-temporal segmentation is
preceded by temporal segmentation to shots.
Some spatio-temporal segmentation approaches rely on initially applying
spatial segmentation methods to each frame independently. Spatio-temporal
objects are subsequently formed by associating the spatial regions formed in
successive frames using their low-level features. In, this general approach
is realized by adopting a seeded region-growing scheme for performing
spatial segmentation; the temporal correspondence of spatial regions between
adjacent frames is established by examining the overlapping of seeds, which
in the case of the selected region growing technique may be quite large,
ranging from 1/3 to 2/3 of the final region size.
A different approach is to use motion information to perform motion
projection, i.e. to estimate the position of a region at a future frame,
based on its current position and its estimated motion features. In this
case, a spatial segmentation method need only be applied to the first frame
of the sequence, whereas in subsequent frames only refinement of the motion
projection result is required. In , a homogeneity-based spatial segmentation
method, namely the watershed algorithm, is applied to the first frame of the
sequence; then, motion projection is used to estimate a segmentation of the
following frame, which is eventually refined by applying a marker-controlled
watershed algorithm. Final results are obtained by merging those of the
resulting 3D watershed volumes that have similar motion characteristics. A
similar in nature approach is followed in, where the need for motion
projection is substituted by a Bayes-based approach to color-homogeneous
region-tracking using color information. Color- and motion-homogeneous
spatio-temporal regions generated by means of spatial segmentation at the
first frame of the sequence and tracking at subsequent frames are eventually
clustered to different objects using again the region long-term motion
trajectories (Figure 4).
Region tracking is an important part of 3D segmentation. This is true both
for approaches of the previous category, where tracking of regions generated
via 2D segmentation at previous frames may be required, as well as for
interactive schemes supporting the manual initialization of the tracked
region. Noisy data and illumination conditions that change with time are two
key factors limiting the efficiency of temporal tracking algorithms. To
address these issues, a color-based deformable model that is robust against
noisy data and changing illumination is developed in . Measuring color
constant gradients and estimating the corresponding amount of sensor noise,
this approach uses a weighting term in the deformation process to force
noisy and unstable gradient information to contribute less to the
deformation process than reliable gradient information yielding robust
tracking. In a geometric prior is introduced to the active contour framework
in order to improve the temporal coherency of the tracking result, and the
application of this approach to interactive segmentation is considered.
Alternatively to the above techniques, one could restrict the problem of
video segmentation to foreground/background separation. The latter can be
achieved either by using primarily motion information, or by performing in
parallel segmentation using other feature spaces as well (e.g. intensity
information) and employing rule-based processing to enhance the motion
segmentation result. In , where a number of different methods are presented,
a method for foreground/background separation using intensity information,
based on the assumption that the background comprises large uniform regions,
is developed. In a fast moving object segmentation algorithm is Page 786
developed, based upon change detection and background registration
techniques; this algorithm also incorporates a shadow cancellation technique
for dealing with light changing and shadow effects.
Finally, as in temporal segmentation, the spatio-temporal segmentation of
compressed video has recently attracted considerable attention. Algorithms
of this category generally employ coarse motion and color information that
can be extracted from the video stream without full decompression, such as
macroblock motion vectors and DC coefficients of DCT-coded image blocks.
Future directions
Segmentation methods for generic visual content have significantly
improved over the last few years. While the further improvement of generic
segmentation techniques will continue to attract attention, particularly
techniques considering the segmentation of compressed data, an important
future direction is the introduction of prior knowledge to the segmentation
procedure, leading to the development of knowledge-assisted analysis
techniques. Although the applicability of the latter is by definition
limited to specific domains, the introduction of prior knowledge to the
segmentation process is the key to transforming the ill-posed segmentation
problem to a well-posed one by restricting the number of admissible
solutions.
|