M. R. Dobie and P. H. Lewis
October 1991
In recent years there have been significant developments in the field of multimedia systems [1,]. Such systems provide the ability to manage many different forms of information: text, graphics, images, image sequences and sounds can all be stored and manipulated. However, although such systems are viable in terms of hardware, there is a marked lack of software techniques to provide the level of functionality that is available for traditional text and numeric databases. Such techniques should be able to manipulate image and video data directly, for example to perform tasks such as searching and sorting of visual information.
This paper concentrates on the development of techniques for tracking a user selected object through digital image sequences in multimedia systems. The ability to track objects in sequences is useful in situations where the motion of objects is important (for example, to study the motion of a particular type of cell under a microscope). It is also useful when a moving object is difficult to see and needs to be highlighted. An example might be a particular object moving amongst many similar objects. A tracking tool is also of benefit to the developers of a multimedia systems to automatically ``mark up'' a moving object so a user of the system can select it and manipulate it. The tool can also be used by a user to follow motion which has not been specially identified by the developer of the system.
There have been many approaches to object tracking described in the literature. The approaches differ in the goals of the tracking processes, the assumptions that are made about the scene, the types of moving objects and their motion characteristics and the camera and its motion. There are many problems which must be overcome. Objects will rotate and change apparent size. Lighting and background will change. Objects will move behind other objects or move erratically and some objects will actually change shape and colour as they move.
If the 3D geometry of the scene and camera position are known in advance, then some assumptions can be made to allow the calculation of world coordinates from an image plane position [7]. Similarly, if a 3D model of the object is available it can be projected into the image plane in different orientations and matched with each frame of the sequence to facilitate the tracking [4].
In our tracking environment, we have no prior knowledge of the scene or the camera. The only information provided about the object is a two dimensional area selected by the user from the first frame of the sequence by drawing around the area with a mouse. For these reasons, a three dimensional model matching approach is inappropriate for our application.
A simple approach is to derive a two dimensional model from the user selected area and match this against the next frame in the sequence. We assume that the object is located at the place where we find the best match. The matching part of the process need not be restricted to finding just the location of the object, but can also provide a measure of the object's orientation and size relative to the first frame.
The tracking process may be improved by carefully selecting the search area from each frame with which the model is to be matched. The main advantage to be gained is speed. The smaller the area searched, the faster the tracking process will be. Another advantage is accuracy. A smaller search area is less likely to contain other objects similar to the model, so the tracking process is less likely to be misled. However, by reducing the search area the chance of missing the object being tracked is increased.
The ideal search area would contain the object being tracked and little else. As we track the object we model its motion and use the model to predict the position in each new frame. Although predictive models using extrapolation based on simple estimates of the motion parameters are useful, Kalman filters offer a recursive method to calculate more accurate minimum mean-square estimates of the model parameters. The model is used to predict a restricted search area for the object in each frame of the sequence.
We have developed a number of 2D matching methods derived from traditional template matching techniques and from the generalised Hough transform which we use to find the best match in the restricted search area.
We have tested several matching techniques for locating the model in the search area. The first was a basic template matching method in which the sum of the absolute differences between corresponding object and search area pixel intensities is used as the difference measure.
One of the problems with this algorithm is its sensitivity to changes in overall illumination between frames of the sequence. If the average intensity of the search area becomes very different to the average intensity of the template then the difference measure will be too high and false matches occur. We have enhanced the basic template matching algorithm by preprocessing the image data. A Sobel edge detector is applied to the template and to the search area and matching is performed on the edge magnitude data.
The main disadvantage with template matching approaches in general is their inability to cope with object rotation easily. To extend the process for rotation would require storing templates at fixed orientations or rotating the templates dynamically as matching occurs. An alternative approach that addresses this problem is to use a Hough Transform.
The Generalised Hough Transform is a modified Hough Transform which can detect arbitrary 2D shapes in an image. The shape is represented as a set of vectors from feature points to a reference point. The feature points are usually edge points on the object.
Feature points within the search area in subsequent images are used to increment cells in a Hough accumulator space. For each feature point, every vector in the object representation is followed back to the corresponding reference point position, and a cell denoting this position is incremented. If the object is present in the search area, all the feature points comprising the object will contribute to the same cell in the accumulator space.
Cells with a high peak should correspond to occurrences of the object in the image. The position of the peak gives the position of the object reference point. Since we are only tracking one moving object we need only look for the highest peak in the Hough accumulator space. The size of the peak is related to the number of feature points in the object representation. Smaller peaks may indicate that the object is present but also occluded.
The problem of object rotation is addressed by adding an orientation dimension to the Hough accumulator space. We assume that the object may be present at any of several orientations and calculate an accumulator space for each orientation. This is achieved by rotating each vector in the object representation to decide which cell to increment in the Hough accumulator space. The position of the highest peak in the space will indicate the location of the object in the image and its orientation relative to the first frame of the sequence.
A similar approach can be used to find the object at different scales within the image plane. This effect is the result of motion components perpendicular to the image plane. A fourth dimension is added to the accumulator space and the object representation vectors are lengthened and shortened when incrementing accumulator cells.
Both of these enhancements to the generalised Hough transform increase the memory space and computation requirements of the process by several orders of magnitude. However, significant savings can be made by limiting the size of the search area, both spatially and in terms of orientation and scale changes and this is achieved by carefully modelling the object motion.
A Kalman filter is an adaptive filter used to model the state of a discrete dynamic system. The technique was originally developed in 1960 to filter out noise in electronic signals [6], but has since found tracking and modelling applications in computer vision [2,4].
Given a set of measurements taken from the system, the filter can estimate the next state of the system and adjust its model to allow for changes in the system behaviour. In the following analysis familiarity with the basic theory of Kalman filters is assumed (see [3,5] for more details).
The way the problem is formulated depends on the measurements that can be made and the results that are required. For instance, if one is interested in positional coordinates, rotation and scale, each of these can have a value and a first and second derivative with respect to time. This could be formulated as a single twelve dimensional state vector x(n) with a corresponding state transition matrix.
Alternatively, each attribute could be considered independently. There would be four Kalman filters, each with a three dimensional state vector. This greatly simplifies the corresponding calculations for the Kalman gain coefficients. This is the method adopted here.
Considering a single positional coordinate, the system state x(n) is a matrix with elements and . These represent the object position, velocity and acceleration at time n. This is shown in equation 2.
If a model of local ballistic motion is used, the state transition matrix is given in equation 1. This represents the familiar equations of motion of a body moving with constant acceleration. Here, represents the time between state transitions. Its value depends on the units used in the tracking process. If positions are measured in pixels and a velocity in units of pixels per frame is acceptable, then .
A system measurement vector y(n) is a matrix whose element is yn a measurement of the object's coordinate. This implies that the measurement equation matrix C(n) is a matrix and that the innovation matrix is a matrix whose element takes the value of , where is the position component of the state estimate. These matrices are summarised in equations 2 and 3.
We have implemented Kalman filters using this formulation and the results of tests are presented below. The implementation uses two independent filters to model the x and y motion of a test object. At each step the next position of the object is predicted. The actual position of the object is used by the filter to refine its models of the x and y motion components.
The results are shown in figure 1. The two plots show the test trajectory and the trajectory predicted at each step by the Kalman filters. The predicted trajectory follows the real trajectory very closely. The main deviations occur at the start of the trajectory and where the motion changes from one parabola to another.
We use Kalman filters to model the motion of an object in a video sequence in order to predict its location in the next frame. Figure 1 shows that in this example when the object's motion is not ballistic there are about five or six frames where there is a deviation between the real trajectory and the predicted trajectory before the filters correctly model the new motion. Since the search area should be as small as possible, yet be large enough to accommodate these variations, the size of the search area is varied according to the most recent deviation between the real and predicted trajectories.
We calculate the new search area by starting with an area the same size as the object being tracked and expanding it around its centre. The size is controlled by adjusting the expansion factor. Where the deviation is small the search area is just larger than the object (the minimum expansion factor is 1.1). As the deviation increases we increase the expansion factor up to a maximum value of 2.0. The formula used to calculate each expansion factor is
Figure 2 shows the search areas that correspond to the trajectory and predictions from figure 1. Note that there are two expansion factors one for each of the x and ydirections since the two motion components can vary (and are modelled) independently. We also model the rotational velocity of the object and adjust the search area for this separately.
We have developed a system that allows a variety of tracking algorithms to be compared when operating on real or artificial image sequences. The user can draw around an object in the first frame of a sequence and then select one of several matching algorithms to attempt to track the object as it moves. As the tracking proceeds, the object is highlighted in each frame. In addition, the system can track a known object through an artificially generated test sequence and produce statistics which can be used to compare the tracking performance of various algorithms.
We initially implemented basic template matching, template matching with edge preprocessing and a generalised Hough transform. These algorithms were designed to track only translational motion within the image plane. The Hough transform has been extended to track object rotation within a limited range predicted by the motion model.
Test image sequences have been created which contain both translational and rotational motion. The sequences have varying levels of Gaussian noise added to them. Noise is added to individual pixel intensities to simulate transmission and digitisation effects. We also add noise to a whole frame of a sequence at a time to simulate changes in overall illumination.
Figure 3 shows the tracking performance of two varieties of template matching in the presence of noise. Each graph shows the distance error in locating the test object against frame number through the sequence. Each curve on a set of axes represents a different noise level. The levels correspond to Gaussian noise with standard deviation ranging from 0 to 60 units of intensity.
The graphs show that basic template matching copes with pixel based noise very well, with all but one of the curves lying along the zero error axis. When the illumination level changes during the sequence the performance is worse. Template matching with edge preprocessing proves a much better approach when the overall intensity varies, but appears more susceptible to localised pixel noise.
We have also experimented using corner feature points instead of edge points in the template matching and Hough transform algorithms. In our test sequences the corner features proved to be too sparse for detecting the object in the presence of noise.
We have decomposed the tracking process into two independent parts. A two dimensional model matching process locates an arbitrary object in a search area and a motion modelling process uses the object positions to predict its next location and orientation and calculate a suitable search area for the next frame.
We have implemented variations of template matching and the generalised Hough transform to track an object with translational and rotational motion within the image plane. The object's motion is successfully modelled by a series of Kalman filters. Experiments with test sequences and varying amounts of different types of noise show that a robust 2D tracking system can be implemented using this approach.
We intend to extend the generalised Hough transform to allow tracking and modelling of limited scale changes and to use perceptual groups as a basis for a more robust model of the tracked object.
The work of the first author was supported by a SERC studentship.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright В© 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Mark Dobie