An Efficient Parallel Algorithm for Graph-Based Image Segmentation

Авторы: Jan Wassenberg, Wolfgang Middelmann, Peter Sanders
Источник: http://algo2.iti.kit.edu/wassenberg/wassenberg09parallelSegmentation.pdf

Abstract

1 Introduction and Related Work

Segmentation (automatically partitioning an image into regions) is an impor- tant early stage of some image processing pipelines, e.g. object-based change detection. The final results of such applications are often strongly dependent on the quality of the initial segmentation. Since subsequent processing steps can use higher-level region information instead of having to examine all pixels, the segmentation may also be the limiting factor in terms of performance. Many algorithms have been proposed, but good quality results often come at the price of high computational cost.

One extreme example of this is a multi-scale watershed approach (MSHLK) [1]. Repeated applications of anisotropic diffusion smooth the image and reduce the oversegmentation caused by the watershed transform. The resulting subjec- tive quality is very good, but its computational cost (1 second per kPixel) is unacceptable.

An alternative approach uses the Mean-Shift (MS) [2] procedure to locate clusters within a higher-dimensional representation of the image. This is guar- anteed to converge on the densest regions in this space and yields good results in practice, but the processing rate (0.1 MPixel/s) is still inadequate.

Recent work has shown that Maximally Stable Extremal Regions (MSER) [3] within a gradient image are also suitable for image segmentation. While more efficient (2 MPixel/s), this scheme only detects high-contrast segments and does not provide full coverage of the image. It also seems ill-suited for parallelisation since the stability criterion depends on a global ordering of pixels.

Graph-based segmentation (GBS) [4] increases the amount of data to be handled (multiple edges per pixel) but has several attractive properties. Viewing pixels as nodes of a graph allows the reduction of segmentation to finding cuts in a Minimum Spanning Tree (MST). Defining edge weights as some function of the pixels’ per-band intensity differences enables the use of colour information without having to compute gradients. Finally, an MST can be assembled from partial sub-trees, which provides the possibility of parallelisation.

The remainder of this article is structured as follows: Section 2 develops a new online graph-cutting heuristic for MST-based segmentation. Section 3 shows the promising results obtained on well-known images. Section 4 introduces ‘PHMSF’ (Parallel Heuristic for Minimum Spanning Forests), which we believe to be the first non-trivially-parallel segmentation algorithm. Perhaps most importantly, Section 5 shows it to significantly outperform existing segmentation techniques.

2. Segmentation Algorithm

Segmentation algorithms require (often application-dependent) definitions of ‘image region’. We consider ‘homogeneity’ and high contrast to surrounding pix- els to be reasonable criteria [5]. Homogeneity can be computed as distances be- tween (vector-valued) pixels; we find the L2 norm to yield better results than L1 or pseudo-norms. Note that [4] advocates separate segmentation of the R/G/B component images and intersecting the results. Since object edges are not al- ways visible in all multi-spectral bands [6], it is safer (and certainly faster) to segment once using all bands. Recalling the graph segmentation framework, the above homogeneity measure defines the weight of edges. It remains to be seen how an online graph-cutting heuristic should partition the MST depending on edge weight. A mere threshold is insufficient because it fails to account for noise or the overall homogeneity of a region. [4] suggests an adaptive threshold that is incremented by a linearly decreasing function of the region size3 . The function’s slope is a user-defined parameter that must be determined by experimentation since it has no physical explanation. This scheme also underestimates a region’s homogeneity by defining it as the maximum weight in its MST, thus tending towards oversegmentation. We suggest the adoption of an idea from Canny’s edge detection algorithm [7]. In the context of edge detection, pixels with large gradient magnitudes are likely to correspond to edges, but there is no single level at which this ceases to be the case. Applying a rather high limit finds likely candidates, which can be augmented by nearby pixels that lie above a second, lower threshold. Returning to segmentation terminology, regions connected by low-weight edges represent likely candidates that can subsequently be expanded by following adjoining edges with somewhat higher weights. To avoid poten- tially unbounded growth, we institute a ‘credit’ limit on the sum of edge weights that may be added to a candidate region. Since no shape can be more compact than a circle, the region’s perimeter is bounded from below by the circumference √ 4π · regionSize. Let us also assume additive white Gaussian noise with variance 2 σn , for which several estimators have been proposed [8, 9].s Defining contrast as the smallest edge weight along the border of any ‘interesting’ region minus 2 · σn thus makes it likely that edges of total weight ≤ contrast · minPerimeter can be added to a region without inadvertently expanding beyond its bounds. This property is important because subsequent region merges are trivial, whereas splitting requires re-examination of the pixels or edges. However, the resulting regions are not necessarily too fine because pixels connected by low-weight edges are always merged. We have therefore averted global under- and oversegmenta- tion of the image while using only local information. The algorithm first forms candidate regions by merging the endpoints of low-weight edges and then calls the following simple heuristic in increasing order of the remaining edges’ weights:

3 Results

To demonstrate the usefulness of the new segmentation results, we compare them to the outputs of existing algorithms on standard images [10], the results of which are shown in Fig. 1:

MSHLK [1] is known for high-quality results and provides excellent smooth- ing of the walls (b) but merges the eaves into the sky segment. We also call attention to the oversegmentation of the second image and shock effects [11] in the background (h). MS [2] is more successful at merging the individual objects (i) but also splits some of them (e.g. below the P); spurious segments near edges (c) are its only visible flaws. As with MSHLK, segment borders are delineated by black pixels. MSER [3] produces mostly adequate label images, though the wall is not considered to be a stable region (d); the effects of the gradient filter are clearly visible (j). GBS [4] is satisfactory but results in undersegmentation near the roof lines and oversegmentation of the sky and wall (e). It also merges different-coloured objects (k) but fails to return a uniform background. Our new PHMSF algorithm provides results comparable to MSHLK and MS and requires only 1/4000 and 1/50 the computation time, respectively (cf. Sect. 5). The black pixels (f) indicate surface irregularities that resulted in regions smaller than the minimum size. The segmentation in (l) is quite accurate, correctly separating different-coloured objects without introducing spurious boundaries.

4 Parallel Algorithm

Despite the efficiency of the new segmentation algorithm, a highly-tuned sequen- tial implementation is still far slower than the collection rates of commercial imaging satellites (e.g. IKONOS with up to 90 km2 /s [12]). Since a significant reduction of the algorithm’s constant factors appears unlikely and sequential programs have seen less benefit from recent CPU advances [13], it appears our self-set performance goal of 10 MPixel/s can only be reached by means of paral- lelisation. Note that embarrassingly-parallel schemes that simply split the input into independent tiles are not acceptable because they do not correctly handle objects straddling a border. Nor are overlapping tiles sufficient because there is no upper bound on the size of objects of interest (e.g. rivers or roads). Our first attempt at parallelisation addressed the MST computation. The new Filter- Kruskal scheme [14] combines ideas from Quicksort and Kruskal’s algorithm and discards non-MST edges without having to sort them. This ‘filter’ operation, partitioning and sorting can all be parallelised. However, the total speedup on a quad-core system is only 1.5 – chiefly due to the sequential portion of the algorithm, but also because our eight-connected grid graphs are too sparse to derive much benefit from discarding edges. Our second approach is designed to allow independent processing of image tiles, but still ensures consistent re- sults irrespective of the number of processors P .4 The key observation is that Kruskal’s MST algorithm can run in a data-parallel fashion until encountering an edge that crosses a tile border. From then on, MST components using such edges and in turn their incident edges must be ‘delayed’ until the partial MSTs of both tiles are available. We accomplish this with per-tile edge queues that are processed in a subsequent sequential phase5 . It remains to be seen how many edges are delayed – a long cross-border region of homogeneous pixels could af- fect a large proportion of a tile. However, high-weight edges at the boundary of such regions often serve as a ‘firewall’ because they can be discarded without affecting neighbouring regions. Only about 5 % of edges are delayed in practice, making Amdahl’s argument less of a factor than real-world limits on memory bandwidth and P . The algorithm is described by the following pseudo-code:

To avoid scheduling and locality issues, the (manually partitioned) loops reside in a single OpenMP parallel region. A novel variant of counting sort uses paged virtual memory to simulate bins of unlimited size and thus dispenses with a separate counting phase. An explicit buffering technique further increases per- formance by enabling write-combining without cache pollution. Details are given in App. A.

The algorithm outputs a Union-Find (UF) tree represented as an array of pointers to a parent pixel or region, as well as per-tile lists of regions, which each store size (number of pixels) and credit. Computing features for single- pixel regions would consume too much memory, so we only consider regions of size min..max. This requires relabeling the per-tile regions and replacing them with so-called ‘accumulators’ for the region features, which is accomplished by Alg. 3. Its separate and very efficient count phase seems preferable to updat- ing the per-tile region count when cross-border merges are performed by the Kruskal algorithm. Since the desired output includes a label image, we ‘collapse’ the UF tree once all regions have been re-labeled. With all pieces in place, we can now compute the contribution of each pixel toward its region’s features.

The per-band intensities SUM{Bi} and Bi are required for computing the band av- erages and standard deviations. For pixel coordinates (Y, X), the six moments SUM{Y p · X q} (p, q ∈ IN0 , p + q ≤ 2) are sufficient for estimating an ellipse [15]. Finally, counting the number of neighbour pixels belonging to different regions allows computing the region perimeter. Using 64-bit floating-point accumulators mitigates precision issues while still enabling vectorization via SSE2 instructions.

5 Performance Analysis

We first examine the complexity of the proposed algorithm. Counting sort is O(N ). Region merges via Union-Find are effectively O(1) for all practical in- put sizes6 [17]. All other operations are also constant-time and reside in loops with trip counts in O(N ), so the complexity is (quasi-)linear in the input size. Since this also applies to the MSER and GBS algorithms, we must compare their implementations. Table 1 lists the performance7 of each algorithm for a repre- sentative 8.19 MPixel subset of a 16-bit, 4-component (RGB + NIR) Quickbird image of Karlsruhe. Our PHMSF algorithm does more work (computing region features and pro- cessing the original four-component 16-bit pixels rather than an 8-bit RGB ver- sion), yet significantly outperforms the other algorithms. In this test it is 138 times as fast as MS [19], 28 times as fast as GBS [20] and 5 times as fast as our similarly optimised implementation of MSER. Note that (32-bit) MSHLK exhausted its address space after a single diffusion iteration. Our PHMSF im- plementation requires much less memory: the working set is about 7.1 GB for a 1.97 GB image, which equates to 13.5 bytes/pixel. Table 2 shows measurements from processing large images of up to 527 MPixel. Performance improves with size due to increased parallelism – tile interiors grow faster than their borders. The parallel speedup varies between 2 and 3.2 when using 4 cores. In the latter case, sequential processing only accounts for 2 % of processing time; the limiting factor is memory bandwidth. RightMark Memory Analyzer [21] measures read and write throughputs of roughly 3500 MB/s and 2500 MB/s on this system. Having analysed the elapsed times and minimum amounts of data that must be transferred to/from memory during the credit computation, region compres- sion/counting/relabeling and feature computation phases, we can conclude that each is at least 85 % efficient. Improving their performance or scalability is there- fore contigent on increased bandwidth (e.g. via NUMA architecture or by adding further memory channels)

6 Conclusion

We have presented a new (quasi-)linear-time segmentation algorithm that pro- vides useful results at previously unmatched speeds. Applications include auto- matic wide-area appraisal of the suitability of roofs for solar panels, object-based change detection, environmental monitoring and rapid updates of land-use maps. From an algorithm engineering standpoint, we believe this to be the first non- trivially-parallel segmentation algorithm. Its scalability is chiefly limited by the memory bandwidth of current SMP systems. Future work includes statistical estimation of the edge weight thresholds and efficiently computing a segment neighbourhood graph. We are also interested in applying this algorithm towards segment-based fusion of high-resolution electro-optical and hyperspectral im- agery.

References

1. Vanhamel, I., et al.: Scale Space Segmentation of Color Images Using Watersheds and Fuzzy Region Merging. In: ICIP (1). (2001) 734–737

2. Comaniciu, D., Meer, P.: Mean Shift Analysis and Applications. In: ICCV. (1999) 1197–1203

3. Wassenberg, J., Bulatov, D., Middelmann, W., Sanders, P.: Determination of Max- imally Stable Extremal Regions in Large Images. In: Signal Processing, Pattern Recognition, and Applications. (February 2008)

4. Felzenszwalb, P., Huttenlocher, D.: Efficient Graph-Based Image Segmentation. IJCV 59(2) (September 2004) 167–181

5. Haralick, R., Shapiro, L.: Image Segmentation Techniques. CVGIP 29 (January 1985) 100–132

6. Thomas, C., Ranchin, T., Wald, L., Chanussot, J.: Synthesis of Multispectral Images to High Spatial Resolution: A Critical Review of Fusion Methods Based on Remote Sensing Physics. IEEE Trans. Geoscience and Remote Sensing 46(5) (May 2008) 1301–1312

7. Canny, J.: A Computational Approach to Edge Detection. In: RCV87. (1987) 184–203

8. Shin, D.H., Park, R.H., Yang, S., Jung, J.H.: Block-Based Noise Estimation Using Adaptive Gaussian Filtering. IEEE Trans. Consum. Electron. 51 (2005) 218–226

9. Amer, A., Dubois, E.: Fast and Reliable Structure-Oriented Video Noise Estima- tion. IEEE Trans. Circuits Syst. Video Techn 15(1) (2005) 113–118

10. Weber, A.: The USC-SIPI Image Database. http://sipi.usc.edu/database/ Accessed 2008-10-06.

11. Buades, A., Coll, B., Morel, J.: The Staircasing Effect in Neighborhood Filters and its Solution. IEEE Trans. Image Processing 15(6) (June 2006) 1499–1505

12. Dial, G., Bowen, H., Gerlach, F., Grodecki, J., Oleszczuk, R.: IKONOS satellite, imagery, and products. Remote Sensing of Environment 88(1-2) (November 2003) 23–36

13. Sutter, H.: The Free Lunch is Over: A Fundamental Turn Toward Concurrency. Dr. Dobb’s Journal (March 2005)

14. Osipov, V., Sanders, P., Singler, J.: The Filter-Kruskal Minimum Spanning Tree Algorithm. In Finocchi, I., Hershberger, J., eds.: ALENEX, SIAM (2009) 52–61

15. Zunic, J., Sladoje, N.: Efficiency of Characterizing Ellipses and Ellipsoids by Dis- crete Moments. IEEE Trans. Pattern Anal. Mach. Intell 22(4) (2000) 407–414

16. Nister, D., Stewenius, H.: Linear Time Maximally Stable Extremal Regions. In: ECCV. (2008) II: 183–196

17. Harfst, G., Reingold, E.: A Potential-Based Amortized Analysis of the Union-Find Data Structure. SIGACT 31 (September 2000) 86–95

18. Durand, F., Paris, S.: A Fast Approximation of the Bilateral Filter Using a Signal Processing Approach. In: ECCV. (2006) IV: 568–580

19. Robust Image Understanding Lab: EDISON System. http://www.caip.rutgers. edu/riul/research/code/EDISON/doc/segm.html Accessed 2008-09-23.

20. Felzenszwalb, P.: Efficient Graph-Based Image Segmentation. http://people.cs. uchicago.edu/~pff/segment/ (March 2007) Accessed 2008-01-11.

21. Besedin, D.: RightMark Memory Analyzer. http://cpu.rightmark.org Accessed 2009-01-09.

22. Bender, M., Farach-Colton, M., Mosteiro, M.: Insertion Sort is O(n log n). Math- ematical Systems Theory 39 (2006) 9

23. Navarro, J., Iyer, S., Druschel, P., Cox, A.: Practical, Transparent Operating System Support for Superpages. In: OSDI-02. Operating Systems Review, New York, ACM Press (December 2002) 89–104

24. Mehlhorn, K., Sanders, P.: Scanning Multiple Sequences via Cache Memory. Al- gorithmica 35 (2003)

25. Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Man- ual. (November 2007)

A Microarchitecture-aware Sort

The Kruskal algorithm requires visiting edges in increasing order of their weights. A straightforward implementation sorts (i.e. permutes into sorted order) all edges, which is possible in O(N ) time because only edge weights ≤ M need be considered, thus enabling counting sort. This typically involves two passes through the data, first building a histogram of weights and then writing edges to the correct output position. However, a weaker post-condition suffices for it- eration – ‘bins’ (the set of edges with a given weight) need not be immediately adjacent9 . This suggests a simplified algorithm where bins of capacity N are pre-allocated and the initial scan is skipped. Edges are simply written to the next free position in the corresponding bin:

This algorithm only writes and reads each edge once, a feat that comes at the price of N · M space. While this appears problematic under the Random-Access- Machine model, it is easily handled by 64-bit CPUs with paged virtual memory. Physical memory is mapped to pages of size S when they are first accessed,10 thus reducing the actual memory requirements to N + M · S. The remainder of the initial allocation only occupies address space, a plentiful resource on 64- bit systems. Note that using large pages (S = 2 MiB on x86-64 architectures) increases TLB coverage and also reduces the number of page faults at the cost of internal fragmentation [23].

A further microarchitectural consideration concerns the manner in which edges are written to a bin. An ideal cache with M lines could exploit the writes’ spatial locality and entirely avoid noncompulsory misses. However, perfect hit rates are not achievable in practice due to limited set associativity [24]. The common (pseudo-)LRU eviction policy also causes the displacement of previ- ously cached data instead of write-only lines that are not accessed after having been filled. This ‘cache pollution’ can be avoided by writing directly to mem- ory via SSE2 non-temporal streaming stores11 . Since ‘partial writes’ (non-burst transfers) involve significant bus overhead, the architecture attempts to combine neighbouring writes. However, M is likely to exceed the six write-combine (WC) buffers provided by current microarchitectures, so WC degenerates to partial writes because WC buffers are often flushed to memory before being filled. Hav- ing established the drawbacks of cached and non-temporal writes, we now show how to avoid them by means of software write-combining [25]. Instead of writ- ing to each bin directly, edges are first placed into cache-line-sized temporary buffers. When one of these is full, it is copied to the actual destination via non- temporal writes, which can be combined into a single burst transfer. This scheme avoids the excessive cache pollution caused by normal writes and works around the limited number of WC buffers by using L1 cache lines for that purpose. The effect is a further 7 % speedup of the entire edge creation phase and an increase in scalability due to less memory traffic.

While this section has covered highly system-specific details, we believe the basic principles (taking advantage of paged virtual memory and reducing partial bus transfers) to have widespread application. Since these issues can determine the feasibility of algorithms, they must be considered during the design phase and cannot be relegated to a separate optimization phase.