Tareen S.A.K. A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK / S.A.K Tareen, Z. Saleem // 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) – Sukkur, Pakistan: IEEE, 2018. [Ссылка на оригинал]

A Comparative Analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK

Shaharyar Ahmed Khan Tareen1, Zahra Saleem2
1 National University of Sciences and Technology (NUST), Islamabad, Pakistan
2 Fatima Jinnah Women University (FJWU), Rawalpindi, Pakistan

Abstract. Image registration is the process of matching, aligning and overlaying two or more images of a scene, which are captured from different viewpoints. It is extensively used in numerous vision based applications. Image registration has five main stages: Feature Detection and Description; Feature Matching; Outlier Rejection; Derivation of Transformation Function; and Image Reconstruction. Timing and accuracy of feature-based Image Registration mainly depend on computational efficiency and robustness of the selected feature-detector-descriptor, respectively. Therefore, choice of feature-detector-descriptor is a critical decision in feature-matching applications. This article presents a comprehensive comparison of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK algorithms. It also elucidates a critical dilemma: Which algorithm is more invariant to scale, rotation and viewpoint changes? To investigate this problem, image matching has been performed with these features to match the scaled versions (5% to 500%), rotated versions (0° to 360°), and perspective-transformed versions of standard images with the original ones. Experiments have been conducted on diverse images taken from benchmark datasets: University of OXFORD, MATLAB, VLFeat, and OpenCV. Nearest-Neighbor-Distance-Ratio has been used as the feature-matching strategy while RANSAChas been applied for rejecting outliers and fitting the transformation models. Results are presented in terms of quantitative comparison, feature-detection-description time, feature-matching time, time of outlier-rejection and model fitting, repeatability, and error in recovered results as compared to the ground-truths. SIFT and BRISK are found to be the most accurate algorithms while ORB and BRISK are most efficient. The article comprises rich information that will be very useful formaking important decisions in vision based applications and main aim of this work is to set a benchmark for researchers, regardless of any particular area.

Keywords: SIFT; SURF; KAZE; AKAZE; ORB; BRISK; RANSAC; nearest neighbor distance ratio; feature detection; feature matching; image registration; scale invariance; rotation invariance; affine invariance; image matching; mosaicing

I. Introduction

Image matching is the cornerstone of many computer and machine vision applications specially: Robot Navigation, Pose Estimation [1], Visual Odometry [2], [3]; Visual Simultaneous Localization and Mapping [4], [5]; Object Detection, Object Tracking, Augmented Reality [6], Image Mosaicing, and Panorama Stitching [7]. Pose estimation is performed by deriving the perspective transform for the image captured at final pose with respect to the image captured at initial pose.

Visual odometry is the measurement of distance covered by a mobile system on the basis of visual information. Vision based Simultaneous Localization and Mapping deals with localizing a mobile agent in an environment while mapping its surroundings simultaneously. Object detection and tracking is based on detection and matching of common features between object’s image and the scene. Image stitching or mosaicing is the process of generating a large-scale consolidated image from multiple small-scale overlapping images. It is widely applied in aerial, marine and satellite imagery to yield a continuous single picture of subject environments. Panorama generation software provided in mobile phones and digital cameras is another major application of mosaicing. All these applications require presence of some overlapping region (typically more than 30%) between any two successive images for matching.

Five main stages of image registration are: Feature Detection and Description; Feature Matching; Outlier Rejection; Derivation of Transformation Function; and Image Reconstruction. A feature-detector is an algorithm that detects feature-points (also called interest-points or key-points) in an image [8]. Features are generally detected in the form of corners, blobs, edges, junctions, lines etc. The detected features are subsequently described in logically different ways on the basis of unique patterns possessed by their neighboring pixels. This process is called feature description as it describes each feature by assigning it a distinctive identity which enables their effective recognition for matching. Some feature-detectors are readily available with their designated feature description algorithm while others exist individually. However the individual feature-detectors can be paired with several types of pertinent feature-descriptors. SIFT, SURF, KAZE, AKAZE, ORB, and BRISK are among the fundamental scale, rotation and affine invariant feature-detectors, each having a designated feature-descriptor and possessing its own merits and demerits. Therefore, the term feature-detector- descriptor will be used for these algorithms in this paper.

After feature-detection-description, feature-matching is performed by using L1-norm or L2-norm for string based descriptors (SIFT, SURF, KAZE etc.) and Hamming distance for binary descriptors (AKAZE, ORB, BRISK etc.). Different matching strategies can be adopted for matching features like Threshold based matching; Nearest Neighbor; Nearest Neighbor Distance Ratio etc. each having its own strengths and weaknesses [9]. Incorrect matches (or outliers) cannot be completely avoided in the feature-matching stage, thus another phase of outlier-rejection is mandatory for accurate fitting of transformation model. Random Sample Consensus (RANSAC) [10], M-estimator Sample Consensus (MSAC) [11], and Progressive Sample Consensus (PROSAC) [12] are some of the robust probabilistic methods used for removing the outliers from matched features and fitting the transformation function (in terms of homography matrix). This matrix provides the perspective transform of second image with respect to the first (reference image). Image reconstruction is then performed on the basis of the derived transformation function to align the second image with respect to the first. The reconstructed version of second image is then overlaid in front of the reference image until all matched feature-points are overlapped. This large consolidated version of smaller images is called a mosaic or stitched image. Fig. 1 shows the generic process of feature based image registration.

Generic phases of image registration
Fig. 1. Generic phases of image registration.

In this article, SIFT (blobs) [13], SURF (blobs) [14], KAZE (blobs) [15], AKAZE (blobs) [16], ORB (corners) [17], and BRISK (corners) [18] algorithms are compared for image matching and registration. The performance of feature-detector-descriptors is further evaluated to investigate: Which is more invariant to scale, rotation and affine changes? To inspect this problem, image matching has been done with these feature-detector-descriptors to match the synthetically scaled versions (5% to 500%) and synthetically rotated versions (0° to 360°) of different images with their original versions. To investigate viewpoint or affine invariance, image matching has been performed for the Graffiti sequence and Wall sequence. Experiments have been conducted on diverse images taken from the well-known datasets of University of OXFORD [19], MATLAB, VLFeat, and OpenCV. Nearest Neighbor Distance Ratio has been applied as the feature-matching strategy with Brute-force search algorithm while RANSAC has been applied for fitting the image transformation models (in the form of homography matrices) and for rejecting the outliers. The experimental results are presented in the form of quantitative comparison, feature-detection-description time, feature-matching time, outlier-rejection and model fitting time, repeatability, and error in recovered results as compared to the ground-truth values. This article will act as a stepping stone in the selection of most suitable feature-detector-descriptor with required strengths for feature based applications in the domain of computer vision and machine vision. As far as our knowledge is concerned, there exists no such research that so exhaustively evaluates these fundamental feature-detector-descriptors over a diverse set of images for the discussed problems.

II. Literature review

A. Performance Evaluations of Detectors & Descriptors

Literature review for this article is based on many high quality research articles that provide comparison results of various types of feature-detectors and feature-descriptors. K. Mikolajczyk and C. Schmid evaluated the performance of local feature-descriptors (SIFT, PCA-SIFT, Steerable Filters, Complex Filters, GLOH etc.) over a diverse dataset for different image transformations (rotation, rotation combined with zoom, viewpoint changes, image blur, JPEG compression, and light changes) [9]. However, scale changes that have been evaluated in this article lie only in the range of 200% to 250%. S. Urban and M. Weinmann compared different feature-detector-descriptor combinations (like SIFT, SURF, ORB, SURF-BinBoost, and AKAZE-MSURF) for registration of point clouds obtained through terrestrial laser scanning in [20]. In [21], S. Gauglitz et al. evaluated different feature-detectors (Harris Corner Detector, Difference of Gaussians, Good Features to Track, Fast Hessian, FAST, CenSurE) and feature-descriptors (Image patch, SIFT, SURF, Randomized Trees, and Ferns) for visual tracking but no quantitative comparison and no evaluation of rotation and scale invariance property is performed. Z. Pusztai and L. Hajder provide a quantitative comparison of various feature-detectors available in OpenCV 3.0 (BRISK, FAST, GFTT, KAZE, MSER, ORB, SIFT, STAR, SURF, AGAST, and AKAZE) for viewpoint changes on four image data sequences in [22] but no comparison of computational timings is present. H. J. Chien et al. compare SIFT, SURF, ORB, and AKAZE features for monocular visual odometry by using the KITTI benchmark dataset in [23], however there is no explicit in-depth comparison for rotation, scale and affine invariance property of the feature-detectors. N. Y. Khan et al. evaluated the performance of SIFT and SURF against various types of image deformations using threshold based feature-matching [24]. However, no quantitative comparison is performed and evaluation of in-depth scale invariance property is missing.

B. SIFT

D. G. Lowe introduced Scale Invariant Feature Transform (SIFT) in 2004 [13], which is the most renowned feature-detection-description algorithm. SIFT detector is based on Difference-of-Gaussians (DoG) operator which is an approximation of Laplacian-of-Gaussian (LoG). Feature-points are detected by searching local maxima using DoG at various scales of the subject images. The description method extracts a 16x16 neighborhood around each detected feature and further segments the region into sub-blocks, rendering a total of 128 bin values. SIFT is robustly invariant to image rotations, scale, and limited affine variations but its main drawback is high computational cost. Equation (1) shows the convolution of difference of two Gaussians (computed at different scales) with image I(x, y).

(1)

Where G represents the Gaussian function.

C. SURF

H. Bay et al. presented Speeded Up Robust Features (SURF) in 2008 [14], which also relies on Gaussian scale space analysis of images. SURF detector is based on determinant of Hessian Matrix and it exploits integral images to improve feature-detection speed. The 64 bin descriptor of SURF describes each detected feature with a distribution of Haar wavelet responses within certain neighborhood. SURF features are invariant to rotation and scale but they have little affine invariance. However, the descriptor can be extended to 128 bin values in order to deal with larger viewpoint changes. The main advantage of SURF over SIFT is its low computational cost. Equation (2) represents the Hessian Matrix in point x = (x, y) at scale σ.

(2)

Where Lxx(x, σ) is the convolution of Gaussian second order derivative with the image I in point x, and similarly for Lxy (x, σ) and Lyy(x, σ).

D. KAZE

P. F. Alcantarilla et al. put forward KAZE features in 2012 that exploit non-linear scale space through non-linear diffusion filtering [15]. This makes blurring in images locally adaptive to feature-points, thus reducing noise and simultaneously retaining the boundaries of regions in subject images. KAZE detector is based on scale normalized determinant of Hessian Matrix which is computed at multiple scale levels. The maxima of detector response are picked up as feature-points using a moving window. Feature description introduces the property of rotation invariance by finding dominant orientation in a circular neighborhood around each detected feature. KAZE features are invariant to rotation, scale, limited affine and have more distinctiveness at varying scales with the cost of moderate increase in computational time. Equation (3) shows the standard nonlinear diffusion formula.

(3)

Where c is conductivity function, div is divergence, is gradient operator and L is image luminance.

E. AKAZE

P. F. Alcantarilla et al. presented Accelerated-KAZE (AKAZE) algorithm in 2013 [16], which is also based on non-linear diffusion filtering like KAZE but its non-linear scale spaces are constructed using a computationally efficient framework called Fast Explicit Diffusion (FED). The AKAZE detector is based on the determinant of Hessian Matrix. Rotation invariance quality is improved using Scharr filters. Maxima of the detector responses in spatial locations are picked up as feature-points. Descriptor of AKAZE is based on Modified Local Difference Binary (MLDB) algorithm which is also highly efficient. AKAZE features are invariant to scale, rotation, limited affine and have more distinctiveness at varying scales because of nonlinear scale spaces.

F. ORBM

E. Rublee et al. introduced Oriented FAST and Rotated BRIEF (ORB) in 2011 [17]. ORB algorithm is a blend of modified FAST (Features from Accelerated Segment Test) [25] detection and direction-normalized BRIEF (Binary Robust Independent Elementary Features) [26] description methods. FAST corners are detected in each layer of the scale pyramid and cornerness of detected points is evaluated using Harris Corner score to filter out top quality points. As BRIEF description method is highly unstable with rotation, thus a modified version of BRIEF descriptor has been employed. ORB features are invariant to scale, rotation and limited affine changes.

G. BRISK

S. Leutenegger et al. put forward Binary Robust Invariant Scalable Keypoints (BRISK) in 2011 [18], which detects corners using AGAST algorithm and filters them with FAST Corner score while searching for maxima in the scale space pyramid. BRISK description is based on identifying the characteristic direction of each feature for achieving rotation invariance. To cater illumination invariance results of simple brightness tests are also concatenated and the descriptor is constructed as a binary string. BRISK features are invariant to scale, rotation, and limited affine changes.

III. Experiments & Results

A. Experimental Setup

MATLAB-2017a with OpenCV 3.3 has been used for performing the experiments presented in this article. Specifications of the computer system used are: Intel(R) Core(TM) i5–4670 CPU @ 3.40 GHz, 6 MB Cache and 8.00 GB RAM. SURF(64D) , SURF(128D), ORB(1000), and BRISK(1000) represent SURF with 64-Floats descriptor, extended SURF with 128-Floats descriptor, bounded ORB and BRISK detectors with an upper bound to detect only up to best 1000 feature-points, respectively. Table I shows OpenCV’s objects used for the feature-detector-descriptors. All remaining parameters are used as OpenCV’s default.

Table I. OpenCV Settings and Descriptor Sizes of The Feature-Detector-Descriptors users
AlgorithmOpenCV ObjectDescripror Size
SIFTcv.SIFT('ConstrastThreshold',0.04,'Sigma',1.6)128 bytes
SURF(128D)cv.SURF('Extended',true,'HessianThreshold',100)128 Floats
SURF(64D)cv.SURF('HessianThreshold',100)64 Floats
KAZEcv.KAZE('NOctaveLayers',3,'Extended',true)128 Bytes
AKAZEcv.AKAZE('NOctaveLayers',3)61 Bytes
ORBcv.ORB('MaxFeatures',100000)32 Bytes
ORB(1000)cv.ORB('MaxFeatures',1000)32 Bytes
BRISKcv.BRISK()64 Bytes
BRISK(1000)cv.BRISK();
cv.KeyPointsFilter.retainBest(features,1000)
64 Bytes

B. Datasets

Two datasets have been used for this research. Dataset-A (see Fig. 2) is prepared by selecting 6 image pairs of diverse scenes from different benchmark datasets. Building and Bricks images shown in Fig. 2(a) and Fig. 2(d), respectively, are selected from the vision toolbox of MATLAB. The image pair of Mountains shown in Fig. 2(g) is taken from the test data of OpenCV. Graffiti-1 and Graffiti-4 shown in Fig. 2(j) are picked from University of OXFORD’s Affine Covariant Regions Datasets [19]. Image pairs of Roofs and River shown in Fig. 2(m) and Fig. 2(p), respectively, are taken from the data of VLFeat library. Dataset-B (see Fig. 3) is based on 5 images chosen from University of OXFORD’s Affine Covariant Regions Datasets. Dataset-A has been used to compare different aspects of feature-detector-descriptors for image registration process while Dataset-B has been used to investigate scale, and rotation invariance capability of the feature-detector-descriptors. Furthermore, to investigate affine invariance, Graffiti sequence and Wall sequence from the Affine Covariant Regions Datasets have been exploited.

Dataset-A: Image pairs selected from different benchmark datasets. Image registration and mosaicing has been performed using SIFT algorithm.
Fig. 2. Dataset-A: Image pairs selected from different benchmark datasets. Image registration and mosaicing has been performed using SIFT algorithm.
Dataset-B: Images selected from Affine Covariant Regions Datasets for evaluation of scale and rotation invariance of the feature-detector-descriptors.
Fig. 3. Dataset-B: Images selected from Affine Covariant Regions Datasets for evaluation of scale and rotation invariance of the feature-detector-descriptors.

C. Ground-truths

Ground-truth values for image transformations have been used to calculate and demonstrate error in the recovered results with each feature-detector-descriptor. For evaluating scale and rotation invariance, ground-truths have been synthetically generated for each image in Dataset-B by resizing and rotating it to known values of scale (5% to 500%) and rotation (0° to 360°). Bicubic interpolation has been adopted for scaling and rotating images since it is the most accurate interpolation method which retains image quality in the transformation process. To investigate affine invariance the benchmark ground-truths provided in Affine Covariant Regions Datasets for Graffiti and Wall sequences have been used.

D. Matching Strategy

The feature-matching strategy adopted for experiments is based on Nearest Neighbor Distance Ratio (NNDR) which was used by D.G. Lowe for matching SIFT features in [13] and by K. Mikolajczyk in [9]. In this matching scheme, the nearest neighbor and the second nearest neighbor for each feature-descriptor (from the first feature set) are searched (in the second feature set). Subsequently, ratio of nearest neighbor to the second nearest neighbor is calculated for each feature-descriptor and a certain threshold ratio is set to filter out the preferred matches. This threshold ratio is kept as 0.7 for the experimental results presented in this article. L1-norm (also called Least Absolute Deviations) has been used for matching the descriptors of SIFT, SURF, and KAZE while Hamming distance has been used for matching the descriptors of AKAZE, ORB, and BRISK.

E. Outlier Rejection & Homography Calculation

RANSAC algorithm with 2000 iterations and 99.5% confidence has been applied to reject the outliers and to find the Homography Matrix. It is a 3x3 matrix that represents a set of equations which satisfy the transformation function from the first image to the second.

F. Repeatability

Repeatability of a feature-detector is the percentage of detected features that survive photometric or geometric transformations in an image [27]. Repeatability is not related with the descriptors and only depends on performance of the feature-detection part of feature-detection-description algorithms. It is calculated on the basis of the overlapping (intersecting) region in the subject image pair. A feature-detector with higher repeatability for a particular transformation is considered robust than others, particularly for that type of transformation.

G. Demonstration of Results

Table II shows the results of quantitative comparison and computational costs of the feature-detector-descriptors for image matching. Each timing value presented in the table is the average of 100 measurements (to minimize errors which arise because of processing glitches). The synthetically generated scale / rotation transformations for the images of Dataset-B and the readily available ground-truth affine transformations for Graffiti / Wall sequences are recovered by performing image matching with each feature-detector-descriptor. Average repeatabilities of each feature-detector (for 5 images of Dataset-B) to investigate the trend of their strengths / weaknesses are presented in Fig. 6(a) and Fig. 6(b) for scale and rotation changes, respectively. Fig. 6(c) and Fig. 6(d) show the repeatability of feature-detectors for viewpoint changes using benchmark datasets. Fig. 6(e) and Fig. 6(f) exhibit the error in recovered scales and rotations for all the feature-detector-descriptors in terms of percentage and degree, respectively. To compute errors in affine transformations, the images of Graffiti and Wall sequences were first warped according to the ground-truth homographies as well as the recovered homographies. Afterwards, Euclidean distance has been calculated individually for each corner point of the recovered image with respect to its corresponding corner point (in the ground-truth image). Fig. 6(g) and Fig. 6(h) illustrate the projection error in recovered affine transformations as the sum of 4 Euclidean distances for Graffiti and Wall sequences, respectively. The error is represented in terms of pixels. Equation (4) shows the Euclidean distance D(P, Q)

between two corresponding corners P(xp, yp) and Q(xq, yq).
(4)
AlgorithmFeatures Detected in the Image PairsFeatures MatchedOutliers RejectedFeatures Detection & Description Time (s)Feature Matching Time (s)Outlier Rejection & Homography Calculation Time (s)Total Image Matching Time (s)
1st Image2nd Image1st Image2nd Image
Building Dataset (Image Pair # 1)
SIFT19072209384510.18120.19800.13370.00570.5186
SURF (128D)39884570319580.16570.17860.54390.00580.8940
SURF (64D)39884570612730.16250.17340.29560.00520.6367
KAZE12911359465260.21450.21130.06130.00530.4924
AKAZE14581554475360.06950.07150.03070.00550.1772
ORB509553458541490.02130.02200.15860.00670.2086
ORB (1000)10001000237210.01030.01010.01380.00490.0391
BRISK32883575481470.05330.05650.12360.00560.2390
BRISK (1000)10001000190330.01880.01910.01580.00490.0586
Bricks Dataset (Image Pair # 2)
SIFT14041405427160.15710.15850.06800.00530.3889
SURF (128D)28552332140340.13370.11910.20660.00510.4645
SURF (64D)28552332327630.13070.11370.11730.00550.3672
KAZE3667058860.19300.19880.01050.00470.4070
AKAZE27828915390.05410.05360.00370.00490.1163
ORB978942323170.00750.00790.01240.00490.0327
ORB (1000)731734240320.00670.00730.00880.00500.0278
BRISK796752285180.01460.01440.01190.00490.0458
BRISK (1000)796752285180.01460.01440.01190.00490.0458
Mountain Dataset (Image Pair # 3)
SIFT18672033170450.19430.20170.11970.00470.5204
SURF (128D)18902006175330.09790.11300.12080.00510.3368
SURF (64D)18902006227620.09780.11130.06890.00530.2833
KAZE972971131200.27870.28260.03430.00500.6006
AKAZE960986187160.08410.08420.01510.00500.1884
ORB47915006340780.02280.02360.13970.00520.1913
ORB (1000)10001000113290.01180.01180.01170.00490.0402
BRISK31533382287220.05200.05550.10830.00520.2210
BRISK (1000)1000100014370.02010.02130.01600.00460.0620
Graffiti Dataset (Image Pair # 4)
SIFT2654369899510.28580.33820.29400.00730.9253
SURF (128D)4802525944310.21660.21840.73990.02221.1971
SURF (64D)4802525962420.20720.21420.39510.01690.8334
KAZE223223023090.33110.33370.15940.00520.8294
AKAZE2064220523130.11580.11850.04910.00810.2915
ORB5527751738220.02770.03410.21290.00830.2830
ORB (1000)100010001670.01460.01570.01140.00590.0476
BRISK3507519154220.06690.09530.16870.00770.3386
BRISK (1000)1000100018100.02270.02390.01430.00730.0682
Roofs Dataset (Image Pair # 5)
SIFT230335504231540.19830.26650.24750.00630.7186
SURF (128D)29383830171950.11730.15230.33490.00840.6129
SURF (64D)293838302471430.11650.15040.18470.00900.4606
KAZE12601736172850.21190.22650.07100.00680.5162
AKAZE12871987175590.06860.08060.02940.00530.1839
ORB7660110404981570.02960.04070.41310.00650.4899
ORB (1000)1000100091320.01060.01130.01110.00550.0385
BRISK532376834362070.08990.12600.36720.00740.5905
BRISK (1000)1000100090440.01890.01990.01560.00690.0613
River Dataset (Image Pair # 6)
SIFT8619908213221920.67950.70832.25820.00923.6552
SURF (128D)943410471223630.37680.42052.85210.00553.6549
SURF (64D)943410471386660.36860.40911.49050.00492.2731
KAZE29842891391780.51190.51150.26690.00591.2962
AKAZE37513635376650.20480.19910.13410.00560.5436
ORB346453511824005530.12190.12765.38130.00925.6400
ORB (1000)100010004060.02350.02350.01090.00460.0625
BRISK236072427817253660.38130.40404.81170.00895.6059
BRISK (1000)100010003980.02510.02480.01550.00400.0694
Mean Values for All Image Pairs
SIFT3125.73662.8470.884.80.28270.31190.52020.00641.1212
SURF (128D)4317.84744.7178.752.30.18470.20030.79970.00871.1934
SURF (64D)4317.84744.7310.274.80.18060.19540.42540.00780.8092
KAZE1517.51660.7212.837.30.29020.29410.10060.00550.6904
AKAZE1633.01776.0231.533.00.09950.10130.04370.00570.2502
ORB9782.710828.0742.2162.70.03850.04271.05300.00681.1410
ORB (1000)955.2955.7122.821.20.01290.01330.01130.00510.0426
BRISK6612.37476.8544.7113.70.10970.12530.93190.00661.1735
BRISK (1000)966.0958.7127.520.00.02000.02060.01490.00540.0609
DataFeature-detection, matching, and mosaicing with SIFT, SURF, KAZE, AKAZE, ORB, and BRISK.
Fig. 4. Feature-detection, matching, and mosaicing with SIFT, SURF, KAZE, AKAZE, ORB, and BRISK.
Illustration of image registration error using the feature-detector- descriptors for Graffiti image pair (1,4). On the basis of observation, BRISK provides best accuracy particularly for this image set. Notice that the accuracy of ORB (1000) and BRISK (1000) is less than ORB and BRISK, respectively.
Fig. 5. Illustration of image registration error using the feature-detector-descriptors for Graffiti image pair (1,4). On the basis of observation, BRISK provides best accuracy particularly for this image set. Notice that the accuracy of ORB(1000) and BRISK(1000) is less than ORB and BRISK, respectively.
Repeatability of feature-detector-descriptors and error in image registration for scale, rotation and viewpoint changes.
Fig. 6. Repeatability of feature-detector-descriptors and error in image registration for scale, rotation and viewpoint changes.

IV. Important Findings

A. Quantitative Comparison:

B. Feature-Detection-Description Time

Table III. Computational Cost per Feature Point Based On Mean Values for All Image Pairs of Dataset-A
AlgorithmMean Feature-Detection-Description Time per Point (μs)Mean Feature Matching Time per Point (μs)
1st Images2nd Images
SIFT90.4485.15142.02
SURF (128D)42.7842.22168.55
SURF (64D)41.8341.1889.66
KAZE191.24177.0960.58
AKAZE60.9357.0424.61
ORB3.943.9497.25
ORB (1000)13.5113.9211.82
BRISK16.5916.76124.64
BRISK (1000)20.7021.4915.42

C. Feature Matching Time:

D. Outlier Rejection and Homography Fitting Time:

E. Total Image Matching Time:

F. Repeatability:

G. Accuracy of Image Matching:

Conclusion

This article presents an exhaustive comparison of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK feature-detector-descriptors. The experimental results provide rich information and various new insights that are valuable for making critical decisions in vision based applications. SIFT, SURF, and BRISK are found to be the most scale invariant feature detectors (on the basis of repeatability) that have survived wide spread scale variations. ORB is found to be least scale invariant. ORB(1000), BRISK(1000), and AKAZE are more rotation invariant than others. ORB and BRISK are generally more invariant to affine changes as compared to others. SIFT, KAZE, AKAZE, and BRISK have higher accuracy for image rotations as compared to the rest. Although, ORB and BRISK are the most efficient algorithms that can detect a huge amount of features, the matching time for such a large number of features prolongs the total image matching time. On the contrary, ORB(1000) and BRISK(1000) perform fastest image matching but their accuracy gets compromised. The overall accuracy of SIFT and BRISK is found to be highest for all types of geometric transformations and SIFT is concluded as the most accurate algorithm.

The quantitative comparison has shown that the generic order of feature-detector-descriptors for their ability to detect high quantity of features is:

ORB > BRISK > SURF > SIFT > AKAZE > KAZE

The sequence of algorithms for computational efficiency of feature-detection-description per feature-point is:

ORB > ORB(1000) > BRISK > BRISK(1000) > SURF(64D) > SURF(128D) > AKAZE > SIFT > KAZE

The order of efficient feature-matching per feature-point is:

ORB(1000) > BRISK(1000) > AKAZE > KAZE > SURF(64D) > ORB > BRISK > SIFT > SURF(128D)

The feature-detector-descriptors can be rated for the speed of total image matching as:

ORB(1000) > BRISK(1000) > AKAZE > KAZE > SURF(64D) > SIFT > ORB > BRISK > SURF(128D)

References

  1. D. Fleer and R. Möller, Comparing holistic and feature-based visual methods for estimating the relative pose of mobile robots, Robotics and Autonomous Systems, vol. 89, pp. 51–74, 2017.
  2. A. S. Huang et al., Visual odometry and mapping for autonomous flight using an RGB-D camera, in Robotics Research, Springer International Publishing, 2017, ch. 14, pp. 235–252.
  3. D. Nistér et al., Visual odometry, in Computer Vision and Pattern Recognition, Washington D.C., CVPR, 2004, pp. 1–8.
  4. C. Pirchheim et al., Monocular visual SLAM with general and panorama camera movements, U.S. Patent 9 674 507, June 6, 2017.
  5. A. J. Davison et al., MonoSLAM: Real-time single camera SLAM, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
  6. E. Marchand, et al., Pose estimation for augmented reality: A hands-on survey, IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, 2016.
  7. M. Brown and D. G. Lowe, Automatic panoramic image stitching using invariant features, International Journal of Computer Vision, vol. 74, no. 1, pp. 59–73, 2007.
  8. M. Hassaballah et al., Image features detection, description and matching, in Image Feature Detectors and Descriptors, Springer International Publishing, 2016, ch. 1, pp. 11–45.
  9. K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005.
  10. M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  11. H. Wang et al., A generalized kernel consensus-based robust estimator, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 178–184, 2010.
  12. O. Chum and J. Matas, Matching with PROSAC-progressive sample consensus, in Computer Vision and Pattern Recognition, San Diego, CVPR, 2005, pp. 220–226.
  13. D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
  14. H. Bay et al., Speeded-up robust features (SURF), Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008.
  15. P. F. Alcantarilla et al., KAZE features, in European Conference on Computer Vision, Berlin, ECCV, 2012, pp. 214–227.
  16. P. F. Alcantarilla et al., Fast explicit diffusion for accelerated features in nonlinear scale spaces, in British Machine Vision Conference, Bristol, BMVC, 2013.
  17. E. Rublee et al., ORB: An efficient alternative to SIFT or SURF, in IEEE International Conference on Computer Vision, Barcelona, ICCV, 2011, pp. 2564–2571.
  18. S. Leutenegger et al., BRISK: Binary robust invariant scalable keypoints, in IEEE International Conference on Computer Vision, Barcelona, ICCV, 2011, pp. 2548–2555.
  19. Visual Geometry Group. (2004). Affine Covariant Regions Datasets [online]. Available: http://www.robots.ox.ac.uk/~vgg/data
  20. S. Urban and M. Weinmann, Finding a good feature detector-descriptor combination for the 2D keypoint-based registration of TLS point clouds, Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, pp. 121–128, 2015.
  21. S. Gauglitz et al., Evaluation of interest point detectors and feature descriptors for visual tracking, International Journal of Computer Vision, vol. 94, no. 3, pp. 335–360, 2011.
  22. Z. Pusztai and L. Hajder, Quantitative comparison of feature matchers implemented in OpenCV3, in Computer Vision Winter Workshop, Rimske Toplice, CVWW, 2016.
  23. H. J. Chien et al., When to use what feature? SIFT, SURF, ORB, or A-KAZE features for monocular visual odometry, in IEEE International Conference on Image and Vision Computing New Zealand, Palmerston North, IVCNZ, 2016, pp. 1–6.
  24. N. Y. Khan et al., SIFT and SURF performance evaluation against various image deformations on benchmark dataset, in IEEE International Conference on Digital Image Computing Techniques and Applications, DICTA, 2011, pp. 501–506.
  25. E. Rosten and T. Drummond, Machine learning for high-speed corner detection, in European Conference on Computer Vision, Graz, ECCV, 2006, pp. 430–443.
  26. M. Calonder et al., Brief: Binary robust independent elementary features, in European Conference on Computer Vision, Heraklion, ECCV, 2010, pp. 778–792.
  27. K. Mikolajczyk et al., A comparison of affine region detectors, International Journal of Computer Vision, vol. 65, no. 1–2, pp. 43–72, 2005.