Stereo Algorithms
Problem
The world has three spatial dimensions. Pictures contain only two spatial dimensions. One dimension (i.e. the depth or distance from the camera) is lost when the light from the scene is projected onto the two dimensional image. This depth can be recovered by using two or more images of the same scene from different viewpoints.
This works by first calibrating the camera intrinsically, which is typically done off-line using a calibration method. Next, the precise camera position and orientation needs to be recovered, either on-line by a mono- or stereo-camera method, or off-line using standard photogrammetry software for e.g. aerial images. In all cases, data from other sensors like (differential) GPS and IMU can be used for supporting the recovery.
After recovering the geometry with a total accuracy of less then one pixel error, the recovery of depth requires finding corresponding points, i.e. points in different images that represent the same point in the scene. This allows the reconstruction of the point in three dimensions, by the reconstruction and triangulation of the rays of light that are projected onto the corresponding points.
The knowledge of the camera geometry greatly simplifies the search for corresponding points from two dimensions, i.e. the whole image, to one dimensions, i.e. one line, which is called the epipolar line. If two or three images are uses, the images can be rectified to make the epipolar lines fall onto image rows or columns. This simplifies the stereo method and makes it faster.
Methods
Improved Correlation based Stereo Matching
A classical method for finding corresponding points is stereo correlation. It uses a typically square, symmetric window around a point of interest and compares this window with windows at different positions along the epipolar line in the other images. Using fixed windows is known to cause blurring (e.g. object borders), because this implicitly assumes that the scene depth is constant within the window. This problem can be reduced by using multiple, smaller windows (Hirschmueller et al., 2002; Hirschmueller 2003). The method is complemented by a left-right consistency check that identifies mismatches and occlusions, sub-pixel refinement and further filtering. Finding of correspondences is done for all pixels and stored as disparities (i.e. shift in position of corresponding points) in a disparity image. The disparity can be seen as an encoding of depth, because it is small for points far away and larger for points closer to the camera.
Stereo correlation is a fast, but not very accurate method, mainly due to the blurring problem. The exact working of the algorithm is described in publications (Hirschmueller et al., 2002; Hirschmueller 2003).
Semi-Global Matching (SGM)
The problem of all methods that match windows is blurring of object boundaries, because this violates the implicit assumption about constant depth within the correlation windows. This can be avoided by matching only single pixels. However, a single pixel does not contain enough information for un-ambiguous matching, especially in the presence of radiometric differences (as discussed below).
There is the observation that virtually all real-scenes consist of piecewise smooth surfaces, which are interrupted by sharp depth discontinuities. This means that pixels in the neighborhood of a certain pixel are highly likely to have the same or nearly the same depth, but exceptions (i.e. sudden depth changes) are possible. This observation can be encoded, together with the cost of pixelwise matching, into a cost function that penalizes (but allows) depth discontinuities. This cost function actually connects all pixels of the image and is therefore called a global cost function. It is the base of all global stereo methods (e.g. graph-cuts, belief propagation, dynamic programming, etc). The problem is that it can be shown that finding the minimum of this global cost function (i.e. the disparity image that minimizes the cost), is an NP-complete problem, at least for the class of depth discontinuity preserving functions, which are the most interesting ones. Thus, all existing global stereo methods do some kind of approximation or simplification.
The Semi-Global Matching (SGM) method attempts the minimization of the global cost along one dimensional paths, which go in all directions through the image and are joined at each pixel. This can be done quite efficient, because in contrast to two dimensions, minimizing the global cost in one dimension is not only practically possible, but efficiently computable, e.g. by dynamic programming. Similar to the correlation method, SGM is complemented by a left-right consistency check, sub-pixel disparity estimation and post-filtering. Multi view matching is optionally possible.
The quality of SGM disparity images is equally high as disparity images from global stereo methods, but SGM is several magnitudes faster than most global methods. Its complexity is the same as that of a correlation method (i.e. linear to the number of pixels and the disparity range). The runtime is higher than that of correlation, because each step is more complex and more temporary memory is required. The method has been described in various publications (Hirschmueller, 2008; Hirschmueller 2005).
Matching Costs
The comparison of windows or individual pixels requires to take radiometric differences (e.g. different brightness, noise, etc) into account. These differences are caused by the cameras due to slightly different settings, the vignetting effect, image noise, etc. Further differences may be caused by non-Lambertian surfaces, for which the amount of reflected light depends on the viewing angle. Another source of radiometric differences is that the positions or strength of the light sources may change when images are acquired at different times. For larger scenes, image acquisition will take some time and it may not be possible to control the light source (e.g., the sun outdoors).
It is important to note that some, but not all causes of radiometric differences could be pre-calibrated. Therefore, matching should be done tolerant to radiometric differences. An evaluation of several costs in combination with different stereo methods has already been published (Hirschmueller and Scharstein, 2007). A much larger study is currently under review in a journal.
Results
The improved correlation method with Laplacian-of-Gaussian pre-filtering, 5 partly overlapping windows with sum of absolute difference as cost, a left/right consistency check, small segment filtering and sub-pixel interpolation requires about 160 ms on images of size 640x480 with a disparity range of 128 on a 2.66 MHz PC.
Correlation is used for real-time applications, where speed is more important than image quality like the DLR crawler.
The SGM method using hierarchically calculated mutual information as matching cost, 8 paths with intensity gradient sensitive smoothness costs, left/right consistency checking, small segment filtering and sub-pixel interpolation requires about 2.2 s with fast and 3.2 s with full left/right consistency checking on images of size 640x480 with a disparity range of 128 on a 2.66 MHz PC. However, the method is currently being implemented on graphics hardware as well as an FPGA, for near-framerate and framerate stereo in these resolutions.
SGM is used in applications that demand higher reconstruction quality, like off-line 2.5D terrain modeling from huge aerial images, 3D modeling from flying (e.g. octocopter) or hand-held systems like the 3D modeler.
The images below show a standard test image (i.e. Teddy from the Middlebury evaluation) together with its ground truth disparity and the disparity images calculated by our correlation method and SGM.
Publications
Heiko Hirschmüller (2008), Stereo Processing by Semi-Global Matching and Mutual Information, in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 30(2), February 2008, pp. 328-341.
Heiko Hirschmüller, Peter R. Innocent and Jon Garibaldi (2002), Real-Time Correlation-Based Stereo Vision with Reduced Border Errors, International Journal of Computer Vision, Volume 47 (1/2/3), April-June 2002, pp. 229-246.
Heiko Hirschmüller and Daniel Scharstein (2007), Evaluation of Cost Functions for Stereo Matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 18-23 June 2007, Minneapolis, Minnesota, USA.
Heiko Hirschmüller (2005), Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 20-26 June 2005, San Diego, CA, USA, Volume 2, pp. 807-814.
Heiko Hirschmüller (2003), Stereo Vision Based Mapping and Immediate Virtual Walkthroughs, Ph.D. Thesis, De Montfort University, Leicester, UK, June 2003.