Abstract
Ñîäåðæàíèå
Introduction
Given a single image of an every day object, a sculp- tor can recreate its 3D shape (i.e., produce a statue of the object), even if the particular object has never been seen be- fore. Presumably, it is familiarity with the shapes of similar 3D objects (i.e., objects from the same class) and how they appear in images, which enables the artist to estimate its shape. This might not be the exact shape of the object, but it is often a good enough estimate for many purposes. Mo- tivated by this example, we propose a novel framework for example based reconstruction of shapes from single images. In general, the problem of 3D reconstruction from a sin- gle 2D image is ill posed, since different shapes may give rise to the same intensity patterns. To solve this, additional constraints are required. Here, we constrain the recon- struction process by assuming that similarly looking objects from the same class (e.g., faces, fish), have similar shapes. We maintain a set of 3D objects, selected as examples of a specific class. We use these objects to produce a database of images of the objects in the class (e.g., by standard ren- dering techniques), along with their respective depth maps. These provide examples of feasible mappings from intensi- ties to shapes and are used to estimate the shapes of objects in query images. Our input image often contains a novel object. It is therefore unlikely that the exact same image exists in our database. We therefore devise a method which utilizes the examples in the database to produce novel shapes. To this end we extract portions of the image (i.e., image patches) and seek similar intensity patterns in the example database. Matching database intensity patterns suggest possible re- constructions for different portions of the image. We merge these suggested reconstructions together, to produce a co- herent shape estimate. Thus, novel shapes are produced by composing different parts of example objects. We show how this scheme can be cast as an optimization process, pro- ducing the likeliest reconstruction in a graphical model. A major obstacle for example based approaches is the limited size of the example set. To faithfully represent a class, many example objects might be required to account for variability in posture, texture, etc. In addition, unless the viewing conditions are known in advance, we may need to store for each object, images obtained under many con- ditions. This can lead to impractical storage and time re- quirements. Moreover, as the database becomes larger so does the risk of false matches, leading to degraded recon- structions. We therefore propose a novel example update scheme. As better estimates for the depth are available, we generate better examples for the reconstruction on-the-fly. We are thus able to demonstrate reconstructions under un- known views of objects from rich object classes. In addi- tion, to reduce the number of false matches we encourage the process to use example patches from corresponding se- mantic parts by adding location based constraints. Unlike existing example based reconstruction methods, which are restricted to classes of highly similar shapes (e.g., faces [3]) our method produces reconstructions of objects belonging to a variety of classes (e.g. hands, human figures).
1. Theme urgency
We note that the data sets used in practice do not guarantee the presence of objects sufficiently similar to the query, for accurate reconstructions. Our goal is therefore to produce plausible depth estimates and not necessarily true depths. However, we show that the estimates we obtain are often convincing enough. The method presented here allows for depth reconstruc- tion under very general conditions and requires little, if any, calibration. Our chief requirement is the existence of a 3D object database, representing the object class. We believe this to be a reasonable requirement given the growing avail- ability of such databases. We show depth from single im- age results for a variety of object classes, under a variety of imaging conditions. In addition, we demonstrate how our method can be extended to obtain plausible depth estimates of the back side of an imaged object.
2. Related work
Methods for single image reconstruction commonly use cues such as shading, silhouette shapes, texture, and vanish- ing points [5, 6, 12, 16, 28]. These methods restrict the al- lowable reconstructions by placing constraints on the prop- erties of reconstructed objects (e.g., reflectance properties, viewing conditions, and symmetry). A few approaches ex- plicitly use examples to guide the reconstruction process. One approach [14, 15] reconstructs outdoor scenes assum- ing they can be labelled as "ground," "sky," and "verti- cal" billboards. A second notable approach makes the asble to the state of the art in texture synthesis.
3. Global optimization scheme
We produce depth for a query image in a manner rem- iniscent of example-based texture synthesis methods [10, 25]. Later publications have suggested additional appli- cations for these synthesis schemes [8, 9, 13]. We note in particular, the connection between our method, and Im- age Analogies [13]. Using their jargon, taking the pair A and A' to be the database image and depth, and B to be the query image, B', the synthesized result, would be the query's depth estimate. Their method, however, cannot be used to recover depth under an unknown viewing position, nor handle large data sets. The optimization method we use here is motivated by the method introduced by [26] for im- age and video hole-filling, and [18] for texture synthesis. In [18] this optimization method was shown to be compara- {(Ii , Di )}i=1 , where Ii and Di respectively are the image and the depth map of an object from the class. For simplic- ity we assume first that all the images in the database con- tain objects viewed in the same viewing position as in the query image. We relax this requirement later in Sec. 3.2. Our process attempts to associate a depth map D to the query image I , such that every patch of mappings in M = (I, D) will have a matching counterpart in S. We call such a depth map a plausible depth estimate. Our basic approach to obtaining such a depth is as follows (see also Fig. 1). At every location p in I we consider a k ? k window around p. For each such window, we seek a matching window in the database with a similar intensity pattern in the least squares sense (Fig. 1.(i)). Once such a window is found we extract its corresponding k ? k depths. We do this for all pixels in I , matching overlapping intensity patterns and obtaining k2 best matching depth estimates for every pixel. The depth value at every p is then determined by taking an average of these k2 estimates (Fig. 1.(ii)).
The optimization process is further modified as follows: Multi-scale processing. The optimization is performed in a multi-scale pyramid representation of M . This both speeds convergence and adds global information to the process. Starting at the coarsest scale, the process iterates until con- vergence of the depth component. Final coarse scale selec- tions are then propagated to the next, finer scale (i.e., by multiplying the coordinates of the selected patches by 2), where intensities are then sampled from the finer scale ex- ample mappings. Fig. 3 demonstrates some intermediate depth estimates, from different scales. Approximate nearest neighbor (ANN) search. The most time consuming step in our algorithm is seeking a matching database window for every pixel in getSimilarP atches. We speed this search by using a sub-linear approximate nearest neighbor search [1]. This does not guarantee finding the most similar patches V , however, we have found the op- timization robust to these approximations, and the speedup to be substantial.
Patch examples are now regularly used in many appli- underlying assumption behind these methods is that class variability can be captured by a finite, often small, set of ex- amples. This is often true, but when the class contains non- rigid objects, objects varying in texture, or when viewing conditions are allowed to change, this can become a prob- lem. Adding more examples to allow for more variability (e.g. rotations of the input image in [8]), implies larger stor- age requirements, longer running times, and higher risk of false matches. In this work, we handle non-rigid objects (e.g. hands), objects which vary in texture (e.g. the fish) and can be viewed from any direction. Ideally, we would like our examples to be objects whose shape is similar to that of the object in the input image, viewed under similar con- ditions. This, however, implies a chicken-and-egg problem as reconstruction requires choosing similar objects for our database, but for this we first need a reconstruction. We thus propose the idea of online example set update. Instead of committing to a fixed database at the onset of re- construction, we propose updating the database on-the-fly during processing. We start with an initial seed database of examples. In subsequent iterations of our optimization we drop the least used examples Mi from our database, replac- ing them with ones deemed better for the reconstruction. These are produced by on-the-fly rendering of more suit- able 3D objects with better viewing conditions. In our ex- periments, we applied this idea to search for better example objects and better viewing angles. Other parameters such as lighting conditions can be similarly resolved. Note that this implies a potentially infinite example database (e.g. infinite views), where only a small relevant subset is used at any one time. We next describe the details of our implementation.
Searching for the best views. Fig. 4 demonstrates a re- construction using images from a single incorrect viewing angle (Fig. 4.a) and four fixed widely spaced viewing an- gles (Fig. 4.b). Both are inadequate. It stands to reason that mappings from viewing angles closer to the real one, will contribute more patches to the process than those further away. We thus adopt the following scheme. We start with a small number of pre-selected views, sparsely covering parts of the viewing sphere (the gray cameras in Fig. 4.c). The seed database S is produced by taking the mappings Mi of our objects, rendered from these views, and is used to obtain an initial depth estimate. In subsequent iterations, we re- estimate our views by taking the mean of the currently used angles, weighted by the relative number of patches selected from each angle. We then drop from S mappings originat- ing from the least used angle, and replace them with ones from the new view. If the new view is sufficiently close to one of the remaining angles, we instead increase the num- ber of objects to maintain the size of S. Fig. 4.c presents a result obtained with our angle update scheme.
Conclusion
This paper presents a method for reconstructing depth from a single image by using example mappings from ap- pearance to depth. We show that not only is the method applicable to many diverse object classes, and images acquired under a range of conditions, it can additionally be used for the novel task of estimating the depth of the back of the object in the image. We further address the problems of global structure preservation, handling large databases, and viewing angles estimation. We believe our method can be extended in many ways. First, the process itself can be improved to produce better results, consuming less time and memory. Second, we be- lieve it will be interesting to examine if this approach can be extended to handle other vision related problems, in partic- ular to combine segmentation with the reconstruction (e.g. by adding binary segmentations to the example mappings.)
References
- S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate near- est neighbor searching in fixed dimensions. Journal of the ACM, 45(6), 1998. Source code available from
- J. Atick, P. Griffin, and A. Redlich. Statistical approach to shape from shading: Reconstruction of three-dimensional face surfaces from single two-dimensional images. Neural Computation, 8(6):1321–1340, 1996.
- V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In SIGGRAPH, 1999. 1, 2
- R. Dovgard and R. Basri. Statistical symmetric shape from shading for 3D structure recovery of faces. In ECCV, 2004.
- I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image completion. In SIGGRAPH, 2003. 2, 4
- Avedillo M.J. SMAS: a program for the concurrent state reduction and state assignment of finite state machines / M.J. Avedillo, J.M. Quintana, J.L. Huertas // Proceedings of IEEE International Symposium on Circuits and Systems. – 1991. – vol. 3. – pp. 1781-1784.
- M. Kearns, Y. Mansour, and A. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In UAI, 1997. 3, 8
- Goren S. CHESMIN: a heuristic for state reduction in incompletely specified finite state machines / S. Goren, F. Ferguson // Proceedings of the Conference on Design, Automation and Test in Europe. – 2002. – pp. 248-254.
- Higuchi H. A fast state reduction algorithm for incompletely specified finite state machines / H. Higuchi, Y. Matsunaga // 33rd Annual Conference of Design Automation. – Las Vegas, 1996. – pp. 463-466.
- R. Osadchy, M. Miller, and Y. LeCun. Synergistic face detec- tion and pose estimation with energy-based model. In NIPS, 2004. 4