Learning to be a Depth Camera
for Close-Range Human Capture and Interaction
Sean Ryan Fanello1,2
Cem Keskin1 Shahram Izadi1 Pushmeet Kohli1 David Kim1 David Sweeney1 Antonio Criminisi1 Jamie Shotton1 Sing Bing Kang1 Tim Paek1
iCub Facility - Istituto Italiano di Tecnologia
Figure 1: (a, b) Our approach turns any 2D camera into a cheap depth sensor for close-range human capture and 3D interaction scenarios. (c, d) Simple hardware modiﬁcations allow active illuminated near infrared images to be captured from the camera. (e, f) This is used as input into our machine learning algorithm for depth estimation. (g, h) Our algorithm outputs dense metric depth maps of hands or faces in real-time.
We present a machine learning technique for estimating absolute, per-pixel depth using any conventional monocular 2D camera, with minor hardware modiﬁcations. Our approach targets close-range human capture and interaction where dense 3D estimation of hands and faces is desired. We use hybrid classiﬁcation-regression forests to learn how to map from near infrared intensity images to absolute, metric depth in real-time. We demonstrate a variety of humancomputer interaction and capture scenarios. Experiments show an accuracy that outperforms a conventional light fall-off baseline, and is comparable to high-quality consumer depth cameras, but with a dramatically reduced cost, power consumption, and form-factor.
While range sensing technologies have existed for a long time, consumer depth cameras such as the Microsoft Kinect have begun to make real-time depth acquisition a commodity. This in turn has opened-up many exciting new applications for gaming, 3D scanning and fabrication, natural user interfaces, augmented reality, and robotics. One important domain where depth cameras have had
clear impact is in human-computer interaction. In particular, the ability to reason about the 3D geometry of the scene makes the sensing of whole bodies, hands, and faces more tractable than with regular cameras, allowing these modalities to be leveraged for high degree-of-freedom (DoF) input.
CR Categories: I.3.7 [Computer Graphics]: Digitization and Image Capture—Applications I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Range Data
Whilst depth cameras are becoming more of a commodity, they have yet to (and arguably will never) surpass the ubiquity of regular 2D cameras, which are now used in the majority of our mobile devices and desktop computers. More widespread adoption of depth cameras is limited by considerations including power, cost, and form-factor. Sensor miniaturization is therefore a key recent focus, as demonstrated by Intel1 , Primesense2 , PMD3 and Pelican Imaging4 , and exempliﬁed by Google’s Project Tango5 . However, the need for custom sensors, high-power illumination, complex electronics, and other physical constraints (e.g. a baseline between the illumination and sensor) will often limit scenarios of use, particularly when compared to regular cameras. Even if these issues are to be addressed, there remains many legacy devices which only contain 2D cameras.
Keywords: learning, depth camera, acquisition, interaction
ACM Reference Format
Fanello, S., Keskin, C., Izadi, S., Kohli, P., Kim, D., Sweeney, D., Criminisi, A., Shotton, J., Kang, S., Paek, T. 2014. Learning to be a Depth Camera for Close-Range Human Capture and Interaction. ACM Trans. Graph. 33, 4, Article 86 (July 2014), 11 pages. DOI = 10.1145/2601097.2601223 http://doi.acm.org/10.1145/2601097.2601223.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this...
References: A HMED , A. H., AND FARAG , A. A. 2007. Shape from shading
under various imaging conditions
A MIT, Y., AND G EMAN , D. 1997. Shape quantization and recognition with randomized trees. Neural Computation 9, 7.
BARRON , J. T., AND M ALIK , J. 2013. Shape, illumination, and reﬂectance from shading. Tech. Rep. UCB/EECS-2013-117, EECS,
UC Berkeley, May.
BATLLE , J., M OUADDIB , E., AND S ALVI , J. 1998. Recent progress
in coded structured light as a technique to solve the correspondence problem: a survey
B EN -A RIE , J., AND NANDY, D. 1998. A neural network approach
for reconstructing surface shape from shading
B ESL , P. J. 1988. Active, optical range imaging sensors. Machine
vision and applications 1, 2, 127–152.
B LAIS , F. 2004. Review of 20 years of range sensor development.
B LANZ , V., AND V ETTER , T. 1999. A morphable model for the
synthesis of 3D faces
B REIMAN , L. 2001. Random forests. Machine Learning 45, 1.
B ROWN , M. Z., B URSCHKA , D., AND H AGER , G. D. 2003.
C OMANICIU , D., AND M EER , P. 2002. Mean shift: A robust
approach toward feature space analysis
C RIMINISI , A., AND S HOTTON , J. 2013. Decision Forests for
Computer Vision and Medical Image Analysis
F REDEMBACH , C., AND S USSTRUNK , S. 2008. Colouring the nearinfrared. In Color and Imaging Conference, vol. 2008, Society
for Imaging Science and Technology, 176–182.
G URBUZ , S. 2009. Application of inverse square law for 3d sensing.
In SPIE Optical Engineering+ Applications, International Society
for Optics and Photonics, 744706–744706.
H ERTZMANN , A., AND S EITZ , S. 2005. Example-based photometric stereo: Shape reconstruction with general, varying BRDFs.
H OIEM , D., E FROS , A., AND H EBERT, M. 2005. Automatic photo
H ORN , B. K. 1975. Obtaining shape from shading information.
I DESES , I., YAROSLAVSKY, L., AND F ISHBAIN , B. 2007. Realtime 2D to 3D video conversion. J. of Real-Time Image Processing 2, 3–9.
J IANG , T., L IU , B., L U , Y., AND E VANS , D. 2003. A neural
network approach to shape from shading
K ARSCH , K., L IU , C., AND K ANG , S. 2012. Depth extraction
from video using non-parametric sampling
K ESKIN , C., K IRAC , F., K ARA , Y., AND A KARUN , L. 2012. Hand
P RADOS , E., AND FAUGERAS , O. 2005. Shape from shading: a
well-posed problem? In Proc
R EMONDINO , F., AND S TOPPA , D. 2013. ToF range-imaging
G EHLER , P. V. 2011. Recovering intrinsic images with a global
sparsity prior on reﬂectance
S AXENA , A., S UN , M., AND N G , A. 2009. Make3D: Learning 3D
scene structure from a single still image
S CHARSTEIN , D., AND S ZELISKI , R. 2002. A taxonomy and
evaluation of dense two-frame stereo correspondence algorithms.
S HOTTON , J., W INN , J., ROTHER , C., AND C RIMINISI , A. 2006.
S HOTTON , J., F ITZGIBBON , A., C OOK , M., S HARP, T., F INOC CHIO , M., M OORE , R., K IPMAN , A., AND B LAKE , A. 2011.
technique. Physics in Medicine and Biology 43, 2465–2478.
S MITH , W. A., AND H ANCOCK , E. R. 2008. Facial shape-fromshading and recognition using principal geodesic analysis and
K HAN , N., T RAN , L., AND TAPPEN , M. 2009. Training manyparameter shape-from-shading models using a surface database.
Please join StudyMode to read the full document