Instant 3D Vision: Apple’s Depth Pro Delivers High-Precision Depth Maps in 0.3 Seconds
Monocular Depth Estimation, which involves estimating depth from a single image, holds tremendous potential. It can add a third dimension to any image — regardless of when or how it was captured — without requiring specialized hardware or additional data. In recent years, zero-shot monocular depth estimation has become the foundation for a range of applications, including advanced image editing, view synthesis, and conditional image generation.
In a new paper Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, an Apple research team introduces Depth Pro, a state-of-the-art foundation model designed for zero-shot metric monocular depth estimation. This model can generate high-resolution depth maps with exceptional clarity and fine detail, producing a 2.25-megapixel depth map in just 0.3 seconds on a standard GPU.
Depth Pro’s architecture hinges on the use of plain Vision Transformer (ViT) encoders, based on the work of Dosovitskiy et al. (2021), which process patches of the image at multiple scales. These patch predictions are then merged into a single, high-resolution depth map within an…