Lotus-2

Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by dataset realism and limited physical reasoning.

Recent flow matching models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference.

In this work, we propose Lotus-2, a two-stage deterministic framework that leverages pre-trained generative priors for accurate and stable geometric dense prediction. The core predictor employs a single-step rectified-flow formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. The detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the predictor, enhancing fine-grained geometry through noise-free flow matching.

Using only 59K training samples—less than 1% of existing large-scale datasets---Lotus-2 establishes new SoTA results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that flow matching models can serve as deterministic world priors, enabling efficient, accurate, and physically consistent geometric reasoning beyond traditional generative paradigms.

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Abstract

BibTeX