Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.
3D 场景重建的进展将现实世界的 2D 图像转化为 3D 模型,通过数百张输入照片生成逼真的 3D 结果。尽管在密集视图重建场景中取得了巨大成功,但从不足的捕获视图中渲染详细场景仍然是一个不适定的优化问题,通常会导致未见区域的伪影和扭曲。在本文中,我们提出了 ReconX,这是一种新颖的 3D 场景重建范式,将模糊的重建挑战重新定义为时间生成任务。关键的见解是释放大型预训练视频扩散模型在稀疏视图重建中的强生成先验。然而,在直接从预训练模型生成的视频帧中,3D 视图一致性难以准确保持。为解决这一问题,在有限输入视图的情况下,ReconX 首先构建全球点云并将其编码为上下文空间作为 3D 结构条件。视频扩散模型在该条件指导下合成既保留细节又具有高度 3D 一致性的视频帧,确保从各个角度的场景连贯性。最后,我们通过一种基于置信度的 3D 高斯点云优化方案从生成的视频中恢复 3D 场景。在各种现实世界数据集上的广泛实验表明,我们的 ReconX 在质量和泛化能力方面优于现有最先进的方法。