We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.
我们提出了L4GM,这是第一个4D大型重建模型,能够从单视角视频输入中生成动画对象——仅通过一次前馈传递即可完成,耗时仅一秒。我们成功的关键在于一个新颖的数据集,该数据集包含了来自Objaverse的精心策划和渲染的动画对象多视图视频。该数据集展示了44K个不同的对象,具有110K个动画,这些动画在48个视点中被渲染,产生了总共300M帧的12M个视频。我们保持L4GM的简单性以便于扩展,并直接基于预训练的3D大型重建模型LGM构建,LGM可以从多视图图像输入输出3D高斯椭球体。L4GM从以低帧率采样的视频帧中输出每帧的3D高斯平涂表示,然后将表示上采样到更高的帧率以实现时间平滑性。我们在基础LGM中添加了时间自注意层,帮助它学习时间上的一致性,并使用每个时间步的多视图渲染损失来训练模型。通过训练一个插值模型将表示上采样到更高的帧率,该模型产生中间的3D高斯表示。我们展示了仅在合成数据上训练的L4GM在野外视频上具有极好的泛化能力,能够生成高质量的动画3D资产。