Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rendering. Moreover, they lack fine-grained semantic information included in their scene representation vectors because they evenly consider free and occupied spaces. Both of them can destroy the performance of downstream RL tasks. To address the above challenges, we propose a novel framework that adopts the efficient 3D Gaussian Splatting (3DGS) to learn 3D scene representation for the first time. In brief, we present the Query-based Generalizable 3DGS to bridge the 3DGS technique and scene representations with more geometrical awareness than those in NeRFs. Moreover, we present the Hierarchical Semantics Encoding to ground the fine-grained semantic features to 3D Gaussians and further distilled to the scene representation vectors. We conduct extensive experiments on two RL platforms including Maniskill2 and Robomimic across 10 different tasks. The results show that our method outperforms the other 5 baselines by a large margin. We achieve the best success rates on 8 tasks and the second-best on the other two tasks.
潜在场景表示在训练强化学习(RL)代理中扮演了重要角色。为了获得描述场景的良好潜在向量,近期的工作将3D意识的潜条件NeRF管道整合到场景表示学习中。然而,这些与NeRF相关的方法因体积渲染中的低效密集采样而难以感知3D结构信息。此外,它们在场景表示向量中缺乏细粒度的语义信息,因为它们均等地考虑了空闲和占用空间。这两者都可能破坏下游RL任务的性能。为了解决上述挑战,我们提出了一个新颖的框架,首次采用高效的3D高斯涂抹(3DGS)学习3D场景表示。简而言之,我们提出了基于查询的泛化3DGS,以比NeRF中的表示具有更多几何意识地桥接3DGS技术和场景表示。此外,我们提出了层次化语义编码,将细粒度的语义特征固化到3D高斯中,并进一步提炼到场景表示向量中。我们在包括Maniskill2和Robomimic的两个RL平台上进行了广泛的实验,涵盖了10个不同的任务。结果显示,我们的方法在5个基准测试中大幅超越其他方法。我们在8个任务上达到了最佳成功率,在另外两个任务上取得了第二好的成绩。