We present Normal Field Learning (NFL), a robust yet practical solution to perceive 3D layouts of transparent objects and grasp them quickly. Conventional input modalities for vision-based grasping do not provide sufficient information for transparent objects. However, with the recent advance on datasets and algorithms for transparent objects, we can at least obtain noisy estimates of normals and masks for various real-world conditions. Instead of directly using the RGB images, we propose to use the estimates to train a neural volume, which serves as an intermediate representation ignorant of challenging appearance variations. We formulate the training objective to account for in- herent uncertainty in individual estimation, and together with the volumetric aggregation, we can reliably extract useful geometric information for grasping. Our neural volume deploys a voxel- grid based representation, motivated by acceleration techniques of neural radiance fields. However, we directly store the normal and density values in the grid cells instead of latent features. Our modification allows direct access to the geometric values without additional inference or volume rendering, further enhancing the efficiency. Our results show over 85% success rates in grasping in cluttered scenes with only 40 seconds of training time.
The inputs for probabilistic normal field learning are the pixel-wise estimation of surface normal modeled as von Mises-Fisher distribution and estimated object mask modeled as a Bernoulli distribution. The output is a 3D normal field where each point is mapped to a normal vector n and density σ. From the normal field, we sample reliable grasps, among which we select one that can induce trajectory without collision(green grasps).
Qualitative results across various scenes. We visualize the geometric representations that different methods use for grasping (normal field for ours, depth image for baselines). Our method stably creates normal fields for real world, Dex-NeRF, and blender scenes. GraspNeRF finds the ground, but occasionally fails to reconstruct the geometry.
We experiment in a real world setup with glass textured objects. For capturing images, we use a RealSense d435i camera mounted on a Franka Emika Panda robot arm. Capturing takes up to 1 second per image.
Above image shows the quality of reconstructed geometry on our blender dataset. The top row depicts the rendered depth and the bottom row the corresponding error map(Blue: low error). NFL outperforms the rest.
Above image shows the error maps on depth per input modality(Blue: low error). For transparent glass textured objects, the mixture of normal and mask produces the best result. Especially, RGB inputs fail for both DVGO and NeRF.
@ARTICLE{10328050, author={Lee, Junho and Kim, Sang Min and Lee, Yonghyeon and Kim, Young Min}, journal={IEEE Robotics and Automation Letters}, title={NFL: Normal Field Learning for 6-DoF Grasping of Transparent Objects}, year={2024}, volume={9}, number={1}, pages={819-826}, keywords={Grasping;Three-dimensional displays;Estimation;Training;Rendering (computer graphics);Cameras;Uncertainty;Deep Learning for Visual Perception;Deep Learning in Grasping and Manipulation;Grasping}, doi={10.1109/LRA.2023.3336108}}