We propose a light-weight, self-supervised adaptation for a visual navigation agent to generalize to unseen environment. Given an embodied agent trained in a noiseless environment, our objective is to transfer the agent to a noisy environment where actuation and odometry sensor noise is present. Our method encourages the agent to maximize the consistency between the global maps generated at different time steps in a round-trip trajectory. The proposed task is completely self-supervised, not requiring any supervision from ground-truth pose data or explicit noise model. In addition, optimization of the task objective is extremely light-weight, as training terminates within a few minutes on a commodity GPU. Our experiments show that the proposed task helps the agent to successfully transfer to new, noisy environments. The transferred agent exhibits improved localization and mapping accuracy, further leading to enhanced performance in downstream visual navigation tasks. Moreover, we demonstrate test-time adaptation with our self-supervised task to show its potential applicability in real-world deployment.
We can create a supervision signal by enforcing consistency of the generated global map. While we cannot assure that the map is error-free, we can assume that the error accumulates over time. This leads to incorporating more accurate pose data in generating the global map in the earlier steps compared to the ones generated later in a continuing trajectory. To implement the self-supervised learning task efficiently, we deliberately design overlapping trajectories and drive an embodied agent in round trips. The global map is first generated during the forward path, then the agent resets and generates another map from scratch during the backward path. In a noiseless setting, the agent observes the same area at each one-way trip. Therefore, the global map from its forward path should be equal to the global map generated during its backward path. When the actuation noise is present, the agent may not step on the same waypoints it has traversed during its forward path. Nonetheless, our proposed formulation is still valid as the agent generates maps from the overlapping area.
While our vanilla formulation utilize the bisections of round-trip trajectories, the consistency can be enforced for any subsets of trajectories with overlapping observations. We introduce a data augmentation method based on random cropping which has been used in various types of data such as videos or images. As shown in the image above, a round trip trajectory can be augmented by randomly sampling the three time steps, initial time, t1, the turn time, to, and the end time, t2, in a chronological order.
We extensively evaluate the performance of a pre-trained embodied agent when adapted to a new environment with our self-supervised learning task. We first investigate if our proposed self-supervised learning task helps agents to transfer to a new, noisy environment. We pre-train an agent in a noiseless environment where the mapping and localization module is trained with the ground-truth pose and egocentric map. We then observe if our self-supervision can help the agent to generalize across various unseen noisy environments without ground-truth supervision. For unseen noisy environment, we apply the odometry noise and actuation noise models based on real data of LoCoBot. For each episode, we report the localization error and the mean squared error (MSE) of the generated occupancy grid maps with respect to the ground-truth maps. In our paper, we show our result for exploration since it is the fundamental task for most navigation agents.
@inproceedings{lee2022self, title={Self-supervised domain adaptation for visual navigation with global map consistency}, author={Lee, Eun Sun and Kim, Junho and Kim, Young Min}, booktitle={Proceedings of the IEEE/CVF winter conference on applications of computer vision}, pages={1707--1716}, year={2022} }