Oracle-Guided Masked Contrastive Reinforcement Learning for Drone Visuomotor Policies

1Nanyang Technological University, Singapore
2Nanjing University of Aeronautics and Astronautics, China
Framework Overview

Click to view larger image

The framework of OMC-RL. Upstream training (top): A masked contrastive learning module is used to learn compact and task-relevant visual representations from sequential RGB inputs. The masked and original inputs are processed through CNN encoders and projection layers, while the masked branch is further processed by an auxiliary transformer module to compute the contrastive loss L_cl. After training, the CNN encoder is frozen and the transformer module is discarded. Downstream training (bottom): An RL agent is trained with the frozen encoder from the upstream stage. An oracle policy trained with privileged depth inputs and full-state information provides expert supervision. The agent policy is optimized via KL-divergence against the oracle policy distribution to enable efficient visuomotor policy learning.

Abstract

A prevailing approach for learning visuomotor policies in drones is to employ reinforcement learning to map high-dimensional visual observations directly to action commands. However, the combination of high-dimensional visual inputs and agile maneuver outputs leads to long-standing challenges, including low sample efficiency and significant sim-to-real gaps. To address these issues, we propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and asymptotic performance of visuomotor policy learning. OMC-RL explicitly decouples the learning process into two stages: an upstream representation learning stage and a downstream policy learning stage. In the upstream phase, a masked Transformer module is trained with temporal modeling and contrastive learning to extract temporally-aware and task-relevant representations from sequential visual inputs. After training, the learned encoder is frozen and used to extract visual representations from consecutive frames, while the Transformer module is discarded. In the downstream phase, an oracle teacher policy with privileged access to global state information supervises the agent during early training to provide informative guidance and accelerate early policy learning. This guidance is gradually reduced to allow independent exploration as training progresses. Extensive experiments in simulated and real-world environments demonstrate that OMC-RL achieves superior sample efficiency and asymptotic policy performance, while also improving generalization across diverse and perceptually complex scenarios.

×