- Train a vision–language–action planner on mixed teleoperation data (success + failure).
- System 2 builds a scene graph and uses a value head with tree search to select feasible subgoals.
- System 1 executes the chosen subgoal sequence as continuous actions with a learned done signal.