Supervised Mixture-of-Experts for Surgical Grasping and Retraction

Author Names Omitted for Anonymous Review

Our Supervised MoE policy performing autonomous surgical bowel grasping and retraction using only stereo endoscopic camera feed as input.

Abstract

Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability.

We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from only 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture.

We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction.

We benchmark our method against state-of-the-art Vision-Language-Action (VLA) models and the standard ACT baseline. Our results show that generalist VLAs fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting Supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this statement, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery. These results demonstrate that supervised MoE architectures provide a data-efficient approach for learning multi-step dexterous manipulation in visually constrained environments. Code and dataset will be released upon acceptance.

Method Overview

MoE Architecture

We integrate a Phase-Aware Mixture-of-Experts (MoE) block into a standard transformer policy. The block consists of H parallel experts (one for each phase of the surgery) and a gating network. The gating network is supervised with phase labels during training, ensuring that the correct expert specializes in the correct sub-task (e.g., Grasping vs. Retracting).

Experimental Results

SmolVLA

Unsafe, pushes against the bowel

Pi0.5

Does not commit to a phase

ACT

Poor grasping, leading to slippage

Ours (ACT + MoE)

Successfully executes the task

Out-of-Distribution Generalization

Our model generalizes to unseen scenarios without additional training.

Unseen Bowel Segment

Different Lighting

Bowel Occlusion

Unseen Viewpoint

Viewpoint Randomization

Retraining with randomized camera viewpoints further improves robustness to viewpoint variations.

Zero-Shot Sim-to-Real Transfer

We tested the policy (trained only on phantom data) directly on ex vivo porcine tissue without any fine-tuning. The policy achieved an 80% success rate, demonstrating robust generalization to real tissue appearance and deformation.

Preliminary In-Vivo Results

BibTeX

@article{Anonymous2024SurgicalMoE,
  author    = {Anonymous Authors},
  title     = {Supervised Mixture-of-Experts for Surgical Grasping and Retraction},
  journal   = {Robotics: Science and Systems (RSS)},
  year      = {2024},
}