论文阅读“CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance“

论文阅读"CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance"

YMWM_

619人浏览 · 2026-02-08 12:20:16

YMWM_ · 2026-02-08 12:20:16 发布

Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robots’ generalization and robustness.
OpenAI’s recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction?
In this paper, we introduce Chain-of-Affordance (CoA-VLA), a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion.
Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) object affordance—what object to manipulate and where it is; (2) grasp affordance—the specific object part to grasp; (3) spatial affordance—the optimal space to place the object; and (4) movement affordance—the collision-free path for movement.
We further transform each affordance into two prompting formats: visual affordance and textual affordance. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network, allowing the robot to leverage essential contextual information during action inference, resulting in improved precision and robustness.
Our experiments demonstrate that CoA-VLA outperforms state-of-the-art robot foundation models, including OpenVLA and Octo, on a variety of tasks. Furthermore, CoA-VLA exhibits strong generalization capabilities, including recognizing unseen object poses, identifying free space, and avoiding obstacles in novel environments.

结论

Explicit reasoning is essential for language models to handle complex tasks. In this work, we design a reasoning-aware foundation model for robotics, focusing on various affordances: object, grasp, spatial, and movement.
These affordances form an interdependent chain: the robot identifies the target object and its location, determines how to grasp it, decides where to place it, and navigates accordingly.
By structuring this chain of affordances as intermediate language and image outputs and feeding this reasoning into the policy model, our Chain-of-Affordances (CoA-VLA) model outperforms baseline methods on real-world robotic tasks.
CoA-VLA also generalizes well to complex environments, tackling challenges such as grasping objects in unfamiliar orientations, avoiding obstacles, and achieving spatial generalization.
Our approach provides a novel perspective on designing reasoning chains to enhance embodied control.

我来详细分析这篇关于**CoA-VLA（Chain-of-Affordance Vision-Language-Action）**的论文。这是一篇ICCV 2024的论文，提出了一种通过视觉-文本功能链推理来增强机器人VLA模型的新方法。

论文核心概述

研究动机

当前VLA模型往往依赖外部LLM/VLM进行高层规划，缺乏自主推理能力。受OpenAI O1模型通过推理链解决复杂问题的启发，作者提出：机器人模型能否通过回顾先前观察并提供任务特定推理来指导动作预测，从而在多任务复杂环境中获得更好性能？

核心创新：Chain-of-Affordance（功能链）

论文定义了四种类型的功能（Affordance），形成功能链：

功能类型	含义	示例
Object Affordance	识别要操作的物体及其位置	“杯子在[0.45,0.68]”
Grasp Affordance	确定抓取物体的具体部位	“抓取点在[0.73,0.80]”
Spatial Affordance	识别放置物体的最优空间	“可放置区域在[0.45,0.68]”
Movement Affordance	规划无碰撞的运动路径	“机器人可通过[0.74,0.82]→[0.65,0.73]移动”

这四种功能形成顺序依赖链：先确定"操作什么"，再确定"如何抓取"，然后"放置何处"，最后"如何移动"。

技术架构

1. 双模态功能表示

论文提出两种提示格式：

文本功能（Textual Affordance）：用自然语言描述坐标和属性
视觉功能（Visual Affordance）：在图像上叠加标记（边界框、轨迹点等）

2. 视觉-文本协同注入模块（Visual-Textual Co-Injection）

文本功能 → VLM编码 → MLP → Token
视觉功能 → ViT-Small → Patch Tokens
                ↓
         [Transformer融合]
                ↓
         FiLM条件层 → 扩散策略 → 动作生成

关键设计：使用FiLM（Feature-wise Linear Modulation）条件层将功能知识动态注入扩散模型，既保留计算效率又增强鲁棒性。

3. 动态功能选择（Dynamic Affordance Selection）

为避免每步计算所有功能带来的开销，论文利用本体感知信息（关节角度、夹爪状态等）智能选择当前需要的功能：

夹爪已闭合且检测到物体 → 跳过Object/Grasp功能，专注Movement/Spatial功能

数据生成流程

论文设计了自动化的功能标注流水线：

GPT-4o：生成场景描述和实体识别
Grounding DINOv2 + SAM：生成物体边界框
RoboPoint + GPT-4o：标注空间功能点
CoTracker：追踪夹爪运动轨迹

实验结果

真实机器人实验（Franka机械臂，7个任务）

模型	平均成功率（分布内）	视觉泛化
Diffusion Policy	42.93%	4.76%
Octo	44.13%	19.05%
OpenVLA	54.89%	22.22%
DiffusionVLA（基线）	76.60%	44.44%
CoA-VLA	85.54%	57.14%

关键提升：相比基线DiffusionVLA，CoA-VLA在分布内提升14.29%，在视觉泛化场景提升12.7%。

仿真基准（LIBERO）

模型	平均成功率
Diffusion Policy	73.6%
ScaleDP	79.2%
Octo	84.6%
OpenVLA	79.1%
CoA-VLA	79.8%

在LIBERO四个任务套件上均取得最佳或接近最佳的性能。

定性分析

空间功能实验（PlaceBread任务）：

在已有物体的盘子上放置面包
CoA-VLA能识别空闲区域，3种场景全部成功
OpenVLA和DiffusionVLA各只成功1种场景

避障实验：

在路径中放置花瓶等障碍物
CoA-VLA在3种场景全部成功避障
OpenVLA全部失败，DiffusionVLA仅成功1次

核心贡献总结

提出了功能链（Chain-of-Affordance）推理框架：将机器人推理结构化为四种顺序功能，增强VLA模型的物理情境理解
设计了视觉-文本协同注入机制：统一融合双模态功能提示，通过FiLM条件层高效注入策略网络
实现了动态功能选择：基于本体感知自适应选择必要功能，平衡计算效率与推理完整性
建立了自动化数据生成流水线：结合多种视觉基础模型，实现大规模功能标注数据的自动构建

局限与展望

论文未深入讨论：

功能链在更长时间跨度任务中的累积误差问题
功能标注的自动化质量如何保证
与端到端学习相比，显式推理的中间监督成本

总体而言，这是一篇将"思维链"（Chain-of-Thought）理念成功迁移到机器人控制领域的扎实工作，通过结构化的物理推理显著提升了VLA模型的精度和泛化能力。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

写论文省心了！专科生专用AI论文软件，千笔AI VS 灵感ai

2048 AI社区

周红伟： DeepSeek大模型微调和部署实战：大模型全解析、部署及大模型训练微调代码实战

DeepSeek大模型作为新一代预训练语言模型，其微调技术（Fine-Tuning）通过针对性调整模型参数，使其在特定领域（如医疗、法律、金融）或任务（文本生成、问答系统）中表现更优。相较于零样本学习（Zero-Shot），微调可显著提升模型对专业术语的识别精度（如医学实体提取准确率提升37%），同时降低推理延迟（响应速度优化42%）。