computer-use agent Topic Archive

computer-use agent Topic Archive computer-use-agent.html 关键词 computer-use agent 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets ../papers/arxiv-4357e9fa7bf7.html https://arxiv.org/abs/2606.25760v1#2026-06-25#computer-use-agent Thu, 25 Jun 2026 13:11:21 +0800 Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet evidence on post-hoc uncertainty quantification (UQ) for these agents is fragmented across isolated model and dataset pairs, leaving it unclear whether UQ rankings stay stable when the agent, benchmark, or observable interface changes. We present Argus, a cross-regime benchm… Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation ../papers/arxiv-640fe613ba1c.html https://arxiv.org/abs/2606.24515v1#2026-06-24#computer-use-agent Wed, 24 Jun 2026 13:06:49 +0800 Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalabl… Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? ../papers/arxiv-9dcfdba2a88b.html https://arxiv.org/abs/2606.23189v1#2026-06-23#computer-use-agent Tue, 23 Jun 2026 13:10:02 +0800 Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three co… MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents ../papers/arxiv-9ae72ad425f0.html https://arxiv.org/abs/2606.16748v1#2026-06-16#computer-use-agent Tue, 16 Jun 2026 14:38:43 +0800 Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench,… ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm ../papers/arxiv-2035193c4622.html https://arxiv.org/abs/2606.13239v1#2026-06-12#computer-use-agent Fri, 12 Jun 2026 13:55:02 +0800 Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program… Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields ../papers/arxiv-0c1951bb8319.html https://arxiv.org/abs/2606.11042v1#2026-06-10#computer-use-agent Wed, 10 Jun 2026 13:25:04 +0800 Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user… Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories ../papers/arxiv-c1a0a781eb5e.html https://arxiv.org/abs/2606.11176v1#2026-06-10#computer-use-agent Wed, 10 Jun 2026 13:25:04 +0800 Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent… WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces ../papers/arxiv-df62d5981d92.html https://arxiv.org/abs/2606.09426v1#2026-06-09#computer-use-agent Tue, 09 Jun 2026 13:12:49 +0800 Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable art… MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents ../papers/arxiv-3b7434ccdd83.html https://arxiv.org/abs/2606.03203#2026-06-03#computer-use-agent Wed, 03 Jun 2026 14:09:56 +0800 Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark… Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents ../papers/arxiv-398b32863eaa.html https://arxiv.org/abs/2605.28775v1#2026-05-28#computer-use-agent Thu, 28 May 2026 13:15:52 +0800 Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation,… CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents ../papers/arxiv-ce117792107e.html https://arxiv.org/abs/2605.25624v1#2026-05-26#computer-use-agent Tue, 26 May 2026 13:09:24 +0800 Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-ju… AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions ../papers/arxiv-f49f736fdeba.html https://arxiv.org/abs/2605.25707v1#2026-05-26#computer-use-agent Tue, 26 May 2026 13:09:24 +0800 Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic e… Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling ../papers/arxiv-672c2d56efad.html https://arxiv.org/abs/2605.21470v1#2026-05-21#computer-use-agent Thu, 21 May 2026 13:14:24 +0800 Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, an alternative that compiles task descriptions direc… OpenComputer: Verifiable Software Worlds for Computer-Use Agents ../papers/arxiv-4d34b4ef6611.html https://arxiv.org/abs/2605.19769#2026-05-20#computer-use-agent Wed, 20 May 2026 13:10:58 +0800 We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness… TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing ../papers/arxiv-205818b0a17d.html https://arxiv.org/abs/2605.18859#2026-05-20#computer-use-agent Wed, 20 May 2026 13:10:58 +0800 LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success,… Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation ../papers/arxiv-24f56ae93690.html https://arxiv.org/abs/2605.06393v1#2026-05-08#computer-use-agent Fri, 08 May 2026 14:15:32 +0800 Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers, files, scripts, system commands, and external communication channels. While useful for automating real tasks, this capability also creates a host-level abuse surface: a legitimately deployed agent may be steered toward unsafe operations through malicious messages, indirect prompt injection, unsafe skills, or tampering along the host-side… Exploring Interaction Paradigms for LLM Agents in Scientific Visualization ../papers/arxiv-8273d104aa19.html https://arxiv.org/abs/2604.27996v1#2026-05-01#computer-use-agent Fri, 01 May 2026 12:53:56 +0800 This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustnes… Gym-Anything: Turn any Software into an Agent Environment ../papers/arxiv-09b28b4075b2.html https://arxiv.org/abs/2604.06126v1#2026-04-08#computer-use-agent Wed, 08 Apr 2026 17:10:24 +0800 Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any softwa…