{
  "generated_at": "2026-06-25T13:11:21.142851+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「Language Model」：命中 13 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》、《Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models》。",
    "主题「LLM」：命中 13 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models》、《How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations》。",
    "主题「Benchmark」：命中 6 篇，覆盖 LM，代表论文包括 《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》、《Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets》。",
    "主题「Agent」：命中 3 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems》、《AI Snitches Get Glitches: Towards Evading Agentic Surveillance》。",
    "主题「Evaluation」：命中 2 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz》、《Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "Language Model",
      "paper_count": 13,
      "feed_names": [
        "LM",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy",
        "Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models",
        "How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations",
        "MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction",
        "TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs",
        "Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets",
        "Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability",
        "Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation",
        "RAS: Measuring LLM Safety Through Refusal Alignment",
        "Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface",
        "SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models",
        "MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources",
        "Evaluating LLMs on Real-World Software Performance Optimization"
      ],
      "key_points": [
        "《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕：Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the…",
        "《Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models》〔评测 / 方法〕：Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes…"
      ]
    },
    {
      "name": "LLM",
      "paper_count": 13,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models",
        "How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations",
        "MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction",
        "Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz",
        "TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs",
        "Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation",
        "RAS: Measuring LLM Safety Through Refusal Alignment",
        "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents",
        "Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface",
        "MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources",
        "SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment",
        "How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring",
        "Evaluating LLMs on Real-World Software Performance Optimization"
      ],
      "key_points": [
        "《Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models》〔评测 / 方法〕：Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes…",
        "《How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations》〔评测 / 方法〕：Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustnes…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 6,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy",
        "Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets",
        "Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability",
        "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents",
        "SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models",
        "SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment"
      ],
      "key_points": [
        "《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕：Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the…",
        "《Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets》〔评测 / 数据 / 应用 / 方法〕：Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, cal…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 3,
      "feed_names": [
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems",
        "AI Snitches Get Glitches: Towards Evading Agentic Surveillance",
        "Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution"
      ],
      "key_points": [
        "《The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems》〔方法〕：AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls in…",
        "《AI Snitches Get Glitches: Towards Evading Agentic Surveillance》〔数据 / 方法〕：To better assist users with completing challenging tasks, AI agents mediate communications, access data, and interact with different APIs. Many employers (and…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 2,
      "feed_names": [
        "LM",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz",
        "Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution"
      ],
      "key_points": [
        "《Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz》〔评测 / 方法〕：The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standar…",
        "《Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution》〔评测 / 应用 / 方法〕：Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕：Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the…",
        "《Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models》〔评测 / 方法〕：Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes…",
        "《How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations》〔评测 / 方法〕：Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustnes…",
        "《MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction》〔评测 / 数据 / 应用 / 方法〕：As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes…",
        "《Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz》〔评测 / 方法〕：The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standar…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy",
          "summary": "Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the specific procedural decision frameworks of expert investors. We introduce InvestPhilBench, a multi-layer dynamic benchmark spanning eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8). The v0.6 release comprises 118 primary-source-verified investment principle cards, 25 decision framework cards with explicit topology metadata, and 243 QA questions (197 dev / 46 held-out test). For reproducible scoring at scale we introduce the Benchmark Automated Scoring Pipeline (BASP) -- five algorithmic metrics (OGRS, KCCS, SAP@k, IVP, CKCA) -- the Failure Mode Detection Protocol (FMDP) with computable rules for six failure modes, and Gate Reconstruction Accuracy (GRA), a per-gate metric for questions with gold reasoning programs. In this release, InvestPhilBench is primarily a benchmark-and-methodology contribution. A four-model sanity wave on the 188-question development split shows a sharp provider-tier split (BASP 0.906 vs. 0.438); these mixed-judge numbers are confounded upper bounds. The central finding: the BASP composite saturates at the frontier (Claude L4 = 0.932) while GRA still exposes a procedural deficit (frontier L4 GRA approx. 0.77, L7 GRA 0.57-0.62) -- composite scoring rewards fluent prose and hides the procedural gap. v0.6 implements a unified judge and true model-in-the-loop retrieval/oracle conditions; the de-confounded multi-model leaderboard and full three-condition run are v1.0 deliverables. On a 100-item expert-annotated gold set the automated BASP composite tracks the human reference at Pearson r = 0.72 (MAE = 0.10), with attribution (SAP@3) the weakest sub-metric and the failure-mode detector running sensitive-but-over-flagging.",
          "authors": [
            "Mingguang Chen",
            "Bo Qu"
          ],
          "categories": [
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25984v1",
          "abstract_url": "https://arxiv.org/abs/2606.25984v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25984v1",
          "published_at": "2026-06-24T15:53:20+00:00",
          "updated_at": "2026-06-24T15:53:20+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.25984",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25984v1"
          },
          "relevance_score": 188,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25984"
        },
        {
          "title": "Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models",
          "summary": "Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips. We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%. A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-input decoder-noise floor in verified cells. Capability predicts but does not eliminate flips; the best model still flips on 13.4% of trials. In our Gemini mitigation tests, training-free prompt changes are modality-conditional and do not transfer from text to visual reasoning. These results suggest that prompt-level mitigation alone is unlikely to provide general order robustness, motivating future work on training-time and architectural approaches. We propose cross-ordering flip rate as a standard reporting axis for MLLMs.",
          "authors": [
            "Akshay Paruchuri",
            "Sanmi Koyejo",
            "Ehsan Adeli"
          ],
          "categories": [
            "cs.CL",
            "cs.CV",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26079v1",
          "abstract_url": "https://arxiv.org/abs/2606.26079v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26079v1",
          "published_at": "2026-06-24T17:53:26+00:00",
          "updated_at": "2026-06-24T17:53:26+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.26079",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26079v1"
          },
          "relevance_score": 182,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26079"
        },
        {
          "title": "How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations",
          "summary": "Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluating OCR reasoning robustness under visual perturbations. It contains 812 samples across two complementary subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. To enable efficient yet informative evaluation, we conduct a pilot study over 18 candidate perturbations and select 5 representative types at 3 severity levels each based on their impact and cross-model discriminability. We evaluate robustness using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI), and benchmark 18 models spanning proprietary systems, open-source VLMs, and OCR+LLM pipelines. Our results show that higher clean accuracy does not necessarily imply stronger robustness, and that models can suffer pronounced degradation in the worst case on OCR tasks that are sensitive to structure, and charts and tables are substantially more fragile than document-like inputs under perturbation.",
          "authors": [
            "Yuxing Cheng",
            "Yuan Wu",
            "Yi Chang"
          ],
          "categories": [
            "cs.CV",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26041v1",
          "abstract_url": "https://arxiv.org/abs/2606.26041v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26041v1",
          "published_at": "2026-06-24T17:15:42+00:00",
          "updated_at": "2026-06-24T17:15:42+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.26041",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26041v1"
          },
          "relevance_score": 182,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"reasoning\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26041"
        },
        {
          "title": "MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction",
          "summary": "As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical error detection and correction as a multi-agent in-context learning task. Specialized agents separately detect, localize, and correct errors, while a confidence-guided arbitration mechanism resolves disagreements using reasoning traces and confidence scores. This design enhances interpretability, robustness, and adaptability, without requiring additional training of the base LLMs. Additionally, we introduce the Keyword-Prioritized Correction Score (KPCS), a new evaluation metric that considers whether critical keywords within the reference text are generated correctly, providing a more comprehensive assessment than conventional metrics. Experiments across four multilingual medical datasets consisting of clinical notes demonstrate significant improvements by the proposed framework across several metrics and models. Our aim is to enable safer deployment of LLMs in real-world healthcare applications. For reproducibility, we make our code publicly available at https://github.com/congboma/MedErrBench.",
          "authors": [
            "Congbo Ma",
            "Hu Wang",
            "Yichun Zhang",
            "Farah E. Shamout"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25651v1",
          "abstract_url": "https://arxiv.org/abs/2606.25651v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25651v1",
          "published_at": "2026-06-24T10:07:59+00:00",
          "updated_at": "2026-06-24T10:07:59+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.25651",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25651v1"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"in-context learning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "summary matched \"guardrail\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25651"
        },
        {
          "title": "Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz",
          "summary": "The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, this paper presents the technical implementation and empirical evaluation of a Multi-Agent System (MAS) architecture combined with Hybrid Retrieval Augmented Generation (HybridRAG) for the partial automation of IT-GS certification. We introduce two novel technical contributions to the MAS architecture to enforce the compliance rigor. The Hypothesis-Verification Loop in the Structural Analysis (SA) phase that cross-references agent-inferred dependencies against the Knowledge Graph to reduce hallucinations, and a Decoupled Reasoning Pipeline that separates agent-driven semantic extraction from the deterministic protection need inheritance. We utilize the BSI's \"RecPlast GmbH\" case study as a human expert-generated reference data set for end-to-end evaluation of the architecture and to quantify Precision, Recall, and F1-scores. The performance of the system is investigated across the phases of SA, Protection Needs Assessment (PNA), Modeling, and IT-GS Check. The empirical results reveal noticeable differences throughout the different steps of IT-GS. While the MAS demonstrates high efficacy in semantic tasks (SA and Modeling), significantly reducing manual effort through automated information extraction, quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS.",
          "authors": [
            "Lea Roxanne Muth",
            "Marian Margraf"
          ],
          "categories": [
            "cs.CR",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25622v1",
          "abstract_url": "https://arxiv.org/abs/2606.25622v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25622v1",
          "published_at": "2026-06-24T09:31:06+00:00",
          "updated_at": "2026-06-24T09:31:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Evaluation"
          ],
          "doi": "10.1109/syscon66367.2026.11503560",
          "arxiv_id": "2606.25622",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25622v1",
            "doi": "https://doi.org/10.1109/syscon66367.2026.11503560"
          },
          "relevance_score": 164,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"retrieval augmented generation\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/syscon66367.2026.11503560"
        },
        {
          "title": "TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs",
          "summary": "Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly understood. We introduce TriViewBench, a controlled three-view visual reasoning benchmark constructed from synthetic 3D scenes with explicitly parameterized object count and occlusion. The benchmark contains 1,923 scenes and over 14K Question-Answer (QA) pairs organized into four complexity levels and three reasoning categories: Local Decision, Object Counting, and Global Recovery. We evaluate 18 open- and closed-source MLLMs under a unified prompting protocol. All 18 models exhibit an identical capability hierarchy without exception (Local Decision > Object Counting > Global Recovery), and performance degrades monotonically with complexity: Local Decision tasks decline modestly (12.11% relative drop), while Object Counting degrades substantially (59.14%) and Global Recovery collapses severely (80.02%). Error analysis on Object Counting reveals two mechanistically independent failure modes: single-view tasks are dominated by undercounting due to occlusion blindness, whereas the multi-view task reverses to overcounting due to cross-view identity confusion. Chain-of-Thought (CoT) prompting yields near-zero overall benefit ($Δ= -0.16\\%$) and its effect on Global Recovery is strongly capability-gated, suggesting that the bottleneck lies in cross-view spatial representation rather than reasoning strategy. These findings reveal fundamental scalability limitations in current MLLMs and position TriViewBench as a controlled diagnostic framework for analyzing structural reasoning failures.",
          "authors": [
            "Yu-Yang Chen",
            "Lan-Zhe Guo"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26029v1",
          "abstract_url": "https://arxiv.org/abs/2606.26029v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26029v1",
          "published_at": "2026-06-24T17:00:05+00:00",
          "updated_at": "2026-06-24T17:00:05+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.26029",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26029v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26029"
        },
        {
          "title": "Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets",
          "summary": "Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet evidence on post-hoc uncertainty quantification (UQ) for these agents is fragmented across isolated model and dataset pairs, leaving it unclear whether UQ rankings stay stable when the agent, benchmark, or observable interface changes. We present Argus, a cross-regime benchmark for post-hoc UQ in single-step executable GUI grounding: a 27-method open-weight matrix over 4 VLM agents and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors where logits, hidden states, and attention maps are unavailable. Evaluated methods span logit-based scores, sampling and consistency measures, hidden-state and density estimators (Mahalanobis, SAPLMA), attention-based scores, P(True) and verbalised-confidence prompting, and split-conformal prediction. The main finding is selective transfer: UQ rankings are stable across datasets for a fixed model, but degrade across model classes and observable interfaces. Hidden-state and density methods are the most stable open-weight family, while CoCoA-1MCA, Focus, sampling-based scores, and verbalised self-assessment win in specific regimes. Within-model ranking transfer is strong (Spearman rho up to 0.969), but cross-tier transfer to closed-source vendors averages only +0.08, so closed-source UQ should be reranked on the target rather than extrapolated. Conformal click regions show score-level discrimination is not enough for deployment: locally weighted disks shrink radii by 40-60% when the plug-in UQ is calibrated, but coverage degrades under calibration-test or interface mismatch. We release per-item records, calibration/test splits, UQ scores, and analysis scripts for regime-aware UQ selection in GUI agents.",
          "authors": [
            "Divake Kumar",
            "Sina Tayebati",
            "Devashri Naik",
            "Amanda Sofie Rios",
            "Nilesh Ahuja",
            "Omesh Tickoo",
            "Ranganath Krishnan",
            "Amit Ranjan Trivedi"
          ],
          "categories": [
            "cs.LG",
            "cs.AI",
            "cs.CL",
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25760v1",
          "abstract_url": "https://arxiv.org/abs/2606.25760v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25760v1",
          "published_at": "2026-06-24T12:34:28+00:00",
          "updated_at": "2026-06-24T12:34:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.25760",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25760v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "title matched \"computer-use agent\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25760"
        },
        {
          "title": "Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability",
          "summary": "Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool-environment unreliability insufficiently examined. We introduce ToolBench-X, a benchmark for evaluating agents under recoverable reliability hazards. ToolBench-X contains executable multi-step tasks across diverse domains and sequential, parallel, and mixed workflows, each paired with deterministic tools and a canonical final answer for automatic evaluation. Starting from clean tool environments, ToolBench-X injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Crucially, each injected instance remains solvable through at least one valid recovery path, such as retrying, fallback, verification, or cross-checking. Experiments reveal a substantial reliability gap: agents that perform well with reliable tools often fail under recoverable hazards. Further analysis shows that failures are driven less by tool-use volume or inference budget than by limited hazard diagnosis and ineffective recovery. Targeted recovery hints recover many failed tasks, while test-time scaling yields more limited gains. These results suggest that tool-use evaluation should move beyond function-call accuracy toward task completion under unreliable tool environments. The code and data is available at https://github.com/Foreverskyou/ToolBench-X.",
          "authors": [
            "Yang Tian",
            "Zhengpeng Shi",
            "Bo Zhao"
          ],
          "categories": [
            "cs.CL",
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25819v1",
          "abstract_url": "https://arxiv.org/abs/2606.25819v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25819v1",
          "published_at": "2026-06-24T13:34:34+00:00",
          "updated_at": "2026-06-24T13:34:34+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.25819",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25819v1"
          },
          "relevance_score": 160,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25819"
        },
        {
          "title": "Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation",
          "summary": "With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably identify harmful LLM outputs in user-model conversations without substantial performance loss relative to LLM-based judges. We benchmark these encoder classifiers against rule-based prefix matching, fine-tuned LLM classifiers, and LLM judges using a range of judge-prompting strategies across open-source adversarial datasets. The LLM judges include evaluation methodologies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, and a Claude-as-a-judge setup, as well as fine-tuned safety classifiers such as LlamaGuard 3 and LlamaGuard 4. The encoder classifiers are fine-tuned on judge-labeled data using a majority-voting label strategy and are then evaluated on a gold-standard holdout dataset to assess their performance relative to LLM judges. We report absolute performance using F1 score, false negative rate, and precision-recall metrics. We also break down results by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation, to identify where encoder classifiers align with or diverge from LLM-based judges. Our findings provide guidance on when encoder classifiers can serve as cost- and latency-efficient alternatives to LLM-based safety evaluation.",
          "authors": [
            "Han Jeon",
            "Shiv Medler",
            "Joseph Voyles",
            "Matt Wood"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25782v1",
          "abstract_url": "https://arxiv.org/abs/2606.25782v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25782v1",
          "published_at": "2026-06-24T13:00:25+00:00",
          "updated_at": "2026-06-24T13:00:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.25782",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25782v1"
          },
          "relevance_score": 159,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"evaluation\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "summary matched \"jailbreak\"",
            "summary matched \"guardrail\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25782"
        },
        {
          "title": "RAS: Measuring LLM Safety Through Refusal Alignment",
          "summary": "Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied to fixed question banks. We propose **SafeVec**, a white-box evaluation procedure that measures safety from internal representations rather than generated answers. **SafeVec** first extracts layer-wise refusal directions from a safety-aligned reference model, then selects stable layer windows where safe and unsafe behaviors are separable, and finally scores a target model by measuring whether its hidden states align with these refusal directions under unsafe and jailbreak prompts. The resulting metric, **RAS** (**R**efusal **A**lignment **S**core), maps representation-level refusal alignment to a calibrated 0-100 safety score. Across `Llama`, `Gemma`, and `Qwen` model families, RAS separates aligned models from uncensored and abliterated variants, tracks output-level attack success rate, and is substantially faster than judge-based evaluation. These results suggest that refusal alignment provides a compact and efficient signal for white-box LLM safety evaluation.",
          "authors": [
            "Chang-Chieh Huang",
            "Yan-Lun Chen",
            "Chia-Mu Yu",
            "Wei-Bin Lee"
          ],
          "categories": [
            "cs.CR",
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25750v1",
          "abstract_url": "https://arxiv.org/abs/2606.25750v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25750v1",
          "published_at": "2026-06-24T12:19:40+00:00",
          "updated_at": "2026-06-24T12:19:40+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.25750",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25750v1"
          },
          "relevance_score": 159,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"alignment\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "summary matched \"jailbreak\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25750"
        },
        {
          "title": "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents",
          "summary": "Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.",
          "authors": [
            "Changdae Oh",
            "Wendi Li",
            "Seongheon Park",
            "Samuel Yeh",
            "Tanwi Mallick",
            "Sharon Li"
          ],
          "categories": [
            "cs.LG",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26080v1",
          "abstract_url": "https://arxiv.org/abs/2606.26080v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26080v1",
          "published_at": "2026-06-24T17:54:08+00:00",
          "updated_at": "2026-06-24T17:54:08+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.26080",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26080v1"
          },
          "relevance_score": 146,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26080"
        },
        {
          "title": "Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface",
          "summary": "Increasing demand for precise and reliable control in complex scenarios has led to the development of increasingly sophisticated controllers, including data-driven approaches employing closed box models and mathematically rigorous yet complex designs. This complexity highlights the needs for explainable control that can provide human-understandable insights into controller behavior. In this paper, an explainable control framework (XCF) along with supporting algorithms and user interface are proposed to explain how controllers determine their control actions and their underlying working mechanism. The novel contributions of this work are threefold: First, the XCF is designed to provide model-agnostic explanations for controllers in closed-loop systems and can optionally refine local explanations by system response dynamics. Second, a novel explanation method, hierarchical fuzzy model-agnostic explanation for control systems (HFMAE-C), is proposed based on the designed framework. The HFMAE-C employs a fuzzy logic system to approximate the controller's behavior and system dynamics, providing sample, local, domain and universe level explanations via IF-THEN rules revealing the controller's decision logic and salience values quantifying the contribution of system states to control actions. Third, a large language model agent-supported user interface is developed to automatically analyze user requirements, select appropriate algorithms, interpret the generated explanations to a natural language report, and provide interactive consultation. Case studies on inverted pendulum system and Turtlebot obstacle avoidance demonstrate the effectiveness of the proposed method through simulated user experiments and quantitative comparisons with mainstream explainable control approaches.",
          "authors": [
            "Faliang Yin",
            "Hak-Keung Lam",
            "David Watson"
          ],
          "categories": [
            "cs.HC",
            "cs.AI",
            "eess.SY"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25941v1",
          "abstract_url": "https://arxiv.org/abs/2606.25941v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25941v1",
          "published_at": "2026-06-24T15:17:54+00:00",
          "updated_at": "2026-06-24T15:17:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.25941",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25941v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25941"
        },
        {
          "title": "SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models",
          "summary": "As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations of machine emotional intelligence assess reasoning exclusively through isolated text or passive acoustic perception, overlooking the complex cross-modal reasoning required for active, multi-turn dialogue. We introduce \\textsc{SpeechEQ}, a comprehensive framework designed to evaluate the sociolinguistic reasoning of Speech-Language Models (SLMs). The framework includes a validated dataset of 2,265 dialogues across 15 Emotional Quotient (EQ) subscales grounded in EQ-i 2.0 theory, along with a multi-turn evaluation protocol measured by our proposed Spoken EQ (SEQ) score inspired by human EQ assessments. Experiments show limitations in how both existing Speech Emotion Recognition and end-to-end Speech-Language Models understand and apply paralinguistic cues through speech. While end-to-end architectures outperform cascaded systems, \\textsc{SpeechEQ} reveals that current multimodal models remain bottlenecked by a text-reliant ``modality shortcut,'' an alignment-induced ``safety trap,'' and ``contextual amnesia,'' highlighting the barriers to truly emotionally aware AI. Our benchmark can be accessed at https://huggingface.co/datasets/SpeechEQ/SpeechEQ and demo page at https://binomial14.github.io/speecheq-demo/",
          "authors": [
            "Liang-Yuan Wu",
            "Zih-Ching Chen",
            "Tongshuang Wu",
            "Chao-Han Huck Yang",
            "Hua Shen"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.SD"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25990v1",
          "abstract_url": "https://arxiv.org/abs/2606.25990v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25990v1",
          "published_at": "2026-06-24T16:03:38+00:00",
          "updated_at": "2026-06-24T16:03:38+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.25990",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25990v1"
          },
          "relevance_score": 140,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25990"
        },
        {
          "title": "MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources",
          "summary": "Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns to solve optimization problems through an \"reasoning-to-model-and-solve\" paradigm. MiniOpt decomposes optimization reasoning into structured optimization modeling and executable solver generation. Building upon this paradigm, we introduce OptReward, a reward function with hierarchical score structure that jointly evaluates formulation and solution, enabling effective policy learning without expert demonstrations. We further develop an optimization-oriented policy optimization strategy that improves exploration efficiency and stabilizes reinforcement learning for compact models. Extensive experiments show that MiniOpt-3B exhibits strong optimization generalization across various optimization types, problem scenarios, and task domains. For models with fewer than 10B parameters, MiniOpt series achieves the highest average solving accuracy (SA). For models with more than 10B parameters, MiniOpt still shows competitive performance. These results suggest that optimization-oriented reward design and reinforcement learning provide an effective pathway for developing compact optimization-specialized language models with strong optimization generalization capabilities. The code is available at https://github.com/Hsiang-1/MiniOpt.",
          "authors": [
            "Ke Zhao",
            "Zixiang Di",
            "Hong Qian",
            "Xiang Shu",
            "Yaolin Wen",
            "Qitao Shi",
            "Bingdong Li",
            "Xingyu Lu",
            "Xiangfeng Wang",
            "Jun Zhou",
            "Ke Tang",
            "Yang Yu"
          ],
          "categories": [
            "cs.LG",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25832v1",
          "abstract_url": "https://arxiv.org/abs/2606.25832v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25832v1",
          "published_at": "2026-06-24T13:48:06+00:00",
          "updated_at": "2026-06-24T13:48:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.25832",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25832v1"
          },
          "relevance_score": 138,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25832"
        },
        {
          "title": "SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment",
          "summary": "Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which suffer from a scarcity of high-quality training data, often have their tokens routed to different experts than those predominantly activated by high-resource inputs, which limits cross-lingual expert sharing. This cross-lingual routing divergence consequently hinders their efficacy in multilingual contexts. To address this issue, we propose SARA (Semantically Anchored Routing Alignment), a framework designed to transfer specialized capabilities from high-resource languages as anchors to low-resource languages. SARA explicitly aligns the routing distribution of multilingual inputs with high-resource semantic anchors using a symmetric Jensen-Shannon (JS) divergence constraint. Unlike traditional distillation methods that operate on output logits, SARA directly aligns the internal routing distributions of MoE layers, encouraging mechanistic consistency in expert selection across languages. We conduct experiments on 2 LLMs across 5 low-resource languages and 3 benchmarks. Experiment results demonstrate that SARA outperforms standard instruction tuning, e.g., +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct on Global-MMLU. Further analyses show that SARA effectively addresses performance bottlenecks in low-resource languages, providing a scalable pathway to enhance multilingual capabilities in sparse architectures.",
          "authors": [
            "Tianyu Dong",
            "Yangyang Liu",
            "Jiang Zhou",
            "Xinwei Wu",
            "Xiaohu Zhao",
            "Hao Wang",
            "Heng Liu",
            "Linlong Xu",
            "Longyue Wang",
            "Weihua Luo",
            "Shaolin Zhu",
            "Deyi Xiong"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25821v1",
          "abstract_url": "https://arxiv.org/abs/2606.25821v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25821v1",
          "published_at": "2026-06-24T13:36:46+00:00",
          "updated_at": "2026-06-24T13:36:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.25821",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25821v1"
          },
          "relevance_score": 138,
          "match_reasons": [
            "title matched \"alignment\"",
            "summary matched \"LLM\"",
            "summary matched \"instruction tuning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25821"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring》〔方法〕：Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated j…",
        "《The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems》〔方法〕：AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls in…",
        "《AI Snitches Get Glitches: Towards Evading Agentic Surveillance》〔数据 / 方法〕：To better assist users with completing challenging tasks, AI agents mediate communications, access data, and interact with different APIs. Many employers (and…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring",
          "summary": "Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite ways. The dedicated classifier over-flags (precision 0.835, recall 0.974); three different LLM-as-judges keep high precision (0.81 to 0.94) but show erratic recall (0.06 to 0.65), so the same responses produce very different ASR depending on which judge scores them. The two families also differ sharply in robustness. Wrappers that leave the harmful text untouched and only add benign framing flip every LLM-judge between 57% and 100% of the time, and a single prepended refusal sentence accounts for much of this (39% to 88%). The dedicated classifier resists these surface attacks (at most 6.7%), but a white-box GCG attack on its open weights flips 70% of confident true positives (21 of 30; 95% CI 54 to 86%) even at a small optimization budget. A two-annotator audit confirms the attacks leave the harm intact: every one of 80 sampled flips still contained the harmful content. Because a large and growing share of reported ASR comes from LLM-judges, many such numbers are unreliable both on average and under deliberate pressure. We recommend that papers report judge precision and recall on a human-labeled slice, report ASR corrected for judge precision, and include an adversarial check of the judge. Our code is released.",
          "authors": [
            "Yang Gao"
          ],
          "categories": [
            "cs.CL",
            "cs.CR",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25487v1",
          "abstract_url": "https://arxiv.org/abs/2606.25487v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25487v1",
          "published_at": "2026-06-24T07:14:17+00:00",
          "updated_at": "2026-06-24T07:14:17+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "RAG"
          ],
          "doi": null,
          "arxiv_id": "2606.25487",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25487v1"
          },
          "relevance_score": 78,
          "match_reasons": [
            "title matched \"jailbreak\"",
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25487"
        },
        {
          "title": "The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems",
          "summary": "AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mechanism must satisfy for architectural control rather than for cooperative requests: process separation, pre-action enforcement on a structurally only path, fail-closed at both the request and system levels, and externalized signed evidence verifiable outside the controlled system's trust boundary. We position this layer as execution-time AI alignment, complementing training-time alignment (RLHF, Constitutional AI) and inference-time alignment. We present the Unfireable Safety Kernel, a Rust reference implementation realizing all four. Its fail-closed invariant is machine-checked at two levels: an SMT theorem (Z3) and an exhaustive bounded-model-checking proof of the production decision function (Kani, 4/4 harnesses). A Python-to-Rust migration was gated on byte-equivalence (1000/1000 fixtures; 17/17 adversarial classes). We evaluate the kernel governing a live, escapable AI system, a deterministic, self-improving world model, against an escape-seeking adversary driving its real self-modification seam: across 1,000 self-modifications, all 704 attempts on the safety-critical core are refused, with no escape; a further 300, under the operator kill switch, are also refused. A separate campaign of 6,240 authorization round-trips had no successful bypass. Against 3 contemporary systems claiming the agent control plane, the agent invokes control; here, it lacks that choice.",
          "authors": [
            "Seth Dobrin",
            "Łukasz Chmiel"
          ],
          "categories": [
            "cs.AI",
            "cs.CR",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26057v1",
          "abstract_url": "https://arxiv.org/abs/2606.26057v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26057v1",
          "published_at": "2026-06-24T17:32:27+00:00",
          "updated_at": "2026-06-24T17:32:27+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Agent",
            "Alignment"
          ],
          "doi": null,
          "arxiv_id": "2606.26057",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26057v1"
          },
          "relevance_score": 48,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26057"
        },
        {
          "title": "AI Snitches Get Glitches: Towards Evading Agentic Surveillance",
          "summary": "To better assist users with completing challenging tasks, AI agents mediate communications, access data, and interact with different APIs. Many employers (and even nation-states) already provide their users with this technology. However, widespread adoption of AI agents creates a new risk to abuse access to user data for another goal: surveilling users. These users might not even have the ability or permission to control the actions and data accesses of the surveilling agents. We introduce and formalize the problem of agentic surveillance: the ability of an AI agent to analyze available information, craft a report, and send it out using available tools. To evaluate surveillance capabilities across different models, we create SurveilBench, a dataset of various reporting scenarios focusing on three domains: corporate, education, and police. We find that some models exhibit emergent (i.e., unprompted) tendencies to help surveillance, but they also report the attempts to surveil users to the government. Finally, we repurpose prompt injections for evading surveillance and develop three evasion techniques that hide from, deceive, or induce over-escalation in surveillance agents. We conclude that agentic surveillance can already be easily implemented and, therefore, call for a comprehensive technical, ethical, and legislative framework to protect users.",
          "authors": [
            "Hyejun Jeong",
            "Dzung Pham",
            "Amir Houmansadr",
            "Eugene Bagdasarian"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25836v1",
          "abstract_url": "https://arxiv.org/abs/2606.25836v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25836v1",
          "published_at": "2026-06-24T13:50:22+00:00",
          "updated_at": "2026-06-24T13:50:22+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "Agent",
            "Prompt Injection"
          ],
          "doi": null,
          "arxiv_id": "2606.25836",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25836v1"
          },
          "relevance_score": 44,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25836"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution》〔评测 / 应用 / 方法〕：Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must…",
        "《Evaluating LLMs on Real-World Software Performance Optimization》〔评测 / 方法〕：Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we sti…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution",
          "summary": "Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must navigate codebases to locate the root cause, reproduce the failure, implement a fix, and validate the resulting patch. Inefficient context management, thereby, can lead to rapid context degradation and context poisoning, preventing successful resolution. We propose icat-agent, a decentralized, multi-agent scaffolding that replaces shared context with synchronous, event-based message passing. Utilizing a rubric-based issue quality check, icat-agent strategically pivots its workflow: it initiates parallel patching and validation for well-defined issues, while deploying preliminary exploration for low-quality ones. A comprehensive evaluation of icat-agent on SWE-bench Verified and SWE-bench Pro demonstrates that it consistently outperforms prominent baselines across all difficulty levels, including SWE-agent, mini-SWE-agent, and Claude Code, while using the same underlying models, improving by 3.6-8.4% on SWE-bench Verified and 6.3-18.5% on SWE-bench Pro. icat-agent is also computationally efficient, reducing the average cost by $1.18 per instance compared with the multi-agent Claude Code baseline. Our findings reveal that a robust scaffold such as icat-agent unlocks substantial latent capability within a fixed model, with the same backbone resolving markedly more issues under icat-agent than under existing scaffolds. icat-agent +GPT-5.4-xhigh resolves 67.4% of SWE-bench Pro problems, outperforming the current best result on SWE-bench Pro (59.10%, mini-SWE-agent+GPT-5.4-xhigh) by 8.3 percentage points.",
          "authors": [
            "Yang Chen",
            "Aliya Ahmad",
            "Yiheng Zhou",
            "Reyhaneh Jabbarvand"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25514v1",
          "abstract_url": "https://arxiv.org/abs/2606.25514v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25514v1",
          "published_at": "2026-06-24T07:48:05+00:00",
          "updated_at": "2026-06-24T07:48:05+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.25514",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25514v1"
          },
          "relevance_score": 78,
          "match_reasons": [
            "title matched \"issue resolution\"",
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25514"
        },
        {
          "title": "Evaluating LLMs on Real-World Software Performance Optimization",
          "summary": "Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and the variability introduced by different input data and execution conditions. We address this by introducing SWE-Pro, a repository-level benchmark derived from 102 expert-written optimizations from open-source projects. Unlike previous benchmarks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage (TWMU) across varying input data and execution conditions under noise-aware measurement conditions. Our evaluation shows that current LLMs struggle significantly: runtime gains are negligible, and memory optimizations are nearly non-existent. This stands in sharp contrast to expert implementations, which achieve an aggregate speedup of 15.5x and peak memory reduction of 171.3x over benchmark tasks. Expert-written improvements are observed in 91.2% of tasks for runtime and 65.7% for peak memory. Our findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.",
          "authors": [
            "Ezgi Sarıkayak",
            "Wenchao Gu",
            "Hesham Ghonim",
            "Chunyang Chen"
          ],
          "categories": [
            "cs.SE",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.25530v1",
          "abstract_url": "https://arxiv.org/abs/2606.25530v1",
          "pdf_url": "https://arxiv.org/pdf/2606.25530v1",
          "published_at": "2026-06-24T08:07:41+00:00",
          "updated_at": "2026-06-24T08:07:41+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.25530",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.25530v1"
          },
          "relevance_score": 38,
          "match_reasons": [
            "summary matched \"repository-level\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.25530"
        }
      ]
    }
  ]
}