{
  "generated_at": "2026-04-15T11:35:50.794093+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LLM",
        "sort_by": "hybrid"
      },
      {
        "name": "Vision",
        "sort_by": "hybrid"
      },
      {
        "name": "PubMed AI",
        "sort_by": "hybrid"
      },
      {
        "name": "OpenAlex AI",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「Benchmark」：命中 15 篇，覆盖 LLM、Vision 等，代表论文包括 《Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents》、《Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss》。",
    "主题「Reasoning」：命中 13 篇，覆盖 LLM、Vision 等，代表论文包括 《Parallax: Why AI Agents That Think Must Never Act》、《Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents》。",
    "主题「Language Model」：命中 10 篇，覆盖 LLM、Vision 等，代表论文包括 《AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance》、《Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "Benchmark",
      "paper_count": 15,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents",
        "Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss",
        "Towards Long-horizon Agentic Multimodal Search",
        "QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence",
        "ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search",
        "Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning",
        "Do VLMs Truly \"Read\" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting",
        "Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training",
        "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance",
        "Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration",
        "Toward Autonomous Long-Horizon Engineering for ML Research",
        "RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation",
        "Generative Refinement Networks for Visual Synthesis",
        "VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.",
        "Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment."
      ],
      "key_points": [
        "《Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents》〔评测 / 方法〕：LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session…",
        "《Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss》〔评测 / 方法〕：Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular re…"
      ]
    },
    {
      "name": "Reasoning",
      "paper_count": 13,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Parallax: Why AI Agents That Think Must Never Act",
        "Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents",
        "Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss",
        "Towards Long-horizon Agentic Multimodal Search",
        "ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search",
        "Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning",
        "Do VLMs Truly \"Read\" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting",
        "Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs",
        "LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems",
        "Toward Autonomous Long-Horizon Engineering for ML Research",
        "All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding",
        "Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks",
        "Comparison of AI-based Chatbot Performance in Analyzing Clinical Scenarios versus Medical Residents: A Novel Approach in Chest Diseases Education."
      ],
      "key_points": [
        "《Parallax: Why AI Agents That Think Must Never Act》〔评测 / 应用 / 方法〕：Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will…",
        "《Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents》〔评测 / 方法〕：LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 10,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI",
        "OpenAlex AI"
      ],
      "paper_titles": [
        "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance",
        "Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration",
        "All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding",
        "Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks",
        "VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.",
        "Multimodal large language models in brain tumor imaging: clinical applications and future perspectives.",
        "Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.",
        "User Experience and Early Clinical Outcomes of a Mental Wellness Chatbot for Depression and Anxiety: Pilot Evaluation Mixed Methods Study.",
        "Comparison of AI-based Chatbot Performance in Analyzing Clinical Scenarios versus Medical Residents: A Novel Approach in Chest Diseases Education.",
        "Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students"
      ],
      "key_points": [
        "《AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance》〔评测 / 数据 / 应用 / 方法〕：The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurem…",
        "《Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration》〔评测 / 数据 / 方法〕：The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often e…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 7,
      "feed_names": [
        "LLM",
        "PubMed AI"
      ],
      "paper_titles": [
        "Parallax: Why AI Agents That Think Must Never Act",
        "QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence",
        "Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training",
        "Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs",
        "LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems",
        "EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution",
        "User Experience and Early Clinical Outcomes of a Mental Wellness Chatbot for Depression and Anxiety: Pilot Evaluation Mixed Methods Study."
      ],
      "key_points": [
        "《Parallax: Why AI Agents That Think Must Never Act》〔评测 / 应用 / 方法〕：Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will…",
        "《QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence》〔评测 / 应用 / 方法〕：As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, b…"
      ]
    },
    {
      "name": "Multimodal",
      "paper_count": 3,
      "feed_names": [
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation",
        "Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation",
        "Multimodal large language models in brain tumor imaging: clinical applications and future perspectives."
      ],
      "key_points": [
        "《RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation》〔评测 / 方法〕：Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sen…",
        "《Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation》〔应用 / 方法〕：Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises f…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LLM",
      "key_points": [
        "《Parallax: Why AI Agents That Think Must Never Act》〔评测 / 应用 / 方法〕：Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will…",
        "《Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents》〔评测 / 方法〕：LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session…",
        "《Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss》〔评测 / 方法〕：Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular re…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Parallax: Why AI Agents That Think Must Never Act",
          "summary": "Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.",
          "authors": [
            "Joel Fokou"
          ],
          "categories": [
            "cs.CR",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12986v1",
          "abstract_url": "https://arxiv.org/abs/2604.12986v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12986v1",
          "published_at": "2026-04-14T17:20:48+00:00",
          "updated_at": "2026-04-14T17:20:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.12986",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12986v1"
          },
          "relevance_score": 107,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12986"
        },
        {
          "title": "Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents",
          "summary": "LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.",
          "authors": [
            "Benjamin Stern",
            "Peter Nadel"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12948v1",
          "abstract_url": "https://arxiv.org/abs/2604.12948v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12948v1",
          "published_at": "2026-04-14T16:45:06+00:00",
          "updated_at": "2026-04-14T16:45:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.12948",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12948v1"
          },
          "relevance_score": 107,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12948"
        },
        {
          "title": "Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss",
          "summary": "Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.",
          "authors": [
            "Ronald Skorobogat",
            "Ameya Prabhu",
            "Matthias Bethge"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12911v1",
          "abstract_url": "https://arxiv.org/abs/2604.12911v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12911v1",
          "published_at": "2026-04-14T15:58:21+00:00",
          "updated_at": "2026-04-14T15:58:21+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.12911",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12911v1"
          },
          "relevance_score": 106,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12911"
        },
        {
          "title": "Towards Long-horizon Agentic Multimodal Search",
          "summary": "Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.",
          "authors": [
            "Yifan Du",
            "Zikang Liu",
            "Jinbiao Peng",
            "Jie Wu",
            "Junyi Li",
            "Jinyang Li",
            "Wayne Xin Zhao",
            "Ji-Rong Wen"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12890v1",
          "abstract_url": "https://arxiv.org/abs/2604.12890v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12890v1",
          "published_at": "2026-04-14T15:40:28+00:00",
          "updated_at": "2026-04-14T15:40:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.12890",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12890v1"
          },
          "relevance_score": 106,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "title matched \"multimodal\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12890"
        },
        {
          "title": "QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence",
          "summary": "As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model's planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.",
          "authors": [
            "Zhichao Lin",
            "Zhichao Liang",
            "Gaoqiang Liu",
            "Meng Xu",
            "Baoyu Xiang",
            "Jian Xu",
            "Guanjun Jiang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12867v1",
          "abstract_url": "https://arxiv.org/abs/2604.12867v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12867v1",
          "published_at": "2026-04-14T15:17:21+00:00",
          "updated_at": "2026-04-14T15:17:21+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.12867",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12867v1"
          },
          "relevance_score": 105,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12867"
        },
        {
          "title": "ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search",
          "summary": "We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.",
          "authors": [
            "Myungchul Kim",
            "Kwanyong Park",
            "Junmo Kim",
            "In So Kweon"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12762v1",
          "abstract_url": "https://arxiv.org/abs/2604.12762v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12762v1",
          "published_at": "2026-04-14T14:06:19+00:00",
          "updated_at": "2026-04-14T14:06:19+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.12762",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12762v1"
          },
          "relevance_score": 104,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12762"
        },
        {
          "title": "Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning",
          "summary": "LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasizes extracting and reusing task-relevant knowledge, analytical prompts, and operational skills from real cases. We evaluate the method on a unified benchmark of six complex task categories and compare it with Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines. Results show that our method achieves consistently strong performance across all tasks and matches or outperforms the best baseline in every case, with especially clear gains on more complex tasks. Further analysis shows that the advantage of case-based learning increases with task complexity, and that practical knowledge acquired by one agent can be reused by others. These findings suggest that case-based learning offers a promising path for building professional agents for real-world work.",
          "authors": [
            "Zhenyu Ma",
            "Yuyang Song",
            "Chunyi Yang",
            "Jingyi Zhu",
            "Letian Yang",
            "Xukai Jiang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12717v1",
          "abstract_url": "https://arxiv.org/abs/2604.12717v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12717v1",
          "published_at": "2026-04-14T13:31:47+00:00",
          "updated_at": "2026-04-14T13:31:47+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.12717",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12717v1"
          },
          "relevance_score": 103,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12717"
        },
        {
          "title": "Do VLMs Truly \"Read\" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting",
          "summary": "Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs' comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs' ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.",
          "authors": [
            "Kaiqi Hu",
            "Linda Xiao",
            "Shiyue Xu",
            "Ziyi Tang",
            "Mingwen Liu"
          ],
          "categories": [
            "cs.LG",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12659v1",
          "abstract_url": "https://arxiv.org/abs/2604.12659v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12659v1",
          "published_at": "2026-04-14T12:26:34+00:00",
          "updated_at": "2026-04-14T12:26:34+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.12659",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12659v1"
          },
          "relevance_score": 102,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12659"
        },
        {
          "title": "Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training",
          "summary": "Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.",
          "authors": [
            "Sohyun An",
            "Shuibenyang Yuan",
            "Hayeon Lee",
            "Cho-Jui Hsieh",
            "Alexander Min"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12967v1",
          "abstract_url": "https://arxiv.org/abs/2604.12967v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12967v1",
          "published_at": "2026-04-14T17:00:18+00:00",
          "updated_at": "2026-04-14T17:00:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.12967",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12967v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12967"
        },
        {
          "title": "Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs",
          "summary": "Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\\% to 86.47\\% on multi-view reasoning, from 52.42\\% to 81.45\\% on relative depth, and achieves a 22\\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.",
          "authors": [
            "Muhammad Kamran Janjua",
            "Hugo Silva",
            "Di Niu",
            "Bahador Rashidi"
          ],
          "categories": [
            "cs.CV",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12896v1",
          "abstract_url": "https://arxiv.org/abs/2604.12896v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12896v1",
          "published_at": "2026-04-14T15:45:22+00:00",
          "updated_at": "2026-04-14T15:45:22+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.12896",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12896v1"
          },
          "relevance_score": 88,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12896"
        },
        {
          "title": "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance",
          "summary": "The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field's main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.",
          "authors": [
            "Abiodun A. Solanke"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12875v1",
          "abstract_url": "https://arxiv.org/abs/2604.12875v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12875v1",
          "published_at": "2026-04-14T15:26:03+00:00",
          "updated_at": "2026-04-14T15:26:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.12875",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12875v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12875"
        },
        {
          "title": "LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems",
          "summary": "The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning framework that is Incremental, Flexible, and Energy efficient that is implemented as an agent centric system rather than a single monolithic model. LIFE uniquely combines four components to realize self evolving network management and operations in HPCs. The components are an orchestrator, Agentic Context Engineering, a novel memory system, and information lattice learning. LIFE can also generalize to enable a variety of orthogonal use cases. We ground LIFE in a specific closed loop HPC operations example for detecting and mitigating latency spikes experienced by critical micro services running on a Kubernetes like cluster.",
          "authors": [
            "Anne Lee",
            "Gurudutt Hosangadi"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12874v1",
          "abstract_url": "https://arxiv.org/abs/2604.12874v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12874v1",
          "published_at": "2026-04-14T15:23:36+00:00",
          "updated_at": "2026-04-14T15:23:36+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.12874",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12874v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12874"
        },
        {
          "title": "Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration",
          "summary": "The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $ρ\\geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains",
          "authors": [
            "Eliya Habba",
            "Itay Itzhak",
            "Asaf Yehudai",
            "Yotam Perlitz",
            "Elron Bandel",
            "Michal Shmueli-Scheuer",
            "Leshem Choshen",
            "Gabriel Stanovsky"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12843v1",
          "abstract_url": "https://arxiv.org/abs/2604.12843v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12843v1",
          "published_at": "2026-04-14T15:01:29+00:00",
          "updated_at": "2026-04-14T15:01:29+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.12843",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12843v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12843"
        },
        {
          "title": "Toward Autonomous Long-Horizon Engineering for ML Research",
          "summary": "Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.",
          "authors": [
            "Guoxin Chen",
            "Jie Chen",
            "Lei Chen",
            "Jiale Zhao",
            "Fanzhe Meng",
            "Wayne Xin Zhao",
            "Ruihua Song",
            "Cheng Chen",
            "Ji-Rong Wen",
            "Kai Jia"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.13018v1",
          "abstract_url": "https://arxiv.org/abs/2604.13018v1",
          "pdf_url": "https://arxiv.org/pdf/2604.13018v1",
          "published_at": "2026-04-14T17:55:16+00:00",
          "updated_at": "2026-04-14T17:55:16+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.13018",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.13018v1"
          },
          "relevance_score": 86,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.13018"
        },
        {
          "title": "EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution",
          "summary": "Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Scène mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.",
          "authors": [
            "Shiyu He",
            "Minchi Kuang",
            "Mengxian Wang",
            "Bin Hu",
            "Tingxiang Gu"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12776v1",
          "abstract_url": "https://arxiv.org/abs/2604.12776v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12776v1",
          "published_at": "2026-04-14T14:16:06+00:00",
          "updated_at": "2026-04-14T14:16:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Agent",
            "Alignment"
          ],
          "doi": null,
          "arxiv_id": "2604.12776",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12776v1"
          },
          "relevance_score": 86,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12776"
        }
      ]
    },
    {
      "name": "Vision",
      "key_points": [
        "《RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation》〔评测 / 方法〕：Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sen…",
        "《All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding》〔数据 / 应用 / 方法〕：Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, q…",
        "《Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation》〔应用 / 方法〕：Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises f…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation",
          "summary": "Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.",
          "authors": [
            "Guoan Xu",
            "Yang Xiao",
            "Guangwei Gao",
            "Dongchen Zhu",
            "Wenjing Jia",
            "Guo-Jun Qi"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12319v1",
          "abstract_url": "https://arxiv.org/abs/2604.12319v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12319v1",
          "published_at": "2026-04-14T05:51:15+00:00",
          "updated_at": "2026-04-14T05:51:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Multimodal"
          ],
          "doi": null,
          "arxiv_id": "2604.12319",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12319v1"
          },
          "relevance_score": 100,
          "match_reasons": [
            "title matched \"multimodal\"",
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12319"
        },
        {
          "title": "All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding",
          "summary": "Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.",
          "authors": [
            "Tanzila Rahman",
            "Renjie Liao",
            "Leonid Sigal"
          ],
          "categories": [
            "cs.CV",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12335v1",
          "abstract_url": "https://arxiv.org/abs/2604.12335v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12335v1",
          "published_at": "2026-04-14T06:17:35+00:00",
          "updated_at": "2026-04-14T06:17:35+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.12335",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12335v1"
          },
          "relevance_score": 78,
          "match_reasons": [
            "title matched \"multimodal\"",
            "summary matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12335"
        },
        {
          "title": "Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation",
          "summary": "Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downstream classifiers to treat all imputed features as equally trustworthy. In safety-critical medical applications, this limitation poses significant risks. We propose the Probabilistic Feature Imputation Network (P-FIN), which outputs calibrated uncertainty estimates alongside imputed features. This uncertainty is leveraged at two levels: (1) locally, through sigmoid gating that attenuates unreliable feature dimensions before classification, and (2) globally, through Fed-UQ-Avg, an aggregation strategy that prioritizes updates from clients with reliable imputation. Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest demonstrate consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.",
          "authors": [
            "Nafis Fuad Shahid",
            "Maroof Ahmed",
            "Md Akib Haider",
            "Saidur Rahman Sagor",
            "Aashnan Rahman",
            "Md Azam Hossain"
          ],
          "categories": [
            "eess.IV",
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12970v1",
          "abstract_url": "https://arxiv.org/abs/2604.12970v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12970v1",
          "published_at": "2026-04-14T17:03:14+00:00",
          "updated_at": "2026-04-14T17:03:14+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Multimodal",
            "Clinical"
          ],
          "doi": null,
          "arxiv_id": "2604.12970",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12970v1"
          },
          "relevance_score": 71,
          "match_reasons": [
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12970"
        },
        {
          "title": "AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation",
          "summary": "Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \\textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \\pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6\\% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.",
          "authors": [
            "Yubraj Bhandari",
            "Lavsen Dahal",
            "Paul Segars",
            "Joseph Y. Lo"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12969v1",
          "abstract_url": "https://arxiv.org/abs/2604.12969v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12969v1",
          "published_at": "2026-04-14T17:01:41+00:00",
          "updated_at": "2026-04-14T17:01:41+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Clinical",
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.12969",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12969v1"
          },
          "relevance_score": 71,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12969"
        },
        {
          "title": "Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation",
          "summary": "Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \\textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.",
          "authors": [
            "Ahmet İnanç",
            "Özgür Erkent"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12918v1",
          "abstract_url": "https://arxiv.org/abs/2604.12918v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12918v1",
          "published_at": "2026-04-14T16:00:53+00:00",
          "updated_at": "2026-04-14T16:00:53+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.12918",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12918v1"
          },
          "relevance_score": 70,
          "match_reasons": [
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12918"
        },
        {
          "title": "Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks",
          "summary": "Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.",
          "authors": [
            "Yingying Zhao",
            "Chengyin Hu",
            "Qike Zhang",
            "Xin Li",
            "Xin Wang",
            "Yiwei Wei",
            "Jiujiang Guo",
            "Jiahuan Long",
            "Tingsong Jiang",
            "Wen Yao"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12833v1",
          "abstract_url": "https://arxiv.org/abs/2604.12833v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12833v1",
          "published_at": "2026-04-14T14:52:15+00:00",
          "updated_at": "2026-04-14T14:52:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.12833",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12833v1"
          },
          "relevance_score": 69,
          "match_reasons": [
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12833"
        },
        {
          "title": "Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models",
          "summary": "Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.",
          "authors": [
            "Iman Islam",
            "Bram Ruijsink",
            "Andrew J. Reader",
            "Andrew P. King"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12832v1",
          "abstract_url": "https://arxiv.org/abs/2604.12832v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12832v1",
          "published_at": "2026-04-14T14:52:00+00:00",
          "updated_at": "2026-04-14T14:52:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.12832",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12832v1"
          },
          "relevance_score": 69,
          "match_reasons": [
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12832"
        },
        {
          "title": "Generative Refinement Networks for Visual Synthesis",
          "summary": "While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ's latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks -- like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.",
          "authors": [
            "Jian Han",
            "Jinlai Liu",
            "Jiahuan Wang",
            "Bingyue Peng",
            "Zehuan Yuan"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.13030v1",
          "abstract_url": "https://arxiv.org/abs/2604.13030v1",
          "pdf_url": "https://arxiv.org/pdf/2604.13030v1",
          "published_at": "2026-04-14T17:59:03+00:00",
          "updated_at": "2026-04-14T17:59:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.13030",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.13030v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "summary matched \"diffusion\"",
            "summary matched \"video generation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.13030"
        },
        {
          "title": "Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images",
          "summary": "Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.",
          "authors": [
            "Haoyang Jiang",
            "Mingyang Yi",
            "Shaolei Zhang",
            "Junxian Cai",
            "Qingbin Liu",
            "Xi Chen",
            "Ju Fan"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.12781v1",
          "abstract_url": "https://arxiv.org/abs/2604.12781v1",
          "pdf_url": "https://arxiv.org/pdf/2604.12781v1",
          "published_at": "2026-04-14T14:17:51+00:00",
          "updated_at": "2026-04-14T14:17:51+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.12781",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.12781v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.12781"
        }
      ]
    },
    {
      "name": "PubMed AI",
      "key_points": [
        "《VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.》〔评测 / 数据 / 方法〕：The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements…",
        "《Multimodal large language models in brain tumor imaging: clinical applications and future perspectives.》〔应用 / 方法〕：The use of multimodal data is essential for the precise diagnosis and treatment of brain tumors. In this context, multimodal data encompass multisequence magne…",
        "《Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.》〔评测 / 应用 / 方法〕：Vision-language models in healthcare face a critical limitation, i.e., the modality gap, where image and text embeddings occupy distantly separated regions in…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.",
          "summary": "The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmark designed to evaluate biases in LVLMs. VLBiasBench, features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race × gender and race × social economic status. To build a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with various questions to creat 128,342 samples. These questions are divided into open-ended and close-ended types, ensuring thorough consideration of bias sources and a comprehensive evaluation of LVLM biases from multiple perspectives. We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.",
          "authors": [
            "Sibo Wang",
            "Xiangkui Cao",
            "Jie Zhang",
            "Zheng Yuan",
            "Shiguang Shan",
            "Xilin Chen",
            "Wen Gao"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:41979962",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/41979962/",
          "pdf_url": null,
          "published_at": "2026-04-14T12:53:00+00:00",
          "updated_at": "2026-04-14T12:53:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": "10.1109/tpami.2026.3683747",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/41979962/",
            "doi": "https://doi.org/10.1109/TPAMI.2026.3683747"
          },
          "relevance_score": 111,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"benchmark\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/tpami.2026.3683747"
        },
        {
          "title": "Multimodal large language models in brain tumor imaging: clinical applications and future perspectives.",
          "summary": "The use of multimodal data is essential for the precise diagnosis and treatment of brain tumors. In this context, multimodal data encompass multisequence magnetic resonance imaging, computed tomography, positron emission tomography, histopathological images, molecular and genomic profiles, structured clinical variables, and radiological reports. With the rapid advancement of artificial intelligence, integrating these heterogeneous data sources has become a central research direction for improving diagnostic accuracy, prognostic assessment, and therapeutic decision-making in neuro-oncology. However, substantial discrepancies exist across data modalities in terms of spatial resolution, semantic representation, and measurement scales, posing significant challenges for effective cross-modal integration. Multimodal large language models (MLLMs) enhance both interpretative and generative capabilities by jointly modeling visual, textual, and structured data, thereby offering a unified framework for addressing these challenges in brain tumor analysis. This review provides a comprehensive overview of MLLMs, covering their methodological foundations, representation learning strategies, and cross-modal alignment mechanisms. We further summarize their applications in both research and emerging clinical settings, including diagnosis support, prognosis prediction, treatment planning assistance, and radiology report generation. Finally, we discuss current limitations, such as data scarcity, interpretability constraints, and clinical deployment barriers, and outline future directions toward robust, explainable, and clinically translatable MLLM systems in neuro-oncology.",
          "authors": [
            "Yixin Wang",
            "Tao Ma",
            "Hongzhi Wang"
          ],
          "categories": [
            "Journal Article",
            "Review"
          ],
          "paper_id": "pubmed:41979660",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/41979660/",
          "pdf_url": null,
          "published_at": "2026-04-14T11:03:00+00:00",
          "updated_at": "2026-04-14T11:03:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Multimodal"
          ],
          "doi": "10.1007/s00117-026-01608-4",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/41979660/",
            "doi": "https://doi.org/10.1007/s00117-026-01608-4"
          },
          "relevance_score": 109,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1007/s00117-026-01608-4"
        },
        {
          "title": "Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.",
          "summary": "Vision-language models in healthcare face a critical limitation, i.e., the modality gap, where image and text embeddings occupy distantly separated regions in shared representation space. This is reinforced by traditional contrastive learning objectives, and manifests itself through fundamental constraints in cross-modal understanding and downstream task performance. Existing approaches focus on addressing input-level requirements, however, the geometric constraints imposed by multimodal contrastive learning remain largely unexplored. We propose a novel framework that synergistically combines contrastive learning and entropy-regularized optimal transport for medical modality alignment, simultaneously tackling both instance-level and distribution-level alignment. First, a medical condition-driven association strategy is introduced, which defines positive pairs through shared pathologies, rather than rigid image-text correspondence. Next, an intra-modality negative sampling scheme is designed, which constrains intra-modal contrastive pressure to prevent reinforcement of cross-modal separation. These operate in tandem with a lightweight embedding refinement network, which reshapes pretrained BiomedCLIP embeddings into diagnosis-aware clusters, supporting compatibility with clinical pipelines. The approach leads to significant improvements in reducing the modality gap, demonstrated through increases in alignment scores (0.33-0.73), and improving retrieval precision (22%-33%), zero-shot classification accuracy (13%-48%) and a 4.27 times reduction in clustering dispersion metrics on standard benchmarks (CheXpert_200×5, MIMIC_200×5, RSNA, and COVID).",
          "authors": [
            "Chaymaa Lahmar",
            "Hongxin Wang",
            "Xuhui Li",
            "Panos Liatsis"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:41979955",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/41979955/",
          "pdf_url": null,
          "published_at": "2026-04-14T12:53:00+00:00",
          "updated_at": "2026-04-14T12:53:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": "10.1109/jbhi.2026.3683892",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/41979955/",
            "doi": "https://doi.org/10.1109/JBHI.2026.3683892"
          },
          "relevance_score": 107,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"benchmark\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/jbhi.2026.3683892"
        },
        {
          "title": "User Experience and Early Clinical Outcomes of a Mental Wellness Chatbot for Depression and Anxiety: Pilot Evaluation Mixed Methods Study.",
          "summary": "BACKGROUND: Artificial intelligence-powered conversational agents (ie, chatbots) are increasingly popular outlets for users seeking psychological support, yet little is known about how users experience early-stage prototypes or which therapeutic processes contribute to clinical improvement. A transparent evaluation of emerging chatbot prototypes is needed to clarify if, how, and why artificial intelligence companions work and to guide their continued development. OBJECTIVE: This mixed methods pilot study evaluated user experience, acceptability, and preliminary clinical signals for an early-stage mental wellness chatbot. We also examined whether baseline symptom severity moderated clinical improvement. METHODS: Three sequential cohorts (n=125) completed a 2-week, incentivized chatbot exposure (approximately 60 min per week). Participants provided first-impression ratings, qualitative feedback, and pre-post assessments of depressive symptoms (PHQ-8 [Patient Health Questionnaire-8]), anxiety symptoms (GAD-7 [Generalized Anxiety Disorder-7]), psychological distress, well-being, and loneliness. Statistical models estimated symptom change and tested interactions with baseline symptom severity. Mixed methods analysis integrated quantitative outcomes with large language model-assisted qualitative content analysis of open-ended responses. RESULTS: Participants described the chatbot as accessible, easy to use, and emotionally validating, while citing limitations in personalization and conversational depth. Qualitative responses consistently highlighted early therapeutic processes such as emotional validation, goal setting, and perceived attunement. Regression models showed significant pre-post reductions in depressive (Hedges g=-0.32) and anxiety (g=-0.32) symptoms, alongside modest improvements in distress and well-being. Baseline severity moderated improvement, with marginal effects indicating larger predicted reductions at higher PHQ-8 and GAD-7 baseline scores (eg, PHQ-8=15: g=-0.84; GAD-7=15: g=-0.62). CONCLUSIONS: This pilot provides a comprehensive view of early chatbot development and suggests promising user experiences and preliminary symptom improvements under structured pilot conditions. By integrating experiential and exploratory clinical data, the study identifies candidate process targets to inform ongoing refinement. Findings support continued development and demonstrate procedural feasibility for progression to larger, longer-term trials evaluating engagement and clinical outcomes under more naturalistic conditions.",
          "authors": [
            "Scott Graupensperger",
            "Emily J Ward",
            "Graham Baum",
            "Kate H Bentley",
            "Emily R Dworkin",
            "Millard Brown",
            "Adam Chekroud",
            "Matt Hawrilenko"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:41980262",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/41980262/",
          "pdf_url": null,
          "published_at": "2026-04-14T16:53:00+00:00",
          "updated_at": "2026-04-14T16:53:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Agent",
            "Language Model"
          ],
          "doi": "10.2196/90644",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/41980262/",
            "doi": "https://doi.org/10.2196/90644"
          },
          "relevance_score": 93,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"language model\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.2196/90644"
        },
        {
          "title": "Comparison of AI-based Chatbot Performance in Analyzing Clinical Scenarios versus Medical Residents: A Novel Approach in Chest Diseases Education.",
          "summary": "OBJECTIVE: Rapid advancements in artificial intelligence (AI) technologies offer new opportunities in medical education. The aim of this study is to compare the performance of large language models, specifically ChatGPT-4 and Gemini, in analyzing clinical scenarios with that of chest diseases research assistants (residents), and to evaluate their potential roles in medical education. MATERIAL AND METHODS: This cross-sectional, comparative study included 28 resident physicians working in the department of chest diseases at a tertiary-care university hospital. Four clinical scenarios involving diagnoses of massive pulmonary embolism, chronic obstructive pulmonary disease, asthma, and severe pneumonia/sepsis were presented to both participants and AI models (ChatGPT-4 and Gemini). Responses were scored by blinded experts based on current guidelines (Global Initiative for Chronic Obstructive Lung Disease, Global Initiative for Asthma, American Thoracic Society). RESULTS: AI models achieved significantly higher scores than residents, particularly on structured questions requiring theoretical knowledge, classification skills, and the listing of contraindications ( P < 0.05). However, it was observed that residents achieved success levels similar to those of AI models in situations requiring emergency intervention (e.g., shock management) through practical, results-oriented approaches. While AI models provided a broader spectrum in differential diagnosis, residents preferred \"telegraphic\" and practice-oriented responses. CONCLUSION: ChatGPT and Gemini have significant potential as clinical decision-support systems and educational assistants. However, rather than replacing human factors in clinical reasoning and emergency management, they should be positioned as complementary tools that accelerate physicians' access to theoretical knowledge.",
          "authors": [
            "Mehmet Hakan Bilgin",
            "Hamit Hakan Alp"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:41979097",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/41979097/",
          "pdf_url": null,
          "published_at": "2026-04-14T06:24:00+00:00",
          "updated_at": "2026-04-14T06:24:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Language Model"
          ],
          "doi": "10.4274/thoracrespract.2026.2026-1-2",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/41979097/",
            "doi": "https://doi.org/10.4274/ThoracResPract.2026.2026-1-2"
          },
          "relevance_score": 82,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"language model\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.4274/thoracrespract.2026.2026-1-2"
        }
      ]
    },
    {
      "name": "OpenAlex AI",
      "key_points": [
        "《Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students》〔方法〕：In compiling literature for my senior seminar on combating hallucinations present within responses from large-language models (LLMs), such as ChatGPT, there ex…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students",
          "summary": "In compiling literature for my senior seminar on combating hallucinations present within responses from large-language models (LLMs), such as ChatGPT, there exists significant variance of the opinions on the ethics and trustworthiness of LLMs among undergraduate and graduate students. Therefore, for this companion presentation to my more theory-focused senior seminar presentation, I seek to provide an overview of the perception of LLM usage in college students for non-CSci majors. While this presentation aims to briefly overview the process behind the inner workings of LLMs, the crux of this project lies in aggregating the results of various computer science and psychology-related studies to gain a greater understanding of the facets behind college students’ views of artificial intelligence. More specifically, this talk overviews students’ trust and understanding of LLMs, assess students’ ethical views on the usage of LLMs (e.g., sustainability, cheating), and discuss long-term outcomes of LLM usage (e.g., the debate around brain deterioration from LLM usage). While this presentation relies solely on present findings within the literature, it is the hope of the presenter that students with non-CSci backgrounds gain a greater understanding of college LLM usage.",
          "authors": [
            "Anton J Olson"
          ],
          "categories": [
            "Article",
            "Medicine",
            "Health Informatics",
            "Artificial Intelligence in Healthcare and Education",
            "Mental Health via Writing"
          ],
          "paper_id": "openalex:W7151710033",
          "abstract_url": "https://digitalcommons.morris.umn.edu/urs_event/2026/oralpresentations/15",
          "pdf_url": null,
          "published_at": "2026-04-15T00:00:00+00:00",
          "updated_at": "2026-04-15T00:00:00+00:00",
          "source": "openalex",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": null,
          "source_variants": [
            "openalex"
          ],
          "source_urls": {
            "openalex": "https://openalex.org/W7151710033"
          },
          "relevance_score": 70,
          "match_reasons": [
            "title matched \"language model\"",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "title:demystifying attitudes and effects of usage of large language models among college aged students"
        }
      ]
    }
  ]
}