Keyword Tracking

关键词追踪：language model

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

返回归档首页查看趋势总览最新 JSON 订阅 RSS

近期走势

最近一次命中来自 LM：NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

查看原始来源

Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token lev…

Terminal and SWE Agents

Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair

查看原始来源

Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks. As the number of parameters increases, results can often be imp…

查看原始来源

Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriente…

Terminal and SWE Agents

Evaluating LLMs on Real-World Software Performance Optimization

查看原始来源

Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks t…

查看原始来源

As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe o…

查看原始来源

Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined…

查看原始来源

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assess…

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

查看原始来源

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR)…

Agent Runtime Security

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

查看原始来源

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We stu…

查看原始来源

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. Whil…

Agent Runtime Security

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

查看原始来源

Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-…

查看原始来源

MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregress…

查看原始来源

Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a sp…

查看原始来源

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings ca…

Terminal and SWE Agents

Recursive Agent Harnesses

查看原始来源

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code…

查看原始来源

Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, vi…

查看原始来源

Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and…

查看原始来源

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models i…

查看原始来源

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG…

查看原始来源

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production depl…

查看原始来源

Automating C-to-Rust migration is critical for improving software security without sacrificing performance. Traditional rule-based methods struggle with diverse C idioms, often pr…

查看原始来源

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents…

查看原始来源

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single…

查看原始来源

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely…

Agent Runtime Security

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

查看原始来源

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a…

查看原始来源

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persi…

查看原始来源

Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, together with the widespread adoption of AI cod…

Merge-Bench: Resolve Merge Conflicts with Large Language Models

查看原始来源

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hu…

Agent Runtime Security

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

查看原始来源

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-…

查看原始来源

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub…

查看原始来源

The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominantly treat audio as a…

查看原始来源

Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade…

查看原始来源

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large lan…

Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling

查看原始来源

Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off…

Terminal and SWE Agents

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

查看原始来源

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have…

查看原始来源

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in auto…

查看原始来源

Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we pr…

Agent Runtime Security

Metaphor Is Not All Attention Needs

查看原始来源

Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to ma…

查看原始来源

This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applicat…

查看原始来源

Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG system…

Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code genera…

查看原始来源

With the rapid development of generative artificial intelligence, large language models (LLMs) have gradually integrated into various fields, demonstrating significant potential,…

查看原始来源

Multi-label Chest X-ray (CXR) classification faces significant challenges from the inherently imperfect nature of clinical data, particularly the complex interplay of co-occurring…

查看原始来源

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). H…

PubMed AI

Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.

查看原始来源

OBJECTIVES: Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize reporting and improve cl…

查看原始来源

Approximately one-fifth of patients with acute pancreatitis (AP) develop severe forms, which are associated with high mortality rates, making early prediction of severity crucial…

查看原始来源

BACKGROUND: Large language models (LLMs) are increasingly used to obtain health information, including guidance on child and adolescent mental health. In anorexia nervosa (AN), wh…

2026-04-20

2026-04-20 11:48:52 (Asia/Shanghai)

OpenAlex AI

Artificial Intelligence And The Transformation of Labor Markets

查看原始来源

The rapid advancement of artificial intelligence (AI) technologies, particularly generative AI and large language models, has reignited debates about the future of work and the po…

OpenAlex AI

Artificial Intelligence And The Transformation of Labor Markets

查看原始来源

The rapid advancement of artificial intelligence (AI) technologies, particularly generative AI and large language models, has reignited debates about the future of work and the po…

查看原始来源

Existing object re-identification (re-ID) and composed image retrieval (CIR) methods capture different aspects of real-world retrieval requirements; re-ID preserves identity but c…

查看原始来源

OBJECTIVE: Computable phenotypes derived from electronic health records (EHRs) are central to clinical research and quality reporting. Although large language models (LLMs) can ex…

PubMed AI

Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3.

查看原始来源

OBJECTIVES: This study conducted an informatics system evaluation of two LLMs (GPT-4o and DeepSeek-V3) for patient education, combining clinician-rated quality with patient-percei…

查看原始来源

BACKGROUND AND OBJECTIVES: Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how phy…

查看原始来源

In compiling literature for my senior seminar on combating hallucinations present within responses from large-language models (LLMs), such as ChatGPT, there exists significant var…

查看原始来源

Category fluency tasks involve producing words constrained by a semantic field (animals). Subcategory fluency involves producing words from categories that are semantically relate…

PubMed AI

Sequence Display enables large-scale sequence-activity datasets for rapid protein evolution.

查看原始来源

Engineering proteins with desired functions remains challenging and usually requires multiple rounds of screening and selection. Here, we present Sequence Display, a platform that…

PubMed AI

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks.

查看原始来源

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to…

PubMed AI

Advancing neurotech justice in youth digital mental health: insights from an interdisciplinary and cross-generational workshop.

查看原始来源

Researchers and clinicians are increasingly looking to leverage artificial intelligence (AI) and digital tools to improve psychiatric care. Of particular promise is addressing the…

查看原始来源

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cos…