code generation benchmark Topic Archive

code generation benchmark Topic Archive code-generation-benchmark.html 关键词 code generation benchmark 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination ../papers/arxiv-026cd79abeae.html https://arxiv.org/abs/2606.17999v1#2026-06-17#code-generation-benchmark Wed, 17 Jun 2026 14:22:19 +0800 MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPa… No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages ../papers/arxiv-3ac1fcf1ccb2.html https://arxiv.org/abs/2606.16827v1#2026-06-16#code-generation-benchmark Tue, 16 Jun 2026 14:38:43 +0800 Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contr… Closing the Loop on Latent Reasoning via Test-Time Reconstruction ../papers/arxiv-d8f49ccdc82d.html https://arxiv.org/abs/2606.06252#2026-06-05#code-generation-benchmark Fri, 05 Jun 2026 13:25:00 +0800 Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent stat… Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language ../papers/arxiv-ddd1be9c8e89.html https://arxiv.org/abs/2605.15607#2026-05-18#code-generation-benchmark Mon, 18 May 2026 13:13:17 +0800 Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19%…