Ryan's Arxiv FrontPage


Generated on 2025-10-22.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.

This project was originally created by Vincent Warmerdam, modifying his original frontpage for different paper categories.


Prompt Engineering in Large Language Models

2025-10-20

Rethinking On-policy Optimization for Query Augmentation

Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR).Two main approaches have emerged.The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. 0.659The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics.While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions.In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. 0.772Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs.Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. 0.783We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. 0.628Our implementation is made available to facilitate reproducibility.

link
Google Scholar

2025-10-20

StreamingThinker: Large Language Models Can Think While Reading

Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. 0.638However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. 0.627Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete.We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference.Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency.We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks.Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. 0.605Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}

link
Google Scholar

2025-10-20

Evaluating Large Language Models on Urdu Idiom Translation

Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention.To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents.We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning.Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality.Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. 0.869Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.

link
Google Scholar

2025-10-20

Disparities in Multilingual LLM-Based Healthcare Q&A

Equitable access to reliable health information is vital when integrating AI into healthcare.Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs).We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare Q&A across English, German, Turkish, Chinese (Mandarin), and Italian.We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG).Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment.Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. 0.601Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge.These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.

link
Google Scholar

2025-10-20

Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital.Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues.More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. 0.693To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. 0.709During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy.This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. 0.634On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance.The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.

link
Google Scholar

2025-10-20

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data.While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support.We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset.Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation.We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories.Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. 0.704We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints.Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

link
Google Scholar

2025-10-20

How role-play shapes relevance judgment in zero-shot LLM rankers

Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. 0.828In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. 0.63However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability.In this work, we systematically examine how role-play variations influence zero-shot LLM rankers.We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs.Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations.Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance.These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.

link
Google Scholar

2025-10-19

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations.This transition introduces partial observability and demands robust world modeling.We ask: Can VLM agents construct internal world models through explicit visual state reasoning?To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP).We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. 0.691Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control.Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment.Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62).All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments.Code and data are publicly available at https://vagen-ai.github.io.

link
Google Scholar

2025-10-19

SolverLLM: Leveraging Test-Time Scaling for Optimization Problem via LLM-Guided Search

Large Language Models (LLMs) offer promising capabilities for tackling complex reasoning tasks, including optimization problems.However, existing methods either rely on prompt engineering, which leads to poor generalization across problem types, or require costly supervised training. 0.734We introduce SolverLLM, a training-free framework that leverages test-time scaling to solve diverse optimization problems.Rather than solving directly, SolverLLM generates mathematical formulations and translates them into solver-ready code, guided by a novel Monte Carlo Tree Search (MCTS) strategy.To enhance the search process, we modify classical MCTS with (1) dynamic expansion for adaptive formulation generation, (2) prompt backpropagation to guide exploration via outcome-driven feedback, and (3) uncertainty backpropagation to incorporate reward reliability into decision-making. 0.656Experiments on six standard benchmark datasets demonstrate that SolverLLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.

link
Google Scholar

2025-10-19

Prompt-MII: Meta-Learning Instruction Induction for LLMs

A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows.In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. 0.702Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. 0.697We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks.PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.

link
Google Scholar

2025-10-19

Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations.Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors.In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations.Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency.We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. 0.666Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective.Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability.Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency.These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency.The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.

link
Google Scholar

2025-10-19

Real-Time World Crafting: Generating Structured Game Behaviors from Natural Language with Large Language Models

We present a novel architecture for safely integrating Large Language Models (LLMs) into interactive game engines, allowing players to "program" new behaviors using natural language.Our framework mitigates risks by using an LLM to translate commands into a constrained Domain-Specific Language (DSL), which configures a custom Entity-Component-System (ECS) at runtime.We evaluated this system in a 2D spell-crafting game prototype by experimentally assessing models from the Gemini, GPT, and Claude families with various prompting strategies. 0.77A validated LLM judge qualitatively rated the outputs, showing that while larger models better captured creative intent, the optimal prompting strategy is task-dependent: Chain-of-Thought improved creative alignment, while few-shot examples were necessary to generate more complex DSL scripts. 0.788This work offers a validated LLM-ECS pattern for emergent gameplay and a quantitative performance comparison for developers.

link
Google Scholar

2025-10-19

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

An interesting behavior in large language models (LLMs) is prompt sensitivity. 0.774When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. 0.772This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. 0.682We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. 0.681Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation.We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity.Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs. 0.837

link
Google Scholar

Robustness Tools in LLM Safety

2025-10-20

The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs

The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models.In turn, this has led to investigation of CoT monitoring as a compelling method for detecting harmful behaviors such as reward hacking, under the assumption that models' reasoning processes reflect their internal decision-making. 0.665In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies.A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors like sycophancy, but what happens to the model's reasoning process when these instructions conflict with learned behaviors?We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms.Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions.This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect.Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight.All code for this paper will be made available.WARNING: some examples in this paper may be upsetting.

link
Google Scholar

2025-10-20

Verification-Aware Planning for Multi-Agent Systems

Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents.However, multi-agent collaboration introduces new challenges in planning, coordination, and verification.Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. 0.648To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning.The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language.We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability.Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.

link
Google Scholar

2025-10-20

Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing.However, how a language model predicts the next token and generates content is not generally understandable by humans.Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. 0.827These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs.Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models.In this regard, our paper aims to make three key contributions.First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature.Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers.Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

link
Google Scholar

2025-10-19

DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge

Large Language Models (LLMs) have demonstrated strong performance across diverse tasks, but fine-tuning them typically relies on cloud-based, centralized infrastructures.This requires data owners to upload potentially sensitive data to external servers, raising serious privacy concerns.An alternative approach is to fine-tune LLMs directly on edge devices using local data; however, this introduces a new challenge: the model owner must transfer proprietary models to the edge, which risks intellectual property (IP) leakage.To address this dilemma, we propose DistilLock, a TEE-assisted fine-tuning framework that enables privacy-preserving knowledge distillation on the edge.In DistilLock, a proprietary foundation model is executed within a trusted execution environment (TEE) enclave on the data owner's device, acting as a secure black-box teacher.This setup preserves both data privacy and model IP by preventing direct access to model internals.Furthermore, DistilLock employs a model obfuscation mechanism to offload obfuscated weights to untrusted accelerators for efficient knowledge distillation without compromising security. 0.672We demonstrate that DistilLock prevents unauthorized knowledge distillation processes and model-stealing attacks while maintaining high computational efficiency, but offering a secure and practical solution for edge-based LLM personalization. 0.792

link
Google Scholar

2025-10-19

T3 Planner: A Self-Correcting LLM Framework for Robotic Motion Planning with Temporal Logic

Translating natural language instructions into executable motion plans is a fundamental challenge in robotics.Traditional approaches are typically constrained by their reliance on domain-specific expertise to customize planners, and often struggle with spatio-temporal couplings that usually lead to infeasible motions or discrepancies between task planning and motion execution.Despite the proficiency of Large Language Models (LLMs) in high-level semantic reasoning, hallucination could result in infeasible motion plans. 0.7In this paper, we introduce the T3 Planner, an LLM-enabled robotic motion planning framework that self-corrects it output with formal methods.The framework decomposes spatio-temporal task constraints via three cascaded modules, each of which stimulates an LLM to generate candidate trajectory sequences and examines their feasibility via a Signal Temporal Logic (STL) verifier until one that satisfies complex spatial, temporal, and logical constraints is found.Experiments across different scenarios show that T3 Planner significantly outperforms the baselines.The required reasoning can be distilled into a lightweight Qwen3-4B model that enables efficient deployment.All supplementary materials are accessible at https://github.com/leeejia/T3_Planner.

link
Google Scholar

2025-10-19

More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents

LLM-powered coding agents, which operate in iterative loops (turns) to solve software engineering tasks, are becoming increasingly powerful.However, their practical deployment is hindered by significant and unpredictable costs. 0.622This challenge arises from a combination of factors: quadratically growing token counts with each turn, the high price of models, the large number of turns required for real-world tasks, and the tendency of agents to take inefficient or unnecessary actions.While existing research focuses on optimizing individual turns, the strategic control of the total number of turns remains an underexplored area for managing agent performance and cost.To address this gap, we conduct a comprehensive empirical study on SWE-bench using three state-of-the-art models and evaluate the impact of three distinct turn-control strategies: an unrestricted baseline, a fixed-turn limit with reminders, and a novel dynamic-turn strategy that grants extensions on-demand.Our findings first reveal a fundamental trade-off in the unrestricted setting, where no single model excels across performance, cost, and turn efficiency.We then show that a fixed-turn limit, specifically at the 75th percentile of the baseline, serves as a "sweet spot", substantially reducing costs (by 24%-68%) with minimal impact on solve rates.Most significantly, the dynamic-turn strategy consistently outperforms fixed-limit approaches, achieving comparable or better solve rates while further reducing costs by an additional 12%-24% by intelligently allocating resources only to tasks that need them.This work provides the first systematic analysis of turn-control strategies, offering simple yet effective guidelines for developers to balance cost and efficacy.We demonstrate that dynamic resource allocation is a superior, easy-to-implement approach for deploying powerful yet economically viable coding agents.

link
Google Scholar

2025-10-19

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models.Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process.Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding.In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries.To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. 0.696We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings.Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

link
Google Scholar

2025-10-16

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts.LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. 0.642We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. 0.61Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $\Delta$ASR = 0.48.We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility.These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

link
Google Scholar

2025-10-16

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge.Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature.However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. 0.672We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. 0.68Our method introduces three key innovations.First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient.Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained.Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns.Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

link
Google Scholar

2025-10-16

Coder as Editor: Code-driven Interpretable Molecular Optimization

Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge.While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. 0.614We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code.MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation.Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs.On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity.By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.

link
Google Scholar

2025-10-16

Lexo: Eliminating Stealthy Supply-Chain Attacks via LLM-Assisted Program Regeneration

Software supply-chain attacks are an important and ongoing concern in the open source software ecosystem.These attacks maintain the standard functionality that a component implements, but additionally hide malicious functionality activated only when the component reaches its target environment. 0.676Lexo addresses such stealthy attacks by automatically learning and regenerating vulnerability-free versions of potentially malicious components. 0.614Lexo first generates a set of input-output pairs to model a component's full observable behavior, which it then uses to synthesize a new version of the original component.The new component implements the original functionality but avoids stealthy malicious behavior.Throughout this regeneration process, Lexo consults several distinct instances of Large Language Models (LLMs), uses correctness and coverage metrics to shepherd these instances, and guardrails their results.Our evaluation on 100+ real-world packages, including high profile stealthy supply-chain attacks, indicates that Lexo scales across multiple domains, regenerates code efficiently (<100s on average), maintains compatibility, and succeeds in eliminating malicious code in several real-world supply-chain-attacks, even in cases when a state-of-the-art LLM fails to eliminate malicious code when prompted to do so. 0.639

link
Google Scholar

2025-10-16

Beyond Hallucinations: The Illusion of Understanding in Large Language Models

Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself.While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning.This creates the risk of hallucination, responses that sound convincing but lack factual validity. 0.83Building on Geoffrey Hinton's observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification.To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction.The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation.Each dimension captures a distinct failure mode, and their combination amplifies misalignment.Rose-Frame does not attempt to fix LLMs with more data or rules.Instead, it offers a reflective tool that makes both the model's limitations and the user's assumptions visible, enabling more transparent and critically aware AI deployment.It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason.Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.

link
Google Scholar

2025-10-16

Boosting Instruction Following at Scale

A typical approach developers follow to influence an LLM's behavior in an application is through careful manipulation of the prompt, such as by adding or modifying instructions. 0.712However, merely adding more instructions provides little assurance that they will actually be followed.We introduce Instruction Boosting as a post-generation method to increase the reliability of LLM prompt instructions.We show that Instruction Boosting improves the instruction following rate by up to 7 points for two instructions and up to 4 points for ten instructions.To demonstrate these results we introduce SCALEDIF, a benchmark with a scaled instruction volume of up to ten instructions per data sample.We also present an analysis of the commonly observed trend that performance degrades as more instructions are added.We show that an important factor contributing to this trend is the degree of tension and conflict that arises as the number of instructions is increased.We contribute a quantitative conflict scoring tool that explains the observed performance trends and provides feedback to developers on the impact that additional prompt instructions have on a model's performance.

link
Google Scholar

2025-10-16

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors.However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations.Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. 0.654These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives.To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision.To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS).To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. 0.612To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback.Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs.GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision.Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench.When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.

link
Google Scholar

Security Challenges in LLM Development

2025-10-20

Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models

Even when prompts and parameters are secured, transformer language models remain vulnerable because their key-value (KV) cache during inference constitutes an overlooked attack surface. 0.755This paper introduces Malicious Token Injection (MTI), a modular framework that systematically perturbs cached key vectors at selected layers and timesteps through controlled magnitude and frequency, using additive Gaussian noise, zeroing, and orthogonal rotations. 0.85A theoretical analysis quantifies how these perturbations propagate through attention, linking logit deviations to the Frobenius norm of corruption and softmax Lipschitz dynamics.Empirical results show that MTI significantly alters next-token distributions and downstream task performance across GPT-2 and LLaMA-2/7B, as well as destabilizes retrieval-augmented and agentic reasoning pipelines.These findings identify cache integrity as a critical yet underexplored vulnerability in current LLM deployments, positioning cache corruption as a reproducible and theoretically grounded threat model for future robustness and security research. 0.791

link
Google Scholar

2025-10-20

Robustness in Text-Attributed Graph Learning: Insights, Trade-offs, and New Defenses

While Graph Neural Networks (GNNs) and Large Language Models (LLMs) are powerful approaches for learning on Text-Attributed Graphs (TAGs), a comprehensive understanding of their robustness remains elusive.Current evaluations are fragmented, failing to systematically investigate the distinct effects of textual and structural perturbations across diverse models and attack scenarios.To address these limitations, we introduce a unified and comprehensive framework to evaluate robustness in TAG learning.Our framework evaluates classical GNNs, robust GNNs (RGNNs), and GraphLLMs across ten datasets from four domains, under diverse text-based, structure-based, and hybrid perturbations in both poisoning and evasion scenarios. 0.782Our extensive analysis reveals multiple findings, among which three are particularly noteworthy: 1) models have inherent robustness trade-offs between text and structure, 2) the performance of GNNs and RGNNs depends heavily on the text encoder and attack type, and 3) GraphLLMs are particularly vulnerable to training data corruption.To overcome the identified trade-offs, we introduce SFT-auto, a novel framework that delivers superior and balanced robustness against both textual and structural attacks within a single model. 0.79Our work establishes a foundation for future research on TAG security and offers practical solutions for robust TAG learning in adversarial environments. 0.856Our code is available at: https://github.com/Leirunlin/TGRB.

link
Google Scholar

2025-10-20

Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems

Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. 0.778Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective. 0.636We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. 0.783We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other.This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control-flow integrity and least privilege. 0.776ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation.

link
Google Scholar

2025-10-20

SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering

Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications.However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. 0.625While safety alignment has made initial advances in LLMs and Large Vision-Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. 0.635To address these challenges, we propose Safe-Ablated Refusal Steering (SARSteer), the first inference-time defense framework for LALMs.Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal.Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs. 0.607

link
Google Scholar

2025-10-19

Black-box Optimization of LLM Outputs by Asking for Directions

We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. 0.665Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. 0.855Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. 0.745We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. 0.891Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. 0.834We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. 0.848Our code is available at this [link](https://github.com/zj-jayzhang/black_box_llm_optimization).

link
Google Scholar

2025-10-19

When AI Takes the Wheel: Security Analysis of Framework-Constrained Program Generation

In recent years, the AI wave has grown rapidly in software development.Even novice developers can now design and generate complex framework-constrained software systems based on their high-level requirements with the help of Large Language Models (LLMs).However, when LLMs gradually "take the wheel" of software development, developers may only check whether the program works.They often miss security problems hidden in how the generated programs are implemented. 0.709In this work, we investigate the security properties of framework-constrained programs generated by state-of-the-art LLMs. 0.788We focus specifically on Chrome extensions due to their complex security model involving multiple privilege boundaries and isolated components. 0.691To achieve this, we built ChromeSecBench, a dataset with 140 prompts based on known vulnerable extensions. 0.671We used these prompts to instruct nine state-of-the-art LLMs to generate complete Chrome extensions, and then analyzed them for vulnerabilities across three dimensions: scenario types, model differences, and vulnerability categories. 0.722Our results show that LLMs produced vulnerable programs at alarmingly high rates (18%-50%), particularly in Authentication & Identity and Cookie Management scenarios (up to 83% and 78% respectively). 0.86Most vulnerabilities exposed sensitive browser data like cookies, history, or bookmarks to untrusted code. 0.839Interestingly, we found that advanced reasoning models performed worse, generating more vulnerabilities than simpler models. 0.663These findings highlight a critical gap between LLMs' coding skills and their ability to write secure framework-constrained programs. 0.71

link
Google Scholar

2025-10-19

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks.Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering.We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns.Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process.To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs.We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research.Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs. 0.621

link
Google Scholar

2025-10-19

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. 0.908Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions.The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits.Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off.We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit.Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy.Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy.Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. 0.728Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.

link
Google Scholar

2025-10-19

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. 0.914Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. 0.84In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. 0.788Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. 0.773Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). 0.731Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. 0.806Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

link
Google Scholar

2025-10-19

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions.While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored.In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs.For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary.We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. 0.619We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones.Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility. 0.647

link
Google Scholar

2025-10-19

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility.Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated?Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. 0.724We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. 0.811We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs.Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence.Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent.Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.

link
Google Scholar

2025-10-19

Watermark Robustness and Radioactivity May Be at Odds in Federated Learning

Federated learning (FL) enables fine-tuning large language models (LLMs) across distributed data sources.As these sources increasingly include LLM-generated text, provenance tracking becomes essential for accountability and transparency.We adapt LLM watermarking for data provenance in FL where a subset of clients compute local updates on watermarked data, and the server averages all updates into the global LLM.In this setup, watermarks are radioactive: the watermark signal remains detectable after fine-tuning with high confidence.The $p$-value can reach $10^{-24}$ even when as little as $6.6\%$ of data is watermarked.However, the server can act as an active adversary that wants to preserve model utility while evading provenance tracking.Our observation is that updates induced by watermarked synthetic data appear as outliers relative to non-watermark updates.Our adversary thus applies strong robust aggregation that can filter these outliers, together with the watermark signal. 0.663All evaluated radioactive watermarks are not robust against such an active filtering server.Our work suggests fundamental trade-offs between radioactivity, robustness, and utility.

link
Google Scholar

2025-10-16

LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing.One critical but underexplored application is automated web vulnerability reproduction, which transforms vulnerability reports into working exploits. 0.798Although recent advances suggest promising potential, challenges remain in applying LLM agents to real-world web vulnerability reproduction scenarios. 0.709In this paper, we present the first comprehensive evaluation of state-of-the-art LLM agents for automated web vulnerability reproduction. 0.769We systematically assess 20 agents from software engineering, cybersecurity, and general domains across 16 dimensions, including technical capabilities, environment adaptability, and user experience factors, on 3 representative web vulnerabilities. 0.609Based on the results, we select three top-performing agents (OpenHands, SWE-agent, and CAI) for in-depth evaluation on our benchmark dataset of 80 real-world CVEs spanning 7 vulnerability types and 6 web technologies. 0.623Our results reveal that while LLM agents achieve reasonable success on simple library-based vulnerabilities, they consistently fail on complex service-based vulnerabilities requiring multi-component environments. 0.712Complex environment configurations and authentication barriers create a gap where agents can execute exploit code but fail to trigger actual vulnerabilities. 0.779We observe high sensitivity to input guidance, with performance degrading by over 33% under incomplete authentication information.Our findings highlight the significant gap between current LLM agent capabilities and the demands of reliable automated vulnerability reproduction, emphasizing the need for advances in environmental adaptation and autonomous problem-solving capabilities. 0.695

link
Google Scholar

2025-10-16

The Gatekeeper Knows Enough

Large Language Models (LLMs) are increasingly deployed as autonomous agents, yet their practical utility is fundamentally constrained by a limited context window and state desynchronization resulting from the LLMs' stateless nature and inefficient context management.These limitations lead to unreliable output, unpredictable behavior, and inefficient resource usage, particularly when interacting with large, structured, and sensitive knowledge systems such as codebases and documents.To address these challenges, we introduce the Gatekeeper Protocol, a novel, domain-agnostic framework that governs agent-system interactions.Our protocol mandates that the agent first operate and reason on a minimalist, low-fidelity "latent state" representation of the system to strategically request high-fidelity context on demand.All interactions are mediated through a unified JSON format that serves as a declarative, state-synchronized protocol, ensuring the agent's model of the system remains verifiably grounded in the system's reality.We demonstrate the efficacy of this protocol with Sage, a reference implementation of the Gatekeeper Protocol for software development. 0.622Our results show that this approach significantly increases agent reliability, improves computational efficiency by minimizing token consumption, and enables scalable interaction with complex systems, creating a foundational methodology for building more robust, predictable, and grounded AI agents for any structured knowledge domain.

link
Google Scholar

HCI in Large Language Models

2025-10-20

Structured Debate Improves Corporate Credit Reasoning in Financial AI

Despite advances in financial AI, the automation of evidence-based reasoning remains unresolved in corporate credit assessment, where qualitative non-financial indicators exert decisive influence on loan repayment outcomes yet resist formalization.Existing approaches focus predominantly on numerical prediction and provide limited support for the interpretive judgments required in professional loan evaluation.This study develops and evaluates two operational large language model (LLM)-based systems designed to generate structured reasoning from non-financial evidence.The first is a non-adversarial single-agent system (NAS) that produces bidirectional analysis through a single-pass reasoning pipeline.The second is a debate-based multi-agent system (KPD-MADS) that operationalizes adversarial verification through a ten-step structured interaction protocol grounded in Karl Popper's critical dialogue framework. 0.64Both systems were applied to three real corporate cases and evaluated by experienced credit risk professionals.Compared to manual expert reporting, both systems achieved substantial productivity gains (NAS: 11.55 s per case; KPD-MADS: 91.97 s; human baseline: 1920 s).The KPD-MADS demonstrated superior reasoning quality, receiving higher median ratings in explanatory adequacy (4.0 vs. 3.0), practical applicability (4.0 vs. 3.0), and usability (62.5 vs. 52.5).These findings show that structured multi-agent interaction can enhance reasoning rigor and interpretability in financial AI, advancing scalable and defensible automation in corporate credit assessment.

link
Google Scholar

2025-10-20

Semantic Intelligence: A Bio-Inspired Cognitive Framework for Embodied Agents

Recent advancements in Large Language Models (LLMs) have greatly enhanced natural language understanding and content generation.However, these models primarily operate in disembodied digital environments and lack interaction with the physical world.To address this limitation, Embodied Artificial Intelligence (EAI) has emerged, focusing on agents that can perceive and interact with their surroundings.Despite progress, current embodied agents face challenges in unstructured real-world environments due to insufficient semantic intelligence, which is critical for understanding and reasoning about complex tasks.This paper introduces the Semantic Intelligence-Driven Embodied (SIDE) agent framework, which integrates a hierarchical semantic cognition architecture with a semantic-driven decision-making process. 0.855This enables agents to reason about and interact with the physical world in a contextually adaptive manner.The framework is inspired by biological cognitive mechanisms and utilizes bio-inspired principles to design a semantic cognitive architecture that mimics how humans and animals integrate and process sensory information.We present this framework as a step toward developing more intelligent and versatile embodied agents. 0.647

link
Google Scholar

2025-10-20

Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction

Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel.In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred.This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? 0.712We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction.The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization.All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation.Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. 0.696This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.

link
Google Scholar

2025-10-20

Software Testing with Large Language Models: An Interview Study with Practitioners

\textit{Background:} The use of large language models in software testing is growing fast as they support numerous tasks, from test case generation to automation, and documentation.However, their adoption often relies on informal experimentation rather than structured guidance.\textit{Aims:} This study investigates how software testing professionals use LLMs in practice to propose a preliminary, practitioner-informed guideline to support their integration into testing workflows.\textit{Method:}We conducted a qualitative study with 15 software testers from diverse roles and domains.Data were collected through semi-structured interviews and analyzed using grounded theory-based processes focused on thematic analysis. 0.74\textit{Results:} Testers described an iterative and reflective process that included defining testing objectives, applying prompt engineering strategies, refining prompts, evaluating outputs, and learning over time.They emphasized the need for human oversight and careful validation, especially due to known limitations of LLMs such as hallucinations and inconsistent reasoning.\textit{Conclusions:} LLM adoption in software testing is growing, but remains shaped by evolving practices and caution around risks.This study offers a starting point for structuring LLM use in testing contexts and invites future research to refine these practices across teams, tools, and tasks. 0.679

link
Google Scholar

2025-10-20

Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence.However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge.To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs.Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems."First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use.We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms.Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation.Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity.Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions.By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents. 0.601

link
Google Scholar

2025-10-20

NieNie: Adaptive Rhythmic System for Stress Relief with LLM-Based Guidance

Today's young people are facing increasing psychological stress due to various social issues. 0.901Traditional stress management tools often rely on static scripts or passive content, which are ineffective in alleviating stress.NieNie addresses this gap by combining rhythm biofeedback with real-time psychological guidance through a large language model (LLM), offering an interactive, tactile response.The system is specifically designed for young people experiencing emotional stress, collecting physiological signals such as heart rate variability and generating adaptive squeeze-release rhythms via soft, tactile devices.Utilising LLM, the system provides timely squeezing rhythms and psychologically guided feedback prompts, offering personalised rhythm games while reinforcing stress restructuring.Unlike traditional mental health apps, NieNie places users within an embodied interactive loop, leveraging tactile interaction, biofeedback, and adaptive language support to create an immersive stress regulation experience. 0.623This study demonstrates how embodied systems can connect bodily actions with mental health in everyday contexts. 0.839

link
Google Scholar

2025-10-20

Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. 0.65Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching.This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM.The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed.Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling.When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance.Code is available at https://github.com/collinzrj/language_confusion_gate.

link
Google Scholar

2025-10-20

DeTAILS: Deep Thematic Analysis with Iterative LLM Support

Thematic analysis is widely used in qualitative research but can be difficult to scale because of its iterative, interpretive demands. 0.671We introduce DeTAILS, a toolkit that integrates large language model (LLM) assistance into a workflow inspired by Braun and Clarke's thematic analysis framework.DeTAILS supports researchers in generating and refining codes, reviewing clusters, and synthesizing themes through interactive feedback loops designed to preserve analytic agency.We evaluated the system with 18 qualitative researchers analyzing Reddit data. 0.779Quantitative results showed strong alignment between LLM-supported outputs and participants' refinements, alongside reduced workload and high perceived usefulness.Qualitatively, participants reported that DeTAILS accelerated analysis, prompted reflexive engagement with AI outputs, and fostered trust through transparency and control.We contribute: (1) an interactive human-LLM workflow for large-scale qualitative analysis, (2) empirical evidence of its feasibility and researcher experience, and (3) design implications for trustworthy AI-assisted qualitative research.

link
Google Scholar

2025-10-20

ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets.However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows.To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation.At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. 0.632Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets.Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents.We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.

link
Google Scholar

2025-10-19

The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability.We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). 0.783Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems.We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity.Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance.Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact.Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing.These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.

link
Google Scholar

2025-10-19

Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. 0.71This paper bridges the gap by presenting a novel framework.We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules.To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances.Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence.To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

link
Google Scholar

2025-10-19

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

The rapid evolution of agentic AI marks a new phase in artificial intelligence, where Large Language Models (LLMs) no longer merely respond but act, reason, and adapt.This survey traces the paradigm shift in building agentic AI: from Pipeline-based systems, where planning, tool use, and memory are orchestrated by external logic, to the emerging Model-native paradigm, where these capabilities are internalized within the model's parameters.We first position Reinforcement Learning (RL) as the algorithmic engine enabling this paradigm shift.By reframing learning from imitating static data to outcome-driven exploration, RL underpins a unified solution of LLM + RL + Task across language, vision and embodied domains.Building on this, the survey systematically reviews how each capability -- Planning, Tool use, and Memory -- has evolved from externally scripted modules to end-to-end learned behaviors.Furthermore, it examines how this paradigm shift has reshaped major agent applications, specifically the Deep Research agent emphasizing long-horizon reasoning and the GUI agent emphasizing embodied interaction. 0.787We conclude by discussing the continued internalization of agentic capabilities like Multi-agent collaboration and Reflection, alongside the evolving roles of the system and model layers in future agentic AI.Together, these developments outline a coherent trajectory toward model-native agentic AI as an integrated learning and interaction framework, marking the transition from constructing systems that apply intelligence to developing models that grow intelligence through experience.

link
Google Scholar

2025-10-19

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. 0.651However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information.Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning.Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments.Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior.This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization).We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems.We hope this survey will inspire future research on the integration of RL and agentic search.Our repository is available at https://github.com/ventr1c/Awesome-RL-based-Agentic-Search-Papers.

link
Google Scholar

2025-10-19

ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

Multimodal Knowledge Graphs (MKGs) extend traditional knowledge graphs by incorporating visual and textual modalities, enabling richer and more expressive entity representations. 0.725However, existing MKGs often suffer from incompleteness, which hinder their effectiveness in downstream tasks.Therefore, multimodal knowledge graph completion (MKGC) task is receiving increasing attention.While large language models (LLMs) have shown promise for knowledge graph completion (KGC), their application to the multimodal setting remains underexplored.Moreover, applying Multimodal Large Language Models (MLLMs) to the task of MKGC introduces significant challenges: (1) the large number of image tokens per entity leads to semantic noise and modality conflicts, and (2) the high computational cost of processing large token inputs.To address these issues, we propose Efficient Lightweight Multimodal Large Language Models (ELMM) for MKGC.ELMM proposes a Multi-view Visual Token Compressor (MVTC) based on multi-head attention mechanism, which adaptively compresses image tokens from both textual and visual views, thereby effectively reducing redundancy while retaining necessary information and avoiding modality conflicts.Additionally, we design an attention pruning strategy to remove redundant attention layers from MLLMs, thereby significantly reducing the inference cost.We further introduce a linear projection to compensate for the performance degradation caused by pruning.Extensive experiments on benchmark FB15k-237-IMG and WN18-IMG demonstrate that ELMM achieves state-of-the-art performance while substantially improving computational efficiency, establishing a new paradigm for multimodal knowledge graph completion.

link
Google Scholar

2025-10-19

Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation

Language model users often embed personal and social context in their questions. 0.744The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response.However, most evaluations, while capturing the model's capability to respond, often ignore who is asking.This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses.We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions.Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. 0.62Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences.Our evaluations show that these questions are both highly believable and comparable to real-world data.When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners.Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.

link
Google Scholar

2025-10-19

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications.While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored.This work systematically investigates the role of speaker emotion. 0.805We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs.Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk.These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.

link
Google Scholar

2025-10-19

Lark: Biologically Inspired Neuroevolution for Multi-Stakeholder LLM Agents

We present Lark, a biologically inspired decision-making framework that couples LLM-driven reasoning with an evolutionary, stakeholder-aware Multi-Agent System (MAS). 0.733To address verbosity and stakeholder trade-offs, we integrate four mechanisms: (i) plasticity, which applies concise adjustments to candidate solutions; (ii) duplication and maturation, which copy high-performing candidates and specialize them into new modules; (iii) ranked-choice stakeholder aggregation using influence-weighted Borda scoring; and (iv) compute awareness via token-based penalties that reward brevity.The system iteratively proposes diverse strategies, applies plasticity tweaks, simulates stakeholder evaluations, aggregates preferences, selects top candidates, and performs duplication/maturation while factoring compute cost into final scores.In a controlled evaluation over 30 rounds comparing 14 systems, Lark Full achieves a mean rank of 2.55 (95% CI[2.17, 2.93]) and a mean composite score of 29.4/50 (95% CI[26.34, 32.46]), finishing Top-3 in 80% of rounds while remaining cost competitive with leading commercial models ($0.016 per task).Paired Wilcoxon tests confirm that all four mechanisms contribute significantly as ablating duplication/maturation yields the largest deficit ({\Delta}Score = 3.5, Cohen's d_z = 2.53, p < 0.001), followed by plasticity ({\Delta}Score = 3.4, d_z = 1.86), ranked-choice voting ({\Delta}Score = 2.4, d_z= 1.20), and token penalties ({\Delta}Score = 2.2, d_z= 1.63).Rather than a formal Markov Decision Process with constrained optimization, Lark is a practical, compute-aware neuroevolutionary loop that scales stakeholder-aligned strategy generation and makes trade-offs transparent through per-step metrics.Our work presents proof-of-concept findings and invites community feedback as we expand toward real-world validation studies.

link
Google Scholar

2025-10-19

ToolCritic: Detecting and Correcting Tool-Use Errors in Dialogue Systems

Tool-augmented large language models (LLMs) are increasingly employed in real-world applications, but tool usage errors still hinder their reliability.We introduce ToolCritic, a diagnostic framework that evaluates and improves LLM behavior in multi-turn, tool-augmented dialogues. 0.692ToolCritic detects eight distinct error types specific to tool-calling (e.g., premature invocation, argument misalignment, and misinterpretation of tool outputs) and provides targeted feedback to the main LLM.The main LLM, assumed to have strong reasoning, task understanding and orchestration capabilities, then revises its response based on ToolCritic's feedback.We systematically define these error categories and construct a synthetic dataset to train ToolCritic.Experimental results on the Schema-Guided Dialogue (SGD) dataset demonstrate that ToolCritic improves tool-calling accuracy by up to 13% over baselines, including zero-shot prompting and self-correction techniques.This represents a promising step toward more robust LLM integration with external tools in real-world dialogue applications.

link
Google Scholar

2025-10-18

Structured Interfaces for Automated Reasoning with 3D Scene Graphs

In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world.Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. 0.615In this work, we address the challenge of using LLMs with 3DSGs to ground natural language.Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs.Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task.We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding.We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods.Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models.This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content.A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.

link
Google Scholar

2025-10-18

Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning.This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. 0.868Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches.Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain.Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math.We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions.Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.

link
Google Scholar

2025-10-16

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk.In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. 0.684We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric.Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test.Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives.When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines.Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average.Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses.We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

link
Google Scholar

2025-10-16

Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored.Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes.To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions.The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios.We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting").We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness.Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. 0.644Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

link
Google Scholar

2025-10-16

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. 0.63However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer.This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks.In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer.Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates.These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories.Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

link
Google Scholar

2025-10-15

CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

Correct answers do not necessarily reflect cultural understanding.We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts.Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation.We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs.Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable.While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning.These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. 0.842CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models. 0.707

link
Google Scholar

2025-10-15

ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally.Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. 0.703We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context.We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected.By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty.In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.

link
Google Scholar

Large Language Models in Social Sciences

2025-10-20

Consistent Zero-Shot Imitation with Contrastive Goal Inference

In the same way that generative models today conduct most of their training in a self-supervised fashion, how can agentic models conduct their training in a self-supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks?A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction, yet today's most successful AI models (e.g., VLMs, LLMs) are trained without an explicit notion of action.The problem of pure exploration (which assumes no data as input) is well studied in the reinforcement learning literature and provides agents with a wide array of experiences, yet it fails to prepare them for rapid adaptation to new tasks.Today's language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve (e.g., modeling chords in a song, phrases in a sonnet, sentences in a medical record).However, when they are prompted to solve a new task, there is a faulty tacit assumption that humans spend most of their time in the most rewarding states. 0.727The key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion, so that they can instantly mimic human demonstrations.Our method treats goals (i.e., observations) as the atomic construct.During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration.During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal-reaching behavior.Experiments on standard benchmarks (not designed for goal-reaching) show that our approach outperforms prior methods for zero-shot imitation.

link
Google Scholar

2025-10-20

When AI companions become witty: Can human brain recognize AI-generated irony?

As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? 0.75This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. 0.646Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis.We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. 0.617Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony.Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. 0.715Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. 0.62Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents.These findings reveal that source attribution shapes neural processing of social-communicative phenomena. 0.813Despite current LLMs' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents. 0.837

link
Google Scholar

2025-10-20

Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

We study a web-deployed, tool-augmented LLM health coach with real users.In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. 0.731A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3.Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

link
Google Scholar

2025-10-20

Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling

Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. 0.811While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability.We address these limitations with a novel, training-free framework built on a key assumption: \textit{evaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries}, a property that enables remarkable data efficiency.Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided \textbf{Propose-Evaluate-Revise} pipeline.Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an \textbf{information-theoretic coding rate}.The final output is an interpretable, hierarchical "Theme-Tips" rubric set.Extensive experiments demonstrate the framework's exceptional data efficiency and performance.Critically, using just 70 preference pairs (1.5\% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts.This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.

link
Google Scholar

2025-10-20

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. 0.78Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results.To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation.By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail.We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size.Simulation performance is not improved by increased inference-time compute.We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones.Models particularly struggle when simulating specific demographic groups. 0.704Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939).By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

link
Google Scholar

2025-10-19

Temporal Understanding under Deictic Frame of Reference

Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience.For example, "summer is approaching" parallels "We are approaching the summer".In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. 0.688Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer's moment of "now".While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited.In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of "now" dynamically shifts along a timeline.Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points.Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events.The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.

link
Google Scholar

2025-10-19

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation.However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. 0.684We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models.ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage.We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions.With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

link
Google Scholar

2025-10-19

Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection

Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. 0.726While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs.This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA.Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments.Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU.The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%.These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.

link
Google Scholar

2025-10-16

DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals.To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews. 0.796DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

link
Google Scholar

2025-10-16

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm.Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions.In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework.We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source).Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models.The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. 0.769Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. 0.61Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability.Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

link
Google Scholar

2025-10-16

Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis

Suicide remains a critical global public health issue. 0.721While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user's evolving suicidal risk. 0.63Users, however, often reveal their intentions through historical posts and interactive comments over time.This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users' suicidal risk levels. 0.627We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users' posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). 0.661Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. 0.74This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.

link
Google Scholar

2025-10-16

Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style.However, real SNS or email histories are difficult to collect due to privacy concerns.To address this, we propose the task of "Your Next Token Prediction (YNTP)", which models a user's precise word choices through controlled human-agent conversations.We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. 0.676This setup captures natural, daily-life communication patterns and enables analysis of users' internal models.We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling.The dataset is available at: https://github.com/AnonymousHub4Submissions/your-next-token-prediction-dataset-100

link
Google Scholar

2025-10-16

The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems

A growing body of multi-agent studies with Large Language Models (LLMs) explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. 0.803While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions.In contrast, human cooperation often emerges without full visibility into payoffs and population, relying instead on heuristics, communication, and punishment. 0.643We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom's principles of resource governance. 0.639Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously.We establish the validity of our simulation by reproducing key findings from existing studies on human behavior.Building on this, we examine norm evolution across a $2\times2$ grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions.Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies.Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.

link
Google Scholar

2025-10-16

Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs.By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance.When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. 0.788Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages.These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.

link
Google Scholar

2025-10-16

RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge.Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. 0.646To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization.Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. 0.617In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech.Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

link
Google Scholar

2025-10-16

Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience.We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification.In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability.Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. 0.74These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing -- and selectively reducing -- overconfidence in reasoning systems.This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.

link
Google Scholar

2025-10-16

AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. 0.766In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention.In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD.We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA).Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall).We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity.LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics.Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis.This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.

link
Google Scholar

2025-10-16

Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. 0.652We show that averaging TVD-MI's binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions.Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 =[0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95[0.825, 2.252]).We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation.At 33% coverage, we achieve holdout RMSE $0.117 \pm 0.008$ while preserving agent rankings (Spearman $\rho = 0.972 \pm 0.015$), three times fewer evaluations than full dense.Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ($\rho = 0.872$) and consistent identity-link advantage.TVD-MI's geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.

link
Google Scholar

2025-10-15

Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content.Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks.However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches.Comparatively, little work has evaluated linguistic biases in this task type. 0.875In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages.We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison.We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations.Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

link
Google Scholar

LLMs in Education Research

2025-10-20

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption.Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. 0.525Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time.Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys.Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.

link
Google Scholar

2025-10-20

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects.Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. 0.549This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. 0.54Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages.To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. 0.522We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). 0.597Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. 0.706EduAdapt code and datasets are publicly available at https://github.com/NaumanNaeem/EduAdapt.

link
Google Scholar

2025-10-20

BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare.However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues.Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability.To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement.BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM.The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. 0.563BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources.In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification.The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025.This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment. 0.631

link
Google Scholar

2025-10-20

Towards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues

Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. 0.588To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. 0.813The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building.Early insights are outlined as an initial step toward future research.The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies. 0.81

link
Google Scholar

2025-10-20

QueST: Incentivizing LLMs to Generate Difficult Problems

Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data.Existing competitive coding datasets contain only thousands to tens of thousands of problems.Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data.In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems.Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance.We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. 0.543Our distillation experiments demonstrate significant performance gains.Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench.With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.

link
Google Scholar

2025-10-20

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities.However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios.To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues.Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains.These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. 0.545With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues.The benchmark will be publicly available to foster future research.

link
Google Scholar

2025-10-20

AcademicEval: Live Long-Context LLM Benchmark

Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding.However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training.Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks.\textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling.Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length.Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage.We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. 0.537Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities.Code is available at https://github.com/ulab-uiuc/AcademicEval

link
Google Scholar

2025-10-20

Rethinking Search: A Study of University Students' Perspectives on Using LLMs and Traditional Search Engines in Academic Problem Solving

With the increasing integration of Artificial Intelligence (AI) in academic problem solving, university students frequently alternate between traditional search engines like Google and large language models (LLMs) for information retrieval. 0.507This study explores students' perceptions of both tools, emphasizing usability, efficiency, and their integration into academic workflows. 0.607Employing a mixed-methods approach, we surveyed 109 students from diverse disciplines and conducted in-depth interviews with 12 participants. 0.778Quantitative analyses, including ANOVA and chi-square tests, were used to assess differences in efficiency, satisfaction, and tool preference.Qualitative insights revealed that students commonly switch between GPT and Google: using Google for credible, multi-source information and GPT for summarization, explanation, and drafting. 0.647While neither tool proved sufficient on its own, there was a strong demand for a hybrid solution.In response, we developed a prototype, a chatbot embedded within the search interface, that combines GPT's conversational capabilities with Google's reliability to enhance academic research and reduce cognitive load. 0.604

link
Google Scholar

2025-10-19

LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts.These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding.In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens.LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts.These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. 0.504The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres.Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges.Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.

link
Google Scholar

2025-10-19

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs).Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows.In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports.To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. 0.503We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data.Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research.Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs.The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.

link
Google Scholar

2025-10-16

MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision.While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. 0.547Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning.To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning.MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. 0.535The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities.We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms.Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings.All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

link
Google Scholar

2025-10-16

Vision-Centric Activation and Coordination for Multimodal Large Language Models

Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. 0.508However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities.To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs).VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs.Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs.To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs.Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension. 0.536

link
Google Scholar

2025-10-16

On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character.We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. 0.543Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance.Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises.We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.

link
Google Scholar

2025-10-16

Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?

Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years.However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind.Since the MLLM typically consists of an LLM and a vision block, we wonder: Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning? 0.568Recent model-merging approaches may offer insights into this question.However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance.Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space.Based on the empirical insights, we propose IP-Merging that first identifies the reasoning-associated parameters in both MLLM and Math LLM, then projects them into the subspace of MLLM, aiming to maintain the alignment, and finally merges parameters in this subspace.IP-Merging is a tuning-free approach since parameters are directly adjusted.Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities. 0.558

link
Google Scholar

2025-10-16

JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol

AI systems are continually evolving and advancing, and user expectations are concurrently increasing, with a growing demand for interactions that go beyond simple text-based interaction with Large Language Models (LLMs).Today's applications often require LLMs to interact with external tools, marking a shift toward more complex agentic systems. 0.525To support this, standards such as the Model Context Protocol (MCP) have emerged, enabling agents to access tools by including a specification of the capabilities of each tool within the prompt.Although this approach expands what agents can do, it also introduces a growing problem: prompt bloating.As the number of tools increases, the prompts become longer, leading to high prompt token costs, increased latency, and reduced task success resulting from the selection of tools irrelevant to the prompt.To address this issue, we introduce JSPLIT, a taxonomy-driven framework designed to help agents manage prompt size more effectively when using large sets of MCP tools.JSPLIT organizes the tools into a hierarchical taxonomy and uses the user's prompt to identify and include only the most relevant tools, based on both the query and the taxonomy structure.In this paper, we describe the design of the taxonomy, the tool selection algorithm, and the dataset used to evaluate JSPLIT.Our results show that JSPLIT significantly reduces prompt size without significantly compromising the agent's ability to respond effectively.As the number of available tools for the agent grows substantially, JSPLIT even improves the tool selection accuracy of the agent, effectively reducing costs while simultaneously improving task success in high-complexity agent environments.

link
Google Scholar

2025-10-16

LLM Agents Beyond Utility: An Open-Ended Perspective

Recent LLM agents have made great use of chain of thought reasoning and function calling.As their capabilities grow, an important question arises: can this software represent not only a smart problem-solving tool, but an entity in its own right, that can plan, design immediate tasks, and reason toward broader, more ambiguous goals?To study this question, we adopt an open-ended experimental setting where we augment a pretrained LLM agent with the ability to generate its own tasks, accumulate knowledge, and interact extensively with its environment.We study the resulting open-ended agent qualitatively.It can reliably follow complex multi-step instructions, store and reuse information across runs, and propose and solve its own tasks, though it remains sensitive to prompt design, prone to repetitive task generation, and unable to form self-representations.These findings illustrate both the promise and current limits of adapting pretrained LLMs toward open-endedness, and point to future directions for training agents to manage memory, explore productively, and pursue abstract long-term goals. 0.53

link
Google Scholar

2025-10-16

Agentic NL2SQL to Reduce Computational Costs

Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs).Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs.To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently.Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information.Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. 0.514We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks.The Datalake Agent reduces the tokens used by the LLM by up to 87\% and thus allows for substantial cost reductions while maintaining competitive performance.

link
Google Scholar

2025-10-16

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids.Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving.To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics.Our approach consists of two phases.First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing.Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. 0.592To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions.Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks.Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs.Project Page: https://mathcanvas.github.io/

link
Google Scholar

2025-10-16

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives.To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale.Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training.We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants.Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. 0.565Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

link
Google Scholar

LLMs as Recommender Systems

2025-10-16

Synergistic Integration and Discrepancy Resolution of Contextualized Knowledge for Personalized Recommendation

The integration of large language models (LLMs) into recommendation systems has revealed promising potential through their capacity to extract world knowledge for enhanced reasoning capabilities. 0.736However, current methodologies that adopt static schema-based prompting mechanisms encounter significant limitations: (1) they employ universal template structures that neglect the multi-faceted nature of user preference diversity; (2) they implement superficial alignment between semantic knowledge representations and behavioral feature spaces without achieving comprehensive latent space integration.To address these challenges, we introduce CoCo, an end-to-end framework that dynamically constructs user-specific contextual knowledge embeddings through a dual-mechanism approach.Our method realizes profound integration of semantic and behavioral latent dimensions via adaptive knowledge fusion and contradiction resolution modules.Experimental evaluations across diverse benchmark datasets and an enterprise-level e-commerce platform demonstrate CoCo's superiority, achieving a maximum 8.58% improvement over seven cutting-edge methods in recommendation accuracy. 0.693The framework's deployment on a production advertising system resulted in a 1.91% sales growth, validating its practical effectiveness.With its modular design and model-agnostic architecture, CoCo provides a versatile solution for next-generation recommendation systems requiring both knowledge-enhanced reasoning and personalized adaptation. 0.727

link
Google Scholar

2025-10-16

MR.Rec: Synergizing Memory and Reasoning for Personalized Recommendation Assistant with LLMs

The application of Large Language Models (LLMs) in recommender systems faces key challenges in delivering deep personalization and intelligent reasoning, especially for interactive scenarios. 0.797Current methods are often constrained by limited context windows and single-turn reasoning, hindering their ability to capture dynamic user preferences and proactively reason over recommendation contexts. 0.605To address these limitations, we propose MR.Rec, a novel framework that synergizes memory and reasoning for LLM-based recommendations. 0.81To achieve personalization, we develop a comprehensive Retrieval-Augmented Generation (RAG) system that efficiently indexes and retrieves relevant external memory to enhance LLM personalization capabilities.Furthermore, to enable the synergy between memory and reasoning, our RAG system goes beyond conventional query-based retrieval by integrating reasoning enhanced memory retrieval.Finally, we design a reinforcement learning framework that trains the LLM to autonomously learn effective strategies for both memory utilization and reasoning refinement.By combining dynamic memory retrieval with adaptive reasoning, this approach ensures more accurate, context-aware, and highly personalized recommendations. 0.672Extensive experiments demonstrate that MR.Rec significantly outperforms state-of-the-art baselines across multiple metrics, validating its efficacy in delivering intelligent and personalized recommendations. 0.792We will release code and data upon paper notification.

link
Google Scholar

2025-10-16

Cognitive-Aligned Spatio-Temporal Large Language Models For Next Point-of-Interest Prediction

The next point-of-interest (POI) recommendation task aims to predict the users' immediate next destinations based on their preferences and historical check-ins, holding significant value in location-based services.Recently, large language models (LLMs) have shown great potential in recommender systems, which treat the next POI prediction in a generative manner. 0.798However, these LLMs, pretrained primarily on vast corpora of unstructured text, lack the native understanding of structured geographical entities and sequential mobility patterns required for next POI prediction tasks.Moreover, in industrial-scale POI prediction applications, incorporating world knowledge and alignment of human cognition, such as seasons, weather conditions, holidays, and users' profiles (such as habits, occupation, and preferences), can enhance the user experience while improving recommendation performance.To address these issues, we propose CoAST (Cognitive-Aligned Spatial-Temporal LLMs), a framework employing natural language as an interface, allowing for the incorporation of world knowledge, spatio-temporal trajectory patterns, profiles, and situational information.Specifically, CoAST mainly comprises of 2 stages: (1) Recommendation Knowledge Acquisition through continued pretraining on the enriched spatial-temporal trajectory data of the desensitized users; (2) Cognitive Alignment to align cognitive judgments with human preferences using enriched training data through Supervised Fine-Tuning (SFT) and a subsequent Reinforcement Learning (RL) phase.Extensive offline experiments on various real-world datasets and online experiments deployed in "Guess Where You Go" of AMAP App homepage demonstrate the effectiveness of CoAST.

link
Google Scholar

2025-10-16

Cross-Scenario Unified Modeling of User Interests at Billion Scale

User interests on content platforms are inherently diverse, manifesting through complex behavioral patterns across heterogeneous scenarios such as search, feed browsing, and content discovery.Traditional recommendation systems typically prioritize business metric optimization within isolated specific scenarios, neglecting cross-scenario behavioral signals and struggling to integrate advanced techniques like LLMs at billion-scale deployments, which finally limits their ability to capture holistic user interests across platform touchpoints. 0.791We propose RED-Rec, an LLM-enhanced hierarchical Recommender Engine for Diversified scenarios, tailored for industry-level content recommendation systems. 0.72RED-Rec unifies user interest representations across multiple behavioral contexts by aggregating and synthesizing actions from varied scenarios, resulting in comprehensive item and user modeling. 0.619At its core, a two-tower LLM-powered framework enables nuanced, multifaceted representations with deployment efficiency, and a scenario-aware dense mixing and querying policy effectively fuses diverse behavioral signals to capture cross-scenario user intent patterns and express fine-grained, context-specific intents during serving.We validate RED-Rec through online A/B testing on hundreds of millions of users in RedNote through online A/B testing, showing substantial performance gains in both content recommendation and advertisement targeting tasks.We further introduce a million-scale sequential recommendation dataset, RED-MMU, for comprehensive offline training and evaluation. 0.711Our work advances unified user modeling, unlocking deeper personalization and fostering more meaningful user engagement in large-scale UGC platforms.

link
Google Scholar

2025-10-15

Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation

Recommender systems (RecSys) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms. 0.79Recent advancements in large language models (LLMs) demonstrate significant potential for improving RecSys, primarily due to their exceptional generalization capabilities and sophisticated contextual understanding, which facilitate the generation of flexible and interpretable recommendations. 0.633However, the direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues stemming from frequent API calls and inherent model limitations such as hallucinations and biases. 0.731To address these issues, this paper proposes a novel offline reinforcement learning (RL) framework that leverages imitation learning from LLM-generated trajectories.Specifically, inverse reinforcement learning is employed to extract robust reward models from LLM demonstrations.This approach negates the need for LLM fine-tuning, thereby substantially reducing computational overhead.Simultaneously, the RL policy is guided by the cumulative rewards derived from these demonstrations, effectively transferring the semantic insights captured by the LLM.Comprehensive experiments conducted on two benchmark datasets validate the effectiveness of the proposed method, demonstrating superior performance when compared against state-of-the-art RL-based and in-context learning baselines.The code can be found at https://github.com/ArronDZhang/IL-Rec.

link
Google Scholar

2025-10-15

MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation

Recent attempts to integrate large language models (LLMs) into recommender systems have gained momentum, but most remain limited to simple text generation or static prompt-based inference, failing to capture the complexity of user preferences and real-world interactions. 0.825This study proposes the Multi-Aspect Driven LLM Agent MADRec, an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews and performs direct recommendation, sequential recommendation, and explanation generation. 0.625MADRec generates structured profiles via aspect-category-based summarization and applies Re-Ranking to construct high-density inputs.When the ground-truth item is missing from the output, the Self-Feedback mechanism dynamically adjusts the inference criteria.Experiments across multiple domains show that MADRec outperforms traditional and LLM-based baselines in both precision and explainability, with human evaluation further confirming the persuasiveness of the generated explanations.

link
Google Scholar

2025-10-15

Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (LLMs).Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment.In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of LLMs. 0.618Our framework incorporates a commitment-communication mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update.We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework.This framework covers a diverse set of persuadees -- including LLM instances with varying prompts and fine-tuning and human participants -- across tasks ranging from specially designed persuasion scenarios to general everyday situations.Experimental results on LLM-based agents reveal three main findings: (1) LLMs guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.

link
Google Scholar

2025-10-15

HyMiRec: A Hybrid Multi-interest Learning Framework for LLM-based Sequential Recommendation

Large language models (LLMs) have recently demonstrated strong potential for sequential recommendation. 0.841However, current LLM-based approaches face critical limitations in modeling users' long-term and diverse interests.First, due to inference latency and feature fetching bandwidth constraints, existing methods typically truncate user behavior sequences to include only the most recent interactions, resulting in the loss of valuable long-range preference signals.Second, most current methods rely on next-item prediction with a single predicted embedding, overlooking the multifaceted nature of user interests and limiting recommendation diversity. 0.695To address these challenges, we propose HyMiRec, a hybrid multi-interest sequential recommendation framework, which leverages a lightweight recommender to extracts coarse interest embeddings from long user sequences and an LLM-based recommender to captures refined interest embeddings. 0.862To alleviate the overhead of fetching features, we introduce a residual codebook based on cosine similarity, enabling efficient compression and reuse of user history embeddings.To model the diverse preferences of users, we design a disentangled multi-interest learning module, which leverages multiple interest queries to learn disentangles multiple interest signals adaptively, allowing the model to capture different facets of user intent.Extensive experiments are conducted on both benchmark datasets and a collected industrial dataset, demonstrating our effectiveness over existing state-of-the-art methods.Furthermore, online A/B testing shows that HyMiRec brings consistent improvements in real-world recommendation systems. 0.727

link
Google Scholar

2025-10-14

Reinforced Preference Optimization for Recommendation

Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. 0.791Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. 0.613Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals.However, applying RLVR to generative recommenders remains non-trivial.Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards.To address these challenges, we propose Reinforced Preference Optimization for Recommendation (ReRe), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. 0.828ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision.Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. 0.872Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales.Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research. 0.745

link
Google Scholar

2025-10-14

Leveraging Language Semantics for Collaborative Filtering with TextGCN and TextGCN-MLP: Zero-Shot vs In-Domain Performance

In recent years, various approaches have been proposed to leverage large language models (LLMs) for incorporating textual information about items into recommender systems. 0.813Existing methods primarily focus on either fine-tuning LLMs to generate recommendations or integrating LLM-based embeddings into downstream models. 0.689In this work, we follow the latter direction and propose \textbf{TextGCN}, which applies parameter-free graph convolution layers directly over LLM-based item-title embeddings, instead of learning ID-based embeddings as in traditional methods.By combining language semantics with graph message passing, this architecture achieves state-of-the-art zero-shot performance, significantly outperforming prior approaches.Furthermore, we introduce \textbf{TextGCN-MLP}, which extends TextGCN with a trainable multilayer perceptron trained using a contrastive loss, achieving state-of-the-art in-domain performance on recommendation benchmarks. 0.735However, the zero-shot performance of TextGCN-MLP remains lower than that of TextGCN, highlighting the trade-off between in-domain specialization and zero-shot generalization.We release our code on github at \href{https://github.com/ChernovAndrey/TFCE}{github.com/ChernovAndrey/TFCE}.

link
Google Scholar

2025-10-14

CTRL-Rec: Controlling Recommender Systems With Natural Language

When users are dissatisfied with recommendations from a recommender system, they often lack fine-grained controls for changing them. 0.641Large language models (LLMs) offer a solution by allowing users to guide their recommendations through natural language requests (e.g., "I want to see respectful posts with a different perspective than mine"). 0.681We propose a method, CTRL-Rec, that allows for natural language control of traditional recommender systems in real-time with computational efficiency. 0.761Specifically, at training time, we use an LLM to simulate whether users would approve of items based on their language requests, and we train embedding models that approximate such simulated judgments.We then integrate these user-request-based predictions into the standard weighting of signals that traditional recommender systems optimize.At deployment time, we require only a single LLM embedding computation per user request, allowing for real-time control of recommendations. 0.689In experiments with the MovieLens dataset, our method consistently allows for fine-grained control across a diversity of requests.In a study with 19 Letterboxd users, we find that CTRL-Rec was positively received by users and significantly enhanced users' sense of control and satisfaction with recommendations compared to traditional controls. 0.664

link
Google Scholar

2025-10-13

HatLLM: Hierarchical Attention Masking for Enhanced Collaborative Modeling in LLM-based Recommendation

Recent years have witnessed a surge of research on leveraging large language models (LLMs) for sequential recommendation. 0.816LLMs have demonstrated remarkable potential in inferring users' nuanced preferences through fine-grained semantic reasoning.However, they also exhibit a notable limitation in effectively modeling collaborative signals, i.e., behavioral correlations inherent in users' historical interactions.Our empirical analysis further reveals that the attention mechanisms in LLMs tend to disproportionately focus on tokens within the same item, thereby impeding the capture of cross-item correlations. To address this limitation, we propose a novel hierarchical attention masking strategy for LLM-based recommendation, termed HatLLM. 0.71Specifically, in shallow layers, HatLLM masks attention between tokens from different items, facilitating intra-item semantic understanding; in contrast, in deep layers, HatLLM masks attention within items, thereby compelling the model to capture cross-item correlations.This progressive, layer-wise approach enables LLMs to jointly model both token-level and item-level dependencies.Extensive experiments on three real-world datasets demonstrate that HatLLM achieves significant performance gains (9.13% on average) over existing LLM-based methods.

link
Google Scholar

2025-10-13

Does LLM Focus on the Right Words? Diagnosing Language Bias in LLM-based Recommenders

Large language models (LLMs), owing to their extensive open-domain knowledge and semantic reasoning capabilities, have been increasingly integrated into recommender systems (RS). 0.841However, a substantial gap remains between the pre-training objectives of LLMs and the specific requirements of recommendation tasks.To address this gap, supervised fine-tuning (SFT) is commonly performed on specially curated recommendation datasets to further enhance their predictive ability. 0.69Despite its success, SFT exhibits a critical limitation: it induces Language Bias, whereby the model over-relies on auxiliary tokens-such as task descriptions and prefix-generated tokens-while underutilizing core user interaction tokens that encode user-specific preferences.This bias not only undermines recommendation accuracy but also raises unfairness concerns. To address this issue, we propose Group Distributionally Robust Optimization-based Tuning (GDRT), a novel fine-tuning paradigm that enforces consistent model performance across token groups with varying degrees of relevance to auxiliary tokens.By adaptively upweighting underperforming groups, typically those weakly correlated with auxiliary tokens, GDRT shifts the model's attention from superficial auxiliary cues to informative user interaction tokens, thereby mitigating language bias.Extensive experiments conducted on three public datasets demonstrate that GDRT effectively mitigates language bias, yielding substantial improvements in recommendation accuracy (with an average NDCG@10 gain of 24.29%) and significantly enhancing recommendation fairness. 0.718

link
Google Scholar

2025-10-13

Instruction-aware User Embedding via Synergistic Language and Representation Modeling

User representation modeling has become increasingly crucial for personalized applications, yet existing approaches struggle with generalizability across domains and sensitivity to noisy behavioral signals.We present InstructUE, an instruction-aware user embedding foundation model that leverages large language models (LLMs) to generate general and instruction-aware user representations.InstructUE introduces a multi-encoder architecture with a lightweight adapter that efficiently processes heterogeneous data from six different sources while preserving their structural characteristics.Additionally, it proposes a novel contrastive-autoregressive training framework that bridges language and representation spaces through a curated UserQA dataset.The contrastive-autoregressive training framework simultaneously leverages autoregressive learning to capture domain knowledge in language space and contrastive learning to align user-text embeddings in representation space, thereby enhancing the instruction-awareness and noise-robustness of user embeddings.Through extensive experiments on real-world applications, we demonstrate that InstructUE significantly outperforms existing methods across multiple domains including user prediction, marketing, and recommendation scenarios. 0.636Our results show that instruction-aware user modeling can effectively achieve instruction-guided denoising of user information in specific scenarios, paving the way for more generalizable and robust user representation learning.

link
Google Scholar

2025-10-13

Aligning Deep Implicit Preferences by Learning to Reason Defensively

Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions.However, current methods face a dual challenge: they fail to infer users' deep implicit preferences (including unstated goals, semantic context and risk tolerances), and they lack the defensive reasoning required to navigate real-world ambiguity.This cognitive gap leads to responses that are superficial, brittle and short-sighted.To address this, we propose Critique-Driven Reasoning Alignment (CDRA), which reframes alignment from a scalar reward-matching task into a structured reasoning process.First, to bridge the preference inference gap, we introduce the DeepPref benchmark.This dataset, comprising 3000 preference-query pairs across 20 topics, is curated by simulating a multi-faceted cognitive council that produces critique-annotated reasoning chains to deconstruct query semantics and reveal latent risks. 0.636Second, to instill defensive reasoning, we introduce the Personalized Generative Process Reward Model (Pers-GenPRM), which frames reward modeling as a personalized reasoning task.It generates a critique chain to evaluate a response's alignment with user preferences before outputting a final score based on this rationale.Ultimately, this interpretable, structured reward signal guides policy model through Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm integrating both numerical and natural language feedback.Experiments demonstrate that CDRA excels at discovering and aligning with users' true preferences while executing robust reasoning.Our code and dataset are available at https://github.com/Zephyrian-Hugh/Deep-pref.

link
Google Scholar

2025-10-13

Next Interest Flow: A Generative Pre-training Paradigm for Recommender Systems by Modeling All-domain Movelines

Click-Through Rate (CTR) prediction, a cornerstone of modern recommender systems, has been dominated by discriminative models that react to past user behavior rather than proactively modeling user intent. 0.681Existing generative paradigms attempt to address this but suffer from critical limitations: Large Language Model (LLM) based methods create a semantic mismatch by forcing e-commerce signals into a linguistic space, while ID-based generation is constrained by item memorization and cold-start issues.To overcome these limitations, we propose a novel generative pre-training paradigm.Our model learns to predict the Next Interest Flow, a dense vector sequence representing a user's future intent, while simultaneously modeling its internal Interest Diversity and Interest Evolution Velocity to ensure the representation is both rich and coherent.However, this two-stage approach introduces a critical objective mismatch between the generative and discriminative stages.We resolve this via a bidirectional alignment strategy, which harmonizes the two stages through cross-stage weight initialization and a dynamic Semantic Alignment Module for fine-tuning.Additionally, we enhance the underlying discriminative model with a Temporal Sequential Pairwise (TSP) mechanism to better capture temporal causality.We present the All-domain Moveline Evolution Network (AMEN), a unified framework implementing our entire pipeline.Extensive offline experiments validate AMEN's superiority over strong baselines, and a large-scale online A/B test demonstrates its significant real-world impact, delivering substantial improvements in key business metrics.

link
Google Scholar

2025-10-13

OneRec-Think: In-Text Reasoning for Generative Recommendation

The powerful generative capacity of Large Language Models (LLMs) has instigated a paradigm shift in recommendation. 0.772However, existing generative models (e.g., OneRec) operate as implicit predictors, critically lacking the capacity for explicit and controllable reasoning-a key advantage of LLMs.To bridge this gap, we propose OneRec-Think, a unified framework that seamlessly integrates dialogue, reasoning, and personalized recommendation. 0.73OneRec-Think incorporates: (1) Itemic Alignment: cross-modal Item-Textual Alignment for semantic grounding; (2) Reasoning Activation: Reasoning Scaffolding to activate LLM reasoning within the recommendation context; and (3) Reasoning Enhancement, where we design a recommendation-specific reward function that accounts for the multi-validity nature of user preferences. 0.626Experiments across public benchmarks show state-of-the-art performance.Moreover, our proposed "Think-Ahead" architecture enables effective industrial deployment on Kuaishou, achieving a 0.159\% gain in APP Stay Time and validating the practical efficacy of the model's explicit reasoning capability.

link
Google Scholar

2025-10-13

Asking Clarifying Questions for Preference Elicitation With Large Language Models

Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. 0.762In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history.One way to get more information is to present clarifying questions to the user.However, generating effective sequential clarifying questions across various domains remains a challenge.To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences.Our method follows a two-stage process inspired by diffusion models.Starting from a user profile, the forward process generates clarifying questions to obtain answers and then removes those answers step by step, serving as a way to add ``noise'' to the user profile.The reverse process involves training a model to ``denoise'' the user profile by learning to ask effective clarifying questions.Our results show that our method significantly improves the LLM's proficiency in asking funnel questions and eliciting user preferences effectively.

link
Google Scholar

2025-10-09

Do We Really Need SFT? Prompt-as-Policy over Knowledge Graphs for Cold-start Next POI Recommendation

Next point-of-interest (POI) recommendation is crucial for smart urban services such as tourism, dining, and transportation, yet most approaches struggle under cold-start conditions where user-POI interactions are sparse. 0.684Recent efforts leveraging large language models (LLMs) address this challenge through either supervised fine-tuning (SFT) or in-context learning (ICL).However, SFT demands costly annotations and fails to generalize to inactive users, while static prompts in ICL cannot adapt to diverse user contexts.To overcome these limitations, we propose Prompt-as-Policy over knowledge graphs, a reinforcement-guided prompting framework that learns to construct prompts dynamically through contextual bandit optimization.Our method treats prompt construction as a learnable policy that adaptively determines (i) which relational evidences to include, (ii) the number of evidence per candidate, and (iii) their organization and ordering within prompts.More specifically, we construct a knowledge graph (KG) to discover candidates and mine relational paths, which are transformed into evidence cards that summarize rationales for each candidate POI.The frozen LLM then acts as a reasoning engine, generating recommendations from the KG-discovered candidate set based on the policy-optimized prompts.Experiments on three real-world datasets demonstrate that Prompt-as-Policy consistently outperforms state-of-the-art baselines, achieving average 7.7\% relative improvements in Acc@1 for inactive users, while maintaining competitive performance on active users, without requiring model fine-tuning.

link
Google Scholar

2025-10-09

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Consumer research costs companies billions annually yet suffers from panel biases and limited scale.Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings.We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. 0.708Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85).Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings.This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

link
Google Scholar

Production workflows for LLMs

2025-10-20

LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics.With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". 0.522This paper systematically investigates such predictive intelligence of LLMs.To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. 0.505Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. 0.314However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears. 0.505

link
Google Scholar

2025-10-20

Qomhra: A Bilingual Irish-English Large Language Model

This paper introduces Qomhr\'a, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. 0.541Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability.6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. 0.355Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. 0.33Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker.Qomhr\'a is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English.Qomhr\'a also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.

link
Google Scholar

2025-10-20

LILO: Bayesian Optimization with Interactive Natural Language Feedback

For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives.We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. 0.49Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. 0.463At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. 0.444We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes. 0.381

link
Google Scholar

2025-10-20

Enabling Fine-Grained Operating Points for Black-Box LLMs

Black-box Large Language Models (LLMs) provide practical and accessible alternatives to other machine learning methods, as they require minimal labeled data and machine learning expertise to develop solutions for various decision making problems. 0.459However, for applications that need operating with constraints on specific metrics (e.g., precision $\geq$ 95%), decision making with black-box LLMs remains unfavorable, due to their low numerical output cardinalities. 0.426This results in limited control over their operating points, preventing fine-grained adjustment of their decision making behavior.In this paper, we study using black-box LLMs as classifiers, focusing on efficiently improving their operational granularity without performance loss. 0.563Specifically, we first investigate the reasons behind their low-cardinality numerical outputs and show that they are biased towards generating rounded but informative verbalized probabilities.Then, we experiment with standard prompt engineering, uncertainty estimation and confidence elicitation techniques, and observe that they do not effectively improve operational granularity without sacrificing performance or increasing inference cost.Finally, we propose efficient approaches to significantly increase the number and diversity of available operating points. 0.584Our proposed approaches provide finer-grained operating points and achieve comparable to or better performance than the benchmark methods across 11 datasets and 3 LLMs. 0.427

link
Google Scholar

2025-10-20

Glyph: Scaling Context Windows via Visual-Text Compression

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning.However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. 0.461In this work, we take a different perspective-visual context scaling-to tackle this challenge.Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs).This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression.Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. 0.53This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. 0.366Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. 0.561In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding.Our code and model are released at https://github.com/thu-coai/Glyph. 0.314

link
Google Scholar

2025-10-20

Unbiased Gradient Low-Rank Projection

Memory-efficient optimization is critical for training increasingly large language models (LLMs). 0.392A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. 0.414However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. 0.342Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. 0.321In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). 0.468We theoretically prove our method matches the convergence guarantees of the base Muon algorithm while preserving the memory efficiency of low-rank techniques. 0.489Empirical experiments on LLM fine-tuning and pretraining also demonstrate non-trivial improvements over GaLore and even better performance than full-parameter training. 0.644Further investigation shows that the improvement of this technique comes from a more uniform distribution of knowledge inside layers, leading to more efficient utilization of the model parameter space and better memorization.

link
Google Scholar

LLM Model Architectures and Training Techniques

2025-10-20

Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models

Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization.Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. 0.705To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs.CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation.For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. 0.473Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. 0.409The implemented code and data are available to ease reproducibility at https://github.com/Applied-Machine-Learning-Lab/HyCAM.

link
Google Scholar

2025-10-20

PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive.While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. 0.4We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. 0.578Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets.Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.

link
Google Scholar

2025-10-20

This is Going to Sound Crazy, But What If We Used Large Language Models to Boost Automatic Database Tuning Algorithms By Leveraging Prior History? We Will Find Better Configurations More Quickly Than Retraining From Scratch!

Tuning database management systems (DBMSs) is challenging due to trillions of possible configurations and evolving workloads. 0.731Recent advances in tuning have led to breakthroughs in optimizing over the possible configurations. 0.623However, due to their design and inability to leverage query-level historical insights, existing automated tuners struggle to adapt and re-optimize the DBMS when the environment changes (e.g., workload drift, schema transfer). 0.666This paper presents the Booster framework that assists existing tuners in adapting to environment changes (e.g., drift, cross-schema transfer). 0.594Booster structures historical artifacts into query-configuration contexts, prompts large language models (LLMs) to suggest configurations for each query based on relevant contexts, and then composes the query-level suggestions into a holistic configuration with beam search.With multiple OLAP workloads, we evaluate Booster's ability to assist different state-of-the-art tuners (e.g., cost-/machine learning-/LLM-based) in adapting to environment changes. 0.489By composing recommendations derived from query-level insights, Booster assists tuners in discovering configurations that are up to 74% better and in up to 4.7x less time than the alternative approach of continuing to tune from historical configurations. 0.562

link
Google Scholar

2025-10-20

Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge.This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. 0.412We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. 0.408This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight.By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.

link
Google Scholar

Programming applications of LLMs

2025-10-20

SEER: Enhancing Chain-of-Thought Code Generation through Self-Exploring Deep Reasoning

Code generation, the task of creating executable programs from natural language requirements, has recently seen tremendous advances through Chain-of-Thought (CoT) reasoning, which enables Large Language Models (LLMs) to develop high-level reasoning plans before writing code. 0.791Recent research has proposed various methods to enhance models' CoT reasoning for code generation such as prompt engineering and supervised fine-tuning. 0.851However, existing approaches still face three critical limitations: (1) limited exploration of diverse reasoning paths, which constrains generalization across various programming scenarios, (2) lack of quality assessment for intermediate reasoning steps, which hampers the reliability of the generated plans and code, and (3) the potential negative impact of "overthinking", potentially leading to unnecessarily complex and incorrect solutions.To address these limitations, we frame CoT code generation as a decision making problem and present SEER, a SElf-Exploring deep Reasoning framework that enables accurate and adaptive reasoning for code generation. 0.676SEER introduces three key components: (1) Diverse reasoning path exploration, which aims at exploring diverse reasoning paths and annotating intermediate steps without relying on manual experts or closed-source proprietary models; (2) Reasoning quality-aware model training, which trains a policy model for generating candidate reasoning steps and a value model for assessing their quality; and (3) Adaptive CoT reasoning, which dynamically switches between direct generation and step-by-step reasoning for different problems.

link
Google Scholar

2025-10-20

PEACE: Towards Efficient Project-Level Efficiency Optimization via Hybrid Code Editing

Large Language Models (LLMs) have demonstrated significant capability in code generation, but their potential in code efficiency optimization remains underexplored. 0.897Previous LLM-based code efficiency optimization approaches exclusively focus on function-level optimization and overlook interaction between functions, failing to generalize to real-world development scenarios.Code editing techniques show great potential for conducting project-level optimization, yet they face challenges associated with invalid edits and suboptimal internal functions.To address these gaps, we propose Peace, a novel hybrid framework for Project-level code Efficiency optimization through Automatic Code Editing, which also ensures the overall correctness and integrity of the project. 0.695Peace integrates three key phases: dependency-aware optimizing function sequence construction, valid associated edits identification, and efficiency optimization editing iteration.To rigorously evaluate the effectiveness of Peace, we construct PeacExec, the first benchmark comprising 146 real-world optimization tasks from 47 high-impact GitHub Python projects, along with highly qualified test cases and executable environments.Extensive experiments demonstrate Peace's superiority over the state-of-the-art baselines, achieving a 69.2% correctness rate (pass@1), +46.9% opt rate, and 0.840 speedup in execution efficiency.Notably, our Peace outperforms all baselines by significant margins, particularly in complex optimization tasks with multiple functions.Moreover, extensive experiments are also conducted to validate the contributions of each component in Peace, as well as the rationale and effectiveness of our hybrid framework design.

link
Google Scholar

2025-10-20

Mamba4Net: Distilled Hybrid Mamba Large Language Models For Networking

Transformer-based large language models (LLMs) are increasingly being adopted in networking research to address domain-specific challenges.However, their quadratic time complexity and substantial model sizes often result in significant computational overhead and memory constraints, particularly in resource-constrained environments.Drawing inspiration from the efficiency and performance of the Deepseek-R1 model within the knowledge distillation paradigm, this paper introduces Mamba4Net, a novel cross-architecture distillation framework.Mamba4Net transfers networking-specific knowledge from transformer-based LLMs to student models built on the Mamba architecture, which features linear time complexity.This design substantially enhances computational efficiency compared to the quadratic complexity of transformer-based models, while the reduced model size further minimizes computational demands, improving overall performance and resource utilization.To evaluate its effectiveness, Mamba4Net was tested across three diverse networking tasks: viewport prediction, adaptive bitrate streaming, and cluster job scheduling.Compared to existing methods that do not leverage LLMs, Mamba4Net demonstrates superior task performance.Furthermore, relative to direct applications of transformer-based LLMs, it achieves significant efficiency gains, including a throughput 3.96 times higher and a storage footprint of only 5.48% of that required by previous LLM-based approaches.These results highlight Mamba4Net's potential to enable the cost-effective application of LLM-derived knowledge in networking contexts.The source code is openly available to support further research and development. 0.68

link
Google Scholar

2025-10-20

TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing.Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios.Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models.To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks.Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse software engineering activities rather than limited coding tasks; (2) Multi-Language and Multi-Modality Assessment that extends beyond traditional single-language, text-only benchmarks to include multi-modality coding tasks; (3) Robustness Assessment that evaluates model reliability under semantically-preserving code transformations; and (4) Rigorous Evaluation Methodology that enhances the trustworthiness of evaluation results through diverse evaluation prompts and adaptive solution extraction.Based on this evaluation framework, we assess 26 state-of-the-art models and uncover both their strengths and limitations, yielding several key insights:(1) Current models show substantial performance variation across programming tasks; (2) Multi-modal language models demonstrate specific performance limitations in UI code generation and edit; 0.657

link
Google Scholar

2025-10-20

AdapTrack: Constrained Decoding without Distorting LLM's Output Intent

Language model-based code generation and completion tools have been widely adopted, but they may sometimes produce code that does not meet necessary constraints, such as syntactic correctness or API existence. 0.877Constrained decoding techniques are developed to help the model generate code adhering to the constraints by greedily eliminating generation options that violate constraints at each step of the generation process.However, there is a severe limitation of constrained decoding, that it distorts the model's output intent, forcing it to produce code that may satisfy the constraint but does not match the development intent and is therefore incorrect.In response to this challenge, we propose AdapTrack.By incorporating backtracking into the generation process, AdapTrack avoids distorting the output intent of the model, thereby producing results that are not only constraint-compliant but also more semantically aligned with model's output intent.On our synthetic API completion dataset, AdapTrack can achieve up to 360.87% improvement compared to constrained decoding; on the real-world API completion dataset we collect that exhibits similar issues, AdapTrack can achieve up to 38.93% improvement over constrained decoding; in general code genration benchmarks, compared to constrained decoding, AdapTrack can achieve up to 7.84% improvement on HumanEval, and up to 6.42% improvement on MBPP.This indicates that, simply by better adhering to the model's output intent, AdapTrack can achieve significant improvements.We provide a theoretical proof that the distribution produced by AdapTrack aligns with the model's distribution given the generated tokens, thereby ensuring that the model's output intent is not distorted.Experiments on DSL problems show that, compared to existing methods, our approach can provide generation results that are more consistent with the language model's distribution.

link
Google Scholar

2025-10-20

DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

Large language models (LLMs) have advanced Text-to-SQL, yet existing solutions still fall short of system-level reliability.The limitation is not merely in individual modules - e.g., schema linking, reasoning, and verification - but more critically in the lack of structured orchestration that enforces correctness across the entire workflow.This gap motivates a paradigm shift: treating Text-to-SQL not as free-form language generation but as a software-engineering problem that demands structured, verifiable orchestration.We present DeepEye-SQL, a software-engineering-inspired framework that reframes Text-to-SQL as the development of a small software program, executed through a verifiable process guided by the Software Development Life Cycle (SDLC). 0.645DeepEye-SQL integrates four synergistic stages: it grounds ambiguous user intent through semantic value retrieval and robust schema linking; enhances fault tolerance with N-version SQL generation using diverse reasoning paradigms; ensures deterministic verification via a tool-chain of unit tests and targeted LLM-guided revision; and introduces confidence-aware selection that clusters execution results to estimate confidence and then takes a high-confidence shortcut or runs unbalanced pairwise adjudication in low-confidence cases, yielding a calibrated, quality-gated output.This SDLC-aligned workflow transforms ad hoc query generation into a disciplined engineering process.Using ~30B open-source LLMs without any fine-tuning, DeepEye-SQL achieves 73.5% execution accuracy on BIRD-Dev and 89.8% on Spider-Test, outperforming state-of-the-art solutions.This highlights that principled orchestration, rather than LLM scaling alone, is key to achieving system-level reliability in Text-to-SQL.

link
Google Scholar

2025-10-20

Executable Knowledge Graphs for Replicating AI Research

Replicating AI research is a crucial yet challenging task for large language model (LLM) agents.Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. 0.722Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. 0.644To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication.Code will released at https://github.com/zjunlp/xKG.

link
Google Scholar

2025-10-19

QuanBench: Benchmarking Quantum Code Generation with Large Language Models

Large language models (LLMs) have demonstrated good performance in general code generation; however, their capabilities in quantum code generation remain insufficiently studied. 0.813This paper presents QuanBench, a benchmark for evaluating LLMs on quantum code generation.QuanBench includes 44 programming tasks that cover quantum algorithms, state preparation, gate decomposition, and quantum machine learning.Each task has an executable canonical solution and is evaluated by functional correctness (Pass@K) and quantum semantic equivalence (Process Fidelity).We evaluate several recent LLMs, including general-purpose and code-specialized models. 0.641The results show that current LLMs have limited capability in generating the correct quantum code, with overall accuracy below 40% and frequent semantic errors.We also analyze common failure cases, such as outdated API usage, circuit construction errors, and incorrect algorithm logic.QuanBench provides a basis for future work on improving quantum code generation with LLMs.

link
Google Scholar

2025-10-19

When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance.We investigate this assumption for the complex task of code translation.Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens.Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples).Providing substantially more examples often degrades this crucial functional performance.This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. 0.652Our results have significant implications for effectively leveraging LLMs in software engineering.

link
Google Scholar

2025-10-19

Tutoring LLM into a Better CUDA Optimizer

Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. 0.931In this paper, we focus on the capabilities of the most recent reasoning models to generate optimized CUDA code for predefined, well-known tasks.Our objective is to determine which types of code optimizations and parallel patterns the LLMs can perform by themselves and whether they can be improved by tutoring (providing more detailed hints and guidelines in the prompt). 0.648The generated solutions were evaluated both automatically (for correctness and speedup) and manually (code reviews) to provide a more detailed perspective.We also tried an interactive approach where the LLM can fix its previous mistakes within a session.The results indicate that LLMs are quite skilled coders; however, they require tutoring to reach optimized solutions provided by parallel computing experts. 0.643

link
Google Scholar

2025-10-16

PathFix: Automated Program Repair with Expected Path

Automated program repair (APR) techniques are effective in fixing inevitable defects in software, enhancing development efficiency and software robustness. 0.642However, due to the difficulty of generating precise specifications, existing APR methods face two main challenges: generating too many plausible patch candidates and overfitting them to partial test cases.To tackle these challenges, we introduce a new APR method named PathFix, which leverages path-sensitive constraints extracted from correct execution paths to generate patches for repairing buggy code.It is based on one observation: if a buggy program is repairable, at least one expected path is supposed to replace the fault path in the patched program.PathFix operates in four main steps.First, it traces fault paths reaching the fault output in the buggy program.Second, it derives expected paths by analyzing the desired correct output on the control flow graph, where an expected path defines how a feasible patch leads to the correct execution.Third, PathFix generates and evaluates patches by solving state constraints along the expected path.Fourth, we validate the correctness of the generated patch.To further enhance repair performance and mitigate scalability issues introduced by path-sensitive analysis, we integrate a large language model (LLM) into our framework.Experimental results show that PathFix outperforms existing solutions, particularly in handling complex program structures such as loops and recursion.

link
Google Scholar

2025-10-16

ATGen: Adversarial Reinforcement Learning for Test Case Generation

Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. 0.783Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets.This imposes a ``fixed-difficulty ceiling'', fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope.To overcome this, we introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.ATGen pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy.This dynamic loop creates a curriculum of increasing difficulty challenging current policy.The test generator is optimized via Reinforcement Learning (RL) to jointly maximize ``Output Accuracy'' and ``Attack Success'', enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training.Extensive experiments demonstrate that ATGen significantly outperforms state-of-the-art baselines.We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models.Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.

link
Google Scholar

2025-10-16

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution.However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap.While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable.To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. 0.734Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details.Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting.Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels.For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

link
Google Scholar

2025-10-16

Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code

Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. 0.792While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power.Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification.To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs.Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations.Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3\% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8\%, delay efficiency of 65.9\%, and power efficiency of 64.0\% at eff@1.This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.

link
Google Scholar

2025-10-16

Programmatic Representation Learning with Language Models

Classical models for supervised machine learning, such as decision trees, are efficient and interpretable predictors, but their quality is highly dependent on the particular choice of input features.Although neural networks can learn useful representations directly from raw data (e.g., images or text), this comes at the expense of interpretability and the need for specialized hardware to run them efficiently.In this paper, we explore a hypothesis class we call Learned Programmatic Representations (LeaPR) models, which stack arbitrary features represented as code (functions from data points to scalars) and decision tree predictors.We synthesize feature functions using Large Language Models (LLMs), which have rich prior knowledge in a wide range of domains and a remarkable ability to write code using existing domain-specific libraries. 0.69We propose two algorithms to learn LeaPR models from supervised data.First, we design an adaptation of FunSearch to learn features rather than directly generate predictors.Then, we develop a novel variant of the classical ID3 algorithm for decision tree learning, where new features are generated on demand when splitting leaf nodes.In experiments from chess position evaluation to image and text classification, our methods learn high-quality, neural network-free predictors often competitive with neural networks.Our work suggests a flexible paradigm for learning interpretable representations end-to-end where features and predictions can be readily inspected and understood.

link
Google Scholar

2025-10-16

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. 0.766As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming.To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization.Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior.Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries.Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

link
Google Scholar

2025-10-16

Attention Is All You Need for KV Cache in Diffusion LLMs

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency.Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy.We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens.Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches).Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality.Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. 0.651Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

link
Google Scholar

2025-10-15

A Matter of Representation: Towards Graph-Based Abstract Code Generation

Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. 0.885However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. 0.689This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets.In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. 0.723We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. 0.661Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations.We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task.All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.

link
Google Scholar

2025-10-15

A11YN: aligning LLMs for accessible web UI code generation

Large language models (LLMs) have recently demonstrated strong capabilities in generating functional and aesthetic web interfaces directly from instructions. 0.657However, these models often replicate accessibility flaws from their training data, resulting in interfaces that exclude users with diverse needs and contexts.To address this gap, we introduce A11yn, the first method that aligns code-generating LLMs to reliably produce accessibility-compliant web UIs.A11yn optimizes a novel reward function that penalizes violations of the Web Content Accessibility Guidelines (WCAG), with penalties scaled to the severity of each violation as identified by an accessibility testing engine.To support training, we construct UIReq-6.8K, a dataset of 6,800 diverse instructions for web UI generation.For evaluation, we introduce RealUIReq-300, a benchmark of 300 real-world web UI requests grounded and manually curated from public web pages, spanning a broad range of use cases.Empirical results show that A11yn significantly outperforms strong baselines, lowering the Inaccessibility Rate by 60% over the base model while preserving semantic fidelity and visual quality of generated UIs.These findings demonstrate that accessibility can be systematically optimized within LLMs, showing the feasibility of aligning code generation for accessibility. 0.679

link
Google Scholar

2025-10-15

One Bug, Hundreds Behind: LLMs for Large-Scale Bug Discovery

Fixing bugs in large programs is a challenging task that demands substantial time and effort.Once a bug is found, it is reported to the project maintainers, who work with the reporter to fix it and eventually close the issue.However, across the program, there are often similar code segments, which may also contain the bug, but were missed during discovery.Finding and fixing each recurring bug instance individually is labor intensive.Even more concerning, bug reports can inadvertently widen the attack surface as they provide attackers with an exploitable pattern that may be unresolved in other parts of the program. In this paper, we explore these Recurring Pattern Bugs (RPBs) that appear repeatedly across various code segments of a program or even in different programs, stemming from a same root cause, but are unresolved.Our investigation reveals that RPBs are widespread and can significantly compromise the security of software programs.This paper introduces BugStone, a program analysis system empowered by LLVM and a Large Language Model (LLM). 0.694The key observation is that many RPBs have one patched instance, which can be leveraged to identify a consistent error pattern, such as a specific API misuse.By examining the entire program for this pattern, it is possible to identify similar sections of code that may be vulnerable.Starting with 135 unique RPBs, BugStone identified more than 22K new potential issues in the Linux kernel.Manual analysis of 400 of these findings confirmed that 246 were valid.We also created a dataset from over 1.9K security bugs reported by 23 recent top-tier conference works.We manually annotate the dataset, identify 80 recurring patterns and 850 corresponding fixes.Even with a cost-efficient model choice, BugStone achieved 92.2% precision and 79.1% pairwise accuracy on the dataset.

link
Google Scholar

2025-10-15

David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

Scenario simulation is central to testing autonomous driving systems.Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics.We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT).We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). 0.751EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity.GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware.Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales.NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.

link
Google Scholar

2025-10-14

Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation

Prevailing code generation benchmarks, such as HumanEval+ and MBPP+, primarily evaluate large language models (LLMs) with pass@k on functional correctness using well-formed inputs. 0.785However, they ignore a crucial aspect of real-world software: adherence to contracts-the preconditions and validity constraints that dictate how ill-formed inputs must be rejected.This critical oversight means that existing benchmarks fail to measure, and models consequently fail to generate, truly robust and reliable code snippets.We introduce PACT, a program assessment and contract-adherence evaluation framework, to bridge this gap.PACT is the first framework designed to systematically evaluate and enhance contract-adherence in LLM-generated code snippets alongside functional correctness.PACT's contributions are threefold:First, it provides a comprehensive test-suite corpus focused on contract violations, extending HumanEval+ and MBPP+.Second, it enables a systematic analysis of code generation under varied prompting conditions. 0.74This analysis demonstrates that augmenting prompts with contract-violating test cases significantly enhance a model's ability to respect contracts compared to using contract description alone.Finally, it introduces novel metrics to rigorously quantify contract adherence in both test generation and code generation.By revealing critical errors that conventional benchmarks overlook, PACT provides the rigorous and interpretable metrics to evaluate the robustness of LLM-generated code snippets in both functionality and contract-adherence.Our code and data are available at https://github.com/suhanmen/PACT.

link
Google Scholar

2025-10-14

Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach

The increasing demand for software development has driven interest in automating software engineering (SE) tasks using Large Language Models (LLMs). 0.838Recent efforts extend LLMs into multi-agent systems (MAS) that emulate collaborative development workflows, but these systems often fail due to three core deficiencies: under-specification, coordination misalignment, and inappropriate verification, arising from the absence of foundational SE structuring principles.This paper introduces Software Engineering Multi-Agent Protocol (SEMAP), a protocol-layer methodology that instantiates three core SE design principles for multi-agent LLMs: (1) explicit behavioral contract modeling, (2) structured messaging, and (3) lifecycle-guided execution with verification, and is implemented atop Google's Agent-to-Agent (A2A) infrastructure.Empirical evaluation using the Multi-Agent System Failure Taxonomy (MAST) framework demonstrates that SEMAP effectively reduces failures across different SE tasks.In code development, it achieves up to a 69.6% reduction in total failures for function-level development and 56.7% for deployment-level development.For vulnerability detection, SEMAP reduces failure counts by up to 47.4% on Python tasks and 28.2% on C/C++ tasks.

link
Google Scholar

2025-10-14

iCodeReviewer: Improving Secure Code Review with Mixture of Prompts

Code review is an essential process to ensure the quality of software that identifies potential software issues at an early stage of software development. 0.782Among all software issues, security issues are the most important to identify, as they can easily lead to severe software crashes and service disruptions.Recent research efforts have been devoted to automated approaches to reduce the manual efforts required in the secure code review process.Despite the progress, current automated approaches on secure code review, including static analysis, deep learning models, and prompting approaches, still face the challenges of limited precision and coverage, and a lack of comprehensive evaluation. To mitigate these challenges, we propose iCodeReviewer, which is an automated secure code review approach based on large language models (LLMs). 0.695iCodeReviewer leverages a novel mixture-of-prompts architecture that incorporates many prompt experts to improve the coverage of security issues.Each prompt expert is a dynamic prompt pipeline to check the existence of a specific security issue.iCodeReviewer also implements an effective routing algorithm to activate only necessary prompt experts based on the code features in the input program, reducing the false positives induced by LLM hallucination.Experiment results in our internal dataset demonstrate the effectiveness of iCodeReviewer in security issue identification and localization with an F1 of 63.98%.The review comments generated by iCodeReviewer also achieve a high acceptance rate up to 84% when it is deployed in production environments.

link
Google Scholar

2025-10-14

GOAT: A Training Framework for Goal-Oriented Agent with Tools

Large language models (LLMs) have recently been extended beyond traditional text generation to serve as interactive agents capable of using external tools based on user intent. 0.704However, current LLM agents still show limited ability to handle goal-oriented queries, which require decomposing a high-level objective into multiple interdependent API calls with correct planning and execution.Current approaches mainly rely on zero-shot evaluation due to the absence of training data.While proprietary closed-source models such as GPT-4 demonstrate strong reasoning abilities, smaller open-source models struggle to perform complex tool use effectively.Thus, we propose a novel training framework GOAT, which enables fine-tuning of LLM agents in a human annotation-free setting.GOAT automatically constructs synthetic datasets of goal-oriented API execution tasks directly from given API documents, equipping models with the ability to reason over interdependent calls and generate coherent responses.Through extensive experiments, we show that GOAT-trained agents achieve state-of-the-art performance across multiple existing goal-oriented benchmarks.In addition, we introduce GOATBench, a new goal-oriented API execution benchmark, and demonstrate that agents trained with GOAT also excel in this setting.These results highlight GOAT as a practical path toward building robust open-source LLM agents capable of complex reasoning and tool use.

link
Google Scholar

2025-10-14

A Survey of Vibe Coding with Large Language Models

The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed "Vibe Coding" where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. 0.762Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration.To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach.Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms.We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents.Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain.Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.

link
Google Scholar

2025-10-14

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. 0.656Yet, this ability also heightens the risk of identity impersonation.To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection.In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations.Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops.We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text.Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings.\method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence.Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps.We hope that this work will encourage further research on personalized text detection.

link
Google Scholar

2025-10-14

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Reliable handling of code diffs is central to agents that edit and refactor repositories at scale.We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). 0.642Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol.We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations.Our findings reveal that different formats should be used depending on the use case and model size.For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models.The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code.The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

link
Google Scholar

2025-10-14

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior

A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles.The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. 0.736This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem.We operationalize this paradigm in an interactive system designed to address the critical challenge of encoding subtle, domain-specific policies into prompt instructions.The system's structured workflow guides people to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate instruction revisions against a growing test set.A user study shows our workflow helps participants refine instructions systematically and specify ambiguous policies more concretely.This work points toward more robust and responsible LLM applications through human-in-the-loop development aligned with local preferences and policies.

link
Google Scholar

2025-10-14

Adaptive Generation of Bias-Eliciting Questions for LLMs

Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide. 0.715As they become integrated into everyday tasks, growing reliance on their outputs raises significant concerns.In particular, users may unknowingly be exposed to model-inherent biases that systematically disadvantage or stereotype certain groups.However, existing bias benchmarks continue to rely on templated prompts or restrictive multiple-choice questions that are suggestive, simplistic, and fail to capture the complexity of real-world user interactions.In this work, we address this gap by introducing a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion.By iteratively mutating and selecting bias-inducing questions, our approach systematically explores areas where models are most susceptible to biased behavior.Beyond detecting harmful biases, we also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias.Leveraging our framework, we construct CAB, a human-verified benchmark spanning diverse topics, designed to enable cross-model comparisons.Using CAB, we analyze a range of LLMs across multiple bias dimensions, revealing nuanced insights into how different models manifest bias.For instance, while GPT-5 outperforms other models, it nonetheless exhibits persistent biases in specific scenarios.These findings underscore the need for continual improvements to ensure fair model behavior.

link
Google Scholar

2025-10-13

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. 0.688While human evaluation remains the gold standard, it is costly and unscalable.The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw:LLMs exhibit a strong positive bias.We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%).This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough.We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent.For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data.On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs. 0.663

link
Google Scholar

2025-10-13

Lingxi: Repository-Level Issue Resolution Framework Enhanced by Procedural Knowledge Guided Scaling

Driven by the advancements of Large Language Models (LLMs), LLM-powered agents are making significant improvements in software engineering tasks, yet struggle with complex, repository-level issue resolution. 0.702Existing agent-based methods have two key limitations.First, they lack of procedural knowledge (i.e., how an issue is fixed step-by-step and rationales behind it) to learn and leverage for issue resolution.Second, they rely on massive computational power to blindly explore the solution space.% To address those limitations, we propose Lingxi, an issue resolution framework that leverages procedural knowledge extracted from historical issue-fixing data to guide agents in solving repository-level issues.\ourTool first constructs this knowledge offline through a hierarchical abstraction mechanism, enabling agents to learn the how and why behind a fix, not just the final solution.During online application, it employs a knowledge-driven scaling method that leverages the procedural knowledge of similar issues to intelligently analyze the target issue from multiple perspectives, in sharp contrast to undirected, brute-force exploration.% Lingxi successfully resolves 74.6\% of bugs on the SWE-bench Verified benchmark in Past@1 setting, outperforming five state-of-the-art techniques by a significant margin (5.4\% to 14.9\%).Our comprehensive ablation study confirmed that the success of Lingxi comes directly from its use of procedural knowledge.Without it, the performance gains from scaling alone is negligible.Our qualitative study further shows that the ``design patterns $\&$ coding practices'' is the most critical knowledge aspect, and that the roles of different knowledge aspects switch across different stages (i.e., analysis, planning, and fixing).

link
Google Scholar

2025-10-13

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements.Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations.Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements.Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning.We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. 0.735TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements.This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation.We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks.TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10.Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

link
Google Scholar