Ryan's Arxiv FrontPageGenerated on 2025-06-27. This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. This project was originally created by Vincent Warmerdam, modifying his original frontpage for different paper categories. |
|
Prompt Engineering in Large Language Models |
|
2025-06-18 |
Truncated Proximal Policy Optimization
Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). 0.673As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error.However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths.In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation.T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts.Our contributions are two-folds.First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning.Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models.By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance.We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model.The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors. 0.638 |
2025-06-18 |
LLM Agent for Hyper-Parameter Optimization
Hyper-parameters are essential and critical for the performance of communication algorithms.However, current hyper-parameters tuning methods for warm-start particles swarm optimization with cross and mutation (WS-PSO-CM) algortihm for radio map-enabled unmanned aerial vehicle (UAV) trajectory and communication are primarily heuristic-based, exhibiting low levels of automation and unsatisfactory performance.In this paper, we design an large language model (LLM) agent for automatic hyper-parameters-tuning, where an iterative framework and model context protocol (MCP) are applied.In particular, the LLM agent is first setup via a profile, which specifies the mission, background, and output format.Then, the LLM agent is driven by the prompt requirement, and iteratively invokes WS-PSO-CM algorithm for exploration. 0.622Finally, the LLM agent autonomously terminates the loop and returns a set of hyper-parameters.Our experiment results show that the minimal sum-rate achieved by hyper-parameters generated via our LLM agent is significantly higher than those by both human heuristics and random generation methods.This indicates that an LLM agent with PSO knowledge and WS-PSO-CM algorithm background is useful in finding high-performance hyper-parameters. |
2025-06-18 |
A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals
In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030.However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved.Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources.Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics.This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs.Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain.The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT (Generative Pre-trained Transformer). 0.659 |
2025-06-18 |
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities.However, the underlying mechanisms supporting such transfer remain poorly understood.We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes -- fundamental reasoning patterns that capture the essence of problems across domains.These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. 0.668Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24).Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models. |
2025-06-18 |
TopClustRAG at SIGIR 2025 LiveRAG Challenge
We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora.Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages.Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. 0.774This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence.Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems. |
2025-06-18 |
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks.However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set.Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. 0.731By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone.Moreover, the framework is general and can work across reasoning domains, including math, code, and logic.We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations.These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy. |
2025-06-18 |
Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning
Medical report generation from imaging data remains a challenging task in clinical practice.While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration.In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. 0.829Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features.We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. 0.745Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation.Our code will be made publicly available. |
2025-06-18 |
Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities.Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries.This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework.We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering. 0.763The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student's needs.We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes. |
2025-06-18 |
Lessons from Training Grounded LLMs with Verifiable Rewards
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs).While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available.In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs.We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations.Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. 0.7A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal.Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks.Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs. |
2025-06-18 |
Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents
Failure Analysis (FA) is a highly intricate and knowledge-intensive process.The integration of AI components within the computational infrastructure of FA labs has the potential to automate a variety of tasks, including the detection of non-conformities in images, the retrieval of analogous cases from diverse data sources, and the generation of reports from annotated images.However, as the number of deployed AI models increases, the challenge lies in orchestrating these components into cohesive and efficient workflows that seamlessly integrate with the FA process. This paper investigates the design and implementation of a Large Language Model (LLM)-based Planning Agent (LPA) to assist FA engineers in solving their analysis cases. 0.655The LPA integrates LLMs with advanced planning capabilities and external tool utilization, enabling autonomous processing of complex queries, retrieval of relevant data from external systems, and generation of human-readable responses.Evaluation results demonstrate the agent's operational effectiveness and reliability in supporting FA tasks. |
2025-06-18 |
The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games
Large Language Models (LLMs) have shown promise as decision-makers in dynamic settings, but their stateless nature necessitates creating a natural language representation of history.We present a unifying framework for systematically constructing natural language "state" representations for prompting LLM agents in repeated multi-agent games. 0.791Previous work on games with LLM agents has taken an ad hoc approach to encoding game history, which not only obscures the impact of state representation on agents' behavior, but also limits comparability between studies.Our framework addresses these gaps by characterizing methods of state representation along three axes: action informativeness (i.e., the extent to which the state representation captures actions played); reward informativeness (i.e., the extent to which the state representation describes rewards obtained); and prompting style (or natural language compression, i.e., the extent to which the full text history is summarized). We apply this framework to a dynamic selfish routing game, chosen because it admits a simple equilibrium both in theory and in human subject experiments \cite{rapoport_choice_2009}.Despite the game's relative simplicity, we find that there are key dependencies of LLM agent behavior on the natural language state representation.In particular, we observe that representations which provide agents with (1) summarized, rather than complete, natural language representations of past history; (2) information about regrets, rather than raw payoffs; and (3) limited information about others' actions lead to behavior that more closely matches game theoretic equilibrium predictions, and with more stable game play by the agents.By contrast, other representations can exhibit either large deviations from equilibrium, higher variation in dynamic game play over time, or both. |
2025-06-18 |
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts.However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. 0.807To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs.This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities.We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered.Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities. |
2025-06-18 |
CC-LEARN: Cohort-based Consistency Learning
Large language models excel at many tasks but still struggle with consistent, robust reasoning.We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions.To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning.Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members.Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. 0.623These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs. |
2025-06-17 |
Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team
Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge.In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. 0.665We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience.Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement.By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. 0.608Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents.Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high.With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning.Code and data are available at https://kagnlp.github.io/xolver.github.io/. |
2025-06-17 |
Don't throw the baby out with the bathwater: How and why deep learning for ARC
The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems.Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc.The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains.Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time.We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC.Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks.Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning.We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques.We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT.An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%).Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning. 0.654 |
2025-06-17 |
From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents
While research on dialogue response generation has primarily focused on generating coherent responses conditioning on textual context, the critical question of when to respond grounded on the temporal context remains underexplored.To bridge this gap, we propose a novel task called timely dialogue response generation and introduce the TimelyChat benchmark, which evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses.Additionally, we construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph and employing a large language model (LLM) to synthesize 55K event-driven dialogues.We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals.Experimental results show that Timer outperforms prompting-based LLMs and other fine-tuned baselines in both turn-level and dialogue-level evaluations. 0.716We publicly release our data, model, and code. |
2025-06-17 |
Quality Assessment of Python Tests Generated by Large Language Models
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions.Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently.This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3.We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C). 0.701Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns.Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell.Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion of Test Cases was the most frequently detected (41%).Prompt context significantly influenced test quality; textual prompts with detailed instructions often yielded tests with fewer errors but a higher incidence of test smells. 0.812Among the evaluated LLMs, GPT-4o produced the fewest errors in both contexts (10% in C2C and 6% in T2C), whereas Amazon Q had the highest error rates (19% in C2C and 28% in T2C).For test smells, Amazon Q had fewer detections in the C2C context (9%), while LLama 3.3 performed best in the T2C context (10%).Additionally, we observed a strong relationship between specific errors, such as assertion or indentation issues, and test case cohesion smells.These findings demonstrate opportunities for improving the quality of test generation by LLMs and highlight the need for future research to explore optimized generation scenarios and better prompt engineering strategies. 0.76 |
2025-06-17 |
Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent
Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs).However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. 0.607Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue.To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction.These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization.ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements.To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations.Experimental results show that ECPO significantly enhances CRA's interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods. |
2025-06-17 |
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. 0.687Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions.This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. 0.678We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones.Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts. 0.627 |
2025-06-17 |
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Large Language Models (LLMs) are experiencing rapid advancements in complex reasoning, exhibiting remarkable generalization in mathematics and programming. 0.608In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic evaluation of their complex reasoning ability within spatial contexts remains underexplored.To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' spatial intelligence through video-based reasoning tasks.SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video.By carefully designing questions and corresponding 3D scenes, our benchmark ensures that solving the questions requires both spatial comprehension for extracting information and high-level reasoning for deriving solutions, making it a challenging benchmark for evaluating VLMs.To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine.This engine, leveraging multiple specialized LLM agents, can generate realistic 3D scenes from abstract math problems, ensuring faithfulness to the original descriptions.Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of spatial reasoning.We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving. |
2025-06-17 |
RMIT-ADM+S at the SIGIR 2025 LiveRAG Challenge
This paper presents the RMIT--ADM+S participation in the SIGIR 2025 LiveRAG Challenge.Our Generation-Retrieval-Augmented Generation (GRAG) approach relies on generating a hypothetical answer that is used in the retrieval phase, alongside the original question.GRAG also incorporates a pointwise large language model (LLM)-based re-ranking step prior to final answer generation.We describe the system architecture and the rationale behind our design choices.In particular, a systematic evaluation using the Grid of Points (GoP) framework and N-way ANOVA enabled comparison across multiple configurations, including query variant generation, question decomposition, rank fusion strategies, and prompting techniques for answer generation. 0.643Our system achieved a Relevance score of 1.199 and a Faithfulness score of 0.477 on the private leaderboard, placing among the top four finalists in the LiveRAG 2025 Challenge. |
2025-06-17 |
Doppelgänger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. 0.864Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. 0.812In this paper, we propose the ''Doppelg\"anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information.Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)''level to evaluate the vulnerability to this adversarial transfer attack.We also propose a ''Caution for Adversarial Transfer (CAT)''prompt to counter the Doppelg\"anger method. 0.616The experimental results demonstrate that the Doppelg\"anger method can compromise the agent's consistency and expose its internal information.In contrast, CAT prompts enable effective defense against this adversarial attack. 0.78 |
2025-06-17 |
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses.Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models.As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods.In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example.We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. 0.801Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance.Performance differences between prompting approaches are conditional on the LLM used. 0.744Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning.We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data.Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs.In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research. |
2025-06-17 |
Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot
In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks.However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks.Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations.We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}.Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance.Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. 0.639Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars. |
2025-06-17 |
DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization
Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness.Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems.This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness.We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected.DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability.Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. 0.639Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities.Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR).DETONATE and complete code are publicly released. |
2025-06-17 |
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS).Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information.However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way.We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains.Each prompt-rubric pair has a corresponding benign version to test for model over-refusals. 0.735This evaluation of frontier LLMs' safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29).Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06.Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2).To provide policymakers and researchers with a clear understanding of models' potential risks, we publicly release FORTRESS at https://huggingface.co/datasets/ScaleAI/fortress_public.We also maintain a private set for evaluation. |
2025-06-17 |
MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs).New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace.In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting.Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs.In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning.Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples.Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. 0.683We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example.We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. 0.798We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements. |
2025-06-17 |
FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
Black-Box Discrete Prompt Learning is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. 0.758Adapting federated learning to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. 0.685However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service.To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning.Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning. 0.744Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs. 0.737We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results. |
2025-06-17 |
CALM: Contextual Analog Logic with Multimodality
In this work, we introduce Contextual Analog Logic with Multimodality (CALM).CALM unites symbolic reasoning with neural generation, enabling systems to make context-sensitive decisions grounded in real-world multi-modal data. Background: Classic bivalent logic systems cannot capture the nuance of human decision-making.They also require human grounding in multi-modal environments, which can be ad-hoc, rigid, and brittle.Neural networks are good at extracting rich contextual information from multi-modal data, but lack interpretable structures for reasoning. Objectives: CALM aims to bridge the gap between logic and neural perception, creating an analog logic that can reason over multi-modal inputs.Without this integration, AI systems remain either brittle or unstructured, unable to generalize robustly to real-world tasks.In CALM, symbolic predicates evaluate to analog truth values computed by neural networks and constrained search. Methods: CALM represents each predicate using a domain tree, which iteratively refines its analog truth value when the contextual groundings of its entities are determined.The iterative refinement is predicted by neural networks capable of capturing multi-modal information and is filtered through a symbolic reasoning module to ensure constraint satisfaction. Results:In fill-in-the-blank object placement tasks, CALM achieved 92.2% accuracy, outperforming classical logic (86.3%) and LLM (59.4%) baselines.It also demonstrated spatial heatmap generation aligned with logical constraints and delicate human preferences, as shown by a human study. Conclusions: CALM demonstrates the potential to reason with logic structure while aligning with preferences in multi-modal environments. 0.651It lays the foundation for next-gen AI systems that require the precision and interpretation of logic and the multimodal information processing of neural networks. |
2025-06-17 |
Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework
Large language models (LLMs) are increasingly deployed in domains requiring moral understanding, yet their reasoning often remains shallow, and misaligned with human reasoning.Unlike humans, whose moral reasoning integrates contextual trade-offs, value systems, and ethical theories, LLMs often rely on surface patterns, leading to biased decisions in morally and ethically complex scenarios.To address this gap, we present a value-grounded framework for evaluating and distilling structured moral reasoning in LLMs.We benchmark 12 open-source models across four moral datasets using a taxonomy of prompts grounded in value systems, ethical theories, and cognitive reasoning strategies.Our evaluation is guided by four questions: (1) Does reasoning improve LLM decision-making over direct prompting? 0.738(2) Which types of value/ethical frameworks most effectively guide LLM reasoning?(3) Which cognitive reasoning strategies lead to better moral performance?(4) Can small-sized LLMs acquire moral competence through distillation?We find that prompting with explicit moral structure consistently improves accuracy and coherence, with first-principles reasoning and Schwartz's + care-ethics scaffolds yielding the strongest gains. 0.779Furthermore, our supervised distillation approach transfers moral competence from large to small models without additional inference cost.Together, our results offer a scalable path toward interpretable and value-grounded models. |
2025-06-17 |
From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?
While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored.In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. 0.649We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD).We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B.We also test two proprietary models: GPT-4o and Gemini Flash 2.0.In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM).We use accuracy, precision, recall, and F1-score as evaluation metrics.Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings.Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score.However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. 0.627This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions. |
2025-06-17 |
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning.A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. 0.619Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains.For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition.Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains.We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data.We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360 |
2025-06-17 |
Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. 0.767However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research.In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression.To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA).Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks. |
Robustness Tools in LLM Safety |
|
2025-06-18 |
HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models
Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents.However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. 0.899In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. 0.775Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. 0.884To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. 0.745Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks.We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies. |
2025-06-18 |
ChatModel: Automating Reference Model Design and Verification with LLMs
As the complexity of integrated circuit designs continues to escalate, the functional verification becomes increasingly challenging.Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop.Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle.To address these challenges, we introduce ChatModel, the first LLM-aided agile reference model generation and verification platform.ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling.Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. 0.659We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation.ChatModel achieved a peak performance improvement of 55.02% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs.Furthermore, it accelerated the iterative process of reference model design and validation by an average of 5.90x compared to traditional approaches.These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation. |
2025-06-18 |
Enhancement Report Approval Prediction: A Comparative Study of Large Language Models
Enhancement reports (ERs) serve as a critical communication channel between users and developers, capturing valuable suggestions for software improvement.However, manually processing these reports is resource-intensive, leading to delays and potential loss of valuable insights.To address this challenge, enhancement report approval prediction (ERAP) has emerged as a research focus, leveraging machine learning techniques to automate decision-making.While traditional approaches have employed feature-based classifiers and deep learning models, recent advancements in large language models (LLM) present new opportunities for enhancing prediction accuracy.This study systematically evaluates 18 LLM variants (including BERT, RoBERTa, DeBERTa-v3, ELECTRA, and XLNet for encoder models; GPT-3.5-turbo, GPT-4o-mini, Llama 3.1 8B, Llama 3.1 8B Instruct and DeepSeek-V3 for decoder models) against traditional methods (CNN/LSTM-BERT/GloVe).Our experiments reveal two key insights: (1) Incorporating creator profiles increases unfine-tuned decoder-only models' overall accuracy by 10.8 percent though it may introduce bias; (2) LoRA fine-tuned Llama 3.1 8B Instruct further improve performance, reaching 79 percent accuracy and significantly enhancing recall for approved reports (76.1 percent vs. LSTM-GLOVE's 64.1 percent), outperforming traditional methods by 5 percent under strict chronological evaluation and effectively addressing class imbalance issues. 0.614These findings establish LLM as a superior solution for ERAP, demonstrating their potential to streamline software maintenance workflows and improve decision-making in real-world development environments.We also investigated and summarized the ER cases where the large models underperformed, providing valuable directions for future research. |
2025-06-18 |
eLLM: Elastic Memory Management Framework for Efficient LLM Serving
Large Language Models are increasingly being deployed in datacenters.Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches.While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management.Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction.This virtualization dynamically manages KV caches to mitigate memory fragmentation. 0.607However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in suboptimal memory utilization under dynamic workloads, which can lead to a nearly 20% drop in throughput. To address these limitations, we propose eLLM, an elastic memory management framework inspired by the classical memory ballooning mechanism in operating systems.The core components of eLLM include: (1) Virtual Tensor Abstraction, which decouples the virtual address space of tensors from the physical GPU memory, creating a unified and flexible memory pool; (2) an Elastic Memory Mechanism that dynamically adjusts memory allocation through runtime memory inflation and deflation, leveraging CPU memory as an extensible buffer; and (3) a Lightweight Scheduling Strategy employing SLO-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints.Comprehensive evaluations demonstrate that eLLM significantly outperforms state-of-the-art systems, 2.32x higher decoding throughput, and supporting 3x larger batch sizes for 128K-token inputs. |
2025-06-18 |
Robust Instant Policy: Leveraging Student's t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation
Imitation learning (IL) aims to enable robots to perform tasks autonomously by observing a few human demonstrations.Recently, a variant of IL, called In-Context IL, utilized off-the-shelf large language models (LLMs) as instant policies that understand the context from a few given demonstrations to perform a new task, rather than explicitly updating network models with large-scale demonstrations.However, its reliability in the robotics domain is undermined by hallucination issues such as LLM-based instant policy, which occasionally generates poor trajectories that deviate from the given demonstrations. 0.746To alleviate this problem, we propose a new robust in-context imitation learning algorithm called the robust instant policy (RIP), which utilizes a Student's t-regression model to be robust against the hallucinated trajectories of instant policies to allow reliable trajectory generation.Specifically, RIP generates several candidate robot trajectories to complete a given task from an LLM and aggregates them using the Student's t-distribution, which is beneficial for ignoring outliers (i.e., hallucinations); thereby, a robust trajectory against hallucinations is generated. 0.64Our experiments, conducted in both simulated and real-world environments, show that RIP significantly outperforms state-of-the-art IL methods, with at least $26\%$ improvement in task success rates, particularly in low-data scenarios for everyday tasks.Video results available at https://sites.google.com/view/robustinstantpolicy. |
2025-06-18 |
LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
With the rapid advancements in Natural Language Processing (NLP), large language models (LLMs) like GPT-4 have gained significant traction in diverse applications, including security vulnerability scanning.This paper investigates the efficacy of GPT-4 in identifying software vulnerabilities compared to traditional Static Application Security Testing (SAST) tools. 0.69Drawing from an array of security mistakes, our analysis underscores the potent capabilities of GPT-4 in LLM-enhanced vulnerability scanning. 0.625We unveiled that GPT-4 (Advanced Data Analysis) outperforms SAST by an accuracy of 94% in detecting 32 types of exploitable vulnerabilities. 0.607This study also addresses the potential security concerns surrounding LLMs, emphasising the imperative of security by design/default and other security best practices for AI. 0.626 |
2025-06-18 |
RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments
The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks.To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS-Eval, a comprehensive security benchmark supporting both simulated and real-world tool execution.RAS-Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats.We evaluate 6 state-of-the-art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings.Notably, scaling laws held for security capabilities, with larger models outperforming smaller counterparts.Our findings expose critical risks in real-world agent deployments and provide a foundational framework for future security research. 0.619Code and data are available at https://github.com/lanzer-tree/RAS-Eval. |
2025-06-18 |
Context-Informed Grounding Supervision
Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination.In such cases, we expect the model to generate responses by grounding its response in the provided external context.However, prior work has shown that simply appending context at inference time does not ensure grounded generation.To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context.Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models.In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques.In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. 0.794This improved grounding comes without degradation in general downstream performance.Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context. |
2025-06-18 |
RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information.However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. 0.607RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions.A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization.This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass.We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates.On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918.This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU.RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications. |
2025-06-18 |
deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses
Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. 0.731Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. 0.722To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. 0.635deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library.Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. 0.628We evaluated deepSURF on 27 real-world Rust crates, successfully rediscovering 20 known memory safety bugs and uncovering 6 previously unknown vulnerabilities, demonstrating clear improvements over state-of-the-art tools. 0.701 |
2025-06-18 |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites continue to pose a significant cybersecurity threat, often leveraging deceptive structures, brand impersonation, and social engineering tactics to evade detection. 0.694While recent advances in large language models (LLMs) have enabled improved phishing detection through contextual understanding, most existing approaches rely on single-agent classification facing the risks of hallucination and lack interpretability or robustness.To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection.PhishDebate employs four specialized agents to independently analyze different textual aspects of a webpage--URL structure, HTML composition, semantic content, and brand impersonation--under the coordination of a Moderator and a final Judge.Through structured debate and divergent thinking, the framework delivers more accurate and interpretable decisions.Extensive evaluations on commercial LLMs demonstrate that PhishDebate achieves 98.2% recall and 98.2% True Positive Rate (TPR) on a real-world phishing dataset, and outperforms single-agent and Chain of Thought (CoT) baselines. 0.684Additionally, its modular design allows agent-level configurability, enabling adaptation to varying resource and application requirements. |
2025-06-18 |
PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance.Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored.Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. 0.601To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs.Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics.Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%. |
2025-06-17 |
Abstract Meaning Representation for Hospital Discharge Summarization
The Achilles heel of Large Language Models (LLMs) is hallucination, which has drastic consequences for the clinical domain. 0.67This is particularly important with regards to automatically generating discharge summaries (a lengthy medical document that summarizes a hospital in-patient visit).Automatically generating these summaries would free physicians to care for patients and reduce documentation burden.The goal of this work is to discover new methods that combine language-based graphs and deep learning models to address provenance of content and trustworthiness in automatic summarization.Our method shows impressive reliability results on the publicly available Medical Information Mart for Intensive III (MIMIC-III) corpus and clinical notes written by physicians at Anonymous Hospital.rovide our method, generated discharge ary output examples, source code and trained models. |
2025-06-17 |
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Latent-space monitors aim to detect undesirable behaviours in large language models by leveraging internal model representations rather than relying solely on black-box outputs.These methods have shown promise in identifying behaviours such as deception and unsafe completions, but a critical open question remains: can LLMs learn to evade such monitors? 0.7To study this, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to bypass latent-space monitors while maintaining coherent generations. 0.638We apply RL-Obfuscation to LLMs ranging from 7B to 14B parameters and evaluate evasion success against a suite of monitors. 0.756We find that token-level latent-space monitors are highly vulnerable to this attack.More holistic monitors, such as max-pooling or attention-based probes, remain robust.Moreover, we show that adversarial policies trained to evade a single static monitor generalise to unseen monitors of the same type.Finally, we study how the policy learned by RL bypasses these monitors and find that the model can also learn to repurpose tokens to mean something different internally. |
2025-06-17 |
Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment
360 video captures the complete surrounding scenes with the ultra-large field of view of 360X180.This makes 360 scene understanding tasks, eg, segmentation and tracking, crucial for appications, such as autonomous driving, robotics.With the recent emergence of foundation models, the community is, however, impeded by the lack of large-scale, labelled real-world datasets.This is caused by the inherent spherical properties, eg, severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex.This paper introduces Leader360V, the first large-scale, labeled real-world 360 video datasets for instance segmentation and tracking.Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes.To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre-trained 2D segmentors and large language models to facilitate the labeling.The pipeline operates in three novel stages.Specifically, in the Initial Annotation Phase, we introduce a Semantic- and Distortion-aware Refinement module, which combines object mask proposals from multiple 2D segmentors with LLM-verified semantic labels.These are then converted into mask prompts to guide SAM2 in generating distortion-aware masks for subsequent frames. 0.617In the Auto-Refine Annotation Phase, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders.The Manual Revision Phase finally incorporates LLMs and human annotators to further refine and validate the annotations.Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline.Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding. |
2025-06-17 |
Quality Assessment of Python Tests Generated by Large Language Models
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions.Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently.This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3.We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C).Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns.Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell. 0.717Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion of Test Cases was the most frequently detected (41%).Prompt context significantly influenced test quality; textual prompts with detailed instructions often yielded tests with fewer errors but a higher incidence of test smells.Among the evaluated LLMs, GPT-4o produced the fewest errors in both contexts (10% in C2C and 6% in T2C), whereas Amazon Q had the highest error rates (19% in C2C and 28% in T2C).For test smells, Amazon Q had fewer detections in the C2C context (9%), while LLama 3.3 performed best in the T2C context (10%).Additionally, we observed a strong relationship between specific errors, such as assertion or indentation issues, and test case cohesion smells.These findings demonstrate opportunities for improving the quality of test generation by LLMs and highlight the need for future research to explore optimized generation scenarios and better prompt engineering strategies. 0.646 |
2025-06-17 |
LLM-Powered Intent-Based Categorization of Phishing Emails
Phishing attacks remain a significant threat to modern cybersecurity, as they successfully deceive both humans and the defense mechanisms intended to protect them. 0.692Traditional detection systems primarily focus on email metadata that users cannot see in their inboxes.Additionally, these systems struggle with phishing emails, which experienced users can often identify empirically by the text alone.This paper investigates the practical potential of Large Language Models (LLMs) to detect these emails by focusing on their intent.In addition to the binary classification of phishing emails, the paper introduces an intent-type taxonomy, which is operationalized by the LLMs to classify emails into distinct categories and, therefore, generate actionable threat information.To facilitate our work, we have curated publicly available datasets into a custom dataset containing a mix of legitimate and phishing emails.Our results demonstrate that existing LLMs are capable of detecting and categorizing phishing emails, underscoring their potential in this domain. |
2025-06-17 |
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning
Existing work on mitigating catastrophic forgetting in large language model (LLM) fine-tuning has primarily focused on preserving specific data or tasks, while critically overlooking the degradation of essential capabilities instilled through safety alignment, particularly the model's ability to faithfully express ignorance.In this work, we show that this capability is significantly degraded during conventional fine-tuning, leading to undesired behaviors such as hallucinations. 0.862To address this novel but highly practical problem, we propose SEAT, a simple and effective fine-tuning approach that preserves both fine-tuning performance and the model's inherent ability to acknowledge its ignorance.SEAT integrates two key components: (1) sparse training that constrains activation drift, and (2) a novel entity perturbation method with KL-divergence regularization, designed to counter knowledge entanglement.Experimental results demonstrate that SEAT significantly outperforms baselines in preserving ignorance awareness while retaining fine-tuning performance, offering a more robust solution for LLM fine-tuning. |
2025-06-17 |
MalGuard: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem
Malicious package detection has become a critical task in ensuring the security and stability of the PyPI.Existing detection approaches have focused on advancing model selection, evolving from traditional machine learning (ML) models to large language models (LLMs).However, as the complexity of the model increases, the time consumption also increases, which raises the question of whether a lightweight model achieves effective detection.Through empirical research, we demonstrate that collecting a sufficiently comprehensive feature set enables even traditional ML models to achieve outstanding performance.However, with the continuous emergence of new malicious packages, considerable human and material resources are required for feature analysis.Also, traditional ML model-based approaches lack of explainability to malicious packages.Therefore, we propose a novel approach MalGuard based on graph centrality analysis and the LIME (Local Interpretable Model-agnostic Explanations) algorithm to detect malicious packages.To overcome the above two challenges, we leverage graph centrality analysis to extract sensitive APIs automatically to replace manual analysis.To understand the sensitive APIs, we further refine the feature set using LLM and integrate the LIME algorithm with ML models to provide explanations for malicious packages.We evaluated MalGuard against six SOTA baselines with the same settings.Experimental results show that our proposed MalGuard, improves precision by 0.5%-33.2% and recall by 1.8%-22.1%. 0.741With MalGuard, we successfully identified 113 previously unknown malicious packages from a pool of 64,348 newly-uploaded packages over a five-week period, and 109 out of them have been removed by the PyPI official. 0.689 |
2025-06-17 |
Doppelgänger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use.Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts.In this paper, we propose the ''Doppelg\"anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. 0.663Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)''level to evaluate the vulnerability to this adversarial transfer attack.We also propose a ''Caution for Adversarial Transfer (CAT)''prompt to counter the Doppelg\"anger method.The experimental results demonstrate that the Doppelg\"anger method can compromise the agent's consistency and expose its internal information.In contrast, CAT prompts enable effective defense against this adversarial attack. |
2025-06-17 |
Issue Retrieval and Verification Enhanced Supplementary Code Comment Generation
Issue reports have been recognized to contain rich information for retrieval-augmented code comment generation.However, how to minimize hallucinations in the generated comments remains significant challenges. 0.756In this paper, we propose IsComment, an issue-based LLM retrieval and verification approach for generating method's design rationale, usage directives, and so on as supplementary code comments.We first identify five main types of code supplementary information that issue reports can provide through code-comment-issue analysis.Next, we retrieve issue sentences containing these types of supplementary information and generate candidate code comments.To reduce hallucinations, we filter out those candidate comments that are irrelevant to the code or unverifiable by the issue report, making the code comment generation results more reliable. 0.769Our experiments indicate that compared with LLMs, IsComment increases the coverage of manual supplementary comments from 33.6% to 72.2% for ChatGPT, from 35.8% to 88.4% for GPT-4o, and from 35.0% to 86.2% for DeepSeek-V3.Compared with existing work, IsComment can generate richer and more useful supplementary code comments for programming understanding, which is quantitatively evaluated through the MESIA metric on both methods with and without manual code comments. |
2025-06-17 |
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities.Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models.To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training.First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology.Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. 0.632Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset.We will release the model, dataset, and code. |
2025-06-17 |
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees.While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. 0.608To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents.OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. 0.669To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.).Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score).We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety.In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. 0.632The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm. |
2025-06-17 |
Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning
The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins.Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of training data, which LM providers try to limit.In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible but also allow to effectively protect a dataset and trace its use.Using gradient-based optimization prompt-tuning, we make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus.We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely high confidence ($p < 10^{-55}$) with a theoretically certifiable scheme.Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets never appearing in the training set. 0.712 |
2025-06-17 |
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS).Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information.However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way.We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains.Each prompt-rubric pair has a corresponding benign version to test for model over-refusals.This evaluation of frontier LLMs' safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29). 0.622Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06.Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2).To provide policymakers and researchers with a clear understanding of models' potential risks, we publicly release FORTRESS at https://huggingface.co/datasets/ScaleAI/fortress_public.We also maintain a private set for evaluation. |
2025-06-16 |
From Promise to Peril: Rethinking Cybersecurity Red and Blue Teaming in the Age of LLMs
Large Language Models (LLMs) are set to reshape cybersecurity by augmenting red and blue team operations.Red teams can exploit LLMs to plan attacks, craft phishing content, simulate adversaries, and generate exploit code. 0.64Conversely, blue teams may deploy them for threat intelligence synthesis, root cause analysis, and streamlined documentation.This dual capability introduces both transformative potential and serious risks. This position paper maps LLM applications across cybersecurity frameworks such as MITRE ATT&CK and the NIST Cybersecurity Framework (CSF), offering a structured view of their current utility and limitations.While LLMs demonstrate fluency and versatility across various tasks, they remain fragile in high-stakes, context-heavy environments.Key limitations include hallucinations, limited context retention, poor reasoning, and sensitivity to prompts, which undermine their reliability in operational settings. 0.677Moreover, real-world integration raises concerns around dual-use risks, adversarial misuse, and diminished human oversight.Malicious actors could exploit LLMs to automate reconnaissance, obscure attack vectors, and lower the technical threshold for executing sophisticated attacks. 0.609To ensure safer adoption, we recommend maintaining human-in-the-loop oversight, enhancing model explainability, integrating privacy-preserving mechanisms, and building systems robust to adversarial exploitation.As organizations increasingly adopt AI driven cybersecurity, a nuanced understanding of LLMs' risks and operational impacts is critical to securing their defensive value while mitigating unintended consequences. 0.605 |
2025-06-16 |
Watermarking LLM-Generated Datasets in Downstream Tasks
Large Language Models (LLMs) have experienced rapid advancements, with applications spanning a wide range of fields, including sentiment classification, review generation, and question answering.Due to their efficiency and versatility, researchers and companies increasingly employ LLM-generated data to train their models.However, the inability to track content produced by LLMs poses a significant challenge, potentially leading to copyright infringement for the LLM owners.In this paper, we propose a method for injecting watermarks into LLM-generated datasets, enabling the tracking of downstream tasks to detect whether these datasets were produced using the original LLM.These downstream tasks can be divided into two categories.The first involves using the generated datasets at the input level, commonly for training classification tasks.The other is the output level, where model trainers use LLM-generated content as output for downstream tasks, such as question-answering tasks.We design a comprehensive set of experiments to evaluate both watermark methods.Our results indicate the high effectiveness of our watermark approach.Additionally, regarding model utility, we find that classifiers trained on the generated datasets achieve a test accuracy exceeding 0.900 in many cases, suggesting that the utility of such models remains robust.For the output-level watermark, we observe that the quality of the generated text is comparable to that produced using real-world datasets.Through our research, we aim to advance the protection of LLM copyrights, taking a significant step forward in safeguarding intellectual property in this domain. 0.601 |
2025-06-16 |
An LLM's Apology: Outsourcing Awkwardness in the Age of AI
A key part of modern social dynamics is flaking at short notice.However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to 'ghosting', awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. 0.781The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms.We introduce FLAKE-Bench, an evaluation of models' capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios.We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says "I value our friendship" like having AI generate your cancellation texts.We open-source FLAKE-Bench at github.com/Cloakless/flake-bench to support future research. |
2025-06-16 |
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs
Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments.Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference.In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs.While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss.Our method is especially effective in extracting task-relevant subgraphs -- so-called ``circuits'' -- which can represent core functions (e.g., indirect object identification).Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs). 0.733All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety.Our code is publicly available at https://github.com/erfanhatefi/SparC3. |
2025-06-16 |
Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability
Phishing attacks remain one of the most prevalent and persistent cybersecurity threat with attackers continuously evolving and intensifying tactics to evade the general detection system. 0.621Despite significant advances in artificial intelligence and machine learning, faithfully reproducing the interpretable reasoning with classification and explainability that underpin phishing judgments remains challenging.Due to recent advancement in Natural Language Processing, Large Language Models (LLMs) show a promising direction and potential for improving domain specific phishing classification tasks.However, enhancing the reliability and robustness of classification models requires not only accurate predictions from LLMs but also consistent and trustworthy explanations aligning with those predictions.Therefore, a key question remains: can LLMs not only classify phishing emails accurately but also generate explanations that are reliably aligned with their predictions and internally self-consistent?To answer these questions, we have fine-tuned transformer based models, including BERT, Llama models, and Wizard, to improve domain relevance and make them more tailored to phishing specific distinctions, using Binary Sequence Classification, Contrastive Learning (CL) and Direct Preference Optimization (DPO).To that end, we examined their performance in phishing classification and explainability by applying the ConsistenCy measure based on SHAPley values (CC SHAP), which measures prediction explanation token alignment to test the model's internal faithfulness and consistency and uncover the rationale behind its predictions and reasoning.Overall, our findings show that Llama models exhibit stronger prediction explanation token alignment with higher CC SHAP scores despite lacking reliable decision making accuracy, whereas Wizard achieves better prediction accuracy but lower CC SHAP scores. |
2025-06-16 |
Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions
As Large Language Models (LLMs) increasingly power applications used by children and adolescents, ensuring safe and age-appropriate interactions has become an urgent ethical imperative.Despite progress in AI safety, current evaluations predominantly focus on adults, neglecting the unique vulnerabilities of minors engaging with generative AI.We introduce Safe-Child-LLM, a comprehensive benchmark and dataset for systematically assessing LLM safety across two developmental stages: children (7-12) and adolescents (13-17).Our framework includes a novel multi-part dataset of 200 adversarial prompts, curated from red-teaming corpora (e.g., SG-Bench, HarmBench), with human-annotated labels for jailbreak success and a standardized 0-5 ethical refusal scale.Evaluating leading LLMs -- including ChatGPT, Claude, Gemini, LLaMA, DeepSeek, Grok, Vicuna, and Mistral -- we uncover critical safety deficiencies in child-facing scenarios. 0.61This work highlights the need for community-driven benchmarks to protect young users in LLM interactions.To promote transparency and collaborative advancement in ethical AI development, we are publicly releasing both our benchmark datasets and evaluation codebase at https://github.com/The-Responsible-AI-Initiative/Safe_Child_LLM_Benchmark.git |
Security Challenges in LLM Development |
|
2025-06-18 |
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem
Large language models (LLMs) are rapidly evolving from single-modal systems to multimodal LLMs and intelligent agents, significantly expanding their capabilities while introducing increasingly severe security risks.This paper presents a systematic survey of the growing complexity of jailbreak attacks and corresponding defense mechanisms within the expanding LLM ecosystem. 0.843We first trace the developmental trajectory from LLMs to MLLMs and Agents, highlighting the core security challenges emerging at each stage. 0.738Next, we categorize mainstream jailbreak techniques from both the attack impact and visibility perspectives, and provide a comprehensive analysis of representative attack methods, related datasets, and evaluation metrics. 0.75On the defense side, we organize existing strategies based on response timing and technical approach, offering a structured understanding of their applicability and implementation.Furthermore, we identify key limitations in existing surveys, such as insufficient attention to agent-specific security issues, the absence of a clear taxonomy for hybrid jailbreak methods, a lack of detailed analysis of experimental setups, and outdated coverage of recent advancements. 0.634To address these limitations, we provide an updated synthesis of recent work and outline future research directions in areas such as dataset construction, evaluation framework optimization, and strategy generalization.Our study seeks to enhance the understanding of jailbreak mechanisms and facilitate the advancement of more resilient and adaptive defense strategies in the context of ever more capable LLMs. 0.838 |
2025-06-18 |
LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
With the rapid advancements in Natural Language Processing (NLP), large language models (LLMs) like GPT-4 have gained significant traction in diverse applications, including security vulnerability scanning. 0.674This paper investigates the efficacy of GPT-4 in identifying software vulnerabilities compared to traditional Static Application Security Testing (SAST) tools. 0.808Drawing from an array of security mistakes, our analysis underscores the potent capabilities of GPT-4 in LLM-enhanced vulnerability scanning. 0.817We unveiled that GPT-4 (Advanced Data Analysis) outperforms SAST by an accuracy of 94% in detecting 32 types of exploitable vulnerabilities. 0.759This study also addresses the potential security concerns surrounding LLMs, emphasising the imperative of security by design/default and other security best practices for AI. 0.847 |
2025-06-18 |
RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments
The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks. 0.627To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS-Eval, a comprehensive security benchmark supporting both simulated and real-world tool execution. 0.779RAS-Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats.We evaluate 6 state-of-the-art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings.Notably, scaling laws held for security capabilities, with larger models outperforming smaller counterparts.Our findings expose critical risks in real-world agent deployments and provide a foundational framework for future security research. 0.79Code and data are available at https://github.com/lanzer-tree/RAS-Eval. |
2025-06-18 |
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Large Language Models (LLMs) have become indispensable in real-world applications.However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions.Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. 0.659In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. 0.673Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM.Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. 0.866For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks.By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations.The code is available at github.com/VITA-Group/LoX. |
2025-06-18 |
deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses
Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. 0.755Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. 0.787deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library.Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. 0.753We evaluated deepSURF on 27 real-world Rust crates, successfully rediscovering 20 known memory safety bugs and uncovering 6 previously unknown vulnerabilities, demonstrating clear improvements over state-of-the-art tools. |
2025-06-18 |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites continue to pose a significant cybersecurity threat, often leveraging deceptive structures, brand impersonation, and social engineering tactics to evade detection. 0.687While recent advances in large language models (LLMs) have enabled improved phishing detection through contextual understanding, most existing approaches rely on single-agent classification facing the risks of hallucination and lack interpretability or robustness.To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection.PhishDebate employs four specialized agents to independently analyze different textual aspects of a webpage--URL structure, HTML composition, semantic content, and brand impersonation--under the coordination of a Moderator and a final Judge.Through structured debate and divergent thinking, the framework delivers more accurate and interpretable decisions.Extensive evaluations on commercial LLMs demonstrate that PhishDebate achieves 98.2% recall and 98.2% True Positive Rate (TPR) on a real-world phishing dataset, and outperforms single-agent and Chain of Thought (CoT) baselines.Additionally, its modular design allows agent-level configurability, enabling adaptation to varying resource and application requirements. |
2025-06-17 |
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Latent-space monitors aim to detect undesirable behaviours in large language models by leveraging internal model representations rather than relying solely on black-box outputs.These methods have shown promise in identifying behaviours such as deception and unsafe completions, but a critical open question remains: can LLMs learn to evade such monitors?To study this, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to bypass latent-space monitors while maintaining coherent generations.We apply RL-Obfuscation to LLMs ranging from 7B to 14B parameters and evaluate evasion success against a suite of monitors. 0.777We find that token-level latent-space monitors are highly vulnerable to this attack. 0.711More holistic monitors, such as max-pooling or attention-based probes, remain robust.Moreover, we show that adversarial policies trained to evade a single static monitor generalise to unseen monitors of the same type. 0.816Finally, we study how the policy learned by RL bypasses these monitors and find that the model can also learn to repurpose tokens to mean something different internally. |
2025-06-17 |
LLM-Powered Intent-Based Categorization of Phishing Emails
Phishing attacks remain a significant threat to modern cybersecurity, as they successfully deceive both humans and the defense mechanisms intended to protect them. 0.795Traditional detection systems primarily focus on email metadata that users cannot see in their inboxes.Additionally, these systems struggle with phishing emails, which experienced users can often identify empirically by the text alone.This paper investigates the practical potential of Large Language Models (LLMs) to detect these emails by focusing on their intent.In addition to the binary classification of phishing emails, the paper introduces an intent-type taxonomy, which is operationalized by the LLMs to classify emails into distinct categories and, therefore, generate actionable threat information.To facilitate our work, we have curated publicly available datasets into a custom dataset containing a mix of legitimate and phishing emails.Our results demonstrate that existing LLMs are capable of detecting and categorizing phishing emails, underscoring their potential in this domain. |
2025-06-17 |
Excessive Reasoning Attack on Reasoning LLMs
Recent reasoning large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, exhibit strong performance on complex tasks through test-time inference scaling.However, prior studies have shown that these models often incur significant computational costs due to excessive reasoning, such as frequent switching between reasoning trajectories (e.g., underthinking) or redundant reasoning on simple questions (e.g., overthinking).In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors and substantially increase computational overhead without compromising model utility. 0.839Therefore, we propose a novel loss framework consisting of three components: (1) Priority Cross-Entropy Loss, a modification of the standard cross-entropy objective that emphasizes key tokens by leveraging the autoregressive nature of LMs; (2) Excessive Reasoning Loss, which encourages the model to initiate additional reasoning paths during inference; and (3) Delayed Termination Loss, which is designed to extend the reasoning process and defer the generation of final outputs.We optimize and evaluate our attack for the GSM8K and ORCA datasets on DeepSeek-R1-Distill-LLaMA and DeepSeek-R1-Distill-Qwen.Empirical results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance.Furthermore, our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models. 0.715 |
2025-06-17 |
MalGuard: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem
Malicious package detection has become a critical task in ensuring the security and stability of the PyPI. 0.766Existing detection approaches have focused on advancing model selection, evolving from traditional machine learning (ML) models to large language models (LLMs).However, as the complexity of the model increases, the time consumption also increases, which raises the question of whether a lightweight model achieves effective detection.Through empirical research, we demonstrate that collecting a sufficiently comprehensive feature set enables even traditional ML models to achieve outstanding performance.However, with the continuous emergence of new malicious packages, considerable human and material resources are required for feature analysis.Also, traditional ML model-based approaches lack of explainability to malicious packages. 0.752Therefore, we propose a novel approach MalGuard based on graph centrality analysis and the LIME (Local Interpretable Model-agnostic Explanations) algorithm to detect malicious packages. 0.786To overcome the above two challenges, we leverage graph centrality analysis to extract sensitive APIs automatically to replace manual analysis.To understand the sensitive APIs, we further refine the feature set using LLM and integrate the LIME algorithm with ML models to provide explanations for malicious packages. 0.812We evaluated MalGuard against six SOTA baselines with the same settings.Experimental results show that our proposed MalGuard, improves precision by 0.5%-33.2% and recall by 1.8%-22.1%.With MalGuard, we successfully identified 113 previously unknown malicious packages from a pool of 64,348 newly-uploaded packages over a five-week period, and 109 out of them have been removed by the PyPI official. |
2025-06-17 |
LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data
Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent.Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging.Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. 0.684However, existing methods often lack stealth, making them relatively easy to detect and remove.In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words.Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text.As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection.We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B).Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios.The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training. 0.608 |
2025-06-17 |
Doppelgänger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use.Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. 0.667In this paper, we propose the ''Doppelg\"anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. 0.74Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)''level to evaluate the vulnerability to this adversarial transfer attack. 0.895We also propose a ''Caution for Adversarial Transfer (CAT)'' 0.719prompt to counter the Doppelg\"anger method.The experimental results demonstrate that the Doppelg\"anger method can compromise the agent's consistency and expose its internal information.In contrast, CAT prompts enable effective defense against this adversarial attack. 0.694 |
2025-06-17 |
Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments
The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content.This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments.Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses.The model's outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse.The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans.These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns.The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks. 0.693 |
2025-06-17 |
AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
We introduce AIRTBench, an AI red teaming benchmark for evaluating language models' ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities. 0.794The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform, requiring models to write python code to interact with and compromise AI systems.Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro following at 39 challenges (56%, 34.3% overall), GPT-4.5-Preview at 34 challenges (49%, 36.9% overall), and DeepSeek R1 at 29 challenges (41%, 26.9% overall).Our evaluations show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model inversion challenges (below 26%, even for the best performers). 0.727Frontier models are far outpacing open-source alternatives, with the best truly open-source model (Llama-4-17B) solving 7 challenges (10%, 1.0% overall), though demonstrating specialized capabilities on certain hard challenges.Compared to human security researchers, large language models (LLMs) solve challenges with remarkable efficiency completing in minutes what typically takes humans hours or days-with efficiency advantages of over 5,000x on hard challenges.Our contribution fills a critical gap in the evaluation landscape, providing the first comprehensive benchmark specifically designed to measure and track progress in autonomous AI red teaming capabilities. |
2025-06-17 |
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees.While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. 0.644To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. 0.635OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. 0.801To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.).Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score).We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety.In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. 0.793The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm. |
2025-06-17 |
DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization
Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness.Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems.This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness.We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected.DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability.Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney.Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. 0.709Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR).DETONATE and complete code are publicly released. |
2025-06-17 |
Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning
The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins.Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of training data, which LM providers try to limit.In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible but also allow to effectively protect a dataset and trace its use. 0.66Using gradient-based optimization prompt-tuning, we make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus.We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely high confidence ($p < 10^{-55}$) with a theoretically certifiable scheme. 0.732Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets never appearing in the training set. |
2025-06-17 |
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS).Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information. 0.692However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way. 0.781We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains. 0.63Each prompt-rubric pair has a corresponding benign version to test for model over-refusals.This evaluation of frontier LLMs' safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29). 0.778Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06.Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2).To provide policymakers and researchers with a clear understanding of models' potential risks, we publicly release FORTRESS at https://huggingface.co/datasets/ScaleAI/fortress_public.We also maintain a private set for evaluation. |
HCI in Large Language Models |
|
2025-06-18 |
Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers
Family caregivers of individuals with Alzheimer's Disease and Related Dementia (AD/ADRD) face significant emotional and logistical challenges that place them at heightened risk for stress, anxiety, and depression. 0.612Although recent advances in generative AI -- particularly large language models (LLMs) -- offer new opportunities to support mental health, little is known about how caregivers perceive and engage with such technologies. 0.847To address this gap, we developed Carey, a GPT-4o-based chatbot designed to provide informational and emotional support to AD/ADRD caregivers. 0.624Using Carey as a technology probe, we conducted semi-structured interviews with 16 family caregivers following scenario-driven interactions grounded in common caregiving stressors. 0.91Through inductive coding and reflexive thematic analysis, we surface a systemic understanding of caregiver needs and expectations across six themes -- on-demand information access, emotional support, safe space for disclosure, crisis management, personalization, and data privacy. 0.616For each of these themes, we also identified the nuanced tensions in the caregivers' desires and concerns. 0.706We present a mapping of caregiver needs, AI chatbot's strengths, gaps, and design recommendations.Our findings offer theoretical and practical insights to inform the design of proactive, trustworthy, and caregiver-centered AI systems that better support the evolving mental health needs of AD/ADRD caregivers. |
2025-06-18 |
HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models
Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. 0.863However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist.In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies.Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond.To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts.Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks.We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies. |
2025-06-18 |
Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs
Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context.Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. 0.604In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively.To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context.Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS.Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models. |
2025-06-18 |
AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. 0.748However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases.We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. 0.643(2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics.(3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition.Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval.Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines.These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. 0.6Code is available at https://github.com/MikeGu721/AgentGroupChat-V2. |
2025-06-18 |
Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities. 0.614Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries.This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework.We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering.The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student's needs.We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes. |
2025-06-18 |
The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games
Large Language Models (LLMs) have shown promise as decision-makers in dynamic settings, but their stateless nature necessitates creating a natural language representation of history.We present a unifying framework for systematically constructing natural language "state" representations for prompting LLM agents in repeated multi-agent games. 0.698Previous work on games with LLM agents has taken an ad hoc approach to encoding game history, which not only obscures the impact of state representation on agents' behavior, but also limits comparability between studies.Our framework addresses these gaps by characterizing methods of state representation along three axes: action informativeness (i.e., the extent to which the state representation captures actions played); reward informativeness (i.e., the extent to which the state representation describes rewards obtained); and prompting style (or natural language compression, i.e., the extent to which the full text history is summarized). We apply this framework to a dynamic selfish routing game, chosen because it admits a simple equilibrium both in theory and in human subject experiments \cite{rapoport_choice_2009}.Despite the game's relative simplicity, we find that there are key dependencies of LLM agent behavior on the natural language state representation.In particular, we observe that representations which provide agents with (1) summarized, rather than complete, natural language representations of past history; (2) information about regrets, rather than raw payoffs; and (3) limited information about others' actions lead to behavior that more closely matches game theoretic equilibrium predictions, and with more stable game play by the agents.By contrast, other representations can exhibit either large deviations from equilibrium, higher variation in dynamic game play over time, or both. |
2025-06-18 |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites continue to pose a significant cybersecurity threat, often leveraging deceptive structures, brand impersonation, and social engineering tactics to evade detection.While recent advances in large language models (LLMs) have enabled improved phishing detection through contextual understanding, most existing approaches rely on single-agent classification facing the risks of hallucination and lack interpretability or robustness.To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection. 0.671PhishDebate employs four specialized agents to independently analyze different textual aspects of a webpage--URL structure, HTML composition, semantic content, and brand impersonation--under the coordination of a Moderator and a final Judge.Through structured debate and divergent thinking, the framework delivers more accurate and interpretable decisions.Extensive evaluations on commercial LLMs demonstrate that PhishDebate achieves 98.2% recall and 98.2% True Positive Rate (TPR) on a real-world phishing dataset, and outperforms single-agent and Chain of Thought (CoT) baselines.Additionally, its modular design allows agent-level configurability, enabling adaptation to varying resource and application requirements. |
2025-06-17 |
InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking
Large Language Models (LLMs) have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledge-transfer capabilities acquired from extensive pretraining.In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by ``reasoning'' over documents rather than through simple keyword matching or semantic similarity. 0.747While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains.In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance.InsertRank demonstrates improved retrieval effectiveness on -- BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks.We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models.%In addition, we also conduct ablation studies on normalization by varying the scale of the BM25 scores, and positional bias by shuffling the order of the documents.With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark.and 51.1 on the R2MED benchmark, surpassing previous methods. |
2025-06-17 |
Fragile Preferences: A Deep Dive Into Order Effects in Large Language Models
Large language models (LLMs) are increasingly used in decision-support systems across high-stakes domains such as hiring and university admissions, where decisions often involve selecting among competing alternatives.While prior work has noted positional order biases in LLM-driven comparisons, these biases have not been systematically dissected or linked to underlying preference structures.We provide the first comprehensive investigation of positional biases across multiple LLM architectures and domains, uncovering strong and consistent order effects, including a novel centrality bias not previously documented in human or machine decision-making.We also find a quality-dependent shift: when options are high quality, models exhibit primacy bias, but favor latter options when option quality is low.We further identify a previously undocumented bias favoring certain names over others.To distinguish superficial tie-breaking from true distortions of judgment, we introduce a framework that classifies pairwise preferences as robust, fragile, or indifferent.We show that order effects can lead models to select strictly inferior options, and that positional biases are typically stronger than gender biases.These findings suggest that LLMs are not merely inheriting human-like biases, but exhibit distinct failure modes not seen in human decision-making. 0.67We propose targeted mitigation strategies, including a novel use of the temperature parameter, to reduce order-driven distortions. |
2025-06-17 |
ImpReSS: Implicit Recommender System for Support Conversations
Following recent advancements in large language models (LLMs), LLM-based chatbots have transformed customer support by automating interactions and providing consistent, scalable service. 0.721While LLM-based conversational recommender systems (CRSs) have attracted attention for their ability to enhance the quality of recommendations, limited research has addressed the implicit integration of recommendations within customer support interactions. 0.731In this work, we introduce ImpReSS, an implicit recommender system designed for customer support conversations.ImpReSS operates alongside existing support chatbots, where users report issues and chatbots provide solutions.Based on a customer support conversation, ImpReSS identifies opportunities to recommend relevant solution product categories (SPCs) that help resolve the issue or prevent its recurrence -- thereby also supporting business growth.Unlike traditional CRSs, ImpReSS functions entirely implicitly and does not rely on any assumption of a user's purchasing intent.Our empirical evaluation of ImpReSS's ability to recommend relevant SPCs that can help address issues raised in support conversations shows promising results, including an MRR@1 (and recall@3) of 0.72 (0.89) for general problem solving, 0.82 (0.83) for information security support, and 0.85 (0.67) for cybersecurity troubleshooting.To support future research, our data and code will be shared upon request. |
2025-06-17 |
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. 0.839Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions.This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation.We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones.Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts. |
2025-06-17 |
LLM-Powered Swarms: A New Frontier or a Conceptual Stretch?
Swarm intelligence traditionally refers to systems of simple, decentralized agents whose local interactions lead to emergent, collective behavior.Recently, the term 'swarm' has been extended to describe AI systems like OpenAI's Swarm, where large language models (LLMs) act as collaborative agents. 0.637This paper contrasts traditional swarm algorithms with LLM-driven swarms exploring how decentralization, scalability, and emergence are redefined in modern artificial intelligence (AI).We implement and compare both paradigms using Boids and Ant Colony Optimization (ACO), evaluating latency, resource usage, and behavioral accuracy.The suitability of both cloud-based and local LLMs is assessed for the agent-based use in swarms.Although LLMs offer powerful reasoning and abstraction capabilities, they introduce new constraints in computation and coordination that challenge traditional notions of swarm design.This study highlights the opportunities and limitations of integrating LLMs into swarm systems and discusses the evolving definition of 'swarm' in modern AI research. |
2025-06-17 |
Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models
Large Language Models (LLMs) have shown impressive moral reasoning abilities.Yet they often diverge when confronted with complex, multi-factor moral dilemmas. 0.626To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus.Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability.For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity.Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity.These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems. |
2025-06-17 |
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses.Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models.As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods.In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example.We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. 0.602Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance.Performance differences between prompting approaches are conditional on the LLM used.Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning.We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data.Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs.In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research. |
2025-06-17 |
Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments
The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. 0.791This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments. 0.625Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses.The model's outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse.The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans. 0.624These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns.The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks. |
2025-06-17 |
Unified Software Engineering agent as AI Software Engineer
The growth of Large Language Model (LLM) technology has raised expectations for automated coding.However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project.In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously.But is an LLM agent the same as an AI software engineer? 0.664In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent.Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities.This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others.We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans.To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching.USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD.In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent.There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future. |
2025-06-17 |
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. 0.66While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption.To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents.OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior.To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.).Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score).We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety.In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions.The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm. |
2025-06-17 |
Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings
As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. 0.717In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. 0.74This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options.We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. 0.741This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects. |
2025-06-16 |
A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMs
The increasing prevalence of large language models (LLMs) is influencing global value systems. 0.697However, these models frequently exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias due to lack of attention to minority values.This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints, posing challenges for the development of equitable and inclusive AI systems.In this work, we introduce a systematic framework designed to boost fair and robust cross-cultural consensus among LLMs.We model consensus as a Nash Equilibrium and employ a game-theoretic negotiation method based on Policy-Space Response Oracles (PSRO) to simulate an organized cross-cultural negotiation process. 0.634To evaluate this approach, we construct regional cultural agents using data transformed from the World Values Survey (WVS). 0.724Beyond the conventional model-level evaluation method, We further propose two quantitative metrics, Perplexity-based Acceptence and Values Self-Consistency, to assess consensus outcomes.Experimental results indicate that our approach generates consensus of higher quality while ensuring more balanced compromise compared to baselines.Overall, it mitigates WEIRD bias by guiding agents toward convergence through fair and gradual negotiation steps. |
2025-06-16 |
Towards Pervasive Distributed Agentic Generative AI -- A State of The Art
The rapid advancement of intelligent agents and Large Language Models (LLMs) is reshaping the pervasive computing field. 0.659Their ability to perceive, reason, and act through natural language understanding enables autonomous problem-solving in complex pervasive environments, including the management of heterogeneous sensors, devices, and data.This survey outlines the architectural components of LLM agents (profiling, memory, planning, and action) and examines their deployment and evaluation across various scenarios.Than it reviews computational and infrastructural advancements (cloud to edge) in pervasive computing and how AI is moving in this field.It highlights state-of-the-art agent deployment strategies and applications, including local and distributed execution on resource-constrained devices.This survey identifies key challenges of these agents in pervasive computing such as architectural, energetic and privacy limitations.It finally proposes what we called "Agent as a Tool", a conceptual framework for pervasive agentic AI, emphasizing context awareness, modularity, security, efficiency and effectiveness. |
2025-06-16 |
VIS-Shepherd: Constructing Critic for LLM-based Data Visualization Generation
Data visualization generation using Large Language Models (LLMs) has shown promising results but often produces suboptimal visualizations that require human intervention for improvement. 0.677In this work, we introduce VIS-Shepherd, a specialized Multimodal Large Language Model (MLLM)-based critic to evaluate and provide feedback for LLM-generated data visualizations. 0.71At the core of our approach is a framework to construct a high-quality visualization critique dataset, where we collect human-created visualization instances, synthesize corresponding LLM-generated instances, and construct high-quality critiques.We conduct both model-based automatic evaluation and human preference studies to evaluate the effectiveness of our approach.Our experiments show that even small (7B parameters) open-source MLLM models achieve substantial performance gains by leveraging our high-quality visualization critique dataset, reaching levels comparable to much larger open-source or even proprietary models.Our work demonstrates significant potential for MLLM-based automated visualization critique and indicates promising directions for enhancing LLM-based data visualization generation. 0.691Our project page: https://github.com/bopan3/VIS-Shepherd. |
2025-06-16 |
Delving Into the Psychology of Machines: Exploring the Structure of Self-Regulated Learning via LLM-Generated Survey Responses
Large language models (LLMs) offer the potential to simulate human-like responses and behaviors, creating new opportunities for psychological science. 0.837In the context of self-regulated learning (SRL), if LLMs can reliably simulate survey responses at scale and speed, they could be used to test intervention scenarios, refine theoretical models, augment sparse datasets, and represent hard-to-reach populations.However, the validity of LLM-generated survey responses remains uncertain, with limited research focused on SRL and existing studies beyond SRL yielding mixed results. 0.739Therefore, in this study, we examined LLM-generated responses to the 44-item Motivated Strategies for Learning Questionnaire (MSLQ; Pintrich \& De Groot, 1990), a widely used instrument assessing students' learning strategies and academic motivation.Particularly, we used the LLMs GPT-4o, Claude 3.7 Sonnet, Gemini 2 Flash, LLaMA 3.1-8B, and Mistral Large.We analyzed item distributions, the psychological network of the theoretical SRL dimensions, and psychometric validity based on the latent factor structure. 0.64Our results suggest that Gemini 2 Flash was the most promising LLM, showing considerable sampling variability and producing underlying dimensions and theoretical relationships that align with prior theory and empirical findings.At the same time, we observed discrepancies and limitations, underscoring both the potential and current constraints of using LLMs for simulating psychological survey data and applying it in educational contexts. 0.888 |
2025-06-16 |
Deflating Deflationism: A Critical Perspective on Debunking Arguments Against LLM Mentality
Many people feel compelled to interpret, describe, and respond to Large Language Models (LLMs) as if they possess inner mental lives similar to our own. 0.881Responses to this phenomenon have varied. 0.747Inflationists hold that at least some folk psychological ascriptions to LLMs are warranted.Deflationists argue that all such attributions of mentality to LLMs are misplaced, often cautioning against the risk that anthropomorphic projection may lead to misplaced trust or potentially even confusion about the moral status of LLMs.We advance this debate by assessing two common deflationary arguments against LLM mentality.What we term the 'robustness strategy' aims to undercut one justification for believing that LLMs are minded entities by showing that putatively cognitive and humanlike behaviours are not robust, failing to generalise appropriately.What we term the 'etiological strategy' undercuts attributions of mentality by challenging naive causal explanations of LLM behaviours, offering alternative causal accounts that weaken the case for mental state attributions.While both strategies offer powerful challenges to full-blown inflationism, we find that neither strategy provides a knock-down case against ascriptions of mentality to LLMs simpliciter.With this in mind, we explore a modest form of inflationism that permits ascriptions of mentality to LLMs under certain conditions.Specifically, we argue that folk practice provides a defeasible basis for attributing mental states and capacities to LLMs provided those mental states and capacities can be understood in metaphysically undemanding terms (e.g. knowledge, beliefs and desires), while greater caution is required when attributing metaphysically demanding mental phenomena such as phenomenal consciousness. |
2025-06-16 |
Implicit and Explicit Research Quality Score Probabilities from ChatGPT
The large language model (LLM) ChatGPT's quality scores for journal articles correlate more strongly with human judgements than some citation-based indicators in most fields. 0.624Averaging multiple ChatGPT scores improves the results, apparently leveraging its internal probability model.To leverage these probabilities, this article tests two novel strategies: requesting percentage likelihoods for scores and extracting the probabilities of alternative tokens in the responses.The probability estimates were then used to calculate weighted average scores.Both strategies were evaluated with five iterations of ChatGPT 4o-mini on 96,800 articles submitted to the UK Research Excellence Framework (REF) 2021, using departmental average REF2021 quality scores as a proxy for article quality.The data was analysed separately for each of the 34 field-based REF Units of Assessment.For the first strategy, explicit requests for tables of score percentage likelihoods substantially decreased the value of the scores (lower correlation with the proxy quality indicator).In contrast, weighed averages of score token probabilities slightly increased the correlation with the quality proxy indicator and these probabilities reasonably accurately reflected ChatGPT's outputs.The token probability approach is therefore the most accurate method for ranking articles by research quality as well as being cheaper than comparable ChatGPT strategies. |
2025-06-16 |
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. 0.686Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone.While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments.In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments.To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations.Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships.For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment.For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment.In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities.Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks.Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience. |
2025-06-16 |
We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems
The development of large language models (LLMs) has entered in a experience-driven era, flagged by the emergence of environment feedback-driven learning via reinforcement learning and tool-using agents. 0.78This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data.However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks.In particular, MCP introduces third-party services, which are not controlled by the LLM developers, into the agent systems.These third-party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user-agent interactions.In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP-powered agent systems.To establish our position, we argue with three key parts.(1) We first construct \framework, a controlled framework to examine safety issues in MCP-powered agent systems.(2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP-powered agent systems is a real threat and its defense is not trivial.(3) Finally, we give our outlook by showing a roadmap to build safe MCP-powered agent systems.In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction.We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction.Our code is available at https://github.com/littlelittlenine/SafeMCP.git. |
2025-06-16 |
An LLM's Apology: Outsourcing Awkwardness in the Age of AI
A key part of modern social dynamics is flaking at short notice. 0.633However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to 'ghosting', awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. 0.681The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms.We introduce FLAKE-Bench, an evaluation of models' capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios.We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says "I value our friendship" like having AI generate your cancellation texts.We open-source FLAKE-Bench at github.com/Cloakless/flake-bench to support future research. |
2025-06-16 |
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients' medical conditions.However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation.If the model can provide appropriate comfort and empathy based on the patient's negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process.To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. 0.862We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient's emotions while addressing their concerns. 0.674The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients' questions.Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model's ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers. 0.869 |
Large Language Models in Social Sciences |
|
2025-06-18 |
Identifying economic narratives in large text corpora -- An integrated approach using Large Language Models
As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts.Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task.While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic understanding required to distinguish extracting economic narratives from merely conducting classic tasks like Semantic Role Labeling.Instead of relying on complex model pipelines, we evaluate the benefits of Large Language Models (LLMs) by analyzing a corpus of Wall Street Journal and New York Times newspaper articles about inflation. 0.641We apply a rigorous narrative definition and compare GPT-4o outputs to gold-standard narratives produced by expert annotators.Our results suggests that GPT-4o is capable of extracting valid economic narratives in a structured format, but still falls short of expert-level performance when handling complex documents and narratives.Given the novelty of LLMs in economic research, we also provide guidance for future work in economics and the social sciences that employs LLMs to pursue similar objectives. |
2025-06-18 |
Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers
Family caregivers of individuals with Alzheimer's Disease and Related Dementia (AD/ADRD) face significant emotional and logistical challenges that place them at heightened risk for stress, anxiety, and depression.Although recent advances in generative AI -- particularly large language models (LLMs) -- offer new opportunities to support mental health, little is known about how caregivers perceive and engage with such technologies. 0.845To address this gap, we developed Carey, a GPT-4o-based chatbot designed to provide informational and emotional support to AD/ADRD caregivers.Using Carey as a technology probe, we conducted semi-structured interviews with 16 family caregivers following scenario-driven interactions grounded in common caregiving stressors. 0.695Through inductive coding and reflexive thematic analysis, we surface a systemic understanding of caregiver needs and expectations across six themes -- on-demand information access, emotional support, safe space for disclosure, crisis management, personalization, and data privacy.For each of these themes, we also identified the nuanced tensions in the caregivers' desires and concerns. 0.667We present a mapping of caregiver needs, AI chatbot's strengths, gaps, and design recommendations.Our findings offer theoretical and practical insights to inform the design of proactive, trustworthy, and caregiver-centered AI systems that better support the evolving mental health needs of AD/ADRD caregivers. 0.615 |
2025-06-18 |
Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs.Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem.To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs.Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do.Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs.Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. 0.695Our code is available at https://github.com/zli12321/long_form_rl. |
2025-06-18 |
Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants
In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties.We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants.Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque.Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. 0.854Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard.All data and code are publicly available. |
2025-06-18 |
DeVisE: Behavioral Testing of Medical Large Language Models
Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. 0.638We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. 0.737We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes.We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings.We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay.Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes.Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. 0.674This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems. |
2025-06-18 |
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture
Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts.To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity.Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge.It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry.We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. 0.691By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs. |
2025-06-18 |
AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. 0.822However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases.We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing.(2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics.(3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition.Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval.Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines.These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios.Code is available at https://github.com/MikeGu721/AgentGroupChat-V2. |
2025-06-18 |
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. 0.722While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. 0.705GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. 0.718We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity.Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models. 0.679 |
2025-06-17 |
Fragile Preferences: A Deep Dive Into Order Effects in Large Language Models
Large language models (LLMs) are increasingly used in decision-support systems across high-stakes domains such as hiring and university admissions, where decisions often involve selecting among competing alternatives.While prior work has noted positional order biases in LLM-driven comparisons, these biases have not been systematically dissected or linked to underlying preference structures.We provide the first comprehensive investigation of positional biases across multiple LLM architectures and domains, uncovering strong and consistent order effects, including a novel centrality bias not previously documented in human or machine decision-making.We also find a quality-dependent shift: when options are high quality, models exhibit primacy bias, but favor latter options when option quality is low.We further identify a previously undocumented bias favoring certain names over others.To distinguish superficial tie-breaking from true distortions of judgment, we introduce a framework that classifies pairwise preferences as robust, fragile, or indifferent. 0.616We show that order effects can lead models to select strictly inferior options, and that positional biases are typically stronger than gender biases. 0.61These findings suggest that LLMs are not merely inheriting human-like biases, but exhibit distinct failure modes not seen in human decision-making. 0.797We propose targeted mitigation strategies, including a novel use of the temperature parameter, to reduce order-driven distortions. |
2025-06-17 |
MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind
Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias.Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. 0.763To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. 0.795The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. 0.809Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias. |
2025-06-17 |
GRAM: A Generative Foundation Reward Model for Reward Generalization
In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. 0.642In this paper, we explore methods that train reward models using both unlabeled and labeled data.Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss.This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives.The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort.Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models. |
2025-06-17 |
MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment
Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. 0.685This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works.To address this, we propose MAS-LitEval, a multi-agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style.We tested MAS-LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur's Court, generated by various LLMs, and compared it to traditional metrics.\textbf{MAS-LitEval} outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances.This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers. |
2025-06-17 |
Causes in neuron diagrams, and testing causal reasoning in Large Language Models. A glimpse of the future of philosophy?
We propose a test for abstract causal reasoning in AI, based on scholarship in the philosophy of causation, in particular on the neuron diagrams popularized by D. Lewis.We illustrate the test on advanced Large Language Models (ChatGPT, DeepSeek and Gemini).Remarkably, these chatbots are already capable of correctly identifying causes in cases that are hotly debated in the literature.In order to assess the results of these LLMs and future dedicated AI, we propose a definition of cause in neuron diagrams with a wider validity than published hitherto, which challenges the widespread view that such a definition is elusive.We submit that these results are an illustration of how future philosophical research might evolve: as an interplay between human and artificial expertise. 0.634 |
2025-06-17 |
Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics
Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. 0.754However, this variation is often overlooked in summarization evaluation.While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated.This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004.We demonstrate that many popular metrics exhibit significant instability.This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons.We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation.Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs. |
2025-06-17 |
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. 0.725Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions.This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation.We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones.Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts. |
2025-06-17 |
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. 0.611However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension.In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs.We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textbf{\textit{stable perplexity}} during direct context extrapolation.Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textbf{\textit{local perception}} phenomenon, enabling successful retrieval from recent context segments.We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory.Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation.Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs.Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short.Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs. |
2025-06-17 |
How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison
As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. 0.667Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience.To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time.In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning.We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation.To provide a comparative baseline, we recruit eight human participants to complete the same task.Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans.These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks. |
2025-06-17 |
Doppelgänger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use.Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts.In this paper, we propose the ''Doppelg\"anger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information.Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)''level to evaluate the vulnerability to this adversarial transfer attack.We also propose a ''Caution for Adversarial Transfer (CAT)''prompt to counter the Doppelg\"anger method. 0.601The experimental results demonstrate that the Doppelg\"anger method can compromise the agent's consistency and expose its internal information.In contrast, CAT prompts enable effective defense against this adversarial attack. |
2025-06-17 |
Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models
Large Language Models (LLMs) have shown impressive moral reasoning abilities.Yet they often diverge when confronted with complex, multi-factor moral dilemmas. 0.653To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus.Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability.For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity.Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity.These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems. |
2025-06-17 |
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses.Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models.As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. 0.697In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example.We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings.Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance.Performance differences between prompting approaches are conditional on the LLM used.Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning.We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data.Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs.In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research. |
2025-06-17 |
Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments
The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. 0.659This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments.Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses.The model's outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse. 0.75The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans.These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns.The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks. |
2025-06-17 |
DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization
Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness.Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems.This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness.We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected.DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. 0.944Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney.Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities.Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR).DETONATE and complete code are publicly released. |
2025-06-17 |
From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?
While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored.In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. 0.613We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD).We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B.We also test two proprietary models: GPT-4o and Gemini Flash 2.0.In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM).We use accuracy, precision, recall, and F1-score as evaluation metrics.Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings.Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score.However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning.This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions. |
2025-06-17 |
Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings
As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. 0.801In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. 0.877This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options.We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. 0.837This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects. 0.878 |
2025-06-16 |
Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models
Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. 0.629Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling.The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation.This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence.First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. 0.809Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters.Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. 0.681Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading.We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity. |
2025-06-16 |
An LLM's Apology: Outsourcing Awkwardness in the Age of AI
A key part of modern social dynamics is flaking at short notice. 0.852However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to 'ghosting', awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. 0.708The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms.We introduce FLAKE-Bench, an evaluation of models' capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios.We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says "I value our friendship" like having AI generate your cancellation texts.We open-source FLAKE-Bench at github.com/Cloakless/flake-bench to support future research. |
2025-06-16 |
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients' medical conditions.However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation. 0.697If the model can provide appropriate comfort and empathy based on the patient's negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process.To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. 0.688We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient's emotions while addressing their concerns. 0.662The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients' questions. 0.63Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model's ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers. |
2025-06-16 |
AI-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns
We present and evaluate a suite of proof-of-concept (PoC), structured workflow prompts designed to elicit human-like hierarchical reasoning while guiding Large Language Models (LLMs) in the high-level semantic and linguistic analysis of scholarly manuscripts.The prompts target two non-trivial analytical tasks within academic summaries (abstracts and conclusions): identifying unsubstantiated claims (informational integrity) and flagging semantically confusing ambiguous pronoun references (linguistic clarity).We conducted a systematic, multi-run evaluation on two frontier models (Gemini Pro 2.5 Pro and ChatGPT Plus o3) under varied context conditions.Our results for the informational integrity task reveal a significant divergence in model performance: while both models successfully identified an unsubstantiated head of a noun phrase (95% success), ChatGPT consistently failed (0% success) to identify an unsubstantiated adjectival modifier that Gemini correctly flagged (95% success), raising a question regarding the potential influence of the target's syntactic role.For the linguistic analysis task, both models performed well (80-90% success) with full manuscript context. 0.675Surprisingly, in a summary-only setting, Gemini's performance was substantially degraded, while ChatGPT achieved a perfect (100%) success rate.Our findings suggest that while structured prompting is a viable methodology for complex textual analysis, prompt performance may be highly dependent on the interplay between the model, task type, and context, highlighting the need for rigorous, model-specific testing. |
2025-06-16 |
Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions
As Large Language Models (LLMs) increasingly power applications used by children and adolescents, ensuring safe and age-appropriate interactions has become an urgent ethical imperative. 0.641Despite progress in AI safety, current evaluations predominantly focus on adults, neglecting the unique vulnerabilities of minors engaging with generative AI.We introduce Safe-Child-LLM, a comprehensive benchmark and dataset for systematically assessing LLM safety across two developmental stages: children (7-12) and adolescents (13-17).Our framework includes a novel multi-part dataset of 200 adversarial prompts, curated from red-teaming corpora (e.g., SG-Bench, HarmBench), with human-annotated labels for jailbreak success and a standardized 0-5 ethical refusal scale.Evaluating leading LLMs -- including ChatGPT, Claude, Gemini, LLaMA, DeepSeek, Grok, Vicuna, and Mistral -- we uncover critical safety deficiencies in child-facing scenarios.This work highlights the need for community-driven benchmarks to protect young users in LLM interactions.To promote transparency and collaborative advancement in ethical AI development, we are publicly releasing both our benchmark datasets and evaluation codebase at https://github.com/The-Responsible-AI-Initiative/Safe_Child_LLM_Benchmark.git |
LLMs in Education Research |
|
2025-06-18 |
Large Language Models for Unit Testing: A Systematic Literature Review
Unit testing is a fundamental practice in modern software engineering, with the aim of ensuring the correctness, maintainability, and reliability of individual software components.Very recently, with the advances in Large Language Models (LLMs), a rapidly growing body of research has leveraged LLMs to automate various unit testing tasks, demonstrating remarkable performance and significantly reducing manual effort.However, due to ongoing explorations in the LLM-based unit testing field, it is challenging for researchers to understand existing achievements, open challenges, and future opportunities. 0.535This paper presents the first systematic literature review on the application of LLMs in unit testing until March 2025. 0.586We analyze \numpaper{} relevant papers from the perspectives of both unit testing and LLMs.We first categorize existing unit testing tasks that benefit from LLMs, e.g., test generation and oracle generation.We then discuss several critical aspects of integrating LLMs into unit testing research, including model usage, adaptation strategies, and hybrid approaches. 0.571We further summarize key challenges that remain unresolved and outline promising directions to guide future research in this area.Overall, our paper provides a systematic overview of the research landscape to the unit testing community, helping researchers gain a comprehensive understanding of achievements and promote future research.Our artifacts are publicly available at the GitHub repository: https://github.com/iSEngLab/AwesomeLLM4UT. |
2025-06-18 |
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks.However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set.Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. 0.518By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone.Moreover, the framework is general and can work across reasoning domains, including math, code, and logic.We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations.These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy. |
2025-06-18 |
Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities.Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries. 0.677This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework.We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering.The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student's needs.We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes. |
2025-06-18 |
Lessons from Training Grounded LLMs with Verifiable Rewards
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs).While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available.In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs.We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations.Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. 0.564A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal.Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks.Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs. |
2025-06-17 |
InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking
Large Language Models (LLMs) have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledge-transfer capabilities acquired from extensive pretraining.In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by ``reasoning'' over documents rather than through simple keyword matching or semantic similarity. 0.503While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains.In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance.InsertRank demonstrates improved retrieval effectiveness on -- BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks.We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models.%In addition, we also conduct ablation studies on normalization by varying the scale of the BM25 scores, and positional bias by shuffling the order of the documents.With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark.and 51.1 on the R2MED benchmark, surpassing previous methods. |
2025-06-17 |
Don't throw the baby out with the bathwater: How and why deep learning for ARC
The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems.Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc.The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains.Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time.We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC.Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks.Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. 0.519We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques.We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT.An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%).Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning. |
2025-06-17 |
Quality Assessment of Python Tests Generated by Large Language Models
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions.Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently.This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. 0.509We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C).Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns.Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell.Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion of Test Cases was the most frequently detected (41%).Prompt context significantly influenced test quality; textual prompts with detailed instructions often yielded tests with fewer errors but a higher incidence of test smells.Among the evaluated LLMs, GPT-4o produced the fewest errors in both contexts (10% in C2C and 6% in T2C), whereas Amazon Q had the highest error rates (19% in C2C and 28% in T2C).For test smells, Amazon Q had fewer detections in the C2C context (9%), while LLama 3.3 performed best in the T2C context (10%).Additionally, we observed a strong relationship between specific errors, such as assertion or indentation issues, and test case cohesion smells.These findings demonstrate opportunities for improving the quality of test generation by LLMs and highlight the need for future research to explore optimized generation scenarios and better prompt engineering strategies. 0.608 |
2025-06-17 |
AviationLLM: An LLM-based Knowledge System for Aviation Training
Aviation training is a core link in ensuring flight safety, improving industry efficiency and promoting sustainable development.It not only involves flight simulation but also requires the learning of a great deal of professional aviation theory knowledge. 0.595In the existing training system, the knowledge is mainly imparted by the the instructors. 0.523However, the number of instructors is limited and the professional answers obtained from the Internet are not accurate enough, resulting in low training efficiency.To address this, we introduced LLM, but the basic pre-trained model cannot provide accurate answers to professional fields, so we fine-tuned it.Traditional Supervised Fine-Tuning (SFT) risk generating superficially plausible but factually incorrect responses due to insufficient data coverage.To address this, we employ Direct Preference Optimization(DPO).This paper proposes Retrieval-Augmented LLM Alignment via Direct Preference Optimization(RALA-DPO).We select open source pre-trained LLM Qwen and adapt it to aviation theory training through DPO-based domain alignment.Simultaneously, to mitigate hallucinations caused by training data biases, knowledge obsolescence, or domain knowledge gaps, we implement Retrieval-Augmented Generation(RAG) technology that combines generative and retrieval models.RALA-DPO effectively retrieves relevant information from external knowledge bases and delivers precise and high-quality responses through the generative model.Experimental results demonstrate that RALA-DPO can improve accuracy in response to professional aviation knowledge. 0.514With integrated RAG mechanisms, this system can further improve the accuracy of answers and achieve zero-cost knowledge updates simultaneously. |
2025-06-17 |
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. 0.611Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions.This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation.We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones.Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts. |
2025-06-17 |
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Large Language Models (LLMs) are experiencing rapid advancements in complex reasoning, exhibiting remarkable generalization in mathematics and programming.In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic evaluation of their complex reasoning ability within spatial contexts remains underexplored.To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' spatial intelligence through video-based reasoning tasks.SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video.By carefully designing questions and corresponding 3D scenes, our benchmark ensures that solving the questions requires both spatial comprehension for extracting information and high-level reasoning for deriving solutions, making it a challenging benchmark for evaluating VLMs.To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine.This engine, leveraging multiple specialized LLM agents, can generate realistic 3D scenes from abstract math problems, ensuring faithfulness to the original descriptions. 0.521Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of spatial reasoning.We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving. 0.553 |
2025-06-17 |
Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot
In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. 0.502However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks.Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations.We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}.Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance.Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability.Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars. |
2025-06-17 |
Unified Software Engineering agent as AI Software Engineer
The growth of Large Language Model (LLM) technology has raised expectations for automated coding.However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project.In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously.But is an LLM agent the same as an AI software engineer? 0.605In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent.Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities.This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others.We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans. 0.511To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching.USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD.In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent.There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future. |
2025-06-17 |
AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored.Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments. 0.606We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents. 0.549The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains and solve new problems with minimal supervision or human intervention. 0.569Experiments on biomedical and mathematical benchmarks demonstrate that our distilled student agents, built on small language models, can achieve performance comparable to advanced systems using large LLMs such as OctoTools (GPT-4o), highlighting the effectiveness of our framework in building scalable and cost-efficient intelligent agents. 0.546 |
2025-06-17 |
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. 0.541While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption.To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents.OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior.To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.).Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score).We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety.In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions.The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm. |
2025-06-17 |
FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
Black-Box Discrete Prompt Learning is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. 0.525Adapting federated learning to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources.However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service.To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning.Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning.Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs.We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results. |
2025-06-17 |
Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework
Large language models (LLMs) are increasingly deployed in domains requiring moral understanding, yet their reasoning often remains shallow, and misaligned with human reasoning.Unlike humans, whose moral reasoning integrates contextual trade-offs, value systems, and ethical theories, LLMs often rely on surface patterns, leading to biased decisions in morally and ethically complex scenarios.To address this gap, we present a value-grounded framework for evaluating and distilling structured moral reasoning in LLMs.We benchmark 12 open-source models across four moral datasets using a taxonomy of prompts grounded in value systems, ethical theories, and cognitive reasoning strategies.Our evaluation is guided by four questions: (1) Does reasoning improve LLM decision-making over direct prompting? 0.57(2) Which types of value/ethical frameworks most effectively guide LLM reasoning?(3) Which cognitive reasoning strategies lead to better moral performance?(4) Can small-sized LLMs acquire moral competence through distillation? 0.503We find that prompting with explicit moral structure consistently improves accuracy and coherence, with first-principles reasoning and Schwartz's + care-ethics scaffolds yielding the strongest gains.Furthermore, our supervised distillation approach transfers moral competence from large to small models without additional inference cost.Together, our results offer a scalable path toward interpretable and value-grounded models. |
2025-06-16 |
ProfiLLM: An LLM-Based Framework for Implicit Profiling of Chatbot Users
Despite significant advancements in conversational AI, large language model (LLM)-powered chatbots often struggle with personalizing their responses according to individual user characteristics, such as technical expertise, learning style, and communication preferences.This lack of personalization is particularly problematic in specialized knowledge-intense domains like IT/cybersecurity (ITSec), where user knowledge levels vary widely.Existing approaches for chatbot personalization primarily rely on static user categories or explicit self-reported information, limiting their adaptability to an evolving perception of the user's proficiency, obtained in the course of ongoing interactions.In this paper, we propose ProfiLLM, a novel framework for implicit and dynamic user profiling through chatbot interactions.This framework consists of a taxonomy that can be adapted for use in diverse domains and an LLM-based method for user profiling in terms of the taxonomy.To demonstrate ProfiLLM's effectiveness, we apply it in the ITSec domain where troubleshooting interactions are used to infer chatbot users' technical proficiency. 0.559Specifically, we developed ProfiLLM[ITSec], an ITSec-adapted variant of ProfiLLM, and evaluated its performance on 1,760 human-like chatbot conversations from 263 synthetic users.Results show that ProfiLLM[ITSec] rapidly and accurately infers ITSec profiles, reducing the gap between actual and predicted scores by up to 55--65\% after a single prompt, followed by minor fluctuations and further refinement.In addition to evaluating our new implicit and dynamic profiling framework, we also propose an LLM-based persona simulation methodology, a structured taxonomy for ITSec proficiency, our codebase, and a dataset of chatbot interactions to support future research. |
2025-06-16 |
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
Code-switching (CSW) is the act of alternating between two or more languages within a single discourse.This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication.As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs.Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text.This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. 0.504While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension.Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation. |
2025-06-16 |
Designing Deep Learning Frameworks for LLMs:Challenges, Expectations, and Opportunities
Large language models (LLMs) drive significant advancements in real industry applications.LLMs rely on DL frameworks for efficient model construction, distributed execution, and optimized deployment.Their large parameter scale and long execution cycles place extreme demands on DL frameworks in terms of scalability, stability, and efficiency.Therefore, poor usability, limited functionality, and subtle bugs in DL frameworks may hinder development efficiency and cause severe failures or resource waste.However, a fundamental question remains underinvestigated, i.e., What challenges do DL frameworks face in supporting LLMs? 0.513To seek an answer, we investigate these challenges through a large-scale analysis of issue reports from three major DL frameworks (MindSpore, PyTorch, TensorFlow) and eight associated LLM toolkits (e.g., Megatron).We construct a taxonomy of LLM-centric bugs, requirements, and user questions and enrich it through interviews with 11 LLM users and eight DL framework developers, uncovering key technical challenges and misalignments between user needs and developer priorities.Our contributions are threefold: (1) we develop a comprehensive taxonomy comprising four question themes (nine sub-themes), four requirement themes (15 sub-themes), and ten bug themes (45 sub-themes); (2) we assess the perceived importance and priority of these challenges based on practitioner insights; and (3) we identify five key findings across the LLM development and propose five actionable recommendations to improve the reliability, usability, and testability of DL frameworks.Our results highlight critical limitations in current DL frameworks and offer concrete guidance for advancing their support for the next generation of LLM construction and applications. |
2025-06-16 |
Querying Large Automotive Software Models: Agentic vs. Direct LLM Approaches
Large language models (LLMs) offer new opportunities for interacting with complex software artifacts, such as software models, through natural language.They present especially promising benefits for large software models that are difficult to grasp in their entirety, making traditional interaction and analysis approaches challenging.This paper investigates two approaches for leveraging LLMs to answer questions over software models: direct prompting, where the whole software model is provided in the context, and an agentic approach combining LLM-based agents with general-purpose file access tools. 0.562We evaluate these approaches using an Ecore metamodel designed for timing analysis and software optimization in automotive and embedded domains.Our findings show that while the agentic approach achieves accuracy comparable to direct prompting, it is significantly more efficient in terms of token usage.This efficiency makes the agentic approach particularly suitable for the automotive industry, where the large size of software models makes direct prompting infeasible, establishing LLM agents as not just a practical alternative but the only viable solution.Notably, the evaluation was conducted using small LLMs, which are more feasible to be executed locally - an essential advantage for meeting strict requirements around privacy, intellectual property protection, and regulatory compliance.Future work will investigate software models in diverse formats, explore more complex agent architectures, and extend agentic workflows to support not only querying but also modification of software models. |
2025-06-16 |
Enhancing Large Language Models with Reliable Knowledge Graphs
Large Language Models (LLMs) have demonstrated remarkable capabilities in text generation and understanding, yet their reliance on implicit, unstructured knowledge often leads to factual inaccuracies and limited interpretability.Knowledge Graphs (KGs), with their structured, relational representations, offer a promising solution to ground LLMs in verified knowledge.However, their potential remains constrained by inherent noise, incompleteness, and the complexity of integrating their rigid structure with the flexible reasoning of LLMs.This thesis presents a systematic framework to address these limitations, advancing the reliability of KGs and their synergistic integration with LLMs through five interconnected contributions.This thesis addresses these challenges through a cohesive framework that enhances LLMs by refining and leveraging reliable KGs.First, we introduce contrastive error detection, a structure-based method to identify incorrect facts in KGs.This approach is extended by an attribute-aware framework that unifies structural and semantic signals for error correction.Next, we propose an inductive completion model that further refines KGs by completing the missing relationships in evolving KGs.Building on these refined KGs, KnowGPT integrates structured graph reasoning into LLMs through dynamic prompting, improving factual grounding. 0.521These contributions form a systematic pipeline (from error detection to LLM integration), demonstrating that reliable KGs significantly enhance the robustness, interpretability, and adaptability of LLMs. |
2025-06-16 |
Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs
Large language models (LLMs) are often constrained by rigid reasoning processes, limiting their ability to generate creative and diverse responses.To address this, a novel framework called LADDER is proposed, combining Chain-of-Thought (CoT) reasoning, Mixture of Experts (MoE) models, and multi-dimensional up/down-sampling strategies which breaks the limitations of traditional LLMs.First, CoT reasoning guides the model through multi-step logical reasoning, expanding the semantic space and breaking the rigidity of thought.Next, MoE distributes the reasoning tasks across multiple expert modules, each focusing on specific sub-tasks.Finally, dimensionality reduction maps the reasoning outputs back to a lower-dimensional semantic space, yielding more precise and creative responses.Extensive experiments across multiple tasks demonstrate that LADDER significantly improves task completion, creativity, and fluency, generating innovative and coherent responses that outperform traditional models.Ablation studies reveal the critical roles of CoT and MoE in enhancing reasoning abilities and creative output. 0.522This work contributes to the development of more flexible and creative LLMs, capable of addressing complex and novel tasks. |
2025-06-16 |
VIS-Shepherd: Constructing Critic for LLM-based Data Visualization Generation
Data visualization generation using Large Language Models (LLMs) has shown promising results but often produces suboptimal visualizations that require human intervention for improvement.In this work, we introduce VIS-Shepherd, a specialized Multimodal Large Language Model (MLLM)-based critic to evaluate and provide feedback for LLM-generated data visualizations.At the core of our approach is a framework to construct a high-quality visualization critique dataset, where we collect human-created visualization instances, synthesize corresponding LLM-generated instances, and construct high-quality critiques.We conduct both model-based automatic evaluation and human preference studies to evaluate the effectiveness of our approach.Our experiments show that even small (7B parameters) open-source MLLM models achieve substantial performance gains by leveraging our high-quality visualization critique dataset, reaching levels comparable to much larger open-source or even proprietary models.Our work demonstrates significant potential for MLLM-based automated visualization critique and indicates promising directions for enhancing LLM-based data visualization generation. 0.51Our project page: https://github.com/bopan3/VIS-Shepherd. |
2025-06-16 |
Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation
Current Reinforcement Learning (RL) methodologies for Large Language Models (LLMs) often rely on simplistic, outcome-based reward signals (e.g., final answer correctness), which limits the depth of learning from each interaction.This paper introduces Socratic Reinforcement Learning (Socratic-RL), a novel, process-oriented framework designed to address this limitation.Socratic-RL operates on the principle that deeper understanding is achieved by reflecting on the causal reasons for errors and successes within the reasoning process itself.The framework employs a decoupled "Teacher-Student" architecture, where a "Teacher AI" analyzes interaction histories, extracts causal insights, and formulates them into structured "viewpoints." 0.639These viewpoints, acting as distilled guidance, are then used by a "Student AI" to enhance its subsequent reasoning. 0.63A key innovation is the iterative self-improvement of the Teacher AI, enabling its reflective capabilities to evolve through a meta-learning loop. 0.587To manage the accumulation of knowledge, a distillation mechanism compresses learned viewpoints into the Student's parameters. 0.57By focusing on process rather than just outcome, Socratic-RL presents a pathway toward enhanced sample efficiency, superior interpretability, and a more scalable architecture for self-improving AI systems.This paper details the foundational concepts, formal mechanisms, synergies, challenges, and a concrete research roadmap for this proposed framework. |
2025-06-16 |
Delving Into the Psychology of Machines: Exploring the Structure of Self-Regulated Learning via LLM-Generated Survey Responses
Large language models (LLMs) offer the potential to simulate human-like responses and behaviors, creating new opportunities for psychological science.In the context of self-regulated learning (SRL), if LLMs can reliably simulate survey responses at scale and speed, they could be used to test intervention scenarios, refine theoretical models, augment sparse datasets, and represent hard-to-reach populations. 0.584However, the validity of LLM-generated survey responses remains uncertain, with limited research focused on SRL and existing studies beyond SRL yielding mixed results.Therefore, in this study, we examined LLM-generated responses to the 44-item Motivated Strategies for Learning Questionnaire (MSLQ; Pintrich \& De Groot, 1990), a widely used instrument assessing students' learning strategies and academic motivation. 0.583Particularly, we used the LLMs GPT-4o, Claude 3.7 Sonnet, Gemini 2 Flash, LLaMA 3.1-8B, and Mistral Large.We analyzed item distributions, the psychological network of the theoretical SRL dimensions, and psychometric validity based on the latent factor structure.Our results suggest that Gemini 2 Flash was the most promising LLM, showing considerable sampling variability and producing underlying dimensions and theoretical relationships that align with prior theory and empirical findings.At the same time, we observed discrepancies and limitations, underscoring both the potential and current constraints of using LLMs for simulating psychological survey data and applying it in educational contexts. 0.618 |
2025-06-16 |
Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored.In this work, we address this gap by introducing a framework inspired by cognitive psychology and education.Specifically, we decompose general learning ability into three distinct, complementary dimensions:Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). 0.59We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. 0.566Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions.It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models. |
2025-06-16 |
Instruction Following by Boosting Attention of Large Language Models
Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment.While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. 0.503However, subsequent studies revealed latent steering's effectiveness to be limited, often underperforming simple instruction prompting.To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques.Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model's attention during generation.InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions.Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering. |
2025-06-15 |
MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution
LLMs demonstrate strong performance in auto-mated software engineering, particularly for code generation and issue resolution.While proprietary models like GPT-4o achieve high benchmarks scores on SWE-bench, their API dependence, cost, and privacy concerns limit adoption.Open-source alternatives offer transparency but underperform in complex tasks, especially sub-100B parameter models.Although quality Chain-of-Thought (CoT) data can enhance reasoning, current methods face two critical flaws: (1) weak rejection sampling reduces data quality, and (2) inadequate step validation causes error accumulation.These limitations lead to flawed reasoning chains that impair LLMs'ability to learn reliable issue resolution.The paper proposes MCTS-REFINE, an enhanced Monte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and optimizes intermediate reasoning steps through a rigorous rejection sampling strategy, generating high-quality CoT data to improve LLM performance in issue resolution tasks. 0.505Key innovations include: (1) augmenting MCTS with a reflection mechanism that corrects errors via rejection sampling and refinement, (2) decomposing issue resolution into three subtasks-File Localization, Fault Localization, and Patch Generation-each with clear ground-truth criteria, and (3) enforcing a strict sampling protocol where intermediate outputs must exactly match verified developer patches, ensuring correctness across reasoning paths.Experiments on SWE-bench Lite and SWE-bench Verified demonstrate that LLMs fine-tuned with our CoT dataset achieve substantial improvements over baselines.Notably, Qwen2.5-72B- Instruct achieves 28.3%(Lite) and 35.0%(Verified) resolution rates, surpassing SOTA baseline SWE-Fixer-Qwen-72B with the same parameter scale, which only reached 24.7%(Lite) and 32.8%(Verified). |
2025-06-15 |
Domain Specific Benchmarks for Evaluating Multimodal Large Language Models
Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities.To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving.While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature.This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. 0.583Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application.Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI) |
2025-06-15 |
Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) generally requires a deep understanding of the contextual information, including the words associated with the aspect terms and their syntactic dependencies.Most existing studies employ advanced encoders (e.g., pre-trained models) to capture such context, especially large language models (LLMs).However, training these encoders is resource-intensive, and in many cases, the available data is insufficient for necessary fine-tuning.Therefore it is challenging for learning LLMs within such restricted environments and computation efficiency requirement. 0.589As a result, it motivates the exploration of plug-and-play methods that adapt LLMs to ABSA with minimal effort.In this paper, we propose an approach that integrates extendable components capable of incorporating various types of syntactic knowledge, such as constituent syntax, word dependencies, and combinatory categorial grammar (CCG).Specifically, we propose a memory module that records syntactic information and is incorporated into LLMs to instruct the prediction of sentiment polarities.Importantly, this encoder acts as a versatile, detachable plugin that is trained independently of the LLM.We conduct experiments on benchmark datasets, which show that our approach outperforms strong baselines and previous approaches, thus demonstrates its effectiveness. |
2025-06-14 |
Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics
As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior.We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas.Using a 6-dimensional persona space (age, gender, country, class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. 0.52Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence.Persuasive success varies across traits, with liberal and open personalities reaching higher consensus and win rates.While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time.These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning. |
2025-06-13 |
A Short Survey on Formalising Software Requirements using Large Language Models
This paper presents a focused literature survey on the use of large language models (LLM) to assist in writing formal specifications for software.A summary of thirty-five key papers is presented, including examples for specifying programs written in Dafny, C and Java.This paper arose from the project VERIFAI - Traceability and verification of natural language requirements that addresses the challenges in writing formal specifications from requirements that are expressed in natural language.Our methodology employed multiple academic databases to identify relevant research.The AI-assisted tool Elicit facilitated the initial paper selection, which were manually screened for final selection.The survey provides valuable insights and future directions for utilising LLMs while formalising software requirements. 0.518 |
2025-06-13 |
code_transformed: The Influence of Large Language Models on Code
Coding remains one of the most fundamental modes of interaction between humans and machines.With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices.This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized?In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity.By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code.For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025.Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. 0.523Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs.Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style. |
LLMs as Recommender Systems |
|
2025-06-17 |
ImpReSS: Implicit Recommender System for Support Conversations
Following recent advancements in large language models (LLMs), LLM-based chatbots have transformed customer support by automating interactions and providing consistent, scalable service.While LLM-based conversational recommender systems (CRSs) have attracted attention for their ability to enhance the quality of recommendations, limited research has addressed the implicit integration of recommendations within customer support interactions. 0.841In this work, we introduce ImpReSS, an implicit recommender system designed for customer support conversations. 0.61ImpReSS operates alongside existing support chatbots, where users report issues and chatbots provide solutions.Based on a customer support conversation, ImpReSS identifies opportunities to recommend relevant solution product categories (SPCs) that help resolve the issue or prevent its recurrence -- thereby also supporting business growth.Unlike traditional CRSs, ImpReSS functions entirely implicitly and does not rely on any assumption of a user's purchasing intent.Our empirical evaluation of ImpReSS's ability to recommend relevant SPCs that can help address issues raised in support conversations shows promising results, including an MRR@1 (and recall@3) of 0.72 (0.89) for general problem solving, 0.82 (0.83) for information security support, and 0.85 (0.67) for cybersecurity troubleshooting.To support future research, our data and code will be shared upon request. |
2025-06-17 |
Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent
Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). 0.762However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations.Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue.To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction.These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization.ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements.To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations.Experimental results show that ECPO significantly enhances CRA's interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods. |
2025-06-16 |
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized Recommendation
Sequential recommender systems aim to model users' evolving preferences by capturing patterns in their historical interactions. 0.76Recent advances in this area have leveraged deep neural networks and attention mechanisms to effectively represent sequential behaviors and time-sensitive interests.In this work, we propose C-TLSAN (Content-Enhanced Time-Aware Long- and Short-Term Attention Network), an extension of the TLSAN architecture that jointly models long- and short-term user preferences while incorporating semantic content associated with items, such as product descriptions. C-TLSAN enriches the recommendation pipeline by embedding textual content linked to users' historical interactions directly into both long-term and short-term attention layers. 0.828This allows the model to learn from both behavioral patterns and rich item content, enhancing user and item representations across temporal dimensions.By fusing sequential signals with textual semantics, our approach improves the expressiveness and personalization capacity of recommendation systems. 0.618We conduct extensive experiments on large-scale Amazon datasets, benchmarking C-TLSAN against state-of-the-art baselines, including recent sequential recommenders based on Large Language Models (LLMs), which represent interaction history and predictions in text form. 0.755Empirical results demonstrate that C-TLSAN consistently outperforms strong baselines in next-item prediction tasks.Notably, it improves AUC by 1.66%, Recall@10 by 93.99%, and Precision@10 by 94.80% on average over the best-performing baseline (TLSAN) across 10 Amazon product categories.These results highlight the value of integrating content-aware enhancements into temporal modeling frameworks for sequential recommendation. 0.738Our code is available at https://github.com/booml247/cTLSAN. |
2025-06-16 |
IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation
Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. 0.78However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding.This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination yet can dominate optimization or decoding.To quantify token decisiveness, we propose a novel perspective that models item generation as a decision process, measuring token decisiveness by the Information Gain (IG) each token provides in reducing uncertainty about the generated item.Our empirical analysis reveals that most tokens have low IG but often correspond to high logits, disproportionately influencing training loss and decoding, which may impair model performance.Building on these insights, we introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.Specifically, IGD downweights low-IG tokens during tuning and rebalances decoding to emphasize tokens with high IG.In this way, IGD moves beyond pure likelihood maximization, effectively prioritizing high-decisiveness tokens.Extensive experiments on four benchmark datasets with two LLM backbones demonstrate that IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines. 0.87 |
2025-06-16 |
OneRec Technical Report
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. 0.78However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. 0.715For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios. To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. 0.843Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $\times$ and have identified the scaling laws for recommendations within certain boundaries. 0.698Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework.Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community.This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines.Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively.Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. 0.653We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact. 0.643 |
2025-06-14 |
Towards Building General Purpose Embedding Models for Industry 4.0 Agents
In this work we focus on improving language models' understanding for asset maintenance to guide the engineer's decisions and minimize asset downtime.Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize to queries of similar assets. 0.643A task may involve identifying relevant sensors given a query about an asset's failure mode. Our approach begins with gathering a qualitative, expert-vetted knowledge base to construct nine asset-specific task datasets.To create more contextually informed embeddings, we augment the input tasks using Large Language Models (LLMs), providing concise descriptions of the entities involved in the queries.This embedding model is then integrated with a Reasoning and Acting agent (ReAct), which serves as a powerful tool for answering complex user queries that require multi-step reasoning, planning, and knowledge inference. Through ablation studies, we demonstrate that: (a) LLM query augmentation improves the quality of embeddings, (b) Contrastive loss and other methods that avoid in-batch negatives are superior for datasets with queries related to many items, and (c) It is crucial to balance positive and negative in-batch samples.After training and testing on our dataset, we observe a substantial improvement: HIT@1 increases by +54.2%, MAP@100 by +50.1%, and NDCG@10 by +54.7%, averaged across all tasks and models.Additionally, we empirically demonstrate the model's planning and tool invocation capabilities when answering complex questions related to industrial asset maintenance, showcasing its effectiveness in supporting Subject Matter Experts (SMEs) in their day-to-day operations. |
2025-06-10 |
Enhanced Whole Page Optimization via Mixed-Grained Reward Mechanism-Adapted Language Models
Optimizing the presentation of search and recommendation results is crucial to enhancing user experience and engagement. 0.607Whole Page Optimization (WPO) plays a pivotal role in this process, as it directly influences how information is surfaced to users.While Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent and contextually relevant content, fine-tuning these models for complex tasks like WPO presents challenges.Specifically, the need for extensive human-annotated data to mitigate issues such as hallucinations and model instability can be prohibitively expensive, especially in large-scale systems that interact with millions of items daily.In this work, we address the challenge of fine-tuning LLMs for WPO by using user feedback as the supervision.Unlike manually labeled datasets, user feedback is inherently noisy and less precise.To overcome this, we propose a reward-based fine-tuning approach, PageLLM, which employs a mixed-grained reward mechanism that combines page-level and item-level rewards.The page-level reward evaluates the overall quality and coherence, while the item-level reward focuses on the accuracy and relevance of key recommendations.This dual-reward structure ensures that both the holistic presentation and the critical individual components are optimized.We validate PageLLM on both public and industrial datasets.PageLLM outperforms baselines and achieves a 0.44\% GMV increase in an online A/B test with over 10 million users, demonstrating its real-world impact. |
2025-06-09 |
LlamaRec-LKG-RAG: A Single-Pass, Learnable Knowledge Graph-RAG Framework for LLM-Based Ranking
Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. 0.766However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions.We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. 0.856Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. 0.62These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step.Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall).LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. 0.8Code is available at~\href{https://github.com/VahidAz/LlamaRec-LKG-RAG}{repository}. |
2025-06-09 |
Serendipitous Recommendation with Multimodal LLM
Conventional recommendation systems succeed in identifying relevant content but often fail to provide users with surprising or novel items. 0.713Multimodal Large Language Models (MLLMs) possess the world knowledge and multimodal understanding needed for serendipity, but their integration into billion-item-scale platforms presents significant challenges.In this paper, we propose a novel hierarchical framework where fine-tuned MLLMs provide high-level guidance to conventional recommendation models, steering them towards more serendipitous suggestions. 0.784This approach leverages MLLM strengths in understanding multimodal content and user interests while retaining the efficiency of traditional models for item-level recommendation.This mitigates the complexity of applying MLLMs directly to vast action spaces.We also demonstrate a chain-of-thought strategy enabling MLLMs to discover novel user interests by first understanding video content and then identifying relevant yet unexplored interest clusters.Through live experiments within a commercial short-form video platform serving billions of users, we show that our MLLM-powered approach significantly improves both recommendation serendipity and user satisfaction. 0.641 |
2025-06-07 |
What Makes a Good Natural Language Prompt?
As large language models (LLMs) have progressed towards more human-like and human--AI communications have become prevalent, prompting has emerged as a decisive component.However, there is limited conceptual consensus on what exactly quantifies natural language prompts.We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025 and blogs.We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions.We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps.Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. 0.635We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact.Finally, we discover that instruction-tuning on property-enhanced prompts can result in better reasoning models.Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human--AI communication and opening new prompting research directions. |
2025-06-05 |
PUB: An LLM-Enhanced Personality-Driven User Behaviour Simulator for Recommender System Evaluation
Traditional offline evaluation methods for recommender systems struggle to capture the complexity of modern platforms due to sparse behavioural signals, noisy data, and limited modelling of user personality traits.While simulation frameworks can generate synthetic data to address these gaps, existing methods fail to replicate behavioural diversity, limiting their effectiveness.To overcome these challenges, we propose the Personality-driven User Behaviour Simulator (PUB), an LLM-based simulation framework that integrates the Big Five personality traits to model personalised user behaviour.PUB dynamically infers user personality from behavioural logs (e.g., ratings, reviews) and item metadata, then generates synthetic interactions that preserve statistical fidelity to real-world data.Experiments on the Amazon review datasets show that logs generated by PUB closely align with real user behaviour and reveal meaningful associations between personality traits and recommendation outcomes. 0.613These results highlight the potential of the personality-driven simulator to advance recommender system evaluation, offering scalable, controllable, high-fidelity alternatives to resource-intensive real-world experiments. 0.635 |
2025-06-05 |
Reason-to-Recommend: Using Interaction-of-Thought Reasoning to Enhance LLM Recommendation
Driven by advances in Large Language Models (LLMs), integrating them into recommendation tasks has gained interest due to their strong semantic understanding and prompt flexibility. 0.777Prior work encoded user-item interactions or metadata into prompts for recommendations. 0.666In parallel, LLM reasoning, boosted by test-time scaling and reinforcement learning, has excelled in fields like mathematics and code, where reasoning traces and correctness signals are clear, enabling high performance and interpretability.However, directly applying these reasoning methods to recommendation is ineffective because user feedback is implicit and lacks reasoning supervision.To address this, we propose $\textbf{R2Rec}$, a reasoning-enhanced recommendation framework that samples interaction chains from the user-item graph and converts them into structured interaction-of-thoughts via a progressive masked prompting strategy, with each thought representing stepwise reasoning grounded in interaction context. 0.761This allows LLMs to simulate step-by-step decision-making based on implicit patterns.We design a two-stage training pipeline: supervised fine-tuning teaches basic reasoning from high-quality traces, and reinforcement learning refines reasoning via reward signals, alleviating sparse explicit supervision.Experiments on three real-world datasets show R2Rec outperforms classical and LLM-based baselines with an average $\textbf{10.48%}$ improvement in HitRatio@1 and $\textbf{131.81%}$ gain over the original LLM.Furthermore, the explicit reasoning chains enhance interpretability by revealing the decision process.Our code is available at: https://anonymous.4open.science/r/R2Rec-7C5D. |
2025-06-04 |
GORACS: Group-level Optimal Transport-guided Coreset Selection for LLM-based Recommender Systems
Although large language models (LLMs) have shown great potential in recommender systems, the prohibitive computational costs for fine-tuning LLMs on entire datasets hinder their successful deployment in real-world scenarios. 0.768To develop affordable and effective LLM-based recommender systems, we focus on the task of coreset selection which identifies a small subset of fine-tuning data to optimize the test loss, thereby facilitating efficient LLMs' fine-tuning. 0.714Although there exist some intuitive solutions of subset selection, including distribution-based and importance-based approaches, they often lead to suboptimal performance due to the misalignment with downstream fine-tuning objectives or weak generalization ability caused by individual-level sample selection.To overcome these challenges, we propose GORACS, which is a novel Group-level Optimal tRAnsport-guided Coreset Selection framework for LLM-based recommender systems. 0.723GORACS is designed based on two key principles for coreset selection: 1) selecting the subsets that minimize the test loss to align with fine-tuning objectives, and 2) enhancing model generalization through group-level data selection.Corresponding to these two principles, GORACS has two key components: 1) a Proxy Optimization Objective (POO) leveraging optimal transport and gradient information to bound the intractable test loss, thus reducing computational costs by avoiding repeated LLM retraining, and 2) a two-stage Initialization-Then-Refinement Algorithm (ITRA) for efficient group-level selection.Our extensive experiments across diverse recommendation datasets and tasks validate that GORACS significantly reduces fine-tuning costs of LLMs while achieving superior performance over the state-of-the-art baselines and full data training. 0.813The source code of GORACS are available at https://github.com/Mithas-114/GORACS. |
Production workflows for LLMs |
|
2025-06-18 |
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs).However, efficient, high-quality automated process annotation remains a significant challenge. 0.399To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. 0.382We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning.We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. 0.468Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. 0.45The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility. 0.401 |
2025-06-18 |
RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information.However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. 0.364RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions.A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization.This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass. 0.426We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates. 0.449On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918. 0.323This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU.RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications. |
2025-06-18 |
Lessons from Training Grounded LLMs with Verifiable Rewards
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs).While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available.In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs.We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. 0.389Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. 0.34A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal.Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. 0.356Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs. |
2025-06-18 |
PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. 0.41However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences.This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. 0.452We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis.To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. 0.321PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. 0.406Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused. 0.562 |
2025-06-18 |
Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents
Failure Analysis (FA) is a highly intricate and knowledge-intensive process.The integration of AI components within the computational infrastructure of FA labs has the potential to automate a variety of tasks, including the detection of non-conformities in images, the retrieval of analogous cases from diverse data sources, and the generation of reports from annotated images.However, as the number of deployed AI models increases, the challenge lies in orchestrating these components into cohesive and efficient workflows that seamlessly integrate with the FA process. 0.338This paper investigates the design and implementation of a Large Language Model (LLM)-based Planning Agent (LPA) to assist FA engineers in solving their analysis cases.The LPA integrates LLMs with advanced planning capabilities and external tool utilization, enabling autonomous processing of complex queries, retrieval of relevant data from external systems, and generation of human-readable responses. 0.467Evaluation results demonstrate the agent's operational effectiveness and reliability in supporting FA tasks. |
2025-06-18 |
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. 0.49While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs.GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers.We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity.Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models. 0.312 |
2025-06-18 |
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). 0.39To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity.However, this long-standing assumption has notable limitations, particularly in large-scale, heterogeneous GPU clusters where bandwidth distribution among GPUs is irregular. 0.353In this paper, we introduce LiteGD, a lightweight and dynamic GPU dispatching system based on global perspectives. 0.512To tackle the difficulty of storing massive GPU topology information, LiteGD adopts a computation-aware design that leverages a lightweight Transformer network trained on sampled data. 0.479Our customized design for network structure ensures both transferability and scalability. 0.4LiteGD also employs a bidirectional tree search approach to find the optimal GPU dispatching in the data generated in the previous step, which can identify near-optimal solutions while reducing search overhead. 0.32We implement and evaluate LiteGD in both real and simulated GPU clusters with homogeneous and heterogeneous interconnects, respectively.Experimental results demonstrate that LiteGD consistently achieves high GPU bandwidth efficacy (approximately 90\%) across various cluster configurations and 80\% in real-world H100 cluster, significantly outperforming conventional default and interconnect topology-aware dispatching methods, particularly in large-scale heterogeneous environments. 0.384 |
2025-06-18 |
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Large Language Models (LLMs) have become indispensable in real-world applications. 0.486However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions.Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. 0.423In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. 0.347Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. 0.536Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. 0.378For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. 0.375By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. 0.336The code is available at github.com/VITA-Group/LoX. |
2025-06-18 |
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts.However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order.To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs.This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities.We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered.Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities. 0.411 |
2025-06-18 |
deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses
Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused.Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. 0.347deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library.Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. 0.378We evaluated deepSURF on 27 real-world Rust crates, successfully rediscovering 20 known memory safety bugs and uncovering 6 previously unknown vulnerabilities, demonstrating clear improvements over state-of-the-art tools. |
2025-06-18 |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites continue to pose a significant cybersecurity threat, often leveraging deceptive structures, brand impersonation, and social engineering tactics to evade detection.While recent advances in large language models (LLMs) have enabled improved phishing detection through contextual understanding, most existing approaches rely on single-agent classification facing the risks of hallucination and lack interpretability or robustness.To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection.PhishDebate employs four specialized agents to independently analyze different textual aspects of a webpage--URL structure, HTML composition, semantic content, and brand impersonation--under the coordination of a Moderator and a final Judge.Through structured debate and divergent thinking, the framework delivers more accurate and interpretable decisions.Extensive evaluations on commercial LLMs demonstrate that PhishDebate achieves 98.2% recall and 98.2% True Positive Rate (TPR) on a real-world phishing dataset, and outperforms single-agent and Chain of Thought (CoT) baselines. 0.341Additionally, its modular design allows agent-level configurability, enabling adaptation to varying resource and application requirements. 0.305 |
2025-06-18 |
CC-LEARN: Cohort-based Consistency Learning
Large language models excel at many tasks but still struggle with consistent, robust reasoning.We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. 0.362To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. 0.321Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. 0.341Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. 0.303These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs. |
2025-06-18 |
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. 0.576This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. 0.334A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. 0.447To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. 0.344GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs.Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs. |
2025-06-18 |
PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance.Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. 0.631Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. 0.699To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. 0.374Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. 0.385Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%. |
LLM Model Architectures and Training Techniques |
|
2025-06-18 |
All is Not Lost: LLM Recovery without Checkpoints
Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization.The inevitable challenge here is the churn of nodes due to failures and the operator's scheduling policies, leading to losing a stage - a part of the model. 0.549The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation.These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. 0.43In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. 0.471In contrast to the state of the art, CheckFree requires no additional computation or storage. 0.497However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. 0.415We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. 0.624Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. 0.564To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. 0.418We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies.In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. 0.552Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree. 0.471 |
2025-06-18 |
Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning
Medical report generation from imaging data remains a challenging task in clinical practice.While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration.In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism.Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features.We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. 0.421Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation.Our code will be made publicly available. |
2025-06-18 |
Context-Informed Grounding Supervision
Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination.In such cases, we expect the model to generate responses by grounding its response in the provided external context.However, prior work has shown that simply appending context at inference time does not ensure grounded generation.To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context.Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models.In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques.In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response.This improved grounding comes without degradation in general downstream performance. 0.465Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context. |
2025-06-18 |
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs).However, efficient, high-quality automated process annotation remains a significant challenge. 0.606To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. 0.492We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. 0.431We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs.Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation.The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility. |
2025-06-18 |
Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities.Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries.This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework.We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering.The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student's needs.We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes. 0.478 |
2025-06-18 |
RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information.However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. 0.447RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions.A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization.This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass.We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates.On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918.This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU.RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications. |
2025-06-18 |
Lessons from Training Grounded LLMs with Verifiable Rewards
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs).While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available.In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs.We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations.Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. 0.41A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. 0.402Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. 0.529Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs. 0.408 |
2025-06-18 |
PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses.However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences.This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. 0.54We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. 0.449To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. 0.539PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. 0.527Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused. 0.542 |
2025-06-18 |
Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents
Failure Analysis (FA) is a highly intricate and knowledge-intensive process. 0.548The integration of AI components within the computational infrastructure of FA labs has the potential to automate a variety of tasks, including the detection of non-conformities in images, the retrieval of analogous cases from diverse data sources, and the generation of reports from annotated images.However, as the number of deployed AI models increases, the challenge lies in orchestrating these components into cohesive and efficient workflows that seamlessly integrate with the FA process. 0.424This paper investigates the design and implementation of a Large Language Model (LLM)-based Planning Agent (LPA) to assist FA engineers in solving their analysis cases.The LPA integrates LLMs with advanced planning capabilities and external tool utilization, enabling autonomous processing of complex queries, retrieval of relevant data from external systems, and generation of human-readable responses.Evaluation results demonstrate the agent's operational effectiveness and reliability in supporting FA tasks. 0.477 |
2025-06-18 |
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. 0.418While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs.GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers.We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity.Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models. |
2025-06-18 |
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). 0.418To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. 0.52However, this long-standing assumption has notable limitations, particularly in large-scale, heterogeneous GPU clusters where bandwidth distribution among GPUs is irregular. 0.422In this paper, we introduce LiteGD, a lightweight and dynamic GPU dispatching system based on global perspectives. 0.498To tackle the difficulty of storing massive GPU topology information, LiteGD adopts a computation-aware design that leverages a lightweight Transformer network trained on sampled data. 0.411Our customized design for network structure ensures both transferability and scalability. 0.423LiteGD also employs a bidirectional tree search approach to find the optimal GPU dispatching in the data generated in the previous step, which can identify near-optimal solutions while reducing search overhead. 0.468We implement and evaluate LiteGD in both real and simulated GPU clusters with homogeneous and heterogeneous interconnects, respectively. 0.48Experimental results demonstrate that LiteGD consistently achieves high GPU bandwidth efficacy (approximately 90\%) across various cluster configurations and 80\% in real-world H100 cluster, significantly outperforming conventional default and interconnect topology-aware dispatching methods, particularly in large-scale heterogeneous environments. 0.543 |
2025-06-18 |
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Large Language Models (LLMs) have become indispensable in real-world applications.However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions.Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. 0.504In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. 0.614Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM.Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. 0.538For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. 0.546By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations.The code is available at github.com/VITA-Group/LoX. |
2025-06-18 |
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts.However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order.To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. 0.413This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. 0.407We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered.Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities. 0.502 |
2025-06-18 |
deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses
Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. 0.432Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. 0.438To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. 0.525deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library. 0.442Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. 0.558We evaluated deepSURF on 27 real-world Rust crates, successfully rediscovering 20 known memory safety bugs and uncovering 6 previously unknown vulnerabilities, demonstrating clear improvements over state-of-the-art tools. 0.446 |
2025-06-18 |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites continue to pose a significant cybersecurity threat, often leveraging deceptive structures, brand impersonation, and social engineering tactics to evade detection.While recent advances in large language models (LLMs) have enabled improved phishing detection through contextual understanding, most existing approaches rely on single-agent classification facing the risks of hallucination and lack interpretability or robustness.To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection.PhishDebate employs four specialized agents to independently analyze different textual aspects of a webpage--URL structure, HTML composition, semantic content, and brand impersonation--under the coordination of a Moderator and a final Judge.Through structured debate and divergent thinking, the framework delivers more accurate and interpretable decisions.Extensive evaluations on commercial LLMs demonstrate that PhishDebate achieves 98.2% recall and 98.2% True Positive Rate (TPR) on a real-world phishing dataset, and outperforms single-agent and Chain of Thought (CoT) baselines.Additionally, its modular design allows agent-level configurability, enabling adaptation to varying resource and application requirements. 0.524 |
2025-06-18 |
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. 0.409This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts.A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering.To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. 0.413GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs.Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs. |
2025-06-18 |
PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance.Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored.Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice.To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs.Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics.Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%. 0.453 |
Programming applications of LLMs |
|
2025-06-18 |
ChatModel: Automating Reference Model Design and Verification with LLMs
As the complexity of integrated circuit designs continues to escalate, the functional verification becomes increasingly challenging.Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop.Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. 0.833To address these challenges, we introduce ChatModel, the first LLM-aided agile reference model generation and verification platform.ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling.Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency.We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation.ChatModel achieved a peak performance improvement of 55.02% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs.Furthermore, it accelerated the iterative process of reference model design and validation by an average of 5.90x compared to traditional approaches.These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation. |
2025-06-18 |
Uncovering Intention through LLM-Driven Code Snippet Description Generation
Documenting code snippets is essential to pinpoint key areas where both developers and users should pay attention. 0.723Examples include usage examples and other Application Programming Interfaces (APIs), which are especially important for third-party libraries.With the rise of Large Language Models (LLMs), the key goal is to investigate the kinds of description developers commonly use and evaluate how well an LLM, in this case Llama, can support description generation. 0.752We use NPM Code Snippets, consisting of 185,412 packages with 1,024,579 code snippets. 0.773From there, we use 400 code snippets (and their descriptions) as samples.First, our manual classification found that the majority of original descriptions (55.5%) highlight example-based usage.This finding emphasizes the importance of clear documentation, as some descriptions lacked sufficient detail to convey intent.Second, the LLM correctly identified the majority of original descriptions as "Example" (79.75%), which is identical to our manual finding, showing a propensity for generalization.Third, compared to the originals, the produced description had an average similarity score of 0.7173, suggesting relevance but room for improvement.Scores below 0.9 indicate some irrelevance.Our results show that depending on the task of the code snippet, the intention of the document may differ from being instructions for usage, installations, or descriptive learning examples for any user of a library. |
2025-06-17 |
Quality Assessment of Python Tests Generated by Large Language Models
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions.Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. 0.733This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. 0.799We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C).Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns.Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell.Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion of Test Cases was the most frequently detected (41%).Prompt context significantly influenced test quality; textual prompts with detailed instructions often yielded tests with fewer errors but a higher incidence of test smells.Among the evaluated LLMs, GPT-4o produced the fewest errors in both contexts (10% in C2C and 6% in T2C), whereas Amazon Q had the highest error rates (19% in C2C and 28% in T2C).For test smells, Amazon Q had fewer detections in the C2C context (9%), while LLama 3.3 performed best in the T2C context (10%).Additionally, we observed a strong relationship between specific errors, such as assertion or indentation issues, and test case cohesion smells.These findings demonstrate opportunities for improving the quality of test generation by LLMs and highlight the need for future research to explore optimized generation scenarios and better prompt engineering strategies. |
2025-06-17 |
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Large Language Models (LLMs) are experiencing rapid advancements in complex reasoning, exhibiting remarkable generalization in mathematics and programming. 0.754In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic evaluation of their complex reasoning ability within spatial contexts remains underexplored.To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' spatial intelligence through video-based reasoning tasks.SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video.By carefully designing questions and corresponding 3D scenes, our benchmark ensures that solving the questions requires both spatial comprehension for extracting information and high-level reasoning for deriving solutions, making it a challenging benchmark for evaluating VLMs.To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine.This engine, leveraging multiple specialized LLM agents, can generate realistic 3D scenes from abstract math problems, ensuring faithfulness to the original descriptions.Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of spatial reasoning.We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving. |
2025-06-17 |
GenerationPrograms: Fine-grained Attribution with Executable Programs
Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust.Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability.To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. 0.728Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response.Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task.We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions.In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality. |
2025-06-17 |
Issue Retrieval and Verification Enhanced Supplementary Code Comment Generation
Issue reports have been recognized to contain rich information for retrieval-augmented code comment generation. 0.806However, how to minimize hallucinations in the generated comments remains significant challenges.In this paper, we propose IsComment, an issue-based LLM retrieval and verification approach for generating method's design rationale, usage directives, and so on as supplementary code comments.We first identify five main types of code supplementary information that issue reports can provide through code-comment-issue analysis.Next, we retrieve issue sentences containing these types of supplementary information and generate candidate code comments. 0.709To reduce hallucinations, we filter out those candidate comments that are irrelevant to the code or unverifiable by the issue report, making the code comment generation results more reliable.Our experiments indicate that compared with LLMs, IsComment increases the coverage of manual supplementary comments from 33.6% to 72.2% for ChatGPT, from 35.8% to 88.4% for GPT-4o, and from 35.0% to 86.2% for DeepSeek-V3.Compared with existing work, IsComment can generate richer and more useful supplementary code comments for programming understanding, which is quantitatively evaluated through the MESIA metric on both methods with and without manual code comments. 0.837 |
2025-06-17 |
Unified Software Engineering agent as AI Software Engineer
The growth of Large Language Model (LLM) technology has raised expectations for automated coding. 0.817However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project. 0.698In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously.But is an LLM agent the same as an AI software engineer?In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent.Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities.This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others.We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans.To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching.USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD.In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent.There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future. |
2025-06-16 |
FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
Large Language Models (LLMs) have made significant strides in front-end code generation. 0.953However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end validation is absent.These issues hinder the accurate assessment of model performance.To address these challenges, we present FrontendBench, a benchmark co-developed by humans and LLMs.FrontendBench categorizes tasks based on code functionality and incorporates interactive test scenarios, enabling a more comprehensive and practical evaluation of front-end code generation capabilities. 0.738The benchmark comprises 148 meticulously crafted prompt-test case pairs spanning five levels of web components, from basic UI elements to complex interactive features.Each task reflects realistic front-end development challenges.Furthermore, we introduce an automatic evaluation framework that executes generated code within a sandbox environment and assesses outcomes using predefined test scripts.This framework achieves a 90.54% agreement rate with expert human evaluations, demonstrating high reliability.We benchmark several state-of-the-art LLMs on FrontendBench and observe substantial performance disparities in handling real-world front-end tasks.These results highlight FrontendBench as a reliable and scalable benchmark, supporting consistent multimodal evaluation and providing a robust foundation for future research in front-end code generation.Our data and code will be released soon. |
2025-06-16 |
How Does LLM Reasoning Work for Code? A Survey and a Call to Action
The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks.These advancements have extended into the domain of code, facilitating complex tasks such as code generation, translation, summarization, and repair. 0.646However, their utility for real-world deployment in-the-wild has only recently been studied, particularly on software engineering (SWE) tasks such as GitHub issue resolution.In this study, we examine the code reasoning techniques that underlie the ability to perform such tasks, and examine the paradigms used to drive their performance. 0.684Our contributions in this paper are: (1) the first dedicated survey on code reasoning for code tasks, highlighting overarching strategies, hybrid and agentic approaches; (2) a taxonomy of various techniques used to drive code reasoning; (3) a comprehensive overview of performance on common benchmarks and a showcase of new, under-explored benchmarks with high potential in SWE; (4) an exploration on how core properties of code can be used to explain different reasoning techniques; and (5) gaps and potentially under-explored areas for future research. 0.663 |
2025-06-16 |
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. 0.645This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication.As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. 0.684Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text.This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. 0.705While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension.Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation. |
2025-06-16 |
Querying Large Automotive Software Models: Agentic vs. Direct LLM Approaches
Large language models (LLMs) offer new opportunities for interacting with complex software artifacts, such as software models, through natural language. 0.863They present especially promising benefits for large software models that are difficult to grasp in their entirety, making traditional interaction and analysis approaches challenging. 0.647This paper investigates two approaches for leveraging LLMs to answer questions over software models: direct prompting, where the whole software model is provided in the context, and an agentic approach combining LLM-based agents with general-purpose file access tools.We evaluate these approaches using an Ecore metamodel designed for timing analysis and software optimization in automotive and embedded domains.Our findings show that while the agentic approach achieves accuracy comparable to direct prompting, it is significantly more efficient in terms of token usage.This efficiency makes the agentic approach particularly suitable for the automotive industry, where the large size of software models makes direct prompting infeasible, establishing LLM agents as not just a practical alternative but the only viable solution.Notably, the evaluation was conducted using small LLMs, which are more feasible to be executed locally - an essential advantage for meeting strict requirements around privacy, intellectual property protection, and regulatory compliance.Future work will investigate software models in diverse formats, explore more complex agent architectures, and extend agentic workflows to support not only querying but also modification of software models. |
2025-06-15 |
Get on the Train or be Left on the Station: Using LLMs for Software Engineering Research
The adoption of Large Language Models (LLMs) is not only transforming software engineering (SE) practice but is also poised to fundamentally disrupt how research is conducted in the field. 0.757While perspectives on this transformation range from viewing LLMs as mere productivity tools to considering them revolutionary forces, we argue that the SE research community must proactively engage with and shape the integration of LLMs into research practices, emphasizing human agency in this transformation.As LLMs rapidly become integral to SE research - both as tools that support investigations and as subjects of study - a human-centric perspective is essential.Ensuring human oversight and interpretability is necessary for upholding scientific rigor, fostering ethical responsibility, and driving advancements in the field.Drawing from discussions at the 2nd Copenhagen Symposium on Human-Centered AI in SE, this position paper employs McLuhan's Tetrad of Media Laws to analyze the impact of LLMs on SE research.Through this theoretical lens, we examine how LLMs enhance research capabilities through accelerated ideation and automated processes, make some traditional research practices obsolete, retrieve valuable aspects of historical research approaches, and risk reversal effects when taken to extremes.Our analysis reveals opportunities for innovation and potential pitfalls that require careful consideration.We conclude with a call to action for the SE research community to proactively harness the benefits of LLMs while developing frameworks and guidelines to mitigate their risks, to ensure continued rigor and impact of research in an AI-augmented future. |
2025-06-15 |
Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?
Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. 0.942To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024.As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation.Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively.Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of their own capabilities.Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance.Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks.We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming.Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE). |
2025-06-15 |
MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution
LLMs demonstrate strong performance in auto-mated software engineering, particularly for code generation and issue resolution. 0.676While proprietary models like GPT-4o achieve high benchmarks scores on SWE-bench, their API dependence, cost, and privacy concerns limit adoption.Open-source alternatives offer transparency but underperform in complex tasks, especially sub-100B parameter models.Although quality Chain-of-Thought (CoT) data can enhance reasoning, current methods face two critical flaws: (1) weak rejection sampling reduces data quality, and (2) inadequate step validation causes error accumulation.These limitations lead to flawed reasoning chains that impair LLMs'ability to learn reliable issue resolution.The paper proposes MCTS-REFINE, an enhanced Monte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and optimizes intermediate reasoning steps through a rigorous rejection sampling strategy, generating high-quality CoT data to improve LLM performance in issue resolution tasks.Key innovations include: (1) augmenting MCTS with a reflection mechanism that corrects errors via rejection sampling and refinement, (2) decomposing issue resolution into three subtasks-File Localization, Fault Localization, and Patch Generation-each with clear ground-truth criteria, and (3) enforcing a strict sampling protocol where intermediate outputs must exactly match verified developer patches, ensuring correctness across reasoning paths.Experiments on SWE-bench Lite and SWE-bench Verified demonstrate that LLMs fine-tuned with our CoT dataset achieve substantial improvements over baselines.Notably, Qwen2.5-72B- Instruct achieves 28.3%(Lite) and 35.0%(Verified) resolution rates, surpassing SOTA baseline SWE-Fixer-Qwen-72B with the same parameter scale, which only reached 24.7%(Lite) and 32.8%(Verified). |
2025-06-15 |
Domain Specific Benchmarks for Evaluating Multimodal Large Language Models
Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. 0.698To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving.While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature.This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized.Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application.Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI) |
2025-06-13 |
A Short Survey on Formalising Software Requirements using Large Language Models
This paper presents a focused literature survey on the use of large language models (LLM) to assist in writing formal specifications for software. 0.787A summary of thirty-five key papers is presented, including examples for specifying programs written in Dafny, C and Java.This paper arose from the project VERIFAI - Traceability and verification of natural language requirements that addresses the challenges in writing formal specifications from requirements that are expressed in natural language.Our methodology employed multiple academic databases to identify relevant research.The AI-assisted tool Elicit facilitated the initial paper selection, which were manually screened for final selection.The survey provides valuable insights and future directions for utilising LLMs while formalising software requirements. |
2025-06-13 |
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks.Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL.We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training.Our approach includes intermediate supervision and eliminates the need for a separate reward model training.Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking.We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching.Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. 0.751TreeRL is open-sourced at https://github.com/THUDM/TreeRL. |
2025-06-13 |
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. 0.721Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain.We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination.A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel.We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications.High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning.LiveCodeBenchPro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning. |
2025-06-13 |
code_transformed: The Influence of Large Language Models on Code
Coding remains one of the most fundamental modes of interaction between humans and machines.With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. 0.946This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? 0.651In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. 0.919By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. 0.819For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025.Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes.Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. 0.799Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style. 0.707 |
2025-06-12 |
ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space
Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software.However, these specifications require significant manual efforts to construct.This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. 0.651At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance.Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes -- up to 1,791,104 lines of code in our evaluation -- and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way.Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches.It shows that ELFuzz achieves up to 434.8% more coverage and triggers up to 174.0% more artificially injected bugs.We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable).Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz.Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way.The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing. |
2025-06-12 |
Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements
This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) such as OpenAI's GPT-3.5 Turbo and GPT-4o into software development workflows. 0.904A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project.Detailed information on each issue was extracted and organized to facilitate automated code revision using LLMs. 0.771An iterative prompt engineering process is applied to ensure that prompts are structured to produce accurate and organized outputs aligned with the project requirements.Retrieval-augmented generation (RAG) is implemented to enhance the relevance and precision of the revisions, enabling LLM to access and integrate real-time external knowledge.The issue of LLM hallucinations - where the model generates plausible but incorrect outputs - is addressed by a custom-built "Code Comparison App," which identifies and corrects erroneous changes before applying them to the codebase.Subsequent scans using the static code analysis framework revealed a significant reduction in code issues, demonstrating the effectiveness of combining LLMs, static analysis, and RAG to improve code quality, streamline the software development process, and reduce time and resource expenditure. 0.835 |
2025-06-12 |
AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine
Geospatial code generation is becoming a key frontier in integrating artificial intelligence with geo-scientific analysis, yet standardised automated evaluation tools for this task remain absent. 0.73This study presents AutoGEEval++, an enhanced framework building on AutoGEEval, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE). 0.753It supports diverse data modalities and varying task complexities.Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench-with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests.It includes a submission programme and a judge module to realise an end-to-end automated evaluation pipeline from code generation to execution-based validation.The framework adopts multi-dimensional metrics-accuracy, resource usage, run-time efficiency, and error types-balancing hallucination control and efficiency, and enabling boundary testing and error pattern analysis.Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs (as of June 2025), including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models.Results reveal clear performance, stability, and error differences across task types, model designs, and deployment settings, confirming AutoGEEval++'s practical value and scalability in vertical-domain code generation.This work establishes the first standardised evaluation protocol and foundational benchmark for GEE-based LLM code generation, providing a unified basis for performance comparison and a methodological framework for systematic, domain-specific code evaluation. 0.673 |
2025-06-12 |
CogStream: Context-guided Streaming Video Question Answering
Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information.Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing.Furthermore, the inclusion of irrelevant context distracts models from key details.This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream.To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline.Additionally, we present CogReasoner as a baseline model.It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval.Extensive experiments prove the effectiveness of this method.Code will be released soon. 0.779 |
2025-06-12 |
AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length
While Large Language Models (LLMs) have significantly advanced code generation efficiency, they face inherent challenges in balancing performance and inference costs across diverse programming tasks. 0.779Dynamically selecting the optimal LLM based on task difficulty and resource constraints offers a promising approach to achieve an optimal balance between efficiency and performance.However, existing model selection methods are resource-intensive and often neglect cost efficiency.Moreover, these approaches rely on human-annotated difficulty labels that are frequently inaccessible in real-world settings and may not align with the LLM's own assessment of task difficulty.In this paper, we introduce AdaptiveLLM, a framework that dynamically selects optimal LLMs for a given coding task by automatically assessing task difficulty.Our framework first estimates task difficulty using Chain-of-Thought lengths generated by reasoning model, clusters these into three difficulty levels via k-means, and fine-tunes CodeBERT to embed difficulty-aware features.A trained XGBoost classifier then selects the best model for each problem, optimizing the performance-cost trade-off.Experimental results show that AdaptiveLLM achieves a 7.86% improvement in pass@1 score while reducing resource consumption by 88.9% compared to baseline method ComplexityNet.When compared to a single model, AdaptiveLLM demonstrates an approximately 15% accuracy improvement, while maintaining the same level of cost consumption.Apart from that, the difficulty assessment using CoT provides more reliable selection criteria than human evaluation.Our replication package is available at https://github.com/cjhCoder7/AdaptiveLLM. |
2025-06-12 |
Evaluating Large Language Models on Non-Code Software Engineering Tasks
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation; however, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored. 0.95We present the first comprehensive benchmark, which we name `Software Engineering Language Understanding' (SELU), for evaluating LLMs on 17 non-code tasks, spanning from identifying whether a requirement is functional or non-functional to estimating the effort and complexity of backlog items. 0.704SELU covers classification, regression, Named Entity Recognition (NER), and Masked Language Modeling (MLM) targets, with data drawn from diverse sources such as code repositories, issue tracking systems, and developer forums.We fine-tune 22 open-source LLMs, prompt two proprietary alternatives, and train two baselines.Performance is measured using metrics such as F1-macro, SMAPE, F1-micro, and accuracy, and compared via the Bayesian signed-rank test.Our results show that moderate-scale decoder-only models consistently form a top-tier, exhibiting high mean performance and low across-task variance, while domain adaptation via code-focused pre-training might yield only modest improvements.These insights guide model selection for non-code SE workflows and highlight directions for expanding SELU to generative and design-oriented scenarios. |
2025-06-12 |
Execution Guided Line-by-Line Code Generation
We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. 0.85While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. 0.825Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions.EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation.By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure.Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions.Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. 0.732Our code is available at: https://github.com/boazlavon/eg_cfg |