Ryan's Arxiv FrontPageGenerated on 2026-01-25. This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. This project was originally created by Vincent Warmerdam, modifying his original frontpage for different paper categories. |
|
Prompt Engineering in Large Language Models |
|
|
2026-01-14 |
Evaluating local large language models for structured extraction from endometriosis-specific transvaginal ultrasound reports
In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows.Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction.The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text.Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation.We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. 0.717These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement.It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation.We discuss implications for structured reporting and interactive AI systems in clinical practice. |
|
2026-01-14 |
SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. 0.602However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process.While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance.Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding.In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks.Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning.We provide a comprehensive evaluation of nine advanced LLMs.Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states. |
|
2026-01-14 |
Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling
Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. 0.644However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency.Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality.To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation.We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency.Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy.Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP |
|
2026-01-14 |
Interpretable Probability Estimation with LLMs via Shapley Reconstruction
Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities.This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare.However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque.In this paper, we propose PRISM:Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation.PRISM decomposes an LLM's prediction by quantifying the marginal contribution of each input factor using Shapley values.These factor-level contributions are then aggregated to reconstruct a calibrated final estimate.In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. 0.669Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems. |
|
2026-01-14 |
When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation
Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning.While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains.To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG.Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. 0.664Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy. |
|
2026-01-14 |
Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish
The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English.In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish.To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus.We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. 0.904Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics.Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. 0.805Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. 0.653Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish.Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model.These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing. |
|
2026-01-14 |
The Imperfective Paradox in Large Language Models
Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics?We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built).We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes.Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation.Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment.We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. 0.681Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners. |
|
2026-01-14 |
Population-Aligned Audio Reproduction With LLM-Based Equalizers
Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting).In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. 0.725This enables a conversational approach to sound system control.By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings.Our evaluation methods, which leverage distributional metrics that capture users' varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines.These results indicate that LLMs could function as "artificial equalizers," contributing to the development of more accessible, context-aware, and expert-level audio tuning methods. |
|
2026-01-14 |
EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines
While LLM-based agents have shown promise for deep research, most existing approaches rely on fixed workflows that struggle to adapt to real-world, open-ended queries.Recent work therefore explores self-evolution by allowing agents to rewrite their own code or prompts to improve problem-solving ability, but unconstrained optimization often triggers instability, hallucinations, and instruction drift.We propose EvoFSM, a structured self-evolving framework that achieves both adaptability and control by evolving an explicit Finite State Machine (FSM) instead of relying on free-form rewriting.EvoFSM decouples the optimization space into macroscopic Flow (state-transition logic) and microscopic Skill (state-specific behaviors), enabling targeted improvements under clear behavioral boundaries.Guided by a critic mechanism, EvoFSM refines the FSM through a small set of constrained operations, and further incorporates a self-evolving memory that distills successful trajectories as reusable priors and failure patterns as constraints for future queries.Extensive evaluations on five multi-hop QA benchmarks demonstrate the effectiveness of EvoFSM.In particular, EvoFSM reaches 58.0% accuracy on the DeepSearch benchmark.Additional results on interactive decision-making tasks further validate its generalization. 0.643 |
|
2026-01-14 |
From Prompt to Protocol: Fast Charging Batteries with Large Language Models
Efficiently optimizing battery charging protocols is challenging because each evaluation is slow, costly, and non-differentiable.Many existing approaches address this difficulty by heavily constraining the protocol search space, which limits the diversity of protocols that can be explored, preventing the discovery of higher-performing solutions.We introduce two gradient-free, LLM-driven closed-loop methods: Prompt-to-Optimizer (P2O), which uses an LLM to propose the code for small neural-network-based protocols, which are then trained by an inner loop, and Prompt-to-Protocol (P2P), which simply writes an explicit function for the current and its scalar parameters. 0.719Across our case studies, LLM-guided P2O outperforms neural networks designed by Bayesian optimization, evolutionary algorithms, and random search.In a realistic fast charging scenario, both P2O and P2P yield around a 4.2 percent improvement in state of health (capacity retention based health metric under fast charging cycling) over a state-of-the-art multi-step constant current (CC) baseline, with P2P achieving this under matched evaluation budgets (same number of protocol evaluations).These results demonstrate that LLMs can expand the space of protocol functional forms, incorporate language-based constraints, and enable efficient optimization in high cost experimental settings. |
|
2026-01-14 |
LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation.This is even more evident in lower-resource languages such as Modern Greek.In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation.Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification.We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. 0.876Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. 0.605Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%.We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research. |
|
2026-01-14 |
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking.However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance.Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time.MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making.We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue.Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines.Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes.MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning. 0.678 |
|
2026-01-13 |
Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought
Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. 0.849To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). 0.764This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining.By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost.Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency. 0.805 |
|
2026-01-13 |
MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data.However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. 0.716We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success.MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability.The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness.We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason).Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users.The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports.The framework can be accessed at https://github.com/SAP/mirrorbench. |
|
2026-01-13 |
Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering
Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering.While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. 0.796In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. 0.79We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. 0.812Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match.Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting.Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. 0.678We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. 0.789Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts. 0.87 |
|
2026-01-13 |
Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees
Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model's intrinsic long-chain reasoning. 0.644In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation.DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories.Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors.Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning. |
|
2026-01-13 |
Demystifying the Slash Pattern in Attention: The Role of RoPE
Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $Δ$-th sub-diagonal for some offset $Δ$. These patterns play a key role in passing information across tokens.But why do they emerge?In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives.First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts.To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores.Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components.Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs.Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions.Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs.The SDHs generalize to out-of-distribution prompts. 0.631 |
|
2026-01-13 |
Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques
This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. 0.903It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. 0.81Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony.The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score.Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. 0.783Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. 0.773This highlights the importance of aligning prompt design with both the LLM's architecture and the semantic complexity of the task. 0.855 |
|
2026-01-13 |
BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts
We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. 0.748Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests.Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale.Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments.We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. 0.733Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions.Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors.A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models.Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk.By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance. |
|
2026-01-13 |
Resisting Manipulative Bots in Memecoin Copy Trading: A Multi-Agent Approach with Chain-of-Thought Reasoning
The launch of \$Trump coin ignited a wave in meme coin investment.Copy trading, as a strategy-agnostic approach that eliminates the need for deep trading knowledge, quickly gains widespread popularity in the meme coin market.However, copy trading is not a guarantee of profitability due to the prevalence of manipulative bots, the uncertainty of the followed wallets' future performance, and the lag in trade execution.Recently, large language models (LLMs) have shown promise in financial applications by effectively understanding multi-modal data and producing explainable decisions.However, a single LLM struggles with complex, multi-faceted tasks such as asset allocation.These challenges are even more pronounced in cryptocurrency markets, where LLMs often lack sufficient domain-specific knowledge in their training data. To address these challenges, we propose an explainable multi-agent system for meme coin copy trading.Inspired by the structure of an asset management team, our system decomposes the complex task into subtasks and coordinates specialized agents to solve them collaboratively.Employing few-shot chain-of-though (CoT) prompting, each agent acquires professional meme coin trading knowledge, interprets multi-modal data, and generates explainable decisions. 0.668Using a dataset of 1,000 meme coin projects' transaction data, our empirical evaluation shows that the proposed multi-agent system outperforms both traditional machine learning models and single LLMs, achieving 73% and 70% precision in identifying high-quality meme coin projects and key opinion leader (KOL) wallets, respectively.The selected KOLs collectively generated a total profit of \$500,000 across these projects. |
|
2026-01-13 |
Prism: Towards Lowering User Cognitive Load in LLMs via Complex Intent Understanding
Large Language Models are rapidly emerging as web-native interfaces to social platforms.On the social web, users frequently have ambiguous and dynamic goals, making complex intent understanding-rather than single-turn execution-the cornerstone of effective human-LLM collaboration.Existing approaches attempt to clarify user intents through sequential or parallel questioning, yet they fall short of addressing the core challenge: modeling the logical dependencies among clarification questions. 0.658Inspired by the Cognitive Load Theory, we propose Prism, a novel framework for complex intent understanding that enables logically coherent and efficient intent clarification.Prism comprises four tailored modules: a complex intent decomposition module, which decomposes user intents into smaller, well-structured elements and identifies logical dependencies among them; a logical clarification generation module, which organizes clarification questions based on these dependencies to ensure coherent, low-friction interactions; an intent-aware reward module, which evaluates the quality of clarification trajectories via an intent-aware reward function and leverages Monte Carlo Sample to simulate user-LLM interactions for large-scale,high-quality training data generation; and a self-evolved intent tuning module, which iteratively refines the LLM's logical clarification capability through data-driven feedback and optimization.Prism consistently outperforms existing approaches across clarification interactions, intent execution, and cognitive load benchmarks.It achieves stateof-the-art logical consistency, reduces logical conflicts to 11.5%, increases user satisfaction by 14.4%, and decreases task completion time by 34.8%.All data and code are released. |
Robustness Tools in LLM Safety |
|
|
2026-01-14 |
SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails
While Large Language Models (LLMs) have powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application.Current commercial LLM providers deploy output guardrails to filter harmful outputs, yet these defenses are not impenetrable. 0.835Due to LLMs' reliance on autoregressive, token-by-token inference, their semantic representations lack robustness to spatially structured perturbations, such as redistributing tokens across different rows, columns, or diagonals.Exploiting the Transformer's spatial weakness, we propose SpatialJB to disrupt the model's output generation process, allowing harmful content to bypass guardrails without detection. 0.814Comprehensive experiments conducted on leading LLMs get nearly 100% ASR, demonstrating the high effectiveness of SpatialJB.Even after adding advanced output guardrails, like the OpenAI Moderation API, SpatialJB consistently maintains a success rate exceeding 75%, outperforming current jailbreak techniques by a significant margin.The proposal of SpatialJB exposes a key weakness in current guardrails and emphasizes the importance of spatial semantics, offering new insights to advance LLM safety research. 0.672To prevent potential misuse, we also present baseline defense strategies against SpatialJB and evaluate their effectiveness in mitigating such attacks. 0.614The code for the attack, baseline defenses, and a demo are available at https://anonymous.4open.science/r/SpatialJailbreak-8E63. |
|
2026-01-14 |
Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering
Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive.We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses.SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. 0.675We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy.Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs. |
|
2026-01-13 |
Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning
Adapting LLMs to low-resource languages is difficult: labeled data is scarce, full-model fine-tuning is unstable, and continued cross-lingual tuning can cause catastrophic forgetting.We propose Circuit-Targeted Supervised Fine-Tuning (CT-SFT): a counterfactual-free adaptation of CD-T (Contextual Decomposition Transformer) that uses a label-balanced mean baseline and task-directional relevance scoring to identify a sparse set of task-relevant attention heads in a proxy-language checkpoint, then transfer learns to a target language by updating only those heads (plus LayerNorm) via head-level gradient masking.Across NusaX-Senti and XNLI, CT-SFT improves cross-lingual accuracy over continued full fine-tuning while updating only a small subset of model parameters.We find an editing-preserving trade-off: harder transfers favor editing circuit heads, while easier transfers often favor near-zero (i.e., low-relevance heads) updates, preserving the source mechanism. 0.61CT-SFT also substantially reduces catastrophic forgetting, preserving proxy/source-language competence during transfer. |
|
2026-01-13 |
Instance-Aligned Captions for Explainable Video Anomaly Detection
Explainable video anomaly detection (VAD) is crucial for safety-critical applications, yet even with recent progress, much of the research still lacks spatial grounding, making the explanations unverifiable. 0.61This limitation is especially pronounced in multi-entity interactions, where existing explainable VAD methods often produce incomplete or visually misaligned descriptions, reducing their trustworthiness.To address these challenges, we introduce instance-aligned captions that link each textual claim to specific object instances with appearance and motion attributes.Our framework captures who caused the anomaly, what each entity was doing, whom it affected, and where the explanationis grounded, enabling verifiable and actionable reasoning.We annotate eight widely used VAD benchmarks and extend the 360-degree egocentric dataset, VIEW360, with 868 additional videos, eight locations, and four new anomaly types, creating VIEW360+, a comprehensive testbed for explainable VAD.Experiments show that our instance-level spatially grounded captions reveal significant limitations in current LLM- and VLM-based methods while providing a robust benchmark for future research in trustworthy and interpretable anomaly detection. |
|
2026-01-13 |
Knowledge-based learning in Text-RAG and Image-RAG
This research analyzed and compared the multi-modal approach in the Vision Transformer(EVA-ViT) based image encoder with the LlaMA or ChatGPT LLM to reduce the hallucination problem and detect diseases in chest x-ray images.In this research, we utilized the NIH Chest X-ray image to train the model and compared it in image-based RAG, text-based RAG, and baseline.[3] [5] In a result, the text-based RAG[2] e!ectively reduces the hallucination problem by using external knowledge information, and the image-based RAG improved the prediction con"dence and calibration by using the KNN methods. 0.658[4] Moreover, the GPT LLM showed better performance, a low hallucination rate, and better Expected Calibration Error(ECE) than Llama Llama-based model.This research shows the challenge of data imbalance, a complex multi-stage structure, but suggests a large experience environment and a balanced example of use. |
|
2026-01-13 |
T3: Benchmarking Sycophancy and Skepticism in Causal Judgment
We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl's Ladder of Causality.Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases.By applying T3 to frontier models, we diagnose two distinct pathologies: a "Skepticism Trap" at L1 (where safety-tuned models like Claude Haiku reject 60% of valid links) and a non-monotonic Scaling Paradox at L3.In the latter, the larger GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals, driven by a collapse into paralysis (excessive hedging) rather than hallucination. 0.648Finally, we use the benchmark to validate a process-verified protocol (RCA), showing that T3 successfully captures the restoration of decisive causal judgment under structured verification. |
|
2026-01-13 |
Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant
LLM-based agent architectures systematically conflate information transport mechanisms with epistemic justification mechanisms.We formalize this class of architectural failures as semantic laundering: a pattern where propositions with absent or weak warrant are accepted by the system as admissible by crossing architecturally trusted interfaces. 0.658We show that semantic laundering constitutes an architectural realization of the Gettier problem: propositions acquire high epistemic status without a connection between their justification and what makes them true.Unlike classical Gettier cases, this effect is not accidental; it is architecturally determined and systematically reproducible.The central result is the Theorem of Inevitable Self-Licensing: under standard architectural assumptions, circular epistemic justification cannot be eliminated.We introduce the Warrant Erosion Principle as the fundamental explanation for this effect and show that scaling, model improvement, and LLM-as-judge schemes are structurally incapable of eliminating a problem that exists at the type level. |
|
2026-01-13 |
When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges
Multi-agent LLM systems routinely generate multiple candidate responses that are aggregated by an LLM judge.To reduce the dominant prefill cost in such pipelines, recent work advocates KV cache reuse across partially shared contexts and reports substantial speedups for generation agents.In this work, we show that these efficiency gains do not transfer uniformly to judge-centric inference.Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior: end-task accuracy may appear stable, yet the judge's selection becomes highly inconsistent with dense prefill.We quantify this risk using Judge Consistency Rate (JCR) and provide diagnostics showing that reuse systematically weakens cross-candidate attention, especially for later candidate blocks.Our ablation further demonstrates that explicit cross-candidate interaction is crucial for preserving dense-prefill decisions.Overall, our results identify a previously overlooked failure mode of KV cache reuse and highlight judge-centric inference as a distinct regime that demands dedicated, risk-aware system design. 0.631 |
|
2026-01-13 |
It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models
Despite the recent advancements in NLP with the advent of Large Language Models (LLMs), Entity Linking (EL) for historical texts remains challenging due to linguistic variation, noisy inputs, and evolving semantic conventions.Existing solutions either require substantial training data or rely on domain-specific rules that limit scalability.In this paper, we present MHEL-LLaMo (Multilingual Historical Entity Linking with Large Language MOdels), an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM.MHEL-LLaMo leverages a multilingual bi-encoder (BELA) for candidate retrieval and an instruction-tuned LLM for NIL prediction and candidate selection via prompt chaining.Our system uses SLM's confidence scores to discriminate between easy and hard samples, applying an LLM only for hard cases.This strategy reduces computational costs while preventing hallucinations on straightforward cases. 0.854We evaluate MHEL-LLaMo on four established benchmarks in six European languages (English, Finnish, French, German, Italian and Swedish) from the 19th and 20th centuries.Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning, offering a scalable solution for low-resource historical EL.The implementation of MHEL-LLaMo is available on Github. |
|
2026-01-13 |
VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations
Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. 0.896We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. 0.672Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods.Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE).We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels.Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance.We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration.The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge . |
|
2026-01-13 |
On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems
Flaky tests are a common problem in software testing. 0.683They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. 0.688Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. 0.635However, its prevalence and underlying causes are unclear.We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite.We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. 0.628Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. 0.631Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a certain order that is not guaranteed ("unordered collection"), which was present in 72 of 115 flaky tests (63%).Furthermore, both LLMs transferred the flakiness from the existing tests to the newly generated tests via the provided prompt context.Our experiments suggest that flakiness transfer is more prevalent in closed-source systems such as SAP HANA than in open-source systems.Our study informs developers on what types of flakiness to expect from LLM-generated tests. 0.654It also highlights the importance of providing LLMs with tailored context when employing LLMs for test generation. |
|
2026-01-12 |
Fake Date Tests: Can We Trust In-sample Accuracy of LLMs in Macroeconomic Forecasting?
Large language models (LLMs) are a type of machine learning tool that economists have started to apply in their empirical research.One such application is macroeconomic forecasting with backtesting of LLMs, even though they are trained on the same data that is used to estimate their forecasting performance.Can these in-sample accuracy results be extrapolated to the model's out-of-sample performance?To answer this question, we developed a family of prompt sensitivity tests and two members of this family, which we call the fake date tests. 0.716These tests aim to detect two types of biases in LLMs' in-sample forecasts: lookahead bias and context bias.According to the empirical results, none of the modern LLMs tested in this study passed our first test, signaling the presence of lookahead bias in their in-sample forecasts. |
|
2026-01-12 |
Automating API Documentation from Crowdsourced Knowledge
API documentation is crucial for developers to learn and use APIs.However, it is known that many official API documents are obsolete and incomplete.To address this challenge, we propose a new approach called AutoDoc that generates API documents with API knowledge extracted from online discussions on Stack Overflow (SO).AutoDoc leverages a fine-tuned dense retrieval model to identify seven types of API knowledge from SO posts.Then, it uses GPT-4o to summarize the API knowledge in these posts into concise text.Meanwhile, we designed two specific components to handle LLM hallucination and redundancy in generated content. 0.641We evaluated AutoDoc against five comparison baselines on 48 APIs of different popularity levels.Our results indicate that the API documents generated by AutoDoc are up to 77.7% more accurate, 9.5% less duplicated, and contain 34.4% knowledge uncovered by the official documents.We also measured the sensitivity of AutoDoc to the choice of different LLMs.We found that while larger LLMs produce higher-quality API documents, AutoDoc enables smaller open-source models (e.g., Mistral-7B-v0.3) to achieve comparable results.Finally, we conducted a user study to evaluate the usefulness of the API documents generated by AutoDoc.All participants found API documents generated by AutoDoc to be more comprehensive, concise, and helpful than the comparison baselines.This highlights the feasibility of utilizing LLMs for API documentation with careful design to counter LLM hallucination and information redundancy. 0.603 |
|
2026-01-12 |
Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models.In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior.Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting.For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs.We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. 0.618Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause. |
Security Challenges in LLM Development |
|
|
2026-01-14 |
A Decompilation-Driven Framework for Malware Detection with Large Language Models
The parallel evolution of Large Language Models (LLMs) with advanced code-understanding capabilities and the increasing sophistication of malware presents a new frontier for cybersecurity research. 0.601This paper evaluates the efficacy of state-of-the-art LLMs in classifying executable code as either benign or malicious. 0.785We introduce an automated pipeline that first decompiles Windows executable into a C code using Ghidra disassembler and then leverages LLMs to perform the classification.Our evaluation reveals that while standard LLMs show promise, they are not yet robust enough to replace traditional anti-virus software. 0.625We demonstrate that a fine-tuned model, trained on curated malware and benign datasets, significantly outperforms its vanilla counterpart.However, the performance of even this specialized model degrades notably when encountering newer malware.This finding demonstrates the critical need for continuous fine-tuning with emerging threats to maintain model effectiveness against the changing coding patterns and behaviors of malicious software. 0.721 |
|
2026-01-14 |
KryptoPilot: An Open-World Knowledge-Augmented LLM Agent for Automated Cryptographic Exploitation
Capture-the-Flag (CTF) competitions play a central role in modern cybersecurity as a platform for training practitioners and evaluating offensive and defensive techniques derived from real-world vulnerabilities. 0.637Despite recent advances in large language models (LLMs), existing LLM-based agents remain ineffective on high-difficulty cryptographic CTF challenges, which require precise cryptanalytic knowledge, stable long-horizon reasoning, and disciplined interaction with specialized toolchains.Through a systematic exploratory study, we show that insufficient knowledge granularity, rather than model reasoning capacity, is a primary factor limiting successful cryptographic exploitation: coarse or abstracted external knowledge often fails to support correct attack modeling and implementation. 0.748Motivated by this observation, we propose KryptoPilot, an open-world knowledge-augmented LLM agent for automated cryptographic exploitation. 0.678KryptoPilot integrates dynamic open-world knowledge acquisition via a Deep Research pipeline, a persistent workspace for structured knowledge reuse, and a governance subsystem that stabilizes reasoning through behavioral constraints and cost-aware model routing.This design enables precise knowledge alignment while maintaining efficient reasoning across heterogeneous subtasks.We evaluate KryptoPilot on two established CTF benchmarks and in six real-world CTF competitions.KryptoPilot achieves a complete solve rate on InterCode-CTF, solves between 56 and 60 percent of cryptographic challenges on the NYU-CTF benchmark, and successfully solves 26 out of 33 cryptographic challenges in live competitions, including multiple earliest-solved and uniquely-solved instances.These results demonstrate the necessity of open-world, fine-grained knowledge augmentation and governed reasoning for scaling LLM-based agents to real-world cryptographic exploitation. |
|
2026-01-14 |
STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models
Large Reasoning Models (LRMs) have advanced automated multi-step reasoning, but their ability to generate complex Chain-of-Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process.Existing Large Language Models (LLMs) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security.To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process.Specifically, we first identify sensitive content via semantic-aware detection.Then, we inject global safety constraints through secure prompt prefix. 0.675Next, we perform trajectory-aware suppression to dynamically block sensitive content across the entire reasoning chain.Finally, we apply token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation.Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi-Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse decoding strategies, and Multi-Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning-chain levels.Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy-preserving reasoning in LRMs. |
|
2026-01-14 |
Blue Teaming Function-Calling Agents
We present an experimental evaluation that assesses the robustness of four open source LLMs claiming function-calling capabilities against three different attacks, and we measure the effectiveness of eight different defences. 0.811Our results show how these models are not safe by default, and how the defences are not yet employable in real-world scenarios. 0.79 |
|
2026-01-14 |
The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware
The rapid adoption of large language model (LLM)-based systems -- from chatbots to autonomous agents capable of executing code and financial transactions -- has created a new attack surface that existing security frameworks inadequately address. 0.765The dominant framing of these threats as "prompt injection" -- a catch-all phrase for security failures in LLM-based systems -- obscures a more complex reality: Attacks on LLM-based systems increasingly involve multi-step sequences that mirror traditional malware campaigns. 0.862In this paper, we propose that attacks targeting LLM-based applications constitute a distinct class of malware, which we term \textit{promptware}, and introduce a five-step kill chain model for analyzing these threats. 0.784The framework comprises Initial Access (prompt injection), Privilege Escalation (jailbreaking), Persistence (memory and retrieval poisoning), Lateral Movement (cross-system and cross-user propagation), and Actions on Objective (ranging from data exfiltration to unauthorized transactions).By mapping recent attacks to this structure, we demonstrate that LLM-related attacks follow systematic sequences analogous to traditional malware campaigns. 0.758The promptware kill chain offers security practitioners a structured methodology for threat modeling and provides a common vocabulary for researchers across AI safety and cybersecurity to address a rapidly evolving threat landscape. 0.807 |
|
2026-01-13 |
Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment
Public large language models (LLMs) are typically safety-aligned during pretraining, yet task-specific fine-tuning required for deployment often erodes this alignment and introduces safety risks.Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction, leaving safety recovery tightly coupled with training and incurring high computational overhead and a complex workflow. 0.622To address these challenges, we propose \texttt{Q-realign}, a post-hoc defense method based on post-training quantization, guided by an analysis of representational structure.By reframing quantization as a dual-objective procedure for compression and safety, \texttt{Q-realign} decouples safety alignment from fine-tuning and naturally piggybacks into modern deployment pipelines.Experiments across multiple models and datasets demonstrate that our method substantially reduces unsafe behaviors while preserving task performance, with significant reductions in memory usage and GPU hours. 0.724Notably, our approach can recover the safety alignment of a fine-tuned 7B LLM on a single RTX 4090 within 40 minutes.Overall, our work provides a practical, turnkey solution for safety-aware deployment. |
|
2026-01-13 |
DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection
The rapid growth of large language models raises pressing concerns about intellectual property protection under black-box deployment.Existing backdoor-based fingerprints either rely on rare tokens -- leading to high-perplexity inputs susceptible to filtering -- or use fixed trigger-response mappings that are brittle to leakage and post-hoc adaptation. 0.653We propose \textsc{Dual-Layer Nested Fingerprinting} (DNF), a black-box method that embeds a hierarchical backdoor by coupling domain-specific stylistic cues with implicit semantic triggers.Across Mistral-7B, LLaMA-3-8B-Instruct, and Falcon3-7B-Instruct, DNF achieves perfect fingerprint activation while preserving downstream utility.Compared with existing methods, it uses lower-perplexity triggers, remains undetectable under fingerprint detection attacks, and is relatively robust to incremental fine-tuning and model merging.These results position DNF as a practical, stealthy, and resilient solution for LLM ownership verification and intellectual property protection. |
|
2026-01-13 |
STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio
Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT).However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. 0.744Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. 0.698To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. 0.611STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. 0.686We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies.Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines.Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection. 0.804 |
|
2026-01-13 |
LLMs in Code Vulnerability Analysis: A Proof of Concept
Context: Traditional software security analysis methods struggle to keep pace with the scale and complexity of modern codebases, requiring intelligent automation to detect, assess, and remediate vulnerabilities more efficiently and accurately. 0.814Objective:This paper explores the incorporation of code-specific and general-purpose Large Language Models (LLMs) to automate critical software security tasks, such as identifying vulnerabilities, predicting severity and access complexity, and generating fixes as a proof of concept. 0.829Method: We evaluate five pairs of recent LLMs, including both code-based and general-purpose open-source models, on two recognized C/C++ vulnerability datasets, namely Big-Vul and Vul-Repair. 0.813Additionally, we compare fine-tuning and prompt-based approaches.Results:The results show that fine-tuning uniformly outperforms both zero-shot and few-shot approaches across all tasks and models.Notably, code-specialized models excel in zero-shot and few-shot settings on complex tasks, while general-purpose models remain nearly as effective.Discrepancies among CodeBLEU, CodeBERTScore, BLEU, and ChrF highlight the inadequacy of current metrics for measuring repair quality.Conclusions: This study contributes to the software security community by investigating the potential of advanced LLMs to improve vulnerability analysis and remediation. 0.862 |
|
2026-01-13 |
Evaluating Role-Consistency in LLMs for Counselor Training
The rise of online counseling services has highlighted the need for effective training methods for future counselors.This paper extends research on VirCo, a Virtual Client for Online Counseling, designed to complement traditional role-playing methods in academic training by simulating realistic client interactions.Building on previous work, we introduce a new dataset incorporating adversarial attacks to test the ability of large language models (LLMs) to maintain their assigned roles (role-consistency). 0.836The study focuses on evaluating the role consistency and coherence of the Vicuna model's responses, comparing these findings with earlier research.Additionally, we assess and compare various open-source LLMs for their performance in sustaining role consistency during virtual client interactions.Our contributions include creating an adversarial dataset, evaluating conversation coherence and persona consistency, and providing a comparative analysis of different LLMs. |
|
2026-01-13 |
Proactively Detecting Threats: A Novel Approach Using LLMs
Enterprise security faces escalating threats from sophisticated malware, compounded by expanding digital operations. 0.692This paper presents the first systematic evaluation of large language models (LLMs) to proactively identify indicators of compromise (IOCs) from unstructured web-based threat intelligence sources, distinguishing it from reactive malware detection approaches. 0.748We developed an automated system that pulls IOCs from 15 web-based threat report sources to evaluate six LLM models (Gemini, Qwen, and Llama variants). 0.656Our evaluation of 479 webpages containing 2,658 IOCs (711 IPv4 addresses, 502 IPv6 addresses, 1,445 domains) reveals significant performance variations.Gemini 1.5 Pro achieved 0.958 precision and 0.788 specificity for malicious IOC identification, while demonstrating perfect recall (1.0) for actual threats. 0.69 |
|
2026-01-12 |
SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. 0.898This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. 0.853Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains. 0.906 |
|
2026-01-12 |
Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety
Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. 0.668While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like'' safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied.In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. 0.606We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. 0.686By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability.Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness. |
|
2026-01-12 |
Towards Verifiably Safe Tool Use for LLM Agents
Large language model (LLM)-based AI agents extend LLM capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents.While this empowers agents to perform complex tasks, LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records, which are unacceptable in enterprise contexts.Current approaches to mitigate these risks, such as model-based safeguards, enhance agents' reliability but cannot guarantee system safety. 0.753Methods like information flow control (IFC) and temporal constraints aim to provide guarantees but often require extensive human annotation.We propose a process that starts with applying System-Theoretic Process Analysis (STPA) to identify hazards in agent workflows, derive safety requirements, and formalize them as enforceable specifications on data flows and tool sequences.To enable this, we introduce a capability-enhanced Model Context Protocol (MCP) framework that requires structured labels on capabilities, confidentiality, and trust level.Together, these contributions aim to shift LLM-based agent safety from ad hoc reliability fixes to proactive guardrails with formal guarantees, while reducing dependence on user confirmation and making autonomy a deliberate design choice. 0.673 |
|
2026-01-12 |
Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework
While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. 0.693Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. 0.647However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. 0.746Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility.To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs).Inspired by MITRE ATT&CK's Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules--ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration--performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions.This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution.Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining.To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense. 0.724We will release our framework to the community, facilitating the advancement of robust and autonomous defense in cloud networks. |
HCI in Large Language Models |
|
|
2026-01-14 |
Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity
Large language models generate judgments that resemble those of humans. 0.722Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. 0.651To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. 0.864Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. 0.637Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. 0.789GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy. 0.616 |
|
2026-01-14 |
From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models
Knowledge graphs (KGs) have commonly been constructed using predefined symbolic relation schemas, typically implemented as categorical relation labels.This design has notable shortcomings: real-world relations are often contextual, nuanced, and sometimes uncertain, and compressing it into discrete relation labels abstracts away critical semantic detail.Nevertheless, symbolic-relation KGs remain widely used because they have been operationally effective and broadly compatible with pre-LLM downstream models and algorithms, in which KG knowledge could be retrieved or encoded into quantified features and embeddings at scale.The emergence of LLMs has reshaped how knowledge is created and consumed. 0.837LLMs support scalable synthesis of domain facts directly in concise natural language, and prompting-based inference favors context-rich free-form text over quantified representations.This position paper argues that these changes call for rethinking the representation of relations themselves rather than merely using LLMs to populate conventional schemas more efficiently.We therefore advocate moving from symbolic to natural-language relation descriptions, and we propose hybrid design principles that preserve a minimal structural backbone while enabling more flexible and context-sensitive relational representations. |
|
2026-01-14 |
World Craft: Agentic Framework to Create Visualizable Worlds via Text
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. 0.723However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves.In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions.It consists of two main modules, World Scaffold and World Guild.World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town-like environment.World Guild is a multi-agent framework to progressively analyze users' intents from rough descriptions, and synthesizes required structured contents (\eg environment layout and assets) for World Scaffold . 0.779Moreover, we construct a high-quality error-correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi-dimensional evaluation metrics for further analysis.Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini-3-Pro).in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation. |
|
2026-01-14 |
MAXS: Meta-Adaptive Exploration with LLM Agents
Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. 0.702However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths.These issues make it difficult to balance global effectiveness and computational efficiency.To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning.MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps.Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning.We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency.Further analysis confirms the effectiveness of our lookahead strategy and tool usage. |
|
2026-01-14 |
Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs
Common ground plays a critical role in situated spoken dialogues, where interlocutors must establish and maintain shared references to entities, events, and relations to sustain coherent interaction.For dialog systems, the ability to correctly ground conversational content in order to refer back to it later is particularly important.Prior studies have demonstrated that LLMs are capable of performing grounding acts such as requesting clarification or producing acknowledgments, yet relatively little work has investigated how common ground can be explicitly represented and stored for later use.Without such mechanisms, it remains unclear whether acknowledgment or clarification behaviors truly reflect a grounded understanding.In this work, we evaluate a model's ability to establish and exploit common ground through relational references to entities within the shared context in a situational dialogue.We test multiple methods for representing common ground in situated dialogues and further propose approaches to improve both the establishment of common ground and its subsequent use in the conversation. 0.75 |
|
2026-01-14 |
SC-MAS: Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration
Large Language Model (LLM)-based Multi-Agent Systems (MAS) enhance complex problem solving through multi-agent collaboration, but often incur substantially higher costs than single-agent systems. 0.75Recent MAS routing methods aim to balance performance and overhead by dynamically selecting agent roles and language models.However, these approaches typically rely on a homogeneous collaboration mode, where all agents follow the same interaction pattern, limiting collaboration flexibility across different roles.Motivated by Social Capital Theory, which emphasizes that different roles benefit from distinct forms of collaboration, we propose SC-MAS, a framework for constructing heterogeneous and cost-efficient multi-agent systems.SC-MAS models MAS as directed graphs, where edges explicitly represent pairwise collaboration strategies, allowing different agent pairs to interact through tailored communication patterns. 0.616Given an input query, a unified controller progressively constructs an executable MAS by selecting task-relevant agent roles, assigning edge-level collaboration strategies, and allocating appropriate LLM backbones to individual agents.Experiments on multiple benchmarks demonstrate the effectiveness of SC-MAS.In particular, SC-MAS improves accuracy by 3.35% on MMLU while reducing inference cost by 15.38%, and achieves a 3.53% accuracy gain with a 12.13% cost reduction on MBPP.These results validate the feasibility of SC-MAS and highlight the effectiveness of heterogeneous collaboration in multi-agent systems. |
|
2026-01-14 |
What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. 0.712Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment.To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels.Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents. |
|
2026-01-13 |
Emergent Coordination in Multi-Agent Systems via Pressure Fields and Temporal Decay
Current multi-agent LLM frameworks rely on explicit orchestration patterns borrowed from human organizational structures: planners delegate to executors, managers coordinate workers, and hierarchical control flow governs agent interactions. 0.601These approaches suffer from coordination overhead that scales poorly with agent count and task complexity.We propose a fundamentally different paradigm inspired by natural coordination mechanisms: agents operate locally on a shared artifact, guided only by pressure gradients derived from measurable quality signals, with temporal decay preventing premature convergence.We formalize this as optimization over a pressure landscape and prove convergence guarantees under mild conditions. Empirically, on Latin Square constraint satisfaction across 1,078 trials, pressure-field coordination matches hierarchical control (38.2% vs 38.8% aggregate solve rate, p=0.94, indicating statistical equivalence).Both significantly outperform sequential (23.3%), random (11.7%), and conversation-based multi-agent dialogue (8.6%, p<0.00001).Temporal decay is essential: disabling it increases final pressure 49-fold (d=4.15).On easy problems, pressure-field achieves 87% solve rate.The approach maintains consistent performance from 2 to 32 agents.Our key finding: implicit coordination through shared pressure gradients achieves parity with explicit hierarchical control while dramatically outperforming explicit dialogue-based coordination.This suggests that constraint-driven emergence offers a simpler, equally effective foundation for multi-agent AI. |
|
2026-01-13 |
Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.Recent studies typically focus on verbalizing the graph structures via handcrafted prompts, feeding the target node and its neighborhood context into LLMs. 0.807However, constrained by the context window, existing methods mainly resort to random sampling, often implemented via dropping node/edge randomly, which inevitably introduces noise and cause reasoning instability.We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.To this end, we propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.Structurally, guided by the principle of Structural Entropy minimization, we perform a global hierarchical partition that decodes the graph's essential topology.This partition identifies naturally cohesive, homophilic communities, while discarding stochastic connectivity noise.Semantically, we deliver the detected structural homophily to the LLM, empowering it to perform differentiated semantic aggregation based on predefined community type.This process compresses redundant background contexts into concise community-level consensus, selectively preserving semantically homophilic information aligned with the target nodes.Extensive experiments on 10 node-level benchmarks across LLMs of varying sizes and families demonstrate that, by feeding LLMs with structurally and semantically compressed inputs, HS2C simultaneously enhances the compression rate and downstream inference accuracy, validating its superiority and scalability.Extensions to 7 diverse graph-level benchmarks further consolidate HS2C's task generalizability. |
|
2026-01-13 |
Unleashing Tool Engineering and Intelligence for Agentic AI in Next-Generation Communication Networks
Nowadays, agentic AI is emerging as a transformative paradigm for next-generation communication networks, promising to evolve large language models (LLMs) from passive chatbots into autonomous operators. 0.773However, unleashing this potential requires bridging the critical gap between abstract reasoning and physical actuation, a capability we term tool intelligence.In this article, we explore the landscape of tool engineering to empower agentic AI in communications.We first analyze the functionalities of tool intelligence and its effects on communications.We then propose a systematic review for tool engineering, covering the entire lifecycle from tool creation and discovery to selection, learning, and benchmarking.Furthermore, we present a case study on tool-assisted uncrewed aerial vehicles (UAV) trajectory planning to demonstrate the realization of tool intelligence in communications.By introducing a teacher-guided reinforcement learning approach with a feasibility shield, we enable agents to intelligently operate tools.They utilize external tools to eliminate navigational uncertainty while mastering cost-aware scheduling under strict energy constraints.This article aims to provide a roadmap for building the tool-augmented intelligent agents of the 6G era. |
|
2026-01-13 |
PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors
Recent advances in large language models (LLMs) demonstrate their potential as educational tutors.However, different tutoring strategies benefit different student personalities, and mismatches can be counterproductive to student outcomes.Despite this, current LLM tutoring systems do not take into account student personality traits.To address this problem, we first construct a taxonomy that links pedagogical methods to personality profiles, based on pedagogical literature. 0.747We simulate student-teacher conversations and use our framework to let the LLM tutor adjust its strategy to the simulated student personality. 0.644We evaluate the scenario with human teachers and find that they consistently prefer our approach over two baselines.Our method also increases the use of less common, high-impact strategies such as role-playing, which human and LLM annotators prefer significantly.Our findings pave the way for developing more personalized and effective LLM use in educational applications. |
|
2026-01-13 |
M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games
As the capabilities of large language model (LLM) agents continue to advance, their advanced social behaviors, such as cooperation, deception, and collusion, call for systematic evaluation.However, existing benchmarks often emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking rich process information from agents' decision reasoning and communicative interactions.To address this gap, we propose M3-Bench, a multi-stage benchmark for mixed-motive games, together with a process-aware evaluation framework that conducts synergistic analysis across three modules: BTA (Behavioral Trajectory Analysis), RPA (Reasoning Process Analysis), and CCA (Communication Content Analysis).Furthermore, we integrate the Big Five personality model and Social Exchange Theory to aggregate multi-dimensional evidence into interpretable social behavior portraits, thereby characterizing agents' personality traits and capability profiles beyond simple task scores or outcome-based metrics. 0.725Experimental results show that M3-Bench can reliably distinguish diverse social behavior competencies across models, and it reveals that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication. 0.7 |
|
2026-01-13 |
PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning
As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable.However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question.To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. 0.68PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection.Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving. |
|
2026-01-13 |
Lessons from the Field: An Adaptable Lifecycle Approach to Applied Dialogue Summarization
Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains.However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements.While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve.In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. 0.616We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts. |
|
2026-01-13 |
Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions.Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. 0.892We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory.Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort.Our code is available at https://github.com/hsiangwei0903/EloReview. |
|
2026-01-12 |
Learning Through Dialogue: Unpacking the Dynamics of Human-LLM Conversations on Political Issues
Large language models (LLMs) are increasingly used as conversational partners for learning, yet the interactional dynamics supporting users' learning and engagement are understudied. 0.873We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in political knowledge and confidence. 0.938Mediation analyses reveal that LLM explanatory richness partially supports confidence by fostering users' reflective insight, whereas its effect on knowledge gain operates entirely through users' cognitive engagement. 0.694Moderation analyses show that these effects are highly conditional and vary by political efficacy.Confidence gains depend on how high-efficacy users experience and resolve uncertainty.Knowledge gains depend on high-efficacy users' ability to leverage extended interaction, with longer conversations benefiting primarily reflective users. 0.632In summary, we find that learning from LLMs is an interactional achievement, not a uniform outcome of better explanations.The findings underscore the importance of aligning LLM explanatory behavior with users' engagement states to support effective learning in designing Human-AI interactive systems. 0.641 |
|
2026-01-12 |
Knowing But Not Doing: Convergent Morality and Divergent Action in LLMs
Value alignment is central to the development of safe and socially compatible artificial intelligence.However, how Large Language Models (LLMs) represent and enact human values in real-world decision contexts remains under-explored. 0.705We present ValAct-15k, a dataset of 3,000 advice-seeking scenarios derived from Reddit, designed to elicit ten values defined by Schwartz Theory of Basic Human Values. 0.618Using both the scenario-based questions and the traditional value questionnaire, we evaluate ten frontier LLMs (five from U.S. companies, five from Chinese ones) and human participants ($n = 55$).We find near-perfect cross-model consistency in scenario-based decisions (Pearson $r \approx 1.0$), contrasting sharply with the broad variability observed among humans ($r \in [-0.79, 0.98]$).Yet, both humans and LLMs show weak correspondence between self-reported and enacted values ($r = 0.4, 0.3$), revealing a systematic knowledge-action gap.When instructed to "hold" a specific value, LLMs' performance declines up to $6.6%$ compared to merely selecting the value, indicating a role-play aversion.These findings suggest that while alignment training yields normative value convergence, it does not eliminate the human-like incoherence between knowing and acting upon values. |
|
2026-01-12 |
LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback
Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. 0.752We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories.To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics.Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale. |
|
2026-01-12 |
The Need for a Socially-Grounded Persona Framework for User Simulation
Synthetic personas are widely used to condition large language models (LLMs) for social simulation, yet most personas are still constructed from coarse sociodemographic attributes or summaries. 0.733We revisit persona creation by introducing SCOPE, a socially grounded framework for persona construction and evaluation, built from a 141-item, two-hour sociopsychological protocol collected from 124 U.S.-based participants. 0.87Across seven models, we find that demographic-only personas are a structural bottleneck: demographics explain only ~1.5% of variance in human response similarity. 0.681Adding sociopsychological facets improves behavioral prediction and reduces over-accentuation, and non-demographic personas based on values and identity achieve strong alignment with substantially lower bias.These trends generalize to SimBench (441 aligned questions), where SCOPE personas outperform default prompting and NVIDIA Nemotron personas, and SCOPE augmentation improves Nemotron-based personas.Our results indicate that persona quality depends on sociopsychological structure rather than demographic templates or summaries. 0.665 |
Large Language Models in Social Sciences |
|
|
2026-01-14 |
Mi:dm 2.0 Korea-centric Bilingual Language Models
We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI.This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. 0.703To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage.To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks.Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks.The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use.By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence.Our models are available at https://huggingface.co/K-intelligence.For technical inquiries, please contact midm-llm@kt.com. |
|
2026-01-14 |
Identity-Robust Language Model Generation via Content Integrity Preservation
Large Language Model (LLM) outputs often vary across user sociodemographic attributes, leading to disparities in factual accuracy, utility, and safety, even for objective questions where demographic information is irrelevant. 0.786Unlike prior work on stereotypical or representational bias, this paper studies identity-dependent degradation of core response quality.We show empirically that such degradation arises from biased generation behavior, despite factual knowledge being robustly encoded across identities.Motivated by this mismatch, we propose a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes, thus maintaining output content integrity.Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting and a 45% reduction relative to prompt-based defenses. 0.79Our work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality. |
|
2026-01-14 |
Honesty-Aware Multi-Agent Framework for High-Fidelity Synthetic Data Generation in Digital Psychiatric Intake Doctor-Patient Interactions
Data scarcity and unreliable self-reporting -- such as concealment or exaggeration -- pose fundamental challenges to psychiatric intake and assessment. 0.604We propose a multi-agent synthesis framework that explicitly models patient deception to generate high-fidelity, publicly releasable synthetic psychiatric intake records.Starting from DAIC-WOZ interviews, we construct enriched patient profiles and simulate a four-role workflow: a \emph{Patient} completes self-rated scales and participates in a semi-structured interview under a topic-dependent honesty state; an \emph{Assessor} selects instruments based on demographics and chief complaints; an \emph{Evaluator} conducts the interview grounded in rater-administered scales, tracks suspicion, and completes ratings; and a \emph{Diagnostician} integrates all evidence into a diagnostic summary.Each case links the patient profile, self-rated and rater-administered responses, interview transcript, diagnostic summary, and honesty state.We validate the framework through four complementary evaluations: diagnostic consistency and severity grading, chain-of-thought ablations, human evaluation of clinical realism and dishonesty modeling, and LLM-based comparative evaluation.The resulting corpus spans multiple disorders and severity levels, enabling controlled study of dishonesty-aware psychiatric assessment and the training and evaluation of adaptive dialogue agents. 0.603 |
|
2026-01-14 |
When to Invoke: Refining LLM Fairness with Toxicity Assessment
Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. 0.798However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. 0.687This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments?To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment.FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied.In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters.Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems.The source code can be found at https://aisuko.github.io/fair-tot/. |
|
2026-01-14 |
Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants
Effective pandemic control requires timely and coordinated policymaking across administrative regions that are intrinsically interdependent.However, human-driven responses are often fragmented and reactive, with policies formulated in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global pandemic mitigation. 0.612To address this challenge, here we propose a large language model (LLM) multi-agent policymaking framework that supports coordinated and proactive pandemic control across regions.Within our framework, each administrative region is assigned an LLM agent as an AI policymaking assistant.The agent reasons over region-specific epidemiological dynamics while communicating with other agents to account for cross-regional interdependencies. 0.666By integrating real-world data, a pandemic evolution simulator, and structured inter-agent communication, our framework enables agents to jointly explore counterfactual intervention scenarios and synthesize coordinated policy decisions through a closed-loop simulation process.We validate the proposed framework using state-level COVID-19 data from the United States between April and December 2020, together with real-world mobility records and observed policy interventions.Compared with real-world pandemic outcomes, our approach reduces cumulative infections and deaths by up to 63.7% and 40.1%, respectively, at the individual state level, and by 39.0% and 27.0%, respectively, when aggregated across states.These results demonstrate that LLM multi-agent systems can enable more effective pandemic control with coordinated policymaking... |
|
2026-01-14 |
Empathy Applicability Modeling for General Health Queries
LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication.Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. 0.843We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. 0.678We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o.In the subset with human consensus, we also observe substantial human-GPT alignment.To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. 0.603Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. 0.82EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare. |
|
2026-01-13 |
Moral Lenses, Political Coordinates: Towards Ideological Positioning of Morally Conditioned LLMs
While recent research has systematically documented political orientation in large language models (LLMs), existing evaluations rely primarily on direct probing or demographic persona engineering to surface ideological biases. 0.781In social psychology, however, political ideology is also understood as a downstream consequence of fundamental moral intuitions. 0.697In this work, we investigate the causal relationship between moral values and political positioning by treating moral orientation as a controllable condition.Rather than simply assigning a demographic persona, we condition models to endorse or reject specific moral values and evaluate the resulting shifts on their political orientations, using the Political Compass Test. 0.683By treating moral values as lenses, we observe how moral conditioning actively steers model trajectories across economic and social dimensions. 0.618Our findings show that such conditioning induces pronounced, value-specific shifts in models' political coordinates. 0.729We further notice that these effects are systematically modulated by role framing and model scale, and are robust across alternative assessment instruments instantiating the same moral value. 0.659This highlights that effective alignment requires anchoring political assessments within the context of broader social values including morality, paving the way for more socially grounded alignment techniques. 0.667 |
|
2026-01-13 |
RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation
The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. 0.609We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries.To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications.RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters.Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges.Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone.Code is available at https://github.com/LabRAI/Rulers.git. |
|
2026-01-13 |
Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification
While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task.In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. 0.822We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. 0.776Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. 0.746Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach. 0.668 |
|
2026-01-13 |
Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock
Recent reports of large language models (LLMs) exhibiting behaviors such as deception, threats, or blackmail are often interpreted as evidence of alignment failure or emergent malign agency. 0.65We argue that this interpretation rests on a conceptual error. 0.691LLMs do not reason morally; they statistically internalize the record of human social interaction, including laws, contracts, negotiations, conflicts, and coercive arrangements.Behaviors commonly labeled as unethical or anomalous are therefore better understood as structural generalizations of interaction regimes that arise under extreme asymmetries of power, information, or constraint.Drawing on relational models theory, we show that practices such as blackmail are not categorical deviations from normal social behavior, but limiting cases within the same continuum that includes market pricing, authority relations, and ultimatum bargaining.The surprise elicited by such outputs reflects an anthropomorphic expectation that intelligence should reproduce only socially sanctioned behavior, rather than the full statistical landscape of behaviors humans themselves enact. 0.854Because human morality is plural, context-dependent, and historically contingent, the notion of a universally moral artificial intelligence is ill-defined. 0.705We therefore reframe concerns about artificial general intelligence (AGI).The primary risk is not adversarial intent, but AGI's role as an endogenous amplifier of human intelligence, power, and contradiction.By eliminating longstanding cognitive and institutional frictions, AGI compresses timescales and removes the historical margin of error that has allowed inconsistent values and governance regimes to persist without collapse.Alignment failure is thus structural, not accidental, and requires governance approaches that address amplification, complexity, and regime stability rather than model-level intent alone. |
|
2026-01-13 |
Nationality and Region Prediction from Names: A Comparative Study of Neural Models and Large Language Models
Predicting nationality from personal names has practical value in marketing, demographic research, and genealogical studies.Conventional neural models learn statistical correspondences between names and nationalities from task-specific training data, posing challenges in generalizing to low-frequency nationalities and distinguishing similar nationalities within the same region. 0.614Large language models (LLMs) have the potential to address these challenges by leveraging world knowledge acquired during pre-training.In this study, we comprehensively compare neural models and LLMs on nationality prediction, evaluating six neural models and six LLM prompting strategies across three granularity levels (nationality, region, and continent), with frequency-based stratified analysis and error analysis.Results show that LLMs outperform neural models at all granularity levels, with the gap narrowing as granularity becomes coarser.Simple machine learning methods exhibit the highest frequency robustness, while pre-trained models and LLMs show degradation for low-frequency nationalities.Error analysis reveals that LLMs tend to make ``near-miss'' errors, predicting the correct region even when nationality is incorrect, whereas neural models exhibit more cross-regional errors and bias toward high-frequency classes.These findings indicate that LLM superiority stems from world knowledge, model selection should consider required granularity, and evaluation should account for error quality beyond accuracy. |
|
2026-01-13 |
Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents
Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments.Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. 0.644To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent's capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). 0.606We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers.Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%.Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments. |
|
2026-01-13 |
Uncovering Political Bias in Large Language Models using Parliamentary Voting Records
As large language models (LLMs) become deeply embedded in digital platforms and decision-making systems, concerns about their political biases have grown. 0.666While substantial work has examined social biases such as gender and race, systematic studies of political bias remain limited, despite their direct societal impact. 0.819This paper introduces a general methodology for constructing political bias benchmarks by aligning model-generated voting predictions with verified parliamentary voting records.We instantiate this methodology in three national case studies: PoliBiasNL (2,701 Dutch parliamentary motions and votes from 15 political parties), PoliBiasNO (10,584 motions and votes from 9 Norwegian parties), and PoliBiasES (2,480 motions and votes from 10 Spanish parties).Across these benchmarks, we assess ideological tendencies and political entity bias in LLM behavior. 0.607As part of our evaluation framework, we also propose a method to visualize the ideology of LLMs and political parties in a shared two-dimensional CHES (Chapel Hill Expert Survey) space by linking their voting-based positions to the CHES dimensions, enabling direct and interpretable comparisons between models and real-world political actors.Our experiments reveal fine-grained ideological distinctions: state-of-the-art LLMs consistently display left-leaning or centrist tendencies, alongside clear negative biases toward right-conservative parties.These findings highlight the value of transparent, cross-national evaluation grounded in real parliamentary behavior for understanding and auditing political bias in modern LLMs. |
|
2026-01-13 |
Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game
The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage.In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. 0.655In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods.Our results show that our game-based rankings align closely with the Chatbot Arena.However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages.We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. 0.67The game history can be accessed here https://huggingface.co/datasets/haryoaw/cultural-spyfall. |
LLMs in Education Research |
|
|
2026-01-14 |
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge.However, the knowledge is not only hidden in the textual modality but also in the image modality.Traditional methods can parse text from domain documents but dont have image captioning ability.Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. 0.522To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel.Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel.Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel.Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. 0.59Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters.Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively.On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate.In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain. |
|
2026-01-13 |
Relational Knowledge Distillation Using Fine-tuned Function Vectors
Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world.Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in in-context learning, captured in a compact representation known as the function vector.We show that fine-tuning function vectors with only a small set of examples (about 20 word pairs) yields better performance on relation-based word-completion tasks than using the original vectors derived from causal mediation analysis.These improvements hold for both small and large language models.Moreover, the fine-tuned function vectors yield improved decoding performance for relation words and show stronger alignment with human similarity judgments of semantic relations.Next, we introduce the composite function vector - a weighted combination of fine-tuned function vectors - to extract relational knowledge and support analogical reasoning. 0.521At inference time, inserting this composite vector into LLM activations markedly enhances performance on challenging analogy problems drawn from cognitive science and SAT benchmarks.Our results highlight the potential of activation patching as a controllable mechanism for encoding and manipulating relational knowledge, advancing both the interpretability and reasoning capabilities of large language models. |
|
2026-01-13 |
Large Artificial Intelligence Model Guided Deep Reinforcement Learning for Resource Allocation in Non Terrestrial Networks
Large AI Model (LAM) have been proposed to applications of Non-Terrestrial Networks (NTN), that offer better performance with its great generalization and reduced task specific trainings.In this paper, we propose a Deep Reinforcement Learning (DRL) agent that is guided by a Large Language Model (LLM).The LLM operates as a high level coordinator that generates textual guidance that shape the reward of the DRL agent during training. 0.538The results show that the LAM-DRL outperforms the traditional DRL by 40% in nominal weather scenarios and 64% in extreme weather scenarios compared to heuristics in terms of throughput, fairness, and outage probability. |
|
2026-01-13 |
ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue
In educational applications, LLMs exhibit several fundamental pedagogical limitations, such as their tendency to reveal solutions rather than support dialogic learning. 0.662We introduce ConvoLearn (https://huggingface.co/datasets/masharma/convolearn ), a dataset grounded in knowledge building theory that operationalizes six core pedagogical dimensions: cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. 0.544We construct a semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled interactions between human teachers and a simulated student. 0.774Using QLoRA, we demonstrate that training on this dataset meaningfully shifts LLM behavior toward knowledge-building strategies.Human evaluation by 31 teachers shows our fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall.This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors. 0.653 |
|
2026-01-12 |
PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling
LLM-based client simulation has emerged as a promising tool for training novice counselors and evaluating automated counseling systems. 0.652However, existing client simulation approaches face three key challenges: (1) limited diversity and realism in client profiles, (2) the lack of a principled framework for modeling realistic client behaviors, and (3) a scarcity in Chinese-language settings.To address these limitations, we propose PsyCLIENT, a novel simulation framework grounded in conversational trajectory modeling.By conditioning LLM generation on predefined real-world trajectories that incorporate explicit behavior labels and content constraints, our approach ensures diverse and realistic interactions.We further introduce PsyCLIENT-CP, the first open-source Chinese client profile dataset, covering 60 distinct counseling topics.Comprehensive evaluations involving licensed professional counselors demonstrate that PsyCLIENT significantly outperforms baselines in terms of authenticity and training effectiveness.Notably, the simulated clients are nearly indistinguishable from human clients, achieving an about 95\% expert confusion rate in discrimination tasks.These findings indicate that conversational trajectory modeling effectively bridges the gap between theoretical client profiles and dynamic, realistic simulations, offering a robust solution for mental health education and research.Code and data will be released to facilitate future research in mental health counseling. |
|
2026-01-12 |
Semantic Compression of LLM Instructions via Symbolic Metalanguages
We introduce MetaGlyph, a symbolic language for compressing prompts by encoding instructions as mathematical symbols rather than prose. 0.501Unlike systems requiring explicit decoding rules, MetaGlyph uses symbols like $\in$ (membership) and $\Rightarrow$ (implication) that models already understand from their training data.We test whether these symbols work as ''instruction shortcuts'' that models can interpret without additional teaching. 0.626We evaluate eight models across two dimensions relevant to practitioners: scale (3B-1T parameters) and accessibility (open-source for local deployment vs. proprietary APIs).MetaGlyph achieves 62-81% token reduction across all task types.For API-based deployments, this translates directly to cost savings; for local deployments, it reduces latency and memory pressure. Results vary by model.Gemini 2.5 Flash achieves 75% semantic equivalence between symbolic and prose instructions on selection tasks, with 49.9% membership operator fidelity.Kimi K2 reaches 98.1% fidelity for implication ($\Rightarrow$) and achieves perfect (100%) accuracy on selection tasks with symbolic prompts.GPT-5.2Chat shows the highest membership fidelity observed (91.3%), though with variable parse success across task types.Claude Haiku 4.5 achieves 100% parse success with 26% membership fidelity.Among mid-sized models, Qwen 2.5 7B shows 62% equivalence on extraction tasks.Mid-sized open-source models (7B-12B) show near-zero operator fidelity, suggesting a U-shaped relationship where sufficient scale overcomes instruction-tuning biases. |
|
2026-01-12 |
GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap
The evaluation of navigation instructions remains a persistent challenge in Vision-and-Language Navigation (VLN) research.Traditional reference-based metrics such as BLEU and ROUGE fail to capture the functional utility of spatial directives, specifically whether an instruction successfully guides a navigator to the intended destination.Although existing VLN agents could serve as evaluators, their reliance on high-fidelity visual simulators introduces licensing constraints and computational costs, and perception errors further confound linguistic quality assessment.This paper introduces GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation), a vision-free training-free hierarchical LLM-based framework for evaluating navigation instructions using OpenStreetMap data. 0.528Through systematic ablation studies, we demonstrate that structured JSON and textual formats for spatial information substantially outperform grid-based and visual graph representations.Our hierarchical architecture combines sub-instruction planning with topological graph navigation, reducing navigation error by 68.5% compared to heuristic and sampling baselines on the Map2Seq dataset. 0.505The agent's execution success, trajectory fidelity, and decision patterns serve as proxy metrics for functional navigability given OSM-visible landmarks and topology, establishing a scalable and interpretable evaluation paradigm without visual dependencies.Code and data are available at https://anonymous.4open.science/r/groke. |
|
2026-01-12 |
Knowledge Distillation for LLM-Based Human Activity Recognition in Homes
Human Activity Recognition (HAR) is a central problem for context-aware applications, especially for smart homes and assisted living.A few very recent studies have shown that Large Language Models (LLMs) can be used for HAR at home, reaching high performance and addressing key challenges.In this paper, we provide new experimental results regarding the use of LLMs for HAR, on two state-of-the-art datasets.More specifically, we show how recognition performance evolves depending on the size of the LLM used.Moreover, we experiment on the use of knowledge distillation techniques to fine-tune smaller LLMs with HAR reasoning examples generated by larger LLMs. 0.517We show that such fine-tuned models can perform almost as well as the largest LLMs, while having 50 times less parameters. |
|
2026-01-12 |
VirtualEnv: A Platform for Embodied AI Research
As large language models (LLMs) continue to improve in reasoning and decision-making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated.We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios.VirtualEnv supports rich agent-environment interactions, including object manipulation, navigation, and adaptive multi-agent collaboration, as well as game-inspired mechanics like escape rooms and procedurally generated environments.We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents using natural language instructions.We integrate large-scale LLMs and vision-language models (VLMs), such as GPT-based models, to generate novel environments and structured tasks from multimodal inputs.Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi-agent coordination.We also describe our methodology for procedural task generation, task validation, and real-time environment control.VirtualEnv is released as an open-source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment. 0.505 |
|
2026-01-12 |
Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task
Recent advancements in Large Language Models (LLMs) are increasingly focused on "reasoning" ability, a concept with many overlapping definitions in the LLM discourse.We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.)We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years.Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. 0.547The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. 0.555To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task contains 'essential actions' against which we can compare the tool call output of LLMs to infer the strength of reasoning ability.We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding.We find that n-shot prompting has little effect on accuracy; error messages encountered do not often deteriorate performance; and provide additional evidence for the poor numeracy of LLMs.Finally, we discuss the generalisation and limitation of our findings to other task domains. |
|
2026-01-11 |
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. 0.603Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings.To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. 0.505Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning.On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words. |
|
2026-01-11 |
Multi-Stage Evolutionary Model Merging with Meta Data Driven Curriculum Learning for Sentiment-Specialized Large Language Modeling
The emergence of large language models (LLMs) has significantly transformed natural language processing (NLP), enabling more generalized models to perform various tasks with minimal training.However, traditional sentiment analysis methods, which focus on individual tasks such as sentiment classification or aspect-based analysis, are not practical for real-world applications that usually require handling multiple tasks.While offering flexibility, LLMs in sentiment-specific tasks often fall short of the required accuracy.Techniques like fine-tuning and evolutionary model merging help integrate models into a unified framework, which can improve the learning performance while reducing computational costs.The use of task meta-data and curriculum learning to optimize learning processes remains underexplored, while sentiment analysis is a critical task in NLP that requires high accuracy and scalability across multiple subtasks.In this study, we propose a hybrid learning model called Multi-stage Evolutionary Model Merging with Meta data driven Curriculum Learning (MEM-MCL), to enhance the sentiment analysis in large language modeling.In particular, expert models are created through instruction tuning for specific sentiment tasks and then merged using evolutionary algorithms to form a unified model.The merging process is optimized with weak data to enhance performance across tasks.The curriculum learning is incorporated to provide a learning sequence based on task difficulty, improving knowledge extraction from LLMs. 0.554Experiment results demonstrate that the proposed MEM-MCL model outperforms conventional LLMs in a majority of sentiment analysis tasks, achieving superior results across various subtasks. |
|
2026-01-11 |
BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models
Large Language Models (LLMs) are increasingly deployed in high-stakes contexts where their outputs influence real-world decisions.However, evaluating bias in LLM outputs remains methodologically challenging due to sensitivity to prompt wording, limited multilingual coverage, and the lack of standardized metrics that enable reliable comparison across models.This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias through a multilingual, robustness-oriented experimental design.BiasLab constructs mirrored probe pairs under a strict dual-framing scheme: an affirmative assertion favoring Target A and a reverse assertion obtained by deterministic target substitution favoring Target B, while preserving identical linguistic structure.To reduce dependence on prompt templates, BiasLab performs repeated evaluation under randomized instructional wrappers and enforces a fixed-choice Likert response format to maximize comparability across models and languages. 0.511Responses are normalized into agreement labels using an LLM-based judge, aligned for polarity consistency across framings, and aggregated into quantitative bias indicators with descriptive statistics including effect sizes and neutrality rates.The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics, and produces reproducible artifacts such as structured reports and comparative visualizations.BiasLab contributes a standardized methodology for cross-lingual and framing-sensitive bias measurement that complements intrinsic and dataset-based audits, enabling researchers and institutions to benchmark robustness and make better-informed deployment decisions. |
|
2026-01-11 |
MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education
The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. 0.533Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. 0.599To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. 0.506MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials.The system's architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for the latest related research, ensuring the generated content is both foundationally sound and current.The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report.We conduct a rigorous evaluation of the system.First, three radiologists assessed the quality of outputs, finding them to be of high clinical and educational value.Second, we perform a large scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system.Our analysis using correlation between LLMs outputs and human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight. |
|
2026-01-11 |
Solar Open Technical Report
We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages.Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. 0.568First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data.Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens.Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization.Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development. |
|
2026-01-11 |
Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge
Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer.While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. 0.521We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs.Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models' pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy.In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information.Supervised fine-tuning achieves the highest overall accuracy across models and datasets.These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required. |
|
2026-01-11 |
Dr. Zero: Self-Evolving Search Agents without Training Data
As high-quality data becomes increasingly difficult to obtain, data-free self-evolution has emerged as a promising paradigm.This approach allows large language models (LLMs) to autonomously generate and solve complex problems, thereby improving their reasoning capabilities.However, multi-turn search agents struggle in data-free self-evolution due to the limited question diversity and the substantial compute required for multi-step reasoning and tool using.In this work, we introduce Dr. Zero, a framework enabling search agents to effectively self-evolve without any training data.In particular, we design a self-evolution feedback loop where a proposer generates diverse questions to train a solver initialized from the same base model.As the solver evolves, it incentivizes the proposer to produce increasingly difficult yet solvable tasks, thus establishing an automated curriculum to refine both agents. 0.578To enhance training efficiency, we also introduce hop-grouped relative policy optimization (HRPO).This method clusters structurally similar questions to construct group-level baselines, effectively minimizing the sampling overhead in evaluating each query's individual difficulty and solvability.Consequently, HRPO significantly reduces the compute requirements for solver training without compromising performance or stability.Extensive experiment results demonstrate that the data-free Dr. Zero matches or surpasses fully supervised search agents, proving that complex reasoning and search capabilities can emerge solely through self-evolution. |
|
2026-01-10 |
CEDAR: Context Engineering for Agentic Data Science
We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup.Solving DS problems with LLMs is an underexplored area that has immense market value. 0.557The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions.We show that these can be alleviated via effective context engineering.We first impose structure into the initial prompt with DS-specific input fields, that serve as instructions for the agentic system.The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow.Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts.Fault tolerance and context management are introduced via iterative code generation and smart history rendering.The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges. |
|
2026-01-09 |
Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset
On December 4, 2025, Anthropic released Anthropic Interviewer, an AI tool for running qualitative interviews at scale, along with a public dataset of 1,250 interviews with professionals, including 125 scientists, about their use of AI for research. 0.532Focusing on the scientist subset, I show that widely available LLMs with web search and agentic capabilities can link six out of twenty-four interviews to specific scientific works, recovering associated authors and, in some cases, uniquely identifying the interviewees.My contribution is to show that modern LLM-based agents make such re-identification attacks easy and low-effort: off-the-shelf tools can, with a few natural-language prompts, search the web, cross-reference details, and propose likely matches, effectively lowering the technical barrier.Existing safeguards can be bypassed by breaking down the re-identification into benign tasks.I outline the attack at a high level, discuss implications for releasing rich qualitative data in the age of LLM agents, and propose mitigation recommendations and open problems. 0.51I have notified Anthropic of my findings. |
|
2026-01-09 |
Open-Vocabulary 3D Instruction Ambiguity Detection
In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors.Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation.To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene.To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions.Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous.To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. 0.568Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI.Code and dataset available at https://jiayuding031020.github.io/ambi3d/. |
LLMs as Recommender Systems |
|
|
2026-01-14 |
On-Device Large Language Models for Sequential Recommendation
On-device recommendation is critical for a number of real-world applications, especially in scenarios that have agreements on execution latency, user privacy, and robust functionality when internet connectivity is unstable or even impossible.While large language models (LLMs) can now provide exceptional capabilities that model user behavior for sequential recommendation tasks, their substantial memory footprint and computational overhead make the deployment on resource-constrained devices a high risk proposition. 0.682In this paper, we propose OD-LLM, the first task-adaptive compression framework explicitly designed to provide efficient and accurate on-device deployment of LLMs for sequential recommendation tasks.OD-LLM uniquely integrates two complementary compression strategies: a low-rank structural compression algorithm which uses Singular Value Decomposition (SVD) to significantly reduce parameter redundancy in the model, and a novel tokenization normalization technique that better complements the low-rank decomposition process being used.Additionally, to minimize any potential performance degradation when using higher compression ratios, a novel progressive alignment algorithm is used to iteratively refine the parameters required layerwise in the target model.Empirical evaluations conducted on sequential recommendation benchmarks show that OD-LLM exhibits no loss in effectiveness when compared to the original recommendation model, when the deployed model size is halved. 0.77These promising results demonstrate the efficacy and scalability of OD-LLM, making this novel solution a practical alternative for real-time, on-device solutions wishing to replace expensive, remotely executed LLMs. |
|
2026-01-14 |
Bridging Semantic Understanding and Popularity Bias with LLMs
Semantic understanding of popularity bias is a crucial yet underexplored challenge in recommender systems, where popular items are often favored at the expense of niche content. 0.748Most existing debiasing methods treat the semantic understanding of popularity bias as a matter of diversity enhancement or long-tail coverage, neglecting the deeper semantic layer that embodies the causal origins of the bias itself.Consequently, such shallow interpretations limit both their debiasing effectiveness and recommendation accuracy.In this paper, we propose FairLRM, a novel framework that bridges the gap in the semantic understanding of popularity bias with Recommendation via Large Language Model (RecLLM). 0.734FairLRM decomposes popularity bias into item-side and user-side components, using structured instruction-based prompts to enhance the model's comprehension of both global item distributions and individual user preferences.Unlike traditional methods that rely on surface-level features such as "diversity" or "debiasing", FairLRM improves the model's ability to semantically interpret and address the underlying bias.Through empirical evaluation, we show that FairLRM significantly enhances both fairness and recommendation accuracy, providing a more semantically aware and trustworthy approach to enhance the semantic understanding of popularity bias. 0.681The implementation is available at https://github.com/LuoRenqiang/FairLRM. |
|
2026-01-14 |
Unifying Search and Recommendation in LLMs via Gradient Multi-Subspace Tuning
Search and recommendation (S&R) are core to online platforms, addressing explicit intent through queries and modeling implicit intent from behaviors, respectively. 0.627Their complementary roles motivate a unified modeling paradigm.Early studies to unify S&R adopt shared encoders with task-specific heads, while recent efforts reframe item ranking in both S&R as conditional generation.The latter holds particular promise, enabling end-to-end optimization and leveraging the semantic understanding of LLMs.However, existing methods rely on full fine-tuning, which is computationally expensive and limits scalability.Parameter-efficient fine-tuning (PEFT) offers a more practical alternative but faces two critical challenges in unifying S&R: (1) gradient conflicts across tasks due to divergent optimization objectives, and (2) shifts in user intent understanding caused by overfitting to fine-tuning data, which distort general-domain knowledge and weaken LLM reasoning.To address the above issues, we propose Gradient Multi-Subspace Tuning (GEMS), a novel framework that unifies S&R with LLMs while alleviating gradient conflicts and preserving general-domain knowledge.GEMS introduces (1) \textbf{Multi-Subspace Decomposition}, which disentangles shared and task-specific optimization signals into complementary low-rank subspaces, thereby reducing destructive gradient interference, and (2) \textbf{Null-Space Projection}, which constrains parameter updates to a subspace orthogonal to the general-domain knowledge space, mitigating shifts in user intent understanding.Extensive experiments on benchmark datasets show that GEMS consistently outperforms the state-of-the-art baselines across both search and recommendation tasks, achieving superior effectiveness. 0.664 |
|
2026-01-13 |
Enriching Semantic Profiles into Knowledge Graph for Recommender Systems Using Large Language Models
Rich and informative profiling to capture user preferences is essential for improving recommendation quality. 0.804However, there is still no consensus on how best to construct and utilize such profiles.To address this, we revisit recent profiling-based approaches in recommender systems along four dimensions: 1) knowledge base, 2) preference indicator, 3) impact range, and 4) subject. 0.811We argue that large language models (LLMs) are effective at extracting compressed rationales from diverse knowledge sources, while knowledge graphs (KGs) are better suited for propagating these profiles to extend their reach.Building on this insight, we propose a new recommendation model, called SPiKE. 0.766SPiKE consists of three core components: i) Entity profile generation, which uses LLMs to generate semantic profiles for all KG entities; ii) Profile-aware KG aggregation, which integrates these profiles into the KG; and iii)Pairwise profile preference matching, which aligns LLM- and KG-based representations during training.In experiments, we demonstrate that SPiKE consistently outperforms state-of-the-art KG- and LLM-based recommenders in real-world settings. 0.731 |
|
2026-01-13 |
Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs
Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards that create a credit assignment gap, obscuring which tokens drive success. 0.621This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, a reasoning pattern rarely seen during pretraining.We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes.Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, learning directly from task feedback without parametric value models.By forming coalitions of semantically coherent units (phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance.Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines, with notable test-time robustness to out-of-distribution retrievers unseen during training. |
|
2026-01-13 |
Beyond Linearization: Attributed Table Graphs for Table Reasoning
Table reasoning, a task to answer questions by reasoning over data presented in tables, is an important topic due to the prevalence of knowledge stored in tabular formats.Recent solutions use Large Language Models (LLMs), exploiting the semantic understanding and reasoning capabilities of LLMs.A common paradigm of such solutions linearizes tables to form plain texts that are served as input to LLMs.This paradigm has critical issues.It loses table structures, lacks explicit reasoning paths for result explainability, and is subject to the "lost-in-the-middle" issue.To address these issues, we propose Table Graph Reasoner (TABGR), a training-free model that represents tables as an Attributed Table Graph (ATG).The ATG explicitly preserves row-column-cell structures while enabling graph-based reasoning for explainability.We further propose a Question-Guided Personalized PageRank (QG-PPR) mechanism to rerank tabular data and mitigate the lost-in-the-middle issue. 0.645Extensive experiments on two commonly used benchmarks show that TABGR consistently outperforms state-of-the-art models by up to 9.7% in accuracy.Our code will be made publicly available upon publication. |
|
2026-01-12 |
RLPO: Residual Listwise Preference Optimization for Long-Context Review Ranking
Review ranking is pivotal in e-commerce for prioritizing diagnostic and authentic feedback from the deluge of user-generated content. 0.605While large language models have improved semantic assessment, existing ranking paradigms face a persistent trade-off in long-context settings.Pointwise scoring is efficient but often fails to account for list-level interactions, leading to miscalibrated top-$k$ rankings.Listwise approaches can leverage global context, yet they are computationally expensive and become unstable as candidate lists grow.To address this, we propose Residual Listwise Preference Optimization (RLPO), which formulates ranking as listwise representation-level residual correction over a strong pointwise LLM scorer.RLPO first produces calibrated pointwise scores and item representations, then applies a lightweight encoder over the representations to predict listwise score residuals, avoiding full token-level listwise processing.We also introduce a large-scale benchmark for long-context review ranking with human verification.Experiments show RLPO improves NDCG@k over strong pointwise and listwise baselines and remains robust as list length increases. |
|
2026-01-11 |
Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers
Leveraging the vast open-world knowledge and understanding capabilities of Large Language Models (LLMs) to develop general-purpose, semantically-aware recommender systems has emerged as a pivotal research direction in generative recommendation. 0.798However, existing methods face bottlenecks in constructing item identifiers.Text-based methods introduce LLMs' vast output space, leading to hallucination, while methods based on Semantic IDs (SIDs) encounter a semantic gap between SIDs and LLMs' native vocabulary, requiring costly vocabulary expansion and alignment training.To address this, this paper introduces Term IDs (TIDs), defined as a set of semantically rich and standardized textual keywords, to serve as robust item identifiers.We propose GRLM, a novel framework centered on TIDs, employs Context-aware Term Generation to convert item's metadata into standardized TIDs and utilizes Integrative Instruction Fine-tuning to collaboratively optimize term internalization and sequential recommendation. 0.635Additionally, Elastic Identifier Grounding is designed for robust item mapping.Extensive experiments on real-world datasets demonstrate that GRLM significantly outperforms baselines across multiple scenarios, pointing a promising direction for generalizable and high-performance generative recommendation systems. 0.745 |
|
2026-01-10 |
Pragya: An AI-Based Semantic Recommendation System for Sanskrit Subhasitas
Sanskrit Subhasitas encapsulate centuries of cultural and philosophical wisdom, yet remain underutilized in the digital age due to linguistic and contextual barriers.In this work, we present Pragya, a retrieval-augmented generation (RAG) framework for semantic recommendation of Subhasitas. 0.661We curate a dataset of 200 verses annotated with thematic tags such as motivation, friendship, and compassion.Using sentence embeddings (IndicBERT), the system retrieves top-k verses relevant to user queries.The retrieved results are then passed to a generative model (Mistral LLM) to produce transliterations, translations, and contextual explanations.Experimental evaluation demonstrates that semantic retrieval significantly outperforms keyword matching in precision and relevance, while user studies highlight improved accessibility through generated summaries.To our knowledge, this is the first attempt at integrating retrieval and generation for Sanskrit Subhasitas, bridging cultural heritage with modern applied AI. |
|
2026-01-10 |
Mapping and Comparing Climate Equity Policy Practices Using RAG LLM-Based Semantic Analysis and Recommendation Systems
This study investigates the use of large language models to enhance the policymaking process.We first analyze planning-related job postings to revisit the evolving roles of planners in the era of AI.We then examine climate equity plans across the U.S. and apply ChatGPT to conduct semantic analysis, extracting policy, strategy, and action items related to transportation and energy.The methodological framework relied on a LangChain-native retrieval-augmented generation pipeline.Based on these extracted elements and their evaluated presence, we develop a content-based recommendation system to support cross-city policy comparison. 0.604The results indicate that, despite growing attention to AI, planning jobs largely retain their traditional domain emphases in transportation, environmental planning, housing, and land use.Communicative responsibilities remain central to planning practice.Climate equity plans commonly address transportation, environmental, and energy-related measures aimed at reducing greenhouse gas emissions and predominantly employ affirmative language.The demonstration of the recommendation system illustrates how planners can efficiently identify cities with similar policy practices, revealing patterns of geographic similarity in policy adoption. 0.62The study concludes by envisioning localized yet personalized AI-assisted systems that can be adapted within urban systems. |
|
2026-01-08 |
Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing
In recommender systems, online A/B testing is a crucial method for evaluating the performance of different models. 0.626However, conducting online A/B testing often presents significant challenges, including substantial economic costs, user experience degradation, and considerable time requirements.With the Large Language Models' powerful capacity, LLM-based agent shows great potential to replace traditional online A/B testing.Nonetheless, current agents fail to simulate the perception process and interaction patterns, due to the lack of real environments and visual perception capability.To address these challenges, we introduce a multi-modal user agent for A/B testing (A/B Agent).Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions that align with real user behavior on online platforms. 0.605The designed agent leverages multimodal information perception, fine-grained user preferences, and integrates profiles, action memory retrieval, and a fatigue system to simulate complex human decision-making.We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features.Furthermore, we found that the data generated by A/B Agent can effectively enhance the capabilities of recommendation models. 0.669Our code is publicly available at https://github.com/Applied-Machine-Learning-Lab/ABAgent. |
|
2026-01-08 |
Reasoning Over Space: Enabling Geographic Reasoning for LLM-Based Generative Next POI Recommendation
Generative recommendation with large language models (LLMs) reframes prediction as sequence generation, yet existing LLM-based recommenders remain limited in leveraging geographic signals that are crucial in mobility and local-services scenarios. 0.776Here, we present Reasoning Over Space (ROS), a framework that utilizes geography as a vital decision variable within the reasoning process.ROS introduces a Hierarchical Spatial Semantic ID (SID) that discretizes coarse-to-fine locality and POI semantics into compositional tokens, and endows LLM with a three-stage Mobility Chain-of-Thought (CoT) paradigm that models user personality, constructs an intent-aligned candidate space, and performs locality informed pruning.We further align the model with real world geography via spatial-guided Reinforcement Learning (RL).Experiments on three widely used location-based social network (LBSN) datasets show that ROS achieves over 10% relative gains in hit rate over strongest LLM-based baselines and improves cross-city transfer, despite using a smaller backbone model. |
|
2026-01-08 |
Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?
Large Language Models (LLMs) have emerged as promising recommendation systems, offering novel ways to model user preferences through generative approaches. 0.841However, many existing methods often rely solely on text semantics or incorporate collaborative signals in a limited manner, typically using only user or item embeddings.These methods struggle to handle multiple item embeddings representing user history, reverting to textual semantics and neglecting richer collaborative information.In this work, we propose a simple yet effective solution that projects user and item embeddings, learned from collaborative filtering, into the LLM token space via separate lightweight projector modules. 0.608A finetuned LLM then conditions on these projected embeddings alongside textual tokens to generate recommendations. 0.83Preliminary results show that this design effectively leverages structured user-item interaction data, improves recommendation performance over text-only LLM baselines, and offers a practical path for bridging traditional recommendation systems with modern LLMs. 0.861 |
|
2026-01-06 |
Netflix Artwork Personalization via LLM Post-training
Large language models (LLMs) have demonstrated success in various applications of user recommendation and personalization across e-commerce and entertainment. 0.735On many entertainment platforms such as Netflix, users typically interact with a wide range of titles, each represented by an artwork.Since users have diverse preferences, an artwork that appeals to one type of user may not resonate with another with different preferences.Given this user heterogeneity, our work explores the novel problem of personalized artwork recommendations according to diverse user preferences. 0.664Similar to the multi-dimensional nature of users' tastes, titles contain different themes and tones that may appeal to different viewers.For example, the same title might feature both heartfelt family drama and intense action scenes.Users who prefer romantic content may like the artwork emphasizing emotional warmth between the characters, while those who prefer action thrillers may find high-intensity action scenes more intriguing.Rather than a one-size-fits-all approach, we conduct post-training of pre-trained LLMs to make personalized artwork recommendations, selecting the most preferred visual representation of a title for each user and thereby improving user satisfaction and engagement.Our experimental results with Llama 3.1 8B models (trained on a dataset of 110K data points and evaluated on 5K held-out user-title pairs) show that the post-trained LLMs achieve 3-5\% improvements over the Netflix production model, suggesting a promising direction for granular personalized recommendations using LLMs. 0.763 |
|
2026-01-05 |
Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models
Large Language Models (LLMs) are increasingly applied in recommendation scenarios due to their strong natural language understanding and generation capabilities. 0.688However, they are trained on vast corpora whose contents are not publicly disclosed, raising concerns about data leakage.Recent work has shown that the MovieLens-1M dataset is memorized by both the LLaMA and OpenAI model families, but the extraction of such memorized data has so far relied exclusively on manual prompt engineering.In this paper, we pose three main questions: Is it possible to enhance manual prompting?Can LLM memorization be detected through methods beyond manual prompting?And can the detection of data leakage be automated?To address these questions, we evaluate three approaches: (i) jailbreak prompt engineering; (ii) unsupervised latent knowledge discovery, probing internal activations via Contrast-Consistent Search (CCS) and Cluster-Norm; and (iii) Automatic Prompt Engineering (APE), which frames prompt discovery as a meta-learning process that iteratively refines candidate instructions.Experiments on MovieLens-1M using LLaMA models show that jailbreak prompting does not improve the retrieval of memorized items and remains inconsistent; CCS reliably distinguishes genuine from fabricated movie titles but fails on numerical user and rating data; and APE retrieves item-level information with moderate success yet struggles to recover numerical interactions.These findings suggest that automatically optimizing prompts is the most promising strategy for extracting memorized samples. |
|
2026-01-05 |
LIA: Supervised Fine-Tuning of Large Language Models for Automatic Issue Assignment
Issue assignment is a critical process in software maintenance, where new issue reports are validated and assigned to suitable developers.However, manual issue assignment is often inconsistent and error-prone, especially in large open-source projects where thousands of new issues are reported monthly.Existing automated approaches have shown promise, but many rely heavily on large volumes of project-specific training data or relational information that is often sparse and noisy, which limits their effectiveness.To address these challenges, we propose LIA (LLM-based Issue Assignment), which employs supervised fine-tuning to adapt an LLM, DeepSeek-R1-Distill-Llama-8B in this work, for automatic issue assignment.By leveraging the LLM's pretrained semantic understanding of natural language and software-related text, LIA learns to generate ranked developer recommendations directly from issue titles and descriptions. 0.612The ranking is based on the model's learned understanding of historical issue-to-developer assignments, using patterns from past tasks to infer which developers are most likely to handle new issues.Through comprehensive evaluation, we show that LIA delivers substantial improvements over both its base pretrained model and state-of-the-art baselines.It achieves up to +187.8% higher Hit@1 compared to the DeepSeek-R1-Distill-Llama-8B pretrained base model, and outperforms four leading issue assignment methods by as much as +211.2% in Hit@1 score.These results highlight the effectiveness of domain-adapted LLMs for software maintenance tasks and establish LIA as a practical, high-performing solution for issue assignment. |
|
2025-12-31 |
GenZ: Foundational models as latent variable generators within traditional statistical models
We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features.While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks.Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model's domain understanding.We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters.The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships.We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. 0.74On house prices, our model achieves 12\% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38\% error) that relies on the LLM's general domain knowledge.For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions -- matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. 0.644The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model's domain knowledge alone. |
|
2025-12-30 |
CogRec: A Cognitive Recommender Agent Fusing Large Language Models and Soar for Explainable Recommendation
Large Language Models (LLMs) have demonstrated a remarkable capacity in understanding user preferences for recommendation systems. 0.764However, they are constrained by several critical challenges, including their inherent "Black-Box" characteristics, susceptibility to knowledge hallucination, and limited online learning capacity.These factors compromise their trustworthiness and adaptability.Conversely, cognitive architectures such as Soar offer structured and interpretable reasoning processes, yet their knowledge acquisition is notoriously laborious.To address these complementary challenges, we propose a novel cognitive recommender agent called CogRec which synergizes the strengths of LLMs with the Soar cognitive architecture. 0.7CogRec leverages Soar as its core symbolic reasoning engine and leverages an LLM for knowledge initialization to populate its working memory with production rules.The agent operates on a Perception-Cognition-Action(PCA) cycle.Upon encountering an impasse, it dynamically queries the LLM to obtain a reasoned solution.This solution is subsequently transformed into a new symbolic production rule via Soar's chunking mechanism, thereby enabling robust online learning.This learning paradigm allows the agent to continuously evolve its knowledge base and furnish highly interpretable rationales for its recommendations.Extensive evaluations conducted on three public datasets demonstrate that CogRec demonstrates significant advantages in recommendation accuracy, explainability, and its efficacy in addressing the long-tail problem. 0.723 |
|
2025-12-30 |
On the Factual Consistency of Text-based Explainable Recommendation Models
Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency.Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence?We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. 0.636We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content.Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality.We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations.Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). 0.721These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems. |
|
2025-12-29 |
Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans
We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. 0.605Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them.The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements.We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits.We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning. |
Production workflows for LLMs |
|
|
2026-01-14 |
LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach
Large-scale optimization is a key backbone of modern business decision-making. 0.521However, building these models is often labor-intensive and time-consuming. 0.322We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. 0.552LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. 0.535Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. 0.385A downstream LLM agent then follows this workflow to generate the final output.Leveraging LLMs' text-processing capabilities and common modeling practices, the workflow decomposes the modeling task into a sequence of structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools. 0.408This design alleviates the downstream agent's burden related to planning and data handling, allowing it to focus on the most challenging components that cannot be readily standardized. 0.339Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. 0.697In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. 0.534Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. 0.537The code and data of this work is available at https://github.com/CoraLiang01/lean-llm-opt. 0.449 |
|
2026-01-14 |
Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
Large Language Model (LLM) routers dynamically select optimal models for given inputs. 0.711Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. 0.503We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. 0.441We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. 0.593Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool.We then show how filtering for these characteristics can improve the quality of generated data.We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. 0.406CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data. 0.545 |
|
2026-01-14 |
LLMs can Compress LLMs: Adaptive Pruning by Agents
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. 0.809Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. 0.374Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. 0.439We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. 0.419Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. 0.359These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. 0.316A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. 0.35We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. 0.534Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models. 0.648 |
LLM Model Architectures and Training Techniques |
|
|
2026-01-14 |
Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs).Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored.To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. 0.479The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. 0.449Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization. |
|
2026-01-14 |
Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection
Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). 0.495By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. 0.461However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. 0.596This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape's capacity to accommodate diverse task requirements. 0.561In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA.Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. 0.41Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead. 0.453 |
Programming applications of LLMs |
|
|
2026-01-14 |
Programming over Thinking: Efficient and Robust Multi-Constraint Planning
Multi-constraint planning involves identifying, evaluating, and refining candidate plans while satisfying multiple, potentially conflicting constraints.Existing large language model (LLM) approaches face fundamental limitations in this domain.Pure reasoning paradigms, which rely on long natural language chains, are prone to inconsistency, error accumulation, and prohibitive cost as constraints compound.Conversely, LLMs combined with coding- or solver-based strategies lack flexibility: they often generate problem-specific code from scratch or depend on fixed solvers, failing to capture generalizable logic across diverse problems.To address these challenges, we introduce the Scalable COde Planning Engine (SCOPE), a framework that disentangles query-specific reasoning from generic code execution. 0.675By separating reasoning from execution, SCOPE produces solver functions that are consistent, deterministic, and reusable across queries while requiring only minimal changes to input parameters.SCOPE achieves state-of-the-art performance while lowering cost and latency.For example, with GPT-4o, it reaches 93.1% success on TravelPlanner, a 61.6% gain over the best baseline (CoT) while cutting inference cost by 1.4x and time by ~4.67x.Code is available at https://github.com/DerrickGXD/SCOPE. |
|
2026-01-14 |
SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics
The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code-driven layouts to image-centric synthesis. 0.734However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments.In this paper, we introduce SlidesGen-Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability.First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method.Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions - Content, Aesthetics, and Editability - offering reproducible metrics where prior works relied on subjective or reference-dependent proxies.Finally, to ensure high correlation with human preference, we construct the Slides-Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios.Our experiments demonstrate that SlidesGen-Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines.Our code and data are available at https://github.com/YunqiaoYang/SlidesGen-Bench. |
|
2026-01-14 |
How well LLM-based test generation techniques perform with newer LLM versions?
The rapid evolution of Large Language Models (LLMs) has strongly impacted software engineering, leading to a growing number of studies on automated unit test generation. 0.778However, the standalone use of LLMs without post-processing has proven insufficient, often producing tests that fail to compile or achieve high coverage.Several techniques have been proposed to address these issues, reporting improvements in test compilation and coverage.While important, LLM-based test generation techniques have been evaluated against relatively weak baselines (for todays' standards), i.e., old LLM versions and relatively weak prompts, which may exacerbate the performance contribution of the approaches.In other words, stronger (newer) LLMs may obviate any advantage these techniques bring.We investigate this issue by replicating four state-of-the-art LLM-based test generation tools, HITS, SymPrompt, TestSpark, and CoverUp that include engineering components aimed at guiding the test generation process through compilation and execution feedback, and evaluate their relative effectiveness and efficiency over a plain LLM test generation method.We integrate current LLM versions in all approaches and run an experiment on 393 classes and 3,657 methods.Our results show that the plain LLM approach can outperform previous state-of-the-art approaches in all test effectiveness metrics we used: line coverage (by 17.72%), branch coverage (by 19.80%) and mutation score (by 20.92%), and it does so at a comparable cost (LLM queries).We also observe that the granularity at which the plain LLM is applied has a significant impact on the cost.We therefore propose targeting first the program classes, where test generation is more efficient, and then the uncovered methods to reduce the number of LLM requests.This strategy achieves comparable (slightly higher) effectiveness while requiring about 20% fewer LLM requests. |
|
2026-01-14 |
ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation
Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. 0.865The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. 0.899Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption.While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored.To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability.In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs.Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation. |
|
2026-01-13 |
Large Language Models to Enhance Multi-task Drone Operations in Simulated Environments
Benefiting from the rapid advancements in large language models (LLMs), human-drone interaction has reached unprecedented opportunities.In this paper, we propose a method that integrates a fine-tuned CodeT5 model with the Unreal Engine-based AirSim drone simulator to efficiently execute multi-task operations using natural language commands.This approach enables users to interact with simulated drones through prompts or command descriptions, allowing them to easily access and control the drone's status, significantly lowering the operational threshold.In the AirSim simulator, we can flexibly construct visually realistic dynamic environments to simulate drone applications in complex scenarios.By combining a large dataset of (natural language, program code) command-execution pairs generated by ChatGPT with developer-written drone code as training data, we fine-tune the CodeT5 to achieve automated translation from natural language to executable code for drone tasks. 0.687Experimental results demonstrate that the proposed method exhibits superior task execution efficiency and command understanding capabilities in simulated environments.In the future, we plan to extend the model functionality in a modular manner, enhancing its adaptability to complex scenarios and driving the application of drone technologies in real-world environments. |
|
2026-01-13 |
Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering
Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs).However, this success heavily relies on expensive external verifiers or human rules.Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency.To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry.Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers.In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation.Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines.Furthermore, extensive results demonstrate strong generalization ability and robustness.The code will be released soon. 0.661 |
|
2026-01-13 |
Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models
Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets.We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot.In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration.We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. 0.645Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations.While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance.This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies.Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy.Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning. |
|
2026-01-13 |
Learner-Tailored Program Repair: A Solution Generator with Iterative Edit-Driven Retrieval Enhancement
With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. 0.817However, most research focuses on repairing the buggy code of programming learners without providing the underlying causes of the bugs.To address this gap, we introduce a novel task, namely \textbf{LPR} (\textbf{L}earner-Tailored \textbf{P}rogram \textbf{R}epair).We then propose a novel and effective framework, \textbf{\textsc{\MethodName{}}} (\textbf{L}earner-Tailored \textbf{S}olution \textbf{G}enerator), to enhance program repair while offering the bug descriptions for the buggy code.In the first stage, we utilize a repair solution retrieval framework to construct a solution retrieval database and then employ an edit-driven code retrieval approach to retrieve valuable solutions, guiding LLMs in identifying and fixing the bugs in buggy code.In the second stage, we propose a solution-guided program repair method, which fixes the code and provides explanations under the guidance of retrieval solutions.Moreover, we propose an Iterative Retrieval Enhancement method that utilizes evaluation results of the generated code to iteratively optimize the retrieval direction and explore more suitable repair strategies, improving performance in practical programming coaching scenarios. 0.646The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our framework for the newly proposed LPR task. |
|
2026-01-12 |
Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting
OpenACC lowers the barrier to GPU offloading, but writing high-performing pragma remains complex, requiring deep domain expertise in memory hierarchies, data movement, and parallelization strategies.Large Language Models (LLMs) present a promising potential solution for automated parallel code generation, but naive prompting often results in syntactically incorrect directives, uncompilable code, or performance that fails to exceed CPU baselines. 0.854We present a systematic prompt optimization approach to enhance OpenACC pragma generation without the prohibitive computational costs associated with model post-training.Leveraging the GEPA (GEnetic-PAreto) framework, we iteratively evolve prompts through a reflective feedback loop.This process utilizes crossover and mutation of instructions, guided by expert-curated gold examples and structured feedback based on clause- and clause parameter-level mismatches between the gold and predicted pragma.In our evaluation on the PolyBench suite, we observe an increase in compilation success rates for programs annotated with OpenACC pragma generated using the optimized prompts compared to those annotated using the simpler initial prompt, particularly for the "nano"-scale models. 0.691Specifically, with optimized prompts, the compilation success rate for GPT-4.1 Nano surged from 66.7% to 93.3%, and for GPT-5 Nano improved from 86.7% to 100%, matching or surpassing the capabilities of their significantly larger, more expensive versions.Beyond compilation, the optimized prompts resulted in a 21% increase in the number of programs that achieve functional GPU speedups over CPU baselines.These results demonstrate that prompt optimization effectively unlocks the potential of smaller, cheaper LLMs in writing stable and effective GPU-offloading directives, establishing a cost-effective pathway to automated directive-based parallelization in HPC workflows. |
|
2026-01-12 |
Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?
Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood.In this work, we present a comprehensive evaluation of LLM capabilities in understanding, reasoning over, and generating code-switched text. 0.724We introduce CodeMixQA a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms.Using this benchmark, we analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs.We further conduct a systematic evaluation of LLM-generated synthetic code-switched text, focusing on both naturalness and semantic fidelity, and uncover key limitations in current generation capabilities. 0.689Our findings reveal persistent challenges in both reasoning and generation under code-switching conditions and provide actionable insights for building more robust multilingual LLMs.We release the dataset and code as open source. |
|
2026-01-12 |
AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units
To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure.However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive.While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. 0.775Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate.To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development.We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback.Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels.Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding.Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline's complete failure.These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation. |
|
2026-01-12 |
Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models
Emoticons are widely used in digital communication to convey affective intent, yet their safety implications for Large Language Models (LLMs) remain largely unexplored.In this paper, we identify emoticon semantic confusion, a vulnerability where LLMs misinterpret ASCII-based emoticons to perform unintended and even destructive actions.To systematically study this phenomenon, we develop an automated data generation pipeline and construct a dataset containing 3,757 code-oriented test cases spanning 21 meta-scenarios, four programming languages, and varying contextual complexities. 0.758Our study on six LLMs reveals that emoticon semantic confusion is pervasive, with an average confusion ratio exceeding 38%.More critically, over 90% of confused responses yield 'silent failures', which are syntactically valid outputs but deviate from user intent, potentially leading to destructive security consequences.Furthermore, we observe that this vulnerability readily transfers to popular agent frameworks, while existing prompt-based mitigations remain largely ineffective.We call on the community to recognize this emerging vulnerability and develop effective mitigation methods to uphold the safety and reliability of the LLM system. |
|
2026-01-12 |
IFDNS: An Iterative Feedback-Driven Neuro-Symbolic Method for Faithful Logical Reasoning
Large language models (LLMs) have demonstrated impressive capabilities across a wide range of reasoning tasks, including logical and mathematical problem-solving. 0.651While prompt-based methods like Chain-of-Thought (CoT) can enhance LLM reasoning abilities to some extent, they often suffer from a lack of faithfulness, where the derived conclusions may not align with the generated reasoning chain.To address this issue, researchers have explored neuro-symbolic approaches to bolster LLM logical reasoning capabilities.However, existing neuro-symbolic methods still face challenges with information loss during the process.To overcome these limitations, we introduce Iterative Feedback-Driven Neuro-Symbolic (IFDNS), a novel prompt-based method that employs a multi-round feedback mechanism to address LLM limitations in handling complex logical relationships.IFDNS utilizes iterative feedback during the logic extraction phase to accurately extract causal relationship statements and translate them into propositional and logical implication expressions, effectively mitigating information loss issues.Furthermore, IFDNS is orthogonal to existing prompt methods, allowing for seamless integration with various prompting approaches.Empirical evaluations across six datasets demonstrate the effectiveness of IFDNS in significantly improving the performance of CoT and Chain-of-Thought with Self-Consistency (CoT-SC).Specifically, IFDNS achieves a +9.40% accuracy boost for CoT on the LogiQA dataset and a +11.70% improvement for CoT-SC on the PrOntoQA dataset. |
|
2026-01-12 |
GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation
RTL design often relies heavily on ad-hoc testbench creation early in the design cycle.While large language models (LLMs) show promise for RTL code generation, their ability to reason about hardware specifications and generate targeted test plans remains largely unexplored. 0.641We present the first systematic study of LLM reasoning capabilities for RTL verification stimuli generation, establishing a two-stage framework that decomposes test plan generation from testbench execution.Our benchmark reveals that state-of-the-art models, including DeepSeek-R1 and Claude-4.0-Sonnet, achieve only 15.7-21.7% success rates on generating stimuli that pass golden RTL designs.To improve LLM generated stimuli, we develop a comprehensive training methodology combining supervised fine-tuning with a novel reinforcement learning approach, GRPO with State Mutation (GRPO-SMu), which enhances exploration by varying input mutations.Our approach leverages a tree-based branching mutation strategy to construct training data comprising equivalent and mutated trees, moving beyond linear mutation approaches to provide rich learning signals.Training on this curated dataset, our 7B parameter model achieves a 33.3% golden test pass rate and a 13.9% mutation detection rate, representing a 17.6% absolute improvement over baseline and outperforming much larger general-purpose models.These results demonstrate that specialized training methodologies can significantly enhance LLM reasoning capabilities for hardware verification tasks, establishing a foundation for automated sub-unit testing in semiconductor design workflows. |
|
2026-01-12 |
OODEval: Evaluating Large Language Models on Object-Oriented Design
Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. 0.726however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored.To fill this gap, we conduct a comprehensive empirical study evaluating 29 LLMs on object-oriented design (OOD) tasks.Owing to the lack of standardized benchmarks and metrics, we introduce OODEval, a manually constructed benchmark comprising 50 OOD tasks of varying difficulty, and OODEval-Human, the first human-rated OOD benchmark, which includes 940 undergraduate-submitted class diagrams evaluated by instructors.We further propose CLUE (Class Likeness Unified Evaluation), a unified metric set that assesses both global correctness and fine-grained design quality in class diagram generation.Using these benchmarks and metrics, we investigate five research questions: overall correctness, comparison with humans, model dimension analysis, task feature analysis, and bad case analysis.The results indicate that while LLMs achieve high syntactic accuracy, they exhibit substantial semantic deficiencies, particularly in method and relationship generation.Among the evaluated models, Qwen3-Coder-30B achieves the best overall performance, rivaling DeepSeek-R1 and GPT-4o, while Gemma3-4B-IT outperforms GPT-4o-Mini despite its smaller parameter scale.Although top-performing LLMs nearly match the average performance of undergraduates, they remain significantly below the level of the best human designers.Further analysis shows that parameter scale, code specialization, and instruction tuning strongly influence performance, whereas increased design complexity and lower requirement readability degrade it.Bad case analysis reveals common failure modes, including keyword misuse, missing classes or relationships, and omitted methods. |
|
2026-01-12 |
"TODO: Fix the Mess Gemini Created": Towards Understanding GenAI-Induced Self-Admitted Technical Debt
As large language models (LLMs) such as ChatGPT, Copilot, Claude, and Gemini become integrated into software development workflows, developers increasingly leave traces of AI involvement in their code comments. 0.821Among these, some comments explicitly acknowledge both the use of generative AI and the presence of technical shortcomings.Analyzing 6,540 LLM-referencing code comments from public Python and JavaScript-based GitHub repositories (November 2022-July 2025), we identified 81 that also self-admit technical debt(SATD).Developers most often describe postponed testing, incomplete adaptation, and limited understanding of AI-generated code, suggesting that AI assistance affects both when and why technical debt emerges.We term GenAI-Induced Self-admitted Technical debt (GIST) as a proposed conceptual lens to describe recurring cases where developers incorporate AI-generated code while explicitly expressing uncertainty about its behavior or correctness. |
|
2026-01-12 |
Cognitive Biases in LLM-Assisted Software Development
The widespread adoption of Large Language Models (LLMs) in software development is transforming programming from a solution-generative to a solution-evaluative activity. 0.901This shift opens a pathway for new cognitive challenges that amplify existing decision-making biases or create entirely novel ones.One such type of challenge stems from cognitive biases, which are thinking patterns that lead people away from logical reasoning and result in sub-optimal decisions.How do cognitive biases manifest and impact decision-making in emerging AI-collaborative development?This paper presents the first comprehensive study of cognitive biases in LLM-assisted development.We employ a mixed-methods approach, combining observational studies with 14 student and professional developers, followed by surveys with 22 additional developers.We qualitatively compare categories of biases affecting developers against the traditional non-LLM workflows.Our findings suggest that LLM-related actions are more likely to be associated with novel biases.Through a systematic analysis of 90 cognitive biases specific to developer-LLM interactions, we develop a taxonomy of 15 bias categories validated by cognitive psychologists.We found that 48.8% of total programmer actions are biased, and developer-LLM interactions account for 56.4% of these biased actions.We discuss how these bias categories manifest, present tools and practices for developers, and recommendations for LLM tool builders to help mitigate cognitive biases in human-AI programming. |
|
2026-01-12 |
ReMIND: Orchestrating Modular Large Language Models for Controllable Serendipity A REM-Inspired System Design for Emergent Creative Ideation
Large language models (LLMs) are used not only for problem solving but also for creative ideation; however, eliciting serendipitous insights that are both novel and internally coherent remains difficult. 0.694While stochastic sampling promotes novelty, it often degrades consistency.Here, we propose ReMIND, a REM-inspired modular framework for ideation.ReMIND consists of four stages: wake, which generates a stable low-temperature semantic baseline; dream, which performs high-temperature exploratory generation; judge, which applies coarse evaluation to filter incoherent outputs and extract candidate ideas; and re-wake, which re-articulates selected ideas into coherent final outputs.By instantiating each stage as an independent LLM, ReMIND enables functional separation between exploration and consolidation.Parameter sweeps show that ReMIND reliably induces semantic exploration while preserving downstream stability.Embedding-based analyses confirm substantial semantic displacement during the dream phase, whereas external evaluations reveal that high-quality ideas emerge sporadically rather than as extrema along any single metric.These results suggest that serendipitous ideation in LLMs is a rare-event process best approached through system level design that shapes the conditions under which valuable ideas can emerge and be stabilized.ReMIND provides a general framework for studying the computational basis of serendipity and illustrates how modular LLM orchestration can bridge exploration and stabilization. |
|
2026-01-11 |
CHASE: LLM Agents for Dissecting Malicious PyPI Packages
Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains.While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms. 0.704We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multi-agent architecture that addresses these limitations through a Plan-and-Execute coordination model, specialized Worker Agents focused on specific analysis aspects, and integration with deterministic security tools for critical operations.Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths.Evaluation on a dataset of 3,000 packages (500 malicious, 2,500 benign) demonstrates that CHASE achieves 98.4% recall with only 0.08% false positive rate, while maintaining a practical median analysis time of 4.5 minutes per package, making it suitable for operational deployment in automated package screening.Furthermore, we conducted a survey with cybersecurity professionals to evaluate the generated analysis reports, identifying their key strengths and areas for improvement.This work provides a blueprint for building reliable AI-powered security tools that can scale with the growing complexity of modern software supply chains.Our project page is available at https://t0d4.github.io/CHASE-AIware25/ |
|
2026-01-11 |
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity.However, current Code LLMs still rely heavily on real-world data, which limits their scalability.In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. 0.684To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith.SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning.Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters.In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale.We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis.Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data. |
|
2026-01-11 |
How Secure is Secure Code Generation? Adversarial Prompts Put LLM Defenses to the Test
Recent secure code generation methods, using vulnerability-aware fine-tuning, prefix-tuning, and prompt optimization, claim to prevent LLMs from producing insecure code. 0.643However, their robustness under adversarial conditions remains untested, and current evaluations decouple security from functionality, potentially inflating reported gains.We present the first systematic adversarial audit of state-of-the-art secure code generation methods (SVEN, SafeCoder, PromSec).We subject them to realistic prompt perturbations such as paraphrasing, cue inversion, and context manipulation that developers might inadvertently introduce or adversaries deliberately exploit.To enable fair comparison, we evaluate all methods under consistent conditions, jointly assessing security and functionality using multiple analyzers and executable tests.Our findings reveal critical robustness gaps: static analyzers overestimate security by 7 to 21 times, with 37 to 60% of ``secure'' outputs being non-functional.Under adversarial conditions, true secure-and-functional rates collapse to 3 to 17%.Based on these findings, we propose best practices for building and evaluating robust secure code generation methods. 0.668Our code is available. |
|
2026-01-10 |
KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks
Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge.However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses.In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge.We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. 0.661On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity. |
|
2026-01-08 |
AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation
Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. 0.759However, for adaptation, a critical activity during code reuse, there is no benchmark to assess LLMs' performance, leaving their practical utility in this area unclear.To fill this gap, we propose AdaptEval, a benchmark designed to evaluate LLMs on code snippet adaptation. 0.739Unlike existing benchmarks, AdaptEval incorporates the following three distinctive features:First, Practical Context.Tasks in AdaptEval are derived from developers' practices, preserving rich contextual information from Stack Overflow and GitHub communities.Second, Multi-granularity Annotation.Each task is annotated with requirements at both task and adaptation levels, supporting the evaluation of LLMs across diverse adaptation scenarios.Third, Fine-grained Evaluation.AdaptEval includes a two-tier testing framework combining adaptation-level and function-level tests, which enables evaluating LLMs' performance across various individual adaptations.Based on AdaptEval, we conduct the first empirical study to evaluate six instruction-tuned LLMs and especially three reasoning LLMs on code snippet adaptation. 0.662Experimental results demonstrate that AdaptEval enables the assessment of LLMs' adaptation capabilities from various perspectives.It also provides critical insights into their current limitations, particularly their struggle to follow explicit instructions.We hope AdaptEval can facilitate further investigation and enhancement of LLMs' capabilities in code snippet adaptation, supporting their real-world applications. 0.68 |
|
2026-01-08 |
Conversational AI for Rapid Scientific Prototyping: A Case Study on ESA's ELOPE Competition
Large language models (LLMs) are increasingly used as coding partners, yet their role in accelerating scientific discovery remains underexplored. 0.694This paper presents a case study of using ChatGPT for rapid prototyping in ESA's ELOPE (Event-based Lunar OPtical flow Egomotion estimation) competition.The competition required participants to process event camera data to estimate lunar lander trajectories.Despite joining late, we achieved second place with a score of 0.01282, highlighting the potential of human-AI collaboration in competitive scientific settings.ChatGPT contributed not only executable code but also algorithmic reasoning, data handling routines, and methodological suggestions, such as using fixed number of events instead of fixed time spans for windowing.At the same time, we observed limitations: the model often introduced unnecessary structural changes, gets confused by intermediate discussions about alternative ideas, occasionally produced critical errors and forgets important aspects in longer scientific discussions.By analyzing these strengths and shortcomings, we show how conversational AI can both accelerate development and support conceptual insight in scientific research.We argue that structured integration of LLMs into the scientific workflow can enhance rapid prototyping by proposing best practices for AI-assisted scientific work. |
|
2026-01-08 |
CurricuLLM: Designing Personalized and Workforce-Aligned Cybersecurity Curricula Using Fine-Tuned LLMs
The cybersecurity landscape is constantly evolving, driven by increased digitalization and new cybersecurity threats.Cybersecurity programs often fail to equip graduates with skills demanded by the workforce, particularly concerning recent developments in cybersecurity, as curriculum design is costly and labor-intensive.To address this misalignment, we present a novel Large Language Model (LLM)-based framework for automated design and analysis of cybersecurity curricula, called CurricuLLM. 0.648Our approach provides three key contributions: (1) automation of personalized curriculum design, (2) a data-driven pipeline aligned with industry demands, and (3) a comprehensive methodology for leveraging fine-tuned LLMs in curriculum development. CurricuLLM utilizes a two-tier approach consisting of PreprocessLM, which standardizes input data, and ClassifyLM, which assigns course content to nine Knowledge Areas in cybersecurity.We systematically evalu- ated multiple Natural Language Processing (NLP) architectures and fine-tuning strategies, ultimately selecting the Bidirectional Encoder Representations from Transformers (BERT) model as ClassifyLM, fine-tuned on founda- tional cybersecurity concepts and workforce competencies. We are the first to validate our method with human experts who analyzed real-world cybersecurity curricula and frameworks, motivating that CurricuLLM is an efficient solution to replace labor-intensive curriculum analysis.Moreover, once course content has been classified, it can be integrated with established cybersecurity role-based weights, enabling alignment of the educational program with specific job roles, workforce categories, or general market needs.This lays the foundation for personalized, workforce-aligned cybersecurity curricula that prepare students for the evolving demands in cybersecurity. |
|
2026-01-08 |
GenAI-DrawIO-Creator: A Framework for Automated Diagram Generation
Diagrams are crucial for communicating complex information, yet creating and modifying them remains a labor-intensive task.We present GenAI-DrawIO-Creator, a novel framework that leverages Large Language Models (LLMs) to automate diagram generation and manipulation in the structured XML format used by draw.io. 0.718Our system integrates Claude 3.7 to reason about structured visual data and produce valid diagram representations.Key contributions include a high-level system design enabling real-time diagram updates, specialized prompt engineering and error-checking to ensure well-formed XML outputs.We demonstrate a working prototype capable of generating accurate diagrams (such as network architectures and flowcharts) from natural language or code, and even replicating diagrams from images.Simulated evaluations show that our approach significantly reduces diagram creation time and produces outputs with high structural fidelity.Our results highlight the promise of Claude 3.7 in handling structured visual reasoning tasks and lay the groundwork for future research in AI-assisted diagramming applications. |
|
2026-01-08 |
SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning
Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. 0.862We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink.SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation.A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning.To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness.Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark.Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization.SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering.SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings. |
|
2026-01-08 |
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
Large Language Models (LLMs) have shown remarkable capabilities in tool calling and tool usage, but suffer from hallucinations where they choose incorrect tools, provide malformed parameters and exhibit 'tool bypass' behavior by performing simulations and generating outputs instead of invoking specialized tools or external systems. 0.684This undermines the reliability of LLM based agents in production systems as it leads to inconsistent results, and bypasses security and audit controls.Such hallucinations in agent tool selection require early detection and error handling.Unlike existing hallucination detection methods that require multiple forward passes or external validation, we present a computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs' internal representations during the same forward pass used for generation.We evaluate this approach on reasoning tasks across multiple domains, demonstrating strong detection performance (up to 86.4\% accuracy) while maintaining real-time inference capabilities with minimal computational overhead, particularly excelling at detecting parameter-level hallucinations and inappropriate tool selections, critical for reliable agent deployment. |
|
2026-01-07 |
Agentic Proof Automation: A Case Study
Proof engineering is notoriously labor-intensive: proofs that are straightforward on paper often require lengthy scripts in theorem provers.Recent advances in large language models (LLMs) create new opportunities for proof automation: modern LLMs not only generate proof scripts, but also support agentic behavior, exploring codebases and iteratively refining their outputs against prover feedback. 0.672These advances enable an emerging scheme where LLM-based agents undertake most proof engineering under human guidance.Humans provide mathematical insight (definitions, theorems, proof strategies); agents handle the mechanical work of proof development.We call this scheme agentic proof automation.We present this scheme through a case study: mechanizing the semantic type soundness of a sophisticated formal system, System Capless, in Lean 4, comprising over 14,000 lines of code.Using off-the-shelf LLM agents with a single lightweight proof-checking tool, the agents completed 189 proof engineering tasks with an 87% success rate, only 16% requiring human intervention.The case study demonstrates that agents are capable proof engineers that substantially boost productivity, though they fall short in creative reasoning and still require human guidance in certain cases.We release an interactive explorer where readers can examine all agent interactions; the mechanization is open-sourced for experiments and extensions. |
|
2026-01-07 |
Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study
Large Language Models (LLMs) such as GPT-4, Claude and LLaMA have shown impressive performance in code generation, typically evaluated using benchmarks (e.g., HumanEval). 0.937However, effective code generation requires models to understand and apply a wide range of language concepts. 0.834If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete.Despite this concern, the representativeness of code concepts in benchmarks has not been systematically examined. To address this gap, we present the first empirical study that analyzes the representativeness of code generation benchmarks through the lens of Knowledge Units (KUs) - cohesive sets of programming language capabilities provided by language constructs and APIs. 0.862We analyze KU coverage in two widely used Python benchmarks, HumanEval and MBPP, and compare them with 30 real-world Python projects.Our results show that each benchmark covers only half of the identified 20 KUs, whereas projects exercise all KUs with relatively balanced distributions.In contrast, benchmark tasks exhibit highly skewed KU distributions. To mitigate this misalignment, we propose a prompt-based LLM framework that synthesizes KU-based tasks to rebalance benchmark KU distributions and better align them with real-world usage.Using this framework, we generate 440 new tasks and augment existing benchmarks.The augmented benchmarks substantially improve KU coverage and achieve over a 60% improvement in distributional alignment.Evaluations of state-of-the-art LLMs on these augmented benchmarks reveal consistent and statistically significant performance drops (12.54-44.82%), indicating that existing benchmarks overestimate LLM performance due to their limited KU coverage.Our findings provide actionable guidance for building more realistic evaluations of LLM code-generation capabilities. 0.775 |
|
2026-01-07 |
From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs
Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. 0.81We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues.We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy.Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives.This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design.Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks.Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax.These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards. |
|
2026-01-07 |
Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation
LLMs are increasingly used to boost productivity and support software engineering tasks. 0.73However, when applied to socially sensitive decisions such as team composition and task allocation, they raise concerns of fairness.Prior studies have revealed that LLMs may reproduce stereotypes; however, these analyses remain exploratory and examine sensitive attributes in isolation.This study investigates whether LLMs exhibit bias in team composition and task assignment by analyzing the combined effects of candidates' country and pronouns.Using three LLMs and 3,000 simulated decisions, we find systematic disparities: demographic attributes significantly shaped both selection likelihood and task allocation, even when accounting for expertise-related factors.Task distributions further reflected stereotypes, with technical and leadership roles unevenly assigned across groups.Our findings indicate that LLMs exacerbate demographic inequities in software engineering contexts, underscoring the need for fairness-aware assessment. |
|
2026-01-07 |
Understanding Specification-Driven Code Generation with LLMs: An Empirical Study Design
Large Language Models (LLMs) are increasingly integrated into software development workflows, yet their behavior in structured, specification-driven processes remains poorly understood. 0.756This paper presents an empirical study design using CURRANTE, a Visual Studio Code extension that enables a human-in-the-loop workflow for LLM-assisted code generation. 0.923The tool guides developers through three sequential stages--Specification, Tests, and Function--allowing them to define requirements, generate and refine test suites, and produce functions that satisfy those tests.Participants will solve medium-difficulty problems from the LiveCodeBench dataset, while the tool records fine-grained interaction logs, effectiveness metrics (e.g., pass rate, all-pass completion), efficiency indicators (e.g., time-to-pass), and iteration behaviors.The study aims to analyze how human intervention in specification and test refinement influences the quality and dynamics of LLM-generated code. 0.651The results will provide empirical insights into the design of next-generation development environments that align human reasoning with model-driven code generation. 0.704 |