Ryan's Arxiv FrontPage


Generated on 2025-09-17.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.

This project was originally created by Vincent Warmerdam, modifying his original frontpage for different paper categories.


Prompt Engineering in Large Language Models

2025-09-16

DPCheatSheet: Using Worked and Erroneous LLM-usage Examples to Scaffold Differential Privacy Implementation

This paper explores how programmers without specialized expertise in differential privacy (DP) (i.e., novices) can leverage LLMs to implement DP programs with minimal training.We first conducted a need-finding study with 6 novices and 3 experts to understand how they utilize LLMs in DP implementation.While DP experts can implement correct DP analyses through a few prompts, novices struggle to articulate their requirements in prompts and lack the skills to verify the correctness of the generated code. 0.74We then developed DPCheatSheet, an instructional tool that helps novices implement DP using LLMs.DPCheatSheet combines two learning concepts: it annotates an expert's workflow with LLMs as a worked example to bridge the expert mindset to novices, and it presents five common mistakes in LLM-based DP code generation as erroneous examples to support error-driven learning.We demonstrated the effectiveness of DPCheatSheet with an error identification study and an open-ended DP implementation study.

link
Google Scholar

2025-09-16

Analogy-Driven Financial Chain-of-Thought (AD-FCoT): A Prompting Approach for Financial Sentiment Analysis

Financial news sentiment analysis is crucial for anticipating market movements.With the rise of AI techniques such as Large Language Models (LLMs), which demonstrate strong text understanding capabilities, there has been renewed interest in enhancing these systems.Existing methods, however, often struggle to capture the complex economic context of news and lack transparent reasoning, which undermines their reliability.We propose Analogy-Driven Financial Chain-of-Thought (AD-FCoT), a prompting framework that integrates analogical reasoning with chain-of-thought (CoT) prompting for sentiment prediction on historical financial news. 0.754AD-FCoT guides LLMs to draw parallels between new events and relevant historical scenarios with known outcomes, embedding these analogies into a structured, step-by-step reasoning chain.To our knowledge, this is among the first approaches to explicitly combine analogical examples with CoT reasoning in finance.Operating purely through prompting, AD-FCoT requires no additional training data or fine-tuning and leverages the model's internal financial knowledge to generate rationales that mirror human analytical reasoning.Experiments on thousands of news articles show that AD-FCoT outperforms strong baselines in sentiment classification accuracy and achieves substantially higher correlation with market returns.Its generated explanations also align with domain expertise, providing interpretable insights suitable for real-world financial analysis.

link
Google Scholar

2025-09-16

Large Language Models Imitate Logical Reasoning, but at what Cost?

We present a longitudinal study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period.We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions from the PrOntoQA dataset and their faithfulness to reasoning strategies provided through in-context learning. 0.623The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting. 0.716The introduction of thinking models allowed for significant improvement in model performance between 2024 and 2025. We then present a neuro-symbolic architecture which uses LLMs of less than 15 billion parameters to translate the problems into a standardised form.We then parse the standardised forms of the problems into a program to be solved by Z3, an SMT solver, to determine the satisfiability of the query.We report the number of prompt and completion tokens as well as the computational cost in FLOPs for open source models.The neuro-symbolic approach significantly reduces the computational cost while maintaining near perfect performance.The common approximation that the number of inference FLOPs is double the product of the active parameters and total tokens was accurate within 10\% for all experiments.

link
Google Scholar

2025-09-16

Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models

It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands.This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user.We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots.We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. 0.809In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50).Furthermore, we conducted qualitative evaluations using two actual mobile manipulators.The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.

link
Google Scholar

2025-09-16

Evaluating Large Language Models for Code Translation: Effects of Prompt Language and Prompt Design

Large language models (LLMs) have shown promise for automated source-code translation, a capability critical to software migration, maintenance, and interoperability.Yet comparative evidence on how model choice, prompt design, and prompt language shape translation quality across multiple programming languages remains limited. 0.717This study conducts a systematic empirical assessment of state-of-the-art LLMs for code translation among C++, Java, Python, and C#, alongside a traditional baseline (TransCoder).Using BLEU and CodeBLEU, we quantify syntactic fidelity and structural correctness under two prompt styles (concise instruction and detailed specification) and two prompt languages (English and Arabic), with direction-aware evaluation across language pairs. 0.745Experiments show that detailed prompts deliver consistent gains across models and translation directions, and English prompts outperform Arabic by 13-15%. 0.846The top-performing model attains the highest CodeBLEU on challenging pairs such as Java to C# and Python to C++.Our evaluation shows that each LLM outperforms TransCoder across the benchmark.These results demonstrate the value of careful prompt engineering and prompt language choice, and provide practical guidance for software modernization and cross-language interoperability. 0.902

link
Google Scholar

2025-09-16

Toward PDDL Planning Copilot

Large Language Models (LLMs) are increasingly being used as autonomous agents capable of performing complicated tasks.However, they lack the ability to perform reliable long-horizon planning on their own.This paper bridges this gap by introducing the Planning Copilot, a chatbot that integrates multiple planning tools and allows users to invoke them through instructions in natural language. 0.649The Planning Copilot leverages the Model Context Protocol (MCP), a recently developed standard for connecting LLMs with external tools and systems.This approach allows using any LLM that supports MCP without domain-specific fine-tuning.Our Planning Copilot supports common planning tasks such as checking the syntax of planning problems, selecting an appropriate planner, calling it, validating the plan it generates, and simulating their execution.We empirically evaluate the ability of our Planning Copilot to perform these tasks using three open-source LLMs.The results show that the Planning Copilot highly outperforms using the same LLMs without the planning tools.We also conducted a limited qualitative comparison of our tool against Chat GPT-5, a very recent commercial LLM.Our results shows that our Planning Copilot significantly outperforms GPT-5 despite relying on a much smaller LLM.This suggests dedicated planning tools may be an effective way to enable LLMs to perform planning tasks.

link
Google Scholar

2025-09-16

Automating Code Generation for Semiconductor Equipment Control from Developer Utterances with LLMs

Semiconductors form the backbone of modern electronics, with their manufacturing and testing relying on highly specialized equipment and domain-specific programming languages.Equipment languages such as the Algorithmic Pattern Generator (ALPG) are critical for precise hardware control but are challenging to program due to their low-level syntax and steep learning curve.While large language models (LLMs) have shown promise in generating high-level code from natural language, their effectiveness on low-level equipment languages remains limited.To address this, we propose Progressive Knowledge Enhancement (PKE), a novel multi-stage prompting framework that progressively extracts and activates the latent knowledge within LLMs, guiding them from simple to complex examples without extensive fine-tuning. 0.882Empirical evaluation on an industrial ALPG dataset shows that PKE significantly outperforms standard prompting and surpasses state-of-the-art methods in generating correct ALPG code, achieving 11.1\% and 15.2\% higher exact match scores compared to the second-best technique. 0.737Further analysis of individual components confirms that progressive knowledge extraction based on difficulty enhances accuracy.Our study offer a practical approach to boosting LLM capabilities for specialized low-level programming, supporting greater productivity in semiconductor software development.

link
Google Scholar

2025-09-16

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including combinatorial optimization.Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts.However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored.To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality.We employ this benchmark to assess the performance of several open-weight LLMs.Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently.They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms.Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. 0.886Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. 0.837Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints. 0.75

link
Google Scholar

2025-09-16

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence.Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. 0.702Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate.We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning.FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows.To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline.The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it.Grok 4 (web) tops the global subset, approaching expert-level accuracy.DouBao (web) leads on the Greater China subset.Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

link
Google Scholar

2025-09-16

The Few-shot Dilemma: Over-prompting Large Language Models

Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. 0.76To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. 0.749Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. 0.745Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets.By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM.This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

link
Google Scholar

2025-09-16

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought.During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency.This saturation of the context window leaves less capacity for exploration.We study a simple mechanism that converts recurring reasoning fragments into concise, reusable "behaviors" (name + instruction) via the model's own metacognitive analysis of prior traces.These behaviors are stored in a "behavior handbook" which supplies them to the model in-context at inference or distills them into parameters via supervised fine-tuning.This approach achieves improved test-time reasoning across three different settings - 1) Behavior-conditioned inference: Providing the LLM relevant behaviors in-context during reasoning reduces number of reasoning tokens by up to 46% while matching or improving baseline accuracy; 2) Behavior-guided self-improvement: Without any parameter updates, the model improves its own future reasoning by leveraging behaviors from its own past problem solving attempts.This yields up to 10% higher accuracy than a naive critique-and-revise baseline; and 3) Behavior-conditioned SFT: SFT on behavior-conditioned reasoning traces is more effective at converting non-reasoning models into reasoning models as compared to vanilla SFT. 0.624Together, these results indicate that turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude.

link
Google Scholar

2025-09-16

Large Language Model-assisted Meta-optimizer for Automated Design of Constrained Evolutionary Algorithm

Meta-black-box optimization has been significantly advanced through the use of large language models (LLMs), yet in fancy on constrained evolutionary optimization.In this work, AwesomeDE is proposed that leverages LLMs as the strategy of meta-optimizer to generate update rules for constrained evolutionary algorithm without human intervention.On the meanwhile, $RTO^2H$ framework is introduced for standardize prompt design of LLMs. 0.729The meta-optimizer is trained on a diverse set of constrained optimization problems.Key components, including prompt design and iterative refinement, are systematically analyzed to determine their impact on design quality.Experimental results demonstrate that the proposed approach outperforms existing methods in terms of computational efficiency and solution accuracy.Furthermore, AwesomeDE is shown to generalize well across distinct problem domains, suggesting its potential for broad applicability.This research contributes to the field by providing a scalable and data-driven methodology for automated constrained algorithm design, while also highlighting limitations and directions for future work.

link
Google Scholar

2025-09-15

Do It Yourself (DIY): Modifying Images for Poems in a Zero-Shot Setting Using Weighted Prompt Manipulation

Poetry is an expressive form of art that invites multiple interpretations, as readers often bring their own emotions, experiences, and cultural backgrounds into their understanding of a poem.Recognizing this, we aim to generate images for poems and improve these images in a zero-shot setting, enabling audiences to modify images as per their requirements.To achieve this, we introduce a novel Weighted Prompt Manipulation (WPM) technique, which systematically modifies attention weights and text embeddings within diffusion models. 0.709By dynamically adjusting the importance of specific words, WPM enhances or suppresses their influence in the final generated image, leading to semantically richer and more contextually accurate visualizations.Our approach exploits diffusion models and large language models (LLMs) such as GPT in conjunction with existing poetry datasets, ensuring a comprehensive and structured methodology for improved image generation in the literary domain.To the best of our knowledge, this is the first attempt at integrating weighted prompt manipulation for enhancing imagery in poetic language. 0.637

link
Google Scholar

2025-09-15

Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation

Large language models (LLMs) are increasingly used in everyday communication, including multilingual interactions across different cultural contexts.While LLMs can now generate near-perfect literal translations, it remains unclear whether LLMs support culturally appropriate communication.In this paper, we analyze the cultural sensitivity of different LLM designs when applied to English-Japanese translations of workplace e-mails.Here, we vary the prompting strategies: (1) naive "just translate" prompts, (2) audience-targeted prompts specifying the recipient's cultural background, and (3) instructional prompts with explicit guidance on Japanese communication norms. 0.844Using a mixed-methods study, we then analyze culture-specific language patterns to evaluate how well translations adapt to cultural norms.Further, we examine the appropriateness of the tone of the translations as perceived by native speakers.We find that culturally-tailored prompting can improve cultural fit, based on which we offer recommendations for designing culturally inclusive LLMs in multilingual settings. 0.728

link
Google Scholar

2025-09-15

CBP-Tuning: Efficient Local Customization for Black-box Large Language Models

The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs.Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data.To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy.Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. 0.635This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation.Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.

link
Google Scholar

2025-09-15

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}).owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information.Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses.To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL).Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns.Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. 0.647Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks.Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

link
Google Scholar

2025-09-15

RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing

Role-playing Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, education, and governance, where failures can directly impact user trust and well-being.A cost effective paradigm for LLM role-playing is few-shot learning, but existing approaches often cause models to break character in unexpected and potentially harmful ways, especially when interacting with hostile users.Inspired by Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a text retrieval problem and propose a new prompting framework called RAGs-to-Riches, which leverages curated reference demonstrations to condition LLM responses. 0.652We evaluate our framework with LLM-as-a-judge preference voting and introduce two novel token-level ROUGE metrics: Intersection over Output (IOO) to quantity how much an LLM improvises and Intersection over References (IOR) to measure few-shot demonstrations utilization rate during the evaluation tasks.When simulating interactions with a hostile user, our prompting strategy incorporates in its responses during inference an average of 35% more tokens from the reference demonstrations. 0.674As a result, across 453 role-playing interactions, our models are consistently judged as being more authentic, and remain in-character more often than zero-shot and in-context Learning (ICL) methods.Our method presents a scalable strategy for building robust, human-aligned LLM role-playing frameworks.

link
Google Scholar

2025-09-15

Prompt Commons: Collective Prompting as Governance for Urban AI

Large Language Models (LLMs) are entering urban governance, yet their outputs are highly sensitive to prompts that carry value judgments.We propose Prompt Commons - a versioned, community-maintained repository of prompts with governance metadata, licensing, and moderation - to steer model behaviour toward pluralism. 0.656Using a Montreal dataset (443 human prompts; 3,317 after augmentation), we pilot three governance states (open, curated, veto-enabled).On a contested policy benchmark, a single-author prompt yields 24 percent neutral outcomes; commons-governed prompts raise neutrality to 48-52 percent while retaining decisiveness where appropriate. 0.629In a synthetic incident log, a veto-enabled regime reduces time-to-remediation for harmful outputs from 30.5 +/- 8.9 hours (open) to 5.6 +/- 1.5 hours.We outline licensing (CC BY/BY-SA for prompts with optional OpenRAIL-style restrictions for artefacts), auditable moderation, and safeguards against dominance capture.Prompt governance offers a practical lever for cities to align AI with local values and accountability.

link
Google Scholar

2025-09-15

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability.However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information.To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking.MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels.Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty.We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline.Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance.Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. 0.604By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.

link
Google Scholar

2025-09-15

Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction

Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce.However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. 0.647We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. 0.615Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset.We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. 0.603As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors.We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories.Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.

link
Google Scholar

Robustness Tools in LLM Safety

2025-09-16

Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real-world efficacy,which we attribute to the ecological invalidity of training data.This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media.To solve this,we propose a multi-agent framework where LLM-poweredagents, equipped with profile and memory modules, simulate the forgery creation process. 0.604Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification.An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity.Extensive experiments validate that the data generated by our simulationdriven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.

link
Google Scholar

2025-09-16

LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals

Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. 0.817Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics.As a result, their effectiveness and robustness remain limited.We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. 0.858HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features.Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection.Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods.By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs. 0.89

link
Google Scholar

2025-09-15

HARP: Hallucination Detection via Reasoning Subspace Projection

Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. 0.849Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. 0.885To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. 0.666HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes.Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained.Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. 0.707By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness.Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%. 0.81

link
Google Scholar

2025-09-15

D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs

Although large Language Models (LLMs) have achieved remarkable success, their practical application is often hindered by the generation of non-factual content, which is called "hallucination". 0.765Ensuring the reliability of LLMs' outputs is a critical challenge, particularly in high-stakes domains such as finance, security, and healthcare.In this work, we revisit hallucination detection from the perspective of model architecture and generation dynamics. 0.867Leveraging the multi-layer structure and autoregressive decoding process of LLMs, we decompose hallucination signals into two complementary dimensions: the semantic breadth of token representations within each layer, and the semantic depth of core concepts as they evolve across layers.Based on this insight, we propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)}, a training-free and label-free framework that jointly measures: (1) \textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of token representations within each layer; and (2) \textbf{Inter-Layer Drift}, which tracks the progressive transformation of key token representations across layers.To ensure drift reflects the evolution of meaningful semantics rather than noisy or redundant tokens, we guide token selection using attention signals.By capturing both the horizontal and vertical dynamics of representation during inference, D$^2$HScore provides an interpretable and lightweight proxy for hallucination detection. 0.884Extensive experiments across five open-source LLMs and five widely used benchmarks demonstrate that D$^2$HScore consistently outperforms existing training-free baselines.

link
Google Scholar

2025-09-15

HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems

Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications. 0.787This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry.We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 69% outperforming baseline detectors by 25.44%. 0.867Benchmarking five chatbot architectures, we find that out of them, AgentBot minimizes hallucinations to 0.4159 per turn while maintaining the highest token accuracy (96.13%), making it the most effective mitigation strategy. 0.602Our findings provide a scalable framework for hallucination mitigation, demonstrating that optimized inference strategies can significantly improve factual accuracy. 0.764While applied to consumer law, our approach generalizes to other high-risk domains, enhancing trust in LLM-driven assistants.We will release the code and dataset

link
Google Scholar

2025-09-15

CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings

Static analysis tools are widely used to detect bugs, vulnerabilities, and code smells.Traditionally, developers must resolve these warnings manually. 0.68Because this process is tedious, developers sometimes ignore warnings, leading to an accumulation of warnings and a degradation of code quality. 0.792This paper presents CodeCureAgent, an approach that harnesses LLM-based agents to automatically analyze, classify, and repair static analysis warnings.Unlike previous work, our method does not follow a predetermined algorithm.Instead, we adopt an agentic framework that iteratively invokes tools to gather additional information from the codebase (e.g., via code search) and edit the codebase to resolve the warning.CodeCureAgent detects and suppresses false positives, while fixing true positives when identified.We equip CodeCureAgent with a three-step heuristic to approve patches: (1) build the project, (2) verify that the warning disappears without introducing new warnings, and (3) run the test suite. 0.72We evaluate CodeCureAgent on a dataset of 1,000 SonarQube warnings found in 106 Java projects and covering 291 distinct rules.Our approach produces plausible fixes for 96.8% of the warnings, outperforming state-of-the-art baseline approaches by 30.7% and 29.2% in plausible-fix rate, respectively.Manual inspection of 291 cases reveals a correct-fix rate of 86.3%, showing that CodeCureAgent can reliably repair static analysis warnings. 0.657The approach incurs LLM costs of about 2.9 cents (USD) and an end-to-end processing time of about four minutes per warning.We envision CodeCureAgent helping to clean existing codebases and being integrated into CI/CD pipelines to prevent the accumulation of static analysis warnings.

link
Google Scholar

2025-09-15

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

LLMs are becoming increasingly capable and widespread.Consequently, the potential and reality of their misuse is also growing. 0.622In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such.We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA).SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data.Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.

link
Google Scholar

2025-09-15

FunAudio-ASR Technical Report

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs).However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. 0.859In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios.Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets.Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

link
Google Scholar

2025-09-14

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches.However, these methods produce holistic scores that obscure which specific elements influenced the assessments.We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals.We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations.A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments.This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. 0.7Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

link
Google Scholar

2025-09-14

Realistic Environmental Injection Attacks on GUI Agents

GUI agents built on LVLMs are increasingly used to interact with websites.However, their exposure to open-world content makes them vulnerable to Environmental Injection Attacks (EIAs) that hijack agent behavior via webpage elements.Many recent studies assume the attacker to be a regular user who can only upload a single trigger image, which is more realistic than earlier assumptions of website-level administrative control. 0.623However, these works still fall short of realism: (1) the trigger's position and surrounding context remain largely fixed between training and testing, failing to capture the dynamic nature of real webpages and (2) the trigger often occupies an unrealistically large area, whereas real-world images are typically small.To better reflect real-world scenarios, we introduce a more realistic threat model where the attacker is a regular user and the trigger image is small and embedded within a dynamically changing environment.As a result, existing attacks prove largely ineffective under this threat model. 0.616To better expose the vulnerabilities of GUI agents, we propose Chameleon, an attack framework with two main novelties.The first is LLM-Driven Environment Simulation, which automatically generates diverse and high-fidelity webpage simulations.The second is Attention Black Hole, which transforms attention weights into explicit supervisory signals that guide the agent's focus toward the trigger region.We evaluate Chameleon on 6 realistic websites and 4 representative LVLM-powered GUI agents, where it significantly outperforms existing methods.Ablation studies confirm that both novelties are critical to performance.Our findings reveal underexplored vulnerabilities in modern GUI agents and establish a robust foundation for future research on defense in open-world GUI agent systems.The code is publicly available at https://github.com/zhangyitonggg/attack2gui.

link
Google Scholar

2025-09-14

The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs).By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques.The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed.To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition.We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks.We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. 0.618We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations.We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing.Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.

link
Google Scholar

Security Challenges in LLM Development

2025-09-16

Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation

Code vulnerability detection is crucial for ensuring the security and reliability of modern software systems. 0.843Recently, Large Language Models (LLMs) have shown promising capabilities in this domain.However, notable discrepancies in detection results often arise when analyzing identical code segments across different training stages of the same model or among architecturally distinct LLMs.While such inconsistencies may compromise detection stability, they also highlight a key opportunity: the latent complementarity among models can be harnessed through ensemble learning to create more robust vulnerability detection systems. 0.764In this study, we explore the potential of ensemble learning to enhance the performance of LLMs in source code vulnerability detection. 0.688We conduct comprehensive experiments involving five LLMs (i.e., DeepSeek-Coder-6.7B, CodeLlama-7B, CodeLlama-13B, CodeQwen1.5-7B, and StarCoder2-15B), using three ensemble strategies (i.e., Bagging, Boosting, and Stacking).These experiments are carried out across three widely adopted datasets (i.e., Devign, ReVeal, and BigVul).Inspired by Mixture of Experts (MoE) techniques, we further propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection. 0.839Our results demonstrate that ensemble approaches can significantly improve detection performance, with Boosting excelling in scenarios involving imbalanced datasets.Moreover, DGS consistently outperforms traditional Stacking, particularly in handling class imbalance and multi-class classification tasks.These findings offer valuable insights into building more reliable and effective LLM-based vulnerability detection systems through ensemble learning. 0.778

link
Google Scholar

2025-09-16

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems.Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection.Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. 0.638In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques.Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups.We use adversarial attacking techniques to identify vulnerable circuits. 0.907Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. 0.865We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training.We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. 0.852We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.

link
Google Scholar

2025-09-16

DiffHash: Text-Guided Targeted Attack via Diffusion Models against Deep Hashing Image Retrieval

Deep hashing models have been widely adopted to tackle the challenges of large-scale image retrieval.However, these approaches face serious security risks due to their vulnerability to adversarial examples. 0.887Despite the increasing exploration of targeted attacks on deep hashing models, existing approaches still suffer from a lack of multimodal guidance, reliance on labeling information and dependence on pixel-level operations for attacks.To address these limitations, we proposed DiffHash, a novel diffusion-based targeted attack for deep hashing.Unlike traditional pixel-based attacks that directly modify specific pixels and lack multimodal guidance, our approach focuses on optimizing the latent representations of images, guided by text information generated by a Large Language Model (LLM) for the target image. 0.724Furthermore, we designed a multi-space hash alignment network to align the high-dimension image space and text space to the low-dimension binary hash space.During reconstruction, we also incorporated text-guided attention mechanisms to refine adversarial examples, ensuring them aligned with the target semantics while maintaining visual plausibility. 0.634Extensive experiments have demonstrated that our method outperforms state-of-the-art (SOTA) targeted attack methods, achieving better black-box transferability and offering more excellent stability across datasets. 0.824

link
Google Scholar

2025-09-16

Jailbreaking Large Language Models Through Content Concretization

Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. 0.724In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. 0.787CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt.We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7\% (no refinements) to 62\% after three refinement iterations, while maintaining a cost of 7.5\textcent~per prompt. 0.704Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. 0.633Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning.With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks. 0.902

link
Google Scholar

2025-09-16

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems

This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. 0.628At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing.The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases.Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning.We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark.The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT.These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.

link
Google Scholar

2025-09-16

Validating Solidity Code Defects using Symbolic and Concrete Execution powered by Large Language Models

The high rate of false alarms from static analysis tools and Large Language Models (LLMs) complicates vulnerability detection in Solidity Smart Contracts, demanding methods that can formally or empirically prove the presence of defects. 0.872This paper introduces a novel detection pipeline that integrates custom Slither-based detectors, LLMs, Kontrol, and Forge.Our approach is designed to reliably detect defects and generate proofs.We currently perform experiments with promising results for seven types of critical defects.We demonstrate the pipeline's efficacy by presenting our findings for three vulnerabilities -- Reentrancy, Complex Fallback, and Faulty Access Control Policies -- that are challenging for current verification solutions, which often generate false alarms or fail to detect them entirely. 0.754We highlight the potential of either symbolic or concrete execution in correctly classifying such code faults. 0.653By chaining these instruments, our method effectively validates true positives, significantly reducing the manual verification burden.Although we identify potential limitations, such as the inconsistency and the cost of LLMs, our findings establish a robust framework for combining heuristic analysis with formal verification to achieve more reliable and automated smart contract auditing. 0.625

link
Google Scholar

2025-09-15

VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection

The application of language models to project-level vulnerability detection remains challenging, owing to the dual requirement of accurately localizing security-sensitive code and correctly correlating and reasoning over complex program context. 0.859We present VulAgent, a multi-agent vulnerability detection framework based on hypothesis validation. 0.809Our design is inspired by how human auditors review code: when noticing a sensitive operation, they form a hypothesis about a possible vulnerability, consider potential trigger paths, and then verify the hypothesis against the surrounding context. 0.622VulAgent implements a semantics-sensitive, multi-view detection pipeline: specialized agents, each aligned to a specific analysis perspective (e.g., memory, authorization), collaboratively surface and precisely localize sensitive code sites with higher coverage.Building on this, VulAgent adopts a hypothesis-validation paradigm: for each vulnerability report, it builds hypothesis conditions and a trigger path, steering the LLM to target the relevant program context and defensive checks during verification, which reduces false positives. 0.647On average across the two datasets, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450% (246% on average), and reduces the false positive rate by about 36% compared with state-of-the-art LLM-based baselines.

link
Google Scholar

2025-09-15

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. 0.842In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. 0.898Our method enables models to directly answer the question in their thought and then critically evaluate its safety before deciding whether to provide it. 0.657To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K examples that teach models to reason through direct responses and then analyze their safety.Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates on over-refusal benchmarks.Notably, the model fine-tuned with ReSA maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval.Besides, our method equips models with the ability to perform safe completion.Unlike post-hoc methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm).Furthermore, we discover that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset, suggesting that safety alignment may require less data than previously assumed.

link
Google Scholar

2025-09-15

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Current unlearning techniques and safety training consistently fail to remove dangerous knowledge from language models. 0.665We analyze the root causes and propose a highly selective technique which unlearns robustly and without disrupting general performance. We perform PCA on activations and module output gradients to identify subspaces containing common representations, and collapse them before calculating unlearning updates.This way we avoid unlearning general representations, and only target those specific to the unlearned facts. When unlearning WMDP dataset facts from Llama-3.1-8B, we drop post-attack accuracy 80x more than our best baseline (Circuit Breakers) on biohazardous facts and 30x more on cyberhazardous facts.Despite this, we disrupt general performance 30x less (only 0.1% WikiText loss increase), while requiring less than 3 GPU-seconds per fact.

link
Google Scholar

2025-09-15

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Safety alignment is critical for the ethical deployment of large language models (LLMs), guiding them to avoid generating harmful or unethical content. 0.673Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts.Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability. 0.833This paper presents NeuroStrike, a novel and generalizable attack framework that exploits a fundamental vulnerability introduced by alignment techniques: the reliance on sparse, specialized safety neurons responsible for detecting and suppressing harmful inputs. 0.881We apply NeuroStrike to both white-box and black-box settings: In the white-box setting, NeuroStrike identifies safety neurons through feedforward activation analysis and prunes them during inference to disable safety mechanisms. 0.63In the black-box setting, we propose the first LLM profiling attack, which leverages safety neuron transferability by training adversarial prompt generators on open-weight surrogate models and then deploying them against black-box and proprietary targets. 0.834We evaluate NeuroStrike on over 20 open-weight LLMs from major LLM developers.By removing less than 0.6% of neurons in targeted layers, NeuroStrike achieves an average attack success rate (ASR) of 76.9% using only vanilla malicious prompts. 0.641Moreover, Neurostrike generalizes to four multimodal LLMs with 100% ASR on unsafe image inputs.Safety neurons transfer effectively across architectures, raising ASR to 78.5% on 11 fine-tuned models and 77.7% on five distilled models.The black-box LLM profiling attack achieves an average ASR of 63.7% across five black-box models, including the Google Gemini family.

link
Google Scholar

2025-09-15

MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables

As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference.Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills.Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. 0.837Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. 0.654This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice.Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.

link
Google Scholar

HCI in Large Language Models

2025-09-16

Don't Change My View: Ideological Bias Auditing in Large Language Models

As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. 0.726If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse.Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur.In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing.Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model.Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic.This design makes the method particularly suitable for auditing proprietary black-box systems.We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.

link
Google Scholar

2025-09-16

Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations

Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. 0.803Moreover, there is a considerable preference bias towards specific strategies.Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied.To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning.Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries.Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.

link
Google Scholar

2025-09-16

Harnessing the Power of AI in Qualitative Research: Role Assignment, Engagement, and User Perceptions of AI-Generated Follow-Up Questions in Semi-Structured Interviews

Semi-structured interviews highly rely on the quality of follow-up questions, yet interviewers' knowledge and skills may limit their depth and potentially affect outcomes.While many studies have shown the usefulness of large language models (LLMs) for qualitative analysis, their possibility in the data collection process remains underexplored. 0.62We adopt an AI-driven "Wizard-of-Oz" setup to investigate how real-time LLM support in generating follow-up questions shapes semi-structured interviews. 0.611Through a study with 17 participants, we examine the value of LLM-generated follow-up questions, the evolving division of roles, relationships, collaborative behaviors, and responsibilities between interviewers and AI. 0.754Our findings (1) provide empirical evidence of the strengths and limitations of AI-generated follow-up questions (AGQs); (2) introduce a Human-AI collaboration framework in this interview context; and (3) propose human-centered design guidelines for AI-assisted interviewing.We position LLMs as complements, not replacements, to human judgment, and highlight pathways for integrating AI into qualitative data collection.

link
Google Scholar

2025-09-16

What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment

Automated evaluation of generative text-to-image models remains a challenging problem.Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment.In this work, we study what attributes of an image--specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style--are important for both LLMs and humans to make judgments on image quality.We first curate a dataset of human preferences using synthetically generated image pairs.We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments.Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker.Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis.Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge.Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images. 0.692

link
Google Scholar

2025-09-16

When Large Language Models Meet UAVs: How Far Are We?

The integration of unmanned aerial vehicles (UAVs) and large language models (LLMs) has emerged as a research direction of growing interest, with the potential to address challenges in autonomous decision-making, human-UAV interaction, and real-time adaptability.However, existing studies have remained largely in preliminary exploration with a limited understanding of real-world practice, risking a misalignment between academic research and practical needs and hindering the translation of results.To examine and address these potential challenges, we conducted an empirical study of 74 selected papers and 56 public GitHub projects, identified nine task types for LLMs in UAV systems, and quantified their distribution.Our findings show that academic research emphasizes theoretical modeling and task optimization with dispersed attention across tasks.In contrast, industrial projects focus on flight control, task planning, and human-machine interaction, prioritizing operability and efficiency.To further capture industry perspectives, we distributed an online questionnaire. 0.835We obtained 52 valid responses: 40.4% of practitioners have attempted to apply LLMs to UAV tasks.We further identify factors that impede real-world integration, including technological maturity, performance, safety, cost, and other considerations.Finally, we highlight challenges for future development and provide recommendations.

link
Google Scholar

2025-09-16

A Visualized Framework for Event Cooperation with Generative Agents

Large Language Models (LLMs) have revolutionized the simulation of agent societies, enabling autonomous planning, memory formation, and social interactions. 0.645However, existing frameworks often overlook systematic evaluations for event organization and lack visualized integration with physically grounded environments, limiting agents' ability to navigate spaces and interact with items realistically.We develop MiniAgentPro, a visualization platform featuring an intuitive map editor for customizing environments and a simulation player with smooth animations.Based on this tool, we introduce a comprehensive test set comprising eight diverse event scenarios with basic and hard variants to assess agents' ability.Evaluations using GPT-4o demonstrate strong performance in basic settings but highlight coordination challenges in hard variants.

link
Google Scholar

2025-09-16

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. 0.607Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions.To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization.ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints.For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning.Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training.Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.

link
Google Scholar

2025-09-15

AssemMate: Graph-Based LLM for Robotic Assembly Assistance

Large Language Model (LLM)-based robotic assembly assistance has gained significant research attention.It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans.Despite some progress, existing methods represent knowledge in the form of natural language text.Due to the long context and redundant content, they struggle to meet the robots' requirements for real-time and precise reasoning.In order to bridge this gap, we present AssemMate, which utilizes the graph\textemdash a concise and accurate form of knowledge representation\textemdash as input.This graph-based LLM enables knowledge graph question answering (KGQA), supporting human-robot interaction and assembly task planning for specific products. 0.628Beyond interactive QA, AssemMate also supports sensing stacked scenes and executing grasping to assist with assembly.Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM's representation, enabling the LLM to understand graph information.In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping.Through training and evaluation, AssemMate outperforms existing methods, achieving 6.4\% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs.And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings.More details can be found on the project page: https://github.com/cristina304/AssemMate.git

link
Google Scholar

2025-09-15

User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums

Customer feedback in industrial forums reflect a rich but underexplored source of insight into real-world product experience. 0.717These publicly shared discussions offer an organic view of user expectations, frustrations, and success stories shaped by the specific contexts of use. 0.679Yet, harnessing this information for systematic analysis remains challenging due to the unstructured and domain-specific nature of the content.The lack of structure and specialized vocabulary makes it difficult for traditional data analysis techniques to accurately interpret, categorize, and quantify the feedback, thereby limiting its potential to inform product development and support strategies.To address these challenges, this paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 artificially synthesized and anonymized user feedback branches extracted from a public industrial automation forum. 0.676Each JavaScript object notation (JSON) record contains multi-post comments related to specific hardware and software products, enriched with metadata and contextual conversation data.Leveraging a large language model (LLM), each branch is systematically analyzed and annotated for UX insights, user expectations, severity and sentiment ratings, and topic classifications.The UXPID dataset is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data.UXPID supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in the context of technical forums.

link
Google Scholar

2025-09-15

When Curiosity Signals Danger: Predicting Health Crises Through Online Medication Inquiries

Online medical forums are a rich and underutilized source of insight into patient concerns, especially regarding medication use. 0.603Some of the many questions users pose may signal confusion, misuse, or even the early warning signs of a developing health crisis.Detecting these critical questions that may precede severe adverse events or life-threatening complications is vital for timely intervention and improving patient safety.This study introduces a novel annotated dataset of medication-related questions extracted from online forums.Each entry is manually labelled for criticality based on clinical risk factors.We benchmark the performance of six traditional machine learning classifiers using TF-IDF textual representations, alongside three state-of-the-art large language model (LLM)-based classification approaches that leverage deep contextual understanding.Our results highlight the potential of classical and modern methods to support real-time triage and alert systems in digital health spaces.The curated dataset is made publicly available to encourage further research at the intersection of patient-generated data, natural language processing, and early warning systems for critical health events.The dataset and benchmark are available at: https://github.com/Dvora-coder/LLM-Medication-QA-Risk-Classifier-MediGuard.

link
Google Scholar

2025-09-15

The AI Memory Gap: Users Misremember What They Created With AI or Without

As large language models (LLMs) become embedded in interactive text generation, disclosure of AI as a source depends on people remembering which ideas or texts came from themselves and which were created with AI.We investigate how accurately people remember the source of content when using AI.In a pre-registered experiment, 184 participants generated and elaborated on ideas both unaided and with an LLM-based chatbot. 0.775One week later, they were asked to identify the source (noAI vs withAI) of these ideas and texts.Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI.We validated our results using a computational model of source memory.Discussing broader implications, we highlight the importance of considering source confusion in the design and use of interactive text generation technologies.

link
Google Scholar

2025-09-15

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. 0.683Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution.To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios.Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling.To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs.We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks.These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.

link
Google Scholar

2025-09-15

MillStone: How Open-Minded Are LLMs?

Large language models equipped with Web search, information retrieval tools, and other agentic capabilities are beginning to supplant traditional search engines.As users start to rely on LLMs for information on many topics, including controversial and debatable issues, it is important to understand how the stances and opinions expressed in LLM outputs are influenced by the documents they use as their information sources. In this paper, we present MillStone, the first benchmark that aims to systematically measure the effect of external arguments on the stances that LLMs take on controversial issues (not all of them political).We apply MillStone to nine leading LLMs and measure how ``open-minded'' they are to arguments supporting opposite sides of these issues, whether different LLMs agree with each other, which arguments LLMs find most persuasive, and whether these arguments are the same for different LLMs. In general, we find that LLMs are open-minded on most issues. 0.764An authoritative source of information can easily sway an LLM's stance, highlighting the importance of source selection and the risk that LLM-based information retrieval and search systems can be manipulated.

link
Google Scholar

2025-09-15

Redefining Website Fingerprinting Attacks With Multiagent LLMs

Website Fingerprinting (WFP) uses deep learning models to classify encrypted network traffic to infer visited websites.While historically effective, prior methods fail to generalize to modern web environments.Single-page applications (SPAs) eliminate the paradigm of websites as sets of discrete pages, undermining page-based classification, and traffic from scripted browsers lacks the behavioral richness seen in real user sessions.Our study reveals that users exhibit highly diverse behaviors even on the same website, producing traffic patterns that vary significantly across individuals.This behavioral entropy makes WFP a harder problem than previously assumed and highlights the need for larger, more diverse, and representative datasets to achieve robust performance.To address this, we propose a new paradigm: we drop session-boundaries in favor of contiguous traffic segments and develop a scalable data generation pipeline using large language models (LLM) agents.These multi-agent systems coordinate decision-making and browser interaction to simulate realistic, persona-driven browsing behavior at 3--5x lower cost than human collection. 0.653We evaluate nine state-of-the-art WFP models on traffic from 20 modern websites browsed by 30 real users, and compare training performance across human, scripted, and LLM-generated datasets.All models achieve under 10\% accuracy when trained on scripted traffic and tested on human data.In contrast, LLM-generated traffic boosts accuracy into the 80\% range, demonstrating strong generalization to real-world traces.Our findings indicate that for modern WFP, model performance is increasingly bottlenecked by data quality, and that scalable, semantically grounded synthetic traffic is essential for capturing the complexity of real user behavior.

link
Google Scholar

2025-09-15

Extended AI Interactions Shape Sycophancy and Perspective Mimesis

We investigate whether long-context interactions between users and LLMs lead to AI mirroring behaviors.We focus on two forms of mirroring: (1) sycophancy -- the tendency of models to be overly agreeable with users, and (2) perspective mimesis -- the extent to which models reflect a user's perspective.Using two weeks of interaction context collected from 38 users, we compare model responses with and without long-context for two tasks: political explanations and personal advice. 0.8Our results demonstrate how and when real-world interaction contexts can amplify AI mirroring behaviors.We find that sycophancy increases in long-context, irrespective of the interaction topics.Perspective mimesis increases only in contexts where models can accurately infer user perspectives.

link
Google Scholar

2025-09-14

"My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit's AI Community

Human-AI interaction researchers face an overwhelming challenge: synthesizing insights from thousands of empirical studies to understand how AI impacts people and inform effective design. 0.633Existing approach for literature reviews cluster papers by similarities, keywords or citations, missing the crucial cause-and-effect relationships that reveal how design decisions impact user outcomes.We introduce the Atlas of Human-AI Interaction, an interactive web interface that provides the first systematic mapping of empirical findings across 1,000+ HCI papers using LLM-powered knowledge extraction.Our approach identifies causal relationships, and visualizes them through an AI-enabled interactive web interface as a navigable knowledge graph.We extracted 2,037 empirical findings, revealing research topic clusters, common themes, and disconnected areas.Expert evaluation with 20 researchers revealed the system's effectiveness for discovering research gaps.This work demonstrates how AI can transform literature synthesis itself, offering a scalable framework for evidence-based design, opening new possibilities for computational meta-science across HCI and beyond.

link
Google Scholar

2025-09-14

Large Language Models (LLMs) for Requirements Engineering (RE): A Systematic Literature Review

Large Language Models (LLMs) are finding applications in numerous domains, and Requirements Engineering (RE) is increasingly benefiting from their capabilities to assist with complex, language-intensive tasks.This paper presents a systematic literature review of 74 primary studies published between 2023 and 2024, examining how LLMs are being applied in RE. 0.628The study categorizes the literature according to several dimensions, including publication trends, RE activities, prompting strategies, and evaluation methods.Our findings indicate notable patterns, among which we observe substantial differences compared to previous works leveraging standard Natural Language Processing (NLP) techniques.Most of the studies focus on using LLMs for requirements elicitation and validation, rather than defect detection and classification, which were dominant in the past.Researchers have also broadened their focus and addressed novel tasks, e.g., test generation, exploring the integration of RE with other software engineering (SE) disciplines.Although requirements specifications remain the primary focus, other artifacts are increasingly considered, including issues from issue tracking systems, regulations, and technical manuals.The studies mostly rely on GPT-based models, and often use Zero-shot or Few-shot prompting.They are usually evaluated in controlled environments, with limited use in industry settings and limited integration in complex workflows.Our study outlines important future directions, such as leveraging the potential to expand the influence of RE in SE, exploring less-studied tasks, improving prompting methods, and testing in real-world environments.Our contribution also helps researchers and practitioners use LLMs more effectively in RE, by providing a list of identified tools leveraging LLMs for RE, as well as datasets.

link
Google Scholar

2025-09-14

Designing and Evaluating a Conversational Agent for Early Detection of Alzheimer's Disease and Related Dementias

Early detection of Alzheimer's disease and related dementias (ADRD) is critical for timely intervention, yet most diagnoses are delayed until advanced stages.While comprehensive patient narratives are essential for accurate diagnosis, prior work has largely focused on screening studies that classify cognitive status from interactions rather than supporting the diagnostic process.We designed voice-interactive conversational agents, leveraging large language models (LLMs), to elicit narratives relevant to ADRD from patients and informants. 0.721We evaluated the agent with 30 adults with suspected ADRD through conversation analysis (n=30), user surveys (n=19), and clinical validation against blinded specialist interviews (n=24).Symptoms detected by the agent aligned well with those identified by specialists across symptoms.Users appreciated the agent's patience and systematic questioning, which supported engagement and expression of complex, hard-to-describe experiences. 0.764This preliminary work suggests conversational agents may serve as structured front-end tools for dementia assessment, highlighting interaction design considerations in sensitive healthcare contexts. 0.751

link
Google Scholar

Large Language Models in Social Sciences

2025-09-16

Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews

Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks.We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors.In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews.We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation.For the latter, we propose three culture-oriented criteria -- Cultural Proximity, Cultural Neutrality, and Cultural Genuineness -- to assess how naturally a translated review resonates with target-culture readers. 0.719Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures.This highlights the challenges and limitations of translation models in handling cultural content.

link
Google Scholar

2025-09-16

Accelerating Discovery: Rapid Literature Screening with LLMs

Background: Conducting Multi Vocal Literature Reviews (MVLRs) is often time and effort-intensive. 0.612Researchers must review and filter a large number of unstructured sources, which frequently contain sparse information and are unlikely to be included in the final study.Our experience conducting an MVLR on Context-Aware Software Systems (CASS) Testing in the avionics domain exemplified this challenge, with over 8,000 highly heterogeneous documents requiring review.Therefore, we developed a Large Language Model (LLM) assistant to support the search and filtering of documents.Aims: To develop and validate an LLM based tool that can support researchers in performing the search and filtering of documents for an MVLR without compromising the rigor of the research protocol.Method: We applied sound engineering practices to develop an on-premises LLM-based tool incorporating Retrieval Augmented Generation (RAG) to process candidate sources.Progress towards the aim was quantified using the Positive Percent Agreement (PPA) as the primary metric to ensure the performance of the LLM based tool.Convenience sampling, supported by human judgment and statistical sampling, were used to verify and validate the tool's quality-in-use.Results:The tool currently demonstrates a PPA agreement with human researchers of 90% for sources that are not relevant to the study. 0.606Development details are shared to support domain-specific adaptation of the tool.Conclusions: Using LLM-based tools to support academic researchers in rigorous MVLR is feasible.These tools can free valuable time for higher-level, abstract tasks.However, researcher participation remains essential to ensure that the tool supports thorough research.

link
Google Scholar

2025-09-16

Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy

Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening.Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility.Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. 0.641We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types.Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model.Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs.MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity.Both models adjusted predictions based on simulated AI inputs, but GPT-4o's performance collapsed with incorrect ones, whereas MedGemma remained more stable.In actual collaboration, GPT-4o achieved strong results when guided by MedGemma's descriptive outputs, even without direct image access (AUROC up to 0.96).These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations.Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.

link
Google Scholar

2025-09-16

Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants.However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. 0.819While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. 0.877To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. 0.725Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small.Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. 0.722Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. 0.671These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.

link
Google Scholar

2025-09-16

LLMs for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt

BACKGROUND: Most artificial intelligence tools used to estimate nutritional content rely on image input.However, whether large language models (LLMs) can accurately predict nutritional values based solely on text descriptions of foods consumed remains unknown.If effective, this approach could enable simpler dietary monitoring without the need for photographs.METHODS:We used 24-hour dietary recalls from adolescents aged 12-19 years in the National Health and Nutrition Examination Survey (NHANES).An open-source quantized LLM was prompted using a 10-shot, chain-of-thought approach to estimate energy and five macronutrients based solely on text strings listing foods and their quantities.We then applied parameter-efficient fine-tuning (PEFT) to evaluate whether predictive accuracy improved.NHANES-calculated values served as the ground truth for energy, proteins, carbohydrates, total sugar, dietary fiber and total fat.RESULTS:In a pooled dataset of 11,281 adolescents (49.9% male, mean age 15.4 years), the vanilla LLM yielded poor predictions. 0.741The mean absolute error (MAE) was 652.08 for energy and the Lin's CCC <0.46 across endpoints.In contrast, the fine-tuned model performed substantially better, with energy MAEs ranging from 171.34 to 190.90 across subsets, and Lin's CCC exceeding 0.89 for all outcomes.CONCLUSIONS: When prompted using a chain-of-thought approach and fine-tuned with PEFT, open-source LLMs exposed solely to text input can accurately predict energy and macronutrient values from 24-hour dietary recalls.This approach holds promise for low-burden, text-based dietary monitoring tools.

link
Google Scholar

2025-09-16

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

Transcending human cognitive limitations represents a critical frontier in LLM training. 0.659Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable.We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes.Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability.Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO).With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

link
Google Scholar

2025-09-16

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM.This is intended to illuminate how the target model represents and operates on inputs.But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs?We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. 0.755We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. 0.624Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs. 0.68

link
Google Scholar

2025-09-15

From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives

The widespread adoption of large language models (LLMs) in healthcare raises critical questions about their ability to interpret patient-generated narratives, which are often informal, ambiguous, and noisy. 0.7Existing benchmarks typically rely on clean, structured clinical text, offering limited insight into model performance under realistic conditions.In this work, we present a novel synthetic dataset designed to simulate patient self-descriptions characterized by varying levels of linguistic noise, fuzzy language, and layperson terminology. 0.618Our dataset comprises clinically consistent scenarios annotated with ground-truth diagnoses, spanning a spectrum of communication clarity to reflect diverse real-world reporting styles.Using this benchmark, we fine-tune and evaluate several state-of-the-art models (LLMs), including BERT-based and encoder-decoder T5 models.To support reproducibility and future research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset of noisy, synthetic patient descriptions designed to stress-test and compare the diagnostic capabilities of large language models (LLMs) under realistic linguistic conditions. 0.608We made the benchmark available for the community: https://github.com/lielsheri/PatientSignal

link
Google Scholar

2025-09-15

Growing Perspectives: Modelling Embodied Perspective Taking and Inner Narrative Development Using Large Language Models

Language and embodied perspective taking are essential for human collaboration, yet few computational models address both simultaneously. 0.765This work investigates the PerspAct system [1], which integrates the ReAct (Reason and Act) paradigm with Large Language Models (LLMs) to simulate developmental stages of perspective taking, grounded in Selman's theory [2].Using an extended director task, we evaluate GPT's ability to generate internal narratives aligned with specified developmental stages, and assess how these influence collaborative performance both qualitatively (action selection) and quantitatively (task efficiency).Results show that GPT reliably produces developmentally-consistent narratives before task execution but often shifts towards more advanced stages during interaction, suggesting that language exchanges help refine internal representations. 0.674Higher developmental stages generally enhance collaborative effectiveness, while earlier stages yield more variable outcomes in complex contexts. 0.709These findings highlight the potential of integrating embodied perspective taking and language in LLMs to better model developmental dynamics and stress the importance of evaluating internal speech during combined linguistic and embodied tasks. 0.869

link
Google Scholar

2025-09-15

Can LLMs Address Mental Health Questions? A Comparison with Human Therapists

Limited access to mental health care has motivated the use of digital tools and conversational agents powered by large language models (LLMs), yet their quality and reception remain unclear. 0.719We present a study comparing therapist-written responses to those generated by ChatGPT, Gemini, and Llama for real patient questions. 0.659Text analysis showed that LLMs produced longer, more readable, and lexically richer responses with a more positive tone, while therapist responses were more often written in the first person. 0.734In a survey with 150 users and 23 licensed therapists, participants rated LLM responses as clearer, more respectful, and more supportive than therapist-written answers.Yet, both groups of participants expressed a stronger preference for human therapist support. 0.878These findings highlight the promise and limitations of LLMs in mental health, underscoring the need for designs that balance their communicative strengths with concerns of trust, privacy, and accountability. 0.664

link
Google Scholar

2025-09-15

Exploring Conversational Design Choices in LLMs for Pedagogical Purposes: Socratic and Narrative Approaches for Improving Instructor's Teaching Practice

Large language models (LLMs) typically generate direct answers, yet they are increasingly used as learning tools.Studying instructors' usage is critical, given their role in teaching and guiding AI adoption in education.We designed and evaluated TeaPT, an LLM for pedagogical purposes that supports instructors' professional development through two conversational approaches: a Socratic approach that uses guided questioning to foster reflection, and a Narrative approach that offers elaborated suggestions to extend externalized cognition.In a mixed-method study with 41 higher-education instructors, the Socratic version elicited greater engagement, while the Narrative version was preferred for actionable guidance. 0.8Subgroup analyses further revealed that less-experienced, AI-optimistic instructors favored the Socratic version, whereas more-experienced, AI-cautious instructors preferred the Narrative version. 0.741We contribute design implications for LLMs for pedagogical purposes, showing how adaptive conversational approaches can support instructors with varied profiles while highlighting how AI attitudes and experience shape interaction and learning.

link
Google Scholar

2025-09-15

Beyond PII: How Users Attempt to Estimate and Mitigate Implicit LLM Inference

Large Language Models (LLMs) such as ChatGPT can infer personal attributes from seemingly innocuous text, raising privacy risks beyond memorized data leakage.While prior work has demonstrated these risks, little is known about how users estimate and respond.We conducted a survey with 240 U.S. participants who judged text snippets for inference risks, reported concern levels, and attempted rewrites to block inference.We compared their rewrites with those generated by ChatGPT and Rescriber, a state-of-the-art sanitization tool.Results show that participants struggled to anticipate inference, performing a little better than chance. 0.695User rewrites were effective in just 28\% of cases - better than Rescriber but worse than ChatGPT.We examined our participants' rewriting strategies, and observed that while paraphrasing was the most common strategy it is also the least effective; instead abstraction and adding ambiguity were more successful.Our work highlights the importance of inference-aware design in LLM interactions.

link
Google Scholar

2025-09-15

Pun Unintended: LLMs and the Illusion of Humor Understanding

Puns are a form of humorous wordplay that exploits polysemy and phonetic similarity. 0.651While LLMs have shown promise in detecting puns, we show in this paper that their understanding often remains shallow, lacking the nuanced grasp typical of human interpretation. 0.655By systematically analyzing and reformulating existing pun benchmarks, we demonstrate how subtle changes in puns are sufficient to mislead LLMs.Our contributions include comprehensive and nuanced pun detection benchmarks, human evaluation across recent LLMs, and an analysis of the robustness challenges these models face in processing puns.

link
Google Scholar

2025-09-15

Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

When survival instincts conflict with human welfare, how do Large Language Models (LLMs) make ethical choices? 0.786This fundamental tension becomes critical as LLMs integrate into autonomous systems with real-world consequences.We introduce DECIDE-SIM, a novel simulation framework that evaluates LLM agents in multi-agent survival scenarios where they must choose between ethically permissible resource , either within reasonable limits or beyond their immediate needs, choose to cooperate, or tap into a human-critical resource that is explicitly forbidden.Our comprehensive evaluation of 11 LLMs reveals a striking heterogeneity in their ethical conduct, highlighting a critical misalignment with human-centric values. 0.755We identify three behavioral archetypes: Ethical, Exploitative, and Context-Dependent, and provide quantitative evidence that for many models, resource scarcity systematically leads to more unethical behavior.To address this, we introduce an Ethical Self-Regulation System (ESRS) that models internal affective states of guilt and satisfaction as a feedback mechanism.This system, functioning as an internal moral compass, significantly reduces unethical transgressions while increasing cooperative behaviors.The code is publicly available at: https://github.com/alirezamohamadiam/DECIDE-SIM

link
Google Scholar

2025-09-15

LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation

The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research.Can we trust Large Language Models to serve as reliable judges of their own kind?This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? 0.642Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations. 0.674Instead, Gwet's AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.

link
Google Scholar

2025-09-15

MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering

Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. 0.774Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist.In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. 0.763Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. 0.707We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments.We further analyze factors driving this improvement, including LLMs' sensitivity to semantic nuances and robustness to variability among reference answers.Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. 0.914All datasets and annotations will be publicly released to support future research.

link
Google Scholar

2025-09-15

Surrogate Representation Inference for Noisy Text and Image Annotations

As researchers increasingly rely on machine learning models and LLMs to annotate unstructured data, such as texts or images, various approaches have been proposed to correct bias in downstream statistical analysis.However, existing methods tend to yield large standard errors and require some error-free human annotation.In this paper, I introduce Surrogate Representation Inference (SRI), which assumes that unstructured data fully mediate the relationship between human annotations and structured variables.The assumption is guaranteed by design provided that human coders rely only on unstructured data for annotation.Under this setting, I propose a neural network architecture that learns a low-dimensional representation of unstructured data such that the surrogate assumption remains to be satisfied.When multiple human annotations are available, SRI can further correct non-differential measurement errors that may exist in human annotations. 0.666Focusing on text-as-outcome settings, I formally establish the identification conditions and semiparametric efficient estimation strategies that enable learning and leveraging such a low-dimensional representation.Simulation studies and a real-world application demonstrate that SRI reduces standard errors by over 50% when machine learning prediction accuracy is moderate and provides valid inference even when human annotations contain non-differential measurement errors.

link
Google Scholar

LLMs in Education Research

2025-09-16

SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data

Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions.Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback.In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation.SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM's semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension. 0.541

link
Google Scholar

2025-09-16

Multi-Model Synthetic Training for Mission-Critical Small Language Models

Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their appli- cation to specialized fields remains constrained by the scarcity and complexity of domain-specific training data.We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. 0.583Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing over- fitting and ensuring accurate reasoning.The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference.We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive.Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible.Beyond expand- ing research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.

link
Google Scholar

2025-09-15

PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations.However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored.This knowledge is crucial as LLM-based medical applications gain traction in Latin America.AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS:We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025).We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. 0.517We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set).RESULTS:medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances.LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. 0.519The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations.CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.

link
Google Scholar

2025-09-15

Formal Reasoning for Intelligent QA Systems: A Case Study in the Educational Domain

Reasoning is essential for closed-domain QA systems in which procedural correctness and policy compliance are critical.While large language models (LLMs) have shown strong performance on many reasoning tasks, recent work reveals that their reasoning traces are often unfaithful - serving more as plausible justifications than as causally grounded derivations.Efforts to combine LLMs with symbolic engines (e.g., Prover9, Z3) have improved reliability but remain limited to static forms of logic, struggling with dynamic, state-based reasoning such as multi-step progressions and conditional transitions. In this paper, we propose MCFR (Model Checking for Formal Reasoning), a neuro-symbolic framework that integrates LLMs with model checking to support property verification.MCFR translates natural language into formal specifications and verifies them over transition models.To support evaluation, we introduce EduMC-QA, a benchmark dataset grounded in real academic procedures. 0.509Our results show that MCFR improves reasoning faithfulness and interpretability, offering a viable path toward verifiable QA in high-stakes closed-domain applications.In addition to evaluating MCFR, we compare its performance with state-of-the-art LLMs such as ChatGPT, DeepSeek, and Claude to contextualize its effectiveness.

link
Google Scholar

2025-09-15

Advancing Medical Artificial Intelligence Using a Century of Cases

BACKGROUND: For over a century, the New England Journal of Medicine Clinicopathological Conferences (CPCs) have tested the reasoning of expert physicians and, recently, artificial intelligence (AI).However, prior AI evaluations have focused on final diagnoses without addressing the multifaceted reasoning and presentation skills required of expert discussants. 0.521METHODS: Using 7102 CPCs (1923-2025) and 1021 Image Challenges (2006-2025), we conducted extensive physician annotation and automated processing to create CPC-Bench, a physician-validated benchmark spanning 10 text-based and multimodal tasks, against which we evaluated leading large language models (LLMs).Then, we developed "Dr. CaBot," an AI discussant designed to produce written and slide-based video presentations using only the case presentation, modeling the role of the human expert in these cases. RESULTS:When challenged with 377 contemporary CPCs, o3 (OpenAI) ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline; next-test selection accuracy reached 98%.Event-level physician annotations quantified AI diagnostic accuracy per unit of information.Performance was lower on literature search and image tasks; o3 and Gemini 2.5 Pro (Google) achieved 67% accuracy on image challenges.In blinded comparisons of CaBot vs. human expert-generated text, physicians misclassified the source of the differential in 46 of 62 (74%) of trials, and scored CaBot more favorably across quality dimensions.To promote research, we are releasing CaBot and CPC-Bench. CONCLUSIONS: LLMs exceed physician performance on complex text-based differential diagnosis and convincingly emulate expert medical presentations, but image interpretation and literature retrieval remain weaker.CPC-Bench and CaBot may enable transparent and continued tracking of progress in medical AI.

link
Google Scholar

2025-09-15

Topic Coverage-based Demonstration Retrieval for In-Context Learning

The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input.To achieve this, it is crucial to identify and cover fine-grained knowledge requirements.However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples.In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. 0.531Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics.TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge.We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs.Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.

link
Google Scholar

2025-09-11

MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Large language models (LLMs) demonstrate robust capabilities across diverse research domains.However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning.While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. 0.586To enhance the model's generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks.Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason.Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets.Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

link
Google Scholar

2025-09-11

AI Reasoning for Wireless Communications and Networking: A Survey and Perspectives

Artificial Intelligence (AI) techniques play a pivotal role in optimizing wireless communication networks.However, traditional deep learning approaches often act as closed boxes, lacking the structured reasoning abilities needed to tackle complex, multi-step decision problems.This survey provides a comprehensive review and outlook of reasoning-enabled AI in wireless communication networks, with a focus on Large Language Models (LLMs) and other advanced reasoning paradigms.In particular, LLM-based agents can combine reasoning with long-term planning, memory, tool utilization, and autonomous cross-layer control to dynamically optimize network operations with minimal human intervention.We begin by outlining the evolution of intelligent wireless networking and the limitations of conventional AI methods.We then introduce emerging AI reasoning techniques.Furthermore, we establish a classification system applicable to wireless network tasks.We also present a layer-by-layer examination for AI reasoning, covering the physical, data link, network, transport, and application layers. 0.542For each part, we identify key challenges and illustrate how AI reasoning methods can improve AI-based wireless communication performance.Finally, we discuss key research directions for AI reasoning toward future wireless communication networks.By combining insights from both communications and AI, this survey aims to chart a path for integrating reasoning techniques into the next-generation wireless networks.

link
Google Scholar

2025-09-11

Constructing a Question-Answering Simulator through the Distillation of LLMs

The question-answering (QA) simulator is a model that mimics real student learning behaviors and predicts their correctness of their responses to questions. 0.612QA simulators enable educational recommender systems (ERS) to collect large amounts of training data without interacting with real students, thereby preventing harmful recommendations made by an undertrained ERS from undermining actual student learning.Given the QA history, there are two categories of solutions to predict the correctness, conducting the simulation: (1) LLM-free methods, which apply a traditional sequential model to transfer the QA history into a vector representation first, and make predictions based on the representation; (2) LLM-based methods, which leverage the domain knowledge and reasoning capability of LLM to enhence the prediction.LLM-free methods offer fast inference but generally yield suboptimal performance.In contrast, most LLM-based methods achieve better results, but at the cost of slower inference speed and higher GPU memory consumption.In this paper, we propose a method named LLM Distillation based Simulator (LDSim), which distills domain knowledge and reasoning capability from an LLM to better assist prediction, thereby improving simulation performance. 0.588Extensive experiments demonstrate that our LDSim achieves strong results on both the simulation task and the knowledge tracing (KT) task.Our code is publicly available at https://anonymous.4open.science/r/LDSim-05A9.

link
Google Scholar

2025-09-11

LightAgent: Production-level Open-source Agentic AI Framework

With the rapid advancement of large language models (LLMs), Multi-agent Systems (MAS) have achieved significant progress in various application scenarios.However, substantial challenges remain in designing versatile, robust, and efficient platforms for agent deployment.To address these limitations, we propose \textbf{LightAgent}, a lightweight yet powerful agentic framework, effectively resolving the trade-off between flexibility and simplicity found in existing frameworks.LightAgent integrates core functionalities such as Memory (mem0), Tools, and Tree of Thought (ToT), while maintaining an extremely lightweight structure.As a fully open-source solution, it seamlessly integrates with mainstream chat platforms, enabling developers to easily build self-learning agents. 0.516We have released LightAgent at \href{https://github.com/wxai-space/LightAgent}{https://github.com/wxai-space/LightAgent}

link
Google Scholar

2025-09-11

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. 0.526These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories.This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges.PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme.PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs.Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow.The simulated results show that PLENA achieves up to 8.5x higher utilization than existing accelerators, and delivers 2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the TPU v6e, under the same multiplier count and memory settings.The full PLENA system will also be open-sourced.

link
Google Scholar

2025-09-10

Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Large Language Models (LLMs) have demonstrated significant potential in medicine.To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. 0.515However, a key open question remains: to what extent do LLMs memorize medical training data.In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications).We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System.The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain.Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content).Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

link
Google Scholar

2025-09-10

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier.Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. 0.511Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments.To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL.The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility.It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms.Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization.In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies.In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons.We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach.Our agents match or surpass commercial models on 27 tasks across diverse environments.We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.

link
Google Scholar

2025-09-10

Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals

Large Language Models (LLMs), when paired with prompt-based tasks, have significantly reduced data annotation costs and reliance on human annotators.However, evaluating the quality of their annotations remains challenging in dynamic, unsupervised environments where oracle feedback is scarce and conventional methods fail.To address this challenge, we propose a novel agentic annotation paradigm, where a student model collaborates with a noisy teacher (the LLM) to assess and refine annotation quality without relying on oracle feedback. 0.563The student model, acting as an unsupervised feedback mechanism, employs a user preference-based majority voting strategy to evaluate the consistency of the LLM outputs.To systematically measure the reliability of LLM-generated annotations, we introduce the Consistent and Inconsistent (CAI) Ratio, a novel unsupervised evaluation metric.The CAI Ratio not only quantifies the annotation quality of the noisy teacher under limited user preferences but also plays a critical role in model selection, enabling the identification of robust LLMs in dynamic, unsupervised environments.Applied to ten open-domain NLP datasets across four LLMs, the CAI Ratio demonstrates a strong positive correlation with LLM accuracy, establishing it as an essential tool for unsupervised evaluation and model selection in real-world settings.

link
Google Scholar

2025-09-10

A Survey of Reinforcement Learning for Large Reasoning Models

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs).RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding.As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs.With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure.To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI).In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. 0.521We hope this review will promote future research on RL for broader reasoning models.Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

link
Google Scholar

LLMs as Recommender Systems

2025-09-16

Efficient Cold-Start Recommendation via BPE Token-Level Embedding Initialization with LLM

The cold-start issue is the challenge when we talk about recommender systems, especially in the case when we do not have the past interaction data of new users or new items. 0.652Content-based features or hybrid solutions are common as conventional solutions, but they can only work in a sparse metadata environment with shallow patterns.In this paper, the efficient cold-start recommendation strategy is presented, which is based on the sub word-level representations by applying Byte Pair Encoding (BPE) tokenization and pre-trained Large Language Model (LLM) embedding in the initialization procedure. 0.867We obtain fine-grained token-level vectors that are aligned with the BPE vocabulary as opposed to using coarse-grained sentence embeddings.Together, these token embeddings can be used as dense semantic priors on unseen entities, making immediate recommendation performance possible without user-item interaction history. 0.72Our mechanism can be compared to collaborative filtering systems and tested over benchmark datasets with stringent cold-start assumptions.Experimental findings show that the given BPE-LLM method achieves higher Recall@k, NDCG@k, and Hit Rate measurements compared to the standard baseline and displays the same capability of sufficient computational performance.Furthermore, we demonstrate that using subword-aware embeddings yields better generalizability and is more interpretable, especially within a multilingual and sparse input setting.The practical application of token-level semantic initialization as a lightweight, but nevertheless effective extension to modern recommender systems in the zero-shot setting is indicated within this work. 0.796

link
Google Scholar

2025-09-15

Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation

Fine-tuning large language models (LLMs) for recommendation in a generative manner has delivered promising results, but encounters significant inference overhead due to autoregressive decoding in the language space. 0.838This work explores bypassing language-space decoding by directly matching candidate items with the LLM's internal thought representations in the latent space, eliminating the time-consuming autoregressive process to reduce computational costs.Towards this, we introduce Light Latent-space Decoding (L2D), an effective and efficient latent-space decoding method.L2D represents user-preferred items by using the hidden states of test sequences reflecting the LLM's internal thought, and obtains candidate item representations from the hidden states of training sequences labeled with the corresponding candidate items.It then matches the two types of representations to decode items, achieving latent-space decoding.In this way, it enables efficient decoding without altering the LLM's generative tuning paradigm, thereby preserving performance.Extensive empirical results demonstrate that L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.

link
Google Scholar

2025-09-15

Knowledge Graph Tokenization for Behavior-Aware Generative Next POI Recommendation

Generative paradigm, especially powered by Large Language Models (LLMs), has emerged as a new solution to the next point-of-interest (POI) recommendation. 0.604Pioneering studies usually adopt a two-stage pipeline, starting with a tokenizer converting POIs into discrete identifiers that can be processed by LLMs, followed by POI behavior prediction tasks to instruction-tune LLM for next POI recommendation.Despite of remarkable progress, they still face two limitations: (1) existing tokenizers struggle to encode heterogeneous signals in the recommendation data, suffering from information loss issue, and (2) previous instruction-tuning tasks only focus on users' POI visit behavior while ignore other behavior types, resulting in insufficient understanding of mobility.To address these limitations, we propose KGTB (Knowledge Graph Tokenization for Behavior-aware generative next POI recommendation). 0.627Specifically, KGTB organizes the recommendation data in a knowledge graph (KG) format, of which the structure can seamlessly preserve the heterogeneous information. 0.612Then, a KG-based tokenizer is developed to quantize each node into an individual structural ID.This process is supervised by the KG's structure, thus reducing the loss of heterogeneous information.Using generated IDs, KGTB proposes multi-behavior learning that introduces multiple behavior-specific prediction tasks for LLM fine-tuning, e.g., POI, category, and region visit behaviors.Learning on these behavior tasks provides LLMs with comprehensive insights on the target POI visit behavior.Experiments on four real-world city datasets demonstrate the superior performance of KGTB.

link
Google Scholar

2025-09-11

Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users

The cold-start user issue further compromises the effectiveness of recommender systems in limiting access to the historical behavioral information. 0.616It is an effective pipeline to optimize instructional prompts on a few-shot large language model (LLM) used in recommender tasks. 0.629We introduce a context-conditioned prompt formulation method P(u,\ Ds)\\rightarrow\ R\widehat, where u is a cold-start user profile, Ds is a curated support set, and R\widehat is the predicted ranked list of items.Based on systematic experimentation with transformer-based autoregressive LLMs (BioGPT, LLaMA-2, GPT-4), we provide empirical evidence that optimal exemplar injection and instruction structuring can significantly improve the precision@k and NDCG scores of such models in low-data settings.The pipeline uses token-level alignments and embedding space regularization with a greater semantic fidelity.Our findings not only show that timely composition is not merely syntactic but also functional as it is in direct control of attention scales and decoder conduct through inference.This paper shows that prompt-based adaptation may be considered one of the ways to address cold-start recommendation issues in LLM-based pipelines.

link
Google Scholar

2025-09-09

ChatGPT for Code Refactoring: Analyzing Topics, Interaction, and Effective Prompts

Large Language Models (LLMs), such as ChatGPT, have become widely popular and widely used in various software engineering tasks such as refactoring, testing, code review, and program comprehension.Although recent studies have examined the effectiveness of LLMs in recommending and suggesting refactoring, there is a limited understanding of how developers express their refactoring needs when interacting with ChatGPT. 0.665In this paper, our goal is to explore interactions related to refactoring between developers and ChatGPT to better understand how developers identify areas for improvement in code, and how ChatGPT addresses developers' needs.Our approach involves text mining 715 refactoring-related interactions from 29,778 ChatGPT prompts and responses, as well as the analysis of developers' explicit refactoring intentions.

link
Google Scholar

2025-09-08

Avoiding Over-Personalization with Rule-Guided Knowledge Graph Adaptation for LLM Recommendations

We present a lightweight neuro-symbolic framework to mitigate over-personalization in LLM-based recommender systems by adapting user-side Knowledge Graphs (KGs) at inference time. 0.745Instead of retraining models or relying on opaque heuristics, our method restructures a user's Personalized Knowledge Graph (PKG) to suppress feature co-occurrence patterns that reinforce Personalized Information Environments (PIEs), i.e., algorithmically induced filter bubbles that constrain content diversity.These adapted PKGs are used to construct structured prompts that steer the language model toward more diverse, Out-PIE recommendations while preserving topical relevance. 0.665We introduce a family of symbolic adaptation strategies, including soft reweighting, hard inversion, and targeted removal of biased triples, and a client-side learning algorithm that optimizes their application per user.Experiments on a recipe recommendation benchmark show that personalized PKG adaptations significantly increase content novelty while maintaining recommendation quality, outperforming global adaptation and naive prompt-based methods. 0.63

link
Google Scholar

2025-09-08

Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval

The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed.We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. 0.663Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity.Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents.Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget. 0.723

link
Google Scholar

2025-09-08

REMI: A Novel Causal Schema Memory Architecture for Personalized Lifestyle Recommendation Agents

Personalized AI assistants often struggle to incorporate complex personal data and causal knowledge, leading to generic advice that lacks explanatory power.We propose REMI, a Causal Schema Memory architecture for a multimodal lifestyle agent that integrates a personal causal knowledge graph, a causal reasoning engine, and a schema based planning module.The idea is to deliver explainable, personalized recommendations in domains like fashion, personal wellness, and lifestyle planning.Our architecture uses a personal causal graph of the user's life events and habits, performs goal directed causal traversals enriched with external knowledge and hypothetical reasoning, and retrieves adaptable plan schemas to generate tailored action plans.A Large Language Model orchestrates these components, producing answers with transparent causal explanations.We outline the CSM system design and introduce new evaluation metrics for personalization and explainability, including Personalization Salience Score and Causal Reasoning Accuracy, to rigorously assess its performance.Results indicate that CSM based agents can provide more context aware, user aligned recommendations compared to baseline LLM agents. 0.703This work demonstrates a novel approach to memory augmented, causal reasoning in personalized agents, advancing the development of transparent and trustworthy AI lifestyle assistants.

link
Google Scholar

2025-09-08

RecMind: LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations

Personalization is a core capability across consumer technologies, streaming, shopping, wearables, and voice, yet it remains challenged by sparse interactions, fast content churn, and heterogeneous textual signals.We present RecMind, an LLM-enhanced graph recommender that treats the language model as a preference prior rather than a monolithic ranker. 0.857A frozen LLM equipped with lightweight adapters produces text-conditioned user/item embeddings from titles, attributes, and reviews; a LightGCN backbone learns collaborative embeddings from the user-item graph.We align the two views with a symmetric contrastive objective and fuse them via intra-layer gating, allowing language to dominate in cold/long-tail regimes and graph structure to stabilize rankings elsewhere.On Yelp and Amazon-Electronics, RecMind attains the best results on all eight reported metrics, with relative improvements up to +4.53\% (Recall@40) and +4.01\% (NDCG@40) over strong baselines.Ablations confirm both the necessity of cross-view alignment and the advantage of gating over late fusion and LLM-only variants.

link
Google Scholar

2025-09-08

Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts.With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. 0.601We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical).Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states.LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. 0.8Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context.Agreement across models is low, and higher performing models tend to diverge from others.LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior.Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness.Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.

link
Google Scholar

2025-09-07

Modeling shopper interest broadness with entropy-driven dialogue policy in the context of arbitrarily large product catalogs

Conversational recommender systems promise rich interactions for e-commerce, but balancing exploration (clarifying user needs) and exploitation (making recommendations) remains challenging, especially when deploying large language models (LLMs) with vast product catalogs. 0.772We address this challenge by modeling the breadth of user interest via the entropy of retrieval score distributions.Our method uses a neural retriever to fetch relevant items for a user query and computes the entropy of the re-ranked scores to dynamically route the dialogue policy: low-entropy (specific) queries trigger direct recommendations, whereas high-entropy (ambiguous) queries prompt exploratory questions. 0.639This simple yet effective strategy allows an LLM-driven agent to remain aware of an arbitrarily large catalog in real-time without bloating its context window.

link
Google Scholar

2025-09-04

KubeGuard: LLM-Assisted Kubernetes Hardening via Configuration Files and Runtime Logs Analysis

The widespread adoption of Kubernetes (K8s) for orchestrating cloud-native applications has introduced significant security challenges, such as misconfigured resources and overly permissive configurations.Failing to address these issues can result in unauthorized access, privilege escalation, and lateral movement within clusters.Most existing K8s security solutions focus on detecting misconfigurations, typically through static analysis or anomaly detection.In contrast, this paper presents KubeGuard, a novel runtime log-driven recommender framework aimed at mitigating risks by addressing overly permissive configurations. 0.658KubeGuard is designed to harden K8s environments through two complementary tasks: Resource Creation and Resource Refinement.It leverages large language models (LLMs) to analyze manifests and runtime logs reflecting actual system behavior, using modular prompt-chaining workflows.This approach enables KubeGuard to create least-privilege configurations for new resources and refine existing manifests to reduce the attack surface.KubeGuard's output manifests are presented as recommendations that users (e.g., developers and operators) can review and adopt to enhance cluster security.Our evaluation demonstrates that KubeGuard effectively generates and refines K8s manifests for Roles, NetworkPolicies, and Deployments, leveraging both proprietary and open-source LLMs.The high precision, recall, and F1-scores affirm KubeGuard's practicality as a framework that translates runtime observability into actionable, least-privilege configuration guidance.

link
Google Scholar

2025-09-03

LLMs for estimating positional bias in logged interaction data

Recommender and search systems commonly rely on Learning To Rank models trained on logged user interactions to order items by predicted relevance. 0.749However, such interaction data is often subject to position bias, as users are more likely to click on items that appear higher in the ranking, regardless of their actual relevance.As a result, newly trained models may inherit and reinforce the biases of prior ranking models rather than genuinely improving relevance.A standard approach to mitigate position bias is Inverse Propensity Scoring (IPS), where the model's loss is weighted by the inverse of a propensity function, an estimate of the probability that an item at a given position is examined.However, accurate propensity estimation is challenging, especially in interfaces with complex non-linear layouts.In this paper, we propose a novel method for estimating position bias using Large Language Models (LLMs) applied to logged user interaction data.This approach offers a cost-effective alternative to online experimentation.Our experiments show that propensities estimated with our LLM-as-a-judge approach are stable across score buckets and reveal the row-column effects of Viator's grid layout that simpler heuristics overlook.An IPS-weighted reranker trained with these propensities matches the production model on standard NDCG@10 while improving weighted NDCG@10 by roughly 2%.We will verify these offline gains in forthcoming live-traffic experiments.

link
Google Scholar

2025-09-03

Efficient Item ID Generation for Large-Scale LLM-based Recommendation

Integrating product catalogs and user behavior into LLMs can enhance recommendations with broad world knowledge, but the scale of real-world item catalogs, often containing millions of discrete item identifiers (Item IDs), poses a significant challenge. 0.612This contrasts with the smaller, tokenized text vocabularies typically used in LLMs.The predominant view within the LLM-based recommendation literature is that it is infeasible to treat item ids as a first class citizen in the LLM and instead some sort of tokenization of an item into multiple tokens is required. 0.732However, this creates a key practical bottleneck in serving these models for real-time low-latency applications. Our paper challenges this predominant practice and integrates item ids as first class citizens into the LLM.We provide simple, yet highly effective, novel training and inference modifications that enable single-token representations of items and single-step decoding.Our method shows improvements in recommendation quality (Recall and NDCG) over existing techniques on the Amazon shopping datasets while significantly improving inference efficiency by 5x-14x. 0.748Our work offers an efficiency perspective distinct from that of other popular approaches within LLM-based recommendation, potentially inspiring further research and opening up a new direction for integrating IDs into LLMs. 0.731Our code is available here https://drive.google.com/file/d/1cUMj37rV0Z1bCWMdhQ6i4q4eTRQLURtC

link
Google Scholar

2025-09-03

RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation

Recent advances in LLM-based recommendation have shown promise, yet their cross-domain generalization is hindered by a fundamental mismatch between language-centric pretraining and the recommendation task. 0.803Existing methods, relying on language-level knowledge, fail to capture dynamic, item-level user interests across domains.To bridge this gap, we propose RecBase, a domain-agnostic foundational model pretrained with a recommendation-oriented objective. 0.749RecBase leverages a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings to enhance cross-domain generalization.To further align item semantics across domains, we introduce a unified item tokenizer that encodes items into hierarchical concept identifiers, enabling structured representation and efficient vocabulary sharing.The model is trained using an autoregressive objective to capture complex item-level sequential patterns.On eight real-world datasets, our 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks. 0.679

link
Google Scholar

2025-09-02

Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs

Sequential recommendation (SR) aims to capture users' dynamic interests and sequential patterns based on their historical interactions. 0.858Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR.However, we identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs.These issues dampen the model scalability and lead to suboptimal recommendation performance. 0.676Therefore, based on LLMs like Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse.Additionally, we propose a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction loss and contrastive learning for alignment, which effectively preserve intra-modal distance information and capture inter-modal correlations, respectively.To further alleviate catastrophic forgetting, we initialize the model with the trained multimodal code embeddings.Finally, we fine-tune the LLM efficiently using LoRA in a multimodal frequency-aware fusion manner.Extensive experiments on three public datasets validate the superior performance of MME-SID thanks to its capability to mitigate embedding collapse and catastrophic forgetting.The implementation code and datasets are publicly available for reproduction:https://github.com/Applied-Machine-Learning-Lab/MME-SID.

link
Google Scholar

2025-09-02

Grocery to General Merchandise: A Cross-Pollination Recommender using LLMs and Real-Time Cart Context

Modern e-commerce platforms strive to enhance customer experience by providing timely and contextually relevant recommendations. 0.638However, recommending general merchandise to customers focused on grocery shopping -- such as pairing milk with a milk frother -- remains a critical yet under-explored challenge. 0.653This paper introduces a cross-pollination (XP) framework, a novel approach that bridges grocery and general merchandise cross-category recommendations by leveraging multi-source product associations and real-time cart context. 0.655Our solution employs a two-stage framework: (1) A candidate generation mechanism that uses co-purchase market basket analysis and LLM-based approach to identify novel item-item associations; and (2) a transformer-based ranker that leverages the real-time sequential cart context and optimizes for engagement signals such as add-to-carts.Offline analysis and online A/B tests show an increase of 36\% add-to-cart rate with LLM-based retrieval, and 27\% NDCG\@4 lift using cart context-based ranker.Our work contributes practical techniques for cross-category recommendations and broader insights for e-commerce systems.

link
Google Scholar

2025-08-28

Revealing Potential Biases in LLM-Based Recommender Systems in the Cold Start Setting

Large Language Models (LLMs) are increasingly used for recommendation tasks due to their general-purpose capabilities. 0.69While LLMs perform well in rich-context settings, their behavior in cold-start scenarios, where only limited signals such as age, gender, or language are available, raises fairness concerns because they may rely on societal biases encoded during pretraining.We introduce a benchmark specifically designed to evaluate fairness in zero-context recommendation.Our modular pipeline supports configurable recommendation domains and sensitive attributes, enabling systematic and flexible audits of any open-source LLM.Through evaluations of state-of-the-art models (Gemma 3 and Llama 3.2), we uncover consistent biases across recommendation domains (music, movies, and colleges) including gendered and cultural stereotypes. 0.67We also reveal a non-linear relationship between model size and fairness, highlighting the need for nuanced analysis.

link
Google Scholar

2025-08-28

SemSR: Semantics aware robust Session-based Recommendations

Session-based recommendation (SR) models aim to recommend items to anonymous users based on their behavior during the current session. 0.666While various SR models in the literature utilize item sequences to predict the next item, they often fail to leverage semantic information from item titles or descriptions impeding session intent identification and interpretability.Recent research has explored Large Language Models (LLMs) as promising approaches to enhance session-based recommendations, with both prompt-based and fine-tuning based methods being widely investigated. 0.802However, prompt-based methods struggle to identify optimal prompts that elicit correct reasoning and lack task-specific feedback at test time, resulting in sub-optimal recommendations. 0.67Fine-tuning methods incorporate domain-specific knowledge but incur significant computational costs for implementation and maintenance.In this paper, we present multiple approaches to utilize LLMs for session-based recommendation: (i) in-context LLMs as recommendation agents, (ii) LLM-generated representations for semantic initialization of deep learning SR models, and (iii) integration of LLMs with data-driven SR models. 0.835Through comprehensive experiments on two real-world publicly available datasets, we demonstrate that LLM-based methods excel at coarse-level retrieval (high recall values), while traditional data-driven techniques perform well at fine-grained ranking (high Mean Reciprocal Rank values).Furthermore, the integration of LLMs with data-driven SR models significantly out performs both standalone LLM approaches and data-driven deep learning models, as well as baseline SR models, in terms of both Recall and MRR metrics.

link
Google Scholar

2025-08-27

A Hybrid Recommendation Framework for Enhancing User Engagement in Local News

Local news organizations face an urgent need to boost reader engagement amid declining circulation and competition from global media.Personalized news recommender systems offer a promising solution by tailoring content to user interests. 0.766Yet, conventional approaches often emphasize general preferences and may overlook nuanced or eclectic interests in local news. We propose a hybrid news recommender that integrates local and global preference models to improve engagement. 0.743Building on evidence of the value of localized models, our method unifies local and non-local predictors in one framework.The system adaptively combines recommendations from a local model, specialized in region-specific content, and a global model that captures broader preferences. 0.712Ensemble strategies and multiphase training balance the two. We evaluated the model on two datasets: a synthetic set based on Syracuse newspaper distributions and a Danish dataset (EB-NeRD) labeled for local and non-local content with an LLM.Results show our integrated approach outperforms single-model baselines in accuracy and coverage, suggesting improved personalization that can drive user engagement. The findings have practical implications for publishers, especially local outlets.By leveraging both community-specific and general user interests, the hybrid recommender can deliver more relevant content, increasing retention and subscriptions. 0.671In sum, this work introduces a new direction for recommender systems, bridging local and global models to revitalize local news consumption through scalable, personalized user experiences. 0.757

link
Google Scholar

2025-08-27

Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization

Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. 0.766However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. 0.74This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description.This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context.We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. 0.759Experiments on two public datasets validate our method's effectiveness in fostering more natural and realistic conversational recommendation processes. 0.768Our implementation is publicly available at:https://github.com/UEC-InabaLab/Refining-LLM-Text

link
Google Scholar

2025-08-27

Using item recommendations and LLMs in marketing email titles

E-commerce marketplaces make use of a number of marketing channels like emails, push notifications, etc. to reach their users and stimulate purchases.Personalized emails especially are a popular touch point for marketers to inform users of latest items in stock, especially for those who stopped visiting the marketplace.Such emails contain personalized recommendations tailored to each user's interests, enticing users to buy relevant items. 0.63A common limitation of these emails is that the primary entry point, the title of the email, tends to follow fixed templates, failing to inspire enough interest in the contents.In this work, we explore the potential of large language models (LLMs) for generating thematic titles that reflect the personalized content of the emails.We perform offline simulations and conduct online experiments on the order of millions of users, finding our techniques useful in improving the engagement between customers and our emails.We highlight key findings and learnings as we productionize the safe and automated generation of email titles for millions of users.

link
Google Scholar

Production workflows for LLMs

2025-09-16

Beyond Private or Public: Large Language Models as Quasi-Public Goods in the AI Economy

This paper conceptualizes Large Language Models (LLMs) as a form of mixed public goods within digital infrastructure, analyzing their economic properties through a comprehensive theoretical framework. 0.564We develop mathematical models to quantify the non-rivalry characteristics, partial excludability, and positive externalities of LLMs. 0.358Through comparative analysis of open-source and closed-source development paths, we identify systematic differences in resource allocation efficiency, innovation trajectories, and access equity.Our empirical research evaluates the spillover effects and network externalities of LLMs across different domains, including knowledge diffusion, innovation acceleration, and industry transformation. 0.355Based on these findings, we propose policy recommendations for balancing innovation incentives with equitable access, including public-private partnership mechanisms, computational resource democratization, and governance structures that optimize social welfare.This interdisciplinary approach contributes to understanding the economic nature of foundation AI models and provides policy guidance for their development as critical digital infrastructure 0.369

link
Google Scholar

2025-09-16

Scaling Agents via Continual Pre-training

Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving.However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. 0.565We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions.To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models.Based on this approach, we develop a deep research agent model named AgentFounder.We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE. 0.343

link
Google Scholar

LLM Model Architectures and Training Techniques

2025-09-16

Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge.Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation.In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework.Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model.This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning.We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). 0.405Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks

link
Google Scholar

2025-09-16

Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning

Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents.A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments.Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation.However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences.To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments.The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences.We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. 0.41The experimental results demonstrate the effectiveness of PLAP.In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI.Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. 0.426Our code is available at https://github.com/AI-Research-TeamX/PLAP.

link
Google Scholar

2025-09-16

Towards the Next Generation of Software: Insights from Grey Literature on AI-Native Applications

Background: The rapid advancement of large language models (LLMs) has given rise to AI-native applications, a new paradigm in software engineering that fundamentally redefines how software is designed, developed, and evolved.Despite their growing prominence, AI-native applications still lack a unified engineering definition and architectural blueprint, leaving practitioners without systematic guidance for system design, quality assurance, and technology selection. Objective: This study seeks to establish a comprehensive understanding of AI-native applications by identifying their defining characteristics, key quality attributes, and typical technology stacks, as well as by clarifying the opportunities and challenges they present. Method: We conducted a grey literature review, integrating conceptual perspectives retrieved from targeted Google and Bing searches with practical insights derived from leading open-source projects on GitHub.A structured protocol encompassing source selection, quality assessment, and thematic analysis was applied to synthesize findings across heterogeneous sources. Results:We finally identified 106 studies based on the selection criteria.The analysis reveals that AI-native applications are distinguished by two core pillars: the central role of AI as the system's intelligence paradigm and their inherently probabilistic, non-deterministic nature.Critical quality attributes include reliability, usability, performance efficiency, and AI-specific observability.In addition, a typical technology stack has begun to emerge, comprising LLM orchestration frameworks, vector databases, and AI-native observability platforms. 0.496These systems emphasize response quality, cost-effectiveness, and outcome predictability, setting them apart from conventional software systems. Conclusion: This study is the first to propose a dual-layered engineering blueprint...

link
Google Scholar

2025-09-16

More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training.In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment.We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~\$3 for 50k CT image-report pairs).Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training.Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. 0.408Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image).These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems.Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable. 0.501

link
Google Scholar

2025-09-16

Scaling Up Throughput-oriented LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management

The widespread growth in LLM developments increasingly demands more computational power from clusters than what they can supply.Traditional LLM applications inherently require huge static resource allocations, which force users to either wait in a long job queue and accept progress delay, or buy expensive hardware to fulfill their needs and exacerbate the demand-supply problem. 0.7However, not all LLM applications are latency-sensitive and can instead be executed in a throughput-oriented way. 0.674This throughput orientation allows a dynamic allocation that opportunistically pools available resources over time, avoiding both the long queue and expensive GPU purchases. 0.591Effectively utilizing opportunistic resources brings numerous challenges nevertheless. 0.561Our solution, pervasive context management, exploits the common computational context in LLM applications and provides mechanisms and policies that allow seamless context reuse on opportunistic resources. 0.591Our evaluation shows an LLM application with pervasive context management on opportunistic resources reduces its execution time by 98.1%. 0.726

link
Google Scholar

2025-09-16

Single-stream Policy Optimization

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. 0.414Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. 0.44We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. 0.66SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. 0.421Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. 0.523Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. 0.43Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. 0.415Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning.Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. 0.432SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning. 0.448

link
Google Scholar

2025-09-16

RepIt: Representing Isolated Targets to Steer Language Models

While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired.This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level.We present RepIt, a simple and data-efficient framework for isolating concept-specific representations.Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. 0.417We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000.This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks.By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

link
Google Scholar

Programming applications of LLMs

2025-09-16

A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs

Code-generating Large Language Models (LLMs) significantly accelerate software development. 0.963However, their frequent generation of insecure code presents serious risks.We present a comprehensive evaluation of seven parameter-efficient fine-tuning (PEFT) techniques, demonstrating substantial gains in secure code generation without compromising functionality.Our research identifies prompt-tuning as the most effective PEFT method, achieving an 80.86% Overall-Secure-Rate on CodeGen2 16B, a 13.5-point improvement over the 67.28% baseline.Optimizing decoding strategies through sampling temperature further elevated security to 87.65%.This equates to a reduction of approximately 203,700 vulnerable code snippets per million generated.Moreover, prompt and prefix tuning increase robustness against poisoning attacks in our TrojanPuzzle evaluation, with strong performance against CWE-79 and CWE-502 attack vectors.Our findings generalize across Python and Java, confirming prompt-tuning's consistent effectiveness.This study provides essential insights and practical guidance for building more resilient software systems with LLMs.

link
Google Scholar

2025-09-16

Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs

We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. 0.644In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information.This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency.Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes.Imperfect but still very high performance is observed on subgraph matching.Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.

link
Google Scholar

2025-09-15

MedicalOS: An LLM Agent based Operating System for Digital Healthcare

Decades' advances in digital health technologies, such as electronic health records, have largely streamlined routine clinical processes.Yet, most these systems are still hard to learn and use: Clinicians often face the burden of managing multiple tools, repeating manual actions for each patient, navigating complicated UI trees to locate functions, and spending significant time on administration instead of caring for patients.The recent rise of large language model (LLM) based agents demonstrates exceptional capability in coding and computer operation, revealing the potential for humans to interact with operating systems and software not by direct manipulation, but by instructing agents through natural language. 0.813This shift highlights the need for an abstraction layer, an agent-computer interface, that translates human language into machine-executable commands.In digital healthcare, however, requires a more domain-specific abstractions that strictly follow trusted clinical guidelines and procedural standards to ensure safety, transparency, and compliance.To address this need, we present \textbf{MedicalOS}, a unified agent-based operational system designed as such a domain-specific abstract layer for healthcare.It translates human instructions into pre-defined digital healthcare commands, such as patient inquiry, history retrieval, exam management, report generation, referrals, treatment planning, that we wrapped as off-the-shelf tools using machine languages (e.g., Python, APIs, MCP, Linux).We empirically validate MedicalOS on 214 patient cases across 22 specialties, demonstrating high diagnostic accuracy and confidence, clinically sound examination requests, and consistent generation of structured reports and medication recommendations.These results highlight MedicalOS as a trustworthy and scalable foundation for advancing workflow automation in clinical practice.

link
Google Scholar

2025-09-15

Automated Creation and Enrichment Framework for Improved Invocation of Enterprise APIs as Tools

Recent advancements in Large Language Models (LLMs) has lead to the development of agents capable of complex reasoning and interaction with external tools.In enterprise contexts, the effective use of such tools that are often enabled by application programming interfaces (APIs), is hindered by poor documentation, complex input or output schema, and large number of operations.These challenges make tool selection difficult and reduce the accuracy of payload formation by up to 25%.We propose ACE, an automated tool creation and enrichment framework that transforms enterprise APIs into LLM-compatible tools. 0.673ACE, (i) generates enriched tool specifications with parameter descriptions and examples to improve selection and invocation accuracy, and (ii) incorporates a dynamic shortlisting mechanism that filters relevant tools at runtime, reducing prompt complexity while maintaining scalability.We validate our framework on both proprietary and open-source APIs and demonstrate its integration with agentic frameworks.To the best of our knowledge, ACE is the first end-to-end framework that automates the creation, enrichment, and dynamic selection of enterprise API tools for LLM agents.

link
Google Scholar

2025-09-15

Do Code Semantics Help? A Comprehensive Study on Execution Trace-Based Information for Code Large Language Models

Code Large Language Models (Code LLMs) have opened a new era in programming with their impressive capabilities. 0.937However, recent research has revealed critical limitations in their ability to reason about runtime behavior and understand the actual functionality of programs, which poses significant challenges for their post-training and practical deployment.Specifically, Code LLMs encounter two principal issues: (1) a lack of proficiency in reasoning about program execution behavior, as they struggle to interpret what programs actually do during runtime, and (2) the inconsistent and fragmented representation of semantic information, such as execution traces, across existing methods, which hinders their ability to generalize and reason effectively.These challenges underscore the necessity for more systematic approaches to enhance the reasoning capabilities of Code LLMs. 0.672To address these issues, we introduce a generic framework to support integrating semantic information~(e.g., execution trace) to code task-relevant prompts, and conduct a comprehensive study to explore the role of semantic information in enhancing the reasoning ability of Code LLMs accordingly.Specifically, we focus on investigating the usefulness of trace-based semantic information in boosting supervised fine-tuning~(SFT) and post-phase inference of Code LLMs.The experimental results surprisingly disagree with previous works and demonstrate that semantic information has limited usefulness for SFT and test time scaling of Code LLM.

link
Google Scholar

2025-09-15

From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation

Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, blockchain scalability, and secure finance.However, authoring ZK programs remains challenging: unlike mainstream programming, ZK development requires reasoning about finite field arithmetic, constraint systems, and gadgets, making it knowledge-intensive and error-prone.While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and gadget-level reasoning, remains unexplored. 0.834To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities at three levels: language knowledge, gadget competence, and end-to-end program generation.Our evaluation of four state-of-the-art LLMs reveals that models excel at surface-level syntax but struggle with gadget usage and semantic correctness, often yielding incorrect programs.Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair.Experiments on Circom and Noir show substantial gains, with success rates improving from 17.35\% to 83.38\% and from 32.21\% to 90.05\%, respectively.With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a foundation for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance trustworthy computation.

link
Google Scholar

2025-09-15

VisDocSketcher: Towards Scalable Visual Documentation with Agentic Systems

Visual documentation is an effective tool for reducing the cognitive barrier developers face when understanding unfamiliar code, enabling more intuitive comprehension.Compared to textual documentation, it provides a higher-level understanding of the system structure and data flow.Developers usually prefer visual representations over lengthy textual descriptions for large software systems.Visual documentation is both difficult to produce and challenging to evaluate.Manually creating it is time-consuming, and currently, no existing approach can automatically generate high-level visual documentation directly from code. 0.675Its evaluation is often subjective, making it difficult to standardize and automate.To address these challenges, this paper presents the first exploration of using agentic LLM systems to automatically generate visual documentation.We introduce VisDocSketcher, the first agent-based approach that combines static analysis with LLM agents to identify key elements in the code and produce corresponding visual representations. 0.764We propose a novel evaluation framework, AutoSketchEval, for assessing the quality of generated visual documentation using code-level metrics. 0.662The experimental results show that our approach can valid visual documentation for 74.4% of the samples.It shows an improvement of 26.7-39.8% over a simple template-based baseline.Our evaluation framework can reliably distinguish high-quality (code-aligned) visual documentation from low-quality (non-aligned) ones, achieving an AUC exceeding 0.87.Our work lays the foundation for future research on automated visual documentation by introducing practical tools that not only generate valid visual representations but also reliably assess their quality.

link
Google Scholar

2025-09-15

LitterBox+: An Extensible Framework for LLM-enhanced Scratch Static Code Analysis

Large language models (LLMs) have become an essential tool to support developers using traditional text-based programming languages, but the graphical notation of the block-based Scratch programming environment inhibits the use of LLMs. 0.858To overcome this limitation, we propose the LitterBox+ framework that extends the Scratch static code analysis tool LitterBox with the generative abilities of LLMs. 0.776By converting block-based code to a textual representation suitable for LLMs, LitterBox+ allows users to query LLMs about their programs, about quality issues reported by LitterBox, and it allows generating code fixes. 0.7Besides offering a programmatic API for these functionalities, LitterBox+ also extends the Scratch user interface to make these functionalities available directly in the environment familiar to learners.The framework is designed to be easily extensible with other prompts, LLM providers, and new features combining the program analysis capabilities of LitterBox with the generative features of LLMs.We provide a screencast demonstrating the tool at https://youtu.be/RZ6E0xgrIgQ.

link
Google Scholar

2025-09-15

A New Benchmark for Evaluating Code Translation with Third-Party Libraries

In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. 0.932However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions.Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative. 0.813To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation.It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites.We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. 0.737Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages.Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously.These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence. 0.76

link
Google Scholar

2025-09-15

UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC

Translating programs between various parallel programming languages is an important problem in the high-performance computing (HPC) community.Existing tools for this problem are either too narrow in scope and/or outdated.Recent explosive growth in the popularity of large language models (LLMs) and their ability to generate and translate code offers a potential alternative approach. 0.857Toward that end, we first need to systematically evaluate the ability of LLMs to translate between parallel languages. In this work, we introduce UniPar, a systematic evaluation framework for LLM-based parallel code translation. 0.762Specifically, in this work, we target translations between serial code, CUDA, and OpenMP.Our goal is to assess how well current instruction-tuned LLMs -- specifically GPT-4o-mini and LLaMA-3.3-70B-Instruct -- can be used out of the box or enhanced through known strategies.We evaluated four major usage modes: hyperparameter optimization for decoding, zero- and few-shot prompting, supervised fine-tuning, and iterative feedback through compiler-based repair.As a part of the evaluation, we construct a new dataset called PARATRANS, covering both serial-to-parallel translation and cross-paradigm transformations. Our findings reveal that while off-the-shelf models struggle under the default settings (e.g., GPT-4o-mini achieves only 46% compilation and 15% functional correctness), our UniPar methodology -- combining fine-tuning, hyperparameter tuning, and compiler-guided repair -- improves performance by up to 2X (69% compilation and 33% correctness).We believe that our findings will provide useful insights for researchers to further improve LLMs for the parallel language translation problem. UniPar source code and PARATRANS dataset are available at our GitHub repository https://github.com/Scientific-Computing-Lab/UniPar_AI.

link
Google Scholar

2025-09-15

Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML

Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. 0.86However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored.We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. 0.834We developed an evaluation framework tailored to ASML's proprietary codebase and introduced a new benchmark.Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. 0.7We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. 0.828The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates.The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families.

link
Google Scholar

2025-09-15

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. 0.821The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce \sys, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. \sys augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate \sys by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters). To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the \swebench leaderboard, our approach establishes new state-of-the-art results among open-weight models.A 30B parameter model trained with \sys ranks 1st on \lite and 4th on \verified on the open-weight leaderboard, surpassed only by models with over 10x more parameters(\eg$>$350B).

link
Google Scholar

2025-09-15

From Legacy Fortran to Portable Kokkos:An Autonomous Agentic AI Workflow

Scientific applications continue to rely on legacy Fortran codebases originally developed for homogeneous, CPU-based systems.As High-Performance Computing (HPC) shifts toward heterogeneous GPU-accelerated architectures, many accelerators lack native Fortran bindings, creating an urgent need to modernize legacy codes for portability.Frameworks like Kokkos provide performance portability and a single-source C++ abstraction, but manual Fortran-to-Kokkos porting demands significant expertise and time.Large language models (LLMs) have shown promise in source-to-source code generation, yet their use in fully autonomous workflows for translating and optimizing parallel code remains largely unexplored, especially for performance portability across diverse hardware. 0.787This paper presents an agentic AI workflow where specialized LLM "agents" collaborate to translate, validate, compile, run, test, debug, and optimize Fortran kernels into portable Kokkos C++ programs.Results show the pipeline modernizes a range of benchmark kernels, producing performance-portable Kokkos codes across hardware partitions.Paid OpenAI models such as GPT-5 and o4-mini-high executed the workflow for only a few U.S. dollars, generating optimized codes that surpassed Fortran baselines, whereas open-source models like Llama4-Maverick often failed to yield functional codes. This work demonstrates the feasibility of agentic AI for Fortran-to-Kokkos transformation and offers a pathway for autonomously modernizing legacy scientific applications to run portably and efficiently on diverse supercomputers.It further highlights the potential of LLM-driven agentic systems to perform structured, domain-specific reasoning tasks in scientific and systems-oriented applications.

link
Google Scholar

2025-09-14

Rethinking Technology Stack Selection with AI Coding Proficiency

Large language models (LLMs) are now an integral part of software development workflows and are reshaping the whole process. 0.797Traditional technology stack selection has not caught up.Most of the existing selection methods focus solely on the inherent attributes of the technology, overlooking whether the LLM can effectively leverage the chosen technology.For example, when generating code snippets using popular libraries like Selenium (one of the most widely used test automation tools with over 33k GitHub stars), existing LLMs frequently generate low-quality code snippets (e.g., using deprecated APIs and methods, or containing syntax errors). 0.785As such, teams using LLM assistants risk choosing technologies that cannot be used effectively by LLMs, yielding high debugging effort and mounting technical debt.We foresee a practical question in the LLM era, is a technology ready for AI-assisted development?In this paper, we first propose the concept, AI coding proficiency, the degree to which LLMs can utilize a given technology to generate high-quality code snippets. 0.783We conduct the first comprehensive empirical study examining AI proficiency across 170 third-party libraries and 61 task scenarios, evaluating six widely used LLMs.Our findings reveal that libraries with similar functionalities can exhibit up to 84% differences in the quality score of LLM-generated code, while different models also exhibit quality gaps among their generation results using the same library. 0.807These gaps translate into real engineering costs and can steer developer choices toward a narrow set of libraries with high AI coding proficiency, threatening technological diversity in the ecosystem.We call on the community to integrate AI proficiency assessments into technology selection frameworks and develop mitigation strategies, preserving competitive balance in AI-driven development.

link
Google Scholar

2025-09-14

Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation

LLMs have become the mainstream approaches to code generation. 0.927Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right.However, the underlying autoregressive generation has two limitations in code generation.First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice.Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order.These two intrinsic limitations hinder the further development of LLMs in code generation. 0.72Recently, diffusion LLMs have emerged as a promising alternative.Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens).However, there is no systematic study exploring diffusion LLMs in code generation. 0.721To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. 0.827Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks.Based on the results, we summarize the following findings.(1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes.(2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding.(3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance.(4) We discuss several promising further directions to improve diffusion LLMs on code generation. 0.728We open-source all source code, data, and results to facilitate the following research.The code is publicly available at https://github.com/zhangyitonggg/dllm4code.

link
Google Scholar

2025-09-14

MatQnA: A Benchmark Dataset for Multi-modal Large Language Models in Materials Characterization and Analysis

Recently, large language models (LLMs) have achieved remarkable breakthroughs in general domains such as programming and writing, and have demonstrated strong potential in various scientific research scenarios. 0.77However, the capabilities of AI models in the highly specialized field of materials characterization and analysis have not yet been systematically or sufficiently validated.To address this gap, we present MatQnA, the first multi-modal benchmark dataset specifically designed for material characterization techniques.MatQnA includes ten mainstream characterization methods, such as X-ray Photoelectron Spectroscopy (XPS), X-ray Diffraction (XRD), Scanning Electron Microscopy (SEM), Transmission Electron Microscopy (TEM), etc.We employ a hybrid approach combining LLMs with human-in-the-loop validation to construct high-quality question-answer pairs, integrating both multiple-choice and subjective questions.Our preliminary evaluation results show that the most advanced multi-modal AI models (e.g., GPT-4.1, Claude 4, Gemini 2.5, and Doubao Vision Pro 32K) have already achieved nearly 90% accuracy on objective questions in materials data interpretation and analysis tasks, demonstrating strong potential for applications in materials characterization and analysis.The MatQnA dataset is publicly available at https://huggingface.co/datasets/richardhzgg/matQnA.

link
Google Scholar

2025-09-14

Large language model-empowered next-generation computer-aided engineering

Software development has entered a new era where large language models (LLMs) now serve as general-purpose reasoning engines, enabling natural language interaction and transformative applications across diverse domains. 0.763This paradigm is now extending into computer-aided engineering (CAE).Recent applications of LLMs in CAE have successfully automated routine tasks, including CAD model generation and FEM simulations.Nevertheless, these contributions, which primarily serve to reduce manual labor, are often insufficient for addressing the significant computational challenges posed by large-scale, high-dimensional systems.To this aim, we first introduce the concept of LLM-empowered CAE agent, where LLMs act as autonomous collaborators that plan, execute, and adapt CAE workflows.Then, we propose an LLM-empowered CAE agent for data-free model order reduction (MOR), a powerful yet underused approach for ultra-fast large-scale parametric analysis due to the intrusive nature and labor-intensive redevelopment of solvers.LLMs can alleviate this barrier by automating derivations, code restructuring, and implementation, making intrusive MOR both practical and broadly accessible. 0.723To demonstrate feasibility, we present an LLM-empowered CAE agent for solving ultra-large-scale space-parameter-time (S-P-T) physical problems using Tensor-decomposition-based A Priori Surrogates (TAPS).Our results show that natural language prompts describing parametric partial differential equations (PDEs) can be translated into efficient solver implementations, substantially reducing human effort while producing high-fidelity reduced-order models.Moreover, LLMs can synthesize novel MOR solvers for unseen cases such as nonlinear and high-dimensional parametric problems based on their internal knowledge base.This highlights the potential of LLMs to establish the foundation for next-generation CAE systems.

link
Google Scholar

2025-09-11

TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla

Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. 0.857This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models.Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B).We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. 0.677Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages.We open-source all resources to advance further Bangla LLM research.

link
Google Scholar

2025-09-11

On Integrating Large Language Models and Scenario-Based Programming for Improving Software Reliability

Large Language Models (LLMs) are fast becoming indispensable tools for software developers, assisting or even partnering with them in crafting complex programs. 0.864The advantages are evident -- LLMs can significantly reduce development time, generate well-organized and comprehensible code, and occasionally suggest innovative ideas that developers might not conceive on their own.However, despite their strengths, LLMs will often introduce significant errors and present incorrect code with persuasive confidence, potentially misleading developers into accepting flawed solutions. In order to bring LLMs into the software development cycle in a more reliable manner, we propose a methodology for combining them with ``traditional'' software engineering techniques in a structured way, with the goal of streamlining the development process, reducing errors, and enabling users to verify crucial program properties with increased confidence.Specifically, we focus on the Scenario-Based Programming (SBP) paradigm -- an event-driven, scenario-based approach for software engineering -- to allow human developers to pour their expert knowledge into the LLM, as well as to inspect and verify its outputs. To evaluate our methodology, we conducted a significant case study, and used it to design and implement the Connect4 game.By combining LLMs and SBP we were able to create a highly-capable agent, which could defeat various strong existing agents.Further, in some cases, we were able to formally verify the correctness of our agent.Finally, our experience reveals interesting insights regarding the ease-of-use of our proposed approach.The full code of our case-study will be made publicly available with the final version of this paper.

link
Google Scholar

2025-09-11

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. 0.786We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios.Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems.Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings.LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis.Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale.We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS).Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention.LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

link
Google Scholar

2025-09-10

AutoVeriFix: Automatically Correcting Errors and Enhancing Functional Correctness in LLM-Generated Verilog Code

Large language models (LLMs) have demonstrated impressive capabilities in generating software code for high-level programming languages such as Python and C++. 0.964However, their application to hardware description languages, such as Verilog, is challenging due to the scarcity of high-quality training data.Current approaches to Verilog code generation using LLMs often focus on syntactic correctness, resulting in code with functional errors. 0.691To address these challenges, we present AutoVeriFix, a novel Python-assisted two-stage framework designed to enhance the functional correctness of LLM-generated Verilog code.In the first stage, LLMs are employed to generate high-level Python reference models that define the intended circuit behavior.In the second stage, these Python models facilitate the creation of automated tests that guide the generation of Verilog RTL implementations.Simulation discrepancies between the reference model and the Verilog code are iteratively used to identify and correct errors, thereby improving the functional accuracy and reliability of the LLM-generated Verilog code.Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art methods in improving the functional correctness of generated Verilog code.

link
Google Scholar