Ryan's Arxiv FrontPageGenerated on 2025-08-01. This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. This project was originally created by Vincent Warmerdam, modifying his original frontpage for different paper categories. |
|
Prompt Engineering in Large Language Models |
|
2025-07-31 |
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia.As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands.Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding.In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250.GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms.On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance. 0.619 |
2025-07-31 |
DSBC : Data Science task Benchmarking with Context engineering
Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks.Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce.In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications.We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent.Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. 0.769We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach.Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment.The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents. |
2025-07-31 |
Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation
Large language models (LLMs) with instruction following capabilities have demonstrated impressive problem-solving abilities. 0.625While synthesizing instructional data from unsupervised text has become a common approach for training such models, conventional methods rely heavily on human effort for data annotation.Although existing automated synthesis paradigms have alleviated this constraint, they still exhibit significant limitations in ensuring adequate diversity and difficulty of synthesized instructions.To address these challenges, we propose Self-Foveate, an innovative LLM-driven method for instruction synthesis.This approach introduces a "Micro-Scatter-Macro" multi-level foveation methodology that effectively guides the LLM to deeply excavate fine-grained information embedded in unsupervised text, thereby enhancing both the diversity and difficulty of synthesized instructions.Comprehensive experiments across multiple unsupervised corpora and diverse model architectures validate the effectiveness and superiority of our proposed method.We publicly release our data and codes: https://github.com/Mubuky/Self-Foveate |
2025-07-31 |
DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer
Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks.However, existing works have shown that the effectiveness of ICL is highly sensitive to the choice of demonstrations, with suboptimal examples often leading to unstable or degraded performance.While prior work has explored example selection, including in some agentic or multi-step settings, existing approaches typically rely on heuristics or task-specific designs and lack a general, theoretically grounded criterion for what constitutes an effective demonstration across reasoning steps. 0.684Therefore, it is non-trivial to develop a principled, general-purpose method for selecting demonstrations that consistently benefit agent performance.In this paper, we address this challenge with DICE, Dynamic In-Context Example Selection for LLM Agents, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning.Our approach decomposes demonstration knowledge into transferable and non-transferable components through a causal lens, showing how the latter can introduce spurious dependencies that impair generalization.We further propose a stepwise selection criterion with a formal guarantee of improved agent performance.Importantly, DICE is a general, framework-agnostic solution that can be integrated as a plug-in module into existing agentic frameworks without any additional training cost.Extensive experiments across diverse domains demonstrate our method's effectiveness and generality, highlighting the importance of principled, context-aware demo selection for robust and efficient LLM agents. |
2025-07-31 |
CFDagent: A Language-Guided, Zero-Shot Multi-Agent System for Complex Flow Simulation
We introduce CFDagent, a zero-shot, multi-agent system that enables fully autonomous computational fluid dynamics (CFD) simulations from natural language prompts. 0.728CFDagent integrates three specialized LLM-driven agents: (i) the Preprocessing Agent that generates 3D geometries from textual or visual inputs using a hybrid text-to-3D diffusion model (Point-E) and automatically meshes the geometries; (ii) the Solver Agent that configures and executes an immersed boundary flow solver; and (iii) the Postprocessing Agent that analyzes and visualizes the results, including multimodal renderings.These agents are interactively guided by GPT-4o via conversational prompts, enabling intuitive and user-friendly interaction. 0.713We validate CFDagent by reproducing canonical sphere flows at Reynolds numbers of 100 and 300 using three distinct inputs: a simple text prompt (i.e., "sphere"), an image-based input, and a standard sphere model.The computed drag and lift coefficients from meshes produced by each input approach closely match available data.The proposed system enables synthesization of flow simulations and photorealistic visualizations for complex geometries.Through extensive tests on canonical and realistic scenarios, we demonstrate the robustness, versatility, and practical applicability of CFDagent.By bridging generative AI with high-fidelity simulations, CFDagent significantly lowers barriers to expert-level CFD, unlocking broad opportunities in education, scientific research, and practical engineering applications. |
2025-07-31 |
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. 0.638Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning.In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model.Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization.To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning.Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin.To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines.We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems.This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning. |
2025-07-31 |
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics.In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond.For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard. 0.696 |
2025-07-31 |
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities.Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models' problem-solving capabilities while maintaining the scalability and automation.This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. 0.623We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm.We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families.Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models.We further validate our findings by extensive ablation studies. |
2025-07-30 |
LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task.Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference.These issues make it difficult to evaluate the LLMs' true capabilities comprehensively.To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced.It utilizes LLMs to generate questions, answer independently, and evaluate mutually.This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously.Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance.Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ''memorization-based answering'' by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness). 0.766 |
2025-07-30 |
Enhancing Manufacturing Knowledge Access with LLMs and Context-aware Prompting
Knowledge graphs (KGs) have transformed data management within the manufacturing industry, offering effective means for integrating disparate data sources through shared and structured conceptual schemas.However, harnessing the power of KGs can be daunting for non-experts, as it often requires formulating complex SPARQL queries to retrieve specific information.With the advent of Large Language Models (LLMs), there is a growing potential to automatically translate natural language queries into the SPARQL format, thus bridging the gap between user-friendly interfaces and the sophisticated architecture of KGs.The challenge remains in adequately informing LLMs about the relevant context and structure of domain-specific KGs, e.g., in manufacturing, to improve the accuracy of generated queries.In this paper, we evaluate multiple strategies that use LLMs as mediators to facilitate information retrieval from KGs.We focus on the manufacturing domain, particularly on the Bosch Line Information System KG and the I40 Core Information Model.In our evaluation, we compare various approaches for feeding relevant context from the KG to the LLM and analyze their proficiency in transforming real-world questions into SPARQL queries.Our findings show that LLMs can significantly improve their performance on generating correct and complete queries when provided only the adequate context of the KG schema.Such context-aware prompting techniques help LLMs to focus on the relevant parts of the ontology and reduce the risk of hallucination. 0.684We anticipate that the proposed techniques help LLMs to democratize access to complex data repositories and empower informed decision-making in manufacturing settings. |
2025-07-30 |
From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs
Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs).However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. 0.722This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer.We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability.TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors.It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks.Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks.The code and data are available at: https://github.com/probe2/TIRESRAG-R1. |
2025-07-30 |
Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation.Their token-level representations capture rich, human-aligned semantics.However, pooling these vectors into a text embedding discards crucial information.Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings.We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning.Combining these components yields state-of-the-art performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB).An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. 0.729Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs. |
2025-07-30 |
Automatically discovering heuristics in a complex SAT solver with large language models
Satisfiability problem (SAT) is a cornerstone of computational complexity with broad industrial applications, and it remains challenging to optimize modern SAT solvers in real-world settings due to their intricate architectures.While automatic configuration frameworks have been developed, they rely on manually constrained search spaces and yield limited performance gains.This work introduces a novel paradigm which effectively optimizes complex SAT solvers via Large Language Models (LLMs), and a tool called AutoModSAT is developed.Three fundamental challenges are addressed in order to achieve superior performance: (1) LLM-friendly solver: Systematic guidelines are proposed for developing a modularized solver to meet LLMs' compatibility, emphasizing code simplification, information share and bug reduction; (2) Automatic prompt optimization: An unsupervised automatic prompt optimization method is introduced to advance the diversity of LLMs' output; (3) Efficient search strategy: We design a presearch strategy and an EA evolutionary algorithm for the final efficient and effective discovery of heuristics. 0.684Extensive experiments across a wide range of datasets demonstrate that AutoModSAT achieves 50% performance improvement over the baseline solver and achieves 30% superiority against the state-of-the-art (SOTA) solvers.Moreover, AutoModSAT attains a 20% speedup on average compared to parameter-tuned alternatives of the SOTA solvers, showcasing the enhanced capability in handling complex problem instances.This work bridges the gap between AI-driven heuristics discovery and mission-critical system optimization, and provides both methodological advancements and empirically validated results for next-generation complex solver development. |
2025-07-30 |
Exploring In-Context Learning for Frame-Semantic Parsing
Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics.This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning.We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. 0.605These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. 0.746Experiments are conducted on a subset of frames related to violent events.The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL.The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks. |
2025-07-30 |
Beyond Rigid AI: Towards Natural Human-Machine Symbiosis for Interoperative Surgical Assistance
Emerging surgical data science and robotics solutions, especially those designed to provide assistance in situ, require natural human-machine interfaces to fully unlock their potential in providing adaptive and intuitive aid.Contemporary AI-driven solutions remain inherently rigid, offering limited flexibility and restricting natural human-machine interaction in dynamic surgical environments.These solutions rely heavily on extensive task-specific pre-training, fixed object categories, and explicit manual-prompting.This work introduces a novel Perception Agent that leverages speech-integrated prompt-engineered large language models (LLMs), segment anything model (SAM), and any-point tracking foundation models to enable a more natural human-machine interaction in real-time intraoperative surgical assistance.Incorporating a memory repository and two novel mechanisms for segmenting unseen elements, Perception Agent offers the flexibility to segment both known and unseen elements in the surgical scene through intuitive interaction.Incorporating the ability to memorize novel elements for use in future surgeries, this work takes a marked step towards human-machine symbiosis in surgical procedures.Through quantitative analysis on a public dataset, we show that the performance of our agent is on par with considerably more labor-intensive manual-prompting strategies. 0.723Qualitatively, we show the flexibility of our agent in segmenting novel elements (instruments, phantom grafts, and gauze) in a custom-curated dataset.By offering natural human-machine interaction and overcoming rigidity, our Perception Agent potentially brings AI-based real-time assistance in dynamic surgical environments closer to reality. |
2025-07-30 |
ChatVis: Large Language Model Agent for Generating Scientific Visualizations
Large language models (LLMs) are rapidly increasing in capability, but they still struggle with highly specialized programming tasks such as scientific visualization.We present an LLM assistant, ChatVis, that aids the LLM to generate Python code for ParaView scientific visualization tasks, without the need for retraining or fine-tuning the LLM.ChatVis employs chain-of-thought prompt simplification, retrieval-augmented prompt generation using a vector database of documentation and code examples, and error checking with iterative prompt feedback to correct errors until a visualization is produced. 0.805An integral part of our approach is a benchmark suite of canonical visualization tasks, ParaView regression tests, and scientific use cases that includes comprehensive evaluation metrics.We evaluate our visualization assistant by comparing results with a variety of top-performing unassisted LLMs.We find that all the metrics are significantly improved with ChatVis. |
2025-07-29 |
DualSG: A Dual-Stream Explicit Semantic-Guided Multivariate Time Series Forecasting Framework
Multivariate Time Series Forecasting plays a key role in many applications.Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities.However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design.Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty.In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework.We propose DualSG, a dual-stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions.As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. 0.69We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation.Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance. |
2025-07-29 |
Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences
LLMs are seeing widespread use for task automation, including automated coding in the social sciences.However, even though researchers have proposed different prompting strategies, their effectiveness varies across LLMs and tasks. 0.822Often trial and error practices are still widespread.We propose HALC$-$a general pipeline that allows for the systematic and reliable construction of optimal prompts for any given coding task and model, permitting the integration of any prompting strategy deemed relevant. 0.874To investigate LLM coding and validate our pipeline, we sent a total of 1,512 individual prompts to our local LLMs in over two million requests. 0.669We test prompting strategies and LLM task performance based on few expert codings (ground truth). 0.81When compared to these expert codings, we find prompts that code reliably for single variables (${\alpha}$climate = .76; ${\alpha}$movement = .78) and across two variables (${\alpha}$climate = .71; ${\alpha}$movement = .74) using the LLM Mistral NeMo.Our prompting strategies are set up in a way that aligns the LLM to our codebook$-$we are not optimizing our codebook for LLM friendliness. 0.881Our paper provides insights into the effectiveness of different prompting strategies, crucial influencing factors, and the identification of reliable prompts for each coding task and model. 0.918 |
2025-07-29 |
AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning
Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs).Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. 0.618Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies.AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration.Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior.These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs.The code and data are available at https://github.com/weiyifan1023/AutoTIR. |
2025-07-29 |
Composable Effect Handling for Programming LLM-integrated Scripts
Implementing LLM-integrated scripts introduces challenges in modularity and performance, as scripts are often coupled to specific LLM implementations and fail to exploit parallelization opportunities.This paper proposes using composable effect handling to separate workflow logic from effectful operations, such as LLM calls, I/O, and concurrency, enabling modularity without sacrificing the opportunity for performance optimization.By treating these operations as abstract interfaces and discharging them via effect handlers, this paper shows that scripts can achieve significant speedups (e.g., 10$\times$ in a Tree-of-Thoughts case study) without compromising modularity.This paper aims to promote composable effect handling as a programming style for LLM scripting. 0.601 |
Robustness Tools in LLM Safety |
|
2025-07-31 |
A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation.We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures.Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios.Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). 0.617Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861).The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments. |
2025-07-31 |
LLM-Based Identification of Infostealer Infection Vectors from Screenshots: The Case of Aurora
Infostealers exfiltrate credentials, session cookies, and sensitive data from infected systems.With over 29 million stealer logs reported in 2024, manual analysis and mitigation at scale are virtually unfeasible/unpractical. 0.708While most research focuses on proactive malware detection, a significant gap remains in leveraging reactive analysis of stealer logs and their associated artifacts.Specifically, infection artifacts such as screenshots, image captured at the point of compromise, are largely overlooked by the current literature. 0.653This paper introduces a novel approach leveraging Large Language Models (LLMs), more specifically gpt-4o-mini, to analyze infection screenshots to extract potential Indicators of Compromise (IoCs), map infection vectors, and track campaigns.Focusing on the Aurora infostealer, we demonstrate how LLMs can process screenshots to identify infection vectors, such as malicious URLs, installer files, and exploited software themes. 0.646Our method extracted 337 actionable URLs and 246 relevant files from 1000 screenshots, revealing key malware distribution methods and social engineering tactics.By correlating extracted filenames, URLs, and infection themes, we identified three distinct malware campaigns, demonstrating the potential of LLM-driven analysis for uncovering infection workflows and enhancing threat intelligence.By shifting malware analysis from traditional log-based detection methods to a reactive, artifact-driven approach that leverages infection screenshots, this research presents a scalable method for identifying infection vectors and enabling early intervention. |
2025-07-30 |
RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment
Automated Program Repair (APR) seeks to automatically correct software bugs without requiring human intervention.However, existing tools tend to generate patches that satisfy test cases without fixing the underlying bug, those are known as overfitting patches. 0.663To address this issue, Automated Patch Correctness Assessment (APCA) attempts to identify overfitting patches generated by APR tools. 0.664It can be solved as a static approach, meaning that no additional information is needed beyond the original and fixed code snippets.Current static techniques often struggle with reliability, flexibility and transparency. 0.657To address these issues, we introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks.Our model is prompted with both buggy and fixed code snippets and guided to generate a Chain of Thought that analyses code differences, reasons about how the patch addresses the root cause, and ultimately provides a binary classification: correct or overfitting.To enhance these reasoning capabilities for the APCA task specifically, the LLM is finetuned using Reinforcement Learning with the Group Relative Policy Optimization algorithm.When evaluated on a standard Defects4J-derived test, our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score.Furthermore, our model demonstrates superior generalization capabilities when trained on different datasets, outperforming the leading technique.This reasoning capability also provides enhanced explainability for the patch assessment.These findings underscore the considerable promise of finetuned, reasoning LLMs to advance static APCA by enhancing accuracy, generalization, and explainability. |
2025-07-30 |
Investigating Hallucination in Conversations for Low Resource Languages
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing.However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. 0.839Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. 0.93While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. 0.86We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3.We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi. 0.871 |
2025-07-30 |
The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach
In real-world scenarios, due to the highly decoupled and flexible nature of microservices, it poses greater challenges to system reliability.The more frequent occurrence of incidents has created a demand for Root Cause Analysis(RCA) methods that enable rapid identification and recovery of incidents.Large language model (LLM) provides a new path for quickly locating and recovering from incidents by leveraging their powerful generalization ability combined with expert experience.Current LLM for RCA frameworks are based on ideas like ReAct and Chain-of-Thought, but the hallucination of LLM and the propagation nature of anomalies often lead to incorrect localization results.Moreover, the massive amount of anomalous information generated in large, complex systems presents a huge challenge for the context window length of LLMs.To address these challenges, we propose KnowledgeMind, an innovative LLM multi-agent system based on Monte Carlo Tree Search and a knowledge base reward mechanism for standardized service-by-service reasoning.Compared to State-Of-The-Art(SOTA) LLM for RCA methods, our service-by-service exploration approach significantly reduces the burden on the maximum context window length, requiring only one-tenth of its size.Additionally, by incorporating a rule-based real-time reward mechanism, our method effectively mitigates hallucinations during the inference process. 0.739Compared to the SOTA LLM for RCA framework, our method achieves a 49.29% to 128.35% improvement in root cause localization accuracy. |
2025-07-30 |
MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models.However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities.In this work, we propose a novel visual framework, MoCHA, to address these issues.Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions.To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features.We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks.Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks.For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. 0.666Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA. |
2025-07-30 |
Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity
In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity.We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations.These annotated examples are systematically categorized into 3 main categories and 9 subcategories.Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans.Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. 0.665Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding.The dataset and code are publicly available at this GitHub repository:https://github.com/ictup/LLM-Chinese-Textual-Disambiguation. |
2025-07-29 |
Large Language Model-Based Framework for Explainable Cyberattack Detection in Automatic Generation Control Systems
The increasing digitization of smart grids has improved operational efficiency but also introduced new cybersecurity vulnerabilities, such as False Data Injection Attacks (FDIAs) targeting Automatic Generation Control (AGC) systems. 0.636While machine learning (ML) and deep learning (DL) models have shown promise in detecting such attacks, their opaque decision-making limits operator trust and real-world applicability.This paper proposes a hybrid framework that integrates lightweight ML-based attack detection with natural language explanations generated by Large Language Models (LLMs).Classifiers such as LightGBM achieve up to 95.13% attack detection accuracy with only 0.004 s inference latency.Upon detecting a cyberattack, the system invokes LLMs, including GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o mini, to generate human-readable explanation of the event. 0.618Evaluated on 100 test samples, GPT-4o mini with 20-shot prompting achieved 93% accuracy in identifying the attack target, a mean absolute error of 0.075 pu in estimating attack magnitude, and 2.19 seconds mean absolute error (MAE) in estimating attack onset.These results demonstrate that the proposed framework effectively balances real-time detection with interpretable, high-fidelity explanations, addressing a critical need for actionable AI in smart grid cybersecurity. |
2025-07-29 |
Promoting Online Safety by Simulating Unsafe Conversations with LLMs
Generative AI, including large language models (LLMs) have the potential -- and already are being used -- to increase the speed, scale, and types of unsafe conversations online.LLMs lower the barrier for entry for bad actors to create unsafe conversations in particular because of their ability to generate persuasive and human-like text. 0.629In our current work, we explore ways to promote online safety by teaching people about unsafe conversations that can occur online with and without LLMs.We build on prior work that shows that LLMs can successfully simulate scam conversations.We also leverage research in the learning sciences that shows that providing feedback on one's hypothetical actions can promote learning.In particular, we focus on simulating scam conversations using LLMs.Our work incorporates two LLMs that converse with each other to simulate realistic, unsafe conversations that people may encounter online between a scammer LLM and a target LLM but users of our system are asked provide feedback to the target LLM. |
2025-07-29 |
LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests
The usage of Large Language Models (LLMs) for software and test development has continued to increase since LLMs were first introduced, but only recently have the expectations of LLMs become more realistic.Verifying the correctness of code generated by LLMs is key to improving their usefulness, but there have been no comprehensive and fully autonomous solutions developed yet.Hallucinations are a major concern when LLMs are applied blindly to problems without taking the time and effort to verify their outputs, and an inability to explain the logical reasoning of LLMs leads to issues with trusting their results. 0.968To address these challenges while also aiming to effectively apply LLMs, this paper proposes a dual-LLM system (i.e. a generative LLM and a discriminative LLM) and experiments with the usage of LLMs for the generation of a large volume of compiler tests.We experimented with a number of LLMs possessing varying parameter counts and presented results using ten carefully-chosen metrics that we describe in detail in our narrative.Through our findings, it is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically. 0.664 |
2025-07-29 |
HLSDebugger: Identification and Correction of Logic Bugs in HLS Code with LLM Solutions
High-level synthesis (HLS) accelerates hardware design by enabling the automatic translation of high-level descriptions into efficient hardware implementations.However, debugging HLS code is a challenging and labor-intensive task, especially for novice circuit designers or software engineers without sufficient hardware domain knowledge.The recent emergence of Large Language Models (LLMs) is promising in automating the HLS debugging process.Despite the great potential, three key challenges persist when applying LLMs to HLS logic debugging: 1) High-quality circuit data for training LLMs is scarce, posing a significant challenge. 0.6452) Debugging logic bugs in hardware is inherently more complex than identifying software bugs with existing golden test cases. 0.6013) The absence of reliable test cases requires multi-tasking solutions, performing both bug identification and correction.complicates the multi-tasking required for effective HLS debugging.In this work, we propose a customized solution named HLSDebugger to address the challenges.HLSDebugger first generates and releases a large labeled dataset with 300K data samples, targeting HLS logic bugs.The HLSDebugger model adopts an encoder-decoder structure, performing bug location identification, bug type prediction, and bug correction with the same model. 0.666HLSDebugger significantly outperforms advanced LLMs like GPT-4 in bug identification and by more than 3x in bug correction. 0.772It makes a substantial advancement in the exploration of automated debugging of HLS code. |
2025-07-29 |
Can We End the Cat-and-Mouse Game? Simulating Self-Evolving Phishing Attacks with LLMs and Genetic Algorithms
Anticipating emerging attack methodologies is crucial for proactive cybersecurity.Recent advances in Large Language Models (LLMs) have enabled the automated generation of phishing messages and accelerated research into potential attack techniques.However, predicting future threats remains challenging due to reliance on existing training data.To address this limitation, we propose a novel framework that integrates LLM-based phishing attack simulations with a genetic algorithm in a psychological context, enabling phishing strategies to evolve dynamically through adversarial interactions with simulated victims.Through simulations using Llama 3.1, we demonstrate that (1) self-evolving phishing strategies employ increasingly sophisticated psychological manipulation techniques, surpassing naive LLM-generated attacks, (2) variations in a victim's prior knowledge significantly influence the evolution of attack strategies, and (3) adversarial interactions between evolving attacks and adaptive defenses create a cat-and-mouse dynamic, revealing an inherent asymmetry in cybersecurity -- attackers continuously refine their methods, whereas defenders struggle to comprehensively counter all evolving threats. 0.658Our approach provides a scalable, cost-effective method for analyzing the evolution of phishing strategies and defenses, offering insights into future social engineering threats and underscoring the necessity of proactive cybersecurity measures. |
2025-07-29 |
Towards a rigorous evaluation of RAG systems: the challenge of due diligence
The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance.The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora.Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. 0.835This study evaluates a RAG system used in due diligence for an investment fund.We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. 0.651Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees.We provide a comprehensive dataset for further analysis.Our contributions aim to enhance the reliability and scalability of RAG systems evaluation protocols in industrial applications. |
2025-07-29 |
HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs
Millions of individuals' well-being are challenged by the harms of substance use.Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks.Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD).However, their performance in relevant tasks remains largely unexplored.We introduce HRIPBench, a benchmark designed to evaluate LLM's accuracy and safety risks in harm reduction information provision.The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs.The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks.We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge.Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. 0.686The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes.WARNING:This paper contains illicit content that potentially induces harms. 0.693 |
2025-07-29 |
Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?
Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited.Current vulnerability datasets suffer from issues including label inaccuracy rates of 20-71%, extensive duplication, and poor coverage of critical CWE types.These issues create a significant "generalization gap" where models achieve misleading self-testing performance (measured on held-out data from same dataset for training) by exploiting spurious correlations rather than learning true vulnerability patterns. 0.66Our analysis reveals that many models experience substantial performance drops of up to 40.6% when evaluated on independent data, sometimes underperforming random guessing. To address these limitations, we present a three-part solution.First, we introduce a manually curated test dataset, BenchVul, covering the MITRE Top 25 Most Dangerous CWEs.Second, we construct a high-quality training dataset, TitanVul, comprising 35,045 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM framework.Third, we propose a Realistic Vulnerability Generation (RVG) framework, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation shows the strengths of each component in closing the generalization gap.First, BenchVul shows the limitations of self-testing: models trained on existing datasets, such as BigVul and PrimeVul, experience performance drops on BenchVul (from 0.776 to 0.519 and from 0.567 to 0.337).Second, training models on TitanVul demonstrates improved generalization, with model performance increasing from 0.584 when evaluated on the same dataset to 0.767 when tested on BenchVul.Third, supplementing TitanVul with RVG-generated data yields further gains, increasing model performance by 14.0% to 0.874. |
2025-07-29 |
Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning
Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. 0.609GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design.To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL).It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism.Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality. |
2025-07-29 |
Libra: Large Chinese-based Safeguard for AI Content
Large language models (LLMs) excel in text understanding and generation but raise significant safety and ethical concerns in high-stakes applications.To mitigate these risks, we present Libra-Guard, a cutting-edge safeguard system designed to enhance the safety of Chinese-based LLMs. 0.794Leveraging a two-stage curriculum training pipeline, Libra-Guard enhances data efficiency by employing guard pretraining on synthetic samples, followed by fine-tuning on high-quality, real-world data, thereby significantly reducing reliance on manual annotations.To enable rigorous safety evaluations, we also introduce Libra-Test, the first benchmark specifically designed to evaluate the effectiveness of safeguard systems for Chinese content.It covers seven critical harm scenarios and includes over 5,700 samples annotated by domain experts.Experiments show that Libra-Guard achieves 86.79% accuracy, outperforming Qwen2.5-14B-Instruct (74.33%) and ShieldLM-Qwen-14B-Chat (65.69%), and nearing closed-source models like Claude-3.5-Sonnet and GPT-4o. 0.693These contributions establish a robust framework for advancing the safety governance of Chinese LLMs and represent a tentative step toward developing safer, more reliable Chinese AI systems. |
2025-07-29 |
MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation
The recent advancement of autonomous agents powered by Large Language Models (LLMs) has demonstrated significant potential for automating tasks on mobile devices through graphical user interfaces (GUIs).Despite initial progress, these agents still face challenges when handling complex real-world tasks.These challenges arise from a lack of knowledge about real-life mobile applications in LLM-based agents, which may lead to ineffective task planning and even cause hallucinations. 0.654To address these challenges, we propose a novel LLM-based agent framework called MapAgent that leverages memory constructed from historical trajectories to augment current task planning.Specifically, we first propose a trajectory-based memory mechanism that transforms task execution trajectories into a reusable and structured page-memory database.Each page within a trajectory is extracted as a compact yet comprehensive snapshot, capturing both its UI layout and functional context.Secondly, we introduce a coarse-to-fine task planning approach that retrieves relevant pages from the memory database based on similarity and injects them into the LLM planner to compensate for potential deficiencies in understanding real-world app scenarios, thereby achieving more informed and context-aware task planning.Finally, planned tasks are transformed into executable actions through a task executor supported by a dual-LLM architecture, ensuring effective tracking of task progress.Experimental results in real-world scenarios demonstrate that MapAgent achieves superior performance to existing methods.The code will be open-sourced to support further research. |
2025-07-29 |
Thou Shalt Not Prompt: Zero-Shot Human Activity Recognition in Smart Homes via Language Modeling of Sensor Data & Activities
Developing zero-shot human activity recognition (HAR) methods is a critical direction in smart home research -- considering its impact on making HAR systems work across smart homes having diverse sensing modalities, layouts, and activities of interest.The state-of-the-art solutions along this direction are based on generating natural language descriptions of the sensor data and feeding it via a carefully crafted prompt to the LLM to perform classification.Despite their performance guarantees, such ``prompt-the-LLM'' approaches carry several risks, including privacy invasion, reliance on an external service, and inconsistent predictions due to version changes, making a case for alternative zero-shot HAR methods that do not require prompting the LLMs. 0.735In this paper, we propose one such solution that models sensor data and activities using natural language, leveraging its embeddings to perform zero-shot classification and thereby bypassing the need to prompt the LLMs for activity predictions.The impact of our work lies in presenting a detailed case study on six datasets, highlighting how language modeling can bolster HAR systems in zero-shot recognition. |
Security Challenges in LLM Development |
|
2025-07-31 |
Adversarial-Guided Diffusion for Multimodal LLM Attacks
This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. 0.73To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. 0.776We introduce adversarial-guided noise to ensure attack efficacy. 0.871A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. 0.702Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property.Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. 0.679Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. 0.749This makes AGD inherently robust to variety defenses.Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses. 0.799 |
2025-07-31 |
Fine-Grained Privacy Extraction from Retrieval-Augmented Generation Systems via Knowledge Asymmetry Exploitation
Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge bases, but this advancement introduces significant privacy risks.Existing privacy attacks on RAG systems can trigger data leakage but often fail to accurately isolate knowledge-base-derived sentences within mixed responses.They also lack robustness when applied across multiple domains.This paper addresses these challenges by presenting a novel black-box attack framework that exploits knowledge asymmetry between RAG and standard LLMs to achieve fine-grained privacy extraction across heterogeneous knowledge landscapes. 0.642We propose a chain-of-thought reasoning strategy that creates adaptive prompts to steer RAG systems away from sensitive content.Specifically, we first decompose adversarial queries to maximize information disparity and then apply a semantic relationship scoring to resolve lexical and syntactic ambiguities.We finally train a neural network on these feature scores to precisely identify sentences containing private information.Unlike prior work, our framework generalizes to unseen domains through iterative refinement without pre-defined knowledge.Experimental results show that we achieve over 91% privacy extraction rate in single-domain and 83% in multi-domain scenarios, reducing sensitive sentence exposure by over 65% in case studies.This work bridges the gap between attack and defense in RAG systems, enabling precise extraction of private information while providing a foundation for adaptive mitigation. 0.67 |
2025-07-31 |
Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis
Bengali is an underrepresented language in NLP research.However, it remains a challenge due to its unique linguistic structure and computational constraints.In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks.We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes.Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral.We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. 0.645Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance.These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts.This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide.The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark. |
2025-07-31 |
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
This paper investigates defenses for LLM-based evaluation systems against prompt injection. 0.826We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. 0.743To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. 0.821An attack is detected if the system validates an answer under both standard and counterfactual conditions. 0.795Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs. 0.879 |
2025-07-31 |
Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement.Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. 0.747In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles.We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation.To evaluate these approaches, we construct two complementary datasets.The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios.We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts. 0.778 |
2025-07-30 |
SAEL: Leveraging Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection
With the increasing security issues in blockchain, smart contract vulnerability detection has become a research focus. 0.761Existing vulnerability detection methods have their limitations: 1) Static analysis methods struggle with complex scenarios. 0.7922) Methods based on specialized pre-trained models perform well on specific datasets but have limited generalization capabilities.In contrast, general-purpose Large Language Models (LLMs) demonstrate impressive ability in adapting to new vulnerability patterns. 0.637However, they often underperform on specific vulnerability types compared to methods based on specialized pre-trained models. 0.743We also observe that explanations generated by general-purpose LLMs can provide fine-grained code understanding information, contributing to improved detection performance. Inspired by these observations, we propose SAEL, an LLM-based framework for smart contract vulnerability detection. 0.771We first design targeted prompts to guide LLMs in identifying vulnerabilities and generating explanations, which serve as prediction features. 0.782Next, we apply prompt-tuning on CodeT5 and T5 to process contract code and explanations, enhancing task-specific performance.To combine the strengths of each approach, we introduce an Adaptive Mixture-of-Experts architecture.This dynamically adjusts feature weights via a Gating Network, which selects relevant features using TopK filtering and Softmax normalization, and incorporates a Multi-Head Self-Attention mechanism to enhance cross-feature relationships.This design enables effective integration of LLM predictions, explanation features, and code features through gradient optimization.The loss function jointly considers both independent feature performance and overall weighted predictions.Experiments show that SAEL outperforms existing methods across various vulnerabilities. |
2025-07-30 |
Breaking Obfuscation: Cluster-Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection
With the rapid expansion of web-based applications and cloud services, malicious JavaScript code continues to pose significant threats to user privacy, system integrity, and enterprise security. 0.819But, detecting such threats remains challenging due to sophisticated code obfuscation techniques and JavaScript's inherent language characteristics, particularly its nested closure structures and syntactic flexibility. 0.844In this work, we propose DeCoda, a hybrid defense framework that combines large language model (LLM)-based deobfuscation with code graph learning: (1) We first construct a sophisticated prompt-learning pipeline with multi-stage refinement, where the LLM progressively reconstructs the original code structure from obfuscated inputs and then generates normalized Abstract Syntax Tree (AST) representations; (2) In JavaScript ASTs, dynamic typing scatters semantically similar nodes while deeply nested functions fracture scope capturing, introducing structural noise and semantic ambiguity. 0.783To address these challenges, we then propose to learn hierarchical code graph representations via a Cluster-wise Graph that synergistically integrates graph transformer network, node clustering, and node-to-cluster attention to simultaneously capture both local node-level semantics and global cluster-induced structural relationships from AST graph.Experimental results demonstrate that our method achieves F1-scores of 94.64% and 97.71% on two benchmark datasets, demonstrating absolute improvements of 10.74% and 13.85% over state-of-the-art baselines.In false-positive control evaluation at fixed FPR levels (0.0001, 0.001, 0.01), our approach delivers 4.82, 5.91, and 2.53 higher TPR respectively compared to the best-performing baseline.These results highlight the effectiveness of LLM-based deobfuscation and underscore the importance of modeling cluster-level relationships in detecting malicious code. 0.799 |
2025-07-30 |
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. 0.742Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. 0.884We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases.By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. 0.82Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. 0.828CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. 0.729These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. 0.69This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems. |
2025-07-30 |
A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models
The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. 0.772However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage.This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully.In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. 0.785We analyze 227 studies published between January 2020 and June 2025, categorizing them by task formulation, input representation, system architecture, and adaptation techniques.Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity.We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. 0.77By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research.We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies. 0.731 |
2025-07-29 |
Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is
Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. 0.874Unlike traditional adversarial examples requiring expert knowledge, many of today's jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. 0.83This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. 0.643We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs.Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies.We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings. 0.703 |
2025-07-29 |
Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs
Rote learning is a memorization technique based on repetition.It is commonly believed to hinder generalization by encouraging verbatim memorization rather than deeper understanding.This insight holds for even learning factual knowledge that inevitably requires a certain degree of memorization.In this work, we demonstrate that LLMs can be trained to generalize from rote memorized data.We introduce a two-phase memorize-then-generalize framework, where the model first rote memorizes factual subject-object associations using a semantically meaningless token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts.Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the two.This surprising finding opens the door to both effective and efficient knowledge injection and possible risks of repurposing the memorized data for malicious usage. 0.702 |
HCI in Large Language Models |
|
2025-07-31 |
Text-to-SQL Task-oriented Dialogue Ontology Construction
Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness.In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability.However, building such ontologies requires manual labels or supervised training.We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method.Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt.We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task.Ablation studies demonstrate the key role of dialogue theory. 0.687TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset.We view this as a step towards broader application of ontologies to increase LLM explainability. |
2025-07-31 |
"I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation
I reflect on my experience creating two music albums centered on state-of-the-art prompt-based AI music generation platforms.The first album explicitly poses the question: What happens when I collide my junk mail with these platforms?The second album is a direct response to the first, and toys with the inability of state-of-the-art prompt-based AI music generation platforms to generate music that is not ``practiced'', ``polished'', and ``produced''.I seed a large language model (LLM) with information about these albums and have it interview me, which results in the exploration of several deeper questions: To what extent am I the author?Where am I in the resulting music?How is my musical identity changing as I am faced with machines that are in some ways far more talented than I? 0.612What new musical spaces does my work open, for me or anyone/thing else?I conclude by reflecting on my reflections, as well as LLM-mediated self-reflection as method. |
2025-07-30 |
An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem
Metaverse service is a product of the convergence between Metaverse and service systems, designed to address service-related challenges concerning digital avatars, digital twins, and digital natives within Metaverse.With the rise of large language models (LLMs), agents now play a pivotal role in Metaverse service ecosystem, serving dual functions: as digital avatars representing users in the virtual realm and as service assistants (or NPCs) providing personalized support. 0.622However, during the modeling of Metaverse service ecosystems, existing LLM-based agents face significant challenges in bridging virtual-world services with real-world services, particularly regarding issues such as character data fusion, character knowledge association, and ethical safety concerns.This paper proposes an explainable emotion alignment framework for LLM-based agents in Metaverse Service Ecosystem. 0.664It aims to integrate factual factors into the decision-making loop of LLM-based agents, systematically demonstrating how to achieve more relational fact alignment for these agents.Finally, a simulation experiment in the Offline-to-Offline food delivery scenario is conducted to evaluate the effectiveness of this framework, obtaining more realistic social emergence. 0.821 |
2025-07-30 |
Mitigating Response Delays in Free-Form Conversations with LLM-powered Intelligent Virtual Agents
We investigated the challenges of mitigating response delays in free-form conversations with virtual agents powered by Large Language Models (LLMs) within Virtual Reality (VR).For this, we used conversational fillers, such as gestures and verbal cues, to bridge delays between user input and system responses and evaluate their effectiveness across various latency levels and interaction scenarios.We found that latency above 4 seconds degrades quality of experience, while natural conversational fillers improve perceived response time, especially in high-delay conditions.Our findings provide insights for practitioners and researchers to optimize user engagement whenever conversational systems' responses are delayed by network limitations or slow hardware. 0.605We also contribute an open-source pipeline that streamlines deploying conversational agents in virtual environments. |
2025-07-30 |
Towards Simulating Social Influence Dynamics with LLM-based Multi-agents
Recent advancements in Large Language Models offer promising capabilities to simulate complex human social interactions. 0.722We investigate whether LLM-based multi-agent simulations can reproduce core human social dynamics observed in online forums. 0.908We evaluate conformity dynamics, group polarization, and fragmentation across different model scales and reasoning capabilities using a structured simulation framework.Our findings indicate that smaller models exhibit higher conformity rates, whereas models optimized for reasoning are more resistant to social influence. |
2025-07-30 |
Multilingual Political Views of Large Language Models: Identification and Steering
Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. 0.74While prior research has shown that LLMs often exhibit measurable political biases--frequently skewing toward liberal or progressive positions--key gaps remain.Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings.Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs.We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement.Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families.To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages.Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs. |
2025-07-30 |
The Incomplete Bridge: How AI Research (Mis)Engages with Psychology
Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. 0.631Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. 0.768Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored.We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. 0.626Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems. |
2025-07-30 |
Vibe Modeling: Challenges and Opportunities
There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new software systems.New types of user interfaces, the need for intelligent components, sustainability concerns, ... bring new challenges that we need to handle. 0.61In the last years, model-driven engineering (MDE) has been key to improving the quality and productivity of software development, but models themselves are becoming increasingly complex to specify and manage.At the same time, we are witnessing the growing popularity of vibe coding approaches that rely on Large Language Models (LLMs) to transform natural language descriptions into running code at the expenses of code vulnerabilities, scalability issues and maintainability concerns.In this paper, we introduce the concept of \textit{vibe modeling} as a novel approach to integrate the best of both worlds (AI and MDE) to speed up the development of reliable complex systems.We outline the key concepts of vibe modeling and highlight the opportunities and open challenges it presents for the future of modeling. |
2025-07-30 |
Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks
Objective: Although computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications, it is time-intensive because of manual data review.We previously assessed the ability of LLMs to perform computational phenotyping tasks using computable phenotypes for ARF respiratory support therapies.They successfully performed concept classification and classification of single-therapy phenotypes, but underperformed on multiple-therapy phenotypes.To understand issues with these complex tasks, we expanded PHEONA, a generalizable framework for evaluation of LLMs, to include methods specifically for evaluating faulty reasoning.Materials and Methods: We assessed the responses of three lightweight LLMs (DeepSeek-r1 32 billion, Mistral Small 24 billion, and Phi-4 14 billion) both with and without prompt modifications to identify explanation correctness and unfaithfulness errors for phenotyping.Results:For experiments without prompt modifications, both errors were present across all models although more responses had explanation correctness errors than unfaithfulness errors.For experiments assessing accuracy impact after prompt modifications, DeepSeek, a reasoning model, had the smallest overall accuracy impact when compared to Mistral and Phi.Discussion: 0.722Since reasoning errors were ubiquitous across models, our enhancement of PHEONA to include a component for assessing faulty reasoning provides critical support for LLM evaluation and evidence for reasoning errors for complex tasks.While insights from reasoning errors can help prompt refinement, a deeper understanding of why LLM reasoning errors occur will likely require further development and refinement of interpretability methods.Conclusion: Reasoning errors were pervasive across LLM responses for computational phenotyping, a complex reasoning task |
2025-07-30 |
User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. 0.645Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs.We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). 0.766First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs.Second, we study harvesting learning signals from such implicit user feedback.We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). 0.6We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt.Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations. |
2025-07-29 |
Valuing Time in Silicon: Can Large Language Model Replicate Human Value of Travel Time
As a key advancement in artificial intelligence, large language models (LLMs) are set to transform transportation systems.While LLMs offer the potential to simulate human travelers in future mixed-autonomy transportation systems, their behavioral fidelity in complex scenarios remains largely unconfirmed by existing research. 0.651This study addresses this gap by conducting a comprehensive analysis of the value of travel time (VOT) of a popular LLM, GPT-4o.We employ a full factorial experimental design to systematically examine the LLM's sensitivity to various transportation contexts, including the choice setting, travel purpose, income, and socio-demographic factors. 0.609Our results reveal a high degree of behavioral similarity between the LLM and humans. 0.795The LLM exhibits an aggregate VOT similar to that of humans, and demonstrates human-like sensitivity to travel purpose, income, and the time-cost trade-off ratios of the alternatives.Furthermore, the behavioral patterns of LLM are remarkably consistent across varied contexts. 0.743However, we also find that the LLM's context sensitivity is less pronounced than that observed in humans.Overall, this study provides a foundational benchmark for the future development of LLMs as proxies for human travelers, demonstrating their value and robustness while highlighting that their blunted contextual sensitivity requires careful consideration. |
2025-07-29 |
Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects
Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in a wide range of applications, including web navigation, software development, and embodied control.While most LLMs are limited in several key agentic procedures, such as reliable planning, long-term memory, tool management, and multi-agent coordination, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows.Given the rapid growth and fragmentation of research on Graph-augmented LLM Agents (GLA), this paper offers a timely and comprehensive overview of recent advances and also highlights key directions for future work.Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each.For multi-agent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS.Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems.We hope this paper can serve as a roadmap for future research on GLA and foster a deeper understanding of the role of graphs in LLM agent systems. 0.629 |
2025-07-29 |
InSituTale: Enhancing Augmented Data Storytelling with Physical Objects
Augmented data storytelling enhances narrative delivery by integrating visualizations with physical environments and presenter actions. 0.719Existing systems predominantly rely on body gestures or speech to control visualizations, leaving interactions with physical objects largely underexplored.We introduce augmented physical data storytelling, an approach enabling presenters to manipulate visualizations through physical object interactions.To inform this approach, we first conducted a survey of data-driven presentations to identify common visualization commands.We then conducted workshops with nine HCI/VIS researchers to collect mappings between physical manipulations and these commands.Guided by these insights, we developed InSituTale, a prototype that combines object tracking via a depth camera with Vision-LLM for detecting real-world events.Through physical manipulations, presenters can dynamically execute various visualization commands, delivering cohesive data storytelling experiences that blend physical and digital elements.A user study with 12 participants demonstrated that InSituTale enables intuitive interactions, offers high utility, and facilitates an engaging presentation experience. 0.698 |
2025-07-29 |
Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour
This study investigates the adoption of open-access, locally deployable causal large language models (LLMs) for travel mode choice prediction and introduces LiTransMC, the first fine-tuned causal LLM developed for this task.We systematically benchmark eleven LLMs (1-12B parameters) across three stated and revealed preference datasets, testing 396 configurations and generating over 79,000 synthetic commuter predictions.Beyond predictive accuracy, we evaluate models generated reasoning using BERTopic for topic modelling and a novel Explanation Strength Index, providing the first structured analysis of how LLMs articulate decision factors in alignment with behavioural theory.LiTransMC, fine-tuned using parameter efficient and loss masking strategy, achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of 0.000245, surpassing both untuned local models and larger proprietary systems, including GPT-4o with advanced persona inference and embedding-based loading, while also outperforming classical mode choice methods such as discrete choice models and machine learning classifiers for the same dataset.This dual improvement, i.e., high instant-level accuracy and near-perfect distributional calibration, demonstrates the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability.Through combining structured behavioural prediction with natural language reasoning, this work unlocks the potential for conversational, multi-task transport models capable of supporting agent-based simulations, policy testing, and behavioural insight generation. 0.646These findings establish a pathway for transforming general purpose LLMs into specialized, explainable tools for transportation research and policy formulation, while maintaining privacy, reducing cost, and broadening access through local deployment. |
2025-07-29 |
Large Language Models for Supply Chain Decisions
Supply Chain Management requires addressing a variety of complex decision-making challenges, from sourcing strategies to planning and execution.Over the last few decades, advances in computation and information technologies have enabled the transition from manual, intuition and experience-based decision-making, into more automated and data-driven decisions using a variety of tools that apply optimization techniques.These techniques use mathematical methods to improve decision-making. Unfortunately, business planners and executives still need to spend considerable time and effort to (i) understand and explain the recommendations coming out of these technologies; (ii) analyze various scenarios and answer what-if questions; and (iii) update the mathematical models used in these tools to reflect current business environments.Addressing these challenges requires involving data science teams and/or the technology providers to explain results or make the necessary changes in the technology and hence significantly slows down decision making. Motivated by the recent advances in Large Language Models (LLMs), we report how this disruptive technology can democratize supply chain technology - namely, facilitate the understanding of tools' outcomes, as well as the interaction with supply chain tools without human-in-the-loop. 0.735Specifically, we report how we apply LLMs to address the three challenges described above, thus substantially reducing the time to decision from days and weeks to minutes and hours as well as dramatically increasing planners' and executives' productivity and impact. |
2025-07-29 |
Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities
Large Language Models (LLMs) are widely used to support various workflows across different disciplines, yet their potential in choice modelling remains relatively unexplored. 0.626This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models.We implement a systematic experimental framework involving thirteen versions of six leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Gemma, and Llama) evaluated under five experimental configurations.These configurations vary along three dimensions: modelling goal (suggesting vs. suggesting and estimating MNLs); prompting strategy (Zero-Shot vs. Chain-of-Thoughts); and information availability (full dataset vs. data dictionary only).Each LLM-suggested specification is implemented, estimated, and evaluated based on goodness-of-fit metrics, behavioural plausibility, and model complexity.Findings reveal that proprietary LLMs can generate valid and behaviourally sound utility specifications, particularly when guided by structured prompts.Open-weight models such as Llama and Gemma struggled to produce meaningful specifications.Claude 4 Sonnet consistently produced the best-fitting and most complex models, while GPT models suggested models with robust and stable modelling outcomes.Some LLMs performed better when provided with just data dictionary, suggesting that limiting raw data access may enhance internal reasoning capabilities.Among all LLMs, GPT o3 was uniquely capable of correctly estimating its own specifications by executing self-generated code.Overall, the results demonstrate both the promise and current limitations of LLMs as assistive agents in choice modelling, not only for model specification but also for supporting modelling decision and estimation, and provide practical guidance for integrating these tools into choice modellers' workflows. |
2025-07-29 |
Representations in vision and language converge in a shared, multidimensional space of perceived similarities
Humans can effortlessly describe what they see, yet establishing a shared representational format between vision and language remains a significant challenge.Emerging evidence suggests that human brain representations in both vision and language are well predicted by semantic feature spaces obtained from large language models (LLMs).This raises the possibility that sensory systems converge in their inherent ability to transform their inputs onto shared, embedding-like representational space.However, it remains unclear how such a space manifests in human behaviour.To investigate this, sixty-three participants performed behavioural similarity judgements separately on 100 natural scene images and 100 corresponding sentence captions from the Natural Scenes Dataset.We found that visual and linguistic similarity judgements not only converge at the behavioural level but also predict a remarkably similar network of fMRI brain responses evoked by viewing the natural scene images.Furthermore, computational models trained to map images onto LLM-embeddings outperformed both category-trained and AlexNet controls in explaining the behavioural similarity structure.These findings demonstrate that human visual and linguistic similarity judgements are grounded in a shared, modality-agnostic representational structure that mirrors how the visual system encodes experience. 0.752The convergence between sensory and artificial systems suggests a common capacity of how conceptual representations are formed-not as arbitrary products of first order, modality-specific input, but as structured representations that reflect the stable, relational properties of the external world. |
2025-07-29 |
Leveraging LLMs for Persona-Based Visualization of Election Data
Visualizations are essential tools for disseminating information regarding elections and their outcomes, potentially influencing public perceptions. 0.831Personas, delineating distinctive segments within the populace, furnish a valuable framework for comprehending the nuanced perspectives, requisites, and behaviors of diverse voter demographics. 0.724In this work, we propose making visualizations tailored to these personas to make election information easier to understand and more relevant. 0.805Using data from UK parliamentary elections and new developments in Large Language Models (LLMs), we create personas that encompass the diverse demographics, technological preferences, voting tendencies, and information consumption patterns observed among voters. 0.79Subsequently, we elucidate how these personas can inform the design of visualizations through specific design criteria. 0.71We then provide illustrative examples of visualization prototypes based on these criteria and evaluate these prototypes using these personas and LLMs.We finally propose some actionable insights based upon the framework and the different design artifacts. |
2025-07-29 |
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. 0.609However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models.To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage.Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3)Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format.By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains.For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities.The dataset is publicly available at https://github.com/VIS-MPU-Agent/MMAT-1M. |
2025-07-29 |
Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation
In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish's essence, but also to provide diverse options for various dietary needs and preferences.Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance.However, it remains unclear whether RAG can generate diverse adaptation results.Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs.This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. 0.631To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization.To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences.Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs. |
2025-07-29 |
Towards Cognitive Synergy in LLM-Based Multi-Agent Systems: Integrating Theory of Mind and Critical Evaluation
Recently, the field of Multi-Agent Systems (MAS) has gained popularity as researchers are trying to develop artificial intelligence capable of efficient collective reasoning.Agents based on Large Language Models (LLMs) perform well in isolated tasks, yet struggle with higher-order cognition required for adaptive collaboration.Human teams achieve synergy not only through knowledge sharing, but also through recursive reasoning, structured critique, and the ability to infer others' mental states. 0.616Current artificial systems lack these essential mechanisms, limiting their ability to engage in sophisticated collective reasoning.This work explores cognitive processes that enable effective collaboration, focusing on adaptive theory of mind (ToM) and systematic critical evaluation.We investigate three key questions.First, how does the ability to model others' perspectives enhance coordination and reduce redundant reasoning?Second, to what extent does structured critique improve reasoning quality by identifying logical gaps and mitigating biases?Third, the interplay of these mechanisms can lead to emergent cognitive synergy, where the collective intelligence of the system exceeds the sum of its parts.Through an empirical case study on complex decision making, we show that the integration of these cognitive mechanisms leads to more coherent, adaptive, and rigorous agent interactions.This article contributes to the field of cognitive science and AI research by presenting a structured framework that emulates human-like collaborative reasoning MAS.It highlights the significance of dynamic ToM and critical evaluation in advancing multi-agent systems' ability to tackle complex, real-world challenges. |
2025-07-29 |
Validating Generative Agent-Based Models of Social Norm Enforcement: From Replication to Novel Predictions
As large language models (LLMs) advance, there is growing interest in using them to simulate human social behavior through generative agent-based modeling (GABM). 0.863However, validating these models remains a key challenge.We present a systematic two-stage validation approach using social dilemma paradigms from psychological literature, first identifying the cognitive components necessary for LLM agents to reproduce known human behaviors in mixed-motive settings from two landmark papers, then using the validated architecture to simulate novel conditions. 0.735Our model comparison of different cognitive architectures shows that both persona-based individual differences and theory of mind capabilities are essential for replicating third-party punishment (TPP) as a costly signal of trustworthiness.For the second study on public goods games, this architecture is able to replicate an increase in cooperation from the spread of reputational information through gossip.However, an additional strategic component is necessary to replicate the additional boost in cooperation rates in the condition that allows both ostracism and gossip.We then test novel predictions for each paper with our validated generative agents.We find that TPP rates significantly drop in settings where punishment is anonymous, yet a substantial amount of TPP persists, suggesting that both reputational and intrinsic moral motivations play a role in this behavior.For the second paper, we introduce a novel intervention and see that open discussion periods before rounds of the public goods game further increase contributions, allowing groups to develop social norms for cooperation.This work provides a framework for validating generative agent models while demonstrating their potential to generate novel and testable insights into human social behavior. 0.677 |
Large Language Models in Social Sciences |
|
2025-07-31 |
P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication
There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. 0.67However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. 0.682Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain.We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. 0.774Following the definition, we formulate two tasks in implicature and one task in presupposition.To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen.The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain.In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. 0.721Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs. 0.832 |
2025-07-31 |
What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content
Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation.While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions.This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. 0.612Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. 0.831Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods. |
2025-07-31 |
Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators
The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field.While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models.This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions.This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots.The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. 0.632The platforms assessed were chosen for their accessibility and ease of use across various operating systems.While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling.The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses. |
2025-07-31 |
SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs.On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. 0.68Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning.Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation.The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language.Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0\% to 32.2\%.World-model-based planning, in particular, shows consistent advantage of up to 124\% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm.We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments.To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing. |
2025-07-30 |
Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors
Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. 0.765Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. 0.736It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. 0.761To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. 0.811It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. 0.827Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities.To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations.To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. 0.631Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy.Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45\% reduction in mean squared error (MSE).Final evaluations on the test set of the AVI Challenge 2025 confirm our method's superiority, ranking first in the Personality Assessment track. 0.683The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep. |
2025-07-30 |
CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records
Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). 0.644However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology.To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records.The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph.This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation.We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. 0.753In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods.The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists. |
2025-07-30 |
ControlMed: Adding Reasoning Control to Medical Language Model
Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support.Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency.These limitations hinder their practical deployment in real-world clinical environments.To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers.ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality.Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. 0.736Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed.These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis. |
2025-07-30 |
Unveiling the Influence of Amplifying Language-Specific Neurons
Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them.However, their role in amplification remains underexplored.This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages.We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES).The optimal amplification factors effectively steer output toward nearly all tested languages.Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. 0.689These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer. 0.62 |
2025-07-30 |
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear.We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior.Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization.Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches.These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. 0.721Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness.We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated.Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation. |
2025-07-30 |
How Exposed Are UK Jobs to Generative AI? Developing and Applying a Novel Task-Based Index
We introduce the Generative AI Susceptibility Index (GAISI), a task-based measure of UK job exposure to large language models (LLMs), such as ChatGPT. 0.79GAISI is derived from probabilistic task ratings by LLMs and linked to worker-reported task data from the Skills and Employment Surveys.It reflects the share of job activities where an LLM or LLM-powered system can reduce task completion time by at least 25 per cent beyond existing productivity tools.The index demonstrates high reliability, strong alignment with AI capabilities, and superior predictive power compared to existing exposure measures.By 2023-24, nearly all UK jobs exhibited some exposure, yet only a minority were heavily affected. 0.687Aggregate exposure has risen since 2017, primarily due to occupational shifts rather than changes in task profiles.The price premium for AI-exposed tasks declined relative to 2017, measuring approximately 11 per cent lower in 2023-24.Job postings in high-exposure roles also fell by 6.5 per cent following the release of ChatGPT.GAISI offers a robust framework for assessing generative AI's impact on work, providing early evidence that displacement effects may already outweigh productivity gains. |
2025-07-30 |
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities.We also provide strong baselines using state-of-the-art large language models (LLMs).Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations.It includes both purely textual questions and those requiring visual understanding.As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness.Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. 0.696Our baseline results highlight a significant gap in regional knowledge among current LLMs.Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. 0.633We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering. |
2025-07-30 |
Opportunities and Challenges of LLMs in Education: An NLP Perspective
Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment.In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: {\em assistance} and {\em assessment}, grounding them along the four dimensions -- reading, writing, speaking, and tutoring. 0.642We then present the new directions enabled by LLMs, and the key challenges to address.We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future. 0.704 |
2025-07-30 |
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries.Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations.This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking.Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. 0.728Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns.Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics.To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese.Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges. |
2025-07-30 |
Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. 0.666While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. 0.741Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. 0.85Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. 0.715Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. 0.624We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. 0.649This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. 0.688We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems.Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI. |
2025-07-30 |
Argumentatively Coherent Judgmental Forecasting
Judgmental forecasting employs human opinions to make predictions about future events, rather than exclusively historical data as in quantitative forecasting. 0.788When these opinions form an argumentative structure around forecasts, it is useful to study the properties of the forecasts from an argumentative perspective.In this paper, we advocate and formally define a property of argumentative coherence, which, in essence, requires that a forecaster's reasoning is coherent with their forecast.We then conduct three evaluations with our notion of coherence.First, we assess the impact of enforcing coherence on human forecasters as well as on Large Language Model (LLM)-based forecasters, given that they have recently shown to be competitive with human forecasters. 0.691In both cases, we show that filtering out incoherent predictions improves forecasting accuracy consistently, supporting the practical value of coherence in both human and LLM-based forecasting.Then, via crowd-sourced user experiments, we show that, despite its apparent intuitiveness and usefulness, users do not generally align with this coherence property.This points to the need to integrate, within argumentation-based judgmental forecasting, mechanisms to filter out incoherent opinions before obtaining group forecasting predictions. |
LLMs in Education Research |
|
2025-07-31 |
DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-agent System
Current multi-agent systems (MAS) frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance.To address these limitations, we propose DynaSwarm, a dynamic framework that enhances LLM-based MAS through two key innovations: (1) an actor-critic reinforcement learning (A2C) mechanism to optimize graph structures with improved stability over prior RL methods, and (2) a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter-efficient LLM fine-tuning.DynaSwarm eliminates the need for rigid, one-fits-all graph architectures, instead leveraging sample-specific idiosyncrasies to dynamically route queries through specialized agent networks.(c) We propose to fine-tune the demonstration retriever to fully exploit the power of in-context learning (ICL).Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that DynaSwarm consistently outperforms state-of-the-art single-agent and MAS baselines across multiple LLM backbones. 0.527Our findings highlight the importance of sample-aware structural flexibility in LLM MAS designs. |
2025-07-31 |
Towards LLM-Enhanced Product Line Scoping
The idea of product line scoping is to identify the set of features and configurations that a product line should include, i.e., offer for configuration purposes.In this context, a major scoping task is to find a balance between commercial relevance and technical feasibility.Traditional product line scoping approaches rely on formal feature models and require a manual analysis which can be quite time-consuming.In this paper, we sketch how Large Language Models (LLMs) can be applied to support product line scoping tasks with a natural language interaction based scoping process.Using a working example from the smarthome domain, we sketch how LLMs can be applied to evaluate different feature model alternatives.We discuss open research challenges regarding the integration of LLMs with product line scoping. 0.503 |
2025-07-31 |
Automated Feedback on Student-Generated UML and ER Diagrams Using Large Language Models
UML and ER diagrams are foundational in computer science education but come with challenges for learners due to the need for abstract thinking, contextual understanding, and mastery of both syntax and semantics. 0.619These complexities are difficult to address through traditional teaching methods, which often struggle to provide scalable, personalized feedback, especially in large classes.We introduce DUET (Diagrammatic UML & ER Tutor), a prototype of an LLM-based tool, which converts a reference diagram and a student-submitted diagram into a textual representation and provides structured feedback based on the differences. 0.697It uses a multi-stage LLM pipeline to compare diagrams and generate reflective feedback.Furthermore, the tool enables analytical insights for educators, aiming to foster self-directed learning and inform instructional strategies. 0.681We evaluated DUET through semi-structured interviews with six participants, including two educators and four teaching assistants. 0.686They identified strengths such as accessibility, scalability, and learning support alongside limitations, including reliability and potential misuse.Participants also suggested potential improvements, such as bulk upload functionality and interactive clarification features.DUET presents a promising direction for integrating LLMs into modeling education and offers a foundation for future classroom integration and empirical evaluation. 0.771 |
2025-07-30 |
Math Natural Language Inference: this should be easy!
We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts.We call this the Math NLI problem.We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field.We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves.We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs.We have both positive and negative findings.Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area.On the negative side: LLMs still struggle with mathematical language. 0.659They occasionally fail at even basic inferences.Current models are not as prone to hypothesis-only "inference" in our data the way the previous generation had been.In addition to our findings, we also provide our corpora as data to support future work on Math NLI. |
2025-07-29 |
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved.Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training.In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training.Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher.The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student's own difficulty scores. 0.508This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead.Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 35% of the tokens.With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%.Code and models will be released soon. |
2025-07-29 |
Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish
This paper summarizes the main findings of ADoBo 2025, the shared task on anglicism identification in Spanish proposed in the context of IberLEF 2025.Participants of ADoBo 2025 were asked to detect English lexical borrowings (or anglicisms) from a collection of Spanish journalistic texts.Five teams submitted their solutions for the test phase. 0.525Proposed systems included LLMs, deep learning models, Transformer-based models and rule-based systems.The results range from F1 scores of 0.17 to 0.99, which showcases the variability in performance different systems can have for this task. |
2025-07-29 |
Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks
Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning.In this work, we propose a lightweight framework that leverages Large Language Models (LLMs) for RCA.To do so, we introduce TeleLogs, a curated dataset of annotated troubleshooting problems designed to benchmark RCA capabilities.Our evaluation reveals that existing open-source reasoning LLMs struggle with these problems, underscoring the need for domain-specific adaptation.To address this issue, we propose a two-stage training methodology that combines supervised fine-tuning with reinforcement learning to improve the accuracy and reasoning quality of LLMs.The proposed approach fine-tunes a series of RCA models to integrate domain knowledge and generate structured, multi-step diagnostic explanations, improving both interpretability and effectiveness.Extensive experiments across multiple LLM sizes show significant performance gains over state-of-the-art reasoning and non-reasoning models, including strong generalization to randomized test variants.These results demonstrate the promise of domain-adapted, reasoning-enhanced LLMs for practical and explainable RCA in network operation and management. 0.535 |
2025-07-28 |
CoGrader: Transforming Instructors' Assessment of Project Reports through Collaborative LLM Integration
Grading project reports are increasingly significant in today's educational landscape, where they serve as key assessments of students' comprehensive problem-solving abilities. 0.568However, it remains challenging due to the multifaceted evaluation criteria involved, such as creativity and peer-comparative achievement.Meanwhile, instructors often struggle to maintain fairness throughout the time-consuming grading process. 0.504Recent advances in AI, particularly large language models, have demonstrated potential for automating simpler grading tasks, such as assessing quizzes or basic writing quality.However, these tools often fall short when it comes to complex metrics, like design innovation and the practical application of knowledge, that require an instructor's educational insights into the class situation. 0.587To address this challenge, we conducted a formative study with six instructors and developed CoGrader, which introduces a novel grading workflow combining human-LLM collaborative metrics design, benchmarking, and AI-assisted feedback. 0.611CoGrader was found effective in improving grading efficiency and consistency while providing reliable peer-comparative feedback to students. 0.541We also discuss design insights and ethical considerations for the development of human-AI collaborative grading systems. 0.555 |
2025-07-28 |
VArsity: Can Large Language Models Keep Power Engineering Students in Phase?
This paper provides an educational case study regarding our experience in deploying ChatGPT Large Language Models (LLMs) in the Spring 2025 and Fall 2023 offerings of ECE 4320: Power System Analysis and Control at Georgia Tech. 0.554As part of course assessments, students were tasked with identifying, explaining, and correcting errors in the ChatGPT outputs corresponding to power factor correction problems. 0.572While most students successfully identified the errors in the outputs from the GPT-4 version of ChatGPT used in Fall 2023, students found the errors from the ChatGPT o1 version much more difficult to identify in Spring 2025. 0.564As shown in this case study, the role of LLMs in pedagogy, assessment, and learning in power engineering classrooms is an important topic deserving further investigation. 0.726 |
2025-07-28 |
CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting
Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts.Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries.An open question is how systematic this interpretation process is.Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional.For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization.Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks.This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. 0.527We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning.Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on.Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity.We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries. |
2025-07-28 |
Pushing the Limits of LLMs in Quantum Operations
What is the fastest Artificial Intelligence Large Language Model (AI LLM) for generating quantum operations?To answer this, we present the first benchmarking study comparing popular and publicly available AI models tasked with creating quantum gate designs.The Wolfram Mathematica framework was used to interface with the 4 AI LLMs, including WolframLLM, OpenAI ChatGPT, Google Gemini, and DeepSeek. 0.611This comparison evaluates both the time taken by each AI LLM platform to generate quantum operations (including networking times), as well as the execution time of these operations in Python, within Jupyter Notebook.Our results show that overall, Gemini is the fastest AI LLM in producing quantum gate designs.At the same time, the AI LLMs tested achieved working quantum operations 80% of the time.These findings highlight a promising horizon where publicly available Large Language Models can become fast collaborators with quantum computers, enabling rapid quantum gate synthesis and paving the way for greater interoperability between two remarkable and cutting-edge technologies. |
2025-07-28 |
Games Agents Play: Towards Transactional Analysis in LLM-based Multi-Agent Systems
Multi-Agent Systems (MAS) are increasingly used to simulate social interactions, but most of the frameworks miss the underlying cognitive complexity of human behavior.In this paper, we introduce Trans-ACT (Transactional Analysis Cognitive Toolkit), an approach embedding Transactional Analysis (TA) principles into MAS to generate agents with realistic psychological dynamics.Trans-ACT integrates the Parent, Adult, and Child ego states into an agent's cognitive architecture.Each ego state retrieves context-specific memories and uses them to shape response to new situations.The final answer is chosen according to the underlying life script of the agent.Our experimental simulation, which reproduces the Stupid game scenario, demonstrates that agents grounded in cognitive and TA principles produce deeper and context-aware interactions. 0.54Looking ahead, our research opens a new way for a variety of applications, including conflict resolution, educational support, and advanced social psychology studies. |
LLMs as Recommender Systems |
|
2025-07-31 |
Not Just What, But When: Integrating Irregular Intervals to LLM for Sequential Recommendation
Time intervals between purchasing items are a crucial factor in sequential recommendation tasks, whereas existing approaches focus on item sequences and often overlook by assuming the intervals between items are static. 0.686However, dynamic intervals serve as a dimension that describes user profiling on not only the history within a user but also different users with the same item history.In this work, we propose IntervalLLM, a novel framework that integrates interval information into LLM and incorporates the novel interval-infused attention to jointly consider information of items and intervals.Furthermore, unlike prior studies that address the cold-start scenario only from the perspectives of users and items, we introduce a new viewpoint: the interval perspective to serve as an additional metric for evaluating recommendation methods on the warm and cold scenarios. 0.724Extensive experiments on 3 benchmarks with both traditional- and LLM-based baselines demonstrate that our IntervalLLM achieves not only 4.4% improvements in average but also the best-performing warm and cold scenarios across all users, items, and the proposed interval perspectives.In addition, we observe that the cold scenario from the interval perspective experiences the most significant performance drop among all recommendation methods. 0.608This finding underscores the necessity of further research on interval-based cold challenges and our integration of interval information in the realm of sequential recommendation tasks. 0.608Our code is available here: https://github.com/sony/ds-research-code/tree/master/recsys25-IntervalLLM. |
2025-07-31 |
LLM4Rail: An LLM-Augmented Railway Service Consulting Platform
Large language models (LLMs) have significantly reshaped different walks of business.To meet the increasing demands for individualized railway service, we develop LLM4Rail - a novel LLM-augmented railway service consulting platform.Empowered by LLM, LLM4Rail can provide custom modules for ticketing, railway food & drink recommendations, weather information, and chitchat.In LLM4Rail, we propose the iterative "Question-Thought-Action-Observation (QTAO)" prompting framework.It meticulously integrates verbal reasoning with task-oriented actions, that is, reasoning to guide action selection, to effectively retrieve external observations relevant to railway operation and service to generate accurate responses.To provide personalized onboard dining services, we first construct the Chinese Railway Food and Drink (CRFD-25) - a publicly accessible takeout dataset tailored for railway services.CRFD-25 covers a wide range of signature dishes categorized by cities, cuisines, age groups, and spiciness levels.We further introduce an LLM-based zero-shot conversational recommender for railway catering. 0.763To address the unconstrained nature of open recommendations, the feature similarity-based post-processing step is introduced to ensure all the recommended items are aligned with CRFD-25 dataset. 0.687 |
2025-07-30 |
RecGPT Technical Report
Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. 0.725However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent.This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users' evolving and latent interests.As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. 0.825By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. 0.753To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. 0.756Currently, RecGPT has been fully deployed on the Taobao App.Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions.These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem. 0.649 |
2025-07-29 |
Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items
We introduce a novel self-supervised multi-modal relational item representation learning framework designed to infer substitutable and complementary items.Existing approaches primarily focus on modeling item-item associations deduced from user behaviors using graph neural networks (GNNs) or leveraging item content information.However, these methods often overlook critical challenges, such as noisy user behavior data and data sparsity due to the long-tailed distribution of these behaviors.In this paper, we propose MMSC, a self-supervised multi-modal relational item representation learning framework to address these challenges.Specifically, MMSC consists of three main components: (1) a multi-modal item representation learning module that leverages a multi-modal foundational model and learns from item metadata, (2) a self-supervised behavior-based representation learning module that denoises and learns from user behavior data, and (3) a hierarchical representation aggregation mechanism that integrates item representations at both the semantic and task levels.Additionally, we leverage LLMs to generate augmented training data, further enhancing the denoising process during training.We conduct extensive experiments on five real-world datasets, showing that MMSC outperforms existing baselines by 26.1% for substitutable recommendation and 39.2% for complementary recommendation. 0.716In addition, we empirically show that MMSC is effective in modeling cold-start items. |
2025-07-29 |
VN-MTEB: Vietnamese Massive Text Embedding Benchmark
Vietnam ranks among the top countries in terms of both internet traffic and online toxicity.As a result, implementing embedding models for recommendation and content control duties in applications is crucial. 0.617However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects.To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework.We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets.Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings.In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks.Datasets are available at HuggingFace: https://huggingface.co/collections/GreenNode/vn-mteb-68871433f0f7573b8e1a6686 |
2025-07-29 |
Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation
Recommendation systems often suffer from data sparsity caused by limited user-item interactions, which degrade their performance and amplify popularity bias in real-world scenarios. 0.759This paper proposes a novel data augmentation framework that leverages Large Language Models (LLMs) and item textual descriptions to enrich interaction data.By few-shot prompting LLMs multiple times to rerank items and aggregating the results via majority voting, we generate high-confidence synthetic user-item interactions, supported by theoretical guarantees based on the concentration of measure.To effectively leverage the augmented data in the context of a graph recommendation system, we integrate it into a graph contrastive learning framework to mitigate distributional shift and alleviate popularity bias. 0.632Extensive experiments show that our method improves accuracy and reduces popularity bias, outperforming strong baselines. |
2025-07-28 |
Large Language Model-Enhanced Reinforcement Learning for Diverse and Novel Recommendations
In recommendation systems, diversity and novelty are essential for capturing varied user preferences and encouraging exploration, yet many systems prioritize click relevance. 0.633While reinforcement learning (RL) has been explored to improve diversity, it often depends on random exploration that may not align with user interests.We propose LAAC (LLM-guided Adversarial Actor Critic), a novel method that leverages large language models (LLMs) as reference policies to suggest novel items, while training a lightweight policy to refine these suggestions using system-specific data.The method formulates training as a bilevel optimization between actor and critic networks, enabling the critic to selectively favor promising novel actions and the actor to improve its policy beyond LLM recommendations.To mitigate overestimation of unreliable LLM suggestions, we apply regularization that anchors critic values for unexplored items close to well-estimated dataset actions.Experiments on real-world datasets show that LAAC outperforms existing baselines in diversity, novelty, and accuracy, while remaining robust on imbalanced data, effectively integrating LLM knowledge without expensive fine-tuning. |
2025-07-27 |
Integrating LLM-Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation
Session-based recommendation (SBR) is mainly based on anonymous user interaction sequences to recommend the items that the next user is most likely to click. 0.603Currently, the most popular and high-performing SBR methods primarily leverage graph neural networks (GNNs), which model session sequences as graph-structured data to effectively capture user intent.However, most GNNs-based SBR methods primarily focus on modeling the ID sequence information of session sequences, while neglecting the rich semantic information embedded within them.This limitation significantly hampers model's ability to accurately infer users' true intention.To address above challenge, this paper proposes a novel SBR approach called Integrating LLM-Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation (LLM-DMsRec). 0.822The method utilizes a pre-trained GNN model to select the top-k items as candidate item sets and designs prompts along with a large language model (LLM) to infer multi-semantic intents from these candidate items.Specifically, we propose an alignment mechanism that effectively integrates the semantic intent inferred by the LLM with the structural intent captured by GNNs.Extensive experiments conducted on the Beauty and ML-1M datasets demonstrate that the proposed method can be seamlessly integrated into GNNs framework, significantly enhancing its recommendation performance. 0.753 |
2025-07-27 |
TADT-CSA: Temporal Advantage Decision Transformer with Contrastive State Abstraction for Generative Recommendation
With the rapid advancement of Transformer-based Large Language Models (LLMs), generative recommendation has shown great potential in enhancing both the accuracy and semantic understanding of modern recommender systems. 0.789Compared to LLMs, the Decision Transformer (DT) is a lightweight generative model applied to sequential recommendation tasks. 0.671However, DT faces challenges in trajectory stitching, often producing suboptimal trajectories.Moreover, due to the high dimensionality of user states and the vast state space inherent in recommendation scenarios, DT can incur significant computational costs and struggle to learn effective state representations. 0.65To overcome these issues, we propose a novel Temporal Advantage Decision Transformer with Contrastive State Abstraction (TADT-CSA) model.Specifically, we combine the conventional Return-To-Go (RTG) signal with a novel temporal advantage (TA) signal that encourages the model to capture both long-term returns and their sequential trend.Furthermore, we integrate a contrastive state abstraction module into the DT framework to learn more effective and expressive state representations.Within this module, we introduce a TA-conditioned State Vector Quantization (TAC-SVQ) strategy, where the TA score guides the state codebooks to incorporate contextual token information.Additionally, a reward prediction network and a contrastive transition prediction (CTP) network are employed to ensure the state codebook preserves both the reward information of the current state and the transition information between adjacent states.Empirical results on both public datasets and an online recommendation system demonstrate the effectiveness of the TADT-CSA model and its superiority over baseline methods. 0.763 |
2025-07-24 |
How Well Do LLMs Predict Prerequisite Skills? Zero-Shot Comparison to Expert-Defined Concepts
Prerequisite skills - foundational competencies required before mastering more advanced concepts - are important for supporting effective learning, assessment, and skill-gap analysis.Traditionally curated by domain experts, these relationships are costly to maintain and difficult to scale.This paper investigates whether large language models (LLMs) can predict prerequisite skills in a zero-shot setting, using only natural language descriptions and without task-specific fine-tuning.We introduce ESCO-PrereqSkill, a benchmark dataset constructed from the ESCO taxonomy, comprising 3,196 skills and their expert-defined prerequisite links.Using a standardized prompting strategy, we evaluate 13 state-of-the-art LLMs, including GPT-4, Claude 3, Gemini, LLaMA 4, Qwen2, and DeepSeek, across semantic similarity, BERTScore, and inference latency.Our results show that models such as LLaMA4-Maverick, Claude-3-7-Sonnet, and Qwen2-72B generate predictions that closely align with expert ground truth, demonstrating strong semantic reasoning without supervision.These findings highlight the potential of LLMs to support scalable prerequisite skill modeling for applications in personalized learning, intelligent tutoring, and skill-based recommender systems. 0.737 |
2025-07-23 |
R4ec: A Reasoning, Reflection, and Refinement Framework for Recommendation Systems
Harnessing Large Language Models (LLMs) for recommendation systems has emerged as a prominent avenue, drawing substantial research interest. 0.759However, existing approaches primarily involve basic prompt techniques for knowledge acquisition, which resemble System-1 thinking.This makes these methods highly sensitive to errors in the reasoning path, where even a small mistake can lead to an incorrect inference.To this end, in this paper, we propose $R^{4}$ec, a reasoning, reflection and refinement framework that evolves the recommendation system into a weak System-2 model. 0.725Specifically, we introduce two models: an actor model that engages in reasoning, and a reflection model that judges these responses and provides valuable feedback.Then the actor model will refine its response based on the feedback, ultimately leading to improved responses.We employ an iterative reflection and refinement process, enabling LLMs to facilitate slow and deliberate System-2-like thinking.Ultimately, the final refined knowledge will be incorporated into a recommendation backbone for prediction. 0.605We conduct extensive experiments on Amazon-Book and MovieLens-1M datasets to demonstrate the superiority of $R^{4}$ec.We also deploy $R^{4}$ec on a large scale online advertising platform, showing 2.2\% increase of revenue.Furthermore, we investigate the scaling properties of the actor model and reflection model. |
2025-07-23 |
Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems
Serendipity plays a pivotal role in enhancing user satisfaction within recommender systems, yet its evaluation poses significant challenges due to its inherently subjective nature and conceptual ambiguity. 0.643Current algorithmic approaches predominantly rely on proxy metrics for indirect assessment, often failing to align with real user perceptions, thus creating a gap.With large language models (LLMs) increasingly revolutionizing evaluation methodologies across various human annotation tasks, we are inspired to explore a core research proposition: Can LLMs effectively simulate human users for serendipity evaluation?To address this question, we conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains, focusing on three key aspects: the accuracy of LLMs compared to conventional proxy metrics, the influence of auxiliary data on LLM comprehension, and the efficacy of recently popular multi-LLM techniques.Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics.Furthermore, multi-LLM techniques and the incorporation of auxiliary data further enhance alignment with human perspectives.Based on our findings, the optimal evaluation by LLMs yields a Pearson correlation coefficient of 21.5\% when compared to the results of the user study.This research implies that LLMs may serve as potentially accurate and cost-effective evaluators, introducing a new paradigm for serendipity evaluation in recommender systems. 0.696 |
2025-07-22 |
LLM-Enhanced Reranking for Complementary Product Recommendation
Complementary product recommendation, which aims to suggest items that are used together to enhance customer value, is a crucial yet challenging task in e-commerce. 0.665While existing graph neural network (GNN) approaches have made significant progress in capturing complex product relationships, they often struggle with the accuracy-diversity tradeoff, particularly for long-tail items.This paper introduces a model-agnostic approach that leverages Large Language Models (LLMs) to enhance the reranking of complementary product recommendations. 0.761Unlike previous works that use LLMs primarily for data preprocessing and graph augmentation, our method applies LLM-based prompting strategies directly to rerank candidate items retrieved from existing recommendation models, eliminating the need for model retraining. 0.697Through extensive experiments on public datasets, we demonstrate that our approach effectively balances accuracy and diversity in complementary product recommendations, with at least 50% lift in accuracy metrics and 2% lift in diversity metrics on average for the top recommended items across datasets. 0.683 |
2025-07-22 |
Meta-Learning for Cold-Start Personalization in Prompt-Tuned LLMs
Generative, explainable, and flexible recommender systems, derived using Large Language Models (LLM) are promising and poorly adapted to the cold-start user situation, where there is little to no history of interaction. 0.802The current solutions i.e. supervised fine-tuning and collaborative filtering are dense-user-item focused and would be expensive to maintain and update.This paper introduces a meta-learning framework, that can be used to perform parameter-efficient prompt-tuning, to effectively personalize LLM-based recommender systems quickly at cold-start. 0.809The model learns soft prompt embeddings with first-order (Reptile) and second-order (MAML) optimization by treating each of the users as the tasks.As augmentations to the input tokens, these learnable vectors are the differentiable control variables that represent user behavioral priors.The prompts are meta-optimized through episodic sampling, inner-loop adaptation, and outer-loop generalization.On MovieLens-1M, Amazon Reviews, and Recbole, we can see that our adaptive model outperforms strong baselines in NDCG@10, HR@10, and MRR, and it runs in real-time (i.e., below 300 ms) on consumer GPUs.Zero-history personalization is also supported by this scalable solution, and its 275 ms rate of adaptation allows successful real-time risk profiling of financial systems by shortening detection latency and improving payment network stability.Crucially, the 275 ms adaptation capability can enable real-time risk profiling for financial institutions, reducing systemic vulnerability detection latency significantly versus traditional compliance checks.By preventing contagion in payment networks (e.g., Fedwire), the framework strengthens national financial infrastructure resilience. |
2025-07-22 |
Biases in LLM-Generated Musical Taste Profiles for Recommendation
One particularly promising use case of Large Language Models (LLMs) for recommendation is the automatic generation of Natural Language (NL) user taste profiles from consumption data. 0.808These profiles offer interpretable and editable alternatives to opaque collaborative filtering representations, enabling greater transparency and user control.However, it remains unclear whether users consider these profiles to be an accurate representation of their taste, which is crucial for trust and usability.Moreover, because LLMs inherit societal and data-driven biases, profile quality may systematically vary across user and item characteristics.In this paper, we study this issue in the context of music streaming, where personalization is challenged by a large and culturally diverse catalog.We conduct a user study in which participants rate NL profiles generated from their own listening histories.We analyze whether identification with the profiles is biased by user attributes (e.g., mainstreamness, taste diversity) and item features (e.g., genre, country of origin).We also compare these patterns to those observed when using the profiles in a downstream recommendation task.Our findings highlight both the potential and limitations of scrutable, LLM-based profiling in personalized systems. |
2025-07-22 |
LLM4MEA: Data-free Model Extraction Attacks on Sequential Recommenders via Large Language Models
Recent studies have demonstrated the vulnerability of sequential recommender systems to Model Extraction Attacks (MEAs). 0.652MEAs collect responses from recommender systems to replicate their functionality, enabling unauthorized deployments and posing critical privacy and security risks. 0.609Black-box attacks in prior MEAs are ineffective at exposing recommender system vulnerabilities due to random sampling in data selection, which leads to misaligned synthetic and real-world distributions.To overcome this limitation, we propose LLM4MEA, a novel model extraction method that leverages Large Language Models (LLMs) as human-like rankers to generate data.It generates data through interactions between the LLM ranker and target recommender system.In each interaction, the LLM ranker analyzes historical interactions to understand user behavior, and selects items from recommendations with consistent preferences to extend the interaction history, which serves as training data for MEA. 0.687Extensive experiments demonstrate that LLM4MEA significantly outperforms existing approaches in data quality and attack performance, reducing the divergence between synthetic and real-world data by up to 64.98% and improving MEA performance by 44.82% on average.From a defensive perspective, we propose a simple yet effective defense strategy and identify key hyperparameters of recommender systems that can mitigate the risk of MEAs. 0.613 |
2025-07-22 |
VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings
Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding.However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. 0.691To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings.Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions.Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. 0.695Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations. 0.652 |
2025-07-21 |
RankMixer: Scaling Up Ranking Models in Industrial Recommenders
Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems, yet two practical obstacles remain. 0.789First, training and serving cost on industrial Recommenders must respect strict latency bounds and high QPS demands.Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability.We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture.RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with multi-head token mixing module for higher efficiency.Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs.We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI.A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training.Experiments show RankMixer's superior scaling abilities on a trillion-scale production dataset.By replacing previously diverse handcrafted low-MFU modules with RankMixer, we boost the model MFU from 4.5% to 45%, and scale our ranking model parameters by 100x while maintaining roughly the same inference latency.We verify RankMixer's universality with online A/B tests across three core application scenarios (Recommendation, Advertisement and Search).Finally, we launch 1B Dense-Parameters RankMixer for full traffic serving without increasing the serving cost, which improves user active days by 0.2% and total in-app usage duration by 0.5%. |
2025-07-21 |
Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation
Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. 0.635While Large Language Models (LLMs) show promise in this direction, their scalability in recommender systems is limited by high costs and latency. 0.712Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment.In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. 0.686JAM models user-query-item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts.We also introduce JAMSessions, a new dataset of over 100k user-query-item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences.Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks. 0.732 |
2025-07-16 |
Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers.Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space.Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks.There are yet few challenges using it for RAG based multi-modal Q&A.Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval.Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption.Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process.This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality.We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages. 0.623Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages.Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises. |
2025-07-15 |
LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation
Recently, much effort has been devoted to modeling users' multi-interests based on their behaviors or auxiliary signals.However, existing methods often rely on heuristic assumptions, e.g., co-occurring items indicate the same interest of users, failing to capture user multi-interests aligning with real-world scenarios.While large language models (LLMs) show significant potential for multi-interest analysis due to their extensive knowledge and powerful reasoning capabilities, two key challenges remain.First, the granularity of LLM-driven multi-interests is agnostic, possibly leading to overly fine or coarse interest grouping.Second, individual user analysis provides limited insights due to the data sparsity issue.In this paper, we propose an LLM-driven dual-level multi-interest modeling framework for more effective recommendation. 0.849At the user-individual level, we exploit LLMs to flexibly allocate items engaged by users into different semantic clusters, indicating their diverse and distinct interests.To alleviate the agnostic generation of LLMs, we adaptively assign these semantic clusters to users' collaborative multi-interests learned from global user-item interactions, allowing the granularity to be automatically adjusted according to the user's behaviors using an alignment module.To alleviate the limited insights derived from individual users' behaviors, at the user-crowd level, we propose aggregating user cliques into synthesized users with rich behaviors for more comprehensive LLM-driven multi-interest analysis.We formulate a max covering problem to ensure the compactness and representativeness of synthesized users' behaviors, and then conduct contrastive learning based on their LLM-driven multi-interests to disentangle item representations among different interests.Experiments on real-world datasets show the superiority of our approach against state-of-the-art methods. |
Production workflows for LLMs |
|
2025-07-31 |
From LLMs to Edge: Parameter-Efficient Fine-Tuning on Edge Devices
Parameter-efficient fine-tuning (PEFT) methods reduce the computational costs of updating deep learning models by minimizing the number of additional parameters used to adapt a model to a down- stream task. 0.726While extensively researched in large language models (LLMs), their application to smaller models used on edge devices, such as convolutional neural networks, remains underexplored. 0.36This paper benchmarks and analyzes popular PEFT methods on convolutional architectures typically deployed in resource-constrained edge environments. 0.541We evaluate LoRA, DoRA, and GaLore for updating standard and depthwise convolutional architectures to handle distribution shifts and accommodate unseen classes. 0.535We utilize recently proposed PyTorch profilers to compare the updated model performance and computational costs of these PEFT methods with traditional fine-tuning approaches. 0.568With resource efficiency in mind, we investigate their update behavior across different rank dimensions. 0.573We find that the evaluated PEFT methods are only half as memory-efficient when applied to depthwise-separable convolution architectures, compared to their efficiency with LLMs. 0.41Conversely, when targeting convolu- tional architectures optimized for edge deployment, adapter-based PEFT methods can reduce floating point operations (FLOPs) during model updates by up to 95%. 0.648These insights offer valuable guidance for selecting PEFT methods based on hardware constraints, performance requirements, and application needs. 0.424Our code is online. 0.441 |
2025-07-31 |
Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions
Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices $K\in \mathbb{R}^{n\times n}$. $K$'s columns are indexed by a set of $n$ keys $k_1,k_2\ldots, k_n\in \mathbb{R}^d$, rows by a set of $n$ queries $q_1,q_2,\ldots,q_n\in \mathbb{R}^d $, and its $i,j$ entry is $K_{ij} = e^{-\|q_i-k_j\|_2^2/2\sigma^2}$ for some bandwidth parameter $\sigma>0$. Given a vector $x\in \mathbb{R}^n$ and error parameter $\epsilon>0$, our task is to output a $y\in \mathbb{R}^n$ such that $\|Kx-y\|_2\leq \epsilon \|x\|_2$ in time subquadratic in $n$ and linear in $d$. Our algorithms rely on the following modelling assumption about the matrices $K$: the sum of the entries of $K$ scales linearly in $n$, as opposed to worst case quadratic growth. 0.344We validate this assumption experimentally, for Gaussian kernel matrices encountered in various settings such as fast attention computation in LLMs.We obtain the first subquadratic-time algorithm that works under this assumption, for unrestricted vectors. |
2025-07-31 |
MemoCue: Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying
Agent-assisted memory recall is one critical research problem in the field of human-computer interaction.In conventional methods, the agent can retrieve information from its equipped memory module to help the person recall incomplete or vague memories.The limited size of memory module hinders the acquisition of complete memories and impacts the memory recall performance in practice.Memory theories suggest that the person's relevant memory can be proactively activated through some effective cues.Inspired by this, we propose a novel strategy-guided agent-assisted memory recall method, allowing the agent to transform an original query into a cue-rich one via the judiciously designed strategy to help the person recall memories.To this end, there are two key challenges.(1) How to choose the appropriate recall strategy for diverse forgetting scenarios with distinct memory-recall characteristics?(2) How to obtain the high-quality responses leveraging recall strategies, given only abstract and sparsely annotated strategy patterns?To address the challenges, we propose a Recall Router framework. 0.47Specifically, we design a 5W Recall Map to classify memory queries into five typical scenarios and define fifteen recall strategy patterns across the corresponding scenarios.We then propose a hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to optimize the selection of strategy and the generation of strategy responses.We construct an instruction tuning dataset and fine-tune multiple open-source large language models (LLMs) to develop MemoCue, an agent that excels in providing memory-inspired responses. 0.36Experiments on three representative datasets show that MemoCue surpasses LLM-based methods by 17.74% in recall inspiration.Further human evaluation highlights its advantages in memory-recall applications. |
2025-07-31 |
TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. 0.491However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search.To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. 0.492Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. 0.306Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience. 0.66 |
2025-07-31 |
DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data
Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery.Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation.We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships.DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). 0.477Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation. |
2025-07-31 |
TextQuests: How Good are LLMs at Text-Based Video Games?
Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities.While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context.To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games.These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks.The benchmark is specifically designed to assess an LLM agent's capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session.We release TextQuests at https://textquests.ai. 0.481 |
2025-07-31 |
DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction
Access to medical imaging and associated text data has the potential to drive major advances in healthcare research and patient outcomes.However, the presence of Protected Health Information (PHI) and Personally Identifiable Information (PII) in Digital Imaging and Communications in Medicine (DICOM) files presents a significant barrier to the ethical and secure sharing of imaging datasets. 0.347This paper presents a hybrid de-identification framework developed by Impact Business Information Solutions (IBIS) that combines rule-based and AI-driven techniques, and rigorous uncertainty quantification for comprehensive PHI/PII removal from both metadata and pixel data. 0.551Our approach begins with a two-tiered rule-based system targeting explicit and inferred metadata elements, further augmented by a large language model (LLM) fine-tuned for Named Entity Recognition (NER), and trained on a suite of synthetic datasets simulating realistic clinical PHI/PII. 0.484For pixel data, we employ an uncertainty-aware Faster R-CNN model to localize embedded text, extract candidate PHI via Optical Character Recognition (OCR), and apply the NER pipeline for final redaction. 0.435Crucially, uncertainty quantification provides confidence measures for AI-based detections to enhance automation reliability and enable informed human-in-the-loop verification to manage residual risks. 0.392This uncertainty-aware deidentification framework achieves robust performance across benchmark datasets and regulatory standards, including DICOM, HIPAA, and TCIA compliance metrics. 0.616By combining scalable automation, uncertainty quantification, and rigorous quality assurance, our solution addresses critical challenges in medical data de-identification and supports the secure, ethical, and trustworthy release of imaging data for research. 0.389 |
LLM Model Architectures and Training Techniques |
|
2025-07-31 |
Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning
In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance.Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes.Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. 0.44Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain.To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning.In this framework, we first develop the model's ability to perform logical reasoning over medical problems.Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process.Finally, we conduct joint optimization of the model's retrieval and reasoning coordination.Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%. 0.403 |
2025-07-31 |
GraphRAG-R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning
Graph Retrieval-Augmented Generation (GraphRAG) has shown great effectiveness in enhancing the reasoning abilities of LLMs by leveraging graph structures for knowledge representation and modeling complex real-world relationships.However, existing GraphRAG methods still face significant bottlenecks when handling complex problems that require multi-hop reasoning, as their query and retrieval phases are largely based on pre-defined heuristics and do not fully utilize the reasoning potentials of LLMs. 0.43To address this problem, we propose GraphRAG-R1, an adaptive GraphRAG framework by training LLMs with process-constrained outcome-based reinforcement learning (RL) to enhance the multi-hop reasoning ability. 0.511Our method can decompose complex problems, autonomously invoke retrieval tools to acquire necessary information, and perform effective reasoning.Specifically, we utilize a modified version of Group Relative Policy Optimization (GRPO) that supports rollout-with-thinking capability. 0.546Next, we design two process-constrained reward functions. 0.499To handle the shallow retrieval problem, we design a Progressive Retrieval Attenuation (PRA) reward to encourage essential retrievals.Then, to handle the over-thinking problem, we design Cost-Aware F1 (CAF) reward to balance the model performance with computational costs. 0.472We further design a phase-dependent training strategy, containing three training stages corresponding to cold start and these two rewards.Lastly, our method adopts a hybrid graph-textual retrieval to improve the reasoning capacity.Extensive experimental results demonstrate that GraphRAG-R1 boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets.Furthermore, our framework can be flexibly integrated with various existing retrieval methods, consistently delivering performance improvements. |
2025-07-31 |
Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study
Recent advancements in Large Language Models have sparked interest in their potential for robotic task planning.While these models demonstrate strong generative capabilities, their effectiveness in producing structured and executable plans remains uncertain.This paper presents a systematic evaluation of a broad spectrum of current state of the art language models, each directly prompted using Planning Domain Definition Language domain and problem files, and compares their planning performance with the Fast Downward planner across a variety of benchmarks.In addition to measuring success rates, we assess how faithfully the generated plans translate into sequences of actions that can actually be executed, identifying both strengths and limitations of using these models in this setting.Our findings show that while the models perform well on simpler planning tasks, they continue to struggle with more complex scenarios that require precise resource management, consistent state tracking, and strict constraint compliance. 0.505These results underscore fundamental challenges in applying language models to robotic planning in real world environments.By outlining the gaps that emerge during execution, we aim to guide future research toward combined approaches that integrate language models with classical planners in order to enhance the reliability and scalability of planning in autonomous robotics. |
Programming applications of LLMs |
|
2025-07-31 |
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution
Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs).Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks.While existing agent-based issue resolution approaches are primarily based on agents' independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase.To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization.SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph.Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace.This structured competition enables agents to collaboratively converge on a consolidated fix plan.Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. 0.683Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin. |
2025-07-31 |
Quality Evaluation of COBOL to Java Code Transformation
We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM's watsonx Code Assistant for Z (WCA4Z). 0.65The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment.Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations.The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review.We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases. |
2025-07-31 |
SWE-Exp: Experience-Driven Software Issue Resolution
Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). 0.671However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences.This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems.To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues.Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts.Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes.Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks.Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution. |
2025-07-31 |
Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling
Software issue resolution is a critical challenge in software engineering and has garnered increasing attention in recent years.With the rapid advancement of large language models (LLMs), substantial progress has been made in addressing real-world software engineering tasks. 0.803Recent studies have introduced ensemble reasoning techniques to enhance the performance of LLM-based issue resolution.However, existing prompting-based methods still face limitations in effectively exploring large ensemble spaces and lack the capacity for repository-level understanding, both of which constrain their overall effectiveness.In this paper, we propose Trae Agent, the first agent-based ensemble reasoning approach for repository-level issue resolution.Trae Agent formulates our goal as an optimal solution search problem and addresses two key challenges, i.e., large ensemble spaces and repository-level understanding, through modular agents for generation, pruning, and selection.We conduct extensive experiments using three leading LLMs on the widely-adopted SWE-bench benchmark, comparing Trae Agent against four state-of-the-art ensemble reasoning techniques.Experimental results demonstrate that Trae Agent consistently achieves superior performance, with an average improvement of 10.22% over all baselines in terms of Pass@1.Trae Agent has achieved first place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of 75.20%.We are pleased to release Trae Agent as an open-source project to support the research community, with all resources available at https://github.com/bytedance/trae-agent. |
2025-07-30 |
From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications
Maintaining software packages imposes significant costs due to dependency management, bug fixes, and versioning.We show that rich method descriptions in scientific publications can serve as standalone specifications for modern large language models (LLMs), enabling on-demand code generation that could supplant human-maintained libraries. 0.868We benchmark state-of-the-art models (GPT-o4-mini-high, Gemini Pro 2.5, Claude Sonnet 4) by tasking them with implementing a diverse set of core algorithms drawn from original publications.Our results demonstrate that current LLMs can reliably reproduce package functionality with performance indistinguishable from conventional libraries.These findings foreshadow a paradigm shift toward flexible, on-demand code generation and away from static, human-maintained packages, which will result in reduced maintenance overhead by leveraging published articles as sufficient context for the automated implementation of analytical workflows. 0.695 |
2025-07-30 |
AutoCodeSherpa: Symbolic Explanations in AI Coding Agents
Large Language Model (LLM) agents autonomously use external tools on top of one or more LLMs to accomplish specific tasks.Lately LLM agents for software engineering tasks have become popular. 0.667These agents can benefit from the use of program analysis tools working on program representations.This is demonstrated by existing agentic AI solutions such as AutoCodeRover or SpecRover which perform automated program repair.Specifically the goal of these works is to use program analysis to improve the patch quality.These agents are currently being used to automatically fix static analysis issues from the widely used SonarQube static analyzer. Nevertheless, for the agents to be deployed in a production environment, agents need to suggest software artifacts, such as patches, with evidence and with high confidence.In this work, we provide a workflow where an agent provides explanations of the bug in the form of symbolic formulae.The explanations are in the form of input conditions, infection conditions and output conditions, implemented as property based tests (PBT) and program-internal symbolic expressions.These can help in human developer cognition of the agent outputs as well as in achieving completely automated agentic workflows for software.The human developer can benefit from the input condition, represented as a PBT, to generate various concrete inputs showing a given issue.Furthermore, since the PBTs are executable, our explanations are executable as well.We can thus also use the explanations in a completely automated issue resolution environment for accepting or rejecting the patches that are suggested by patching agents such as AutoCodeRover.Finally, as agentic AI approaches continue to develop, the program analysis driven explanations can be provided to other LLM-based repair techniques such as Agentless to improve their output. |
2025-07-30 |
IFEvalCode: Controlled Code Generation
Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. 0.9To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines.The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries.Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment.Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models' ability to generate correct code versus code that precisely follows instructions. 0.726 |
2025-07-30 |
SLM-SQL: An Exploration of Small Language Models for Text-to-SQL
Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). 0.66In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities.However, SLMs offer inherent advantages in inference speed and suitability for edge deployment.To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques.Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision.We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach.Experimental results validate the effectiveness and generalizability of our method, SLM-SQL.On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points.Notably, the 0.5B model reached 56.87\% execution accuracy (EX), while the 1.5B model achieved 67.08\% EX.We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql. |
2025-07-30 |
aLLoyM: A large language model for alloy phase diagram prediction
Large Language Models (LLMs) are general-purpose tools with wide-ranging applications, including in materials science. 0.656In this work, we introduce aLLoyM, a fine-tuned LLM specifically trained on alloy compositions, temperatures, and their corresponding phase information.To develop aLLoyM, we curated question-and-answer (Q&A) pairs for binary and ternary phase diagrams using the open-source Computational Phase Diagram Database (CPDDB) and assessments based on CALPHAD (CALculation of PHAse Diagrams).We fine-tuned Mistral, an open-source pre-trained LLM, for two distinct Q&A formats: multiple-choice and short-answer.Benchmark evaluations demonstrate that fine-tuning substantially enhances performance on multiple-choice phase diagram questions.Moreover, the short-answer model of aLLoyM exhibits the ability to generate novel phase diagrams from its components alone, underscoring its potential to accelerate the discovery of previously unexplored materials systems.To promote further research and adoption, we have publicly released the short-answer fine-tuned version of aLLoyM, along with the complete benchmarking Q&A dataset, on Hugging Face. |
2025-07-30 |
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. 0.672While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. 0.87In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups.To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation.The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis.This design improves robustness, interpretability, and fidelity over end-to-end black-box methods.Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs.Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality.Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.Our code is made publicly available at https://github.com/leigest519/ScreenCoder. |
2025-07-30 |
On LLM-Assisted Generation of Smart Contracts from Business Processes
Large language models (LLMs) have changed the reality of how software is produced. 0.75Within the wider software engineering community, among many other purposes, they are explored for code generation use cases from different types of input. 0.766In this work, we present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions, an idea that has emerged in recent literature to overcome the limitations of traditional rule-based code generation approaches. 0.747However, current LLM-based work evaluates generated code on small samples, relying on manual inspection, or testing whether code compiles but ignoring correct execution.With this work, we introduce an automated evaluation framework and provide empirical data from larger data sets of process models.We test LLMs of different types and sizes in their capabilities of achieving important properties of process execution, including enforcing process flow, resource allocation, and data-based conditions.Our results show that LLM performance falls short of the perfect reliability required for smart contract development.We suggest future work to explore responsible LLM integrations in existing tools for code generation to ensure more reliable output. 0.761Our benchmarking framework can serve as a foundation for developing and evaluating such integrations. |
2025-07-29 |
IntentFlow: Interactive Support for Communicating Intent with LLMs in Writing Tasks
While large language models (LLMs) are widely used for writing, users often struggle to express their nuanced and evolving intents through prompt-based interfaces. 0.689Intents -- low-level strategies or preferences for achieving a writing goal -- are often vague, fluid, or even subconscious, making it difficult for users to articulate and adjust them.To address this, we present IntentFlow, which supports the communication of dynamically evolving intents throughout LLM-assisted writing.IntentFlow extracts goals and intents from user prompts and presents them as editable interface components, which users can revise, remove, or refine via direct manipulation or follow-up prompts.Visual links connect each component to the output segments it influences, helping users understand model behavior.In a within-subjects study (N=12), participants using IntentFlow, compared to a chat-based baseline, expressed their intents more easily and in detail, engaged in more meaningful actions to communicate intents, such as adjusting and deleting, and produced outputs that better aligned with their evolving intents.We found that editable intent representations help users refine and consolidate a final set of intents, which can be reused across similar tasks to support consistent and transferable LLM-assisted writing. |
2025-07-29 |
Secure coding for web applications: Frameworks, challenges, and the role of LLMs
Secure coding is a critical yet often overlooked practice in software development.Despite extensive awareness efforts, real-world adoption remains inconsistent due to organizational, educational, and technical barriers.This paper provides a comprehensive review of secure coding practices across major frameworks and domains, including web development, DevSecOps, and cloud security.It introduces a structured framework comparison and categorizes threats aligned with the OWASP Top 10.Additionally, we explore the rising role of Large Language Models (LLMs) in evaluating and recommending secure code, presenting a reproducible case study across four major vulnerability types. 0.734This paper offers practical insights for researchers, developers, and educators on integrating secure coding into real-world development processes. |
2025-07-29 |
MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios
As large language models (LLMs) rapidly advance, their role in code generation has expanded significantly. 0.951While this offers streamlined development, it also creates concerns in areas like education and job interviews.Consequently, developing robust systems to detect AI-generated code is imperative to maintain academic integrity and ensure fairness in hiring processes.In this study, we introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go. 0.73From the CodeNet dataset's problem definitions and human-authored codes, we generate several code samples in Java, Python, and Go with six different LLMs and three different prompts. 0.696This generation process covered three key usage scenarios: (i) generating code from problem descriptions, (ii) fixing runtime errors in human-written code, and (iii) correcting incorrect outputs.Overall, MultiAIGCD consists of 121,271 AI-generated and 32,148 human-written code snippets.We also benchmark three state-of-the-art AI-generated code detection models and assess their performance in various test scenarios such as cross-model and cross-language. 0.682We share our dataset and codes to support research in this field. |
2025-07-29 |
LLM-based Content Classification Approach for GitHub Repositories by the README Files
GitHub is the world's most popular platform for storing, sharing, and managing code. 0.733Every GitHub repository has a README file associated with it.The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories.However, GitHub repository owners sometimes neglected these recommendations.This prevents a GitHub repository from reaching its full potential.This research posits that the comprehensiveness of a GitHub repository's README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community.Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation.In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub README files.Three encoder-only LLMs are utilized, including BERT, DistilBERT and RoBERTa.These pre-trained models are then fine-tuned based on a gold-standard dataset consisting of 4226 README file sections.This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98.Moreover, we have also investigated the use of Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) and shown an economical alternative to full fine-tuning without compromising much performance.The results demonstrate the potential of using LLMs in designing an automatic classifier for categorizing the content of GitHub README files.Consequently, this study contributes to the development of automated tools for GitHub repositories to improve their identifications and potential usages. |
2025-07-29 |
UserBench: An Interactive Gym Environment for User-Centric Agents
Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. 0.674However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored.To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions.UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools.Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment.For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction.These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners.UserBench offers an interactive environment to measure and advance this critical capability. |
2025-07-28 |
LLMs-guided adaptive compensator: Bringing Adaptivity to Automatic Control Systems with Large Language Models
With rapid advances in code generation, reasoning, and problem-solving, Large Language Models (LLMs) are increasingly applied in robotics. 0.659Most existing work focuses on high-level tasks such as task decomposition.A few studies have explored the use of LLMs in feedback controller design; however, these efforts are restricted to overly simplified systems, fixed-structure gain tuning, and lack real-world validation.To further investigate LLMs in automatic control, this work targets a key subfield: adaptive control.Inspired by the framework of model reference adaptive control (MRAC), we propose an LLM-guided adaptive compensator framework that avoids designing controllers from scratch.Instead, the LLMs are prompted using the discrepancies between an unknown system and a reference system to design a compensator that aligns the response of the unknown system with that of the reference, thereby achieving adaptivity.Experiments evaluate five methods: LLM-guided adaptive compensator, LLM-guided adaptive controller, indirect adaptive control, learning-based adaptive control, and MRAC, on soft and humanoid robots in both simulated and real-world environments.Results show that the LLM-guided adaptive compensator outperforms traditional adaptive controllers and significantly reduces reasoning complexity compared to the LLM-guided adaptive controller.The Lyapunov-based analysis and reasoning-path inspection demonstrate that the LLM-guided adaptive compensator enables a more structured design process by transforming mathematical derivation into a reasoning task, while exhibiting strong generalizability, adaptability, and robustness.This study opens a new direction for applying LLMs in the field of automatic control, offering greater deployability and practicality compared to vision-language models. |
2025-07-28 |
GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation
With the widespread adoption of large language models (LLMs) in code generation tasks, geospatial code generation has emerged as a critical frontier in the integration of artificial intelligence and geoscientific analysis. 0.884This trend underscores the urgent need for systematic evaluation methodologies to assess LLMs generation capabilities in geospatial contexts.In particular, geospatial computation and visualization tasks in JavaScript environments rely heavily on orchestrating diverse frontend libraries and ecosystems, placing elevated demands on a model's semantic understanding and code synthesis abilities.To address this challenge, we propose GeoJSEval--the first multimodal, function-level automatic evaluation framework for LLMs in JavaScript-based geospatial code generation. 0.736GeoJSEval comprises three core components: a standardized test suite (GeoJSEval-Bench), a code submission engine, and an evaluation module.It includes 432 function-level tasks and 2,071 structured test cases spanning five widely used JavaScript geospatial libraries and 25 mainstream geospatial data types.GeoJSEval enables multidimensional quantitative evaluation across metrics such as accuracy, output stability, execution efficiency, resource consumption, and error type distribution, and integrates boundary testing mechanisms to enhance robustness and coverage.We conduct a comprehensive evaluation of 18 state-of-the-art LLMs using GeoJSEval, revealing significant performance disparities and bottlenecks in spatial semantic understanding, code reliability, and function invocation accuracy.GeoJSEval provides a foundational methodology, evaluation resource, and practical toolkit for the standardized assessment and optimization of geospatial code generation models, with strong extensibility and applicability in real-world scenarios. 0.812 |
2025-07-28 |
evalSmarT: An LLM-Based Framework for Evaluating Smart Contract Generated Comments
Smart contract comment generation has gained traction as a means to improve code comprehension and maintainability in blockchain systems. 0.65However, evaluating the quality of generated comments remains a challenge.Traditional metrics such as BLEU and ROUGE fail to capture domain-specific nuances, while human evaluation is costly and unscalable.In this paper, we present \texttt{evalSmarT}, a modular and extensible framework that leverages large language models (LLMs) as evaluators.The system supports over 400 evaluator configurations by combining approximately 40 LLMs with 10 prompting strategies.We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs.Our results show that prompt design significantly impacts alignment with human judgment, and that LLM-based evaluation offers a scalable and semantically rich alternative to existing methods. |
2025-07-28 |
Enhancing Project-Specific Code Completion by Inferring Internal API Information
Project-specific code completion is a critical task that leverages context from a project to generate accurate code.State-of-the-art methods use retrieval-augmented generation (RAG) with large language models (LLMs) and project information for code completion. 0.946However, they often struggle to incorporate internal API information, which is crucial for accuracy, especially when APIs are not explicitly imported in the file. To address this, we propose a method to infer internal API information without relying on imports.Our method extends the representation of APIs by constructing usage examples and semantic descriptions, building a knowledge base for LLMs to generate relevant completions.We also introduce ProjBench, a benchmark that avoids leaked imports and consists of large-scale real-world projects. Experiments on ProjBench and CrossCodeEval show that our approach significantly outperforms existing methods, improving code exact match by 22.72% and identifier exact match by 18.31%. 0.658Additionally, integrating our method with existing baselines boosts code match by 47.80% and identifier match by 35.55%. |
2025-07-28 |
Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs
Background: Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair.Recent studies show that large language models (LLMs) outperform traditional techniques, extending their success beyond code generation and fault detection. 0.827Hypothesis: These gains may be driven by hidden factors -- "invisible hands" such as training-data leakage or perfect fault localization -- that let an LLM reproduce human-authored fixes for the same code. Objective: We replicate prior AVR studies under controlled conditions by deliberately adding errors to the reported vulnerability location in the prompt.If LLMs merely regurgitate memorized fixes, both small and large localization errors should yield the same number of correct patches, because any offset should divert the model from the original fix. Method: Our pipeline repairs vulnerabilities from the Vul4J and VJTrans benchmarks after shifting the fault location by n lines from the ground truth.A first LLM generates a patch, a second LLM reviews it, and we validate the result with regression and proof-of-vulnerability tests.Finally, we manually audit a sample of patches and estimate the error rate with the Agresti-Coull-Wilson method. |
2025-07-28 |
User-Centered Design with AI in the Loop: A Case Study of Rapid User Interface Prototyping with "Vibe Coding"
We present a case study of using generative user interfaces, or ``vibe coding,'' a method leveraging large language models (LLMs) for generating code via natural language prompts, to support rapid prototyping in user-centered design (UCD). 0.84Extending traditional UCD practices, we propose an AI-in-the-loop ideate-prototyping process.We share insights from an empirical experience integrating this process to develop an interactive data analytics interface for highway traffic engineers to effectively retrieve and analyze historical traffic data.With generative UIs, the team was able to elicit rich user feedback and test multiple alternative design ideas from user evaluation interviews and real-time collaborative sessions with domain experts.We discuss the advantages and pitfalls of vibe coding for bridging the gaps between design expertise and domain-specific expertise. |
2025-07-28 |
Curiosity by Design: An LLM-based Coding Assistant Asking Clarification Questions
Large Language Models (LLMs) are increasingly used as coding assistants. 0.778However, the ambiguity of the developer's prompt often leads to incorrect code generation, as current models struggle to infer user intent without extensive prompt engineering or external context.This work aims to build an LLM-based coding assistant that mimics the human code review process by asking clarification questions when faced with ambiguous or under-specified queries. 0.715Our end-to-end system includes (1) a query classifier trained to detect unclear programming-related queries and (2) a fine-tuned LLM that generates clarification questions.Our evaluation shows that the fine-tuned LLM outperforms standard zero-shot prompting in generating useful clarification questions.Furthermore, our user study indicates that users find the clarification questions generated by our model to outperform the baseline, demonstrating that our coding assistant produces more accurate and helpful code responses compared to baseline coding assistants. 0.659 |
2025-07-27 |
Learning to Align Human Code Preferences
Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. 0.871While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios.This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences.Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions.Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training.Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies.Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios. |
2025-07-27 |
CIgrate: Automating CI Service Migration with Large Language Models
Continuous Integration (CI) configurations often need to be migrated between services (e.g., Travis CI to GitHub Actions) as projects evolve, due to changes in service capabilities, usage limits, or service deprecation.Previous studies reported that migration across CI services is a recurring need in open-source development.However, manual migration can be time-consuming and error-prone.The state-of-the-art approach, CIMig, addresses this challenge by analyzing past migration examples to create service-specific rules and produce equivalent configurations across CI services.However, its relatively low accuracy raises concerns about the overall feasibility of automated CI migration using rule-based techniques alone.Meanwhile, Large Language Models (LLMs) have demonstrated strong capabilities in code generation and transformation tasks, suggesting potential to improve the automation, usability, and generalizability of CI configuration migration. 0.847This registered report presents a study in which we aim to assess whether CI migration can be improved using LLMs.To this end, we propose CIgrate, an LLM-based framework for automatically migrating CI configurations.We plan to evaluate the performance of CIgrate compared to CIMig as a baseline, in different setups (a) zero-shot/few-shot prompting of LLMs for configuration migration and (b) fine-tuning an LLM on a dataset of already established CI service migrations.We will also seek developer feedback on the quality and usability of the generated configurations.We formulate research questions focusing on the accuracy of LLM-generated migrations versus ground truth and the output of CIMig.The expected contributions include the first LLM-powered approach for CI service migration, a comparative evaluation of its effectiveness compared to rule-based approaches, and insight into leveraging LLMs to support software configuration evolution. |
2025-07-27 |
When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions
Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. 0.924However, in practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions.In this paper, we present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions. 0.86We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions.We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories.Our findings reveal that even minor imperfections in task description phrasing can cause significant performance degradation, with contradictory task descriptions resulting in numerous logical errors.Moreover, while larger models tend to be more resilient than smaller variants, they are not immune to the challenges posed by unclear requirements.We further analyze semantic error patterns and identify correlations between description clarity, model behavior, and error types.Our results underscore the critical need for developing LLMs that are not only powerful but also robust to the imperfections inherent in natural user tasks, highlighting important considerations for improving model training strategies, designing more realistic evaluation benchmarks, and ensuring reliable deployment in practical software development environments. |
2025-07-25 |
Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
As large language models (LLMs) increasingly integrate native code interpreters, they enable powerful real-time execution capabilities, substantially expanding their utility. 0.667However, such integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities.To systematically evaluate these interpreter-specific risks, we propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion.Each risk category includes explicitly malicious ("direct") and plausibly benign ("indirect") prompt variants.Our automated evaluation framework assesses not only whether LLMs refuse or generates risky code, but also executes the generated code within the interpreter environment to evaluate code correctness, simplifications made by the LLM to make the code safe, or execution timeouts.Evaluating 7 commercially available models from OpenAI and Google, we uncover significant and inconsistent vulnerabilities.For instance, evaluations show substantial disparities even within providers - OpenAI's o4-mini correctly refuses risky requests at 7.1%, notably higher rates compared to GPT-4.1 at 0.5%.Results particularly underscore that indirect, socially-engineered prompts substantially weaken model defenses.This highlights an urgent need for interpreter-specific cybersecurity benchmarks, dedicated mitigation tools (e.g., guardrails), and clear industry standards to guide safe and responsible deployment of LLM interpreter integrations.The benchmark dataset and evaluation code are publicly released to foster further research. |
2025-07-25 |
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks.We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards.To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error.Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts.As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain.Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts.GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization. 0.642 |