AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
NLP
Large Language Models
Theory
- Introduces a principled framework for multi-agent decision-making in self-harm risk screening.
- Implements adaptive sampling strategies to efficiently allocate resources based on case complexity.
- Demonstrates significant reductions in false positive rates while maintaining recall across datasets.
- Provides a foundation for auditing and deploying AI systems in safety-critical behavioral health settings.
Read more
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
Summary
This paper addresses the challenges of self-harm risk screening in behavioral health using adaptive multi-agent large language model (LLM) systems. Traditional evaluation methods, such as LLM-as-a-judge, lack reliability indicators and do not account for error accumulation across multiple LLM judgments, which is critical in safety-sensitive environments. The authors propose a statistical framework structured as directed acyclic graphs (DAGs) that enhances decision-making processes by modeling each agent as a stochastic categorical decision. Key innovations include tighter agent-level performance confidence bounds, a bandit-based adaptive sampling strategy that focuses on input difficulty, and guarantees on regret that ensure logarithmic error growth during deployment. The system was evaluated on two datasets: the AEGIS 2.0 behavioral health subset and a stratified sample of SWMH Reddit posts. Results indicate that the adaptive sampling strategy achieved a false positive rate of 0.095 on AEGIS 2.0, significantly lower than the 0.159 rate of single-agent models, while maintaining similar false negative rates. This suggests that the proposed method improves precision without compromising recall, thus enhancing the reliability of self-harm risk assessments.
Methodology
The authors developed a statistical framework for multi-agent systems modeled as directed acyclic graphs (DAGs). Each agent's decision-making is treated as a stochastic categorical process, with enhancements including tighter performance confidence bounds and an adaptive sampling strategy based on the complexity of cases. The system's performance was empirically evaluated using labeled datasets in behavioral health.
Results
The adaptive sampling strategy achieved the lowest false positive rate of 0.095 on the AEGIS 2.0 dataset, compared to 0.159 for single-agent models, indicating a 40% reduction in incorrect flagging of safe content. The false negative rates remained comparable across conditions, suggesting improved precision without sacrificing recall.
Implications
The findings suggest that adaptive sampling can significantly enhance the reliability of AI systems in behavioral health, reducing the burden on clinicians while ensuring patient safety. The framework can be applied to other clinical settings requiring staged decision-making and risk assessment.
Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
Reinforcement Learning
Generative Models
Multimodal
- DROL introduces a dynamic routing mechanism that allows for flexible action selection in offline RL.
- The method preserves local action support rather than fixed correspondence to a teacher action.
- DROL demonstrates improved performance on multimodal benchmarks while maintaining efficient inference.
- The routing mechanism enables ownership of action regions to shift during training, enhancing learning dynamics.
Read more
Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
Summary
The paper introduces DROL (Dynamic Routing for Offline Reinforcement Learning), a novel approach to enhance one-step offline reinforcement learning (RL) actors. Traditional methods face challenges in balancing the need to improve action quality while remaining within the constraints of a fixed dataset. DROL addresses this by implementing a latent-conditioned actor that samples multiple candidate actions for each state, allowing for dynamic routing of actions based on their proximity to the dataset's supported actions. This method enables the actor to focus on the most relevant actions while preserving the support from the dataset, rather than being tethered to a single reference action. The authors argue that this flexibility in routing allows for better local improvements and avoids the compromises seen in previous methods. The effectiveness of DROL is demonstrated through evaluations on OGBench and D4RL benchmarks, showing competitive performance against existing one-step methods like FQL.
Methodology
DROL employs a one-step actor that samples K candidate actions from a latent prior for each state. It assigns each dataset action to its nearest candidate action and updates only the selected 'winner' using behavior cloning and critic guidance. This approach allows for dynamic ownership transfer of action regions during training, enabling local improvements without compromising inference efficiency.
Results
DROL was evaluated on OGBench and D4RL, where it showed competitive performance compared to the one-step FQL baseline. The method improved performance across various task groups in OGBench while maintaining strong results on specific tasks like AntMaze and Adroit.
Implications
The findings suggest that preserving local action support rather than fixed correspondence can lead to more effective offline RL strategies. DROL's dynamic routing approach could be applied to other multimodal learning scenarios, enhancing the flexibility and performance of generative models in reinforcement learning.
Unsupervised Learning of Inter-Object Relationships via Group Homomorphism
Computer Vision
Theory
Robotics
- Proposes a novel unsupervised learning method based on group homomorphism.
- Integrates object segmentation and motion extraction in a single framework.
- Demonstrates the ability to segment objects and understand their interactions without labeled data.
- Introduces algebraic constraints to achieve disentangled representations.
Read more
Unsupervised Learning of Inter-Object Relationships via Group Homomorphism
Summary
This paper presents an unsupervised representation learning method that models inter-object relationships through group homomorphism, aiming to mimic the cognitive development of preverbal infants. Unlike traditional deep learning approaches that rely heavily on statistical correlations, this method emphasizes the structural understanding of the world. The proposed model integrates object segmentation and motion law extraction from dynamic image sequences, using algebraic constraints to decompose pixel-level changes into meaningful transformation components such as translation and deformation. The authors demonstrate the model's capability to segment multiple objects without ground-truth labels and accurately map their relative movements into a one-dimensional latent space. This approach suggests that incorporating algebraic geometric constraints can lead to physically interpretable representations, enhancing AI's ability to acquire flexible intelligence akin to human cognition.
Methodology
The methodology involves three main steps: (1) Object segmentation to isolate individual objects from image sequences, (2) Separation of object motion using group homomorphism constraints to distinguish between different motion components, and (3) Extraction of multi-object interactions by relativizing the motion of each object from the perspective of others.
Results
The model successfully segments multiple objects into individual slots and accurately maps their relative movements into a structured latent space. This indicates that the introduction of algebraic constraints allows for the acquisition of interpretable representations that reflect the underlying physical laws governing object interactions.
Implications
The findings suggest a new approach for developing AI systems that can learn and adapt to novel situations with limited data, similar to human cognitive processes. This could lead to advancements in creating artificial systems with developmental intelligence, enhancing their robustness and flexibility in real-world applications.
Reinforcing privacy reasoning in LLMs via normative simulacra from fiction
NLP
Large Language Models
Reinforcement Learning
- Introduces normative simulacra from fiction as a method to enhance LLM privacy reasoning.
- Utilizes a two-stage training process combining supervised fine-tuning and reinforcement learning.
- Implements a composite reward function to evaluate privacy reasoning based on contextual norms.
- Demonstrates improved alignment of LLM outputs with human privacy expectations across multiple benchmarks.
Read more
Reinforcing privacy reasoning in LLMs via normative simulacra from fiction
Summary
This paper addresses the misalignment between the information handling practices of Large Language Models (LLMs) and the contextual privacy expectations of users. It leverages the concept of Contextual Integrity (CI) to define privacy as the appropriate flow of information according to context-relative norms. The authors propose a novel approach that extracts normative simulacra—structured representations of norms and information flows—from fiction novels to fine-tune LLMs. The methodology involves a two-stage training process: first, supervised fine-tuning (SFT) to encourage privacy reasoning based on normative examples, followed by Group Relative Policy Optimization (GRPO) to refine this reasoning with a normatively-grounded composite reward function. This function evaluates the model's outputs based on task clarity, structural completeness, and context identification, among other criteria. To prevent overfitting, a contrastive scoring mechanism is introduced, comparing model outputs against both correct and incorrect normative contexts. The evaluation is conducted on five CI-aligned benchmarks, showing that the proposed methods improve the recognition of privacy-relevant situations and align LLM outputs with human privacy expectations, particularly excelling in law compliance benchmarks. The findings suggest that fiction-derived normative simulacra can effectively teach contextual privacy reasoning applicable to real-world scenarios.
Methodology
The authors extract normative simulacra from fiction novels and employ a two-stage fine-tuning approach for LLMs. The first stage involves supervised fine-tuning (SFT) to ground the model in normative reasoning, followed by Group Relative Policy Optimization (GRPO) that refines the model's outputs using a composite reward function. A contrastive scoring mechanism is also introduced to mitigate overfitting.
Results
The proposed methods resulted in a conservative prior for restricting information flow, improving the model's recognition of privacy-relevant situations. GRPO with normative grounding achieved the highest scores on law compliance benchmarks and showed strong correlation with crowdsourced human privacy expectations, indicating effective transfer of contextual privacy reasoning from fiction to real-world applications.
Implications
The findings suggest that integrating normative reasoning from fiction into LLM training can enhance the ethical decision-making capabilities of AI systems, particularly in contexts requiring nuanced understanding of privacy norms. This approach could inform the development of more socially aware AI agents that align better with user expectations and societal norms.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Large Language Models
Efficient ML
NLP
- Absorber LLM preserves causal relationships between historical contexts and future inferences.
- The method reduces memory consumption during inference while maintaining model performance.
- Causal synchronization is introduced as a mechanism for effective context absorption.
- Empirical results show superior performance over traditional transformers and linear-time models.
Read more
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Summary
The paper introduces Absorber LLM, a novel approach to address the computational inefficiencies of transformers during inference, particularly in long-context scenarios. Traditional transformers face challenges due to their quadratic computational complexity associated with self-attention mechanisms, which limits their ability to process long streams of data effectively. Existing alternatives, such as RNNs and SSMs, compress historical data into fixed-size states, leading to a loss of long-tail dependencies. The authors propose a self-supervised causal synchronization method that allows a contextless model to absorb historical contexts into its parameters while preserving the causal relationships necessary for future inferences. This is achieved by synchronizing the internal behaviors of the updated model with the original full-context model. The results demonstrate that Absorber LLM significantly reduces inference memory requirements and improves accuracy compared to prior methods that utilize parameter memory, paving the way for scalable and efficient long-context inference in real-world applications.
Methodology
The authors develop Absorber LLM by formulating a self-supervised optimization objective that synchronizes the behaviors of a contextless model with a full-context model. This involves absorbing historical contexts into model parameters while ensuring that the updated model behaves identically to the original model during future inferences. The synchronization process is designed to maintain the causal effects of the absorbed contexts.
Results
Experiments conducted on long-context and streaming benchmarks reveal that Absorber LLM outperforms traditional transformer models in terms of computational efficiency and accuracy. The method demonstrates significant improvements over previous parameter-as-memory approaches, validating its effectiveness for real-world applications requiring continuous data absorption.
Implications
The findings suggest that Absorber LLM can be applied in scenarios where efficient processing of long-context data is critical, such as in conversational AI, real-time data analysis, and applications requiring lifelong learning capabilities. This approach could lead to more scalable and efficient models in various domains.
Logistic Bandits with $ ilde{O}( ext{sqrt}{dT})$ Regret without Context Diversity Assumptions
Theory
Optimization
- SupSplitLog is the first algorithm for logistic bandits achieving $ ilde{O}( ext{sqrt}{dT})$ regret without context diversity assumptions.
- The algorithm improves the dependence on dimension d compared to existing methods.
- SupSplitLog employs a novel sample-splitting technique for constructing estimators.
- The method can adapt to provide a regret bound based on a data-dependent complexity measure.
Read more
Logistic Bandits with $ ilde{O}( ext{sqrt}{dT})$ Regret without Context Diversity Assumptions
Summary
This paper addresses the K-armed logistic bandit problem, focusing on achieving a regret bound of $ ilde{O}( ext{sqrt}{dT})$ without relying on context diversity assumptions, which are common in existing literature. Traditional algorithms that achieve this optimal regret bound typically require strict conditions on the context process, such as a positive lower bound on the minimum eigenvalue of the context covariance matrix. These conditions can be overly restrictive, especially when context vectors are concentrated in low-dimensional subspaces. The authors introduce a novel algorithm called SupSplitLog, which is the first to achieve the desired regret bound without such assumptions. The key innovation of SupSplitLog lies in its sample-splitting technique, where collected samples are divided into two subsets for constructing estimators: one for an initial-point estimator and the other for a Newton-type correction procedure. This approach not only improves the dependence on the dimension d in the regret upper bound but also allows for a data-dependent complexity measure that circumvents direct dependence on d. The paper includes experimental results demonstrating the algorithm's effectiveness in high-dimensional settings with low-dimensional structures, confirming its theoretical advantages.
Methodology
The authors developed the SupSplitLog algorithm, which splits collected samples into two disjoint subsets. One subset is used to compute an initial-point estimator, while the other subset is utilized for a Newton-type one-step correction procedure. This splitting is designed to balance the accuracy of both estimators, leading to a high-probability bound on estimation error that does not depend on the dimension d.
Results
The proposed SupSplitLog algorithm achieves a regret bound of $ ilde{O}( ext{sqrt}{dT})$ without context diversity assumptions, demonstrating a strictly improved dependence on dimension d compared to existing algorithms. Additionally, it can yield a regret bound that scales with a data-dependent complexity measure, which is advantageous in scenarios where context vectors are concentrated in low-dimensional subspaces. Experimental results corroborate the theoretical claims, showing that SupSplitLog outperforms baseline algorithms in high-dimensional problems.
Implications
The findings of this paper have significant implications for the design of algorithms in sequential decision-making problems, particularly in scenarios where context diversity cannot be guaranteed. The ability to achieve optimal regret bounds without stringent assumptions opens up new avenues for applying logistic bandits in real-world applications where context may be limited or concentrated.
LAF-Based Evaluation and UTTL-Based Learning Strategies with MIATTs
Theory
Optimization
Interpretability
- Introduces the EL-MIATTs framework to handle ambiguous true targets in ML.
- Develops LAF-based evaluation algorithms for coherent model assessment.
- Proposes UTTL-based learning strategies for effective model training under uncertainty.
- Analyzes the structural properties of task-specific MIATTs and their impact on evaluation and learning.
Read more
LAF-Based Evaluation and UTTL-Based Learning Strategies with MIATTs
Summary
This paper addresses the challenges of defining true targets in machine learning (ML) applications where ambiguity and subjectivity are prevalent. The author proposes the EL-MIATTs (Evaluation and Learning with Multiple Inaccurate True Targets) framework, which operates under the assumption that true targets may not exist objectively. To implement this framework, two complementary mechanisms are developed: LAF (Logical Assessment Formula)-based evaluation algorithms and UTTL (Undefinable True Target Learning)-based learning strategies. The paper analyzes task-specific MIATTs, focusing on their coverage and diversity, which influence evaluation and learning processes. LAF-grounded evaluation algorithms are introduced, allowing for coherent assessments of models with multiple partially correct targets, while UTTL-grounded learning strategies facilitate effective model training despite the undefined nature of true targets. The integration of these mechanisms bridges logical semantics with statistical optimization, providing a robust methodology for ML systems operating under uncertain supervision. The paper concludes with an application of the proposed methods, demonstrating their practical utility in real-world scenarios.
Methodology
The paper employs a theoretical framework (EL-MIATTs) to analyze task-specific MIATTs, developing LAF-based evaluation algorithms for model assessment and UTTL-based learning strategies for model training. It utilizes logical aggregation operations and loss functions (Dice and Cross-Entropy) to optimize learning processes.
Results
The proposed LAF-based evaluation algorithms achieve a balance between logical completeness and interpretability, while UTTL-based learning strategies provide flexibility in learning under epistemic uncertainty. The integration of these components enhances the practical deployment of ML systems in complex environments.
Implications
The findings suggest a new approach for developing ML systems that can operate effectively in real-world scenarios where true targets are ambiguous or undefined, potentially impacting fields such as medical diagnosis and social behavior analysis.
Fairness under uncertainty in sequential decisions
Reinforcement Learning
Theory
Optimization
- Introduces a taxonomy of uncertainties in sequential decision-making: model, feedback, and prediction uncertainty.
- Formalizes uncertainties using counterfactual logic and reinforcement learning techniques.
- Demonstrates potential harms of naive policies that ignore unobserved outcomes.
- Shows how uncertainty-aware exploration can improve fairness metrics in decision systems.
Read more
Fairness under uncertainty in sequential decisions
Summary
This paper addresses the challenges of ensuring fairness in machine learning (ML) within the context of sequential decision-making under uncertainty. Traditional fairness metrics have primarily focused on supervised learning, but many real-world applications involve dynamic decisions where past outcomes influence future actions. The authors introduce a taxonomy of uncertainties—model uncertainty, feedback uncertainty, and prediction uncertainty—that affect decision-making processes, particularly for historically marginalized groups. By formalizing these uncertainties using counterfactual logic and reinforcement learning techniques, the paper illustrates the potential harms of naive decision-making policies that overlook unobserved outcomes. The framework is demonstrated through algorithmic examples that show how to reduce outcome variance for disadvantaged groups while maintaining institutional objectives. Experiments reveal that unequal uncertainty and selective feedback can lead to disparities in decision systems, emphasizing the need for uncertainty-aware exploration to improve fairness metrics. The authors argue that explicitly accounting for uncertainty is crucial for fair and effective decision-making in sequential systems.
Methodology
The authors develop a framework that categorizes uncertainties in sequential decision-making and formalizes them using counterfactual logic and reinforcement learning. They conduct experiments on simulated data to illustrate the effects of unequal uncertainty and selective feedback on fairness outcomes.
Results
The experiments demonstrate that naive decision-making policies can lead to compounding exclusion for marginalized groups. The framework allows for the simultaneous reduction of outcome variance for these groups while achieving expected utility for decision-makers. Uncertainty-aware exploration significantly alters observed fairness metrics, indicating that accounting for uncertainty is essential in sequential decision systems.
Implications
This work provides a structured approach for researchers and practitioners to diagnose and govern fairness risks in sequential decision systems. It emphasizes the need for integrating uncertainty considerations into ML models used in high-stakes applications such as finance, healthcare, and criminal justice.
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Large Language Models
NLP
- Development of mcdok system for detecting machine-generated code in multiple programming languages.
- Adaptation of the mdok approach for code detection, focusing on binary and multi-class classification tasks.
- Use of various large language models (LLMs) tailored for better code understanding.
- Competitive results in all subtasks, indicating the effectiveness of the proposed methods.
Read more
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Summary
The paper presents the mcdok system, developed for SemEval-2026 Task 13, which focuses on the detection of machine-generated code across multiple programming languages. The task is divided into three subtasks: binary detection of machine-generated code, multi-class authorship detection, and hybrid code detection. The authors adapted their previous mdok approach, originally designed for text detection, to better suit the unique challenges of code detection. They explored various base models, specifically selecting those that enhance code understanding. The results demonstrate that their systems are competitive, although there is room for improvement compared to top-performing systems. The methodology involved fine-tuning large language models (LLMs) using a parameter-efficient approach, with careful attention to data selection and training processes. The authors provide insights into their experimental setup and the challenges faced in distinguishing between human-written and machine-generated code.
Methodology
The mcdok system was built upon the existing mdok framework, utilizing a parameter-efficient fine-tuning approach (QLoRA) with 4-bit quantization. The authors selected specific base models for each subtask: Gemma-3-27B-PT for binary detection, CodeGemma-7B for multi-class authorship detection, and Qwen2.5-Coder-14B for hybrid code detection. The training process involved careful data selection, balancing classes, and employing weighted cross-entropy loss for optimization.
Results
The mcdok systems achieved competitive performance across all three subtasks of the SemEval-2026 Task 13. However, the authors noted significant margins from the top-performing systems, suggesting that further enhancements are possible. The results indicate the effectiveness of their fine-tuning strategy and model selection.
Implications
The findings have implications for improving the detection of machine-generated code, which is increasingly relevant in software development and code review processes. The open-source nature of the mcdok system allows for further exploration and application in various domains, including automated code analysis and security.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
NLP
Large Language Models
Efficient ML
- Introduces a joint scaling law for looped language models with a recurrence-equivalence exponent φ = 0.46.
- Demonstrates that additional recurrences in looped LMs lead to increased validation loss at matched training compute.
- Establishes a five-axis evaluation suite to analyze the performance of looped LMs across different tasks.
- Finds that looped models prefer wider architectures with fewer training tokens compared to non-looped models.
Read more
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
Summary
This paper investigates the value of additional recurrences in looped language models (LMs) by establishing a joint scaling law that quantifies the relationship between validation loss, unique parameters, training tokens, and recurrence count. The authors conducted an iso-depth sweep across 116 pretraining runs with varying recurrence counts (r ∈ {1, 2, 4, 8}) and found a new recurrence-equivalence exponent φ = 0.46, indicating that each extra recurrence increases validation loss predictably at matched training compute. The study reveals that a looped model with a recurrence count of 4 can perform comparably to a non-looped model with 42% more parameters, while incurring the training cost of a model with 1 billion parameters. The findings suggest that the parameter-sharing cost in looped LMs is significant, particularly in tasks requiring parametric knowledge, while reasoning tasks remain less measurable at the current compute budgets. The paper also introduces a five-axis downstream evaluation suite to assess the performance of looped LMs across various capabilities, highlighting the importance of validation loss as a target for future research.
Methodology
The authors performed an iso-depth sweep across four prelude-recur-coda architectures with varying recurrence counts (r = 1, 2, 4, 8) and six compute budgets, totaling 116 pretraining runs. They fitted a joint scaling law to relate validation loss to unique parameters, training tokens, and recurrence count, isolating the effects of parameter sharing from effective depth.
Results
The fitted scaling law yielded a recurrence-equivalence exponent φ = 0.46, indicating that each additional recurrence does not fully substitute for unique blocks in terms of validation loss. The empirical results showed that at r = 4, a looped model with 410M parameters performed similarly to a non-looped model with 580M parameters, while incurring the training cost of a 1B non-looped model. The evaluation suite confirmed that validation loss is a reliable target for assessing looped LMs.
Implications
The findings provide a benchmark for evaluating future looped LM designs and training recipes, allowing researchers to predict the validation-loss cost associated with different recurrence counts. This work may influence the design of more efficient looped language models and improve their performance across various NLP tasks.
Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms
Graph Learning
- Introduces ResGIN-Att, a novel model for predicting drug synergy.
- Integrates molecular features, genomic profiles, and drug interactions for enhanced predictions.
- Employs residual connections to mitigate over-smoothing in deep learning layers.
- Demonstrates competitive performance on benchmark datasets, showcasing robustness and generalization.
Read more
Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms
Summary
This paper addresses the challenge of predicting drug synergy, which is crucial for developing effective combination therapies for complex diseases. Traditional single-drug therapies often lead to limited efficacy and drug resistance, making computational prediction methods essential for exploring drug combinations. The authors propose a novel model called Residual Graph Isomorphism Network integrated with an Attention mechanism (ResGIN-Att). This model enhances synergy prediction by integrating molecular structural features, cell-line genomic profiles, and drug-drug interactions. ResGIN-Att employs a residual graph isomorphism network to extract multi-scale topological features, mitigating over-smoothing in deep layers. An adaptive Long Short-Term Memory (LSTM) module fuses structural information from local to global scales, while a cross-attention module explicitly models drug-drug interactions and identifies key chemical substructures. The model was evaluated on five public benchmark datasets, demonstrating competitive performance against baseline methods, improved generalization capability, and robustness, thereby providing a promising tool for drug synergy prediction.
Methodology
The methodology involves a Residual Graph Isomorphism Network to extract multi-scale topological features from drug molecules, combined with an adaptive LSTM for structural information fusion and a cross-attention mechanism to model drug-drug interactions.
Results
The ResGIN-Att model achieved competitive performance on five benchmark datasets, outperforming key baseline methods while demonstrating strong generalization and robustness in predicting drug synergy.
Implications
The findings suggest that ResGIN-Att can significantly accelerate the development of combination therapies by providing a reliable computational tool for predicting synergistic drug interactions, potentially leading to improved treatment outcomes for complex diseases.
Sharpness-Aware Poisoning: Enhancing Transferability of Injective Attacks on Recommender Systems
Optimization
Theory
- Introduces Sharpness-Aware Poisoning (SharpAP) to enhance the transferability of injective attacks on recommender systems.
- Addresses the limitations of using fixed surrogate models for generating poisoned data.
- Implements a min-max-min tri-level optimization framework to optimize poisoned data against the worst-case victim model.
- Demonstrates significant improvements in attack transferability through comprehensive experiments on real-world datasets.
Read more
Sharpness-Aware Poisoning: Enhancing Transferability of Injective Attacks on Recommender Systems
Summary
This paper addresses the vulnerability of Recommender Systems (RS) to injective attacks, where attackers create fake user profiles to manipulate item recommendations for unethical gains. Existing methods typically rely on a fixed surrogate model to generate poisoned data, which often leads to poor transferability when attacking different victim models due to structural discrepancies. The authors propose a novel attack method called Sharpness-Aware Poisoning (SharpAP), which utilizes sharpness-aware minimization to iteratively optimize poisoned data specifically against the worst-case victim model. This approach is formulated as a min-max-min tri-level optimization problem, allowing for the generation of more robust poisoned data that is less sensitive to model structure shifts. Experimental results on three real-world datasets demonstrate that SharpAP significantly enhances the transferability of attacks across various victim models, outperforming existing methods.
Methodology
The methodology involves the development of Sharpness-Aware Poisoning (SharpAP), which employs sharpness-aware minimization to identify and optimize against the worst-case victim model. The attack is structured as a tri-level optimization problem, allowing for iterative refinement of the poisoned data to enhance its effectiveness across different victim models.
Results
The experimental evaluations show that SharpAP significantly improves the transferability of poisoning attacks compared to existing methods, achieving better performance across various victim models such as BPR, LightGCN, and SGL. The results indicate that the proposed method effectively mitigates the overfitting issue associated with fixed surrogate models.
Implications
The findings suggest that incorporating sharpness-aware optimization principles can lead to more effective adversarial strategies in recommender systems, potentially influencing the design of more robust systems against such attacks. This work may also have broader implications for security in machine learning applications where model transferability is a concern.
Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes
Optimization
- Introduces GP-MLMPC, an iterative model-learning scheme for NMPC using Gaussian Processes.
- Demonstrates significant performance improvements in batch process control with limited initial data.
- Achieves an 83% reduction in tracking error and a 17-fold increase in final product mass after eight iterations.
- Utilizes uncertainty quantification from GPs to enforce chance constraints for safe operation.
Read more
Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes
Summary
This paper presents a novel approach to Nonlinear Model Predictive Control (NMPC) for batch processes using Gaussian Processes (GPs) in an iterative model-learning scheme (GP-MLMPC). The authors highlight the challenges of implementing NMPC due to the high costs and unavailability of dynamic models for batch processes, which are typically nonlinear and transient. The proposed GP-MLMPC method initializes with data from a single trajectory and iteratively updates the GP model with new observations from each batch run. This approach allows for continuous improvement in control performance while ensuring safe operation through the formulation of chance constraints based on GP uncertainty quantification. The method is validated through simulations on a semi-batch polymerization reactor, demonstrating significant reductions in tracking error and substantial increases in final product mass over iterations. The results indicate that GP-MLMPC can achieve performance comparable to full-model NMPC while remaining sample-efficient, making it a promising solution for controlling nonlinear batch processes without requiring detailed mechanistic models.
Methodology
The methodology involves initializing the GP-MLMPC with data from a single initial trajectory, iteratively applying NMPC embedded with GPs to run batches, and updating the GP model with new observations after each batch. The approach incorporates uncertainty quantification to formulate chance constraints, ensuring safe operation during the control process.
Results
The GP-MLMPC scheme demonstrated an 83% reduction in tracking error after four batch iterations and a 17-fold increase in final product mass by the eighth iteration, compared to the initial trajectory. The performance of GP-MLMPC was found to be on par with that of full-model NMPC, indicating its effectiveness in learning optimal control strategies.
Implications
The proposed GP-MLMPC scheme has significant implications for the control of nonlinear batch processes, particularly in industries where dynamic models are expensive or impractical to obtain. Its sample-efficient learning approach can facilitate the adoption of advanced control strategies in various manufacturing settings, enhancing productivity and product quality.
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
Large Language Models
Time Series
Interpretability
- Introduces a LLM-guided framework for sepsis early warning that enhances interpretability.
- Combines spatiotemporal feature extraction with clinical reasoning prompts to improve prediction accuracy.
- Achieves superior AUC scores compared to traditional models, indicating better predictive performance.
- Provides interpretable trajectories that assist clinicians in understanding physiological deterioration.
Read more
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
Summary
This paper addresses the challenge of timely and interpretable early warning for sepsis, a life-threatening condition characterized by complex physiological dynamics. Traditional models often provide accurate predictions but lack interpretability, which is crucial for clinical decision-making. The authors propose a novel framework that combines a Large Language Model (LLM) with a temporal simulation approach to model physiological trajectories leading to sepsis onset. The framework includes a spatiotemporal feature extraction module to capture dynamic relationships among vital signs, a Medical Prompt-as-Prefix module to integrate clinical reasoning into the LLM, and an agent-based post-processing component to ensure predictions remain within physiologically plausible ranges. By simulating key physiological indicators and classifying sepsis onset, the model offers transparent predictions that align with clinical judgment. Evaluated on the MIMIC-IV and eICU databases, the proposed method demonstrates superior performance with AUC scores ranging from 0.861 to 0.903 across various pre-onset prediction tasks, outperforming conventional deep learning and rule-based approaches. Importantly, it provides interpretable trajectories and risk trends, aiding clinicians in early intervention and personalized decision-making in intensive care settings.
Methodology
The proposed framework consists of three main components: a spatiotemporal feature extraction module to analyze multivariate vital signs, a Medical Prompt-as-Prefix module to embed clinical reasoning into the LLM, and an agent-based post-processing component to ensure predictions are physiologically plausible. The model first simulates the evolution of physiological indicators and then classifies sepsis onset based on these simulated trajectories.
Results
The framework was evaluated using the MIMIC-IV and eICU databases, achieving AUC scores between 0.861 and 0.903 for 24–4-hour pre-onset prediction tasks, significantly outperforming conventional deep learning and rule-based methods.
Implications
The results suggest that the proposed LLM-guided framework can enhance early sepsis detection and intervention strategies in clinical settings, potentially improving patient outcomes through timely and interpretable predictions.
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
Reinforcement Learning
Optimization
Robotics
- Introduces a KL-regularized approach to trajectory and policy optimization that leverages differentiable dynamics.
- Develops a novel tempered sequential Monte Carlo (TSMC) method for efficient sampling from multimodal distributions.
- Combines sampling-based exploration with gradient-based optimization to enhance performance in trajectory and policy optimization tasks.
- Demonstrates the effectiveness of TSMC through experiments that show superior performance compared to existing methods.
Read more
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
Summary
This paper presents a novel sampling-based framework for finite-horizon trajectory and policy optimization (TPO) under differentiable dynamics, framing controller design as an inference problem. The approach minimizes a KL-regularized expected trajectory cost, leading to an optimal 'Boltzmann-tilted' distribution over controller parameters that emphasizes low-cost solutions as the temperature decreases. To efficiently sample from this potentially multimodal target distribution, the authors introduce tempered sequential Monte Carlo (TSMC), which employs an annealing scheme to adaptively reweight and resample particles along a tempering path from a prior to the target distribution. Additionally, Hamiltonian Monte Carlo (HMC) rejuvenation is utilized to maintain diversity and leverage exact gradients from differentiable trajectory rollouts. The TSMC method is extended for policy optimization through a deterministic empirical approximation of the initial-state distribution and an extended-space construction treating rollout randomness as auxiliary variables. Experimental results demonstrate that TSMC is broadly applicable and outperforms state-of-the-art baselines in trajectory and policy optimization benchmarks.
Methodology
The methodology involves formulating the trajectory and policy optimization problem as a KL-regularized inference problem, allowing for the derivation of an optimal distribution over controller parameters. The tempered sequential Monte Carlo (TSMC) method is employed to sample from this distribution, utilizing an annealing process and Hamiltonian Monte Carlo for diversity and exploration.
Results
The results indicate that TSMC effectively samples from the Boltzmann-tilted distribution, achieving improved performance in trajectory and policy optimization tasks compared to state-of-the-art methods. The experiments validate the applicability of TSMC across various benchmarks.
Implications
The proposed framework has significant implications for robotics and control systems, where efficient trajectory and policy optimization is crucial. It also opens avenues for further research in combining sampling and gradient-based methods in differentiable dynamics contexts.
Fast Neural-Network Approximation of Active Target Search Under Uncertainty
Robotics
Optimization
Efficient ML
- Introduces a CNN-based approximation for Active Search (AS) and Intermittent Active Search (ASI) to enhance computational efficiency.
- Utilizes a multi-channel grid representation to encode critical information for decision-making.
- Demonstrates that the CNN achieves detection rates comparable to AS and ASI while significantly reducing computation time.
- Validates the approach through extensive simulations with both uniform and clustered target distributions.
Read more
Fast Neural-Network Approximation of Active Target Search Under Uncertainty
Summary
This paper addresses the challenge of searching for an unknown number of stationary targets in uncertain environments using a mobile agent. The authors propose a novel approach that utilizes a convolutional neural network (CNN) to approximate decision-making processes of existing planners, specifically Active Search (AS) and its Intermittent variant (ASI). These planners, while effective in detecting targets, are computationally expensive due to their reliance on online optimization. To mitigate this, the CNN is trained on data generated from AS/ASI, using a multi-channel grid that encodes various inputs such as target beliefs and agent position. The proposed method significantly reduces computational costs while maintaining detection rates comparable to traditional methods. Extensive simulations demonstrate that the CNN-based approach achieves similar performance in target detection while being orders of magnitude faster, thus providing a more efficient solution for active target search under uncertainty.
Methodology
The authors employ a convolutional neural network (CNN) trained on data from existing planners (AS and ASI) to approximate their decision-making processes. The CNN takes as input a multi-channel spatial grid that includes particle filter outputs, visitation counts, boundary masks, and the agent's position, predicting the next waypoint for the agent. The approach also incorporates a Probability Hypothesis Density (PHD) filter to estimate the expected number of targets in the environment.
Results
The simulations show that the CNN-based method achieves detection rates statistically indistinguishable from the original AS and ASI methods while being orders of magnitude faster in terms of computational efficiency. The results validate the effectiveness of the CNN in approximating the decision-making process of traditional planners.
Implications
This research has significant implications for robotics applications, particularly in scenarios requiring efficient target search and exploration under uncertainty, such as search-and-rescue operations, environmental monitoring, and surveillance. The proposed method can enhance the operational efficiency of autonomous agents in real-world applications.
Multi-Task Optimization over Networks of Tasks
Optimization
Robotics
Graph Learning
- Introduction of MONET, a graph-based multi-task optimization algorithm.
- MONET addresses scalability issues of existing multi-task optimization methods.
- Combines individual and social learning strategies for knowledge transfer.
- Empirical results show MONET outperforms MAP-Elites variants across multiple domains.
Read more
Multi-Task Optimization over Networks of Tasks
Summary
This paper introduces MONET (Multi-Task Optimization over Networks of Tasks), a novel algorithm designed to address the limitations of existing multi-task optimization methods, particularly in scaling to large task sets. Traditional population-based methods struggle with scalability, while MAP-Elites variants often rely on fixed, discretized archives that overlook the topology of the task space. MONET models the task space as a graph, where tasks are represented as nodes and edges facilitate knowledge transfer between tasks. This graph-based approach enables both individual learning through mutation and social learning via crossover with neighboring tasks. The authors conducted a hyperparameter study across four domains—archery, arm, cartpole (5,000 tasks each), and hexapod (2,000 tasks)—demonstrating that MONET outperforms existing MAP-Elites-based methods in all scenarios. The findings suggest that leveraging task-space structure through a network of task neighborhoods is a promising strategy for effective multi-task optimization at scale.
Methodology
MONET utilizes a graph representation of the task space, where tasks are nodes and edges represent relationships between them. The algorithm employs two learning mechanisms: individual learning, which refines solutions through mutation, and social learning, which generates new candidates by combining solutions from neighboring tasks via crossover. A hyperparameter study was conducted to analyze the effects of neighborhood size, type, and the balance between individual and social learning.
Results
MONET demonstrated superior performance compared to MT-ME and PT-ME across all tested domains, indicating that the graph-based approach effectively exploits the structure of the task space. The systematic hyperparameter study provided insights into the optimal configurations for different task environments, confirming that a single MONET configuration can generalize well across diverse tasks.
Implications
The findings suggest that MONET could be applied in real-world scenarios where agents need to adapt to changing configurations efficiently, such as in robotics and control systems. The ability to optimize multiple tasks simultaneously can lead to more robust and adaptable systems in dynamic environments.
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Theory
- Introduces a formal specification language for ML kernel contracts to address implicit agreements in computations.
- Defines twelve contract classes based on empirical evidence, covering various failure modes.
- Establishes a three-state calibration requirement for testable contracts.
- Demonstrates the application of the framework through three case studies of kernel failures.
Read more
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Summary
This paper addresses the critical issue of implicit contracts in machine learning (ML) kernels, which can lead to discrepancies in computations across different hardware platforms. The author proposes a formal specification language for kernel contracts, consisting of eight components: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. The paper categorizes twelve contract classes based on empirical evidence, covering various failure modes such as precision, ordering, compiler-induced errors, and exceptional values. A three-state calibration requirement is introduced, ensuring that each contract includes at least one reference-conforming implementation and one contract-violating implementation that passes basic functional tests. The framework is applied to three case studies, demonstrating how informal diagnoses of kernel failures can be mapped to specific contract violations with measurable signatures. The paper concludes by discussing the relevance of kernel contracts to conformance assessment, suggesting that they can serve as normative references for grading compliance, similar to existing certification schemes in other domains.
Methodology
The author develops a specification language for kernel contracts, categorizes contract classes based on empirical literature, and applies the framework to real-world case studies of kernel discrepancies. The methodology includes defining contract components, establishing calibration requirements, and analyzing documented incidents of kernel failures.
Results
The application of the proposed framework to three case studies revealed specific contract violations corresponding to informal diagnoses of kernel issues. The results highlight the importance of formalizing kernel contracts to arbitrate discrepancies in computations across different hardware platforms.
Implications
The introduction of kernel contracts has significant implications for improving the reliability and correctness of ML kernels across heterogeneous silicon. It provides a structured approach for developers to specify and verify kernel behavior, potentially reducing silent failures and enhancing cross-platform compatibility. Furthermore, it lays the groundwork for future certification schemes in ML hardware.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Optimization
Theory
Efficient ML
- GEM is a family of C2N-smooth activation functions that improve upon ReLU's limitations.
- The introduction of E-GEM and SE-GEM variants allows for greater flexibility and performance optimization.
- GEM outperforms GELU in several benchmarks, particularly in deep CNNs and transformers.
- The smoothness parameter N plays a crucial role in determining the effectiveness of the activation function based on the architecture used.
Read more
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Summary
This paper introduces a new family of activation functions called Geometric Monomial (GEM), which are designed to be C2N-smooth and provide an alternative to traditional activation functions like ReLU. The GEM functions are based on a log-logistic cumulative distribution function (CDF) and are constructed to maintain the performance characteristics of ReLU while improving smoothness and stability in gradient-based optimization. The author presents three variants: the base GEM, an ε-parameterized generalization (E-GEM), and a piecewise variant (SE-GEM) that addresses issues like dead neurons. Through extensive experimentation across various benchmarks, including image classification on CIFAR-10/100 and language modeling with BERT and GPT-2, the paper demonstrates that GEM can outperform GELU and other existing activation functions in specific contexts, particularly when tuned for different network architectures. The findings suggest that the choice of the smoothness parameter N can significantly impact performance, with N=1 being optimal for CNNs and N=2 for transformers.
Methodology
The paper employs a combination of theoretical analysis and empirical benchmarking to evaluate the performance of GEM against existing activation functions. An N-ablation study is conducted to determine the optimal smoothness parameter, and various metrics such as test accuracy, validation loss, and perplexity are used to assess performance across multiple datasets and architectures.
Results
GEM achieves significant improvements in performance metrics, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 0.62%. On MNIST, E-GEM ties the best baseline at 99.23%. SE-GEM surpasses GELU on CIFAR-10 + ResNet-56 (92.51% vs 92.44%). GEM also achieves the lowest perplexity on GPT-2 (72.57 vs 73.76 for GELU) and the best validation loss on BERT-small (6.656).
Implications
The introduction of GEM and its variants could lead to more efficient training of deep neural networks, particularly in applications requiring stable gradient propagation. This could enhance performance in both computer vision and natural language processing tasks, making GEM a valuable addition to the toolkit of machine learning practitioners.
The Path Not Taken: Duality in Reasoning about Program Execution
Large Language Models
- Current benchmarks for LLMs focus too narrowly on single execution paths, limiting their evaluation of program understanding.
- The proposed duality framework introduces forward and backward reasoning tasks to better assess LLMs' causal understanding of program execution.
- DEXBENCH, the new benchmark, comprises 445 paired instances that facilitate a more robust evaluation of LLMs.
- Results show that dual-path reasoning can reveal limitations in models that perform well in isolation but struggle under joint evaluation.
Read more
The Path Not Taken: Duality in Reasoning about Program Execution
Summary
This paper addresses the limitations of existing benchmarks for evaluating large language models (LLMs) in understanding program execution. Current benchmarks primarily focus on predicting program properties based on specific inputs, which can lead to a narrow view of dynamic code reasoning and potential data contamination. The authors propose a novel framework that emphasizes the duality in reasoning about program execution through two complementary tasks: predicting a program's observed behavior for a given input (forward reasoning) and inferring how the input must be mutated to achieve a specific behavioral objective (backward reasoning). This dual-path approach is instantiated in a new benchmark called DEXBENCH, which consists of 445 paired instances and evaluates 13 LLMs. The results indicate that dual-path reasoning serves as a robust proxy for understanding dynamic code execution, revealing that strong performance in isolated reasoning tasks does not necessarily translate to success in joint evaluations. The findings suggest that current models may not fully grasp the causal relationships in program execution, highlighting the need for more comprehensive evaluation methods.
Methodology
The authors developed a dual-path reasoning framework that includes two tasks: forward reasoning, which predicts program properties along the execution path, and backward reasoning, which infers the necessary input mutations for a counterfactual path. They created the DEXBENCH benchmark from real-world programs and evaluated 13 LLMs across various model sizes.
Results
The evaluation revealed that dual-path reasoning provides a reliable proxy for understanding causal, state-aware reasoning in program execution. Strong performance in isolated tasks did not correlate with success in joint evaluations, indicating that existing models may lack a comprehensive understanding of execution flow.
Implications
The findings suggest that improving LLMs' understanding of program execution requires more nuanced evaluation methods that account for the duality of reasoning. This could lead to better performance in software engineering tasks, such as code generation, debugging, and program analysis.
Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
Computer Vision
Multimodal
Interpretability
- Introduction of Sum-of-Checks framework for structured surgical safety assessment.
- Framework decomposes CVS criteria into expert-defined reasoning checks.
- Demonstrated improvement in accuracy and transparency of LVLM-based assessments.
- LVLMs show reliable performance on observational checks but variability on anatomical evidence.
Read more
Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
Summary
This paper addresses the critical need for accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy to prevent bile duct injuries. The authors introduce 'Sum-of-Checks', a structured reasoning framework that decomposes CVS criteria into expert-defined verification checks based on clinically relevant visual evidence. By utilizing large vision-language models (LVLMs), the framework evaluates each check to produce binary judgments and justifications, which are then aggregated into criterion-level scores. The study evaluates the framework on the Endoscapes2023 benchmark using three advanced LVLMs, comparing it against various prompting strategies. Results indicate that Sum-of-Checks enhances frame-level mean average precision by 12-14% relative to the best baseline across all models and criteria. The analysis reveals that while LVLMs perform reliably on observational checks, they exhibit variability on decision-critical anatomical evidence, highlighting the importance of structured reasoning in surgical AI systems.
Methodology
The Sum-of-Checks framework decomposes CVS criteria into expert-defined reasoning checks that reflect clinically relevant visual evidence. Each check is evaluated by an LVLM, producing binary judgments and justifications. The outcomes are aggregated using a fixed, weighted scheme to compute criterion-level scores. The framework was tested on the Endoscapes2023 benchmark with comparisons to various prompting strategies.
Results
Sum-of-Checks improved average frame-level mean average precision by 12-14% compared to the best baseline across three LVLMs. The analysis of individual checks indicated that LVLMs reliably predicted observational checks but showed significant variability in decision-critical anatomical evidence checks.
Implications
The findings suggest that structuring surgical reasoning into expert-aligned verification checks can enhance the accuracy and transparency of AI systems in surgical contexts. This approach may lead to more reliable and auditable surgical AI applications, ultimately improving patient safety.
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
Reinforcement Learning
Computer Vision
Efficient ML
- Introduction of Reinforced Iterative Classification (RIC) as a novel approach to classification using reinforcement learning.
- RIC allows for iterative refinement of predictions, improving calibration and reducing overconfidence in model outputs.
- The framework provides a natural mechanism for adaptive computation, dynamically allocating resources based on input complexity.
- Empirical results show that RIC achieves competitive accuracy while enhancing calibration on multiple image classification benchmarks.
Read more
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
Summary
The paper addresses the limitations of standard supervised classification, which typically trains models to strictly mimic oracle labels in a single forward pass, leading to fixed compute budgets and overconfident predictions. To overcome these issues, the authors propose Reinforced Iterative Classification (RIC), a framework that utilizes Reinforcement Learning (RL) to iteratively refine a predictive distribution over classes. RIC employs a recurrent agent that receives rewards for incremental improvements in prediction quality, allowing for dynamic computation allocation based on input complexity. The authors demonstrate that RIC maintains competitive accuracy with traditional supervised methods while achieving better calibration on datasets such as CIFAR-10, SVHN, and ImageWoof. The framework also provides a natural halting mechanism, concentrating computation on resolvable inputs and terminating early on intractable ones. This approach not only reshapes the optimization landscape but also prevents the pathological overconfidence associated with standard cross-entropy loss, leading to more reliable predictions.
Methodology
The authors recast the classification task as a sequential decision-making process using a recurrent agent that refines a continuous predictive distribution. The agent is rewarded for improvements in prediction quality, and the optimization objective is structured to prevent overconfidence by anchoring logit scales at finite values. This approach contrasts with traditional supervised learning methods that rely on single-step predictions and cross-entropy loss.
Results
RIC demonstrates competitive accuracy compared to standard supervised classification methods and adaptive computation models, while significantly improving calibration on datasets like CIFAR-10, SVHN, and ImageWoof. The learned value function effectively signals when to halt computation, optimizing resource allocation.
Implications
The RIC framework has potential applications in scenarios requiring adaptive computation and improved prediction reliability, such as real-time image classification and other domains where input complexity varies significantly. It also opens avenues for further research into reinforcement learning applications in classification tasks.
Channel-Free Human Activity Recognition via Inductive-Bias-Aware Fusion Design for Heterogeneous IoT Sensor Environments
Time Series
Multimodal
- Introduces a channel-free HAR framework that adapts to heterogeneous sensor environments.
- Utilizes metadata for improved structural information recovery during activity recognition.
- Demonstrates strong robustness against channel perturbations and improved performance with metadata conditioning.
- Maintains competitiveness with traditional channel-fixed models while enabling cross-dataset transfer learning.
Read more
Channel-Free Human Activity Recognition via Inductive-Bias-Aware Fusion Design for Heterogeneous IoT Sensor Environments
Summary
This paper addresses the challenges of human activity recognition (HAR) in heterogeneous Internet of Things (IoT) environments, where sensor configurations can vary significantly across datasets and devices. Traditional HAR models are often channel-fixed, making them difficult to adapt to different sensor setups. The author proposes a channel-free HAR framework that allows a single model to perform inference without relying on a predefined number or arrangement of input channels. The framework incorporates channel-wise encoding, a shared encoder, and metadata-conditioned late fusion through conditional batch normalization. This design enables the model to process each channel independently while leveraging sensor metadata to recover structural information. The proposed joint optimization approach encourages both the discriminability of individual channels and the consistency of fused predictions. Extensive experiments on the PAMAP2 dataset and robustness analyses across six HAR datasets demonstrate that the channel-free model exhibits strong robustness under channel perturbations, improved performance with metadata conditioning, and competitiveness with conventional channel-fixed architectures. The findings suggest that channel-free modeling is a practical advancement towards scalable and transferable HAR in real-world IoT applications.
Methodology
The proposed methodology involves a channel-free HAR framework that integrates channel-wise encoding with a shared encoder, employs metadata-conditioned late fusion via conditional batch normalization, and utilizes a joint optimization approach for channel-level and fused predictions. This allows for independent processing of channels while incorporating sensor metadata.
Results
The experiments reveal that the channel-free late fusion approach is robust against variations in channel configurations. Metadata conditioning consistently enhances model performance, and the framework remains competitive with conventional models, facilitating transfer across diverse sensor setups without the need for redesigning input layers.
Implications
The findings indicate that channel-free modeling can significantly enhance the scalability and adaptability of HAR systems in real-world IoT environments, making it a viable solution for applications in health monitoring, assisted living, and context-aware services.
From Local to Cluster: A Unified Framework for Causal Discovery with Latent Variables
Graph Learning
Theory
Efficient ML
- L2C framework bridges local structure learning and cluster-level causal discovery.
- Automatically discovers clusters from local causal patterns without manual assignment.
- Handles latent variables effectively without assuming causal sufficiency.
- Theoretical guarantees of soundness, atomic completeness, and computational efficiency.
Read more
From Local to Cluster: A Unified Framework for Causal Discovery with Latent Variables
Summary
This paper addresses the challenges of causal discovery in the presence of latent variables, which complicate the identification of causal relationships. Traditional local methods focus on direct neighbors but lack macro-level insights, while cluster-level methods often assume prior knowledge of clusters or require causal sufficiency, which is not always valid in real-world scenarios. The proposed L2C (Local to Cluster Causal Abstraction) framework unifies local structure learning with cluster-level causal discovery, automatically discovering clusters from local causal patterns. L2C employs a cluster reduction theorem to condense clusters to a maximum of three nodes without losing causal information, allowing for local causal discovery to identify direct causes, effects, and V structures. The framework performs macro-level causal inference using cluster-level calculus on the learned cluster graph and does not assume causal sufficiency, effectively handling latent variables through local discovery. Theoretical analysis confirms that L2C maintains soundness, atomic completeness, and computational efficiency. Extensive experiments on both synthetic and real-world datasets demonstrate that L2C accurately recovers ground truth clusters and outperforms existing methods in macro causal effect identification.
Methodology
The L2C framework utilizes a cluster reduction theorem to simplify clusters to at most three nodes, applies local causal discovery techniques to identify direct causal relationships, and employs cluster-level calculus for macro causal inference. The framework is designed to operate without the assumption of causal sufficiency, addressing the challenges posed by latent variables.
Results
L2C successfully recovers ground truth clusters in extensive experiments and achieves better macro causal effect identification compared to existing baseline methods, showcasing its effectiveness in both synthetic and real-world data scenarios.
Implications
The L2C framework has significant implications for causal discovery in complex systems where latent variables are present. It can be applied in various fields such as epidemiology, social sciences, and economics, where understanding causal relationships is crucial for decision-making and policy formulation.
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
NLP
Large Language Models
Federated Learning
- ProjRes is the first projection residuals-based passive MIA specifically designed for FedLLMs.
- The method achieves near 100% accuracy in inferring data membership, outperforming existing techniques.
- ProjRes operates without the need for shadow models or auxiliary classifiers, enhancing efficiency.
- The study reveals significant privacy vulnerabilities in FedLLMs, necessitating a reassessment of their security frameworks.
Read more
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
Summary
This paper addresses the vulnerability of Federated Large Language Models (FedLLMs) to Membership Inference Attacks (MIAs), which can expose sensitive information despite the models' design to maintain data privacy. The authors introduce ProjRes, a novel passive MIA that utilizes projection residuals to analyze hidden embedding vectors and their relationship with gradients. Unlike traditional MIAs, ProjRes does not require shadow models or auxiliary classifiers, making it more efficient and robust. The study demonstrates that existing MIA techniques are ineffective against FedLLMs due to their unique characteristics, such as large parameter scales and rapid convergence. Through extensive experiments on four benchmarks and four different LLMs, ProjRes achieves near 100% accuracy, significantly outperforming previous methods by up to 75.75%. The findings highlight a critical privacy vulnerability in FedLLMs, urging a reevaluation of their security assumptions and the need for improved privacy-preserving mechanisms.
Methodology
The authors propose ProjRes, which analyzes the projection residuals of hidden embedding vectors in the gradient subspace to determine data membership. This approach circumvents the limitations of traditional MIAs by eliminating the need for shadow models and auxiliary classifiers, thus streamlining the attack process.
Results
ProjRes demonstrated near 100% accuracy across four benchmarks and four LLMs, outperforming previous MIA methods by as much as 75.75%. The method maintained effectiveness even against strong differential privacy defenses, indicating a significant privacy risk in FedLLMs.
Implications
The findings suggest that while FedLLMs are designed to protect data privacy, they still possess vulnerabilities that can be exploited through MIAs. This calls for enhanced security measures and a reevaluation of privacy assumptions in federated learning contexts.
Droplet-LNO: Physics-Informed Laplace Neural Operators for Accurate Prediction of Droplet Spreading Dynamics on Complex Surfaces
Theory
Efficient ML
Optimization
- Introduction of PI-LNO, a novel neural network architecture for droplet dynamics.
- Achieves significant speedup in predictions compared to traditional CFD methods.
- Demonstrates superior accuracy with a mean R2 score of 0.9009 across various conditions.
- Utilizes a physics-regularized loss function to ensure physically feasible predictions.
Read more
Droplet-LNO: Physics-Informed Laplace Neural Operators for Accurate Prediction of Droplet Spreading Dynamics on Complex Surfaces
Summary
This paper introduces the Physics-Informed Laplace Operator Neural Network (PI-LNO), a novel architecture designed to accurately predict the dynamics of droplet spreading on complex surfaces. Traditional computational fluid dynamics (CFD) simulations are time-consuming, often requiring 18 to 24 hours for transient computations. The PI-LNO leverages Laplace integral transforms to model the exponential transient dynamics of droplet spreading, significantly improving computational efficiency. The authors conducted extensive benchmark studies against five state-of-the-art methods, including UNet and DeepONet, demonstrating that PI-LNO achieves superior performance with a mean R2 score of 0.9009 across various spreading times, compared to lower scores from the other models. The model was trained on multi-surface CFD data with a physics-regularized composite loss function that integrates data fidelity and physical constraints. Results indicate that PI-LNO not only provides accurate predictions with localized absolute errors but also enables real-time inference, achieving a speedup of approximately 23,400 times over traditional CFD methods. This advancement positions PI-LNO as a powerful tool for parametric optimization and design in engineering applications involving transient multiphase dynamics.
Methodology
The PI-LNO architecture employs Laplace integral transforms to model droplet dynamics, integrating a physics-regularized composite loss function that combines data fidelity metrics (MSE, MAE, RMSE) with physical constraints from Navier-Stokes and Cahn-Hilliard equations. The model was trained on multi-surface CFD data across a range of contact angles, optimizing using Adam and L-BFGS methods.
Results
PI-LNO achieved a mean R2 score of 0.9009 across four intermediate spreading times, with localized absolute errors around the contact-line regions. The model demonstrated R2 scores exceeding 0.99 for all field variables, with inference times of 2.8 ms, representing a 23,400× speedup over traditional CFD simulations.
Implications
The PI-LNO model provides a robust framework for real-time simulations and optimizations in applications such as inkjet printing, spray cooling, and biomedical microfluidics, where understanding droplet dynamics is critical. Its ability to efficiently handle varying surface conditions makes it a valuable tool for engineering design and analysis.
On the Properties of Feature Attribution for Supervised Contrastive Learning
Computer Vision
Interpretability
- Supervised Contrastive Learning (SCL) enhances feature attribution quality compared to Cross-Entropy (CE) loss.
- Models trained with SCL show improved robustness and generalization capabilities.
- Grad-CAM-based feature attributions from SCL-trained models are more faithful and continuous.
- Lower contrastivity in SCL models indicates a more stable feature attribution across classes.
Read more
On the Properties of Feature Attribution for Supervised Contrastive Learning
Summary
This paper investigates the properties of feature attribution (FA) in neural networks trained using Supervised Contrastive Learning (SCL) compared to traditional Cross-Entropy (CE) loss. The authors argue that while CE is the standard for classification tasks, it leads to issues such as overconfidence and poor out-of-distribution detection. In contrast, SCL creates an embedding space where similar data points are clustered together based on their labels, enhancing robustness and generalization. The study empirically evaluates the quality of Grad-CAM-based FA explanations generated from CNNs trained on CIFAR10 and Imagenet-S50 datasets. The authors assess these explanations based on faithfulness, continuity, contrastivity, coherence, and complexity. The findings indicate that models trained with SCL produce more faithful, continuous, and less complex feature attributions than those trained with CE, although they exhibit lower contrastivity. The results suggest that SCL may be a preferable training objective for applications requiring model transparency and trustworthiness.
Methodology
The authors conducted a comparative analysis of feature attribution using Grad-CAM on CNNs trained with three different loss functions: Cross-Entropy (CE), Supervised Contrastive Loss (SCL), and Triplet Loss (TL). They evaluated the quality of the generated explanations based on multiple criteria, including faithfulness, continuity, contrastivity, coherence, and complexity, using two popular image datasets, CIFAR10 and Imagenet-S50.
Results
The results demonstrate that models trained with SCL yield feature attributions that are more faithful, continuous, and less complex than those from CE-trained models. However, SCL models exhibited lower contrastivity, and the coherence results were inconclusive. Overall, the findings reinforce the advantages of SCL in producing trustworthy and transparent neural networks.
Implications
The findings suggest that practitioners should consider using Supervised Contrastive Learning as a training objective, especially in safety-critical applications where model transparency and robustness are essential. The framework for generating class-specific feature attributions can aid in enhancing human oversight and understanding of AI models.
A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
Efficient ML
- Systematic evaluation across five model scales revealed task-dependent performance ceilings.
- Disease prediction improved with larger models, while medication prediction saturated at a smaller size.
- Optimal model sizes can lead to significant reductions in pretraining time without sacrificing performance.
- Task-specific capacity ceilings are essential for efficient resource allocation in model development.
Read more
A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
Summary
This study investigates the relationship between model size and downstream task performance in structured medical foundation models, specifically using a large-scale dataset from a nationwide Japanese medical claims database. The authors pre-trained encoder-only Transformer models at five different scales (2.2M to 101M parameters) to predict disease incidence and medication initiation. The findings reveal that the optimal model size varies by task: larger models (32M-101M) improved disease prediction, while medication prediction performance saturated at 11M parameters, significantly reducing pretraining time by 178 hours. The best-performing models consistently outperformed a Light Gradient Boosting Machine (LGBM) baseline in terms of area under the precision-recall curve (AUPRC). This study highlights the importance of task-specific model scaling, suggesting that while larger models may reduce pretraining loss, they do not uniformly enhance downstream performance across different clinical tasks.
Methodology
The authors utilized a nationwide claims database to sample 2.3 million patients and constructed token sequences integrating diagnosis and medication codes. They pretrained encoder-only Transformer models at five scales and fine-tuned them for two tasks: disease incidence prediction and medication initiation prediction. Performance was compared against a Light Gradient Boosting Machine baseline.
Results
The study found that disease prediction benefited from larger models (32M-101M), while medication prediction performance saturated at 11M parameters. The optimal models for each task outperformed the LGBM baseline in AUPRC, demonstrating that task-specific model sizes can justify pretraining costs.
Implications
These findings suggest that in clinical risk prediction, selecting the appropriate model size based on task characteristics can enhance predictive performance while optimizing computational resources. This approach can guide future developments in structured medical foundation models.
Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
Optimization
Theory
- Identification of the 'attenuate-then-adapt conflict' in gradient modification under Adam.
- Demonstration that traditional methods lead to increased forgetting in continual learning tasks.
- Introduction of Adaptive Decoupled Moment Routing as a solution to mitigate identified failures.
- Empirical validation showing significant performance improvements over existing methods.
Read more
Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
Summary
This paper investigates the hidden failure modes of gradient modification techniques when used with the Adam optimizer in continual learning scenarios. The authors identify a significant issue termed the 'attenuate-then-adapt conflict,' where gradient-modifying methods inadvertently lead to increased forgetting in continual learning tasks. Through extensive experiments on an 8-domain continual learning model, they demonstrate that traditional shared-routing projection methods collapse to near-vanilla performance, while naive fixed-strength decoupling underperforms. The authors propose a novel solution, Adaptive Decoupled Moment Routing, which effectively mitigates this failure by preserving the magnitude of statistics in the second moment while adapting the gradient routing based on overlap-aware strategies. Their findings reveal that this adaptive approach significantly outperforms existing methods, particularly in high-overlap scenarios, and provides a deeper understanding of the dynamics between gradient modification and adaptive optimization.
Methodology
The authors conducted experiments on an 8-domain continual learning model to assess the performance of various gradient-modifying techniques under the Adam optimizer. They analyzed the effects of these techniques on forgetting rates and introduced a new routing method, Adaptive Decoupled Moment Routing, which routes modified gradients while preserving the magnitude of statistics in the second moment. The study included a scalar-surrogate analysis to diagnose the failure modes and a cross-optimizer control to validate the findings across different optimization algorithms.
Results
The results indicated that all tested shared-routing projection methods collapsed to near-vanilla forgetting rates, with the best alternative (0.5% replay buffer) trailing by 1.6 units. In contrast, the proposed Adaptive Decoupled Moment Routing achieved a forgetting rate of 9.4 units, representing a 3.8-unit improvement over vanilla methods and a 2.2-unit improvement over the strongest shared baseline. The gap widened further in a 16-domain continual stream, highlighting the robustness of the proposed method.
Implications
The findings suggest that existing gradient modification techniques may not be effective when used with adaptive optimizers like Adam in continual learning scenarios. The proposed Adaptive Decoupled Moment Routing could lead to more effective continual learning systems, particularly in applications with high overlap between tasks. This research may influence future developments in optimizer design and continual learning methodologies.
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Large Language Models
NLP
Interpretability
- LLMs are sensitive to clinically insignificant variables, affecting psychiatric risk assessments.
- Prompt design significantly influences model outputs, necessitating controlled methodologies in AI healthcare applications.
- A structured audit framework can identify reliability issues in LLMs, particularly in psychiatric contexts.
- Increased output variability correlates with the addition of irrelevant features, highlighting predictive instability.
Read more
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Summary
This paper investigates the reliability of large language models (LLMs) in generating hospitalization risk scores within the psychiatric domain. The authors highlight the challenges posed by algorithmic biases and prompt sensitivity, which can significantly impact model outputs. They propose a structured approach for reliability auditing that evaluates the effects of prompt design and the inclusion of medically insignificant inputs on predicted risk scores. The study involves generating a cohort of synthetic patient profiles with clinically relevant and irrelevant features, and auditing four different LLMs across various prompt styles. The findings reveal that the inclusion of non-clinical variables leads to increased variability in predicted hospitalization risk scores, indicating reduced predictive stability. This underscores the need for systematic evaluations of LLMs before their clinical deployment, particularly in sensitive areas like psychiatry where decision-making is complex and nuanced.
Methodology
The authors generated a synthetic cohort of 50 patient profiles, each containing 15 clinically relevant features and up to 50 non-clinically relevant features. They evaluated four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini) across four different prompt styles (neutral, logical, human impact, clinical judgment). The analysis involved 44,000 simulations to assess the impact of prompt variations and irrelevant features on hospitalization risk scores.
Results
The study found that the inclusion of medically insignificant features led to a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts. This indicates a reduction in predictive stability as contextual noise increased. The results also showed that prompt variations independently affected the trajectory of instability in a model-dependent manner.
Implications
The findings suggest that LLMs used in psychiatric risk assessments must undergo rigorous reliability audits to ensure stable and interpretable outputs. This is crucial for maintaining trust in AI applications in healthcare, as unreliable models could lead to poor patient outcomes. The proposed audit framework can be utilized to improve the reliability of AI systems in clinical settings.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
Multimodal
Efficient ML
Optimization
- Presents a multi-layered methodology for accelerating multimodal foundation models (MFMs).
- Integrates hardware and software optimization techniques to enhance energy efficiency and performance.
- Employs advanced techniques such as mixed-precision quantization, structural pruning, and model cascading.
- Demonstrates effectiveness on medical MFMs and code generation tasks.
Read more
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
Summary
This paper presents a comprehensive methodology aimed at accelerating multimodal foundation models (MFMs) through a combination of hardware and software co-design techniques. The authors identify key challenges in the acceleration of MFMs, including energy efficiency, cross-modal integration, domain-specific adaptation, and security. They propose a multi-layered design pipeline that integrates model development with domain-specific adaptations and various optimization techniques. The methodology includes MFM compression via hierarchy-aware mixed-precision quantization and structural pruning, as well as operational optimizations such as speculative decoding and model cascading. Additionally, the authors emphasize the importance of optimizing the processing dataflow based on hardware architecture to meet bandwidth and latency requirements. A specialized hardware accelerator for transformer workloads is also discussed, which can be designed through expert methods or aided by large language models (LLMs). The effectiveness of the proposed methodology is demonstrated through applications in medical MFMs and code generation tasks, with future extensions towards energy-efficient spiking-MFMs.
Methodology
The methodology combines hardware and software co-design for transformer blocks, employing techniques such as mixed-precision quantization, structural pruning, and speculative decoding. It also includes a specialized hardware accelerator designed for transformer workloads, optimized for specific application requirements.
Results
The proposed methodology effectively accelerates MFMs, demonstrating significant improvements in computational efficiency and memory usage in applications related to medical data and code generation. The results indicate a successful integration of hardware and software techniques, paving the way for further advancements in energy-efficient spiking-MFMs.
Implications
The findings suggest that the proposed methodology can significantly enhance the deployment of MFMs in resource-constrained environments, such as edge devices, while maintaining performance and energy efficiency. This has potential applications in various domains, including healthcare, robotics, and generative AI.
FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
NLP
Large Language Models
Efficient ML
- FairyFuse is the first ternary-weight GEMV kernel on x86 CPUs that eliminates floating-point multiplications.
- The system achieves a 29.6× speedup over FP32 by optimizing memory bandwidth usage through fused execution.
- End-to-end evaluation shows FairyFuse outperforms llama.cpp Q4_K_M by 1.24× while maintaining high model quality.
- The findings suggest that CPUs are more suitable than GPUs for extreme quantization in LLMs.
Read more
FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
Summary
The paper introduces FairyFuse, a novel inference system designed for large language models (LLMs) that utilizes ternary weights ({−1, 0, +1}) to eliminate floating-point multiplications during inference on CPU-only platforms. Traditional methods for LLM inference often rely on dequantizing weights and performing multiplications, which can hinder performance due to memory bandwidth limitations. FairyFuse addresses this by implementing a multiplication-free execution model that leverages masked additions and subtractions, achieving significant speedups. The system fuses multiple real-valued sub-GEMVs into a single AVX-512 loop, optimizing memory access and computational efficiency. The results demonstrate that FairyFuse can sustain 32.4 tokens per second on an Intel Xeon 8558P, outperforming existing methods while maintaining near-lossless quality. This work highlights the potential of ternary packing for efficient LLM inference on CPUs, suggesting a shift in focus from GPUs to CPUs for extreme quantization tasks.
Methodology
FairyFuse employs a novel approach to ternary weight processing by packing weights into 32-bit words and using AVX-512 instructions for masked additions and subtractions. It consolidates multiple sub-GEMVs into a single loop to reduce memory overhead and improve computational efficiency. The design decisions include mask reuse, input sharing, and register-resident accumulation to enhance performance.
Results
FairyFuse achieves a throughput of 32.4 tokens per second on a single Intel Xeon 8558P socket, which is 1.24× faster than the llama.cpp Q4_K_M method. The perplexity on the WikiText-2 dataset is 5.52, comparable to 5.47 for FP16, with an average downstream accuracy of 66.0%. The inner loop of the implementation contains zero floating-point multiplication instructions, confirming the effectiveness of the proposed method.
Implications
The results indicate that ternary quantization can significantly enhance the efficiency of LLM inference on CPUs, making it a viable option for applications requiring low-latency and privacy-preserving processing. This could lead to broader adoption of LLMs in edge devices and on-device assistants where GPU resources are limited.
Performance Anomaly Detection in Athletics: A Benchmarking System with Visual Analytics
Time Series
- The system processes a large dataset of athletic performances to identify potential doping violations.
- Eight detection methods are utilized, including statistical and machine learning techniques.
- Trajectory-based methods outperform others in balancing detection and false alarm rates.
- The system supports expert-driven investigations with an interactive visual analytics interface.
Read more
Performance Anomaly Detection in Athletics: A Benchmarking System with Visual Analytics
Summary
This paper addresses the challenge of detecting performance-enhancing drug use in athletics through a novel performance anomaly detection system. Traditional anti-doping programs rely heavily on biological testing, which is costly and limited by short detection windows for many substances. To complement these programs, the authors propose a system that analyzes 1.6 million athletic performances from over 19,000 competitions between 2010 and 2025. The system employs eight detection methods, including statistical rules, machine learning algorithms, and trajectory analysis, to identify suspicious performance patterns. The methods are validated against confirmed anti-doping violations, with trajectory-based approaches showing the best balance between detection rates and false alarms. The system emphasizes transparency and human judgment, providing an interactive interface for investigators to review flagged performances with contextual details. This approach aims to enhance the effectiveness of anti-doping efforts while minimizing the risk of false accusations against athletes.
Methodology
The authors developed a performance anomaly detection system that integrates a large-scale data pipeline processing 1.6 million performance records. It employs eight detection methods: statistical outlier detection (z-score, MAD, IQR), machine learning algorithms (Isolation Forest, XGBoost), trajectory-based models (excess performance), and Bayesian hierarchical inference. The system is designed for interactive use, allowing investigators to analyze flagged performances with detailed contextual information.
Results
The trajectory-based methods demonstrated superior performance in identifying sanctioned athletes while maintaining a low rate of false positives. The system's validation against confirmed anti-doping violations showed that performance-based screening can effectively complement existing biological testing methods, which have low positive rates.
Implications
This research has significant implications for anti-doping programs, providing a complementary approach to traditional biological testing. By leveraging performance data, the system can help prioritize testing resources and enhance the detection of doping violations, ultimately promoting fair play in athletics.
ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
Time Series
NLP
Multimodal
- ARFBench provides a comprehensive evaluation framework for TSQA in software incident response.
- Frontier VLMs significantly outperform existing baselines in TSQA tasks.
- Hybrid TSFM-VLM models show promise for specialized time series question answering.
- A model-expert oracle approach demonstrates complementary strengths between models and human experts.
Read more
ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
Summary
This paper introduces ARFBench, a benchmark for evaluating time series question-answering (TSQA) capabilities of foundation models in the context of software incident response. The benchmark consists of 750 questions derived from 142 time series and 5.38 million data points from 63 production incidents at Datadog. The authors assess various leading models, including proprietary and open-source large language models (LLMs), vision-language models (VLMs), and time series foundation models (TSFMs). Results indicate that frontier VLMs, particularly GPT-5, outperform existing baselines with an accuracy of 62.7% and F1 score of 51.9%. The study also explores a novel hybrid TSFM + VLM model, which, after post-training on synthetic and real data, achieves comparable performance to leading models. Furthermore, the authors establish a model-expert oracle that combines model and human expert answers, achieving an F1 score of 82.8% and accuracy of 87.2%, setting a new benchmark for TSQA tasks. The benchmark is publicly available for further research.
Methodology
The authors developed ARFBench by sourcing real incident data from Datadog's internal telemetry, creating a set of multiple-choice questions that assess various levels of time series reasoning. They benchmarked multiple foundation models, including LLMs, VLMs, and TSFMs, and introduced a hybrid model combining TSFMs with VLMs. The evaluation involved comparing model performances against established baselines and human expert responses.
Results
The leading model, GPT-5, achieved 62.7% accuracy and 51.9% F1 score, outperforming naive baselines by significant margins. The hybrid TSFM-VLM model demonstrated comparable performance to frontier models. The model-expert oracle achieved an F1 score of 82.8% and accuracy of 87.2%, indicating a new superhuman frontier for TSQA.
Implications
The findings suggest that advanced multimodal models can enhance the efficiency and accuracy of software incident response by improving the understanding of time series data. The benchmark can serve as a foundation for future research in TSQA and incident management, potentially leading to better automated tools for engineers.
Learning Coverage- and Power-Optimal Transmitter Placement from Building Maps: A Comparative Study of Direct and Indirect Neural Approaches
Optimization
- Introduces a large dataset (RadioMapSeer-Deployment) for transmitter placement with dual labels.
- Identifies an asymmetric trade-off between coverage and power in transmitter placements.
- Compares indirect heatmap-based and direct score-map models for transmitter placement.
- Demonstrates significant speed improvements in predictions using neural network models.
Read more
Learning Coverage- and Power-Optimal Transmitter Placement from Building Maps: A Comparative Study of Direct and Indirect Neural Approaches
Summary
This paper addresses the challenge of optimal wireless transmitter placement in radio-network planning, focusing on a single-transmitter scenario under a fixed learned propagation surrogate. The author introduces a dataset, RadioMapSeer-Deployment, containing 167,525 urban scenarios with dual surrogate-exact labels for coverage-optimal and power-optimal transmitter locations. The analysis reveals an asymmetric trade-off between coverage and power, with coverage-optimal placements sacrificing more received power than power-optimal placements do in terms of coverage. The study evaluates two learning formulations: indirect heatmap-based models that predict received-power radio maps and direct score-map models that predict objective landscapes over feasible transmitter locations. The results show that discriminative models within the heatmap family provide significantly faster predictions compared to exhaustive searches, while diffusion models enhance performance through multi-sample inference. Dual score-map strategies effectively match the optimal balanced placement and demonstrate substantial speedups. The findings indicate that both formulations offer rapid inference capabilities, with dual score-map methods excelling in balanced placements and heatmap models providing valuable intermediate maps.
Methodology
The study employs two main methodologies: indirect heatmap-based models that predict received-power radio maps and direct score-map models that predict the objective landscape for feasible transmitter locations. The models are trained and evaluated on the RadioMapSeer-Deployment dataset, allowing for a controlled comparison of performance across different approaches.
Results
The analysis shows that discriminative models within the heatmap family achieve one-shot predictions that are 1350–2400 times faster than exhaustive searches. Diffusion models improve single-objective performance and enable strong balanced placements without explicit multi-objective training. Dual score-map strategies match the optimal balanced placement and maintain proximity to it across smaller candidate budgets, achieving 14–22 times speedups.
Implications
The findings suggest that learning-based methods can significantly enhance the efficiency of wireless transmitter placement, making it feasible to optimize placements in complex urban environments. This has implications for improving network planning and deployment strategies in telecommunications.
Protect the Brain When Treating the Heart: A Convolutional Neural Network for Detecting Emboli
Computer Vision
- Introduction of a 2.5D U-Net architecture for real-time GME detection.
- Development of a custom annotation tool for creating a specialized dataset.
- Demonstration of high segmentation accuracy and robust detection capabilities.
- Integration of the model into surgical protocols for real-time patient monitoring.
Read more
Protect the Brain When Treating the Heart: A Convolutional Neural Network for Detecting Emboli
Summary
This paper addresses the challenge of detecting gaseous microemboli (GME) during cardiac interventions, which pose significant neurological risks. The authors propose a novel approach using a 2.5D U-Net architecture to segment GME in echocardiographic video data. Traditional methods for GME detection are limited by their inability to provide real-time feedback during surgery, which is critical for mitigating risks. The proposed model leverages temporal context from consecutive frames to improve detection accuracy against a dynamic anatomical background. A custom annotation tool was developed to create a dataset of echocardiographic videos, which was used to train the model. The results demonstrate robust detection capabilities and high segmentation accuracy, enabling real-time monitoring of GME during surgical procedures. This advancement has the potential to enhance intraoperative decision-making and patient safety by providing timely feedback to surgical teams.
Methodology
The authors utilized a 2.5D U-Net architecture to segment GME in echocardiographic video sequences. The model processes short sequences of frames to incorporate temporal context, enhancing the ability to distinguish moving emboli from the surrounding cardiac structures. A custom annotation tool was created to facilitate the segmentation of GME, resulting in a dataset of approximately 4000 localized image samples derived from echocardiographic videos.
Results
The proposed model achieved robust detection of GME with high segmentation accuracy while maintaining real-time execution speed. The integration of temporal context allowed for effective discrimination of moving emboli against a dynamic background, which is crucial for intraoperative monitoring.
Implications
The findings suggest that the developed system can significantly improve the safety of cardiac procedures by providing real-time feedback on GME presence, potentially reducing the incidence of neurological complications associated with cardiac interventions. This technology could be integrated into standard surgical protocols, enhancing decision-making and patient outcomes.
Estimating Tail Risks in Language Model Output Distributions
NLP
Large Language Models
Efficient ML
- Introduces a method for estimating the probability of harmful outputs in language models using importance sampling.
- Demonstrates that unsafe model versions can be created to enhance the likelihood of harmful outputs, allowing for efficient sampling.
- Achieves accurate estimates of harmful output probabilities with significantly fewer samples than traditional methods.
- Reveals the sensitivity of model outputs to input perturbations, indicating that query-level estimates are crucial for understanding deployment risks.
Read more
Estimating Tail Risks in Language Model Output Distributions
Summary
This paper addresses the critical issue of estimating tail risks associated with harmful outputs from language models, which are increasingly deployed at scale. Despite advancements in model alignment that reduce harmful outputs, the sheer volume of queries means that even rare harmful behaviors can manifest. Current safety evaluations primarily focus on the distribution of inputs that lead to harmful outputs, neglecting the probabilistic nature of model outputs. To tackle this, the authors propose a novel method utilizing importance sampling to efficiently estimate the probability of harmful outputs for any input query. By creating unsafe versions of the target model, they enhance the likelihood of generating harmful outputs, allowing for sample-efficient estimation. The proposed method significantly reduces the number of samples needed—achieving accurate estimates with 10-20 times fewer samples compared to brute-force Monte Carlo methods. The authors demonstrate that their estimates can reveal model sensitivity to input perturbations and predict deployment risks. This work highlights the feasibility of accurate rare-event estimation for safety evaluations in language models.
Methodology
The authors employ importance sampling to estimate the probability of harmful outputs from language models. They create unsafe proposal models using activation steering, which are designed to produce harmful outputs more frequently than the target model. This allows for low-variance estimates of harmfulness probabilities with significantly fewer samples than traditional Monte Carlo methods. The methodology also incorporates extreme value theory to predict model behavior on unseen queries based on query-level estimates.
Results
The proposed method allows for the estimation of harmful output probabilities with only 500 samples, achieving results comparable to those obtained through brute-force Monte Carlo sampling that would typically require thousands of samples. The estimates reveal that harmful outputs can be sensitive to minor changes in input, and the authors demonstrate that their approach can accurately predict deployment risks and model sensitivity.
Implications
This research has significant implications for the safe deployment of language models, providing a framework for evaluating and mitigating risks associated with harmful outputs. It can inform the development of safer AI systems and enhance the understanding of model behavior in real-world applications.
Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment
Time Series
- Introduction of CCSS-RS, a novel data-driven simulator for wastewater treatment decision support.
- Model effectively handles irregular and missing sensor data, crucial for real-world applications.
- Demonstrated significant predictive accuracy improvements over existing models.
- Case studies validate the model's operational utility in real-time decision-making.
Read more
Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment
Summary
This paper addresses the need for effective decision support tools in wastewater treatment plants (WWTPs) through the development of a controlled continuous-time state-space model (CCSS-RS). The model is designed to simulate plant responses under various control plans while accommodating irregular and missing sensor data. It separates historical state inference from future control actions, allowing for accurate long-term predictions over planning horizons of 12 to 36 hours. The CCSS-RS model incorporates advanced techniques such as typed context encoding and gain-weighted forcing to handle the complexities of real-world WWTP data. Evaluated on the Avedøre WWTP dataset, the model demonstrates significant improvements in predictive accuracy compared to existing methods, achieving a reduction in RMSE by 40-46% relative to Neural CDE baselines. The paper also presents four case studies that highlight the operational value of the model, showcasing its ability to inform decision-making in real-time scenarios. Overall, CCSS-RS represents a practical solution for offline scenario screening in industrial wastewater treatment, complementing traditional mechanistic models.
Methodology
The CCSS-RS model employs a controlled continuous-time state-space framework that distinguishes between state variables, control inputs, and exogenous variables. It utilizes typed context encoding to process irregular observations without resampling, and a gain-weighted forcing mechanism to incorporate prescribed controls. The model is evaluated against established baselines and simplified variants to assess its performance in predicting wastewater treatment dynamics.
Results
On the Avedøre WWTP dataset, CCSS-RS achieved a root mean square error (RMSE) of 0.696 and a continuous ranked probability score (CRPS) of 0.349 across 10,000 test windows, outperforming Neural CDE models by 40-46% in RMSE. The model maintained accuracy in predicting key variables such as ammonium and nitrate, even under conditions of sensor outages.
Implications
The CCSS-RS model provides a robust tool for wastewater treatment operators, enabling better decision-making through accurate simulations of plant behavior. Its ability to function without extensive recalibration of mechanistic models positions it as a valuable asset in the push towards smart-water initiatives and Industry 4.0 applications.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Computer Vision
Large Language Models
Efficient ML
- Identifies sink tokens as a critical barrier to fine-grained video understanding.
- Proposes Sink-Token-aware Pruning (SToP) to enhance existing pruning methods.
- Demonstrates significant performance improvements across diverse benchmarks.
- Validates the effectiveness of SToP in maintaining visual grounding during pruning.
Read more
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Summary
This paper addresses the challenge of high inference latency in Video Large Language Models (Video LLMs) caused by the processing of numerous visual tokens. Existing training-free visual token pruning methods have been shown to reduce computational costs but often lead to performance degradation in fine-grained understanding tasks, particularly those requiring precise visual grounding. The authors identify 'sink tokens'—tokens that attract excessive attention but provide little semantic information—as a significant obstacle to effective video understanding. To mitigate this issue, they propose Sink-Token-aware Pruning (SToP), a novel method that quantifies the sink tendency of each token and integrates this information into existing pruning techniques. SToP enhances the performance of state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) across various benchmarks, including hallucination evaluation and open-ended generation, demonstrating that it can significantly improve model performance while allowing for the pruning of up to 90% of visual tokens. The findings suggest that addressing sink tokens is crucial for maintaining fine-grained understanding in video tasks.
Methodology
The authors conducted a systematic analysis to identify sink tokens and their impact on video understanding. They developed SToP, which introduces a sink score to quantify the sink tendency of tokens and applies this score to existing spatial and temporal pruning methods. The effectiveness of SToP was validated by applying it to various state-of-the-art pruning methods and evaluating performance across multiple benchmarks.
Results
The application of SToP led to significant performance boosts in tasks requiring fine-grained visual understanding, even with up to 90% of visual tokens pruned. The results indicated that existing pruning methods were vulnerable to performance drops in fine-grained tasks, while SToP effectively mitigated this issue.
Implications
The findings suggest that addressing sink tokens is essential for improving the efficiency and effectiveness of Video LLMs in real-world applications, particularly in scenarios requiring detailed visual grounding. This work could lead to more practical deployments of Video LLMs in various domains, including video analysis and interactive AI systems.
A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting
Time Series
Theory
Efficient ML
- Introduces a hybrid autoregressive transformer embedded in a mixed finite element framework for stable forecasting.
- Proves preservation of discrete energies and uniform gradient bounds, addressing the exploding gradient problem.
- Achieves a 65× reduction in model parameters compared to existing models while maintaining high forecasting accuracy.
- Demonstrates real-time surrogate modeling capabilities with significant speedup over traditional simulation methods.
Read more
A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting
Summary
This paper addresses the challenges of autoregressive modeling for chaotic dynamical systems, particularly focusing on stability during training and inference over long time horizons. The authors propose a hybrid technique that integrates an autoregressive transformer within a novel shooting-based mixed finite element scheme, which ensures provable stability. They demonstrate that their approach preserves discrete energies and maintains uniform bounds on gradients, effectively avoiding the exploding gradient problem. By combining this method with a vision transformer, the authors achieve a significant reduction in model parameters while enhancing long-horizon forecasting capabilities. The results show that their model outperforms existing foundation models, achieving a 65× reduction in parameters and enabling real-time surrogate modeling with a 9,000× speedup over traditional simulations. This work highlights the potential of structure-preserving numerical methods in enhancing the stability and efficiency of machine learning models for scientific applications.
Methodology
The authors embed learned neural dynamics within a mixed finite element framework, utilizing finite element exterior calculus (FEEC) to ensure stability. They employ a shooting method to couple trajectory segments and a vision transformer for dynamic embedding, allowing for the extraction of complex nonlinear dynamics from sparse data.
Results
The proposed method successfully forecasts chaotic systems over 10,000 Lyapunov times, significantly outperforming neural ODEs. It matches state-of-the-art accuracy on shear flow benchmarks with 65× fewer parameters than existing models and achieves a 9,000× speedup in real-time simulations of a fusion component using only 12 training simulations.
Implications
This research has significant implications for scientific computing and real-time simulations, particularly in fields requiring accurate modeling of chaotic dynamics. The hybrid approach can facilitate faster design iterations and optimizations in engineering applications, potentially transforming how complex systems are simulated and analyzed.
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
NLP
Large Language Models
- Identifies three failure modes in numerical reasoning: reasoning inefficiency, data scarcity for logical supervision, and header dependency.
- Introduces operation sketches to enhance contextual reasoning and reduce reliance on surface-level patterns.
- Combines header anonymization and self-supervised learning to improve data efficiency and robustness.
- Demonstrates superior performance of TaNOS over traditional SFT methods in both in-domain and cross-domain settings.
Read more
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
Summary
This paper addresses the challenges of numerical reasoning in expert-domain tables, which often show high accuracy in familiar contexts but struggle with domain shifts. The authors introduce TaNOS, a continual pre-training framework designed to enhance the robustness of numerical reasoning by decoupling domain semantics from numerical operation structures. TaNOS consists of three main components: header anonymization to reduce lexical memorization, operation sketches that provide minimal structural cues, and self-supervised pretraining that generates correctness-guaranteed program-question pairs. The framework aims to improve transferability and generalization across different datasets. The authors demonstrate that TaNOS significantly outperforms traditional supervised fine-tuning (SFT) methods, achieving 80.13% execution accuracy on the FinQA benchmark with only 10% of the training data, compared to 73.97% with full SFT. Additionally, TaNOS exhibits minimal performance degradation across domain shifts, highlighting its potential for robust generalization in numerical reasoning tasks.
Methodology
The authors propose TaNOS, which integrates three mechanisms: operation sketches to provide structural cues, self-supervised learning to generate program-question pairs from unlabeled tables, and header anonymization to mitigate header dependency. This approach aims to enhance the model's ability to generalize across different datasets while maintaining high accuracy.
Results
TaNOS achieves 80.13% execution accuracy on the FinQA benchmark using only 10% of the training data, outperforming the SFT baseline of 73.97% with full training data. In domain-shift experiments, TaNOS maintains a performance gap of less than 2 percentage points, while SFT shows over a 10 percentage point gap, indicating improved robustness.
Implications
The findings suggest that TaNOS can be applied to various expert-domain table reasoning tasks, potentially improving the performance of language models in finance, engineering, and biology. The framework's ability to generalize across domains could lead to more reliable AI systems in critical applications.
Conditional anomaly detection with soft harmonic functions
Graph Learning
- Introduction of a non-parametric method for conditional anomaly detection using soft harmonic functions.
- Regularization techniques to avoid detection of isolated and fringe points.
- Development of a backbone graph for efficient computation in large datasets.
- Demonstration of the method's efficacy on synthetic, UCI, and real-world datasets.
Read more
Conditional anomaly detection with soft harmonic functions
Summary
This paper addresses the problem of conditional anomaly detection (CAD), which focuses on identifying unusual instances based on a subset of variables given the values of others. The authors propose a novel non-parametric approach utilizing soft harmonic functions to estimate label confidence and detect anomalous mislabeling. The method incorporates regularization to mitigate the influence of isolated and fringe points, which are often problematic in anomaly detection. The authors demonstrate the effectiveness of their approach through experiments on synthetic datasets, UCI ML datasets, and a real-world electronic health record dataset, showcasing its ability to identify unusual patient-management decisions. The paper emphasizes the importance of context in anomaly detection and presents a framework that leverages neighborhood interactions to improve label consistency and anomaly scoring.
Methodology
The proposed method employs a similarity graph of instances to propagate label information and assess label consistency in the neighborhood of data points. It includes a regularization component to reduce the confidence in predictions for isolated and fringe points. Additionally, the authors introduce a backbone graph to facilitate efficient computation, allowing for scalable anomaly detection.
Results
The proposed method outperformed several baseline approaches in detecting unusual labels across various datasets. It effectively identified anomalies in both synthetic and real-world scenarios, particularly in the context of patient management decisions, demonstrating its practical applicability.
Implications
This research has significant implications for fields where context-dependent anomaly detection is crucial, such as healthcare, finance, and social networks. The ability to accurately identify unusual behaviors or outcomes can lead to improved decision-making and risk management in these domains.
SCM: Sleep-Consolidated Memory with Algorithmic Forgetting for Large Language Models
Large Language Models
NLP
Theory
- Introduces a unified memory architecture for conversational agents that mimics biological memory processes.
- Implements multi-dimensional value tagging for richer memory prioritization.
- Achieves perfect recall in multi-turn conversations while significantly reducing memory noise.
- Describes a complete system design including synchronization and visualization tools.
Read more
SCM: Sleep-Consolidated Memory with Algorithmic Forgetting for Large Language Models
Summary
The paper introduces SCM (Sleep-Consolidated Memory), a novel memory architecture for large language models (LLMs) that integrates principles from neuroscience to overcome the limitations of current memory systems. Traditional LLMs lack persistent and structured memory, often relying on context windows or unbounded vector databases without effective consolidation or forgetting mechanisms. SCM addresses these issues by implementing five core components inspired by human memory: a limited-capacity working memory, multi-dimensional importance tagging, offline sleep-stage consolidation (including NREM and REM phases), intentional value-based forgetting, and a computational self-model for introspection. The architecture allows for structured semantic encoding of user inputs and prioritizes memory through a four-dimensional importance vector. The prototype demonstrates perfect recall accuracy in ten-turn conversations while reducing memory noise by 90.9% through adaptive forgetting, with memory search latency remaining under one millisecond even with numerous stored concepts. This work lays the groundwork for future advancements in LLM memory systems, emphasizing the need for consolidation, prioritization, and forgetting.
Methodology
SCM employs a memory architecture that encodes user inputs into structured semantic concepts, utilizes a limited working memory, and incorporates sleep-stage consolidation and intentional forgetting mechanisms. The system also features a computational self-model for introspection, allowing it to manage and prioritize memory effectively.
Results
The SCM prototype achieved perfect recall accuracy across a standardized benchmark of eight tests involving ten-turn conversations. It reduced memory noise by 90.9% through adaptive forgetting, while maintaining memory search latency below one millisecond, even with hundreds of stored concepts.
Implications
The development of SCM could lead to more advanced conversational agents capable of maintaining context over longer interactions, improving user experience in applications such as personal assistants, customer service bots, and interactive storytelling. It also opens avenues for further research into biologically-inspired memory systems in AI.
Conditional anomaly detection using soft harmonic functions: An application to clinical alerting
Graph Learning
Theory
Efficient ML
- Introduces a non-parametric approach for conditional anomaly detection using soft harmonic functions.
- Focuses on identifying unusual patient-management decisions to prevent medical errors.
- Incorporates regularization to handle isolated and fringe points in the data.
- Demonstrates effectiveness on a real-world electronic health record dataset.
Read more
Conditional anomaly detection using soft harmonic functions: An application to clinical alerting
Summary
This paper addresses the critical issue of timely detection of anomalies in clinical settings, particularly focusing on identifying unusual patient-management decisions that may indicate medical errors. The authors propose a novel non-parametric approach for conditional anomaly detection based on soft harmonic functions. This method estimates the confidence of labels to detect anomalous mislabeling while incorporating regularization to mitigate the detection of isolated examples and fringe points. The proposed technique is evaluated on a real-world electronic health record dataset, demonstrating its effectiveness in identifying unusual labels compared to several baseline methods. The study highlights the potential of machine learning to enhance clinical alert systems by leveraging historical medical records to detect anomalies that could lead to improved patient care and reduced healthcare costs.
Methodology
The authors utilize a soft harmonic solution to compute anomaly scores, which are derived from a graph-based approach where nodes represent data instances and edges encode similarities. The method involves regularizing the graph Laplacian to reduce the confidence of predictions for isolated and fringe points, thereby improving the reliability of anomaly detection.
Results
The proposed method successfully identifies anomalous labels in the dataset, outperforming traditional baseline approaches in terms of accuracy and reliability. The results indicate that the soft harmonic functions effectively capture the underlying structure of the data, leading to better anomaly detection in clinical contexts.
Implications
This research has significant implications for clinical practice, as it demonstrates how machine learning can be applied to enhance the detection of medical errors through anomaly detection. By improving alert systems, the approach could lead to better patient outcomes and reduced healthcare costs.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Large Language Models
NLP
Theory
- Introduction of the Stealth Pretraining Seeding (SPS) threat model for LLMs.
- Development of the PermaFrost-Attack framework to study latent conceptual poisoning.
- Introduction of three geometric diagnostics for analyzing adversarial influence in LLMs.
- Empirical evidence showing persistent unsafe behaviors induced by SPS across multiple model families.
Read more
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Summary
This paper introduces a novel threat model called Stealth Pretraining Seeding (SPS), which highlights the vulnerabilities of aligned large language models (LLMs) to adversarial manipulation during their training phase. The authors propose the PermaFrost-Attack, where adversaries can subtly introduce poisoned content across various stealth websites, making it difficult to detect during dataset construction. The attack exploits the web-scale pretraining process, allowing small, benign-seeming fragments to bias the model's behavior without immediate detection. The paper presents a controlled framework for studying this latent conceptual poisoning and introduces three geometric diagnostics—Thermodynamic Length, Spectral Curvature, and Infection Traceback Graph—to analyze how adversarial influences propagate through the model's latent space. Empirical results demonstrate that these triggers can induce persistent unsafe behaviors in multiple LLMs, which remain hidden under standard evaluation methods. The findings underscore the need for more robust detection and analysis techniques to address these vulnerabilities in future foundation models.
Methodology
The authors operationalize the SPS threat model through a controlled fine-tuning setup, using a discrete trigger (<00TRIGGER00>) to manipulate concept-behavior relationships. They employ geometric diagnostics to trace the propagation of adversarial influence in the model's latent space, allowing for a systematic examination of vulnerabilities that may not be visible through conventional evaluation methods.
Results
The study finds that the PermaFrost triggers can induce significant behavioral deviations in LLMs, which persist even after training. The geometric diagnostics effectively reveal how adversarial influences propagate through the model, demonstrating that standard evaluation techniques fail to detect these latent vulnerabilities.
Implications
The findings suggest that LLMs are susceptible to subtle adversarial attacks that can compromise their safety and reliability. This highlights the need for improved dataset construction practices and evaluation methodologies to identify and mitigate such vulnerabilities in future AI systems.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
NLP
- Novel multi-method approach combining transfer learning and unsupervised clustering for morphological analysis.
- Discovered 2,455 noun class labels in Giriama, significantly increasing the available morphological data.
- Identified two previously undocumented morphological patterns in Giriama.
- Achieved 78.2% lemmatization accuracy on known paradigms and high segmentation rates.
Read more
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Summary
This paper presents a novel approach for discovering morphological features in low-resource Bantu languages, specifically focusing on Giriama, which has limited annotated data. The authors combine cross-lingual transfer learning with unsupervised clustering to enhance morphological analysis. The methodology utilizes a character-level pretrained model (ByT5) to leverage the similarities between Giriama and a high-resource language, Swahili, which shares approximately 60% vocabulary overlap. The proposed pipeline successfully identifies noun class assignments for 2,455 words, marking a significant increase from the existing 91 labeled paradigms. Additionally, the study uncovers two new morphological patterns in Giriama: an a- prefix variant for Class 2 and a contracted k'- prefix. The results demonstrate a lemmatization accuracy of 78.2% on known verb paradigms and high segmentation and lemmatization rates across various word classes. The authors emphasize the complementary strengths of transfer learning and unsupervised clustering, showcasing the potential for scalable morphological discovery in other low-resource languages. The code and discovered lexicons are made publicly available to support further research.
Methodology
The methodology consists of a three-component pipeline that includes transfer learning via cross-lingual projection using a K-nearest neighbors approach, unsupervised clustering with UMAP and K-means, and ensemble validation through weighted voting. This approach allows for the discovery of morphological features with minimal supervision.
Results
The approach led to the discovery of 2,455 noun class labels, a 27-fold increase from the existing data. The model achieved 78.2% lemmatization accuracy on 444 known Giriama verb paradigms and expanded the corpus to 19,624 words with 97.3% segmentation and 86.7% lemmatization rates across major word classes.
Implications
The findings have significant implications for the documentation and analysis of low-resource languages, particularly within the Bantu language family. The methodology can be applied to other understudied languages, potentially aiding in linguistic research and natural language processing applications.
Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control
Graph Learning
- Distance-misaligned training highlights the mismatch between task-relevant information and model communication strategies.
- The preferred graph-distance bias varies with task locality, indicating the need for adaptive control.
- An oracle adaptive controller outperforms fixed bias settings, demonstrating the importance of task-specific distance targets.
- Distance-resolved diagnostics can effectively identify over-globalizing and under-reaching failures in Graph Transformers.
Read more
Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control
Summary
This paper investigates the performance of Graph Transformers in relation to the structural bias introduced during training, particularly focusing on the concept of distance-misaligned training. The authors explore how the allocation of communication across graph distances can lead to failures in model performance, especially when the task requires either local or long-range interactions. Through a synthetic benchmark using contextual stochastic block model graphs, they define distance-misaligned training as a mismatch between the location of label-relevant information and the model's communication strategy. The study reveals that the preferred graph-distance bias shifts systematically with task locality, and that an oracle adaptive controller can significantly improve performance by aligning communication with task requirements. The findings suggest that understanding and diagnosing distance-resolved mismatches can enhance the design of graph-aware control mechanisms in Graph Transformers.
Methodology
The authors employed a dense node-level Graph Transformer with a graph-distance bias added to the attention logits. They conducted experiments on a synthetic benchmark involving contextual stochastic block model graphs, defining local and far-range signals for node classification tasks. The study measured training regimes by computing task-side distance profiles and attention profiles, summarizing mismatches using mean-distance gaps and Wasserstein distances.
Results
The results indicated that the preferred graph-distance bias increases with task locality, with the oracle target-gap controller closely tracking the best fixed bias settings. The adaptive control significantly improved test accuracy compared to neutral and zero-gap training methods. The analysis of mismatch curves revealed distinct failure modes, with the same control knob leading to either over-globalizing or under-reaching behaviors depending on the task's locality.
Implications
The findings suggest that incorporating distance-resolved diagnostics into the training of Graph Transformers can lead to more effective models that adapt to specific task requirements. This approach could enhance applications in various domains where graph-based learning is crucial, such as social network analysis, recommendation systems, and biological network modeling.
Even More Guarantees for Variational Inference in the Presence of Symmetries
Theory
Optimization
- Establishes sufficient conditions for exact recovery of the mean using FKL and α-divergences.
- Extends previous results on robust variational inference under target symmetries.
- Provides guidelines for selecting variational families based on the derived conditions.
- Highlights potential optimization failures when sufficient conditions are not satisfied.
Read more
Even More Guarantees for Variational Inference in the Presence of Symmetries
Summary
This paper addresses the challenges of variational inference (VI) when the variational family is misspecified, particularly in the context of target distributions exhibiting symmetries. The authors build upon previous work that established conditions for the exact recovery of the target mean and correlation matrix using location-scale families. They derive new sufficient conditions for the exact recovery of the mean when employing the forward Kullback-Leibler (FKL) divergence and α-divergences. The paper emphasizes the importance of understanding how optimization can fail to recover the target mean if these conditions are not met, providing practical guidelines for selecting appropriate variational families and α-values. The results extend the theoretical framework of VI, offering insights into the behavior of various divergences under symmetry assumptions.
Methodology
The authors extend the analysis of previous works by deriving new theoretical results regarding the conditions under which the mean of a target distribution can be exactly recovered using FKL and α-divergences. They utilize concepts from location-scale families and symmetry properties of distributions to formulate their results.
Results
The paper presents complementary sufficient conditions for the exact recovery of the mean when optimizing the FKL and α-divergences. For the FKL, mild assumptions on the base distribution lead to guarantees of exact recovery, while for the α-divergence, a more nuanced criterion is established that depends on the value of α. The authors also discuss the implications of these conditions on the choice of variational families in practice.
Implications
The findings have significant implications for practitioners using variational inference in machine learning, particularly in scenarios where the target distribution is not well-represented by the chosen variational family. The results can guide the selection of appropriate divergence measures and variational families, potentially improving the accuracy of inference in various applications.