gistml

By James Asher

Daily summaries of the latest Machine Learning research papers from Arxiv.

2026-02-05 • Found 24 papers

Achieving Linear Speedup for Composite Federated Learning

Kun Huang, Shi Pu
  • FedNMap achieves linear speedup for nonconvex composite federated learning under standard assumptions.
  • The method uses a normal map-based update scheme to address the challenges posed by nonsmooth regularizers.
  • A local correction strategy is incorporated to mitigate the effects of data heterogeneity across clients.
  • FedNMap does not require restrictive assumptions such as homogeneous objectives or bounded subgradients.
  • The algorithm demonstrates communication efficiency with a 1/(nQ) dependence in the dominant term, where n is the number of clients and Q is the number of local updates.
Read More
Abstract
This paper introduces FedNMap, a novel federated learning (FL) algorithm designed to address composite optimization problems involving a smooth loss function and a potentially nonsmooth regularizer. The proposed method leverages a normal map-based update scheme to handle the nonsmooth regularizer and incorporates a local correction strategy to address data heterogeneity across clients. FedNMap achieves linear speedup in terms of both the number of participating clients and local updates, even for nonconvex objectives, under standard assumptions such as smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance. This is the first work to establish linear speedup for nonconvex composite FL problems, filling a significant gap in the literature. The method does not require restrictive assumptions such as homogeneous objectives or bounded subgradients, making it broadly applicable to real-world scenarios like statistical learning with sparsity-inducing norms, constrained optimization, and model pruning.
Methodology
The authors propose FedNMap, which integrates a normal map-based update scheme with a local correction term. The normal map approach ensures unbiasedness in updates, even with nonsmooth regularizers, while the local correction strategy addresses data heterogeneity. The method is analyzed under standard assumptions, including smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance. Theoretical analysis demonstrates the algorithm's convergence and linear speedup properties.
Results
FedNMap achieves an ε-solution with a communication complexity that exhibits a 1/(nQ) dependence in the dominant term, demonstrating linear speedup with respect to both the number of clients (n) and the number of local updates (Q). This result matches the performance of state-of-the-art methods for smooth objectives and extends it to the nonconvex composite setting, where prior works have not established similar results.
Implications
The proposed method has significant implications for federated learning applications involving nonsmooth regularizers, such as sparsity-inducing norms, constrained optimization, and model pruning. By achieving linear speedup under standard assumptions, FedNMap enhances the scalability and efficiency of federated learning in real-world, heterogeneous environments.
View on arXiv

BinaryPPO: Efficient Policy Optimization for Binary Classification

Punya Syon Pandey, Zhijing Jin
  • BinaryPPO reframes binary classification as a decision-making problem under uncertainty using offline reinforcement learning.
  • The confidence-weighted reward function penalizes incorrect or uncertain predictions, promoting robust decision policies.
  • BinaryPPO achieves substantial accuracy improvements (40–60 percentage points) across diverse benchmarks, outperforming supervised baselines.
  • The framework operates entirely offline, making it practical for sensitive domains like LLM safety evaluations.
  • Reward shaping, advantage scaling, and policy stability are critical components driving the performance improvements.
Read More
Abstract
BinaryPPO introduces a novel offline reinforcement learning framework for binary classification tasks, addressing challenges such as label noise, class imbalance, and sparse supervision that hinder traditional supervised fine-tuning (SFT) methods. By reformulating binary classification as a reward maximization problem, BinaryPPO employs a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function. This approach penalizes uncertain or incorrect predictions, enabling robust decision-making under uncertainty. The framework operates entirely offline, making it suitable for sensitive domains such as toxicity detection, factuality verification, and causal inference. Across eight domain-specific benchmarks, BinaryPPO consistently outperforms supervised baselines, achieving accuracy improvements of 40–60 percentage points and reaching up to 99%. The paper also provides an in-depth analysis of reward shaping, advantage scaling, and policy stability, demonstrating the efficacy of confidence-based reward design as an alternative to SFT.
Methodology
BinaryPPO employs a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function to optimize binary classification tasks. The framework operates offline, leveraging static datasets without requiring online interaction. The reward function integrates model confidence to penalize uncertain or incorrect predictions, while the composite loss function combines PPO loss, value function regularization, cross-entropy loss, and entropy regularization to stabilize learning.
Results
BinaryPPO demonstrated consistent performance improvements across eight benchmarks, achieving accuracy gains of 40–60 percentage points and reaching up to 99% accuracy. It outperformed supervised fine-tuning methods and standard PPO across tasks such as toxicity detection, factuality verification, sentiment analysis, and causal inference. The framework also exhibited stable loss convergence and robust decision-making under noisy supervision.
Implications
BinaryPPO provides a robust alternative to supervised fine-tuning for binary classification tasks, particularly in domains with noisy labels, sparse supervision, or class imbalance. Its offline nature makes it suitable for sensitive applications like harmful content detection and LLM safety evaluations. The confidence-based reward design could inspire future reinforcement learning approaches for classification tasks and improve reliability in AI systems operating under uncertainty.
View on arXiv

DeepDFA: Injecting Temporal Logic in Deep Learning for Sequential Subsymbolic Applications

Elena Umili, Francesco Argenziano, Roberto Capobianco
  • DeepDFA integrates temporal logic into neural networks using differentiable layers based on Deterministic Finite Automata (DFA) or Moore Machines.
  • The framework supports symbolic knowledge injection into subsymbolic domains, addressing the symbol grounding problem in sequential tasks.
  • DeepDFA is applied to static sequence classification (e.g., video activity recognition) and non-Markovian reinforcement learning tasks.
  • Experimental results show that DeepDFA outperforms traditional deep learning models and other neurosymbolic systems in temporal knowledge integration.
  • The approach provides a generalizable method for combining perception and reasoning in sequential decision-making tasks.
Read More
Abstract
This paper introduces DeepDFA, a neurosymbolic framework designed to integrate temporal logic into deep learning models for sequential tasks involving subsymbolic data. DeepDFA incorporates high-level temporal rules, expressed as Deterministic Finite Automata (DFA) or Moore Machines, into neural architectures by modeling these rules as continuous, differentiable layers. This approach enables the injection of symbolic knowledge into subsymbolic domains while maintaining compatibility with gradient-based optimization. The framework is applied to two key settings: static image sequence classification and policy learning in non-Markovian reinforcement learning environments. Experimental results demonstrate that DeepDFA outperforms traditional sequential models (e.g., LSTMs, GRUs, Transformers) and other neurosymbolic systems, achieving state-of-the-art performance in temporal knowledge integration. The work highlights the potential of DeepDFA to bridge the gap between symbolic reasoning and subsymbolic learning in complex sequential tasks.
Methodology
DeepDFA leverages a continuous and differentiable logic layer to encode temporal rules as neural network components. These rules are represented using Deterministic Finite Automata (DFA) or Moore Machines, which are compatible with gradient-based optimization. The framework enables symbolic knowledge to be injected into the learning process by encoding temporal logic as fixed parameters or by defining loss functions that encourage adherence to these rules. The system also learns the perceptual grounding of symbols from data, making it applicable to subsymbolic domains where grounding functions are not predefined.
Results
DeepDFA achieves state-of-the-art performance in temporal knowledge integration across multiple benchmarks. It outperforms traditional sequential models (e.g., RNNs, Transformers) and other neurosymbolic systems (e.g., FuzzyDFA, NeSyA) in tasks such as static sequence classification and non-Markovian reinforcement learning. These results demonstrate the framework's effectiveness in bridging symbolic reasoning and subsymbolic learning.
Implications
DeepDFA has significant implications for domains requiring both perception and reasoning, such as robotics, healthcare, and business process management. Its ability to integrate temporal logic into neural networks can improve decision-making in sequential tasks, enabling more robust and interpretable AI systems. The framework also advances the field of neurosymbolic AI by addressing the symbol grounding problem in temporally extended domains.
View on arXiv

EVE: Efficient Verification of Data Erasure through Customized Perturbation in Approximate Unlearning

Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Luoyu Chen, Shui Yu
  • EVE eliminates the need for backdoor embedding during the initial model training phase, making it more practical and efficient.
  • Customized perturbations are designed to ensure both the effectiveness of unlearning and the alteration of model predictions for verification.
  • The method formalizes perturbation generation as an adversarial optimization problem, aligning gradients to achieve verification objectives.
  • EVE achieves significant speedup compared to backdoor-based methods while maintaining comparable verification accuracy.
  • Statistical hypothesis testing is introduced to provide theoretical guarantees for the success of unlearning verification.
Read More
Abstract
This paper introduces EVE, a novel method for verifying machine unlearning that does not require involvement in the initial training process of the model, addressing inefficiencies in existing backdoor-based verification approaches. EVE leverages customized perturbations applied to unlearning data to induce changes in the model's decision boundary for specified samples, enabling users to observe prediction shifts as verification signals. The perturbation generation is formalized as an adversarial optimization problem, aligning gradients of the unlearning operation with gradients of boundary shifts for target samples. Additionally, statistical hypothesis testing is employed to provide theoretical guarantees for verification. Extensive experiments across multiple benchmarks and datasets demonstrate that EVE significantly outperforms state-of-the-art methods in efficiency while maintaining comparable verification accuracy.
Methodology
EVE applies customized perturbations to unlearning data, ensuring prediction changes for specified samples after unlearning. Perturbation generation is modeled as an adversarial optimization problem, aligning gradients of the unlearning operation with gradients of decision boundary shifts. Statistical hypothesis testing is used to validate the verification results.
Results
EVE demonstrated superior efficiency and effectiveness across three approximate unlearning benchmarks and four datasets. It achieved significant speedup compared to backdoor-based methods, as it does not rely on the initial training process. Verification accuracy was comparable to state-of-the-art methods, confirming the reliability of EVE.
Implications
EVE provides a practical and efficient solution for verifying machine unlearning, addressing privacy concerns in compliance with regulations like GDPR. Its independence from the initial training phase makes it suitable for real-world applications, enabling scalable and reliable verification of data erasure in machine learning systems.
View on arXiv

From Inexact Gradients to Byzantine Robustness: Acceleration and Optimization under Similarity

Renaud Gaucher, Aymeric Dieuleveut, Hadrien Hendrikx
  • Byzantine-robust optimization is reformulated as optimization under inexact gradient oracles, enabling the use of established results in this domain.
  • A Nesterov-type accelerated algorithm is proposed, achieving an accelerated linear convergence rate under medium heterogeneity assumptions.
  • The Prox Inexact Gradient method under Similarity (PIGS) leverages an auxiliary loss function to achieve robust optimization with linear convergence rates.
  • The proposed methods significantly reduce communication complexity compared to existing Byzantine-robust algorithms.
  • Theoretical and empirical results confirm the effectiveness of the proposed approaches in achieving robustness and efficiency.
Read More
Abstract
This paper addresses the challenge of Byzantine robustness in federated learning, where adversarial nodes can disrupt distributed optimization processes. The authors propose a novel framework that reformulates Byzantine-robust optimization as a special case of optimization with inexact gradient oracles, characterized by additive and multiplicative error terms. This abstraction enables the systematic development of robust algorithms by leveraging existing results in the field of inexact gradient optimization. The paper introduces two key algorithms: a Nesterov-type accelerated method that achieves faster convergence under medium heterogeneity and a Prox Inexact Gradient method under Similarity (PIGS) that leverages an auxiliary loss function to further enhance robustness and efficiency. Both methods significantly reduce communication complexity while maintaining strong theoretical guarantees. The authors validate their approaches through theoretical analysis and empirical evaluations, demonstrating improved convergence rates and robustness compared to existing methods.
Methodology
The authors cast Byzantine-robust optimization as a problem of optimization with inexact gradient oracles, characterized by additive and multiplicative error terms. They then adapt existing results from inexact gradient optimization to develop two robust algorithms: a Nesterov-type accelerated method and the PIGS algorithm, which incorporates an auxiliary loss function for improved robustness. Theoretical analyses are conducted to establish convergence rates, and empirical evaluations are performed to validate the methods' performance.
Results
The Nesterov-type accelerated algorithm achieves an accelerated linear convergence rate of O(√(µ/L)), where µ is the strong convexity parameter and L is the smoothness parameter. The PIGS algorithm achieves a linear convergence rate of O(∆/µ log(1/ε)), where ∆ measures the similarity between the approximate and true loss functions. Both methods demonstrate reduced communication complexity and improved robustness compared to existing approaches.
Implications
The proposed framework and algorithms provide a systematic approach to designing Byzantine-robust distributed learning methods, enabling the development of more efficient and scalable federated learning systems. These methods are particularly relevant for applications involving heterogeneous and potentially adversarial environments, such as IoT, healthcare, and large-scale collaborative machine learning.
View on arXiv

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen
  • The paper identifies an 'exploration ceiling' in RLVR, where models fail to discover new reasoning strategies under large sampling budgets.
  • PSN-RLVR introduces parameter-space noise to induce trajectory-level exploration, improving coherence in long-horizon reasoning tasks.
  • Two lightweight modules—truncated importance sampling (TIS) and a real-time adaptive noise scheduler—address optimization stability and computational efficiency challenges.
  • PSN-RLVR consistently improves pass@k metrics across multiple benchmarks, outperforming existing exploration-focused RLVR methods.
  • The framework is orthogonal and composable with other RLVR techniques, enabling further enhancements in reasoning diversity and performance.
Read More
Abstract
This paper addresses the exploration limitations of Reinforcement Learning with Verifiable Rewards (RLVR) in improving reasoning capabilities of Large Language Models (LLMs). RLVR often reweights existing solution traces rather than discovering new strategies, leading to an 'exploration ceiling' under large sampling budgets. The authors propose PSN-RLVR, a novel framework that introduces parameter-space noise (PSN) to perturb policy parameters before rollout generation, enabling temporally consistent, trajectory-level exploration. This approach better preserves long-horizon chain-of-thought reasoning compared to traditional action-space noise methods. To address challenges such as sampling-update mismatch and computational inefficiencies, the paper introduces two modules: truncated importance sampling (TIS) for stable optimization and a real-time adaptive noise scheduler that uses semantic diversity and normalized self-certainty as lightweight surrogates for noise control. Experiments on mathematical reasoning benchmarks demonstrate that PSN-RLVR significantly expands the reasoning capability boundaries of LLMs, achieving higher pass@k metrics under large sampling budgets and outperforming prior exploration-oriented RLVR methods. The framework is orthogonal and composable with other RLVR enhancements, making it a promising tool for improving reasoning diversity and performance in LLMs.
Methodology
The authors propose PSN-RLVR, which perturbs policy parameters in the parameter space before rollout generation to induce trajectory-level exploration. They address optimization challenges with truncated importance sampling (TIS) and computational inefficiencies with a real-time adaptive noise scheduler that combines semantic diversity and normalized self-certainty. The framework is instantiated on GRPO, a widely used RLVR method, and systematically evaluated across mathematical reasoning benchmarks.
Results
PSN-RLVR significantly expands reasoning capability boundaries, achieving higher pass@k metrics under large sampling budgets (e.g., k = 128, 256) compared to baseline methods like GRPO-Train and other exploration-oriented RLVR techniques. It restores and enhances semantic and operational diversity in reasoning tasks, demonstrating superior performance in long-horizon chain-of-thought reasoning.
Implications
PSN-RLVR has the potential to improve the reasoning capabilities of LLMs in domains requiring long-horizon coherence, such as mathematical problem-solving, code generation, and symbolic reasoning. Its composability with other RLVR techniques makes it a versatile tool for enhancing exploration and diversity in reinforcement learning pipelines for LLMs.
View on arXiv

Live or Lie: Action-Aware Capsule Multiple Instance Learning for Risk Assessment in Live Streaming Platforms

Yiran Qiao, Jing Chen, Xiang Ao, Qiwei Zhong, Yang Liu, Qing He
  • The paper introduces a novel Action-aware Capsule Multiple Instance Learning (AC-MIL) framework for risk assessment in live streaming platforms.
  • The problem is formulated as a Multiple Instance Learning (MIL) task, where room-level labels are used to detect fine-grained, coordinated malicious behaviors.
  • AC-MIL captures multi-granular semantics by modeling both individual user actions and group-level coordination patterns through a capsule-based architecture.
  • The framework provides interpretable evidence by identifying risky behavior segments, aiding in actionable interventions.
  • Extensive experiments on Douyin datasets show that AC-MIL achieves state-of-the-art performance, significantly outperforming baseline models.
Read More
Abstract
This paper addresses the challenge of risk assessment in live streaming platforms, where malicious behaviors such as fraud are often concealed within normal user activities. The authors propose a novel Action-aware Capsule Multiple Instance Learning (AC-MIL) framework to detect risks in live streaming rooms. The problem is formulated as a Multiple Instance Learning (MIL) task, where each live streaming room is treated as a 'bag' and user-time capsules, representing sequences of user actions within specific time windows, are treated as 'instances.' The AC-MIL framework captures both individual user behaviors and group-level coordination patterns by leveraging a serial and parallel architecture. This design encodes temporal dynamics and cross-user dependencies, enabling robust room-level risk predictions and interpretable evidence at the behavior segment level. Extensive experiments on large-scale datasets from Douyin demonstrate that AC-MIL outperforms existing MIL and sequential models, achieving state-of-the-art performance in detecting risky behaviors in live streaming rooms.
Methodology
The authors propose a capsule-based MIL framework where each live streaming room is treated as a 'bag' and user-time capsules (subsequences of user actions within specific time windows) are treated as 'instances.' The AC-MIL framework employs a serial and parallel architecture to jointly model temporal dynamics and cross-user dependencies. This approach enables the detection of both individual and coordinated malicious behaviors. The model also provides interpretable outputs by identifying specific risky behavior segments.
Results
AC-MIL significantly outperforms traditional MIL and sequential baselines in room-level risk assessment tasks on large-scale datasets from Douyin. The framework achieves state-of-the-art performance in terms of accuracy, recall, and interpretability. Additionally, the capsule-level interpretability allows for actionable insights into risky behavior segments, facilitating timely interventions.
Implications
The proposed AC-MIL framework has significant implications for improving safety and trust in live streaming platforms. By enabling real-time detection of fraudulent behaviors and providing interpretable evidence, the framework can help platforms mitigate risks, protect users from scams, and enhance the overall user experience. The approach could also be extended to other domains involving weak supervision and complex behavioral patterns, such as e-commerce, social media, and online gaming.
View on arXiv

Membership Inference Attacks from Causal Principles

Mathieu Even, Clément Berenfeld, Linus Bleistein, Tudor Cebere, Julie Josse, Aurélien Bellet
  • The paper introduces a novel causal framework for evaluating Membership Inference Attacks (MIAs), defining memorization as the causal effect of including a data point in the training set.
  • It identifies key sources of bias in existing MIA evaluation methods, including interference in one-run methods and confounding in zero-run methods.
  • The authors propose practical estimators for causal MIA metrics, providing non-asymptotic consistency guarantees and addressing distribution shift biases.
  • The framework leverages causal inference principles and algorithmic stability to ensure robust and interpretable privacy evaluations.
  • Empirical validation on synthetic data and CIFAR-10 demonstrates the effectiveness of the proposed methods in practical settings.
Read More
Abstract
This paper reframes the evaluation of Membership Inference Attacks (MIAs) as a causal inference problem to address the limitations of existing evaluation protocols. MIAs are used to assess privacy risks by determining whether a specific data point was part of a model's training set. Traditional multi-run evaluations, which involve retraining models multiple times, are computationally expensive, while one-run and zero-run methods suffer from statistical biases. The authors propose a causal framework that defines memorization as the causal effect of including a data point in the training set. This approach identifies and corrects key biases in existing methods, such as interference in one-run evaluations and confounding in zero-run evaluations. The paper introduces causal analogues of standard MIA metrics and develops practical estimators with non-asymptotic consistency guarantees. Empirical experiments on synthetic data and CIFAR-10 demonstrate that the proposed causal estimators provide reliable and practical solutions for evaluating memorization and privacy risks in machine learning models, even under challenging conditions like distribution shifts.
Methodology
The authors adopt a causal inference approach using the potential outcomes framework to redefine MIA metrics. They analyze multi-run, one-run, and zero-run evaluation protocols through a causal lens, identifying sources of bias such as interference and confounding. They propose new causal estimators that correct these biases and provide theoretical guarantees of consistency. The methodology incorporates learning-theoretic tools like algorithmic stability to address challenges such as random interference and distribution shifts.
Results
The proposed causal estimators were validated on synthetic data and the CIFAR-10 dataset. The results demonstrated that the estimators effectively corrected biases in one-run and zero-run MIA evaluations, providing reliable measures of memorization even under distribution shifts. The methods were shown to be practical and computationally efficient, making them suitable for evaluating privacy risks in large-scale machine learning models.
Implications
This work provides a principled foundation for evaluating privacy risks in modern AI systems, particularly in scenarios where retraining models is impractical or infeasible. The causal framework and proposed estimators can be applied to assess memorization and privacy leakage in large-scale models, including language models, under real-world conditions. This has implications for regulatory compliance, data privacy auditing, and the development of more privacy-preserving machine learning systems.
View on arXiv

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Hyunji Jung, Sungbin Shin, Namhoon Lee
  • Gradient staleness in asynchronous pipeline parallelism scales with pipeline depth, severely degrading convergence speed and scalability.
  • Misalignment between the Hessian eigenbasis and the standard coordinate basis amplifies the negative effects of delayed gradients, particularly for adaptive optimizers like Adam.
  • The proposed basis rotation framework aligns the optimization space with the Hessian eigenbasis, mitigating the impact of gradient staleness.
  • Empirical results show that basis rotation accelerates convergence, achieving the same training loss in up to 81.6% fewer iterations.
  • The method is particularly effective for large-scale models, addressing a critical bottleneck in asynchronous pipeline parallelism.
Read More
Abstract
This paper addresses the challenge of gradient staleness in asynchronous pipeline parallelism, a technique used to improve hardware utilization in large-scale distributed training of models like large language models (LLMs). Gradient staleness arises due to delays in gradient updates caused by the asynchronous nature of the pipeline, which worsens as the pipeline depth increases. The authors identify that this issue is exacerbated by misalignment between the Hessian eigenbasis and the standard coordinate basis, which undermines the effectiveness of adaptive optimizers like Adam. To mitigate this, the paper introduces a novel approach called basis rotation, which aligns the optimization space with the Hessian eigenbasis to reduce oscillations and improve convergence. Theoretical analysis and empirical evaluations demonstrate that basis rotation significantly accelerates convergence, achieving the same training loss in up to 81.6% fewer iterations compared to standard asynchronous pipeline parallel training. This work provides a scalable solution to improve the efficiency of asynchronous training for large-scale models.
Methodology
The authors conduct a theoretical analysis of gradient staleness and its interaction with the optimization process, focusing on the role of basis misalignment. They propose basis rotation as a solution, which involves transforming the optimization space to align the Hessian eigenbasis with the standard coordinate basis. Several practical strategies for basis rotation are introduced, leveraging different Hessian approximations. The approach is evaluated empirically on large-scale LLM pre-training tasks, comparing convergence speed and training loss against standard asynchronous pipeline parallelism.
Results
The proposed basis rotation method significantly mitigates the effects of gradient staleness, achieving the same training loss in up to 81.6% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline. For example, training a 1-billion-parameter LLM with basis rotation required 76.8% fewer iterations to reach the same loss. The method also demonstrated robustness across varying pipeline depths, where standard asynchronous methods suffered from severe slowdowns.
Implications
This work provides a scalable solution to improve the efficiency of asynchronous pipeline parallelism, making it more viable for training large-scale models like LLMs. By addressing gradient staleness, the proposed method can enhance hardware utilization and reduce training time, potentially lowering the computational cost of large-scale distributed training. The approach could be adopted in a wide range of applications, including natural language processing, computer vision, and other domains requiring large-scale model training.
View on arXiv

Mitigating Task-Order Sensitivity and Forgetting via Hierarchical Second-Order Consolidation

Protik Nag, Krishnan Raghavan, Vignesh Narayanan
  • HTCL addresses the NP-hard problem of task-order sensitivity by optimizing intra-group task sequences and consolidating updates using second-order approximations.
  • The framework employs a multi-level hierarchy to balance plasticity and stability across short and long task horizons.
  • HTCL is model-agnostic and can be integrated with existing continual learning methods without modifying their underlying mechanisms.
  • Empirical results show substantial improvements in task-order robustness, reducing accuracy variance by up to 68% and memory forgetting by up to 70.9%.
  • HTCL scales efficiently with near-linear time and memory overhead, making it suitable for real-world applications.
Read More
Abstract
This paper introduces Hierarchical Taylor Series-based Continual Learning (HTCL), a novel framework designed to address task-order sensitivity and catastrophic forgetting in continual learning (CL). HTCL combines fast local adaptation with conservative second-order global consolidation, enabling robust performance across random task orderings. The framework partitions tasks into small groups, optimizes intra-group orderings, and integrates updates using a Hessian-regularized Taylor expansion. HTCL further employs a multi-level hierarchy to consolidate knowledge across varying timescales, ensuring both plasticity for recent tasks and stability for long-term knowledge retention. The approach is model-agnostic, enhancing existing CL methods without altering their core algorithms. Empirical evaluations on diverse datasets demonstrate significant reductions in task-order sensitivity (up to 68% variance reduction) and improved memory retention (mean forgetting reduced by up to 70.9%).
Methodology
HTCL partitions tasks into small groups, evaluates all intra-group orderings, and selects the optimal sequence for local adaptation. Updates are integrated using a Hessian-regularized Taylor expansion, leveraging scalable low-rank curvature approximations. A multi-level hierarchy is employed to consolidate knowledge across varying timescales, ensuring robust performance across long task sequences. The framework is designed to be model-agnostic, allowing integration with existing CL methods.
Results
HTCL consistently reduces task-order sensitivity, achieving up to 68% variance reduction in accuracy across random task permutations on datasets like SplitMNIST, CIFAR-100, CORA, and 20 Newsgroups. It also improves memory retention, with mean forgetting reduced by up to 70.9% and per-task standard deviation lowered by over 30% in longer task sequences. These improvements are observed across replay-based and regularization-based CL baselines.
Implications
HTCL has significant implications for real-world continual learning applications where task order cannot be controlled, such as medical diagnosis, robotics, and adaptive AI systems. By mitigating task-order sensitivity and improving memory retention, HTCL enhances the reliability and robustness of CL models, making them more suitable for deployment in dynamic environments.
View on arXiv

Most Convolutional Networks Suffer from Small Adversarial Perturbations

Amit Daniely, Idan Mehalel
  • Adversarial examples exist in random CNNs with â„“2-distances as small as ||x||/√d, which is the theoretical lower bound under certain conditions.
  • The study demonstrates that a single step of gradient descent can efficiently find adversarial examples in CNNs.
  • The authors use Fourier decomposition to derive bounds on the singular values of random linear convolutional operators, which may have broader applications in understanding CNN behavior.
  • The results improve upon prior work by providing tighter bounds on perturbation distances and a constructive method for finding adversarial examples in CNNs.
  • The analysis assumes constant depth and limited growth in layer width, which is a limitation compared to some prior studies.
Read More
Abstract
This paper investigates the vulnerability of convolutional neural networks (CNNs) to adversarial examples, which are small, carefully crafted perturbations to input data that can cause a model to produce incorrect outputs. While prior work has established the existence of adversarial examples in fully connected neural networks, this study extends the theoretical understanding to CNNs. The authors demonstrate that adversarial examples can be found in random CNNs with input dimension d at an ℓ2-distance of order ||x||/√d from the input x, which is the smallest possible distance under certain assumptions. They also show that a single step of gradient descent is sufficient to identify such adversarial examples. The study employs Fourier decomposition to derive bounds on the singular values of random linear convolutional operators, a key component of CNN layers. These findings improve upon prior work by providing tighter bounds on adversarial perturbation distances and offering a constructive method to find adversarial examples in CNNs.
Methodology
The authors use a theoretical approach based on Fourier decomposition to analyze the singular values of random linear convolutional operators, which are the core components of CNN layers. They derive mathematical bounds on the â„“2-distances required to create adversarial examples and demonstrate that these examples can be efficiently found using a single step of gradient descent. The analysis is conducted under specific assumptions, including constant network depth and limited growth in layer width.
Results
The study proves that adversarial examples in random CNNs can be found at an ℓ2-distance of ||x||/√d from the input, which is the smallest possible under the given assumptions. Additionally, the authors show that these adversarial examples can be identified using a single step of gradient descent, making the process computationally efficient. The results improve upon prior work by providing tighter bounds and a constructive method for finding adversarial examples in CNNs.
Implications
['The findings highlight the inherent vulnerability of CNNs to small adversarial perturbations, even under optimal conditions.', 'The results can inform the development of more robust CNN architectures and adversarial defense mechanisms.', 'The mathematical tools and insights, particularly the bounds on singular values of convolutional operators, may be useful for further theoretical research on CNNs and their robustness.']
View on arXiv

Neural Probabilistic Amplitude Shaping for Nonlinear Fiber Channels

Mohammad Taha Askari, Lutz Lampe, Amirhossein Ghazisaeidi
  • NPAS introduces a PAS-compatible framework for learning joint symbol distributions using an autoregressive recurrent neural network.
  • The method achieves a 0.5 dB SNR gain over traditional sequence selection for dual-polarized 64-QAM transmission across a 205 km fiber link.
  • NPAS retains compatibility with systematic FEC and standard PAS architectures, making it practical for deployment in real-world optical communication systems.
  • The end-to-end training pipeline incorporates a differentiable optical fiber channel model to account for nonlinear effects during optimization.
  • NPAS matches the performance of NPS while addressing its limitations, such as incompatibility with PAS and FEC.
Read More
Abstract
This paper introduces Neural Probabilistic Amplitude Shaping (NPAS), a novel framework for optimizing the joint distribution of transmitted symbols in coherent optical fiber systems. NPAS leverages an autoregressive recurrent neural network to model temporal dependencies in symbol sequences, enabling it to mitigate nonlinear interference noise (NLIN) and improve the achievable information rate (AIR). Unlike previous neural probabilistic shaping (NPS) methods, NPAS is fully compatible with probabilistic amplitude shaping (PAS) and systematic forward error correction (FEC), making it deployable in practical transceivers. The authors demonstrate that NPAS achieves a 0.5 dB signal-to-noise ratio (SNR) gain over traditional sequence selection methods for dual-polarized 64-QAM transmission over a 205 km single-span fiber link. The proposed method is trained end-to-end using a differentiable optical fiber channel model that accounts for nonlinear effects, ensuring optimal performance in real-world scenarios.
Methodology
The NPAS framework employs an autoregressive recurrent neural network (LSTM-based) to model the joint amplitude distribution of unsigned symbols. The network is trained end-to-end using a differentiable optical fiber channel model that incorporates nonlinear effects through an additive-multiplicative perturbative model. The training process uses the Gumbel-Softmax trick for sampling and a binary cross-entropy loss function adjusted to maximize the achievable information rate. The NPAS encoder integrates with an arithmetic distribution matcher (ADM) to map information bits to unsigned symbols, ensuring compatibility with PAS and FEC.
Results
The proposed NPAS method achieves a 0.5 dB SNR gain over sequence selection methods for dual-polarized 64-QAM transmission over a 205 km single-span fiber link. It also matches the performance of NPS in terms of shaping and nonlinear mitigation gains while being deployable in practical PAS transceivers. The end-to-end training approach ensures that the learned joint distributions are optimized for nonlinear fiber propagation.
Implications
The NPAS framework has significant implications for the field of optical communications, as it provides a practical and deployable solution for improving data transmission rates and reliability in nonlinear fiber channels. By addressing the limitations of existing methods and integrating seamlessly with PAS and FEC, NPAS could enhance the performance of next-generation optical networks, enabling higher spectral efficiency and better utilization of fiber infrastructure.
View on arXiv

On the Sample Efficiency of Inverse Dynamics Models for Semi-Supervised Imitation Learning

Sacha Morin, Moonsub Byeon, Alexia Jolicoeur-Martineau, Sébastien Lachapelle
  • IDM-based methods (VM-IDM and IDM labeling) recover the same policy under ideal conditions, termed the IDM-based policy.
  • The sample efficiency of IDM learning arises from the reduced complexity and stochasticity of the ground-truth IDM compared to the expert policy.
  • Extensive experiments across multiple benchmarks (ProcGen, Push-T, and Libero) confirm the statistical advantages of IDM-based methods over behavior cloning.
  • An improved version of the LAPO algorithm is proposed, demonstrating superior performance on the ProcGen benchmark.
  • Unified video-action prediction (UVA) architectures can further enhance IDM-based policy success.
Read More
Abstract
This paper investigates the sample efficiency of inverse dynamics models (IDMs) in the context of semi-supervised imitation learning (SSIL), where a small set of action-labeled trajectories is combined with a larger set of action-free trajectories to train policies. The authors unify two IDM-based approaches—VM-IDM (using a video model) and IDM labeling (synthetically labeling action-free data)—and demonstrate that both recover the same policy under ideal conditions. They attribute the superior sample efficiency of IDM-based methods over behavior cloning (BC) to two key factors: (i) the ground-truth IDM resides in a lower complexity hypothesis class compared to the expert policy, and (ii) the ground-truth IDM is less stochastic than the expert policy. Through theoretical insights and extensive experiments on benchmarks like ProcGen, Push-T, and Libero, the authors validate these claims and propose an improved version of the LAPO algorithm for latent action policy learning. They also explore the use of unified video-action prediction (UVA) architectures to enhance IDM-based methods.
Methodology
The authors analyze IDM-based methods using statistical learning theory to explain their sample efficiency advantages. They conduct experiments across 16 ProcGen environments, Push-T, and Libero benchmarks to validate their theoretical claims. Additionally, they propose modifications to the LAPO algorithm and evaluate the impact of using UVA architectures for video-action prediction.
Results
The study shows that IDM-based methods outperform behavior cloning in sample efficiency due to the lower complexity and reduced stochasticity of the ground-truth IDM. The improved LAPO algorithm achieves better performance on the ProcGen benchmark, and the integration of UVA architectures further enhances policy success rates.
Implications
The findings provide a deeper understanding of when and why IDM-based methods excel in semi-supervised imitation learning, offering a framework for leveraging action-free data effectively. The proposed improvements to IDM-based methods, including the enhanced LAPO algorithm and UVA integration, have potential applications in robotics, autonomous systems, and other domains where labeled expert demonstrations are scarce.
View on arXiv

PA-MIL: Phenotype-Aware Multiple Instance Learning Guided by Language Prompting and Genotype-to-Phenotype Relationships

Zekang Yang, Hong Liu, Xiangdong Wang
  • PA-MIL is the first framework to explicitly identify cancer-related phenotypes from WSIs and use them for cancer diagnosis, offering ante-hoc interpretability.
  • A phenotype knowledge base is constructed, containing morphological descriptions of phenotypes and their associated genotypes.
  • The framework uses language prompts and genotype-to-phenotype relationships to guide the extraction of phenotype-related features.
  • A Genotype-to-Phenotype Neural Network (GP-NN) provides multi-level supervision, improving the precision of phenotype learning.
  • PA-MIL achieves competitive diagnostic performance while providing clinically valuable, interpretable phenotype-based evidence.
Read More
Abstract
This paper introduces PA-MIL, a novel phenotype-aware multiple instance learning framework designed to enhance interpretability in cancer diagnosis from pathology whole-slide images (WSIs). Unlike traditional deep learning models that rely on post-hoc interpretability, PA-MIL provides ante-hoc interpretability by explicitly identifying cancer-related phenotypes and using them as evidence for cancer subtyping. The framework leverages a phenotype knowledge base, language prompts describing phenotypes, and genotype-to-phenotype relationships to guide the learning process. A Genotype-to-Phenotype Neural Network (GP-NN) is introduced to provide multi-level supervision during training, enabling PA-MIL to extract precise phenotype-related features. Experimental results demonstrate that PA-MIL achieves competitive performance compared to state-of-the-art multiple instance learning methods while offering improved interpretability through phenotype-based evidence. The approach has significant potential for clinical applications, as it aligns closely with the diagnostic process used by pathologists.
Methodology
PA-MIL employs a text encoder and an image encoder to extract features from phenotype descriptions and WSI patches, respectively. Cross-attention mechanisms are used to align textual and image features, enabling the identification of phenotype-related features. The clinical saliency of each phenotype is predicted, and cancer diagnosis is performed based on these saliencies. A Genotype-to-Phenotype Neural Network (GP-NN) is used as a teacher model during training, leveraging transcriptomic data and genotype-to-phenotype relationships to provide multi-level supervisory signals.
Results
PA-MIL demonstrates competitive performance compared to state-of-the-art multiple instance learning methods on multiple datasets. It provides improved interpretability by identifying phenotype saliency as evidence for predictions. The framework also offers reliable cohort-level and case-level interpretability, aligning with clinical diagnostic practices.
Implications
PA-MIL has the potential to significantly enhance the reliability and accountability of AI-based cancer diagnosis systems by providing interpretable, phenotype-based evidence. This approach could improve trust and adoption of AI in clinical settings, aiding pathologists in making more informed decisions. Additionally, the integration of genotype-to-phenotype relationships opens avenues for more precise and personalized cancer diagnostics.
View on arXiv

Prediction of Critical Heat Flux in Rod Bundles Using Tube-Based Hybrid Machine Learning Models in CTF

Aidan Furlong, Robert Salko, Xingang Zhao, Xu Wu
  • Three ML-based CHF models (pure DNN, hybrid Bowring, and hybrid LUT) were trained on tube-based CHF data and tested on rod bundle geometries.
  • The models were implemented in the CTF subchannel code and validated using the CE 5×5 rod bundle test series.
  • Tube-trained ML models showed reduced accuracy in predicting CHF in rod bundles due to complex thermal hydraulic phenomena not present in tube geometries.
  • Hybrid models leverage traditional base models to improve predictions but still face limitations in generalizing to rod bundle environments.
  • The study underscores the need for advanced techniques, such as transfer learning, to adapt ML models for rod bundle CHF prediction.
Read More
Abstract
This paper investigates the application of machine learning (ML) models, originally trained on tube-based critical heat flux (CHF) data, to predict CHF in rod bundle geometries. The study evaluates three ML-based models: a purely data-driven deep neural network (DNN) and two hybrid models that combine DNNs with traditional base models (the Bowring correlation and the Groeneveld lookup table). These models were integrated into the CTF subchannel code and tested on the Combustion Engineering (CE) 5×5 rod bundle test series to assess their generalization to more complex geometries. The research highlights the challenges of applying tube-trained ML models to rod bundles, where phenomena such as channel crossflow and spacer grid effects complicate predictions. The study demonstrates that while the models show promise, their performance in rod bundle geometries is less accurate compared to tube-based predictions, emphasizing the need for further refinement or alternative approaches such as transfer learning.
Methodology
The study utilized three ML-based models: a purely data-driven DNN and two hybrid models combining DNNs with the Bowring correlation and Groeneveld lookup table. These models were trained on a filtered subset of the NRC CHF database (24,320 tube-based data points) using a local input feature set (heated equivalent diameter, pressure, mass flux, and local equilibrium quality). Hyperparameters were optimized via Bayesian methods, and the models were integrated into the CTF subchannel code for validation. Performance was assessed on both the NRC tube database and the CE 5×5 rod bundle test series.
Results
The tube-trained ML models performed well on the NRC tube database, achieving mean absolute percentage errors (MAPE) of approximately 11.63% for the pure ML and hybrid LUT models. However, their performance declined when applied to the CE 5×5 rod bundle test series, highlighting the challenges of generalizing tube-based models to more complex geometries. The hybrid models showed slightly better performance than the pure ML model but still struggled with the additional complexities of rod bundles.
Implications
This study demonstrates the potential of ML models for CHF prediction in nuclear reactor applications but also highlights their limitations when applied to complex geometries like rod bundles. The findings suggest that further research is needed to improve model generalization, potentially through techniques like transfer learning or the incorporation of additional training data that captures rod bundle-specific phenomena. These advancements could enhance the accuracy and reliability of CHF predictions in full-scale reactor simulations, contributing to improved reactor safety and efficiency.
View on arXiv

QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

Nan Zhang, Eugene Kwek, Yusen Zhang, Muyu Pan, Suhang Wang, Prasenjit Mitra, Rui Zhang
  • QuantLRM leverages fine-tuning signals, specifically the smallest and largest weight updates, to improve quantization performance for Large Reasoning Models.
  • The method uses restricted quadratic functions to compute channel importance, emphasizing both extreme and zero weight updates.
  • QuantLRM outperforms existing PTQ methods, achieving an average improvement of 6.55% on reinforcement learning fine-tuned models and at least 1.65% on supervised fine-tuned models.
  • The approach is compatible with mainstream LRMs and can be applied to non-fine-tuned models using a pseudo-fine-tuning pipeline.
  • QuantLRM achieves state-of-the-art performance on four reasoning benchmarks while using a minimal calibration dataset.
Read More
Abstract
The paper introduces QuantLRM, a novel approach for weight-only quantization of Large Reasoning Models (LRMs) that leverages fine-tuning signals to improve quantization performance. Inspired by classical magnitude pruning, the authors hypothesize that both the smallest and largest weight updates during fine-tuning are critical for preserving model performance, a concept termed 'protecting both ends.' QuantLRM computes channel importance by fitting restricted quadratic functions to weight updates and multiplying the average quadratic values with the count of zero weight updates in each channel. This method is shown to outperform existing post-training quantization (PTQ) techniques, which typically rely on activation distributions or second-order information but fail to utilize the rich information encoded in fine-tuning traces. The authors validate their hypothesis through empirical experiments and demonstrate the effectiveness of QuantLRM across various fine-tuned LRMs and reasoning benchmarks. Additionally, they propose a pseudo-fine-tuning pipeline to extend QuantLRM's applicability to non-fine-tuned models. QuantLRM achieves state-of-the-art performance, with significant improvements in quantization accuracy and efficiency, particularly for low-bit quantization.
Methodology
QuantLRM computes channel importance by fitting restricted quadratic functions to weight updates during fine-tuning. The method emphasizes both the smallest and largest weight updates ('protecting both ends') and incorporates these importance scores into the quantization loss function. For non-fine-tuned models, a pseudo-fine-tuning pipeline is introduced to generate effective signals. The approach is evaluated on various fine-tuned LRMs across four reasoning benchmarks, including AIME-120, FOLIO, temporal sequences, and GPQA-Diamond.
Results
QuantLRM consistently outperforms existing PTQ baselines, achieving an average improvement of 6.55% on reinforcement learning fine-tuned models and at least 1.65% on supervised fine-tuned models. It demonstrates state-of-the-art performance on 3-bit quantization for LRMs and achieves comparable speedups to other PTQ methods while using a smaller calibration dataset.
Implications
QuantLRM has significant implications for compressing and optimizing Large Reasoning Models, enabling efficient deployment in resource-constrained environments. Its ability to leverage fine-tuning signals and support non-fine-tuned models enhances its versatility, making it applicable across a wide range of reasoning tasks and model architectures. This approach could pave the way for more efficient and specialized quantization techniques tailored to specific model types and tasks.
View on arXiv

Reasoning with Latent Tokens in Diffusion Language Models

Andre He, Sean Welleck, Daniel Fried
  • Latent tokens in diffusion models act as auxiliary computational states, facilitating joint reasoning over undecoded tokens and improving performance on reasoning tasks.
  • The authors propose latent token modulation, a method to control the number of latent tokens, enabling a trade-off between inference speed and sample quality.
  • Incorporating latent tokens into autoregressive models via a multi-token prediction objective significantly improves their performance on reasoning tasks, closing the gap with diffusion models.
  • Diffusion models outperform autoregressive models on tasks requiring global coherence and constraint satisfaction due to their ability to leverage latent computation.
  • Latent tokens represent a general mechanism for enhancing language model performance across different modeling paradigms.
Read More
Abstract
This paper investigates the role of latent tokens in discrete diffusion language models and their impact on reasoning tasks requiring global coherence and planning. Diffusion models, which predict tokens in parallel and in a non-monotonic order, have shown superior performance compared to autoregressive models on tasks like Sudoku and constraint satisfaction. The authors identify that latent tokens—tokens predicted but not immediately decoded—serve as auxiliary computational states that enhance reasoning by enabling joint predictions over undecoded tokens. They introduce a method called latent token modulation, which adjusts the number of latent tokens during generation, allowing a trade-off between inference speed and sample quality. Additionally, the authors demonstrate that latent tokens can be incorporated into autoregressive models through a multi-token prediction objective, significantly improving their performance on reasoning tasks. The study highlights latent tokens as a general mechanism for improving language model performance, particularly on tasks requiring global reasoning and lookahead.
Methodology
The authors analyze the role of latent tokens in discrete diffusion models, which predict distributions over all masked tokens at each generation step. They introduce latent token modulation to control the number of latent tokens during generation and evaluate its impact on reasoning tasks and language modeling. They also extend the concept of latent tokens to autoregressive models by adding a multi-token prediction objective during training. The models are tested on synthetic reasoning tasks (e.g., Sudoku) and real-world language modeling benchmarks.
Results
The study finds that increasing the number of latent tokens improves accuracy on reasoning tasks and reduces perplexity in language modeling. Latent token modulation provides a controllable trade-off between inference speed and sample quality. Autoregressive models augmented with latent tokens achieve substantial performance gains on reasoning tasks, often surpassing diffusion models under uniform decoding conditions.
Implications
The findings suggest that latent tokens are a powerful mechanism for improving the performance of language models on tasks requiring global reasoning and planning. This approach could be applied to enhance various natural language processing tasks, including constraint satisfaction, text generation, and complex reasoning. Additionally, the ability to trade off inference speed and quality through latent token modulation could make diffusion models more practical for real-world applications.
View on arXiv

Robust Representation Learning in Masked Autoencoders

Anika Shrivastava, Renu Rameshan, Samar Agnihotri
  • MAEs progressively develop class-separable latent spaces across network depth, even without explicit supervision.
  • MAEs exhibit robust classification performance under input perturbations such as Gaussian blur and occlusion.
  • The study introduces two sensitivity indicators—directional invariance and head-wise retention of active features—to measure the robustness of MAE representations.
  • MAEs demonstrate early and persistent global attention across encoder layers, unlike standard Vision Transformers (ViTs).
  • The robustness of MAE representations is a key factor in their strong downstream classification performance.
Read More
Abstract
This paper investigates the internal mechanisms and robustness of representations learned by Masked Autoencoders (MAEs), a self-supervised learning (SSL) approach for visual representation learning. The authors analyze how MAEs progressively develop class-separable latent spaces across network depth, even without explicit supervision. They demonstrate that MAEs exhibit robust classification performance under input perturbations such as Gaussian blur and occlusion. The study introduces two novel sensitivity indicators—directional invariance and head-wise retention of active features—to quantify the robustness of MAE representations. Through a layer-wise analysis of token embeddings, the authors show that MAEs construct class-aware subspaces that become increasingly separable as the network depth increases. Additionally, they observe that MAEs exhibit early and persistent global attention across encoder layers, distinguishing them from standard Vision Transformers (ViTs). The findings suggest that the robust classification performance of MAEs is closely tied to the stability of their latent representations under perturbations.
Methodology
The authors conducted a layer-wise analysis of token embeddings in pretrained MAEs to study the evolution of class-separable latent spaces. They used subspace-based geometric analysis to characterize class-specific structures and introduced two sensitivity indicators to quantify the robustness of representations under input perturbations, such as Gaussian blur and attention-guided occlusion. The classification performance of fine-tuned MAEs was evaluated under varying levels of input degradation.
Results
The study found that MAEs progressively develop class-separable latent spaces across network depth, with class-specific subspaces becoming increasingly distinct. MAEs also demonstrated robust classification performance under significant input perturbations, maintaining stable accuracy. The proposed sensitivity indicators revealed that MAE representations are less sensitive to input degradations compared to standard Vision Transformers. Additionally, MAEs were found to exhibit early and persistent global attention across encoder layers.
Implications
The findings highlight the potential of MAEs for robust visual representation learning, making them suitable for applications in scenarios involving noisy or occluded data, such as autonomous driving, medical imaging, and surveillance. The insights into MAE's internal mechanisms could also guide the development of more efficient and robust self-supervised learning models for computer vision tasks.
View on arXiv

Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains

Jae-Sung Bae, Minje Kim
  • GeLDA performs data augmentation in a foundation model-induced latent space, which is more efficient and task-relevant than the input space.
  • The framework uses conditional diffusion models, conditioned on semantic relationships among labels and subdomains, to generate high-quality data.
  • GeLDA improves performance in low-resource and label-imbalanced scenarios, achieving state-of-the-art results in long-tailed image classification and zero-shot speech emotion recognition.
  • The method is lightweight, requiring only a small diffusion model and limited training data, making it practical for real-world applications.
  • Ablation studies show the importance of abstraction levels in the latent space for effective data augmentation.
Read More
Abstract
This paper introduces GeLDA (Generative Latent Data Augmentation), a novel framework designed to address the challenges of learning in low-resource and label-imbalanced domains. GeLDA leverages foundation models (FMs) and conditional diffusion models to perform data augmentation in a task-relevant, low-dimensional latent space rather than the high-dimensional input space. By conditioning the generative process on auxiliary semantic information, such as label relationships and subdomain features, GeLDA improves the diversity and quality of augmented data. The framework is validated on two tasks: zero-shot speech emotion recognition (SER) for low-resource languages and long-tailed image classification on the ImageNet-LT dataset. GeLDA achieves significant performance improvements, including a 6.13% increase in unweighted average recall for SER and a new state-of-the-art tail-class accuracy of 74.7% on ImageNet-LT. These results demonstrate the effectiveness of GeLDA in mitigating data scarcity and label imbalance, making it a promising approach for real-world applications.
Methodology
GeLDA operates in the latent space of pretrained foundation models (FMs) and uses conditional diffusion models to generate synthetic data. The generative process is conditioned on augmented semantic information, such as label relationships and subdomain features, to improve the diversity and relevance of the generated data. The framework is evaluated on two tasks: zero-shot speech emotion recognition using Whisper-large as the FM and long-tailed image classification on ImageNet-LT.
Results
{'speech_emotion_recognition': 'GeLDA improves the unweighted average recall by 6.13% over the Whisper-large baseline, using a small diffusion model trained on 83 hours of data.', 'image_classification': 'GeLDA achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art, while maintaining strong performance on middle- and head-class accuracies.'}
Implications
GeLDA provides a practical and effective solution for data augmentation in low-resource and label-imbalanced domains, with potential applications in speech recognition, image classification, and other tasks requiring robust performance under data scarcity. Its lightweight design and reliance on latent spaces make it suitable for real-world scenarios where computational resources and labeled data are limited.
View on arXiv

Sequential Group Composition: A Window into the Mechanics of Deep Learning

Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, Nina Miolane
  • The Sequential Group Composition task requires neural networks to predict the cumulative product of group elements, capturing the essence of many structured computation problems.
  • Two-layer networks learn the task by decomposing it into irreducible representations of the group, but require exponential hidden width as sequence length increases.
  • Recurrent neural networks and multilayer networks exploit the associativity of the task to achieve more efficient scaling, with linear and logarithmic complexity, respectively.
  • The task is order-sensitive and nonlinear, requiring networks to learn nonlinear interactions between inputs.
  • The group-specific Fourier decomposition of the task enables precise analysis of learning dynamics and feature emergence in neural networks.
Read More
Abstract
This paper introduces the Sequential Group Composition (SGC) task as a framework to study how neural networks learn structured operations, such as arithmetic, geometric, and algorithmic computations. The SGC task involves predicting the cumulative product of a sequence of elements from a finite group, encoded in a real vector space. The authors analyze how different neural network architectures (two-layer networks, recurrent neural networks, and multilayer networks) learn to solve this task, focusing on the role of group structure, encoding statistics, and sequence length. They demonstrate that two-layer networks learn the task by decomposing it into irreducible representations of the group, but require exponential hidden width with respect to sequence length. In contrast, deeper architectures exploit the associativity of the task to achieve more efficient scaling, with recurrent networks solving the task in linear steps and multilayer networks in logarithmic layers. The paper provides a mathematical framework for understanding how neural networks learn from sequential data and highlights the importance of architectural biases in shaping learning dynamics.
Methodology
The authors formulate the Sequential Group Composition task as a regression problem where networks predict the cumulative product of group elements encoded in a real vector space. They analyze the learning dynamics of different architectures (two-layer networks, recurrent neural networks, and multilayer networks) using group-specific Fourier decomposition to study how features are learned and in what order. The analysis focuses on the role of group structure, encoding statistics, and sequence length in shaping learning efficiency.
Results
Two-layer networks can perfectly learn the task but require exponential hidden width with respect to sequence length. Recurrent neural networks solve the task in linear steps by composing elements sequentially, while multilayer networks achieve logarithmic scaling by composing adjacent pairs in parallel. The study reveals a universality in feature learning dynamics across architectures and highlights the efficiency gains of deeper models in exploiting task associativity.
Implications
The Sequential Group Composition task provides a principled framework for understanding how neural networks learn structured computations from sequential data. Insights from this work could inform the design of more efficient architectures for tasks involving group-based operations, such as algorithmic planning, geometric transformations, and physical simulations. Additionally, the findings contribute to the broader field of mechanistic interpretability by revealing how neural networks represent and learn composition rules.
View on arXiv

Soft Sensor for Bottom-Hole Pressure Estimation in Petroleum Wells Using Long Short-Term Memory and Transfer Learning

M. A. Fernandes, E. Gildin, M. A. Sampaio
  • The paper proposes a soft sensor for flowing bottom-hole pressure estimation using LSTM networks, addressing the limitations of physical sensors like PDGs.
  • Transfer learning is introduced to adapt models across different operational environments, mitigating issues like concept drift and varying reservoir conditions.
  • The methodology achieves high accuracy with MAPE consistently below 2%, outperforming benchmarks such as MLP and Ridge Regression.
  • The approach is tested on real offshore datasets from Brazil's Pre-salt basin, demonstrating practical applicability.
  • The solution offers a cost-effective alternative to physical sensors and can be integrated into digital twin systems for anomaly detection and error monitoring.
Read More
Abstract
This paper addresses the challenge of estimating flowing bottom-hole pressure (BHP) in petroleum wells, a critical variable for production optimization, safety, and emissions reduction. Physical sensors, such as Permanent Downhole Gauges (PDGs), are often unreliable or economically unfeasible, especially in mature fields or low-productivity wells. The authors propose a machine learning-based soft sensor using Long Short-Term Memory (LSTM) networks to estimate BHP based on wellhead and topside measurements. The study also introduces transfer learning to adapt models across different operational environments, addressing concept drift and varying reservoir conditions. The methodology was tested on real offshore datasets from Brazil's Pre-salt basin, achieving high accuracy with a Mean Absolute Percentage Error (MAPE) consistently below 2%, outperforming traditional approaches like Multi-Layer Perceptron (MLP) and Ridge Regression. This work provides a cost-effective, reliable alternative to physical sensors and demonstrates broad applicability across diverse reservoir and flow conditions.
Methodology
The authors developed a data-driven soft sensor using LSTM networks to estimate flowing BHP based on wellhead and topside measurements. Transfer learning techniques were employed to adapt models across different operational environments. The methodology was benchmarked against MLP and Ridge Regression models using real-world offshore datasets. The focus was on steady-state flow conditions, ensuring reliable predictions despite the complexity of multiphase flow and varying reservoir characteristics.
Results
The proposed LSTM-based soft sensor achieved a Mean Absolute Percentage Error (MAPE) consistently below 2%, outperforming traditional methods like MLP and Ridge Regression. The model demonstrated robustness across diverse operational environments and reservoir conditions, validating its effectiveness using real offshore datasets from Brazil's Pre-salt basin.
Implications
This work offers a cost-effective and accurate alternative to physical sensors for bottom-hole pressure estimation, reducing reliance on expensive and unreliable PDGs. The methodology can be applied across diverse reservoir and flow conditions, making it suitable for mature fields and low-productivity wells. Additionally, it can be integrated into digital twin systems for continuous monitoring, anomaly detection, and production optimization in the petroleum industry.
View on arXiv

The Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting

Chen-Hui Song, Shuoling Liu, Liyuan Chen
  • The paper identifies the 'Label Horizon Paradox,' where the optimal training label horizon often differs from the inference target due to market dynamics.
  • Generalization performance is governed by a trade-off between marginal signal realization (information gain) and marginal noise accumulation (volatility).
  • The authors propose a bi-level optimization framework that dynamically learns the optimal label horizon during training, avoiding manual tuning.
  • Theoretical insights are derived using a linear factor model based on Arbitrage Pricing Theory, unifying empirical observations under a signal-noise trade-off mechanism.
  • Empirical results on large-scale financial datasets demonstrate that the proposed adaptive approach outperforms conventional methods in short-term stock forecasting.
Read More
Abstract
This paper introduces the 'Label Horizon Paradox,' a novel insight in financial forecasting that challenges the conventional assumption that training labels must align strictly with inference targets. The authors demonstrate that the optimal supervision signal often deviates from the prediction goal, shifting to intermediate horizons due to the dynamic interplay between signal realization and noise accumulation in financial markets. They theoretically ground this phenomenon using a signal-noise trade-off framework, where generalization performance depends on balancing marginal information gain and noise accumulation over time. To address this, the authors propose a bi-level optimization framework that adaptively learns the optimal label horizon during training, eliminating the need for manual or brute-force selection. Extensive experiments on large-scale financial datasets validate the approach, showing consistent improvements over traditional methods. This work highlights the importance of rethinking label design in financial forecasting and opens new avenues for research in label-centric machine learning.
Methodology
The authors use a bi-level optimization framework to adaptively learn the optimal label horizon during training. The framework treats the label horizon as a learnable parameter and dynamically balances the trade-off between signal realization and noise accumulation. Theoretical derivations are based on a linear factor model grounded in Arbitrage Pricing Theory, and experiments are conducted on large-scale financial datasets to validate the approach.
Results
The proposed adaptive framework consistently outperforms traditional methods in short-term stock forecasting tasks. By optimizing the signal-noise trade-off, the method achieves superior generalization performance, demonstrating the effectiveness of using intermediate label horizons rather than strictly aligning training labels with inference targets.
Implications
This work has significant implications for financial forecasting and machine learning. It challenges the conventional design of supervision signals, suggesting that adaptive label horizons can improve predictive performance in noisy, non-stationary environments like financial markets. The methodology could be extended to other domains with similar signal-noise dynamics, such as weather forecasting or healthcare time-series predictions.
View on arXiv

Toward Ultra-Long-Horizon Sequential Model Editing

Mingda Liu, Zhenghan Zhu, Ze'an Miao, Katsuki Fujisawa
  • Sequential model editing in LLMs often leads to performance degradation due to exponential growth in parameter norms.
  • Norm-Anchor Scaling (NAS) is introduced as a simple yet effective strategy to stabilize long-horizon sequential edits by constraining norm inflation.
  • NAS improves editing success rates by 72.2% on average and delays model collapse by over 4× compared to baseline methods.
  • The method is computationally efficient, requiring only a single additional line of code, and is validated on large-scale datasets like CounterFact and ZsRE.
  • NAS prevents unintended side effects and ensures stable model behavior during extensive sequential editing tasks.
Read More
Abstract
This paper addresses the challenge of sequential model editing in large language models (LLMs), where repeated updates to model parameters often lead to performance degradation and instability. The authors identify a critical issue in the Locate-and-Edit (L&E) paradigm: exponential growth in the norms of edited parameters, which correlates strongly with model collapse during long sequences of edits. To mitigate this, they propose Norm-Anchor Scaling (NAS), a plug-and-play strategy that constrains the norm of parameter updates by anchoring them to the stable scale of the unedited model. NAS ensures numerical stability and prevents norm blow-up, significantly improving the robustness and performance of sequential editing. Extensive experiments demonstrate that NAS delays the degradation point by over 4× and improves editing success rates by 72.2% on average, while requiring minimal computational overhead. The proposed method is validated on large-scale datasets like CounterFact and ZsRE, showcasing its effectiveness in ultra-long-horizon editing scenarios.
Methodology
The authors analyze the Locate-and-Edit (L&E) paradigm, identifying exponential norm growth as a key factor in model collapse during sequential edits. They propose Norm-Anchor Scaling (NAS), which rescales parameter updates to a stable reference magnitude derived from the unedited model. This approach preserves the direction of updates while constraining their scale to prevent instability. The method is theoretically proven to bound norm growth and empirically validated through extensive experiments on LLMs such as LLaMA3 and GPT-J.
Results
NAS significantly improves the robustness of sequential model editing, delaying degradation by over 4× and increasing average editing success rates from 51.9% to 89.3% (+37.4 percentage points). It effectively suppresses norm drift and hidden representation instability, ensuring stable performance across long edit sequences. Experiments on CounterFact (20,877 edits) and ZsRE (19,086 edits) demonstrate that NAS is the only method among evaluated baselines that avoids clear degradation during ultra-long-horizon editing.
Implications
The proposed NAS method has significant implications for lifelong knowledge updating in LLMs, enabling stable and precise corrections without retraining or introducing global side effects. It can be applied in real-world scenarios requiring frequent updates to factual knowledge, such as dynamic knowledge bases, conversational AI, and systems requiring continuous adaptation to new information.
View on arXiv

medR: Reward Engineering for Clinical Offline Reinforcement Learning via Tri-Drive Potential Functions

Qianyi Xu, Gousia Habib, Feng Wu, Yanrui Du, Zhihui Chen, Swapnil Mishra, Dilruk Perera, Mengling Feng
  • The paper introduces medR, an automated reward engineering pipeline for clinical offline RL, leveraging LLMs for interpretable feature selection and reward design.
  • A novel 'Tri-Drive' reward mechanism is proposed, incorporating survival, confidence, and competence to guide policy learning and align with clinical objectives.
  • Quantitative metrics are developed to verify reward functions offline, addressing the lack of ground truth in clinical RL environments.
  • Empirical results show that medR-trained policies achieve up to 77.3% higher WIS scores compared to clinician baselines across three high-stakes clinical tasks.
  • The framework demonstrates generalizability across diverse diseases, offering a scalable solution for optimizing dynamic treatment regimes.
Read More
Abstract
This paper addresses the critical challenge of reward engineering in clinical offline reinforcement learning (RL), which is essential for optimizing dynamic treatment regimes (DTRs) in high-stakes medical environments. The authors propose an automated pipeline, medR, that leverages Large Language Models (LLMs) to design and verify reward functions tailored to specific clinical tasks. The framework introduces a novel reward structure based on a 'Tri-Drive' mechanism, which incorporates three core components: survival, confidence, and competence. These components aim to align RL policies with clinical goals while addressing challenges such as sparse and delayed outcomes, implicit treatment targets, and the lack of verifiable reward signals in offline settings. The proposed method bridges the gap between medical knowledge and mathematical representation, enabling interpretable and generalizable reward design. Empirical validation on three distinct clinical tasks demonstrates that policies trained with medR significantly outperform those using standard reward functions, showcasing its potential to improve decision-making in critical care.
Methodology
The authors model clinical dynamics as a Partially Observable Markov Decision Process (POMDP) and approximate it as an MDP for offline policy training. They use LLMs to automate the design of reward functions, incorporating domain knowledge into a 'Tri-Drive' potential function framework. This framework includes three components: survival (patient outcomes), confidence (model certainty), and competence (alignment with clinical best practices). Quantitative metrics are introduced to evaluate and verify the reward functions offline before deployment.
Results
The proposed medR framework was validated on three distinct clinical tasks, where it outperformed standard reward functions. Policies trained with medR achieved 77.3%, 66.7%, and 60.3% higher WIS scores compared to clinician baselines, demonstrating its effectiveness and generalizability across different diseases and datasets.
Implications
The medR framework has the potential to significantly improve clinical decision-making by automating the design of reward functions for offline RL in healthcare. Its generalizability across diverse diseases suggests it could be applied to optimize treatment strategies in various medical domains, ultimately enhancing patient outcomes and reducing reliance on manual, heuristic-based reward engineering.
View on arXiv