AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
61
Papers today
8h
Update frequency
7
Days of history
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
Theory
Optimization
- Isolates a boundary-layer mechanism for understanding learning dynamics in online softmax classification.
- Demonstrates that the generalization error scales as α^(-1/3) in late training phases.
- Shows that learning-rate schedules can improve generalization error scaling to α^(-1/2).
- Validates theoretical predictions through simulations and controlled experiments.
Read more
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
Summary
This paper investigates the learning dynamics of online softmax classification, particularly focusing on the discrepancies between smooth surrogate losses and discrete labels. The authors isolate a boundary-layer mechanism that explains the emergence of power-law learning curves in a teacher-student model. By analyzing centered variables, they reveal that at late training times, only examples near decision boundaries remain active, leading to a slow scaling of classification error as a function of training time. Specifically, they find that the generalization error exhibits a power-law behavior of α^(-1/3), which is slower than the Bayes-optimal rate of α^(-1). The study also demonstrates that employing learning-rate schedules can enhance generalization error towards a faster scaling of α^(-1/2). Through simulations and controlled experiments, the authors validate their theoretical predictions and highlight the importance of data structure in influencing learning dynamics. The findings provide a complementary perspective to existing spectral explanations of neural scaling laws, emphasizing the role of boundary layers in online classification tasks.
Methodology
The authors employ a one-layer K-class teacher-student model with Gaussian inputs and online stochastic gradient descent (SGD). They utilize order-parameter methodology from statistical physics to derive a centered macroscopic closure for the softmax learning process, focusing on the dynamics of centered variables and boundary layers.
Results
The study finds that the late-time dynamics of online classification is dominated by boundary layers near decision boundaries, leading to a generalization error that scales as α^(-1/3). The introduction of learning-rate schedules can enhance this scaling towards α^(-1/2). Simulations confirm the theoretical predictions, demonstrating the robustness of the proposed boundary-layer mechanism.
Implications
The findings suggest that understanding the boundary-layer dynamics can lead to improved strategies for training classifiers, particularly in online settings. The insights into scaling laws could inform the design of more efficient learning algorithms and enhance generalization performance in practical applications.
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
Large Language Models
NLP
Theory
- Data contamination in LLMs can create a false impression of reasoning capabilities.
- The Zero-CoT Probe (ZCP) method effectively detects evasive data contamination by truncating the CoT process.
- Contamination Confidence is introduced as a new metric to quantify the severity of data contamination.
- Extensive evaluations reveal significant levels of data contamination in various LLMs.
Read more
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
Summary
This paper addresses the issue of data contamination in large language models (LLMs), which undermines the evaluation of their reasoning capabilities. The authors highlight that malicious publishers often employ evasive strategies, such as paraphrasing benchmark data, to evade detection and inflate performance metrics. To tackle this problem, the authors introduce the Zero-CoT Probe (ZCP), a novel black-box detection method that truncates the Chain-of-Thought (CoT) process to reveal underlying memorization in model outputs. By comparing the model's performance on the original benchmark with an isomorphically perturbed reference dataset, ZCP effectively isolates memorization from genuine problem-solving skills. Additionally, the authors propose a new metric, Contamination Confidence, which quantifies the severity of contamination beyond binary classifications. Extensive experiments demonstrate that ZCP can robustly detect both direct and evasive data contamination across various models, revealing a widespread issue of data contamination in both closed-source and open-source LLMs.
Methodology
The authors developed the Zero-CoT Probe (ZCP), which truncates the Chain-of-Thought (CoT) reasoning process in LLMs to expose latent memorization. The method compares the model's performance on the original benchmark dataset against a perturbed reference dataset to identify discrepancies that indicate contamination. The Contamination Confidence metric is also introduced to quantify the likelihood and severity of contamination.
Results
The experiments conducted demonstrate that ZCP can reliably detect both direct and evasive data contamination in various models. The results indicate a significant presence of data contamination in both closed-source and open-source LLMs, highlighting the need for improved evaluation methods.
Implications
The findings suggest that current evaluation metrics for LLMs may be misleading due to data contamination. The proposed ZCP method and Contamination Confidence metric can help developers and researchers better assess the true capabilities of LLMs, leading to more informed deployment decisions and improved model training practices.
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
Reinforcement Learning
Large Language Models
Theory
- Self-play stability is governed more by data gating than by reward design.
- A strict data gate ensures stability across various reward configurations.
- The Grounded Proposer Paradox indicates that access to ground truth can worsen stability.
- A continuous strictness parameter for gating reveals a two-stage phase transition in training dynamics.
Read more
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
Summary
This paper investigates the stability issues in self-play reinforcement learning (RL), particularly focusing on the roles of data gating and reward grounding. The authors argue that stability is influenced more by a data-level gate, which determines which proposer-generated tasks are included in the training pool, than by the reward signal that updates the policy on admitted tasks. Through controlled experiments on a Python output-prediction task and a deterministic DSL twin task, they demonstrate that a strict data gate is sufficient for stability across various reward designs, while the absence of such a gate leads to collapse regardless of the reward structure. The study reveals a counter-intuitive phenomenon termed the Grounded Proposer Paradox, where a proposer with access to ground truth can accelerate collapse when paired with a self-consistency solver. Additionally, the introduction of a continuous strictness parameter for the gate uncovers a two-stage phase transition in training dynamics. Overall, the findings suggest that data-level gating is the critical factor for maintaining stability in self-play RL systems.
Methodology
The authors conducted controlled experiments using a Python output-prediction task and a deterministic DSL twin task to analyze the effects of data gating and reward grounding on self-play RL stability. They varied the strictness of the data gate and tested multiple reward structures to observe their impact on training outcomes.
Results
The experiments showed that a strict data gate is sufficient for stability, while the removal of the gate leads to irreversible collapse regardless of the reward design. The Grounded Proposer Paradox was identified, demonstrating that proposers with ground-truth access can lead to faster collapse when paired with self-consistency solvers. The introduction of a continuous strictness parameter revealed a two-stage phase transition in training metrics.
Implications
These findings suggest that future self-play RL systems should prioritize data gating mechanisms to enhance stability, rather than solely focusing on reward design. This could lead to more robust training methodologies for language models and other AI systems that utilize self-play.
A Posterior-Predictive Variance Decomposition for Epistemic and Aleatoric Uncertainty in Wind Power Forecasting
Time Series
Theory
- Introduces a posterior-predictive variance decomposition framework for wind power forecasting.
- Successfully separates epistemic and aleatoric uncertainties, improving forecasting accuracy.
- Develops an evaluation framework that does not rely on ground-truth uncertainty labels.
- Demonstrates theoretical consistency and operational utility through synthetic and real-world experiments.
Read more
A Posterior-Predictive Variance Decomposition for Epistemic and Aleatoric Uncertainty in Wind Power Forecasting
Summary
This paper addresses the critical need for accurate uncertainty quantification in wind power forecasting, which is essential for grid reliability due to the inherent variability of wind energy. Traditional methods often conflate epistemic uncertainty (EU) and aleatoric uncertainty (AU) into a single total uncertainty (TU) estimate, which can obscure the distinct sources of uncertainty. The authors propose a novel framework that applies the law of total variance to decompose TU into its AU and EU components using heteroscedastic neural network regression and Bayesian posterior approximation. This decomposition allows for a clearer understanding of the uncertainties involved, enhancing forecasting performance and operational reliability. The paper also introduces a comprehensive evaluation framework tailored for wind power forecasting that does not require ground-truth uncertainty labels. This framework includes controlled synthetic experiments, validation on real-world SCADA datasets, and dataset-size scaling experiments to assess the theoretical consistency of the AU and EU components. The results demonstrate that the decomposed components respond consistently to variations in noise structure, distribution shifts, and training scale, validating the utility of the proposed methods in practical applications.
Methodology
The authors apply the law of total variance to heteroscedastic neural network regression combined with Bayesian posterior inference to derive a decomposition of total uncertainty into aleatoric and epistemic components. They also implement β-NLL training to balance mean and variance learning, ensuring compatibility with standard posterior approximation methods.
Results
The experiments conducted on both synthetic and real-world datasets show that the AU and EU components behave as theoretically predicted in response to variations in noise structure, distribution shifts, and training scale. This supports the effectiveness of the proposed decomposition and evaluation framework.
Implications
The findings have significant implications for improving wind power forecasting accuracy and reliability, which is crucial for the integration of renewable energy sources into power grids. The framework can also be applied to other domains requiring uncertainty quantification and disentanglement.
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Optimization
Multimodal
Theory
- Identifies the limitations of linear scalarization in multi-task RRG optimization.
- Introduces the concept of 'Double Dilemma' in gradient dynamics affecting RRG.
- Proposes CAME-Grad, a novel optimizer that enhances multi-task learning performance.
- Demonstrates significant improvements in clinical efficacy across multiple RRG methods.
Read more
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Summary
This paper addresses the challenges in multi-task learning for automatic radiology report generation (RRG), particularly focusing on the limitations of linear scalarization strategies that fail to balance clinical supervision with report generation smoothness. The authors analyze the failure mechanisms of these strategies using gradient dynamics, identifying a 'Double Dilemma' characterized by drift term deviation and diffusion term decay. To overcome these issues, they propose a new optimizer called Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad), which enhances optimization dynamics through conflict-averse direction rectification, magnitude-enhanced energy injection, and adaptive gradient fusion. CAME-Grad is designed to be backbone-agnostic, allowing it to be integrated into existing RRG methods without altering their architectures. Experimental results demonstrate that CAME-Grad significantly improves clinical efficacy across eight diverse RRG methods, achieving an average performance increase of 2.3% on the MIMIC-CXR dataset and 1.9% on the IU X-Ray dataset.
Methodology
The authors utilize a stochastic differential equation (SDE) framework to analyze the gradient dynamics of multi-task optimization in RRG. They propose the CAME-Grad optimizer, which incorporates three main components: conflict-averse direction rectification to mitigate drift, magnitude-enhanced energy injection to counteract diffusion decay, and adaptive gradient fusion to balance optimal directions with task-specific biases.
Results
CAME-Grad was tested across eight different RRG methods, resulting in an average improvement of 2.3% on the MIMIC-CXR dataset and 1.9% on the IU X-Ray dataset, showcasing its effectiveness as a universal optimizer in enhancing clinical report generation.
Implications
The findings suggest that optimizing gradient dynamics can lead to better performance in multi-task learning scenarios, particularly in clinical applications where accuracy and consistency are critical. The proposed CAME-Grad optimizer can be applied to various multi-task learning frameworks beyond RRG, potentially improving outcomes in other domains requiring similar optimization challenges.
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Theory
- Identifies the 'Robustly Unlearnable Set' as a key factor in the failure of Adversarial Distillation.
- Develops a theoretical framework explaining how teacher-student dynamics affect robust generalization.
- Demonstrates that a teacher's predictive confidence on unlearnable samples is crucial for student robustness.
- Empirical validation confirms the theoretical predictions across various datasets.
Read more
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Summary
This paper investigates the inconsistencies in the effectiveness of Adversarial Distillation (AD), which aims to enhance the robustness of student models by using soft labels from robust teacher models within a min-max adversarial training framework. The authors identify a critical issue: the misalignment between the teacher's supervisory confidence and the student's representational limitations, particularly concerning a subset of training data termed the 'Robustly Unlearnable Set.' Through a theoretical framework analyzing the learning dynamics of a two-layer neural network, the authors demonstrate that when a teacher provides confident supervision on unlearnable samples, it leads the student to memorize irrelevant noise patterns, resulting in robust overfitting. Conversely, a teacher exhibiting high uncertainty on these samples helps the student focus on learnable signals, improving generalization. Empirical validation across synthetic simulations and real-image datasets supports the theory, revealing that a teacher's predictive entropy on unlearnable samples is a strong indicator of student robustness. The findings offer a principled guideline for selecting effective robust teachers in AD, bridging theoretical insights with practical applications.
Methodology
The authors develop a theoretical framework to analyze the feature learning dynamics of a two-layer neural network. They empirically validate their findings through experiments on both synthetic data and real-image classification tasks, assessing the impact of teacher supervision on student performance.
Results
The study confirms that robust overfitting is significantly influenced by the teacher's interaction with unlearnable samples. The theoretical framework accurately predicts the learning dynamics, and the empirical results show that a teacher's predictive entropy on unlearnable samples correlates strongly with the robustness of the student model.
Implications
The findings suggest that careful selection of teacher models based on their predictive confidence can enhance the robustness of student models in adversarial settings. This has potential applications in safety-critical domains where model robustness is essential.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
NLP
Large Language Models
Theory
- Introduces a joint selection framework for benchmark decontamination across multiple LLMs.
- Proposes Joint Envelope Conformal Selection (JECS) to control global contamination rates.
- Establishes theoretical guarantees for GCR control under specified conditions.
- Demonstrates superior performance of JECS in maintaining GCR while improving power over existing methods.
Read more
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
Summary
This paper addresses the challenge of benchmark data contamination in the evaluation of large language models (LLMs), which can inflate reported performance and hinder reliable cross-model comparisons. The authors propose a novel method called Joint Envelope Conformal Selection (JECS) to jointly decontaminate benchmarks across multiple models. JECS formalizes the problem as a joint selection task, aiming to select a benchmark that is free from overlap with the training data of any audited model. The method computes per-model conformal p-values, aggregates them using the maximum, and constructs a conservative envelope for the null distribution. By applying the adaptive Benjamini-Hochberg procedure, JECS ensures control over the global contamination rate (GCR). The authors validate their approach through extensive experiments, demonstrating that JECS not only maintains the target GCR but also improves statistical power compared to existing methods. Overall, this work provides a robust framework for fair benchmarking of LLMs, addressing a critical issue in the field.
Methodology
The methodology involves formulating the benchmark decontamination problem as a joint selection issue, where a sample is considered jointly pure if it is absent from the training data of all audited models. JECS computes per-model conformal p-values, aggregates them using the maximum, and reconstructs a conservative envelope for the null distribution. The adaptive Benjamini-Hochberg procedure is then applied to ensure GCR control.
Results
JECS effectively controls the global contamination rate at user-specified levels, achieving a GCR of 0.038 at α = 0.1, compared to significantly higher rates from union and intersection methods. Additionally, JECS shows improved statistical power, increasing from 0.094 to 0.447 in specific experimental setups.
Implications
The findings suggest that JECS can be a valuable tool for researchers and practitioners in the field of LLM evaluation, enabling fairer comparisons and more reliable assessments of model performance. This could lead to more accurate understanding of model capabilities and generalization.
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Time Series
- Introduces 'spectral drift' as a new perspective to understand subject-specific variability in biomedical time-series.
- Proposes the Frequency-Band Alignment Module (FBAM) to align spectral structures and mitigate variability.
- Implements Sample Conditional Layer Normalization (SCLN) for stabilizing cross-subject representations.
- Demonstrates a 6% absolute improvement in F1-score over 12 baseline models across multiple datasets.
Read more
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Summary
The paper addresses the challenge of cross-subject generalization in biomedical time-series (BTS) data, where models trained on data from certain subjects perform poorly on unseen subjects due to subject-specific variability. The authors introduce the concept of 'spectral drift' to characterize this variability, which manifests as consistent oscillatory structures in BTS signals but varies in magnitude and phase across subjects. To mitigate this issue, they propose BioFormer, which includes a Frequency-Band Alignment Module (FBAM) that generates modulation factors to align spectral structures across subjects. Additionally, they implement Sample Conditional Layer Normalization (SCLN) to stabilize representations by inferring normalization parameters from intrinsic signal statistics rather than subject identity. The proposed methods were validated through extensive experiments on six datasets, demonstrating significant improvements in classification performance compared to 12 baseline models.
Methodology
The methodology involves analyzing subject-specific variability through spectral structural patterns, specifically using FBAM to perform adaptive frequency-band alignment in the Fourier domain. FBAM extracts band-wise statistics and applies cross-attention to derive modulation coefficients for amplitude and phase adjustments. SCLN is used to adaptively calibrate sample-level feature statistics, enhancing model robustness without relying on subject identity.
Results
The experiments conducted on six biomedical time-series datasets showed that BioFormer outperformed 12 baseline models, achieving an absolute F1-score improvement of 6%. This indicates a significant enhancement in the model's ability to generalize across subjects while maintaining task-relevant features.
Implications
The findings suggest that explicitly modeling subject-specific variability can lead to more robust and generalizable models in biomedical applications. This approach could improve the performance of classification tasks in various clinical settings, such as disease screening and monitoring.
What are the Right Symmetries for Formal Theorem Proving?
Theory
Large Language Models
- Introduction of rewriting categories as a framework for modeling theorem statement transformations.
- Formalization of proof equivariance and success invariance as essential symmetry properties for theorem provers.
- Empirical demonstration of LLM-based provers' failure to maintain success invariance across equivalent formulations.
- Proposed test-time aggregation method improves robustness and performance of theorem proving.
Read more
What are the Right Symmetries for Formal Theorem Proving?
Summary
The paper addresses the sensitivity of formal theorem provers, particularly those based on large language models (LLMs), to variations in problem representation. It identifies a critical issue where semantically equivalent statements yield significantly different proof success rates, indicating a failure to respect mathematical symmetries. To tackle this, the authors introduce a category-theoretic framework called rewriting categories, which formalizes two key symmetry notions: proof equivariance and success invariance. They demonstrate that while state-based next-tactic provers naturally satisfy proof equivariance, LLM-based provers do not, leading to performance inconsistencies. To improve robustness, the authors propose a model-agnostic test-time method that aggregates equivalent rewritings of inputs, showing both theoretical and empirical evidence that this approach enhances success invariance and overall performance under fixed inference budgets. The findings suggest that incorporating symmetry as an inductive bias could significantly benefit LLM-based theorem proving.
Methodology
The authors develop a category-theoretic framework to model transformations between theorem statements, formalizing the concepts of proof equivariance and success invariance. They construct a benchmark of semantically equivalent reformulations to analyze the performance of LLM-based provers. A test-time aggregation method is proposed to enhance robustness and success invariance.
Results
The study reveals that LLM-based theorem provers exhibit significant variability in proof success rates across equivalent formulations, violating the principle of success invariance. The proposed aggregation method is shown to theoretically recover success invariance in the sampling limit and empirically improve performance and robustness under fixed inference budgets.
Implications
The findings suggest that integrating symmetry as an inductive bias could enhance the effectiveness of LLM-based theorem provers, potentially leading to more reliable and consistent performance in formal theorem proving tasks. This could have broader implications for the application of LLMs in mathematical reasoning and related fields.
TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems
Computer Vision
Multimodal
Generative Models
- TONIC framework aligns communication design with token-level task relevance.
- Introduces unequal error protection based on token utility under fixed channel budgets.
- Utilizes confidence gating to manage unreliable token decisions effectively.
- Combines transmitter-side semantic protection with receiver-side completion models.
Read more
TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems
Summary
The paper introduces TONIC, a novel token-centric semantic communication framework designed for task-oriented wireless systems. Traditional wireless communication focuses on bit-level fidelity, which does not align with the needs of downstream models that operate on semantic tokens. TONIC addresses this mismatch by converting source samples into sequences of tokens, estimating their task relevance, and applying utility-aware unequal error protection within a fixed channel-use budget. At the receiver, the framework employs token-level confidence to manage unreliable decisions, transforming harmful substitutions into recoverable erasures. A Transformer-based completion model is then used to restore these masked tokens for final task inference. This approach combines semantic-aware protection at the transmitter with confidence-aware gating at the receiver, creating a modular and interpretable architecture. The paper also establishes a utility-aware Bayes-risk interpretation for the receiver-side gating rule, exploring its interaction with unequal protection and completion. Experimental results demonstrate that TONIC outperforms existing methods, including separation-based schemes and pixel-domain baselines, across various channel conditions.
Methodology
The methodology involves converting input images into semantic tokens at the transmitter, estimating their relevance for the task, and applying unequal error protection. The receiver uses token-level confidence to gate decisions, transforming harmful substitutions into recoverable erasures, which are then restored by a Transformer-based model for final inference.
Results
The experimental results indicate that TONIC consistently outperforms separation-based schemes and other baselines in image classification tasks across various channel conditions, including AWGN, Rayleigh, and Rician channels.
Implications
The TONIC framework has significant implications for enhancing the efficiency and effectiveness of wireless communication systems, particularly in applications requiring task-oriented processing, such as image transmission and multimodal systems.
ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning
Graph Learning
- CONTACT separates contact identification and sequence prediction into distinct stages, improving learning efficiency.
- The architecture includes a contact-gated injection mechanism that selectively routes antigen information to relevant CDR positions.
- CONTACT achieves superior performance metrics compared to existing CDR design methods, including a 7% improvement in RMSD and a 10% increase in F1 score.
- The methodology addresses the architectural limitations of current models by focusing on the sparsity of CDR-antigen interactions.
Read more
ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning
Summary
The paper introduces CONTACT, a novel architecture for antibody complementarity-determining region (CDR) design that explicitly separates the tasks of identifying contact positions with antigens and selecting appropriate amino acids for those positions. Traditional methods conflate these two tasks, leading to inefficiencies in learning and performance. CONTACT operates in three stages: it first learns surface complementarity fingerprints, then predicts which CDR positions will contact the antigen, and finally injects contact-gated antigen features into the sequence prediction process. This architecture employs a distance-biased cross-attention module to favor spatial neighbors and a contact-weighted cross-entropy loss to focus learning on critical binding positions. The authors demonstrate that CONTACT outperforms existing methods on the CHIMERA-BENCH dataset, achieving significant improvements in structural quality, epitope awareness, and sequence recovery metrics.
Methodology
The CONTACT architecture is structured into three explicit stages: (1) learning surface complementarity fingerprints, (2) predicting CDR-antigen contacts, and (3) injecting contact-gated antigen features into the CDR representation. A distance-biased cross-attention module is utilized to encode geometric priors, while a contact-weighted cross-entropy loss is applied to prioritize learning at binding-critical positions.
Results
CONTACT achieved the best structural quality with an RMSD of 1.63 Ã…, a 7% improvement over the next-best baseline. It also demonstrated the highest epitope awareness with an F1 score of 0.79, a 10% increase over GNN baselines, and competitive sequence recovery metrics (AAR of 0.38) among eleven CDR-H3 design baselines.
Implications
The findings suggest that explicitly separating contact prediction from sequence design can lead to more effective antibody design methodologies. This could have significant implications for therapeutic antibody development and vaccine design, where precise binding interactions are crucial.
Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning
Theory
Optimization
Efficient ML
- Sampling-based inference (SAI) is as computationally efficient as optimization-based methods for Bayesian neural networks.
- SAI can improve prediction performance and provide better uncertainty quantification.
- Addressing misconceptions about SAI is crucial for its broader acceptance in the community.
- Research should focus on effective exploration of the posterior landscape and management of posterior samples.
Read more
Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning
Summary
This position paper advocates for the increased adoption of sampling-based inference (SAI) in Bayesian neural networks (BNNs), arguing that SAI has reached a level of computational efficiency comparable to optimization-based methods. The authors highlight that misconceptions about SAI's feasibility and efficiency have hindered its practical use. They assert that SAI not only provides principled uncertainty quantification but also enhances prediction performance through model averaging. To realize the full potential of SAI, the authors propose a shift in research focus towards overcoming existing misconceptions, improving posterior landscape exploration, and developing effective strategies for posterior sample management. The paper emphasizes the need for coordinated efforts to build robust end-to-end workflows that facilitate the practical application of SAI in BNNs, ultimately positioning SAI as a central tool in Bayesian deep learning.
Methodology
The authors review current methodologies in Bayesian deep learning, contrasting sampling-based inference with optimization-based approaches. They discuss recent advancements in SAI that demonstrate its scalability and effectiveness for larger problems, supported by new software frameworks that enhance accessibility and performance.
Results
The paper presents a conceptual overview of the current state of SAI, highlighting its algorithmic successes and the barriers to mainstream adoption. It outlines the potential for SAI to outperform traditional optimization methods in both efficiency and predictive accuracy, while also providing insights into the posterior landscape of BNNs.
Implications
The findings suggest that embracing SAI could lead to significant advancements in Bayesian deep learning, enabling more reliable uncertainty quantification in various applications, including scientific research and industrial systems. By addressing misconceptions and improving methodologies, SAI could become a standard practice in the field.
Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning
Theory
- WTA bottlenecks can enforce the extraction of categorical latent factors in multi-task learning.
- The representation at the WTA bottleneck is a structured permutation of the original latent factors.
- Symbolic representations allow individual neurons to encode distinct abstract features.
- Empirical validation shows generalization benefits from the symbolic representations.
Read more
Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning
Summary
This paper investigates the role of Winner-Take-All (WTA) networks in enforcing disentangled symbolic representations within multi-task learning frameworks. The authors demonstrate that a WTA bottleneck in a deep neural network can extract categorical latent factors from highly entangled data under specific conditions. They prove that when the network is trained on multiple linear classification tasks, the representation formed at the WTA bottleneck becomes a structured permutation of the original latent factors. This leads to highly symbolic representations where individual neurons or neuron populations encode distinct abstract features such as objects or colors. Empirical results on two datasets support the theoretical findings, showing that even architectures deviating from the assumptions can yield similar symbolic representations. The study highlights the advantages of these representations for generalization, suggesting a potential bridge between symbolic and subsymbolic AI systems.
Methodology
The authors utilized theoretical proofs to establish the conditions under which WTA bottlenecks can extract categorical latent factors. They conducted empirical experiments on two datasets to validate their theoretical claims, demonstrating the emergence of symbolic representations even in architectures that do not fully comply with the theoretical assumptions.
Results
The study found that a WTA bottleneck leads to the emergence of highly symbolic representations that are disentangled and facilitate generalization across various tasks. The empirical results confirmed that these representations could be achieved even when the network architecture deviated from the ideal conditions outlined in the theoretical framework.
Implications
The findings have significant implications for the development of neurosymbolic AI systems, as they provide insights into how deep learning models can leverage WTA-like components to create symbolic representations that enhance generalization and interpretability in machine learning tasks.
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
Reinforcement Learning
- Frequent strategy switching is inversely related to win rates in Clash Royale.
- Existing recommendation systems often ignore the behavioral costs of switching strategies.
- The Transition Quality Predictor (TQP) reformulates strategy recommendations as a transition-level decision problem.
- The TQP pipeline includes components for identifying suitable players and timing for strategy switches.
Read more
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
Summary
This paper investigates the phenomenon of strategy switching in competitive online games, specifically Clash Royale, where players often change strategies after losing streaks. Analyzing 926,334 match records from 34,619 players, the authors found a counterintuitive trend: higher switching frequency correlates with lower win rates. This suggests that players may not always benefit from switching strategies, as the costs associated with transitioning are often overlooked in existing recommendation systems. The authors introduce the Transition Quality Predictor (TQP), a three-stage pipeline designed to improve strategy recommendations by considering who should switch, when they should switch, and what strategies to adopt. The pipeline includes PersonaGate, which suppresses recommendations for players with consistent performance, TimingGate, which identifies optimal switching moments, and ScoreFusion, which ranks strategies based on predicted transition quality. The study also introduces the SwitchGap metric to evaluate the effectiveness of recommendations without assuming observed choices are optimal. The TQP pipeline demonstrates a significant improvement in recommendation quality, particularly benefiting players who frequently switch strategies despite their lower performance.
Methodology
The authors developed the Transition Quality Predictor (TQP) as a three-stage pipeline consisting of PersonaGate, TimingGate, and ScoreFusion. PersonaGate filters recommendations based on player consistency, TimingGate identifies optimal switching moments, and ScoreFusion ranks strategies by combining adoptability signals with predicted transition quality. The methodology also includes the introduction of the SwitchGap metric for evaluating recommendation quality.
Results
The TQP pipeline achieved a SwitchGap improvement of +10.4 percentage points at a recommendation rate of 5.4%. It was particularly effective for loss-triggered switchers, who, despite being the lowest-performing group, benefited significantly from subtype-conditioned guidance.
Implications
This research has implications for designing more effective strategy recommendation systems in competitive gaming and other domains where decision-making under uncertainty is crucial. By considering individual player behaviors and the timing of recommendations, systems can enhance player performance and satisfaction.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Reinforcement Learning
Large Language Models
Optimization
- Identified hard clipping as a key source of instability in RLVR.
- Proposed Near-boundary Stochastic Rescue (NSR) to recover near-boundary signals.
- NSR improves training stability and convergence over traditional methods.
- Demonstrated effectiveness across various model sizes and architectures.
Read more
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Summary
This paper addresses the challenges of Reinforcement Learning with Verifiable Rewards (RLVR), particularly focusing on the instability and suboptimal convergence associated with clipping-based GRPO-style objectives. The authors identify hard clipping as a significant bottleneck that discards potentially informative signals located just beyond the clipping threshold. To mitigate this issue, they propose a novel approach called Near-boundary Stochastic Rescue (NSR), which stochastically retains tokens that fall slightly outside the clipping boundary. This method transforms the rigid binary clipping decision into a probabilistic admission process, allowing for the recovery of valuable learning signals. The authors demonstrate that NSR not only improves training stability but also consistently outperforms strong baselines such as DAPO and GSPO across various model sizes (7B to 30B parameters) and architectures (dense and MoE). The findings suggest that addressing the clipping decision can lead to significant enhancements in RLVR performance, making NSR a practical and effective plug-and-play solution for stabilizing training in RLVR setups.
Methodology
The authors conducted a systematic analysis of GRPO-style clipping objectives to diagnose the issues caused by hard clipping. They introduced NSR, which employs stochastic sampling to retain tokens near the clipping boundary, thus transforming the binary clipping decision into a probabilistic process. Extensive experiments validated the effectiveness of NSR, comparing it against established baselines.
Results
The implementation of NSR led to substantial improvements in training stability and performance, consistently outperforming strong baselines like DAPO and GSPO. The experiments demonstrated that even minor stochastic perturbations at the clipping boundary could yield significant performance gains, confirming the importance of the clipping decision in RLVR setups.
Implications
The findings suggest that refining the clipping mechanism in RLVR can lead to more robust training processes, potentially enhancing the reasoning capabilities of large language models. NSR could be widely applicable in various RLVR scenarios, improving model performance and stability.
Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks
Reinforcement Learning
Theory
Efficient ML
- Analytical solution to the Mountain Car problem reveals a simple optimal control strategy.
- Chebyshev policies significantly reduce the number of parameters and improve sample efficiency.
- Chebyshev policies outperform traditional neural network approaches in various RL tasks.
- The study highlights a substantial gap in performance between current RL agents and the optimal solution.
Read more
Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks
Summary
This paper presents an analytical solution to the Mountain Car problem, a well-known benchmark in reinforcement learning (RL), which has remained unsolved for 36 years. The authors derive an optimal control solution that reveals a significant gap in performance between modern RL agents and the optimal solution. They introduce Chebyshev policies, a new class of universal approximators based on Chebyshev polynomials, which can be used as alternatives to neural networks. These policies demonstrate improved sample efficiency, requiring 277 times fewer parameters and reducing regret by a factor of 4.18 compared to state-of-the-art RL methods. The authors evaluate Chebyshev policies across various RL tasks, including the Pendulum environment and a real-world helicopter-like testbed, consistently outperforming traditional multilayer perceptron (MLP) based policies. The findings suggest that Chebyshev policies could serve as a lightweight and effective alternative for low-dimensional control tasks, addressing key challenges in RL such as sample efficiency and interpretability.
Methodology
The authors analytically derive the optimal control solution for the Mountain Car problem and introduce Chebyshev policies as a new class of stochastic policies. They evaluate these policies using Proximal Policy Optimization (PPO), Augmented Random Search (ARS), and REINFORCE algorithms across multiple low-dimensional control tasks.
Results
Chebyshev policies reduced the regret by a factor of 4.18 and required 277 times fewer parameters compared to existing state-of-the-art RL agents. They consistently improved performance in various environments, including the Pendulum and a helicopter-like testbed, demonstrating superior sample efficiency and control dynamics.
Implications
The introduction of Chebyshev policies could lead to more efficient and interpretable RL solutions in both simulated and real-world applications, potentially transforming approaches to low-dimensional control tasks and addressing key challenges in the field.
Relational Linear Properties in Language Models: An Empirical Investigation
NLP
Large Language Models
Interpretability
- Introduces a novel probing method based on Kullback-Leibler divergence to evaluate relational linearity in language models.
- Demonstrates that relational linearity varies across different models and layers, with specific relations exhibiting stronger linearity.
- Finds that the phrasing of queries significantly affects the linear probing results, highlighting the complexity of relational representations.
Read more
Relational Linear Properties in Language Models: An Empirical Investigation
Summary
This paper investigates the concept of relational linearity in language models, which posits that the unembedding of an object can be predicted from the embedding of its subject through a linear transformation. The authors introduce a novel probing method based on Kullback-Leibler divergence to empirically test this hypothesis across different layers of language models. They analyze two models, Llama-3.1 and Gemma-2, and find that relational linearity is present but varies across models and layers. The study reveals that linearity is particularly strong for certain relations like tense and truthfulness, while it is weaker for others like language and subjectivity. Additionally, the phrasing of queries impacts the observed linearity, indicating that the representation of relational properties is complex and model-dependent. This work fills a gap in the experimental evaluation of relational linearity and provides insights into how language models encode knowledge about entities and their relationships.
Methodology
The authors developed a probing procedure that evaluates relational linear properties by comparing model embeddings from contexts with and without relational queries. This method utilizes Kullback-Leibler divergence to assess the linearity of representations efficiently, avoiding the crude Jacobian approximations used in previous studies.
Results
The experiments revealed that relational linearity is present in language models, particularly for tense and truthfulness, while showing weaker effects for language and subjectivity. The analysis indicated that linearity varies across layers, with surface features emerging in early layers and more abstract properties in middle layers. Additionally, the way queries are phrased significantly influences the results of linear probing.
Implications
The findings suggest that understanding relational linearity can enhance the interpretability of language models and improve methods for controlling model behavior through activation steering. This research could inform future developments in language model architectures and their applications in tasks requiring relational reasoning.
The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning
Theory
- The Neural Compiler translates symbolic physics expressions into exact, differentiable PyTorch modules.
- Compiled modules achieve zero approximation error in their safe domain, unlike traditional neural networks.
- The system supports 51 primitive operations, enabling complex physics computations and PDE discretizations.
- Experimental results show significant improvements in parameter recovery and extrapolation compared to PINNs and other baselines.
Read more
The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning
Summary
This paper introduces The Neural Compiler, a novel system designed to translate programs written in a first-order expression language with Scheme syntax into frozen, differentiable PyTorch modules. The primary motivation behind this work is to address the challenges faced in scientific machine learning, where known physics must be integrated with unknown components. Traditional methods either discard known structures or require extensive manual coding, leading to inefficiencies and potential inaccuracies. The Neural Compiler compiles symbolic expressions into modules that compute exactly and provide exact gradients, ensuring zero approximation error within a safe domain. The paper evaluates the compiler across six experiments, demonstrating its effectiveness in recovering physical constants and maintaining composability without accumulating errors, unlike traditional neural network approaches. The results indicate that the compiled models outperform physics-informed neural networks (PINNs) and other baseline methods in terms of accuracy and efficiency, highlighting the potential of the compiler for systematic composability in scientific modeling.
Methodology
The Neural Compiler utilizes a first-order expression language with Scheme syntax to compile symbolic physics programs into differentiable PyTorch modules. It guarantees correctness in computation and gradients, ensuring that the compiled modules produce identical outputs to hand-coded implementations. The methodology includes formal proofs of compilation correctness and error bounds for compositionality.
Results
The evaluation of The Neural Compiler across six experiments revealed that compiled models produced numerically identical results to hand-coded PyTorch implementations. Compiled models recovered physical constants with less than 1% error using only 1-4 trainable parameters, while PINNs exhibited errors ranging from 7% to 93%. Additionally, the compiled modules maintained zero error at arbitrary depths, contrasting with neural approximations that accumulated significant errors.
Implications
The Neural Compiler has the potential to revolutionize scientific machine learning by providing a robust framework for integrating known physics with learnable components. Its ability to generate correct, differentiable modules from symbolic specifications could streamline the modeling process and enhance the interpretability of scientific models. Furthermore, the integration of large language models could facilitate the translation of natural language physics descriptions into compilable programs, paving the way for automated scientific modeling.
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Theory
Optimization
Computer Vision
- Introduces the Matching Principle, unifying various robustness challenges in machine learning.
- Establishes that existing methods are different estimators of the same statistical object, Σtask.
- Presents theoretical results proving the necessity of covering the range of Σtask for effective regularization.
- Introduces the Trajectory Deviation Index (tdi) as a new metric for assessing embedding sensitivity.
Read more
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Summary
This paper introduces the Matching Principle, a geometric framework for understanding loss functions in the context of nuisance-robust representation learning. It posits that various challenges in machine learning, such as robustness, domain adaptation, and invariance, can be unified under the concept of estimating the covariance of label-preserving deployment nuisances, denoted as Σtask. The author argues that existing methods like CORAL, adversarial training, and others are not independent techniques but rather different estimators of the same underlying statistical object. The paper provides theoretical results, including closed-form optimality in a linear-Gaussian model and necessary conditions for effective regularization of the encoder Jacobian. A new metric, the Trajectory Deviation Index (tdi), is introduced to assess embedding sensitivity. Empirical evaluations across various tasks demonstrate the effectiveness of the proposed framework, with most tests confirming the predicted outcomes regarding deployment drift. The paper emphasizes the importance of matching the regularization matrix Σ′ to cover the range of Σtask to prevent drift in representations during deployment.
Methodology
The paper employs a theoretical approach to derive the Matching Principle, supported by closed-form proofs in a linear-Gaussian model. It introduces the Trajectory Deviation Index (tdi) for empirical evaluation and conducts experiments across thirteen pre-registered blocks to test the framework's predictions regarding deployment drift.
Results
The empirical evaluations showed that the matched approach outperformed isotropic and wrong-direction methods in most cases, confirming the theoretical predictions of the Matching Principle. The only exception was in the Office-31 dataset, which was identified as a failure mode in advance. Overall, the results support the framework's claims regarding the necessity of matching the regularization matrix to the covariance of deployment nuisances.
Implications
The Matching Principle could lead to more robust machine learning models by providing a unified framework for addressing various nuisances in representation learning. It encourages the design of loss functions that explicitly account for deployment variations, potentially improving performance in real-world applications.
Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification
Theory
Efficient ML
Optimization
- PFNs excel in tabular classification but suffer from class imbalance issues.
- Thresholding is identified as the most effective method for improving minority class performance.
- Downsampling provides a balance between performance and computational efficiency.
- Classical imbalance correction techniques can be adapted for PFNs despite their unique learning dynamics.
Read more
Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification
Summary
This paper addresses the challenge of class imbalance in Prior-Data Fitted Networks (PFNs), which have shown remarkable performance in tabular classification tasks. PFNs utilize in-context learning (ICL) and are pretrained on large synthetic datasets, allowing them to predict outcomes without weight updates. However, like other classifiers, PFNs struggle with class imbalance, particularly affecting the performance of minority classes. The authors explore various techniques to mitigate this issue, including thresholding, downsampling, oversampling, and synthetic upsampling. Their empirical analysis reveals that thresholding significantly enhances minority class performance while maintaining majority class accuracy, and downsampling also yields competitive results with reduced computational costs. The findings suggest that adapting classical imbalance correction methods to the unique characteristics of PFNs can lead to improved classification outcomes, especially for rare classes.
Methodology
The authors adapted classical class imbalance correction techniques, including thresholding, downsampling, oversampling, and synthetic upsampling, to the context of PFNs. They conducted empirical evaluations to assess the performance of these methods on binary classification tasks, focusing on their impact on minority class performance and overall classification accuracy.
Results
The results indicated that thresholding significantly improved the performance of minority classes with minimal impact on majority class accuracy. Downsampling also performed well, achieving the highest worst-class accuracy while reducing computational costs. These findings demonstrate that specific adaptations of traditional methods can effectively address class imbalance in PFNs.
Implications
The study's findings have implications for various applications where class imbalance is prevalent, such as rare disease detection and cybersecurity. By improving the performance of PFNs on minority classes, the proposed methods can enhance the reliability of predictions in critical domains.
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
Generative Models
Time Series
- scFM addresses the challenges of unpaired snapshots in scRNA-seq by integrating optimal transport and flow matching.
- The framework improves temporal coherence and reduces distribution drift in long-horizon predictions.
- Experimental results show enhanced performance in trajectory reconstruction and gene expression dynamics recovery.
Read more
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
Summary
This paper introduces Single-Cell Flow Matching (scFM), a novel framework designed to infer continuous gene expression dynamics from sparse single-cell RNA sequencing (scRNA-seq) snapshots. Traditional scRNA-seq methods yield unpaired snapshots at discrete time points, which complicates the modeling of temporal dynamics due to the absence of intermediate states. The authors identify two primary challenges: the ambiguity of local transitions between snapshots and the compounding errors in long-horizon predictions. To tackle these issues, scFM employs a coupling-conditioned flow matching approach that integrates optimal transport (OT) for aligning adjacent snapshots and learning time-dependent velocity fields. The framework computes entropically regularized OT couplings to create soft flow-matching targets, learns bidirectional velocity fields for improved temporal coherence, and incorporates distribution-level alignment to anchor long-term predictions. Experimental results demonstrate that scFM significantly enhances distributional prediction performance for both temporal interpolation and extrapolation, yielding more accurate trajectory reconstructions and coherent visualizations of gene expression dynamics, even in the absence of intermediate time points.
Methodology
The scFM framework utilizes entropically regularized optimal transport to compute couplings between adjacent snapshots, which are then used to create soft flow-matching targets for learning velocity fields. It learns bidirectional velocity fields to refine couplings and improve temporal coherence, while also employing distribution-level alignment and latent dynamic regularization to mitigate drift during long-term predictions.
Results
The experiments conducted on real-world time-series scRNA-seq datasets reveal that scFM consistently outperforms existing methods in terms of distributional prediction accuracy for both interpolation and extrapolation tasks. Additionally, it provides more accurate trajectory reconstructions and temporally coherent visualizations, indicating a better recovery of the underlying temporal gene expression dynamics.
Implications
The proposed scFM framework has significant implications for biological research, particularly in understanding cellular dynamics and transitions in complex biological systems. It can facilitate more accurate modeling of gene expression over time, potentially leading to new insights in developmental biology and disease progression.
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
Time Series
Theory
- The paper rigorously defines and decomposes sources of uncertainty in ensemble forecasting.
- It systematically compares various parameterization strategies, including novel machine learning approaches.
- Stochastic parameterizations with temporally persistent structures significantly improve spread growth and error consistency.
- The study enhances understanding of how different uncertainties interact in chaotic systems.
Read more
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
Summary
This paper addresses the inherent uncertainties in weather and climate forecasting, which arise from chaotic dynamics, imperfect initial conditions, and model deficiencies. The authors utilize the Lorenz 1996 (L96) system as a controlled testbed to systematically decompose the sources of uncertainty affecting ensemble spread. They identify three main contributors: internal variability, initial-condition uncertainty, and model uncertainty. The study compares various ensemble configurations and parameterization strategies, including deterministic, autoregressive, Bayesian, and novel machine learning approaches. The findings reveal that ensemble perturbations do not increase long-term variance but regulate the decorrelation of trajectories. Stochastic parameterizations, especially those with persistent structures, enhance early spread growth and improve the consistency between spread and forecast errors. The work clarifies the interactions between different uncertainty sources and provides guidance for designing effective stochastic parameterizations in operational weather and climate models.
Methodology
The authors employed the two-scale Lorenz 1996 system to evaluate the contributions of internal variability, initial-condition uncertainty, and model uncertainty to ensemble spread. They analyzed multiple ensemble configurations and parameterization strategies, using both stationary and dynamical diagnostics to assess spread growth, forecast skill, and ensemble calibration.
Results
The study found that ensemble perturbations regulate the rate of trajectory decorrelation without increasing long-term variance. Stochastic parameterizations, particularly those with temporally persistent structures, were shown to enhance early spread growth and improve the spread-error consistency, addressing the issue of underdispersion in operational forecasting systems.
Implications
The findings have significant implications for improving ensemble forecasting systems in meteorology and climate science. By clarifying the interactions of various uncertainty sources, the research provides a framework for developing more reliable stochastic parameterizations, potentially leading to better weather and climate predictions.
Disentanglement Beyond Generative Models with Riemannian ICA
Theory
Interpretability
- Introduces Riemannian ICA (RICA) as a local geometric approach to disentanglement.
- Develops the disentanglement tensor to quantify pointwise disentanglement.
- Demonstrates that RICA can recover sources effectively across different manifolds.
- Challenges the reliance on global generative models in traditional disentanglement methods.
Read more
Disentanglement Beyond Generative Models with Riemannian ICA
Summary
This paper addresses the gap between the theoretical foundations of disentanglement and the practical applications in modern representation learning. Traditional frameworks like Independent Component Analysis (ICA) rely on generative models with independent latent variables, which can be limiting in practice. The author introduces Riemannian ICA (RICA), which shifts the focus from a global generative model to local geometric structures. RICA utilizes Riemannian geometry to interpret factors of variation as radial curves in the latent space, allowing for a more flexible understanding of disentanglement. The main contribution is the disentanglement tensor, which captures a second-order notion of disentanglement termed pointwise disentanglement. This tensor is derived from the Hessian of the data log likelihood and the Ricci curvature of the model. In experiments involving controlled source recovery, RICA demonstrates superior performance in recovering sources across various manifolds compared to traditional ICA methods, which are sensitive to the choice of coordinates. The work provides a theoretical framework for analyzing local disentanglement without relying on a global generative model.
Methodology
The paper employs Riemannian geometry to redefine the factors of variation in data through local geometric structures. It introduces the disentanglement tensor, which is computed from the Hessian of the log likelihood and Ricci curvature. The methodology involves diagonalizing this tensor to assess pointwise disentanglement, validated through a controlled source recovery setting.
Results
RICA outperformed traditional ICA methods in recovering sources across multiple manifolds, demonstrating that local geometric interpretations can yield better disentangled representations without the constraints of global generative assumptions.
Implications
The findings suggest that modern representation learning can benefit from a local geometric perspective, potentially leading to more interpretable and effective models in various applications such as controllable editing, transfer learning, and scientific analysis.
PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting
Time Series
- PeakFocus unifies peak localization and intensity regression into a single framework.
- The framework employs a triple hybrid loss for joint supervision of peak timing and intensity.
- Multi-Scale Mixing Peak Locator resolves misjudgment and timing misalignment using coarse and fine-grained features.
- Location-Aware Decoder enhances intensity estimation by incorporating peak timing context.
Read more
PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting
Summary
The paper presents PeakFocus, a novel framework for Electricity Load Peak Forecasting (ELPF) that addresses significant limitations in existing forecasting methods. Traditional approaches often separate peak timing localization from intensity regression, leading to inefficiencies and inaccuracies. PeakFocus introduces a Unified Peak-Aware Pipeline (UPAP) that employs a triple hybrid loss to jointly supervise both tasks, enhancing their interdependence. Additionally, it features a Multi-Scale Mixing Peak Locator (MSM-PL) that utilizes coarse-grained features to mitigate peak misjudgment and timing misalignment, and a Location-Aware Decoder (LAD) that incorporates peak timing context into intensity regression to counteract smoothing effects. The framework is validated through extensive experiments on both public and industrial-scale datasets, demonstrating superior performance in timing precision and intensity estimation compared to existing methods.
Methodology
PeakFocus utilizes a unified approach that combines temporal localization and intensity regression through a series of specialized modules. The Unified Peak-Aware Pipeline (UPAP) integrates a triple hybrid loss for joint training, while the Multi-Scale Mixing Peak Locator (MSM-PL) and Location-Aware Decoder (LAD) address specific challenges related to peak misjudgment, timing misalignment, and intensity smoothing.
Results
The results indicate that PeakFocus significantly outperforms baseline models in both timing precision and intensity estimation on the public Electricity (ELC) dataset and the industrial-scale World Large-scale Electricity Load (WLEL) dataset, confirming the effectiveness of the proposed framework.
Implications
The advancements presented in PeakFocus have significant implications for electricity grid management, enabling more accurate forecasting of load peaks which is crucial for resource allocation and risk management in power systems.
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Reinforcement Learning
Theory
Optimization
- Introduces a model-based algorithm, Model-Based OCE Value Iteration (MB-OCE-VI), for risk-sensitive RL.
- Establishes PAC-type bounds on sample complexity for both value and policy learning under recursive OCE.
- Proves that OCEs defined by utility functions outside a specific class are not PAC learnable.
- Provides worst-case lower bounds on sample complexity, improving existing results for CVaR.
Read more
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Summary
This paper investigates risk-sensitive reinforcement learning (RL) within finite discounted Markov Decision Processes (MDPs), specifically focusing on the sample complexities associated with learning optimal state-action value functions and policies under a framework of Optimized Certainty Equivalents (OCE). The authors characterize utility functions for which the corresponding OCE is PAC-learnable and analyze a model-based approach, deriving PAC sample complexity bounds. They demonstrate that if the utility function does not have a full domain, the learning problem is not PAC learnable. The paper also establishes lower bounds for both value and policy learning, revealing the tightness of sample complexity in relation to the state-action space size and the effective horizon. Notably, the authors improve existing bounds for Conditional Value-at-Risk (CVaR) by providing a more precise dependence on the risk parameter Ï„. This work represents a significant advancement in understanding the sample complexity of recursive OCE in discounted MDPs and presents the first impossibility results for RL under OCEs.
Methodology
The authors propose a model-based algorithm called Model-Based OCE Value Iteration (MB-OCE-VI) and derive PAC sample complexity bounds for value and policy learning. They analyze the conditions under which OCEs are PAC-learnable and establish lower bounds for sample complexity, focusing on the dependence on the state-action space size and the effective horizon.
Results
The paper provides exact characterizations of OCE measures that are PAC learnable, establishes necessary conditions for learnability, and presents both upper and lower bounds on sample complexity. The results indicate that the sample complexity is tightly linked to the size of the state-action space and the effective horizon, with improved bounds for CVaR compared to existing literature.
Implications
The findings have significant implications for risk-sensitive applications in fields such as finance, healthcare, and operations research, where understanding the variability of returns is crucial. The results can guide the development of more efficient RL algorithms that incorporate risk measures, enhancing decision-making in high-stakes environments.
Learning Causal Orderings for In-Context Tabular Prediction
Theory
- Introduces TABORDER, a model that incorporates causal orderings into tabular prediction.
- Uses causal order-constrained attention to ensure predictions are based on causal relationships.
- Learns optimal variable orderings in an unsupervised manner through a likelihood-based objective.
- Addresses the challenge of missing data in tabular datasets while identifying causal directions.
Read more
Learning Causal Orderings for In-Context Tabular Prediction
Summary
This paper addresses the limitations of in-context learning for tabular data, which often relies on correlational structures that can fail under distribution shifts or interventions. The authors propose a novel model, TABORDER, which integrates causal orderings into tabular prediction by using causal order-constrained attention. This model learns the optimal variable ordering in an unsupervised manner through a likelihood-based objective, allowing it to make predictions based solely on features that precede a target variable according to the learned causal order. The paper also explores how missing data interacts with causal direction identification, demonstrating that TABORDER can effectively recover accurate causal orderings while performing well in prediction and imputation tasks. Empirical results show that TABORDER maintains robust predictive performance even when faced with interventions, outperforming traditional models that do not account for causal relationships.
Methodology
The authors developed TABORDER, a transformer-based model that employs causal order-constrained attention to model the joint distribution of tabular data. The model infers causal orderings through a likelihood-based objective, allowing it to condition predictions on preceding variables in a learned causal order. It also incorporates mechanisms to handle missing data effectively, facilitating causal direction identification.
Results
Empirical evaluations indicate that TABORDER accurately infers causal orderings and performs competitively in downstream tasks such as prediction and imputation. The model demonstrates resilience to distribution shifts and interventions, maintaining predictive accuracy where traditional models fail.
Implications
The findings suggest that integrating causal structures into predictive models can enhance robustness against distribution shifts and improve performance in real-world applications, particularly in fields like biology where interventions are common. This approach could lead to more reliable decision-making tools in various domains reliant on tabular data.
Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos
Theory
Optimization
- Dropout acts as a relevant perturbation that shifts the critical fixed point in deep networks.
- Smooth and kinked activations lead to different universality classes with distinct critical scaling behaviors.
- A two-parameter scaling collapse is established for dropout strength and distance to criticality.
- Optimal dropout scheduling can significantly reduce test loss without increasing computational costs.
Read more
Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos
Summary
This paper presents a mean-field theory of dropout as a perturbation affecting critical signal propagation in deep neural networks, particularly at the edge of chaos. The author demonstrates that dropout modifies the perfect-alignment fixed point, resulting in a finite depth scale for information propagation even under critical initialization. The study derives critical and crossover scaling laws for correlation decay and identifies two universality classes based on activation types: smooth activations and kinked, ReLU-like activations. Each class exhibits distinct critical exponents and a universal scaling collapse in relation to dropout strength and detuning. The findings suggest that dropout can be treated as a depth-dependent dynamical field, leading to optimal scheduling strategies that reduce test loss without additional computational costs. The predictions are validated through experiments on Multi-Layer Perceptrons (MLPs) and Vision Transformers, with discussions on potential extensions to Convolutional Neural Networks (CNNs) and ResNets.
Methodology
The author employs mean-field theory to analyze the effects of dropout in randomly initialized deep networks. The study utilizes a representation-group approach to derive scaling laws and examines the correlation map's deformation due to dropout. Experiments are conducted on MLPs and Vision Transformers to validate theoretical predictions.
Results
The analysis reveals that dropout retains a non-trivial fixed point for correlation, even as it disrupts the order-to-chaos critical point. The distinct scaling behaviors of smooth and kinked activations are characterized, and optimal dropout profiles are derived, leading to reduced test loss and improved accuracy in neural networks.
Implications
The findings suggest that dropout can be strategically implemented to enhance the performance of deep learning models without incurring additional computational costs. The insights into scaling laws and universality classes may inform future research on regularization techniques and hyperparameter optimization in deep learning.
Posterior Collapse as Automatic Spectral Pruning
Generative Models
Theory
Interpretability
- Posterior collapse in beta-VAEs is shown to act as automatic spectral pruning.
- A latent-rescaling-invariant order parameter is introduced to rank active latent modes.
- The collapse spectrum and utility spectrum coincide in the linear Gaussian case.
- The findings suggest that posterior collapse can be beneficial for feature learning and interpretability.
Read more
Posterior Collapse as Automatic Spectral Pruning
Summary
This paper explores the phenomenon of posterior collapse in beta-Variational Autoencoders (beta-VAEs), proposing that it functions as an automatic spectral pruning mechanism. The author demonstrates that when the contribution of a latent mode to reconstruction falls below a certain threshold determined by the regularization strength beta, that mode collapses. This results in a cascade of collapses, revealing a systematic decoupling of latent modes from least to most useful. Through a Landau stability analysis, the paper derives a latent-rescaling-invariant order parameter that ranks active latent modes and identifies collapse thresholds, which can guide the inspection of effective variables. The study shows that in the linear Gaussian case, the collapse spectrum, utility spectrum, and normalized PCA spectrum align, with each collapse adhering to a mean-field law. The findings are empirically tested on the WorldClim dataset, providing insights into the relationship between latent representations and their utility in reconstruction tasks. The paper argues that posterior collapse should not be viewed solely as a failure but as a feature that can be exploited for effective dimensionality reduction and interpretability in representation learning.
Methodology
The paper employs a Landau stability analysis to derive the relationship between latent mode collapse and the regularization strength beta in beta-VAEs. It defines a latent-rescaling-invariant order parameter to rank latent modes and identifies collapse thresholds. The empirical analysis is conducted on the WorldClim dataset, using geography-aware train/validation/test splits to minimize spatial leakage.
Results
The study finds that as the regularization strength beta increases, latent modes collapse sequentially at characteristic thresholds, leading to a collapse spectrum that aligns with the utility spectrum. This indicates that the latent representation can be effectively pruned mode by mode, similar to PCA, enhancing reconstruction performance.
Implications
The insights from this paper could influence the design of variational autoencoders and other generative models, particularly in applications where interpretability and dimensionality reduction are crucial. Understanding posterior collapse as a feature rather than a failure could lead to more effective representation learning strategies.
Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising
Optimization
- Introduces a joint optimization framework for multi-slot guaranteed display advertising.
- Addresses key challenges such as slot-level redundancy and contract imbalance.
- Utilizes an offline bipartite matching approach for coordinated ad allocation.
- Implements Page View constraints and a Contract Roulette mechanism to enhance user experience.
Read more
Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising
Summary
This paper addresses the limitations of existing guaranteed display (GD) advertising methods that typically operate under a single-slot assumption. The authors propose a novel joint optimization framework for multi-slot GD allocation, which tackles issues such as slot-level redundancy, contract imbalance, and exposure concentration. By formulating the allocation as an offline bipartite matching problem, the framework incorporates a contract roulette mechanism for slot exclusivity and Page View constraints for impression control. The proposed method enables coordinated decision-making across multiple slots, enhancing fairness and diversity in ad placements. Extensive online tests conducted on the Meituan advertising platform demonstrate significant improvements in merchant ROI, platform revenue efficiency, and contract fulfillment robustness. The results indicate a 28.99% increase in Average Revenue Per User under 70% traffic, showcasing the framework's strong applicability in real-world advertising scenarios.
Methodology
The authors propose a joint optimization framework that models the allocation problem as an offline bipartite matching task at the page view level. This approach allows for coordinated decisions across multiple ad slots. The framework includes two key modules: Page View constraints for managing traffic balance and a Contract Roulette mechanism to ensure exclusivity and reduce redundant exposures.
Results
The proposed framework was tested on the Meituan advertising platform, resulting in a 28.99% increase in Average Revenue Per User under 70% traffic and a 2.12% improvement in contract fulfillment rates compared to previous methods. These results highlight the effectiveness and efficiency of the new approach in real-world applications.
Implications
The findings suggest that the joint optimization framework can significantly enhance the performance of GD advertising systems, leading to better revenue generation and improved user experience. This approach could be applied to other advertising platforms facing similar multi-slot allocation challenges.
A note on convergence of Wasserstein policy optimization
Reinforcement Learning
Theory
Optimization
- Establishes linear convergence of Wasserstein Policy Optimization (WPO) in entropy-regularized MDPs.
- Utilizes mean-field analysis and log-Sobolev inequalities to prove convergence properties.
- Demonstrates monotonic energy dissipation along the gradient flow.
- Concludes that the value function converges exponentially fast to the global optimum.
Read more
A note on convergence of Wasserstein policy optimization
Summary
This paper addresses the theoretical convergence properties of Wasserstein Policy Optimization (WPO), a reinforcement learning algorithm that optimizes stochastic policies in continuous action spaces using Wasserstein gradient flows. Despite WPO's empirical success, its theoretical foundation was lacking. The authors demonstrate that WPO converges linearly within the framework of entropy-regularized Markov Decision Processes (MDPs). They utilize recent advancements in mean-field analysis and log-Sobolev inequalities to establish monotonic energy dissipation along the gradient flow, leading to the conclusion that the value function converges exponentially fast to the global optimum. The analysis is primarily conducted in continuous time, although the authors suggest that similar results could be obtained for discrete stepping schemes. This work provides a rigorous mathematical foundation for WPO, enhancing its applicability in reinforcement learning tasks.
Methodology
The authors analyze WPO within the framework of entropy-regularized MDPs, assuming the existence of sufficiently regular solutions to the gradient flow equation. They apply techniques from mean-field analysis and convex analysis to derive energy dissipation properties and establish a local log-Sobolev inequality, which are crucial for proving convergence.
Results
The main results indicate that WPO converges linearly in the context of entropy-regularized MDPs, with the value function exhibiting exponential convergence to the global optimum. The analysis confirms that energy dissipates monotonically along the gradient flow, supporting the theoretical claims.
Implications
The findings provide a solid theoretical basis for WPO, which can enhance its use in various reinforcement learning applications, particularly in environments with continuous state and action spaces. This could lead to more reliable and efficient policy optimization methods in complex control problems.
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
Robotics
Optimization
Theory
- Introduction of deep-kernel pairwise learning (DKPL) for autonomous experimentation.
- DKPL incorporates expert feedback to evaluate experimental outputs beyond scalar metrics.
- Demonstrated effectiveness in learning nanoscale structures and analyzing ferroelectric domain walls.
- Addresses limitations of traditional Bayesian optimization in capturing complex scientific phenomena.
Read more
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
Summary
This paper presents a novel approach to autonomous experimentation in scientific discovery, particularly at the nanoscale, by introducing deep-kernel pairwise learning (DKPL). Traditional Bayesian optimization methods in self-driving laboratories rely on predefined scalar descriptors, which can limit their effectiveness in capturing complex phenomena. The authors argue that many important scientific insights are not easily quantifiable and can be better assessed through expert feedback. DKPL allows experts to evaluate experimental outputs directly, using their interdisciplinary knowledge to inform the learning process. The method learns a latent utility function from these evaluations, guiding subsequent experiments without the constraints of scalar metrics. The authors demonstrate DKPL's capabilities in two case studies: learning nanoscale structures and analyzing ferroelectric domain walls, showcasing its ability to prioritize high-information measurement regions and distinguish between different domain-wall characteristics. This work highlights the potential for integrating expert knowledge into autonomous systems, paving the way for more effective self-driving laboratories that can tackle complex scientific challenges.
Methodology
The authors developed a preference-driven active learning framework called deep-kernel pairwise learning (DKPL), which integrates expert evaluations into the experimental process. Instead of relying on predefined scalar metrics, DKPL allows experts to compare experimental outputs directly, learning a latent utility function from these pairwise comparisons to guide future experiments.
Results
DKPL successfully learned physically meaningful nanoscale structures and effectively prioritized measurement regions in an experimental model dataset. In the analysis of ferroelectric domain walls, DKPL distinguished between high and low characteristic angles in bismuth ferrite and identified different domain-wall characters in erbium manganite, demonstrating its capability to handle complex, multidimensional data.
Implications
This research suggests that integrating human expertise into autonomous experimentation can enhance the discovery process in materials science and other fields. It opens avenues for developing self-driving laboratories that can explore complex scientific questions that are not easily quantifiable, potentially leading to significant advancements in various scientific domains.
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
NLP
Large Language Models
Interpretability
- The audit pipeline developed is a model-agnostic tool for analyzing language model failures.
- Feature 17,491 correlates strongly with task failure but does not serve as a sufficient cause.
- The study highlights the importance of conducting controls to distinguish between robust behavioral effects and incidental feature correlations.
- The findings reveal a significant lexical confound affecting the model's performance on the IOI task.
Read more
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
Summary
This paper presents a reproducible audit of the GPT-2 Small model's performance on the Indirect Object Identification (IOI) task, focusing on the relationship between model activations and task failures. The authors analyze 300 prompts and find that GPT-2 Small achieves an accuracy of 79.7%. They identify 146 out of 24,576 features from a sparse-autoencoder (SAE) that significantly correlate with task performance, with feature 17,491, labeled 'cryptographic keys,' showing the strongest correlation with failure. This feature activates predominantly when the transferred object is 'the keys,' leading to a failure rate of 93.3% on those prompts. The authors conduct three controls to validate their findings: a causal ablation test, a representation baseline comparison, and a seed-robustness check. The results indicate that while the feature correlates with failure, it is not a sufficient cause. The paper emphasizes the importance of the audit pipeline developed for this analysis, which is model-agnostic and provides insights into the interpretability of language models. The authors also release their code and data for reproducibility.
Methodology
The authors employed a sparse-autoencoder (SAE) to analyze the activations of GPT-2 Small on the IOI task. They conducted an audit on 300 prompts, measuring feature activations and performing statistical analyses to identify significant correlations with task performance. Three controls were implemented to validate the mechanistic interpretations of the findings.
Results
GPT-2 Small achieved 79.7% accuracy on the IOI task. Feature 17,491 was identified as the strongest correlate of failure, activating significantly more during failed trials. However, ablation of this feature did not restore accuracy, indicating it is a correlate rather than a cause. The logistic regression analysis showed that the raw residual stream's predictive power matched that of the top SAE features, and the failure rate remained consistent across different random seeds.
Implications
The findings suggest that while certain features may correlate with model failures, they do not necessarily indicate causation. This has implications for the interpretability of language models and the methodologies used in auditing their performance. The developed audit pipeline can be utilized for further investigations into model behavior and failure analysis.
Bandit Convex Optimization with Gradient Prediction Adaptivity
Optimization
Theory
- Introduces Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) for BCO.
- Establishes a negative result indicating fundamental limitations of single-point feedback in BCO.
- Achieves improved regret bounds that scale with cumulative prediction error.
- Develops adaptive variants of the algorithm that do not require prior knowledge of prediction error or time horizon.
Read more
Bandit Convex Optimization with Gradient Prediction Adaptivity
Summary
This paper explores the problem of Bandit Convex Optimization (BCO), where a learner receives partial feedback in the form of loss values at chosen decision points. The authors investigate whether optimistic gradient predictions can enhance worst-case regret guarantees in a prediction-adaptive manner. They establish a negative result showing that under single-point feedback, a lower bound of Ω(√T) regret persists, even when the cumulative prediction error is small. To address this, they propose a new algorithm called Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) for the two-point feedback setting. This algorithm uses a novel variance-reduced gradient estimator, which reduces variance based on prediction error rather than gradient norm, achieving a regret bound of O(√d E[ST]). The authors also provide an information-theoretic lower bound that characterizes the best achievable prediction-adaptive regret, demonstrating that TP-VR-OPT is optimal up to a factor of √d. Additionally, they develop adaptive variants that do not require prior knowledge of E[ST] or the horizon T, and extend their framework to non-stationary environments, establishing dynamic regret guarantees that adapt to both cumulative prediction error and comparator path length.
Methodology
The authors propose the TP-VR-OPT algorithm, which utilizes a variance-reduced gradient estimator that adapts based on the prediction error. They analyze the regret bounds in both static and dynamic settings, providing theoretical guarantees for their approach and comparing it with existing methods.
Results
The TP-VR-OPT algorithm achieves a regret bound of O(√d E[ST]), which is optimal up to a factor of √d. The authors also establish an information-theoretic lower bound of Ω(√E[ST]), demonstrating the effectiveness of their approach in leveraging gradient predictions. Adaptive variants of the algorithm are shown to maintain performance without prior knowledge of prediction error or time horizon.
Implications
The findings suggest that incorporating gradient predictions can significantly enhance the performance of BCO algorithms, particularly in environments where the loss functions exhibit predictable patterns. This has potential applications in various online learning scenarios, such as recommendation systems and adaptive control.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
NLP
Large Language Models
Reinforcement Learning
- OPPO improves token-level credit assignment in LLM reasoning by using Bayesian updates.
- The method accumulates oracle signals to provide a running estimate of success probability for each token.
- OPPO eliminates the need for a learned value network and additional rollouts.
- The framework includes two estimators: self-oracle and teacher-oracle.
Read more
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
Summary
The paper introduces Oracle-Prompted Policy Optimization (OPPO), a novel approach to improve token-level credit assignment in large language model (LLM) reasoning. Traditional reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), assign a single trajectory-level advantage to all tokens, which dilutes the learning signal for pivotal tokens and introduces noise for uninformative ones. OPPO addresses this by utilizing a Bayesian framework that accumulates oracle signals along a trajectory, providing a running estimate of success probability for each token. This method eliminates the need for a learned value network and additional rollouts, allowing for a more focused credit assignment to pivotal tokens. The framework supports two estimators: a self-oracle that reuses the student model and a teacher-oracle that employs a stronger frozen model. The authors demonstrate that OPPO consistently outperforms existing methods across various reasoning benchmarks, achieving significant improvements in performance metrics.
Methodology
The OPPO framework employs Bayesian value recursion to derive a token-level advantage from oracle signals. It integrates this recursion into the GRPO pipeline, allowing for efficient computation with only one additional forward pass per trajectory. The method factors the advantage into a per-token discrimination signal modulated by a state weight, concentrating credit on pivotal tokens.
Results
OPPO was evaluated on two base LLMs across seven reasoning benchmarks, showing improvements of up to +6.0 points on AMC'23 and +5.2 points on AIME'24 compared to GRPO, DAPO, and SDPO. The performance gains were observed to widen with the response length, indicating the effectiveness of the method in longer reasoning tasks.
Implications
The OPPO framework has the potential to enhance the reasoning capabilities of large language models, making them more effective in tasks requiring complex reasoning and decision-making. This could lead to advancements in applications such as automated reasoning, question answering, and other NLP tasks that rely on LLMs.
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Theory
Efficient ML
Generative Models
- Decomposes KL divergence between GP and LNP predictives into three interpretable components.
- Establishes bounds on approximation errors related to representation dimension and kernel smoothness.
- Identifies label contamination as a persistent cost in neural process predictions.
- Provides architectural recommendations to enhance predictive variance estimation.
Read more
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Summary
This paper investigates the costs associated with amortizing Gaussian process (GP) inference using latent neural processes (LNPs). Traditional GP inference is computationally expensive, scaling cubically with the number of context points, which limits its applicability in real-time scenarios. Neural processes provide a solution by learning a mapping from context sets to predictive distributions in linear time. However, this amortization introduces approximation errors that have not been quantitatively characterized until now. The author decomposes the Kullback-Leibler (KL) divergence between the GP and LNP predictives into three distinct components: label contamination, an information bottleneck due to finite-dimensional representation, and amortization error from a shared encoder network. The paper provides bounds on these errors, linking them to architectural choices and kernel properties. Specifically, it shows that the bottleneck term decays exponentially with the representation dimension for squared-exponential kernels and polynomially for Matérn kernels. The label contamination term is identified as a persistent cost, while architectural recommendations are made to improve predictive variance estimation. Overall, this work offers a comprehensive understanding of the trade-offs involved in using neural processes for GP inference.
Methodology
The paper employs a theoretical analysis of the KL divergence between Gaussian process and latent neural process predictives. It derives bounds on the divergence components, focusing on label contamination, information bottleneck, and amortization error. The analysis connects these components to architectural choices, such as representation dimension and kernel type, providing a framework for understanding the trade-offs in neural process design.
Results
The analysis reveals that the bottleneck truncation term decays exponentially with the representation dimension for squared-exponential kernels and polynomially for Matérn kernels. The label contamination term is generally constant, with only the observation noise component decaying with the number of context points. The paper also offers architectural recommendations to improve predictive variance estimation by utilizing second-order pooling instead of mean aggregation.
Implications
This work has significant implications for the design of neural processes in applications requiring efficient GP inference, such as robotics, real-time predictions, and sequential experimental design. By understanding the costs associated with amortization, practitioners can make informed decisions about architecture and kernel choices to optimize performance.
Aerodynamic force reconstruction using physics-informed Gaussian processes
Theory
Optimization
Time Series
- Introduces a physics-informed machine learning approach for aerodynamic force reconstruction.
- Utilizes Gaussian processes to avoid overfitting and eliminate the need for regularization.
- Demonstrated effectiveness through a case study on the Great Belt East Bridge.
- Achieves strong agreement between true and predicted aerodynamic loads.
Read more
Aerodynamic force reconstruction using physics-informed Gaussian processes
Summary
This paper presents a novel probabilistic physics-informed machine learning approach aimed at accurately reconstructing aerodynamic loads from noisy measurements of structural dynamic responses. Traditional aerodynamic load models often rely on simplifications that can compromise their accuracy, especially when faced with noisy or incomplete data. The proposed method utilizes Gaussian processes (GP) to create covariance kernels informed by the physical behavior of the system, allowing for the integration of heterogeneous and multi-fidelity data without the need for regularization schemes. The effectiveness of this approach is demonstrated through a case study involving the Great Belt East Bridge, where the model successfully reconstructed aerodynamic loads under a linear unsteady assumption. The results indicate a strong correlation between the true and predicted loads, particularly in terms of root mean squared errors, magnitude, phase angle, and peak values. This method not only enhances model validation but also has broader applications in future load estimation and structural damage prognosis.
Methodology
The methodology involves using a physics-informed Gaussian process regression model that constructs covariance kernels based on the physical behavior of the system. The model is trained on noisy measurements of structural responses, allowing for automatic balancing of data fitting and model complexity. The parameters of the covariance kernels are identified using maximum likelihood estimation from the training data.
Results
The results showcase a strong correlation between the reconstructed aerodynamic loads and the true loads, with significant improvements in root mean squared errors, magnitude, phase angle, and peak values of the signals. The method effectively reconstructs the underlying aerodynamic forces even in the presence of noise.
Implications
The proposed method holds significant implications for improving the accuracy of aerodynamic load modeling in structural engineering. It can be used for model validation, predicting future loads on structures, and assessing potential structural damage, thereby enhancing the safety and reliability of engineering designs.
Can Transformers Learn to Verify During Backtracking Search?
Theory
Large Language Models
Optimization
- Transformers struggle with verification during backtracking due to scattered retrieval and history entanglement.
- Selective State Attention (SSA) is introduced as a structural fix to enforce state-based decision-making.
- SSA allows transformers to produce consistent outputs for same-state pairs, improving their reliability in search tasks.
- The study highlights the importance of structural adjustments in transformer models for effective reasoning.
Read more
Can Transformers Learn to Verify During Backtracking Search?
Summary
This paper investigates the ability of transformer models to perform verification during backtracking search, a fundamental algorithmic technique used in constraint solvers and planners. The authors identify two main issues with decoder-only transformers trained on cumulative traces: scattered retrieval, where state features are dispersed across multiple positions, and history entanglement, where the model's predictions depend on the trajectory rather than the current state. To address these issues, the authors propose Selective State Attention (SSA), a fixed attention mechanism that ensures decisions are made based solely on the current state and relevant features, rather than past decisions. The study focuses on reactive verification scenarios, where contradictions are identified after propagation. The effectiveness of SSA is tested on various problems, including 3-SAT and graph coloring, demonstrating that SSA can produce consistent decisions across same-state pairs, unlike traditional cumulative-trained models. The findings suggest that transformer-based reasoning systems may require structural adjustments to improve their performance in search tasks, and the analysis opens avenues for inference-time context clearing as a potential solution.
Methodology
The authors conducted experiments using decoder-only transformers trained on cumulative solver traces, identifying failures in state-local decision-making. They introduced SSA as a fixed attention mask to isolate state features and tested its effectiveness across various problem domains, including 3-SAT and graph coloring.
Results
The implementation of SSA resulted in transformers emitting identical decisions for same-state pairs that differed only in prior history, demonstrating a significant improvement over traditional cumulative-trained models, which failed to achieve this consistency.
Implications
The findings imply that transformer models may need to incorporate structural changes to enhance their reasoning capabilities in search tasks. The proposed SSA mechanism could be applied to improve the performance of pretrained language models in similar contexts, and the concept of inference-time context clearing may offer additional benefits.
SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis
Generative Models
Time Series
Theory
- SDPM offers a generative approach to continuous-time survival analysis without fixed discretization.
- The model effectively estimates survival functions using a denoising diffusion probabilistic framework.
- SDPM demonstrates superior performance in survival function estimation, particularly in integrated Brier score.
- The approach allows for controllable accuracy in survival estimates through sample generation.
Read more
SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis
Summary
The paper introduces the Survival Diffusion Probabilistic Model (SDPM), a novel generative approach for continuous-time survival analysis. Traditional survival analysis methods often impose structural assumptions on hazard functions or discretize the time axis, which can limit flexibility and introduce errors. SDPM addresses these issues by modeling the conditional distribution of observed survival outcomes using a denoising diffusion probabilistic model. This approach allows for the generation of conditional samples that can be transformed into survival function estimates via the Kaplan-Meier estimator, without requiring explicit parametric assumptions or fixed time discretization. The model operates in a transformed target space, enhancing numerical stability and sample validity. Evaluations on ten real-world survival datasets demonstrate that SDPM achieves competitive performance compared to five strong baselines, particularly excelling in integrated Brier score (IBS). Additional experiments reveal that increasing sample generation improves Kaplan-Meier reconstruction stability and fidelity. An ablation study highlights the significance of target-space transformations in enhancing event-rate calibration and predictive discrimination. The authors provide publicly available code for the implementation of SDPM.
Methodology
The SDPM utilizes a denoising diffusion probabilistic model to generate samples of the joint distribution of event time and censoring indicator. It operates in a transformed target space using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator, which improves numerical stability and sample validity. The model's performance is evaluated using the Kaplan-Meier estimator for survival function estimation.
Results
SDPM was evaluated on ten real-world survival datasets and compared against five baseline models. It achieved competitive predictive performance across various metrics, including Harrell’s C-index, integrated time-dependent AUC, and integrated Brier score, with the best average rank among the approaches. The model showed particularly strong performance in IBS, indicating accurate survival function estimation. Additional experiments confirmed the benefits of increased sample generation and target-space transformations.
Implications
The SDPM has potential applications in various fields requiring survival analysis, such as medicine, reliability engineering, and financial modeling. Its ability to model survival outcomes without strict parametric assumptions and fixed time discretization can lead to more accurate predictions and insights in these domains.
Riemannian geometry meets fMRI: the advantages of modeling correlation manifolds and eigenvector subspaces
Theory
Time Series
Graph Learning
- Introduces a scalable geometric framework for analyzing correlation matrices in fMRI data.
- Develops the Off–log metric for closed-form statistical modeling of correlation matrices.
- Utilizes Grassmannian subspace discrimination to resolve ambiguities in eigenvector comparisons.
- Demonstrates improved sensitivity and predictive performance in clinical and aging datasets.
Read more
Riemannian geometry meets fMRI: the advantages of modeling correlation manifolds and eigenvector subspaces
Summary
This paper introduces a novel geometric framework for analyzing functional brain networks using correlation matrices, which are often treated independently in standard analyses. The authors propose two main components: the Off–log metric, which transforms correlation matrices into symmetric zero-diagonal matrices, allowing for closed-form statistical operations without complex manifold optimization, and Grassmannian subspace discrimination, which compares subjects based on principal-angle distances between eigenvector subspaces. This approach addresses the limitations of existing geometric methods, such as lack of scalability and dependence on arbitrary region ordering. The framework is validated across two clinical cohorts (Parkinson’s and psychosis) and three aging fMRI datasets, demonstrating increased sensitivity in permutation tests and competitive performance in classification tasks compared to Riemannian and Euclidean baselines. The results indicate that geometry-aware representations enhance predictive performance while being straightforward to implement at scale, making them suitable for large neuroimaging datasets.
Methodology
The authors developed a two-component framework: (i) the Off–log metric, which maps correlation matrices to symmetric zero-diagonal matrices for closed-form statistical operations, and (ii) Grassmannian subspace discrimination, which compares subjects using principal-angle distances between eigenvector subspaces. The framework was integrated into standard machine-learning workflows for inference, regression, and classification.
Results
The Off–log metric showed increased sensitivity in permutation tests and matched or exceeded Riemannian and Euclidean baselines in classification tasks. Brain-age prediction performance was comparable, with Riemannian metrics performing better in two out of three cohorts. The Grassmannian method consistently outperformed Euclidean baselines, effectively highlighting disease-relevant networks.
Implications
The proposed geometric framework enhances the analysis of functional brain networks, potentially leading to better understanding and prediction of neurological conditions. Its scalability makes it suitable for large neuroimaging datasets, facilitating broader applications in clinical research and neuroscience.
PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation
Theory
Optimization
Reinforcement Learning
- PEARL addresses behavioral intensity imbalance in recommender systems.
- The framework uses nonparametric contrastive learning to estimate relative preference signals.
- It eliminates the need for auxiliary models for distribution estimation.
- Theoretical justification supports the unbiased nature of the proposed method.
Read more
PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation
Summary
The paper introduces PEARL, a novel framework designed to address the issue of behavioral intensity imbalance in recommender systems, particularly in the context of industrial-scale livestream recommendations. Traditional recommender systems often suffer from skewed feedback signals due to heterogeneous engagement patterns among users, leading to a disproportionate influence of highly active users on model training. PEARL aims to mitigate this bias by employing a nonparametric contrastive percentile approximation approach that focuses on modeling relative preference signals instead of absolute engagement levels. The framework utilizes real contrastive interaction samples to directly approximate percentile relationships, eliminating the need for auxiliary distribution estimation models. The authors provide theoretical justification for the unbiased nature of these pairwise comparisons in estimating preference signals. Additionally, PEARL incorporates a prediction-based bootstrapping mechanism for smoothing percentiles and a co-training strategy to enhance representation learning. Extensive offline experiments demonstrate that PEARL effectively reduces behavioral bias and improves recommendation performance across various ranking targets. The framework has been deployed in a production environment with billions of users, yielding significant real-world improvements in key metrics such as Watch Duration, Consumption Amount, Interaction Rate, and Report Rate.
Methodology
PEARL employs a nonparametric contrastive learning approach to model relative preference signals through pairwise comparisons of user interactions. It leverages real contrastive samples to approximate percentile relationships without relying on auxiliary distribution estimation. The framework also includes a prediction-based bootstrapping mechanism for handling sparse feedback and a co-training strategy to enhance model flexibility.
Results
The offline experiments indicate that PEARL significantly mitigates behavioral bias and enhances recommendation performance across multiple ranking targets. In online A/B testing on a production livestream platform, PEARL achieved a +2.10% increase in Watch Duration, +0.80% in Consumption Amount, +1.49% in Interaction Rate, and a reduction of -6.91% in Report Rate.
Implications
The findings suggest that PEARL can be effectively utilized in large-scale recommender systems to improve user experience by providing more accurate and fair recommendations, particularly in environments with diverse user engagement patterns. This approach may also inspire further research into nonparametric methods for debiasing in machine learning applications.
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Reinforcement Learning
Computer Vision
Robotics
- Curiosity-driven exploration can be enhanced by integrating a persistent world model with episodic context.
- The proposed method utilizes online 3D reconstruction to maintain spatial persistence.
- The agent's policy is based on a transformer model that processes RGB observations to retain episodic history.
- The approach outperforms traditional active-mapping methods and generalizes to unseen environments.
Read more
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Summary
This paper addresses the challenges of curiosity-driven exploration in 3D environments, particularly in sparse-reward, long-horizon tasks. The authors argue that traditional methods often fail due to agents getting trapped in local loops and receiving misleading rewards for revisiting states. They propose a novel approach that combines a persistent world model with episodic context to enhance exploration. The persistent model is realized through an online 3D reconstruction method, while the agent's policy is structured as a sequence model over RGB observations, allowing it to maintain an episodic history. This design enables effective exploration during training and allows the agent to navigate using only RGB frames during deployment. The proposed method outperforms active-mapping baselines and demonstrates zero-shot generalization to new environments, showcasing its adaptability to downstream tasks such as apple picking and image-goal navigation. The findings highlight the importance of spatial persistence and episodic context in scaling curiosity-driven exploration in complex environments.
Methodology
The authors developed a curiosity-driven reinforcement learning framework that employs an online 3D reconstruction method for spatial persistence and a transformer-based policy that processes RGB image sequences to maintain episodic context. This allows the agent to explore effectively without explicit mapping during deployment.
Results
The proposed agent, trained solely on curiosity in the Habitat Matterport 3D (HM3D) environment, outperformed active-mapping baselines and demonstrated zero-shot generalization to the Gibson dataset and AI-generated 3D worlds. Additionally, it showed superior performance in downstream tasks after minimal fine-tuning compared to policies trained from scratch.
Implications
This research has significant implications for the development of autonomous agents in complex environments, particularly in applications requiring exploration and navigation without explicit mapping. It suggests that enhancing curiosity through persistent models and episodic context can lead to more effective learning in sparse-reward scenarios.
When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks
Theory
- Stronger training triggers can enhance clean test accuracy in high-dimensional models.
- Attack success rates peak at a finite trigger strength before declining.
- The most damaging trigger direction aligns with the minimum eigenvector of the data covariance.
- The study provides a rigorous theoretical framework for analyzing backdoor poisoning attacks.
Read more
When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks
Summary
This paper investigates the counterintuitive behavior of backdoor poisoning attacks in high-dimensional settings, specifically focusing on regularized generalized linear models (GLMs) trained on Gaussian mixture data. The authors demonstrate that stronger training triggers can paradoxically increase clean test accuracy while also exhibiting a peak in attack success rates at a finite trigger strength. They identify that the most effective trigger direction corresponds to the minimum eigenvector of the data covariance. The study provides a rigorous theoretical framework for understanding these phenomena, proving three key results: (1) clean test accuracy increases with trigger strength, (2) attack success peaks at a finite strength before declining, and (3) the optimal attack direction is linked to low-variance data directions. These findings are validated through experiments on CIFAR-10 and logistic regression, showing consistency across different models, including deep learning architectures like ResNet-18. The research contributes to the understanding of backdoor attacks by offering a high-dimensional perspective that challenges classical assumptions based on lower-dimensional analyses.
Methodology
The authors analyze backdoor poisoning attacks using regularized generalized linear models (GLMs) in a high-dimensional setting, specifically in the proportional regime where the number of features and samples are of the same order. They derive closed-form expressions for key quantities related to clean accuracy and attack success, and extend their results to general convex loss functions. The theoretical analysis is complemented by empirical validation using logistic regression and deep learning models on datasets like CIFAR-10.
Results
The study proves that clean test accuracy increases with training trigger strength, that attack success rates peak at a specific trigger strength, and that the optimal attack direction is the minimum eigenvector of the data covariance. These results hold across various settings, including empirical risk minimization and population-risk minimization, and are supported by experimental results that align with the theoretical predictions.
Implications
The findings have significant implications for the security of machine learning models in safety-critical applications, suggesting that understanding the dynamics of backdoor attacks in high dimensions can inform better defense strategies. The insights gained could lead to the development of more robust models that are less susceptible to adversarial manipulation.
From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
Reinforcement Learning
Large Language Models
Theory
- SCRL effectively decomposes hard problems into verifiable subproblems, enhancing learning signals.
- The framework allows for finer-grained credit assignment through subproblem-level normalization.
- SCRL outperforms traditional RLVR methods and strong curriculum-learning baselines on multiple benchmarks.
- The approach leads to improved exploration in challenging reasoning tasks.
Read more
From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
Summary
This paper introduces SCRL (Subproblem Curriculum Reinforcement Learning), a novel framework designed to enhance the efficiency of reinforcement learning from verifiable rewards (RLVR) in large language models (LLMs) for mathematical reasoning tasks. Traditional RLVR methods often struggle with hard problems due to sparse rewards and ineffective credit assignment. SCRL addresses these issues by decomposing complex problems into a series of verifiable subproblems, creating a structured curriculum that allows for better learning signals from partial progress. The framework employs subproblem-level normalization to assign rewards based on the model's performance at each subproblem stage, facilitating finer-grained credit assignment. Theoretical analysis suggests that this approach helps lift hard problems out of gradient dead zones, making them more learnable. Empirical results demonstrate that SCRL significantly outperforms existing curriculum-learning baselines across various mathematical reasoning benchmarks, indicating improved exploration and learning efficiency.
Methodology
SCRL constructs a curriculum of verifiable subproblems from a reference solution to a hard problem. It employs subproblem-level normalization to independently normalize rewards at each subproblem position, allowing for detailed credit assignment based on the model's intermediate reasoning progress. The model is trained using a mixed group of rollouts that include both original problems and subproblems.
Results
SCRL achieved average-point gains of +4.1 and +1.9 compared to GRPO on Qwen3-4B-Base and Qwen3-14B-Base, respectively. On challenging benchmarks like AIME24, AIME25, and IMO-Bench, SCRL showed point gains of +3.7 in pass@1 and +4.6 in pass@64 on Qwen3-4B-Base, indicating enhanced performance in hard reasoning tasks.
Implications
The SCRL framework has the potential to significantly improve the training of LLMs on complex reasoning tasks, making it easier to tackle previously unsolved problems. Its ability to provide fine-grained credit assignment could lead to more robust and capable models in various applications requiring advanced reasoning.
Manifold-Guided Attention Steering
NLP
Large Language Models
Interpretability
- MAGS introduces a dynamic, trajectory-aware correction mechanism for attention heads in LLMs.
- The method is grounded in the observation that reasoning errors manifest as deviations from a low-dimensional correctness manifold.
- MAGS outperforms static steering approaches by up to 10.8% across multiple reasoning and generation benchmarks.
- The approach is validated through diagnostic experiments confirming the separability of correct and incorrect reasoning trajectories.
Read more
Manifold-Guided Attention Steering
Summary
This paper addresses the issue of reasoning errors in large language models (LLMs) during multi-step tasks, despite the models having the necessary knowledge. The authors propose a novel approach called Manifold-Guided Attention Steering (MAGS), which is an adaptive intervention that dynamically corrects the attention outputs of individual heads based on their proximity to a low-dimensional correctness manifold. Unlike existing methods that apply fixed correction vectors, MAGS learns a low-dimensional subspace from contrastive pairs of correct and incorrect reasoning traces. During inference, it monitors the attention heads for deviations from this manifold and applies targeted corrections only when necessary. The authors validate their hypothesis that reasoning errors correspond to structured drift in attention outputs and demonstrate that MAGS significantly outperforms static steering methods across various benchmarks, including mathematical reasoning, code generation, and molecular generation, while maintaining low inference overhead.
Methodology
The authors developed MAGS by first hypothesizing that reasoning errors can be characterized as structured drifts in the output space of attention heads. They conducted diagnostic experiments to confirm the separability of correct and incorrect trajectories, then designed MAGS to monitor attention heads during inference and apply corrections only when deviations from the correctness manifold exceed a learned threshold.
Results
MAGS consistently outperformed both unsteered baselines and static steering methods across various benchmarks, achieving improvements of up to 10.8% in performance while incurring negligible inference overhead. The method demonstrated effectiveness in mathematical reasoning tasks (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular generation (SMILES).
Implications
The findings suggest that reasoning errors in LLMs can be effectively mitigated through adaptive interventions that leverage the geometric structure of attention outputs. This could lead to more reliable applications of LLMs in complex reasoning tasks and enhance their overall performance in real-world scenarios.
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
Time Series
- Climate emulation is fundamentally an out-of-distribution prediction task.
- Seasonal variations can serve as effective proxies for long-term climate shifts.
- Current state-of-the-art hybrid-ML emulators show significant performance degradation under realistic distribution shifts.
- Compositional generalization is crucial for improving the robustness of climate emulators.
Read more
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
Summary
This paper addresses the challenges of climate emulation using machine learning (ML) methods, particularly focusing on the issue of out-of-distribution (OOD) generalization. The authors highlight that while current ML emulators perform well on present climate data, their reliability under future climate shifts remains uncertain. They provide empirical evidence that climate change induces significant shifts in atmospheric state distributions, which standard evaluation protocols fail to capture. To tackle this, the authors propose a novel evaluation framework that utilizes seasonal variations as proxies for long-term climate shifts, allowing for a realistic assessment of emulator robustness without additional data collection. Their systematic analysis reveals that existing hybrid-ML emulators experience significant performance degradation under these realistic distribution shifts. The authors advocate for compositional generalization as a key strategy for enhancing robustness, demonstrating that physically motivated model decompositions can improve OOD performance with minimal trade-offs against in-distribution accuracy. This work emphasizes the need for a paradigm shift in the design and evaluation of climate emulators to ensure their reliability in the face of an uncertain future.
Methodology
The authors analyze 40 years of observation-constrained reanalysis data to characterize climate emulation as an OOD task. They introduce a zero-overhead evaluation framework that leverages seasonal shifts to assess emulator robustness. This framework allows for the quantification of performance degradation under realistic distribution shifts, isolating it from inherent prediction difficulties.
Results
The study confirms that current hybrid-ML emulators degrade significantly when faced with realistic seasonal distribution shifts. The introduction of compositional generalization and physically motivated decompositions leads to substantial improvements in OOD performance, indicating a promising direction for future climate emulation efforts.
Implications
The findings suggest that improving the robustness of climate emulators is essential for reliable climate projections, which are critical for informing climate policy and risk management. The proposed evaluation framework and methodologies could be applied to enhance the performance of ML models in other domains facing similar distribution shift challenges.
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
Robotics
Generative Models
Theory
- Multimodal action distributions pose significant challenges in behavioral cloning.
- Posterior-prior regularization can enhance reliability but may lead to loss of multimodal information.
- The Lipschitz constant of the base-to-action mapping affects the ability to capture multiple modes.
- The paper provides a formal definition of multimodality and identifies key factors for its preservation.
Read more
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
Summary
This paper addresses the challenges of behavioral cloning in the context of multimodal action distributions, particularly when the same observation can lead to multiple valid actions. The authors investigate how different parameterizations of action-chunking policies can fail in various ways due to multimodality. They focus on two main approaches: latent-variable policies and action-space generative policies. For latent-variable methods, the study reveals that posterior-prior regularization can enhance deployment-time sampling reliability but may also suppress essential action-conditioned information, leading to a collapse of multimodality. Conversely, for action-space generative methods, the smoothness of the mapping from base to action space affects the ability to represent multiple modes. The paper provides a precise definition of multimodality and identifies critical factors that influence it, such as regularization strength and Lipschitz continuity. Empirical experiments on synthetic tasks and robotic simulations validate the theoretical findings, demonstrating the practical implications of multimodal failures in behavioral cloning.
Methodology
The authors conducted a theoretical analysis of multimodal behavioral cloning, focusing on latent-variable and action-space generative methods. They defined multimodality in the context of action distributions and explored the effects of regularization and Lipschitz continuity on the preservation of multimodal information. The study included empirical validation through experiments on synthetic multimodal tasks and robotic simulation benchmarks.
Results
The findings indicate that excessive regularization in latent-variable policies can suppress necessary action-conditioned information, leading to a collapse of multimodality. For action-space generative policies, a small Lipschitz constant can hinder the model's ability to assign substantial probability to well-separated modes. The experiments confirmed these theoretical insights, illustrating the practical implications of multimodal failures in behavioral cloning.
Implications
The insights from this study can inform the design of more robust behavioral cloning methods that effectively handle multimodal action distributions. By understanding the trade-offs associated with regularization and mapping smoothness, practitioners can improve the performance of imitation learning systems in complex environments, particularly in robotics.
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
NLP
Large Language Models
Optimization
- Optimizers significantly influence the spectral scaling laws of Transformer architectures, affecting how model capacity is utilized.
- Different optimizers can yield markedly different scaling behaviors, particularly in rare-token representation scenarios.
- Matched validation loss does not imply similar representation structures across different optimizers.
- Optimizer-induced spectral shifts can surpass the effects of architectural changes, emphasizing the importance of optimizer choice in model design.
Read more
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
Summary
This paper investigates the impact of different optimizers on the spectral scaling laws of feed-forward networks (FFNs) within Transformer architectures. While traditional scaling laws focus on model size, data, and compute, this work emphasizes the role of optimizers in shaping the learned representations. The authors analyze the eigenspectra of FFN representations, revealing that optimizers like AdamW and Muon yield significantly different spectral scaling behaviors, particularly in challenging token regimes. AdamW shows weak hard-rank scaling on rare-token representations, whereas Muon achieves near-linear scaling, indicating that the optimizer not only affects convergence speed but also the structure of learned representations. The study further demonstrates that matched validation loss does not guarantee similar representation structures, highlighting the necessity of considering optimizer effects in model design. The findings suggest that optimization should be treated as a critical axis of representation scaling, advocating for co-design of optimizers and architectures.
Methodology
The authors conducted experiments comparing various optimizers (AdamW, Muon, NorMuon, and rank-constrained Dion variants) while holding the Transformer architecture and feed-forward network width constant. They measured the eigenspectra of FFN representations using soft and hard spectral ranks to analyze how effectively added width translates into utilized spectral capacity. The study also stratified token representations by frequency to assess optimizer performance across different token regimes.
Results
The results indicate that AdamW exhibits weak hard-rank scaling (β=0.29) on rare-token representations, while Muon achieves near-linear scaling (β=0.82). The spectral scaling exponents varied significantly between optimizers, with AdamW showing the largest hard-soft asymmetry. The findings reveal that optimizer choice critically shapes the geometry of learned representations, with implications for how capacity is allocated across token frequencies.
Implications
These findings suggest that the choice of optimizer is a fundamental aspect of model design that can influence not only convergence speed but also the effectiveness of representation learning. This could lead to more effective strategies for training large language models and other neural architectures by optimizing both the architecture and the optimizer in tandem.
IKNO: Infinite-order Kernel Neural Operators
Theory
Efficient ML
- Introduction of Infinite-order Kernel Neural Operator (IKNO) for enhanced expressivity in neural operators.
- Development of two constructions: IKNO-Vanilla and IKNO-TP, both optimized for computational efficiency.
- Empirical results show IKNO consistently achieves state-of-the-art accuracy across multiple PDE benchmarks.
- Significant improvements in scalability to large point clouds compared to existing methods.
Read more
IKNO: Infinite-order Kernel Neural Operators
Summary
The paper introduces the Infinite-order Kernel Neural Operator (IKNO), a novel framework designed to enhance the expressivity of neural operators used in solving partial differential equations (PDEs). Traditional neural operators rely on first-order kernel integral approximations, which limit their ability to aggregate global information effectively. The IKNO addresses this limitation by utilizing infinite-order kernel integrals, allowing for superior information aggregation and propagation. The authors present two complementary constructions: IKNO-Vanilla, which employs full-kernel resolvent via Kronecker eigendecomposition, and IKNO-TP, a tensor-product operator that composes per-axis resolvents. Both constructions are optimized for computational efficiency, reducing the preprocessing cost significantly. Empirical evaluations demonstrate that IKNO achieves state-of-the-art accuracy across various benchmarks, particularly excelling in scalability to large point clouds. The findings suggest that incorporating higher-order kernel integrals can substantially improve the performance of neural operators in scientific computing applications.
Methodology
The authors propose the IKNO framework, which constructs neural operators using infinite-order kernel integrals. They develop two variants: IKNO-Vanilla, which uses Kronecker eigendecomposition for full-kernel resolvent, and IKNO-TP, which employs a tensor-product approach. Both methods include fast computation schemes that reduce preprocessing costs and maintain high efficiency.
Results
IKNO demonstrates superior performance across 15 benchmark datasets, achieving state-of-the-art accuracy and significant improvements over previous models. IKNO-TP outperforms IKNO-Vanilla in most cases, while both variants maintain scalability to large datasets.
Implications
The IKNO framework has potential applications in various fields requiring the solution of PDEs, such as physics, finance, and engineering design. Its ability to efficiently handle large-scale datasets and improve model performance could lead to advancements in computational methods for scientific simulations.
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
Generative Models
Theory
Time Series
- Introduces a physics-informed generative framework for spatiotemporal field reconstruction.
- Decouples training and inference processes to enhance stability and physical consistency.
- Demonstrates effectiveness in acoustic systems and generalizes to chaotic flows and meteorological fields.
- Addresses the limitations of traditional data-driven methods in the context of sparse measurements.
Read more
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
Summary
This paper presents a novel framework called the Physics-Informed Generative Solver (PIGS) designed to address the challenge of reconstructing spatiotemporal fields from sparse measurements, a common issue in physical sciences. Traditional data-driven methods often fail to respect the governing dynamics of the systems being modeled. The authors propose a two-step approach: during training, they utilize Martingale-Regularized Score Matching (MRSM) to create a stable generative prior by coupling denoising score matching with a Score Fokker–Planck Equation constraint. This ensures that the generative model adheres to a reverse martingale property, enhancing stability. During inference, Physics-Informed Implicit Score Sampling (PI-ISS) is employed to project samples towards the physical manifold by back-propagating conservation-law residuals, allowing for flexible reconstruction from extremely sparse and incomplete data while maintaining physical consistency. The framework is demonstrated in acoustic systems, where it effectively generates coupled pressure and particle velocity fields, and is also shown to generalize to chaotic Kolmogorov flows and large-scale meteorological fields. This work establishes a robust paradigm for solving high-dimensional inverse problems by integrating generative AI with first-principles science.
Methodology
The methodology involves two main components: Martingale-Regularized Score Matching (MRSM) for training, which combines denoising score matching with a Score Fokker–Planck Equation constraint to stabilize the generative prior, and Physics-Informed Implicit Score Sampling (PI-ISS) for inference, which projects samples towards the physical manifold by incorporating conservation-law residuals.
Results
The proposed framework successfully reconstructs coupled pressure and particle velocity fields from sparse measurements in acoustic systems, effectively transforming sparse physical arrays into dense virtual arrays and mitigating spatial aliasing. Additionally, it demonstrates generalizability to chaotic Kolmogorov flows and large-scale meteorological fields under extreme data sparsity.
Implications
This work has significant implications for various fields in physical sciences where spatiotemporal field reconstruction is critical, such as acoustics, meteorology, and fluid dynamics. It provides a robust method for dealing with sparse data, potentially leading to advancements in real-time monitoring and predictive modeling.
Value-Gradient Hypothesis of RL for LLMs
Reinforcement Learning
Large Language Models
Theory
- Critic-free RL methods like PPO and GRPO can effectively improve LLMs despite traditional RL concerns about credit assignment.
- The actor update in critic-free RL is shown to be value-gradient-like in expectation.
- Empirical costates in discrete transformers approximate the value gradient, with controlled error margins.
- A predictive decomposition of RL impact into value-gradient signals and reward headroom is developed.
Read more
Value-Gradient Hypothesis of RL for LLMs
Summary
This paper investigates the effectiveness of critic-free reinforcement learning (RL) methods, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), in the context of post-training large language models (LLMs). The authors propose a value-gradient perspective to explain why these methods perform well despite traditional RL theories suggesting they should struggle with long-horizon credit assignment. They demonstrate that the actor update in critic-free RL is value-gradient-like in expectation, particularly under a differentiable rollout and additive-noise parameterization. Furthermore, they show that in discrete transformer policies, the empirical costates derived from autodifferentiation approximate the value gradient, with errors influenced by sampling gaps and policy entropy. This leads to a decomposition of RL impact into usable value-gradient signals and reachable reward headroom, providing a criterion for identifying when RL is most beneficial during pretraining. The findings suggest that RL is most effective at checkpoints that are close enough to the value-gradient regime while being far enough from saturation to allow for reward-enhancing trajectories.
Methodology
The authors utilize a theoretical framework based on differentiable rollouts and additive-noise parameterization to analyze the actor updates in critic-free RL. They employ autodifferentiation techniques to derive empirical costates in discrete transformer architectures, allowing them to approximate the value gradient. The methodology includes a decomposition of RL impact into usable signals and reward headroom, which is empirically validated.
Results
The study confirms that the actor updates in critic-free RL methods carry a value-gradient-like signal, and that the empirical costates computed through autodifferentiation effectively approximate the expected value gradient. The proposed RL-impact decomposition provides a reliable criterion for determining the effectiveness of RL at various checkpoints during pretraining.
Implications
The findings have significant implications for the development of more effective RL strategies in LLMs, particularly in optimizing training processes and improving model performance. The insights into when RL is most beneficial can guide practitioners in selecting appropriate checkpoints for training, potentially leading to more efficient use of computational resources.
The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler
Generative Models
Theory
Efficient ML
- Full covariance matching reduces path-KL error from Ω(1/T) to O(1/T²).
- The Lanczos Gaussian Sampler (LGS) enables practical sampling from optimal covariance without dense storage.
- LGS achieves improved sample quality over strong diagonal-covariance baselines with minimal computational overhead.
- The method leverages Jacobian-vector products to compute covariance-vector products efficiently.
Read more
The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler
Summary
This paper addresses a significant limitation in Gaussian Denoising Diffusion Probabilistic Models (DDPMs) related to the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. The authors demonstrate that traditional isotropic reverse covariances incur a path-KL error that cannot decay faster than Ω(1/T) as the number of denoising steps T increases. They propose that matching the full posterior covariance can break this barrier, improving the path KL error to O(1/T²). To facilitate practical implementation of full covariance matching, the authors introduce the Lanczos Gaussian Sampler (LGS), a method that allows for sampling from the optimal reverse covariance using only covariance-vector products derived from Jacobian-vector products of the posterior mean. LGS avoids the need for dense covariance storage and auxiliary models, proving that its approximation error decreases exponentially with the number of Lanczos steps. Empirical results show that using just three Lanczos steps significantly enhances sample quality compared to existing methods, establishing full covariance matching as both theoretically beneficial and practically feasible for efficient DDPM sampling.
Methodology
The authors develop the Lanczos Gaussian Sampler (LGS), which samples from the optimal reverse covariance using covariance-vector products derived from Jacobian-vector products. This matrix-free approach avoids the need for dense covariance storage and auxiliary models, focusing on the full posterior covariance through a limited number of Lanczos iterations.
Results
The introduction of LGS leads to a significant reduction in path-KL error and improved sample quality across standard image benchmarks, outperforming existing methods such as OCM-DDPM. The empirical results confirm that only three Lanczos steps can enhance sampling efficiency and quality.
Implications
The findings suggest that full covariance matching can be a viable strategy for improving the performance of Gaussian DDPMs, particularly in applications requiring accurate reverse trajectories, such as classifier-guided sampling. The LGS method offers a practical solution for high-dimensional data scenarios where traditional covariance modeling is infeasible.
Hierarchical Variational Policies for Reward-Guided Diffusion
Generative Models
Computer Vision
Efficient ML
- Introduces a unified framework for test-time guidance in diffusion models using hierarchical variational policies.
- Develops Amortized HVP (AHVP) for efficient generation of high-quality reward-aligned samples.
- Presents Semi-Amortized HVP (SHVP) that combines amortized proposals with test-time refinement for improved quality.
- Achieves over 5× faster inference with better perceptual quality compared to leading methods on inverse problems.
Read more
Hierarchical Variational Policies for Reward-Guided Diffusion
Summary
This paper presents a novel framework for adapting pretrained diffusion models to various downstream tasks, particularly inverse problems, by utilizing hierarchical variational policies. The authors propose a method that significantly reduces the computational cost associated with test-time adaptation, which typically involves expensive gradient evaluations or optimizations. Their approach introduces a lightweight stochastic policy that amortizes control, allowing for efficient few-step diffusion sampling. The framework is designed to maintain high sample quality while achieving faster inference times. The authors introduce two key methods: Amortized HVP (AHVP), which learns an initial noise distribution and per-step stochastic policies, and Semi-Amortized HVP (SHVP), which combines cheap amortized proposals with limited test-time optimization. Empirical results demonstrate that these methods outperform existing baselines in terms of perceptual quality and inference speed across multiple challenging inverse problems.
Methodology
The authors formulate test-time adaptation as a hierarchical variational model, where a lightweight stochastic policy is trained to guide the denoising process. This involves learning an initial noise distribution and per-step stochastic controllers through variational inference. The method allows for a single forward pass during inference, replacing the need for repeated optimizations.
Results
The proposed methods, AHVP and SHVP, achieve superior perceptual quality and significantly faster inference times compared to existing test-time scaling baselines. For instance, on 4× super-resolution tasks, AHVP provides better perceptual quality with over 5× faster inference than the best-performing baseline.
Implications
This work has significant implications for real-time applications in computer vision, where efficient and high-quality image generation is critical. The framework can be applied to various inverse problems, enhancing the usability of diffusion models in practical scenarios.
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Reinforcement Learning
Large Language Models
- Introduction of a multi-reward RLIF framework that combines answer-level and completion-level rewards.
- Implementation of GDPO normalization to mitigate reward-scale imbalance.
- Use of KL-Cov regularization to prevent entropy collapse and maintain exploration.
- Demonstrated improved performance and stability over single-reward RLIF methods.
Read more
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Summary
This paper introduces a novel multi-reward framework for Reinforcement Learning from Internal Feedback (RLIF) aimed at enhancing the reasoning capabilities of large language models (LLMs) without relying on external supervision. Traditional RLIF methods often depend on a single internal reward, which can lead to issues such as reward hacking and entropy collapse, ultimately degrading the model's reasoning structure. The authors propose a multi-reward approach that combines two distinct reward signals: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To address the imbalance in reward scales, they utilize GDPO-based normalization. Additionally, they introduce KL-Cov regularization to target low-entropy token distributions, preserving exploration and preventing model collapse during training. The proposed framework demonstrates improved stability and robustness across various mathematical reasoning and code-generation benchmarks, achieving performance levels comparable to supervised RL methods. The findings indicate that using complementary internal rewards, along with targeted regularization, can effectively support stable long-horizon reasoning in LLMs.
Methodology
The authors developed a multi-reward RLIF framework that integrates two types of intrinsic rewards: an answer-level reward derived from cluster voting and a completion-level reward based on token-wise self-certainty. They applied GDPO normalization to balance the influence of each reward and introduced KL-Cov regularization to target tokens that contribute to entropy collapse while preserving diversity in the token distribution.
Results
The proposed framework outperformed several baseline methods, including single-reward RLIF approaches, across five mathematical reasoning benchmarks and two coding benchmarks. The results indicated enhanced stability and robustness during continued training, which is critical for maintaining effective reasoning capabilities in LLMs.
Implications
This research suggests that multi-reward systems can significantly improve the training of LLMs in unsupervised settings, making them more scalable and applicable across various domains where external supervision is limited or unavailable. The findings could lead to advancements in AI reasoning capabilities and broader applications in areas requiring complex problem-solving.
TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes
Theory
Optimization
Efficient ML
- Introduces TBP and RTBP for exact doubly stochastic mixing matrices.
- Achieves minimal parameterization and full expressivity without iterative normalization.
- Demonstrates improved stability and scalability in empirical evaluations.
- Addresses trade-offs between exactness, expressivity, memory efficiency, and speed.
Read more
TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes
Summary
This paper introduces the Transportation Birkhoff Polytope (TBP) parameterization and its recursive variant (RTBP) to enhance the expressivity and stability of Hyper-Connections (HC) in deep neural networks. Traditional HC methods improve residual networks by enabling learnable mixing across multiple streams, but they often suffer from training instability due to unconstrained mixing. The proposed TBP and RTBP methods ensure exact doubly stochasticity without iterative normalization, thus avoiding the combinatorial explosion associated with previous methods. TBP achieves minimal parameterization with (n-1)² degrees of freedom while covering the entire Birkhoff polytope, making it optimal in terms of exactness, expressivity, and memory efficiency. The RTBP variant allows for partial parallelization, addressing speed limitations inherent in TBP. Empirical evaluations demonstrate that TBP and RTBP outperform existing methods like mHC, mHC-lite, and KromHC in terms of stability and scalability, particularly in language model pre-training tasks.
Methodology
The paper proposes a new parameterization of the Birkhoff polytope through transportation polytopes, utilizing a greedy algorithm for construction. The recursive variant decomposes the problem into smaller subproblems to enhance computational efficiency while maintaining exactness in the mixing matrices.
Results
Numerical experiments show that TBP and RTBP outperform existing methods in terms of stability and scalability, particularly in language model pre-training. The results indicate that these new methods provide qualitative advantages over previous approaches.
Implications
The findings suggest that TBP and RTBP can be effectively utilized in various deep learning applications, particularly in scenarios requiring stable and scalable training of models with complex architectures. The public availability of the code encourages further exploration and application in the field.
Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization
Optimization
Efficient ML
Interpretability
- Proposes a machine learning framework for anomaly detection in smart grids using PMU/IED measurements.
- Implements a genetic algorithm for feature selection, significantly reducing the feature space.
- Demonstrates that tree-based ensemble models, especially Extra Trees, outperform other baseline models.
- Achieves improved detection metrics while maintaining a reduced set of informative features.
Read more
Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization
Summary
This paper addresses the challenge of detecting cyber-physical anomalies in IoT-enabled smart grids, which are increasingly vulnerable to both physical incidents and cyber-attacks. The authors propose a framework that combines machine learning techniques with genetic algorithm-based feature selection to enhance the detection of anomalies. Utilizing the MSU/ORNL Power System Attack Dataset, the study formulates the problem as a binary classification task to differentiate between natural events and malicious attacks. Various machine learning models, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees, are evaluated. The results indicate that tree-based ensemble models, particularly Extra Trees, perform best on the dataset. The proposed feature selection method reduces the feature space from 112 to an average of 27.4 attributes, while improving detection metrics such as macro-F1 and ROC-AUC. This demonstrates that a compact set of phasor-based features can effectively support reliable anomaly detection, highlighting the potential for reducing computational overhead and enhancing interpretability in smart grid applications.
Methodology
The methodology involves formulating the anomaly detection problem as a binary classification task using the MSU/ORNL dataset. A genetic algorithm is employed for feature selection to identify a smaller subset of PMU/IED features. Several machine learning models, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees, are evaluated based on their performance metrics suitable for imbalanced data.
Results
The study found that the GA + Extra Trees model effectively reduced the feature set from 112 to approximately 27.4 attributes while enhancing the macro-F1 score from 0.9118 to 0.9212 and the ROC-AUC from 0.9791 to 0.9837. This indicates that many features in the original dataset were redundant, and a smaller subset can still provide accurate anomaly detection.
Implications
The findings suggest that implementing a compact feature set can lead to more efficient anomaly detection systems in smart grids, which is crucial for enhancing cybersecurity measures. This approach can be particularly beneficial for edge devices and IoT applications, where computational resources are limited.
Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery
Optimization
Large Language Models
Generative Models
- Introduction of EMO-STA framework for efficient multi-task program discovery.
- Demonstrated improvement over single-task evolutionary methods in various settings.
- Adaptation strategies enhance performance for both seen and unseen tasks.
- Shared evolution reduces overfitting by promoting generalizable solutions.
Read more
Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery
Summary
This paper introduces Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery, addressing the inefficiencies of existing methods that optimize tasks independently despite their structural similarities. The authors propose a two-stage framework, EMO-STA (Shared-Then-Adapt), which first evolves a shared archive of executable programs across a family of related tasks and then adapts selected candidates to individual tasks. The framework explores various adaptation strategies, including warm-starting from the shared archive, adapting the best average shared program, and adapting the best program for each target task. The authors demonstrate that EMO-STA outperforms single-task evolution across eight task families, enhancing both in-distribution adaptation and transfer to unseen tasks. The study also highlights the compute efficiency of shared evolution and its ability to mitigate overfitting in low-evidence scenarios, suggesting that a balanced allocation of resources between shared and task-specific evolution is optimal for performance.
Methodology
The methodology involves a two-stage evolutionary process: first, a shared evolution phase that aggregates performance across related tasks to create a reusable program archive, followed by a task-specific adaptation phase where selected programs are refined for individual tasks using one of three initialization strategies (Warmstart, Best-Shared, Best-Local).
Results
EMO-STA consistently outperformed matched-compute single-task evolution across diverse task families, with the Best-Local strategy yielding the best in-distribution adaptation and the Best-Shared strategy providing robust performance on unseen tasks. The framework also demonstrated compute efficiency and reduced overfitting in low-data scenarios.
Implications
The findings suggest that evolutionary multi-task optimization can significantly enhance the efficiency and effectiveness of program discovery in various domains, including scientific computing and algorithmic optimization. The approach may be applied to improve LLM-guided tasks in agentic AI systems and other areas requiring adaptive program generation.
Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
Optimization
Theory
- Objective design critically influences the performance of multiobjective unsupervised feature selection.
- Silhouette-based formulations often lead to low-cardinality, less informative solutions.
- The PCA reconstruction loss objective provides a better balance between subset compactness and predictive performance.
- Subset-size regularization and initial population strategies significantly shape the Pareto front structure.
Read more
Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
Summary
This paper investigates the impact of objective design on multiobjective unsupervised feature selection (UFS), focusing on how different evaluation objectives and subset-size regularization strategies influence search dynamics and the quality of selected feature subsets. The authors conduct experiments using a synthetic dataset with known feature types to analyze the behavior of six formulations combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss, with either subset-size minimization or maximization. The findings reveal that the choice of formulation significantly affects both the search dynamics and the resulting Pareto front. Specifically, silhouette-based formulations tend to favor trivial low-cardinality solutions, while the proposed PCA loss objective yields compact subsets with test accuracy comparable to those obtained through supervised accuracy optimization. The study emphasizes the importance of objective design in effective multiobjective UFS, highlighting how different objectives can lead to varying trade-offs and biases in feature selection.
Methodology
The authors employed a controlled experimental framework using a synthetic dataset designed to include informative, redundant, and irrelevant features. They compared six formulations of multiobjective UFS by varying evaluation objectives (accuracy, silhouette score, PCA reconstruction loss) and subset-size strategies (minimization vs. maximization). The analysis focused on how these factors influenced the structure of the Pareto front and the composition of selected feature subsets.
Results
The study found that the formulation used for multiobjective UFS significantly impacts search dynamics and the quality of the resulting feature subsets. Silhouette-based approaches were biased towards trivial solutions, while the PCA loss objective produced compact subsets with competitive accuracy. The analysis also revealed how different objectives affect the exploration of the search space and the attainable trade-offs in feature selection.
Implications
The findings suggest that careful design of evaluation objectives is essential for improving the effectiveness of unsupervised feature selection methods. The insights gained from this study can inform the development of more robust feature selection algorithms that are less prone to bias and better suited for real-world applications where feature relevance is uncertain.
Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation
NLP
Large Language Models
Reinforcement Learning
- Introduced a large-scale dataset of 11,488 research idea pairs for comparative forecasting.
- Achieved a significant accuracy improvement from 30% to 77.1% using Supervised Fine-Tuning.
- Outperformed GPT-5 by over 10 percentage points while being more compute-efficient.
- Demonstrated robustness against superficial heuristics and effective reasoning capabilities.
Read more
Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation
Summary
This paper addresses the challenge of evaluating and filtering numerous AI-generated research ideas, which is crucial as language models (LMs) increasingly automate hypothesis generation. The authors propose a method for comparative empirical forecasting, where LMs predict which of two candidate ideas will achieve better performance on a specific benchmark. They constructed a dataset comprising 11,488 idea pairs based on objective outcomes from PapersWithCode. Initial experiments with off-the-shelf 8B-parameter models yielded low accuracy (30%). However, by employing Supervised Fine-Tuning (SFT), they significantly improved performance to 77.1%, surpassing GPT-5's 61.1%. The authors further enhanced the model's reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR), achieving 71.35% accuracy with interpretable justifications. The study demonstrates that small, compute-efficient LMs can effectively serve as objective verifiers for research ideas, providing a scalable approach to scientific discovery. The findings suggest that LMs can internalize prior research patterns to discriminate between competing ideas before experimental validation.
Methodology
The authors constructed a dataset of research idea pairs and framed the problem as a preference-prediction task. They first applied Supervised Fine-Tuning (SFT) on two curated datasets, followed by Reinforcement Learning (RL) to capture intermediate reasoning paths. The models were trained to output binary winner labels for the idea pairs, with an emphasis on interpretability.
Results
The study found that while base models struggled with low accuracy (20.13%), the application of SFT led to a dramatic increase in performance to 77.1%. Models trained with reasoning capabilities achieved 71.35% accuracy, outperforming GPT-5. The models showed robustness in various stress tests, indicating a genuine understanding of the task rather than reliance on superficial heuristics.
Implications
The findings suggest that language models can effectively assist in the evaluation of research ideas, potentially streamlining the scientific discovery process. This could lead to more efficient hypothesis testing and a reduction in the number of experiments needed, ultimately accelerating research advancements.
Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes
Theory
Interpretability
- Establishes a causal framework linking breath VOCs to blood glucose levels.
- Develops a classifier to differentiate between diabetic and non-diabetic individuals.
- Introduces a risk-ranking system for individuals at risk of diabetes.
- Utilizes Gaussian Mixture Models for population clustering.
Read more
Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes
Summary
This study investigates the potential of breath volatile organic compounds (VOCs) as non-invasive biomarkers for early detection of diabetes. The authors employ causal inference techniques to explore the relationship between specific VOCs—acetone, isopropanol, isoprene, and ethanol—and blood glucose levels. A structured causal framework is established to assess whether these VOCs causally influence glucose concentration, addressing the gap in existing literature which primarily focuses on correlations. The research includes a classifier to distinguish between diabetic and non-diabetic individuals, as well as a risk-ranking system for those in the 'gray zone' of diabetes risk. The study utilizes a dataset of 94 participants, including healthy individuals and those with varying degrees of diabetes, and applies Gaussian Mixture Models for clustering analysis. The findings indicate that certain VOCs have a significant causal impact on glucose levels, supporting their use in non-invasive diabetes screening and highlighting the potential for machine learning models to effectively classify and stratify individuals at risk.
Methodology
The study employs a structured causal inference framework to analyze the effects of individual and combined VOCs on blood glucose levels. It includes data preprocessing, causal graph construction, and the use of the DoWhy model to estimate average treatment effects. The methodology also incorporates SHapley Additive exPlanations (SHAP) for interpretability and robustness checks against confounding factors.
Results
The results indicate that specific VOCs exhibit a strong causal influence on blood glucose levels. The machine learning models developed can reliably classify individuals as diabetic or non-diabetic and stratify those at high risk. The clustering analysis reveals natural groupings within the population, enhancing understanding of diabetes risk.
Implications
The findings suggest that breath VOCs can serve as effective non-invasive biomarkers for early diabetes detection, potentially leading to improved screening methods and timely interventions. This research could pave the way for further studies on metabolic monitoring and personalized healthcare strategies.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Reinforcement Learning
Large Language Models
Optimization
- VPO focuses on generating diverse, competent solutions rather than converging on a single optimal response.
- The method leverages vector-valued rewards to encourage exploration of the Pareto frontier of multiple objectives.
- VPO consistently outperforms scalar RL baselines in test-time search scenarios, especially with larger candidate budgets.
- The approach allows for the resolution of complex problems that traditional methods fail to solve.
Read more
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Summary
The paper introduces Vector Policy Optimization (VPO), a novel reinforcement learning (RL) algorithm designed to enhance the diversity of solutions generated by language models (LLMs) during test-time search. Traditional post-training methods for LLMs often focus on optimizing a single scalar reward, which can lead to low-entropy response distributions and a lack of diversity in generated solutions. VPO addresses this by explicitly training policies to anticipate diverse downstream reward functions, allowing for the generation of multiple solutions that cater to different trade-offs in a vector reward space. The authors demonstrate that VPO outperforms existing scalar RL baselines across four diverse tasks, particularly as the search budget increases. This approach not only improves performance metrics like pass@k and best@k but also enables the solving of problems that previous methods could not address. The findings suggest that optimizing for diversity should become a standard objective in post-training for LLMs, especially in systems that utilize test-time search procedures.
Methodology
VPO combines multi-answer generation with stochastic reward scalarizations, training the model to produce sets of candidates that span the Pareto frontier of different reward dimensions. This allows the model to maintain a rich distribution of candidate solutions, which can be effectively utilized during test-time search.
Results
VPO was evaluated across four tasks, showing that it matches or exceeds the performance of the strongest scalar RL baselines on metrics like pass@k and best@k. The performance gap widens with increased search budgets, and VPO successfully solves problems that GRPO models cannot, indicating its superior capability in diverse solution generation.
Implications
The introduction of VPO has significant implications for the design of RL algorithms in AI systems, particularly those that incorporate search mechanisms. By prioritizing diversity in solution generation, VPO can enhance the effectiveness of LLMs in real-world applications, making them more adaptable to various tasks and user requirements.
Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series
Time Series
- Formulates entity-conditioned heterogeneous lag discovery as a testable panel time-series mining task.
- Introduces AC-GATE, which generates entity-level effective lags through an explicit lag gating structure.
- Proposes a layered audit protocol for evaluating forecast calibration and lag discovery.
- Demonstrates the ability of AC-GATE to recover true heterogeneous lag structures in synthetic data.
Read more
Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series
Summary
This paper addresses the challenge of auditing how different entities respond to historical signals over varying time horizons in panel time series data. Traditional methods often fail to provide auditable entity-specific lag summaries. The author formulates entity-conditioned heterogeneous lag discovery as a temporal panel mining task and introduces AC-GATE, an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate. This framework enables the modeling of Conditional Moderated Distributed Lag (CMDL) by utilizing observable entity-level proxies to condition lag-weight distributions, making effective lags structural outputs rather than post-hoc explanations. The evaluation of AC-GATE employs a layered audit protocol that distinguishes between predictive calibration and lag discovery, using both synthetic data for mechanism recovery testing and real-world country-level panels for external audit and stress testing. The results demonstrate that AC-GATE successfully recovers heterogeneous lag structures in synthetic datasets and generates meaningful effective lags in real-world applications, thereby enhancing the interpretability and usability of panel time series models.
Methodology
The methodology involves the development of AC-GATE, which utilizes an Adaptive-Conditioning Encoder and a Scale-Invariant Lag Gate to model entity-conditioned heterogeneous lags. The evaluation is conducted through a layered audit protocol that separates predictive calibration from lag discovery, including structural ablations and proxy-shuffle controls to validate the model's effectiveness.
Results
The results indicate that AC-GATE can accurately recover heterogeneous lag structures in synthetic datasets and produce non-degenerate, externally structured effective lags in real-world country-level panels, demonstrating its effectiveness in addressing the limitations of existing models.
Implications
The findings suggest that AC-GATE can significantly improve the interpretability and usability of panel time series models, making it easier for researchers to understand entity-specific responses to historical signals. This has potential applications in various fields such as economics, environmental studies, and organizational behavior where understanding lag effects is crucial.