AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
55
Papers today
8h
Update frequency
7
Days of history
Finding Stationary Points by Comparisons
Optimization
Theory
- Developed an algorithm for finding ϵ-stationary points using a comparison oracle with improved query complexity.
- Introduced a quantum algorithm for the same problem, showcasing the potential of quantum methods in optimization.
- Demonstrated that the algorithm's ϵ-dependence matches optimal rates of second-order methods, although with a higher dimension dependence.
- Identified the challenge of accessing gradient norms in the comparison oracle model, limiting the ability to confirm stationary points directly.
Read more
Finding Stationary Points by Comparisons
Summary
This paper addresses the challenge of locating stationary points of non-convex functions using a comparison oracle, which only provides information on which of two points has a larger function value. The authors propose an algorithm that can find an ϵ-stationary point with eO(n²/ϵ¹.⁵) queries, leveraging a subroutine that estimates the normalized Hessian with eO(n² log(1/δ)) queries. Additionally, they explore a quantum comparison oracle model, introducing the first quantum algorithm for this problem, which requires eO(n/ϵ¹.⁵) queries. The work builds on existing optimization techniques and highlights the potential of comparison-based methods in non-convex optimization, particularly in scenarios where traditional gradient information is unavailable. The results indicate that the proposed algorithm achieves a better dependence on ϵ compared to previous methods, although it incurs a higher dimensionality cost. The authors also identify the need for further research into lower bounds for comparison-based stationary point finding.
Methodology
The authors developed an algorithm that utilizes a comparison oracle to determine the relative function values of pairs of points. The algorithm estimates the normalized Hessian and employs a structured approach to ensure that one of the queried points is an ϵ-stationary point. They also explored a quantum version of the algorithm, leveraging superposition to enhance query efficiency.
Results
The proposed algorithm guarantees finding an ϵ-stationary point with eO(n²/ϵ¹.⁵) queries in the classical setting and eO(n/ϵ¹.⁵) queries in the quantum setting. The results indicate that the algorithm's performance is competitive with existing methods, particularly in terms of ϵ-dependence, although it raises questions about the optimality of its dimensionality dependence.
Implications
This research has significant implications for optimization in machine learning, particularly in scenarios where function evaluations are costly or infeasible. The findings may enhance optimization techniques in various applications, including neural network training and preference-based reinforcement learning, where only comparative feedback is available.
GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
- Identification of directional inconsistency as a failure mode in online RL for LLMs.
- Development of GEOALIGN, a lightweight module for rollout curation that enhances training stability.
- GEOALIGN operates on-the-fly, requiring only forward passes and minimal overhead.
- Demonstrated improvements in performance and stability over strong baselines in various tasks.
Read more
GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning
Summary
The paper introduces GEOALIGN, a novel approach to enhance the stability of online reinforcement learning (RL) for aligning large language models (LLMs) with reward signals. The authors identify a critical issue termed 'directional inconsistency,' where a small number of high-reward rollouts can lead to conflicting update directions that destabilize training. GEOALIGN addresses this by implementing a lightweight rollout curation module that operates during policy optimization. It forms preference pairs from rollouts, learns a projector to concentrate reward-ordered directions, and identifies directionally inconsistent rollouts for rectification. The method is designed to be efficient, requiring only forward passes and adding minimal computational overhead. The authors demonstrate the effectiveness of GEOALIGN through experiments on dialogue alignment and mathematical reasoning tasks, showing significant improvements in both performance and training stability compared to existing robust RL methods.
Methodology
GEOALIGN employs a series of steps: it forms within-prompt preference pairs, learns an online projector to distill reward-ordered directions, builds a consensus prototype for the batch, and identifies directionally inconsistent rollouts for rectification. This process is executed without requiring per-rollout policy gradients, making it efficient and lightweight.
Results
The evaluation of GEOALIGN on dialogue alignment and mathematical reasoning tasks showed that it outperformed existing robust RL baselines such as PF-PPO, PAR, PODS, and Seed-GRPO. The method not only improved final performance but also reduced training oscillation, demonstrating resilience under controlled reward corruption.
Implications
The findings suggest that addressing directional inconsistency can significantly enhance the reliability of online RL for LLMs. GEOALIGN's approach could be applied to various tasks requiring alignment of LLMs with human preferences, potentially leading to more stable and effective training processes.
Does Aurora Encode Atmospheric Structure? Latent Regime Analysis and Attribution
Time Series
Interpretability
Efficient ML
- Aurora's latent space is organized by seasonal cycles rather than distinct storm events.
- Layer-wise relevance propagation (LRP) reveals that the model captures the 3D vertical structure of significant weather events.
- Perturbation tests indicate that relevant region masking severely impacts forecast accuracy.
- The study demonstrates that Aurora learns meteorological coherence without explicit guidance.
Read more
Does Aurora Encode Atmospheric Structure? Latent Regime Analysis and Attribution
Summary
This paper investigates the internal representations of the Aurora model, a deep learning foundation model designed for weather forecasting. The authors employ spatially pooled PCA and layer-wise relevance propagation (LRP) to analyze how Aurora organizes atmospheric data in its latent space. The findings reveal that Aurora's latent space is primarily structured around seasonal cycles, while extreme storm events do not form distinct clusters. The LRP analysis indicates that the model captures features consistent with the vertical structure of significant weather events, such as the Great Storm of 1987. Perturbation tests demonstrate that masking relevant regions significantly degrades forecast accuracy, suggesting that Aurora learns meteorological coherence and vertical structure without explicit instruction. The study highlights the potential of deep learning models in weather forecasting while addressing the challenge of interpretability in such complex systems.
Methodology
The authors utilized spatially pooled PCA to analyze the latent space organization of the Aurora model, focusing on seasonal cycles and extreme weather events. They applied layer-wise relevance propagation (LRP) to trace information flow and validate the model's attributions to specific storm systems. Perturbation tests were conducted to assess the impact of masking relevant regions on forecast accuracy.
Results
The analysis revealed that Aurora's latent space is primarily organized by seasonal cycles, with the first principal component capturing 24.1% of the total variance and effectively separating winter and summer regimes. Extreme weather events were found to lack distinct clustering in the latent space. The LRP analysis confirmed that the model accurately reflects the vertical structure of significant storms, and perturbation tests showed that masking relevant regions led to a 3.31 times greater degradation in forecast accuracy compared to random masking.
Implications
The findings suggest that deep learning models like Aurora can effectively learn and represent complex atmospheric dynamics, enhancing the accuracy of weather forecasts. The study also emphasizes the importance of interpretability in AI models, which is crucial for operational trust in meteorological applications.
Fast LeWorldModel
Robotics
Reinforcement Learning
Efficient ML
- Fast-LeWM replaces autoregressive rollout with action-prefix prediction, enhancing planning efficiency.
- The model allows for parallel prediction of future latents, reducing accumulated errors.
- Fast-LeWM achieves a 3.9× acceleration in dynamics module time and improves success rates from 85.8% to 90.5%.
- The method lowers full CEM solve time by 48.0%, demonstrating significant improvements in planning tasks.
Read more
Fast LeWorldModel
Summary
The paper introduces Fast LeWorldModel (Fast-LeWM), an advancement over the existing LeWorldModel (LeWM) for visual planning tasks. The authors identify the limitations of LeWM, particularly its reliance on autoregressive rollouts which lead to slow planning times and accumulated prediction errors over long horizons. Fast-LeWM addresses these issues by employing action-prefix prediction, allowing the model to predict future latents based on prefixes of action sequences in parallel, rather than sequentially. This approach not only accelerates the planning process but also reduces the impact of prediction errors. The model is trained to understand how states evolve under different action prefixes, enhancing its predictive capabilities. The authors demonstrate that Fast-LeWM significantly improves planning efficiency and success rates across various tasks, achieving a notable reduction in planning time and open-loop prediction error growth.
Methodology
Fast-LeWM reformulates the latent dynamics modeling from single-step transitions to action-prefix prediction. It employs an action-prefix encoder and a parallel latent predictor, allowing the model to process multiple action prefixes simultaneously. This design minimizes the recursive accumulation of prediction errors and enhances the model's ability to predict state evolution over multiple horizons.
Results
Fast-LeWM improves the average success rate from 85.8% to 90.5%, accelerates the dynamics module from 31.4 seconds to 8.0 seconds, and reduces the full CEM solve time from 54.4 seconds to 28.3 seconds. Additionally, it significantly lowers open-loop prediction error and its growth over longer horizons.
Implications
The advancements presented in Fast-LeWM could lead to more efficient planning in robotics and other applications requiring real-time decision-making based on visual inputs. The reduction in planning time and errors can enhance the performance of agents in dynamic environments.
Error-Conditioned Neural Solvers
Optimization
Theory
Efficient ML
- ENS uses the PDE residual as an input to improve prediction accuracy rather than as an optimization target.
- The framework demonstrates significant improvements in reconstruction accuracy across diverse PDE families.
- ENS achieves up to a 10x improvement in accuracy for turbulent Kolmogorov flow problems.
- The method exhibits robustness to initialization and generalizes well under distribution shifts.
Read more
Error-Conditioned Neural Solvers
Summary
This paper introduces Error-Conditioned Neural Solvers (ENS), a novel framework for solving partial differential equations (PDEs) that addresses the limitations of traditional neural surrogate models and existing hybrid methods. Traditional neural solvers often treat PDE solving as a statistical regression problem, leading to issues with constraint violations and poor extrapolation beyond training distributions. Hybrid methods attempt to incorporate physical correctness by minimizing the PDE residual but can be computationally expensive and unstable. The authors demonstrate that minimizing the PDE residual can be an unreliable measure of reconstruction accuracy, particularly in ill-conditioned systems. ENS improves upon these methods by using the PDE residual as an input to the network rather than an optimization target, allowing the model to learn an update policy that corrects its predictions iteratively. The framework shows significant improvements in prediction accuracy across various PDE families, achieving up to a 10x improvement in turbulent Kolmogorov flow scenarios while maintaining lower computational costs than hybrid methods. ENS also exhibits robustness to initialization and generalizes well under distribution shifts, making it particularly effective in ill-conditioned regimes where traditional methods struggle.
Methodology
The ENS framework incorporates the PDE residual field as a direct input to the neural network at each iteration, allowing the network to learn to correct its predictions based on the spatial structure of its errors. This approach avoids the pitfalls of minimizing the residual directly, which can lead to inaccurate solutions in ill-conditioned systems. The model is trained under reconstruction supervision, refining solutions iteratively without explicit numerical optimization.
Results
ENS outperforms existing methods in terms of prediction accuracy across four PDE families, particularly in challenging scenarios such as super-resolution and coefficient extrapolation. The method achieves low PDE residuals alongside low reconstruction errors, demonstrating its effectiveness in ill-conditioned settings. The results indicate that ENS's learned correction policy generalizes well, maintaining performance even with unseen parameter changes and across different equations.
Implications
The development of ENS has significant implications for computational physics and engineering, where accurate and efficient PDE solving is crucial. The ability to generalize under distribution shifts and improve accuracy in ill-conditioned scenarios could enhance simulations in various applications, including fluid dynamics and material science.
Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations
Large Language Models
NLP
Theory
- Grader configurations in LLM-as-judge evaluations often default to a temperature of 1.0, leading to non-deterministic outcomes.
- Setting temperature to zero reduces variability but does not eliminate it, with persistent non-reproducibility observed in several test cases.
- Grader disagreement should be considered a critical metric in evaluating the reliability of LLM judgments.
- The study reveals that some major LLM providers are moving away from temperature control, complicating reproducibility efforts.
Read more
Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations
Summary
This paper investigates the reliability of Large Language Model (LLM) evaluators, specifically those used as judges in safety evaluations. The author challenges the common assumption that setting the sampling temperature of these models to zero ensures deterministic grading outcomes. Through an empirical study using Japan AISI's open-source evaluation framework (aisev), the author demonstrates that the default temperature setting is often not applied, leading to significant variability in grading outcomes. Even when the temperature is explicitly set to zero, the study finds that non-determinism persists across multiple configurations and providers. The paper highlights a critical gap in evaluation harnesses that report single-run verdicts without accounting for grader disagreement, which can misrepresent the reliability of safety evaluations. The author proposes that grader disagreement should be treated as an essential metric in evaluation processes. Additionally, the paper discusses the implications of these findings for the design of reproducibility mitigations in LLM evaluations, especially as some providers are deprecating temperature control altogether.
Methodology
The author conducted a controlled empirical study involving 690 API calls across seven conditions, testing the grading outcomes of borderline items under various configurations, including default settings and explicit temperature control. The study aimed to expose the non-determinism of LLM evaluators by analyzing their responses to carefully crafted question/answer pairs.
Results
The results indicated that under default conditions, 4 out of 7 items exhibited non-reproducibility, with two items showing strong disagreement. Setting the temperature to zero stabilized some outcomes but left others unstable, confirming that temperature control is necessary but not sufficient for reproducibility.
Implications
The findings suggest that safety evaluations relying on LLM judges may be misleading if they do not account for variability in grading outcomes. This has significant implications for the deployment of AI systems in safety-critical applications, where pass/fail decisions can have substantial consequences. The paper advocates for improved evaluation practices that incorporate grader disagreement metrics.
AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing
NLP
Large Language Models
Reinforcement Learning
- AIGP integrates LLM-based reasoning with long-term business value alignment for dynamic pricing.
- The Long-Term Value Estimator (LTVE) automates preference pair selection for Direct Preference Optimization (DPO).
- AIGP achieves significant improvements in GMV (+13.21%), ROI (+7.59%), and milestone achievement rate (+8.20%) over traditional methods.
- The framework provides interpretable pricing rationales, enhancing decision transparency.
Read more
AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing
Summary
The paper introduces AIGP, a novel framework designed to enhance dynamic pricing in e-commerce by leveraging Large Language Models (LLMs) for long-term value alignment. Traditional dynamic pricing methods often struggle with interpretability, underutilization of unstructured data, and misalignment with long-term business goals such as Gross Merchandise Value (GMV) and Return on Investment (ROI). AIGP addresses these issues by integrating LLMs with a Long-Term Value Estimator (LTVE) that is trained using offline reinforcement learning on historical data. This allows for the generation of interpretable pricing decisions that align with long-term objectives. The framework employs supervised fine-tuning for knowledge distillation, ensuring efficient deployment while maintaining high-quality outputs. Extensive evaluations and A/B tests on Tao Factory demonstrate that AIGP significantly outperforms traditional and reinforcement learning-based pricing models, achieving notable increases in GMV, ROI, and milestone achievement rates over a 14-day period. The results indicate that AIGP not only enhances pricing effectiveness but also provides transparent rationales for pricing decisions.
Methodology
AIGP employs a framework that combines LLMs with a Long-Term Value Estimator (LTVE) trained via offline reinforcement learning. It utilizes supervised fine-tuning for knowledge distillation, allowing for the generation of interpretable pricing decisions based on structured data, domain knowledge, and textual context. The LTVE scores candidate pricing actions and selects preference pairs for Direct Preference Optimization (DPO), aligning pricing strategies with long-term business goals.
Results
The implementation of AIGP on Tao Factory resulted in a +13.21% increase in GMV, +7.59% increase in ROI, and +8.20% improvement in milestone achievement rate over a 14-day period compared to the production baseline. The framework also provided interpretable pricing rationales.
Implications
AIGP's approach can be applied to various e-commerce platforms seeking to optimize dynamic pricing strategies while ensuring alignment with long-term business objectives. The integration of LLMs for decision-making can enhance transparency and interpretability in pricing, potentially leading to better customer satisfaction and improved business outcomes.
Topology-Informed Neural Networks for Flood Detection in Optical and Synthetic Aperture Radar Imagery
Computer Vision
Time Series
Interpretability
- Introduces topology-informed neural networks for enhanced flood detection.
- Utilizes the SEN12-FLOOD dataset for comprehensive evaluation.
- Demonstrates the effectiveness of combining topological features with CNNs.
- Achieves a detection accuracy of 98.9%, significantly higher than previous baselines.
Read more
Topology-Informed Neural Networks for Flood Detection in Optical and Synthetic Aperture Radar Imagery
Summary
This paper addresses the challenge of flood detection using optical and synthetic aperture radar (SAR) imagery, particularly in scenarios where cloud cover obscures optical data. The authors leverage the SEN12-FLOOD dataset, which provides coregistered time series data from Sentinel-1 SAR and Sentinel-2 multispectral imagery. They introduce a novel approach that incorporates topological data analysis (TDA) into neural networks to enhance flood detection accuracy. By extracting topological features from images and integrating them with traditional convolutional neural network (CNN) features, the authors demonstrate that these topological descriptors not only carry significant flood-related information but also improve the interpretability of the models. The study shows that combining topological and convolutional features leads to a substantial increase in detection accuracy, achieving 98.9% compared to a baseline of 95.7%. This work highlights the potential of TDA in improving the robustness and interpretability of machine learning models in safety-critical applications like flood detection.
Methodology
The authors systematically evaluate topological descriptors for flood detection by extracting topological features from images in the SEN12-FLOOD dataset. They improve upon existing temporal classification methods using gated recurrent units (GRUs) and introduce a lightweight Gaussian topological embedding for stable training. The study combines these topological features with convolutional features to enhance model performance.
Results
The integration of topological features with traditional CNN features resulted in a detection accuracy of 98.9%, surpassing the baseline accuracy of 95.7%. This improvement demonstrates the effectiveness of TDA in capturing meaningful flood signals and enhancing model robustness.
Implications
The findings suggest that topology-informed approaches can significantly improve the accuracy and interpretability of flood detection systems, which is crucial for emergency response and disaster management. This methodology could be extended to other environmental monitoring tasks where interpretability and robustness are essential.
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Reinforcement Learning
Large Language Models
Optimization
- EVOM automates the architecture design process for actor-critic reinforcement learning models.
- The framework utilizes a bi-level optimization approach with an inner loop for training and an outer loop for architecture evolution.
- An LLM-based design agent generates and refines architectures, reducing reliance on predefined search spaces.
- Experimental results show significant performance improvements over traditional methods and other automated search techniques.
Read more
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
Summary
The paper introduces EVOM, an innovative framework designed to automate the architecture search process for actor-critic reinforcement learning models. Traditional methods rely on manually designed architectures, which can be inefficient and overlook potential improvements. EVOM addresses this by framing architecture search as a bi-level optimization problem. The inner loop utilizes a low-fidelity proximal policy optimization (PPO) to train candidate architectures, while the outer loop employs a large language model (LLM) as a design agent to iteratively refine these architectures. This approach allows for a more efficient exploration of the architecture space, which is inherently open-ended and complex. The authors demonstrate that EVOM significantly outperforms existing methods, including manually designed baselines and other LLM-guided search techniques, on benchmark environments such as Ant-v4 and HalfCheetah-v4. Ablation studies confirm the necessity of both the meta-evolution loop and the LLM design agent for achieving superior performance, highlighting the framework's effectiveness in discovering high-performance actor-critic architectures.
Methodology
The methodology involves a bi-level optimization framework where the inner loop uses low-fidelity PPO to evaluate candidate architectures, while the outer loop employs an LLM design agent to evolve these architectures through initialization, mutation, and crossover processes. This allows for efficient exploration of the architecture space without the need for extensive manual design.
Results
EVOM demonstrated superior performance compared to manually designed architectures, LLM-guided random searches, and the MLES programmatic policy search method. The framework achieved better results on benchmark tasks such as Ant-v4 and HalfCheetah-v4, confirming its effectiveness in discovering optimal actor-critic architectures.
Implications
The findings suggest that automated architecture design can significantly enhance the performance of reinforcement learning models, paving the way for more efficient and effective RL systems. The use of LLMs in this context opens new avenues for research in automated algorithm design and architecture optimization.
DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
NLP
Large Language Models
Interpretability
- DUALEVAL unifies static benchmark correctness and open-ended preference signals for LLM evaluation.
- The framework jointly estimates model abilities and item properties, enhancing evaluation stability and interpretability.
- Empirical results show DUALEVAL achieves high accuracy in reconstructing evaluation signals and produces balanced model rankings.
- The framework supports diagnostic applications like benchmark compression and anomaly detection.
Read more
DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
Summary
The paper introduces DUALEVAL, a novel framework for evaluating large language models (LLMs) that integrates static benchmark correctness and arena-style preference signals into a unified latent model-item calibration approach. Traditional LLM evaluation methods often rely on either static benchmarks, which provide objective correctness labels, or preference data from user interactions, which can be subjective and inconsistent. DUALEVAL addresses this by jointly estimating model abilities and item properties, allowing for a more comprehensive evaluation that compensates for the weaknesses of each method. The framework is applied across four domains: coding, math, miscellaneous domain knowledge tasks, and everyday user queries, utilizing 18 frontier LLMs. The results demonstrate that DUALEVAL produces reliable model rankings and item-level diagnostics, enabling applications such as benchmark compression and anomaly detection. Overall, DUALEVAL reframes LLM evaluation as a joint calibration process, enhancing interpretability and efficiency in evaluation pipelines.
Methodology
DUALEVAL employs a latent model-item calibration framework inspired by Item Response Theory (IRT). It uses a two-parameter logistic IRT model for static benchmarks and constructs soft pairwise preference targets from arena-style data, fitting them into a shared latent scale for model abilities and item difficulties. This joint formulation allows for mutual updates of model and item parameters based on both correctness and preference signals.
Results
The framework achieved 88-92% accuracy in static-label reconstruction across static-anchored domains and 68-81% agreement in arena comparisons. Compared to static-only and arena-only baselines, DUALEVAL produced more balanced rankings and demonstrated robustness to variations in reward models. The learned item profiles enabled effective benchmark compression and anomaly detection, with high AUROC scores for identifying contamination.
Implications
DUALEVAL's approach to joint model-item calibration can lead to more interpretable and efficient evaluation pipelines for LLMs. Its diagnostic capabilities can enhance the understanding of model performance and item characteristics, potentially influencing future evaluation methodologies and practices in the field of NLP.
BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media
NLP
Interpretability
- Introduction of BetXplain, the first dataset for detecting manipulative betting advertisements with explanations.
- Manual annotation of advertisements for manipulative and deceptive practices, enhancing the dataset's utility for research.
- Analysis of persuasive strategies in betting ads and their implications for mental health.
- Potential applications for user protection and regulatory monitoring in the context of online betting.
Read more
BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media
Summary
The paper introduces BetXplain, a novel dataset aimed at detecting manipulative betting advertisements on social media platforms, specifically Instagram and Reddit. The dataset addresses the growing concern over the misleading nature of these advertisements, which often employ persuasive techniques that can lead to risky behaviors and mental health issues among users. BetXplain is unique in that it not only provides classification labels for advertisements as manipulative, deceptive, or responsible but also includes human-written explanations for each annotation. This feature enables researchers to explore explainable approaches in the detection of such advertisements. The authors analyze common persuasive strategies used in betting ads and their potential impact on users' mental health, highlighting the societal risks associated with aggressive betting promotion. The dataset aims to fill a significant gap in existing research, as there are currently no publicly available datasets that focus specifically on manipulative betting advertisements with explanatory annotations. The paper also discusses the potential for practical applications, such as browser plugins that alert users to manipulative ads and automated systems for regulatory monitoring.
Methodology
The authors collected betting-related advertisements from Instagram and Reddit, manually annotating them for manipulative and deceptive practices. They also provided human explanations for each annotation. The dataset was then used to evaluate transformer-based and large language models for the detection of manipulative advertisements.
Results
The study successfully created a comprehensive dataset that allows for the classification of betting advertisements and provides insights into the persuasive techniques used. Initial evaluations of machine learning models demonstrated the feasibility of automated detection of manipulative content, paving the way for future research in this area.
Implications
The BetXplain dataset can facilitate the development of tools to detect and warn users about manipulative betting advertisements, potentially leading to better consumer protection. It also serves as a resource for researchers studying the psychological impacts of gambling advertisements and the effectiveness of various detection methodologies.
KG-TRACE: A Neuro-Symbolic Framework for Mechanistic Grounding in Antimicrobial Resistance Prediction
Graph Learning
Interpretability
- KG-TRACE integrates genomic data with a WHO knowledge graph to improve AMR prediction.
- The framework uses a learned epistemic trust gate to balance neural and symbolic evidence.
- Achieved an AUROC of 0.9760 for isoniazid resistance with 92.5% symbolic coverage.
- Introduced the Biological Grounding Ratio (BGR) to measure alignment with biological knowledge.
Read more
KG-TRACE: A Neuro-Symbolic Framework for Mechanistic Grounding in Antimicrobial Resistance Prediction
Summary
The paper introduces KG-TRACE, a novel neuro-symbolic framework designed to enhance antimicrobial resistance (AMR) prediction by integrating genomic data with a structured biological knowledge graph (KG) from the WHO. Traditional models have achieved high accuracy in predicting AMR but often lack a mechanism for grounding predictions in established biological pathways. KG-TRACE addresses this gap by combining genomic features with RotatE-based KG embeddings through a learned epistemic trust gate, which dynamically adjusts the weight of neural evidence against symbolic biological knowledge. The framework was evaluated on the CRyPTIC M. tuberculosis cohort, achieving an AUROC of 0.9760 for isoniazid resistance. A key innovation is the Biological Grounding Ratio (BGR), a metric that quantifies the alignment between neural attributions and biological knowledge, achieving 92.5% symbolic coverage for isoniazid-resistant predictions. The system also flags uncertain predictions when neural evidence lacks a documented biological pathway, thereby providing a verifiable audit trail for clinicians. This approach not only enhances predictive accuracy but also fosters clinical trust by ensuring that high-confidence predictions are backed by known causal mechanisms.
Methodology
KG-TRACE employs a neuro-symbolic approach that fuses genomic features with knowledge graph embeddings using a learned cross-attention gate. This gate dynamically allocates trust between neural and symbolic components on a per-sample basis, allowing the model to adapt its reliance on biological knowledge based on the evidence presented. The framework also implements a dual-level mechanistic grounding protocol to issue uncertainty flags for predictions lacking a documented biological pathway.
Results
KG-TRACE achieved an AUROC of 0.9760 for predicting isoniazid resistance in M. tuberculosis, demonstrating competitive accuracy. The framework also reported a 92.5% symbolic coverage of isoniazid-resistant predictions and effectively identified cases with ambiguous resistance mechanisms by issuing laboratory follow-up flags for uncertain predictions.
Implications
The neuro-symbolic grounding provided by KG-TRACE enhances the interpretability and reliability of AMR predictions, bridging the gap between machine learning accuracy and clinical applicability. This framework could improve clinical decision support systems by ensuring that predictions are not only accurate but also grounded in established biological knowledge, potentially leading to better patient outcomes.
Discovering Millions of Interpretable Features with Sparse Autoencoders
NLP
Large Language Models
Interpretability
- Introduction of Qwen3-Instruct SAE, a comprehensive suite of Sparse Autoencoders for Qwen3 models.
- Layer-wise SAEs are provided for key activation sites, enhancing interpretability of model features.
- Evaluation reveals distinct sparsity-fidelity trade-offs, contributing to understanding feature representations.
- Demonstration of SAE utility through a refusal-steering case study, influencing model behavior.
Read more
Discovering Millions of Interpretable Features with Sparse Autoencoders
Summary
This paper presents Qwen3-Instruct SAE, a suite of Sparse Autoencoders (SAEs) designed to extract interpretable features from the Qwen3 instruction-tuned model family, which includes models of varying sizes (1.7B, 4B, and 8B parameters). The authors address the computational challenges associated with training SAEs and the limited availability of open-source models by providing a comprehensive release that includes layer-wise SAEs across key activation sites: residual streams, MLP outputs, and attention outputs for the smaller models, and a subset of residual stream layers for the largest model. The evaluation of these SAEs employs both activation-level reconstruction metrics and model-level recovery metrics, revealing a trade-off between sparsity and fidelity across different layers. The practical utility of the Qwen3-Instruct SAE is demonstrated through a case study focused on refusal steering, showcasing how selected features can influence the behavior of instruction-tuned models. This work aims to enhance the understanding of sparse representations and facilitate future research in mechanistic interpretability and behavioral interventions in large language models.
Methodology
The authors trained Sparse Autoencoders on the Qwen3 instruction-tuned model family, focusing on three key activation sites: residual streams, MLP outputs, and attention outputs. They systematically evaluated the SAEs using reconstruction and model recovery metrics to assess their performance and interpretability.
Results
The evaluation of the Qwen3-Instruct SAE suite showed distinct trade-offs between sparsity and fidelity across different layers and components. The case study on refusal steering demonstrated that selected features from the SAEs could effectively influence the behavior of instruction-tuned Qwen3 models.
Implications
The release of Qwen3-Instruct SAE provides a valuable resource for researchers studying sparse representations and mechanistic interpretability in large language models. It opens avenues for further exploration of feature-level mechanisms and behavioral interventions, potentially enhancing the safety and alignment of AI systems.
Federated Hash Projected Latent Factor Learning
Federated Learning
Efficient ML
Optimization
- FHPLF reduces communication and computation costs by using binary gradient-like matrices instead of real-valued gradients.
- The model incorporates Projected Hamming Distance to enhance the representation capability of binary codes.
- SBG-PEU strategy minimizes the risk of privacy leakage during data transmission.
- FHPLF consistently outperforms state-of-the-art methods in terms of accuracy and efficiency across multiple datasets.
Read more
Federated Hash Projected Latent Factor Learning
Summary
The paper introduces the Federated Hash Projected Latent Factor (FHPLF) model, which integrates Hash Learning (HL) into Federated Learning (FL) to address privacy concerns and communication overhead associated with traditional methods. FHPLF innovatively replaces real-valued gradient matrices with binary gradient-like matrices, significantly reducing computation, storage, and communication costs while enhancing privacy. It employs Projected Hamming Distance for improved similarity modeling, allowing for a more nuanced representation of binary codes by emphasizing the importance of individual bits. Additionally, the Secure Binary Gradient Reassembly and Privacy-Enhanced Upload (SBG-PEU) strategy is proposed to mitigate risks of user interaction leakage during data transmission. Extensive experiments on four real-world datasets demonstrate that FHPLF outperforms existing HL and FL methods, achieving a favorable balance between accuracy, efficiency, and privacy preservation.
Methodology
The FHPLF model utilizes a federated learning framework to enable decentralized training without centralizing user data. It replaces traditional real-valued gradients with binary representations and employs Projected Hamming Distance for similarity modeling. The SBG-PEU mechanism is implemented to enhance privacy during data transmission.
Results
The experimental results indicate that FHPLF achieves superior performance compared to existing HL and FL methods, demonstrating significant improvements in privacy preservation, binary representation learning, and communication efficiency across four real-world datasets.
Implications
The FHPLF model has potential applications in recommendation systems and other domains requiring privacy-preserving machine learning techniques. Its approach can be extended to various federated learning scenarios where data privacy and communication efficiency are critical.
A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets
Efficient ML
- Introduces a multi-fidelity transfer learning framework for GWSHM.
- Utilizes a convolutional autoencoder for deep feature learning from limited experimental data.
- Achieves high accuracy in damage localization and sizing with R² scores above 0.93 and 0.99, respectively.
- Demonstrates strong generalization capabilities on unseen data.
Read more
A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets
Summary
This paper presents a novel multi-fidelity transfer learning framework designed for guided wave-based structural health monitoring (GWSHM) to enhance damage diagnosis in engineering structures. The framework integrates lightweight physics-based simulations, convolutional autoencoder (CAE) deep feature learning, and limited experimental data to achieve accurate damage localization and sizing in plate-like structures equipped with piezoelectric transducers. A one-dimensional time-domain spectral element model is employed to generate a large synthetic dataset for pretraining the model. Subsequently, transfer learning is utilized to adapt the model to experimental conditions using a minimal amount of labeled data. The proposed CAE-based framework demonstrates superior performance compared to traditional CNN approaches, achieving R² scores exceeding 0.93 for damage localization and 0.99 for damage sizing. The framework's robustness is validated through its ability to generalize to unseen data, maintaining high prediction accuracy for damage scenarios not included during pretraining or fine-tuning. This work establishes the framework as an efficient and practical solution for real-world GWSHM applications, addressing the challenges posed by limited experimental data and high computational costs associated with generating large-scale simulation datasets.
Methodology
The methodology involves using a one-dimensional time-domain spectral element model to create a large synthetic dataset for pretraining a convolutional autoencoder. The model is then fine-tuned using limited labeled experimental data through transfer learning, enhancing its ability to accurately localize and size damage in structures.
Results
The framework outperforms traditional CNN models in damage localization accuracy, achieving R² scores greater than 0.93 for localization and 0.99 for sizing. It also exhibits excellent generalization to previously unseen data, indicating its robustness and practical applicability.
Implications
The proposed framework has significant implications for structural health monitoring in various engineering applications, enabling early damage detection and characterization while minimizing the reliance on extensive labeled datasets and computational resources.
A Generalization Theory for JEPA-Based World Models
Theory
Graph Learning
Robotics
- Establishment of a spectral graph-based theoretical framework for JEPA-based world models.
- Demonstration of the equivalence between JEPA risk and matrix factorization of the co-occurrence matrix.
- Derivation of a generalization error bound linking JEPA pretraining risk to downstream planning performance.
- Identification of a trade-off between approximation and sample errors concerning latent dimensions.
Read more
A Generalization Theory for JEPA-Based World Models
Summary
This paper presents the first generalization theory for Joint Embedding Predictive Architectures (JEPAs) in the context of world modeling. JEPAs have gained attention for their ability to learn predictive dynamics in a latent space, but their theoretical foundations have not been well established. The authors formulate JEPA pretraining as a conditional spectral graph learning problem, demonstrating that the JEPA objective can be viewed as a low-rank factorization of an action-conditioned co-occurrence matrix. They establish a connection between the error in JEPA pretraining and the regret in downstream planning tasks, leading to a finite-sample generalization bound for JEPA-based world models. The analysis reveals a trade-off between approximation and sample errors in relation to the latent dimension, providing insights into the strengths and weaknesses of latent predictive models compared to traditional input-level predictive approaches.
Methodology
The authors utilize a conditional spectral graph formulation to analyze JEPA pretraining, establishing its equivalence to matrix factorization of an action-conditioned co-occurrence matrix. They derive a generalization error bound by connecting the pretraining risk to downstream action planning regret, allowing for a theoretical comparison between latent and input-level predictive models.
Results
The study successfully derives a generalization error bound for JEPA-based world models, revealing a trade-off between approximation and sample errors. This theoretical framework provides a deeper understanding of how JEPAs generalize in real-world action planning scenarios, highlighting the benefits of latent predictive modeling.
Implications
The findings have significant implications for the development of more robust and efficient world models in machine learning, particularly in applications involving action planning and decision-making in latent spaces. This theoretical foundation can guide future research in enhancing the performance and generalization capabilities of JEPA-based models.
Mesh-RL: Coupled subgrid reinforcement learning
Reinforcement Learning
- Mesh-RL introduces a spatial domain-decomposition framework for reinforcement learning.
- It enforces boundary-consistent TD updates to enhance value propagation.
- The framework improves convergence speed, cumulative reward, and learning stability.
- Higher mesh resolutions lead to better exploration and prevent premature convergence.
Read more
Mesh-RL: Coupled subgrid reinforcement learning
Summary
The paper introduces Mesh-RL, a novel reinforcement learning framework designed to address the challenges of slow temporal-difference (TD) reward propagation in large or sparse-reward environments. By employing a spatial domain-decomposition approach inspired by the finite element method, Mesh-RL partitions the environment into overlapping subgrids and enforces boundary-consistent TD updates. This method allows for localized learning while ensuring coherent value propagation across the entire state space. Unlike existing hierarchical or model-based methods, Mesh-RL accelerates long-range credit assignment without altering the reward function or introducing explicit planning mechanisms. The authors evaluate Mesh-RL on various hazard-dense grid-world environments, demonstrating that it consistently enhances convergence speed, cumulative reward, and learning stability across Q-learning, SARSA, and Dyna-Q algorithms. The results indicate that higher mesh resolutions improve exploration, prevent premature convergence, and significantly accelerate value propagation to distant states. Overall, Mesh-RL provides a principled mechanism for enhancing sample efficiency in sparse-reward settings by integrating techniques from scientific computing with reinforcement learning.
Methodology
The methodology involves partitioning the environment into overlapping subgrids and applying boundary-aware TD updates. This structured spatial decomposition allows for efficient local computation while approximating a global solution, enhancing the learning process without modifying standard TD algorithms.
Results
The experimental results show that Mesh-RL consistently outperforms traditional methods in terms of convergence speed, cumulative rewards, and overall learning stability across various algorithms (Q-learning, SARSA, Dyna-Q) in hazard-dense environments.
Implications
The implications of this work suggest that Mesh-RL can be applied to improve reinforcement learning in environments where rewards are sparse or delayed, potentially benefiting applications in robotics, game AI, and other domains requiring efficient learning in complex state spaces.
SOLAR: AI-Powered Speed-of-Light Performance Analysis
Optimization
Efficient ML
Robotics
- SOLAR is the first tool to automatically derive validated SOL bounds from PyTorch and JAX source code.
- The framework utilizes a three-stage pipeline that separates generative translation from deterministic analysis.
- SOLAR achieves 100% operator and language coverage on KernelBench, outperforming existing tools.
- The analysis reveals substantial optimization opportunities, with headroom improvements of up to 7.8x.
Read more
SOLAR: AI-Powered Speed-of-Light Performance Analysis
Summary
The paper introduces SOLAR, a novel framework designed to automate the derivation of Speed-of-Light (SOL) performance bounds for deep learning models implemented in PyTorch and JAX. SOLAR addresses the challenges of manual and error-prone SOL analysis by employing a three-stage pipeline that combines generative and deterministic components. The framework translates source code into an executable Affine Loop Intermediate Representation (IR) using a large language model (LLM), validates this translation through output comparison, and then lifts the IR into an einsum graph for analytical performance evaluation. This process allows SOLAR to compute unfused, fused, and cache-aware SOL bounds, achieving comprehensive operator and language coverage with zero observed SOL violations. The authors evaluate SOLAR across various workloads, including KernelBench and robotics applications, demonstrating its utility in headroom analysis, optimization identification, cross-platform exploration, and hardware provisioning. The results indicate significant performance headroom and optimization opportunities, showcasing SOLAR's potential to enhance model efficiency and inform hardware resource allocation.
Methodology
SOLAR employs a three-stage pipeline: (1) a large language model translates source code into an executable Affine Loop IR, (2) a deterministic compiler lifts the IR into an einsum graph, and (3) an analytical backend computes SOL bounds, including unfused, fused, and cache-aware variants. This approach ensures both generative and deterministic reasoning, allowing for validated performance analysis.
Results
The evaluation of SOLAR across various workloads demonstrated significant performance improvements, including a 7.8x increase in optimization headroom on KernelBench and a 2.06x tightening of SOL bounds through cache-aware analysis. The framework also successfully identified a 2.04x reduction in FLOPs for specific models and provided insights for cross-platform performance projections.
Implications
SOLAR's automated SOL analysis can significantly enhance the optimization of deep learning models, enabling developers to identify performance bottlenecks and optimize resource allocation effectively. Its open-source nature encourages further research and development in performance analysis tools for diverse AI architectures.
Neural Architecture Search for Generative Adversarial Networks: A Comprehensive Review and Critical Analysis
Generative Models
Optimization
Computer Vision
- NAS significantly optimizes GAN architecture design, improving performance and stability.
- Evolutionary algorithms and gradient-based methods are particularly effective in certain scenarios.
- Robust evaluation metrics are crucial for accurately assessing GAN performance.
- Diverse datasets are essential for comprehensive evaluation of GANs.
Read more
Neural Architecture Search for Generative Adversarial Networks: A Comprehensive Review and Critical Analysis
Summary
This paper presents a comprehensive review of Neural Architecture Search (NAS) methods applied to Generative Adversarial Networks (GANs), emphasizing the automation of architecture design to enhance GAN performance. The authors categorize and compare various NAS approaches based on search strategies, evaluation metrics, and performance outcomes. Key findings indicate that evolutionary algorithms and gradient-based methods often outperform others in specific contexts. The review also highlights the necessity for robust evaluation metrics beyond traditional scores like Inception Score (IS) and Fréchet Inception Distance (FID), as well as the importance of diverse datasets for assessing GAN performance. The paper aims to guide future research by identifying limitations in current methods and suggesting areas for improvement in NAS techniques for GANs.
Methodology
The authors developed a framework to categorize and compare existing NAS-GAN techniques based on key criteria derived from an extensive literature review. This framework facilitated a critical analysis of various approaches, focusing on their search strategies and performance metrics.
Results
The review reveals that while NAS can enhance GAN performance, challenges remain, particularly in achieving stability during training. The analysis shows that certain NAS methods, especially those utilizing evolutionary algorithms, yield superior results. Additionally, the need for improved evaluation metrics and diverse datasets is emphasized to better assess GAN capabilities.
Implications
The findings of this review have significant implications for researchers and practitioners in the field of machine learning, particularly in optimizing GAN architectures. By highlighting effective NAS strategies and the importance of robust evaluation, the paper provides a roadmap for future research aimed at advancing GAN technology and its applications in various domains.
Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
Theory
- Introduces Hausdorff distance as a metric for comparing differential equations.
- Establishes identifiability bounds for a wide range of ODE classes.
- Quantifies sample complexity needed for reliable recovery of governing equations.
- Addresses theoretical gaps in the uniqueness and stability of ODE identification.
Read more
Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
Summary
This paper addresses the challenge of recovering governing ordinary differential equations (ODEs) from observed solution data, a critical issue in scientific machine learning. The authors identify a significant gap in the theoretical understanding of when a ground-truth ODE can be uniquely and stably identified from multiple observations. To tackle this, they introduce the Hausdorff distance as a metric for comparing differential equations, which captures the worst-case separation between solution trajectories across all admissible initial conditions. The paper establishes identifiability bounds for various classes of ODEs, including linear and nonlinear equations with Lipschitz continuous vector fields. By deriving metric entropy estimates, the authors analyze the sample complexity required to reliably recover governing equations, providing a quantitative framework for understanding the learning task's complexity. This work lays the groundwork for future research in identifying governing equations from data, particularly in the context of partial differential equations (PDEs).
Methodology
The authors formalize the identification problem by measuring the distance between ODEs using the Hausdorff distance of their solution sets. They derive upper and lower bounds on this distance for various ODE classes and analyze the implications for sample complexity and metric entropy, quantifying the number of solution observations required for reliable identification.
Results
The paper presents specific identifiability bounds for linear and Lipschitz ODEs, demonstrating how the Hausdorff distance can be used to distinguish between different governing equations. The authors provide examples of lower and upper bounds on the Hausdorff distance, illustrating the theoretical framework developed for understanding the sample complexity involved in learning these equations.
Implications
The findings have significant implications for scientific machine learning, particularly in fields where understanding the governing dynamics of systems is crucial. The established bounds and metrics can guide the development of algorithms for identifying governing equations from data, potentially enhancing predictive modeling in various scientific and engineering applications.
EMA-FS: Accelerating GBDT Training via Gain-Informed Feature Screening
Efficient ML
- EMA-FS optimizes GBDT training by focusing histogram construction on high-gain features.
- The method achieves significant speedups (up to 2.61x) while maintaining model accuracy.
- S-EMA-FS offers a flexible framework that combines deterministic and stochastic feature selection.
- EMA-FS is implemented in a compact manner, ensuring compatibility with existing LightGBM functionalities.
Read more
EMA-FS: Accelerating GBDT Training via Gain-Informed Feature Screening
Summary
This paper introduces EMA-based Feature Screening (EMA-FS), a novel optimization technique for accelerating the training of Gradient Boosted Decision Trees (GBDT), particularly in the context of LightGBM. The primary bottleneck in GBDT training is the time spent on constructing per-feature histograms, which accounts for 65-70% of the total training time. Existing methods like random feature subsampling discard features without considering their predictive power. EMA-FS addresses this by maintaining an exponential moving average (EMA) of per-feature split gains across boosting iterations. After a brief warmup period, it restricts histogram construction to the top-K features ranked by historical gain, thus retaining high-gain features while discarding low-gain ones. This method is compatible with LightGBM's existing routines and requires no changes to core algorithms. The paper also introduces Stochastic EMA-FS (S-EMA-FS), which generalizes EMA-FS by incorporating gain-weighted random sampling, allowing for a balance between informed feature selection and ensemble diversity. The authors evaluate EMA-FS on five datasets, demonstrating significant speedups and improved model performance, especially in dense feature scenarios.
Methodology
The authors propose EMA-FS, which tracks per-feature split gains using an exponential moving average. After a warmup period, it restricts histogram construction to the top-K features based on historical gain. S-EMA-FS extends this by allowing gain-weighted random sampling, parameterized by a concentration parameter, unifying deterministic and random feature selection methods.
Results
EMA-FS demonstrated significant speed improvements, achieving a 2.61x speedup on a synthetic benchmark with 500 features and a 1.45x speedup on the IEEE-CIS Fraud Detection dataset with 432 features at 30% feature retention. At 70% retention on the synthetic benchmark, it improved AUC by 0.11 points while providing a 1.34x speedup. However, it showed no measurable speedup on extremely sparse datasets.
Implications
EMA-FS and S-EMA-FS can significantly enhance the efficiency of GBDT training, making it more feasible for large-scale applications in various domains, including financial fraud detection and advertising. The methods can lead to faster model training times without sacrificing accuracy, which is crucial for real-time applications.
SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference
Large Language Models
Efficient ML
Multimodal
- SharQ combines activation sparsity and FP4 quantization without requiring training or calibration data.
- The method uses an online N:M mask to create an outlier-dominated sparse backbone for quantization.
- It defines a dense residual relative to the quantized sparse backbone, improving accuracy and efficiency.
- SharQ achieves significant reductions in latency and improvements in throughput for LLM inference.
Read more
SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference
Summary
The paper introduces SharQ, a novel training-free inference method designed to enhance the efficiency of large language model (LLM) inference by effectively combining activation sparsity with FP4 quantization. Traditional methods struggle to apply FP4 quantization to activations due to input-dependent outliers that dominate block scales, leading to significant quantization errors. SharQ addresses this by employing an online sparse-dense decomposition approach. For each activation tensor, it generates an input-adaptive N:M mask that identifies an outlier-dominated sparse backbone, which is then quantized to FP4. A dense residual is defined relative to this quantized sparse backbone, allowing for a more accurate representation of the activation values. The method utilizes a sparse FP4 GEMM for processing the backbone and a dense FP4 GEMM to compensate for both the activation loss from the mask and the quantization error. This dual-path approach shares a single FP4 weight payload and integrates multiple operations into a single kernel, enhancing computational efficiency. SharQ does not require calibration data or model-specific tuning, making it versatile across various LLM architectures. The method demonstrates significant improvements in latency and throughput, achieving a 43-63% recovery of the accuracy gap between NVFP4 and FP16 across multiple tasks, while also providing substantial speedups in video generation tasks.
Methodology
SharQ employs an online sparse-dense decomposition method that generates an input-adaptive N:M mask for each activation tensor. This mask identifies a sparse backbone of outlier-dominated values, which is quantized to FP4. A dense residual is constructed relative to this quantized backbone, allowing for compensation of both sparsification loss and quantization error. The method integrates multiple operations into a single fused kernel, optimizing performance on modern hardware.
Results
SharQ was evaluated on various models, including Llama-3.1-8B and Qwen3-30B-A3B, recovering 43-63% of the accuracy gap between NVFP4 and FP16 across language and vision-language tasks. It achieved 2.2-2.4x latency reduction over FP16 and 1.2-1.4x throughput improvement over FP8 in language model serving, with up to 1.58x speedup in video generation tasks.
Implications
SharQ's approach could significantly enhance the efficiency of LLM inference, making it feasible to deploy larger models in real-time applications. Its ability to generalize across different architectures and formats suggests potential for widespread adoption in both language and multimodal tasks.
The Geometry of Updates: Fisher Alignment at Vocabulary Scale
Large Language Models
Theory
Optimization
- Fisher alignment is shown to be non-identifiable using representation-only metrics without assumptions on error geometry.
- FisherSketch provides a practical method for estimating head Fisher alignment at vocabulary scale with minimal computational overhead.
- The method allows for effective source selection and diagnostic analysis of task similarity in LLMs.
- FisherSketch outperforms traditional activation-only metrics in various experimental settings, demonstrating its utility in transfer learning.
Read more
The Geometry of Updates: Fisher Alignment at Vocabulary Scale
Summary
This paper addresses the challenge of training-free source selection for large language models (LLMs) that share vocabularies but differ in prediction targets, particularly in scientific domains like SMILES, protein, and genomic sequences. The author identifies an 'activation-dark regime' where traditional representation-similarity metrics fail to provide informative insights without assumptions about label-conditioned error geometry. The paper introduces FisherSketch, a novel method that estimates head Fisher alignment as a cosine between kernel mean embeddings in the joint activation-error space, allowing for practical computation of Fisher alignment at vocabulary scale. FisherSketch produces compact task signatures that facilitate source selection and diagnostic analysis of task similarity in LLMs. The results demonstrate that FisherSketch outperforms activation-only baselines in various settings, including verbalizer shifts and cross-domain evaluations, highlighting its effectiveness in capturing update geometry that representation-only metrics cannot. The findings suggest that understanding the geometry of updates is crucial for improving transfer learning in LLMs.
Methodology
The paper develops FisherSketch, a one-pass streaming random-feature estimator that calculates head Fisher alignment as a cosine between joint activation and error mean embeddings. This approach avoids the need to materialize large error moments, producing compact task signatures that can be used for source selection and diagnostic purposes.
Results
FisherSketch achieves a top-1 accuracy of 66.7% in verbalizer shift tasks, significantly outperforming activation-only baselines that collapse to random performance. In evaluations across 100 domains with Llama-3.1-8B, FisherSketch remains competitive with activation-only methods, and in molecular SMILES domains, it correlates with cross-domain perplexity reduction, while representation-only similarity does not.
Implications
The findings suggest that FisherSketch can enhance the transfer learning capabilities of LLMs by providing a more nuanced understanding of task similarity and update geometry. This has potential applications in various scientific domains where LLMs are applied, improving model adaptation and performance.
Cross-Head Attention Uplift Network with Inverse Propensity Score under Unobserved Confounding
Theory
- Introduction of CHAUN, which leverages cross-head attention for better inter-group correlation modeling.
- Theoretical proof that true propensity scores ensure ITE identifiability despite unobserved confounders.
- RA-IPS method optimizes propensity weights to mitigate bias from unobserved variables.
- CHAUN shows significant performance improvements over state-of-the-art uplift models.
Read more
Cross-Head Attention Uplift Network with Inverse Propensity Score under Unobserved Confounding
Summary
This paper addresses the challenges in uplift modeling, particularly in estimating individual treatment effects (ITE) under unobserved confounding. The authors propose the Cross-Head Attention Uplift Network (CHAUN) and the Robust Adversarial Inverse Propensity Score (RA-IPS) method. CHAUN utilizes shared feature embeddings and cross-head attention mechanisms to integrate treatment-specific and control-specific representations, enhancing the modeling of inter-group correlations. The paper theoretically establishes that true propensity scores can ensure ITE identifiability despite unobserved confounders. In scenarios lacking true propensity scores, RA-IPS optimizes propensity weights adversarially within uncertainty sets to reduce bias from unobserved variables. Experimental results on public datasets (CRITEO-UPLIFT, LAZADA) and a production e-commerce dataset demonstrate that CHAUN outperforms existing uplift models, achieving up to 25.6% improvement in QINI scores. RA-IPS further enhances robustness, outperforming standard IPS by 5.4% in the presence of unobserved confounding, validating the effectiveness of the proposed methods in real-world causal inference tasks.
Methodology
The methodology involves the development of CHAUN, which employs shared feature embeddings and attention mechanisms to create treatment-specific and control-specific representations. The RA-IPS method is introduced to optimize propensity weights adversarially, addressing selection bias due to unobserved confounders.
Results
CHAUN achieved up to 25.6% improvement in QINI scores compared to existing uplift models. RA-IPS demonstrated a 5.4% performance enhancement over standard inverse propensity score methods under conditions of unobserved confounding.
Implications
The proposed methods can significantly improve causal inference in various applications, including marketing and treatment effect estimation, where unobserved confounding is a common challenge. This could lead to more effective precision intervention strategies in real-world scenarios.
Learning Probabilistic Filters with Strictly Proper Scoring Rules
Theory
Time Series
Optimization
- Introduction of the Proper Scoring Ensemble Filter (PSEF) for Bayesian filtering.
- Training based on strictly proper scoring rules to enhance probabilistic accuracy.
- Theoretical foundation linking the population objective to the true Bayesian filtering distribution.
- Numerical experiments demonstrate superior performance in challenging filtering scenarios.
Read more
Learning Probabilistic Filters with Strictly Proper Scoring Rules
Summary
This paper introduces the Proper Scoring Ensemble Filter (PSEF), a novel ensemble data assimilation method aimed at approximating the Bayesian filtering distribution of partially and noisily observed dynamical systems. The authors address the challenge of learning the true filtering distribution, which is often unavailable for supervised learning, by leveraging synthetic state-observation trajectories generated from a forecast model. The PSEF employs a permutation-invariant, transformer-based analysis map that processes forecast ensembles and observations to produce an analysis ensemble. The training is grounded in strictly proper scoring rules, specifically utilizing the energy score, which encourages probabilistic accuracy across the entire distribution rather than focusing solely on the ensemble mean. The theoretical foundation of the PSEF is established under a realizability assumption, demonstrating that the population objective is minimized by the true Bayesian filtering distribution. The paper also derives the finite-ensemble empirical objective used in training and connects it to the population objective through a mean-field consistency argument. Numerical experiments validate the effectiveness of the learned filter in accurately approximating complex filtering distributions, including nonlinear, non-Gaussian, and multi-modal posteriors, outperforming classical and other learning-based methods. The results indicate that for Gaussian problems, a correction to the Ensemble Kalman Filter (EnKF) is optimal, while for highly non-Gaussian scenarios, an end-to-end approach without inductive bias yields superior results.
Methodology
The PSEF utilizes a transformer-based, permutation-invariant analysis map to process forecast ensembles and observations. Training is conducted using synthetic state-observation trajectories, with a focus on strictly proper scoring rules to ensure accurate probabilistic representations.
Results
The PSEF effectively approximates complex filtering distributions, achieving better performance than classical methods and other learning-based approaches. It shows particular strength in handling nonlinear, non-Gaussian, and multi-modal posteriors, with distinct strategies for Gaussian and non-Gaussian problems.
Implications
The PSEF framework has significant implications for data assimilation tasks in various fields, including meteorology and other dynamical systems, where accurate uncertainty quantification is crucial. Its ability to learn from synthetic data could lead to advancements in real-time filtering applications.
Symplectic Neural Networks for learning Generalized Hamiltonians
Theory
Efficient ML
Robotics
- Development of a neural framework for learning generalized Hamiltonians from noisy trajectory observations without structural bias.
- Demonstration of the HNN's ability to generalize to out-of-distribution data and learn governing Hamiltonians for chaotic systems.
- Introduction of an efficient gradient computation method using adjoint sensitivity equations derived from symplectic discretizations.
- Application of backward error analysis to improve the accuracy of the learned Hamiltonian.
Read more
Symplectic Neural Networks for learning Generalized Hamiltonians
Summary
This paper presents a novel approach to learning generalized Hamiltonians using Hamiltonian Neural Networks (HNNs) that integrate physical principles into neural models. The authors address the challenge of identifying Hamiltonians from noisy observations of state variables, emphasizing the importance of symplectic integrators for preserving the geometric structure and energy conservation in Hamiltonian systems. The proposed method leverages symplectic discretizations of the adjoint system to efficiently compute gradients for training the neural network, circumventing the computational difficulties associated with implicit symplectic integrators. The authors demonstrate the effectiveness of their approach through numerical experiments on various chaotic systems, showcasing improvements in system identification and energy preservation. Additionally, they introduce a backward error analysis technique that enhances the accuracy of the learned Hamiltonian without requiring more precise discretizations. Overall, the work highlights the potential of HNNs in capturing the dynamics of complex systems while maintaining computational efficiency.
Methodology
The authors propose a Hamiltonian Neural Network architecture that incorporates an implicit symplectic integrator, utilizing a predictor-corrector method for the forward pass. For the backward pass, they employ adjoint sensitivity equations to compute gradients efficiently, allowing for effective training of the neural network parameters despite the challenges posed by implicit integration methods.
Results
The experiments conducted demonstrate significant improvements in identifying Hamiltonians from noisy data, with the HNN achieving better performance in energy conservation and system dynamics representation compared to traditional methods. The backward error analysis further enhances the learned Hamiltonian's accuracy, providing a more reliable approximation of the true Hamiltonian.
Implications
This research has potential applications in physics-informed machine learning, particularly in fields requiring accurate modeling of dynamical systems, such as robotics, control systems, and simulations of physical phenomena. The ability to learn Hamiltonians from noisy data could lead to advancements in understanding complex systems and improving predictive models.
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Large Language Models
Efficient ML
NLP
- EPIKV scores tokens based on internal representation changes rather than attention weights, leading to better eviction decisions.
- The method scales to contexts 16 times longer than traditional attention-based methods without exhausting GPU memory.
- EPIKV matches or exceeds the performance of existing attention-based eviction methods on benchmark datasets.
- The approach runs up to 2.8 times faster than traditional methods while maintaining comparable eviction quality.
Read more
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Summary
This paper addresses the challenges posed by key-value (KV) cache eviction in reasoning models that generate extensive chains of thought, often leading to memory bottlenecks during deployment. Traditional methods rely on attention weights to rank token importance, which can be noisy and computationally expensive due to the need to materialize the attention matrix. The authors propose a novel approach called epiphany-aware KV cache eviction (EPIKV), which scores tokens based on the change in the model's internal representation during the forward pass, eliminating the need for the attention matrix. This method not only scales to longer contexts but also operates without requiring additional training or custom kernels, making it suitable for production environments. The paper demonstrates that EPIKV achieves competitive performance on benchmark tasks while significantly improving speed and memory efficiency compared to attention-based methods.
Methodology
The authors introduce EPIKV, which evaluates token importance by measuring changes in the model's hidden states during the forward pass. They identify two bands in the model's layers that correlate with token importance and apply a causal rolling z-score to enhance eviction quality. The method is designed to be compatible with existing FlashAttention inference stacks, allowing for seamless integration into production systems.
Results
EPIKV achieves 72% accuracy on the MATH-500 benchmark with a 4096-token cache, comparable to the best attention-based method (ThinKV at 71%). Additionally, a lag-normalized variant of EPIKV reaches 37% on the AIME-2024 benchmark with an 8192-token cache, outperforming the best attention-based method (33%) while operating at up to 2.8 times the speed.
Implications
The findings suggest that EPIKV can significantly enhance the efficiency of reasoning models in real-world applications, particularly in scenarios requiring long context processing. This could lead to broader adoption of reasoning models in various domains, including education, problem-solving, and automated reasoning tasks.
Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication
NLP
Large Language Models
- First direct evaluation of LLM coding reliability on humanitarian-context data.
- Multi-stage evaluation framework extending conventional inter-coder reliability testing.
- Performance variation among LLMs based on humanitarian themes and reasoning effort.
- LLMs can enhance humanitarian analytical capacity but should not replace human judgment.
Read more
Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication
Summary
This paper presents a benchmark study evaluating the reliability of 46 large language models (LLMs) in coding qualitative humanitarian data, compared to human expert adjudication. The study addresses the critical need for timely and consistent interpretation of qualitative data from crisis-affected populations, which is essential for effective humanitarian response. Given the resource constraints in humanitarian organizations, LLMs are explored as a potential solution for scaling qualitative data analysis. The evaluation utilized a dataset of 150 synthetic humanitarian transcripts, designed to reflect real-world complexities, and involved inter-rater reliability testing, discrepancy analysis, and qualitative assessments based on humanitarian-specific criteria. The findings indicate that several LLMs can achieve reliability levels comparable to experienced human coders, particularly when structured prompts and reasoning-enabled configurations are employed. However, the study also highlights that aggregate reliability metrics alone are insufficient for deployment decisions, as models exhibited variability in recognizing nuanced needs and protection-related concerns. The authors conclude that while LLMs can enhance analytical capacity in humanitarian contexts, they should not replace human judgment, and their deployment requires careful consideration of structured codebooks and oversight mechanisms.
Methodology
The study employed a systematic evaluation framework involving inter-rater reliability testing using Krippendorff’s alpha, discrepancy analysis to classify coding accuracy, and qualitative assessments tailored to humanitarian contexts. A dataset of 150 synthetic transcripts was used for rigorous evaluation.
Results
The results demonstrated that multiple LLMs achieved reliability levels comparable to human coders in deductive coding tasks. However, performance varied significantly across different models, particularly in recognizing indirect needs and protection-related issues. Aggregate reliability metrics were found to be insufficient for making deployment decisions.
Implications
The findings suggest that LLMs can significantly augment the analytical capacity of humanitarian organizations, but careful implementation is necessary to ensure accountability and accuracy in coding. Structured codebooks and oversight are essential to mitigate risks associated with misclassification, especially in sensitive contexts.
Asymptotically Optimal Learning for Parametric Prophet Inequalities
Theory
Optimization
- Characterization of optimal full-information asymptotic competitive ratios for exponential-type parametric families.
- Development of a confidence-based dynamic programming policy for online learning that does not require offline samples.
- Derivation of distribution-specific convergence guarantees for various reward distributions.
- Numerical experiments demonstrate the effectiveness of the proposed algorithm in practical scenarios.
Read more
Asymptotically Optimal Learning for Parametric Prophet Inequalities
Summary
This paper investigates the learning of prophet inequalities with independent and identically distributed (i.i.d.) rewards drawn from an exponential-type parametric family with an unknown parameter θ. The authors first characterize the optimal full-information asymptotic competitive ratio for this family, revealing that in the unbounded-support case, the limit is governed by the endpoint-growth parameter c+, while in the bounded-support case, the limit is 1. They propose a confidence-based dynamic programming policy for online learning that estimates the unknown distribution parameter from online observations alone, achieving the same optimal asymptotic competitive ratio as the full-information case without requiring offline samples. The paper also derives distribution-specific convergence rates for canonical examples, demonstrating that the proposed policy matches the convergence rates of full-information optimal policies for exponential and bounded-support distributions, and provides a Pareto-specific convergence guarantee for heavy-tailed rewards. Numerical experiments validate the performance of the algorithm, highlighting the influence of tail or endpoint structure on convergence speed.
Methodology
The authors propose a confidence-based dynamic programming policy that first estimates the unknown distribution parameter through an initial exploration phase. It constructs an upper confidence bound and applies corresponding plug-in dynamic programming thresholds to achieve optimal competitive ratios using only online observations.
Results
The paper establishes that the proposed policy achieves the same asymptotic competitive ratio as the optimal known-parameter policy. It also provides refined convergence guarantees for exponential, Pareto, and bounded-support power-family distributions, showing that the policy matches the convergence rates of full-information optimal policies in specific cases.
Implications
The findings suggest that it is possible to effectively learn and make optimal decisions in online settings with unknown reward distributions, which has significant implications for applications in areas such as online advertising, labor market hiring processes, and other decision-making scenarios where full distributional knowledge is not available.
Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Introduction of a novel preference-conditioned Bellman operator for MORL.
- Proof of convergence to the Pareto-optimal values for deterministic policies.
- Extraction of deterministic policies from converged Q-estimates covering the Pareto frontier.
- Empirical validation showing the algorithm's ability to recover complex trade-offs.
Read more
Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning
Summary
This paper addresses the challenge of balancing multiple conflicting objectives in real-world decision-making through a novel approach in Multi-Objective Reinforcement Learning (MORL). Traditional methods often rely on aggregating rewards into a single scalar signal, which can overlook the full spectrum of optimal trade-offs represented by the Pareto frontier. The authors introduce a preference-conditioned Bellman operator based on Chebyshev scalarization, which computes deterministic Pareto-optimal policies for Multi-Objective Markov Decision Processes (MOMDPs). They prove that this operator satisfies an enveloping property, ensuring that the estimated value functions upper-bound the true Pareto frontier and converge to a coverage set of this frontier. The methodology allows for the extraction of deterministic policies from the converged Q-estimates, ensuring that the agent can recover a policy for any given preference while maintaining approximate Pareto-optimality. Experimental results demonstrate the algorithm's effectiveness in recovering complex trade-offs and synthesizing deterministic Pareto-optimal policies, thus providing a comprehensive solution for multi-objective decision-making in reinforcement learning contexts.
Methodology
The authors develop a model-based Bellman operator parameterized by preference weights, leveraging Chebyshev scalarization. They derive error bounds for this operator and prove its convergence for deterministic, non-stationary policies in MOMDPs. The approach requires only a single-step transition memory to extract policies that cover the Pareto frontier.
Results
The experimental results validate that the proposed algorithm successfully converges to the Pareto frontier, capturing all trade-offs and recovering a set of deterministic Pareto-optimal policies for varying preferences. The results demonstrate that the synthesized policies are approximately non-dominated within the space of all non-stationary deterministic policies.
Implications
This work has significant implications for various domains requiring multi-objective decision-making, such as robotics, circuit design, and drug design, where balancing competing criteria is crucial. The ability to synthesize deterministic Pareto-optimal policies can enhance decision-making processes in complex environments.
Heavy-Ball Q-Learning with Residual Weighting Correction
Reinforcement Learning
Theory
Optimization
- Introduces a corrected heavy-ball Q-learning method for faster convergence.
- Establishes theoretical conditions for acceleration compared to standard Q-learning.
- Utilizes switched linear system (SLS) representation and joint spectral radius (JSR) for analysis.
- Extends findings to Q-learning with linear function approximation.
Read more
Heavy-Ball Q-Learning with Residual Weighting Correction
Summary
This paper introduces a corrected heavy-ball Q-learning method aimed at enhancing the convergence speed of reinforcement learning (RL) algorithms. The author establishes theoretical conditions under which this method converges faster than standard Q-learning, particularly in scenarios where the active greedy policy may change during the learning process. The proposed method modifies the basic heavy-ball Q-learning recursion to ensure that the mean mappings share a common eigenvector, allowing for a tractable analysis of convergence rates. The analysis employs a switched linear system (SLS) representation of Q-learning algorithms and utilizes the joint spectral radius (JSR) to derive insights into the acceleration of Q-learning through heavy-ball momentum. The findings extend to Q-learning with linear function approximation, maintaining analogous convergence and acceleration guarantees. The paper emphasizes the utility of the SLS framework in providing new perspectives on the dynamics of heavy-ball Q-learning and its potential for improved performance in RL tasks.
Methodology
The methodology involves modifying the heavy-ball Q-learning recursion to ensure that the mean mappings share a common eigenvector. The analysis is conducted using a switched linear system (SLS) framework, where the greedy action in the Bellman maximum determines the active mode of the SLS. The joint spectral radius (JSR) of the associated switching families is analyzed to derive convergence and acceleration results.
Results
The paper demonstrates that the corrected heavy-ball Q-learning method converges faster than standard Q-learning under specific conditions. It provides a theoretical certificate for acceleration in one analytically identifiable direction, offering insights into the dynamics of Q-learning and the role of heavy-ball momentum.
Implications
The findings suggest that the corrected heavy-ball Q-learning method could be applied to improve the efficiency of RL algorithms, particularly in environments where the policy changes dynamically. This could lead to faster learning and better performance in various RL applications.
CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry
NLP
Large Language Models
Efficient ML
- CascadeFormer architecture tapers width with depth to optimize information flow.
- Gradient Fan-in Asymmetry (GFA) is proposed as a structural explanation for layer redundancy in deep transformers.
- CascadeFlow Pruning (CFP) effectively prunes layers based on training gradients, outperforming traditional heuristics.
- Empirical tests validate the GFA hypothesis, showing that structural factors, not just gradient magnitude, affect layer importance.
Read more
CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry
Summary
The paper introduces CascadeFormer, a novel transformer architecture that addresses the inefficiencies of deep transformers by tapering their width with depth, inspired by the concept of Gradient Fan-in Asymmetry (GFA). The authors argue that deeper layers in traditional transformers contribute less due to a structural bottleneck in gradient diversity rather than merely gradient magnitude. They propose two methods: CascadeFormer, which optimizes the model's architecture to align with the natural flow of information, and CascadeFlow Pruning (CFP), which prunes layers based on accumulated training gradients without requiring extensive post hoc analysis. Empirical evidence supports the GFA hypothesis, demonstrating that early layers receive richer gradients while deeper layers suffer from diminishing returns. The proposed methods achieve comparable perplexity to uniform baselines while improving latency and throughput, indicating a significant advancement in transformer efficiency.
Methodology
The authors conducted empirical studies on transformer models trained from scratch, analyzing the relationship between per-layer gradient norms and functional importance. They performed interventional tests to assess the impact of gradient magnitude and structural changes on layer contributions. The CascadeFormer architecture was designed to taper width with depth, while CascadeFlow Pruning utilized accumulated training gradients for layer pruning.
Results
The CascadeFormer achieved an 8.6% reduction in latency and a 9.4% increase in throughput compared to a uniform baseline, while maintaining comparable perplexity. CascadeFlow Pruning outperformed standard pruning heuristics in terms of perplexity and rank-stability, demonstrating its effectiveness in enhancing model efficiency.
Implications
The findings suggest that transformer architectures can be significantly optimized by considering the structural dynamics of gradients, leading to more efficient models suitable for large-scale applications in natural language processing and beyond. The proposed methods could facilitate the development of lighter, faster models without sacrificing performance.
A Causal Foundation Model for Structure and Outcome Prediction
Graph Learning
Theory
Interpretability
- TabPFN-CFM predicts causal structures and outcomes from observational data.
- Supports all three levels of Pearl's Causal Hierarchy.
- Utilizes known graph structures to improve prediction accuracy.
- Employs a refined training procedure that enhances efficiency.
Read more
A Causal Foundation Model for Structure and Outcome Prediction
Summary
The paper introduces TabPFN-CFM, a causal foundation model designed to predict both causal structures and outcomes from observational data. It addresses the challenge of determining causal relationships and outcomes by supporting queries across all three levels of Pearl's Causal Hierarchy. The model is trained on synthetic datasets and demonstrates improved performance on real datasets compared to existing baselines. TabPFN-CFM leverages known graph structures to enhance predictions and employs a refined training procedure that increases training efficiency by nearly four times. The model's architecture incorporates Bayesian principles and uses Acyclic Directed Mixed Graphs (ADMGs) to account for unobserved confounding variables, thereby improving the accuracy of causal structure predictions. The training process involves generating observational data from a variety of structural causal models (SCMs), ensuring the model's robustness and generalization capabilities.
Methodology
The model follows a Bayesian PFN framework, estimating causal queries and graph structures by training on synthetic data generated from a diverse set of SCMs. It uses a transformer architecture that incorporates graph conditioning and employs row-wise and column-wise attention mechanisms to enhance prediction accuracy.
Results
TabPFN-CFM shows significant improvements in predicting both causal structures and outcomes compared to existing methods. It effectively generalizes from synthetic training data to real-world datasets, demonstrating enhanced performance metrics across various causal prediction tasks.
Implications
The model has potential applications in fields requiring causal inference, such as epidemiology, economics, and social sciences. Its ability to predict unobserved confounding can lead to better decision-making and policy formulation based on causal insights.
How Good Can Linear Models Be for Time-Series Forecasting?
Time Series
- Ridge regression, when carefully tuned, outperforms prior linear forecasting models and matches or exceeds transformer and MLP architectures on several benchmarks.
- The optimal lookback period for forecasting is highly specific to the dataset and often non-monotonic with respect to the forecast horizon.
- Normalizing over a learned trailing fraction of the context improves forecasting accuracy compared to using the entire context.
- Different time series within the same dataset may require distinct hyperparameters, indicating the importance of series heterogeneity.
Read more
How Good Can Linear Models Be for Time-Series Forecasting?
Summary
This paper challenges the prevailing notion that larger model architectures, such as transformers, are necessary for effective time-series forecasting. Instead, the authors argue that significant improvements can be achieved through careful preprocessing and hyperparameter tuning of simpler models, specifically Ridge regression. The study systematically investigates the effects of context length, normalization strategies, regularization, and data augmentation across eight standard benchmarks. The findings reveal that optimal lookback periods are often series-specific and non-monotonic, suggesting that longer forecasting horizons do not always require longer historical data. Additionally, normalizing over a learned trailing fraction of the context is generally more effective than using the entire context. The results indicate that the optimal hyperparameters provide insights into the underlying data structures, which can inform the development of more complex forecasting models. The authors also introduce SearchCast, a reproducible pipeline for hyperparameter optimization in time-series forecasting.
Methodology
The authors employed Ridge regression as a testbed for their experiments, conducting a systematic search over various hyperparameters including context length, normalization strategies, regularization strength, and data augmentation. They analyzed the performance across eight standard benchmarks, focusing on per-horizon and per-series granularity.
Results
The optimized Ridge regression models outperformed previous linear forecasters on most dataset-horizon combinations and exceeded transformer, MLP, and CNN baselines on six out of eight benchmarks. The study revealed that optimal hyperparameters are dataset-specific and can provide diagnostic insights into the data's structural properties.
Implications
The findings suggest that simpler models, when properly tuned, can be highly effective for time-series forecasting, potentially reducing the need for complex architectures. This could lead to more efficient forecasting solutions that are easier to interpret and implement. The insights gained from hyperparameter tuning may also guide the design of future forecasting models.
PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Large Language Models
Efficient ML
Optimization
- PersistentKV enhances long-context LLM serving by optimizing KV-cache management and decode scheduling.
- The system employs a calibrated adaptive policy to select the most efficient decoding strategy based on active batch sizes.
- A native block-table decode engine is introduced, which improves memory utilization and reduces overhead.
- The methodology includes rigorous comparisons with existing systems, showcasing the advantages of the proposed approach.
Read more
PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Summary
The paper introduces PersistentKV, a novel decode scheduling engine designed to optimize the serving of autoregressive large language models (LLMs) on commodity GPUs, specifically targeting the challenges posed by key-value (KV) cache management during long-context decoding. Traditional methods often suffer from inefficiencies due to KV-cache fragmentation and suboptimal scheduling, particularly when dealing with varying sequence lengths. PersistentKV addresses these issues by implementing a native block-table decode engine that enhances memory utilization and reduces overhead associated with KV-cache management. The proposed system employs a calibrated adaptive policy that dynamically selects between different decoding strategies based on the active batch size, thereby improving throughput. The methodology includes a detailed comparison of PersistentKV against existing systems like FlashInfer, focusing on grouped-query attention (GQA) and utilizing a compact workqueue for efficient task execution. The results demonstrate significant improvements in throughput across various workloads, highlighting the importance of adaptive scheduling in optimizing LLM serving.
Methodology
The methodology involves the development of a native block-table decode engine for grouped-query attention (GQA), which allows for efficient KV-cache management. The system utilizes a compact workqueue to execute only non-empty tasks, thereby optimizing resource usage. An adaptive policy is implemented to select between PersistentKV and FlashInfer based on the characteristics of the active batch, ensuring optimal performance across different workloads. The evaluation includes synchronized wall timing, CUDA-event timing, and workload counters to assess the effectiveness of the scheduling strategies.
Results
The results show that PersistentKV improves synchronized wall throughput by 1.063–1.265× on various workloads, with a notable 1.399× improvement on a specific bucketed trace. The workqueue scheduling significantly reduces launch fan-out from 16.00 to 2.00 launches per decode step, indicating enhanced efficiency in task execution. The adaptive policy effectively balances the use of PersistentKV and FlashInfer, optimizing performance based on workload characteristics.
Implications
The findings suggest that adaptive page-aware decode scheduling can significantly enhance the efficiency of LLM serving on commodity GPUs. This has implications for the deployment of LLMs in real-world applications, where optimizing resource utilization and reducing latency are critical. The approach could be applied to various LLM architectures and serving scenarios, potentially improving the performance of AI-driven applications.
Kolmogorov Arnold networks (KAN) for aerodynamic prediction: a comparison with MLPs and GNNs
Theory
Efficient ML
Graph Learning
- KANs offer a novel approach to neural network architecture by adapting activation functions.
- In aerodynamic prediction tasks, KANs perform comparably but slightly worse than MLPs and GNNs.
- KANs converge faster during training due to their lower complexity compared to MLPs and GNNs.
- Training instabilities and hyperparameter sensitivity are significant challenges for KANs.
Read more
Kolmogorov Arnold networks (KAN) for aerodynamic prediction: a comparison with MLPs and GNNs
Summary
This paper investigates the performance of Kolmogorov Arnold Networks (KANs) in the context of aerodynamic prediction, specifically for estimating surface pressure distributions over airfoils. KANs are a novel neural network architecture that adapts activation functions rather than the coefficients of affine transformations, leveraging the Kolmogorov-Arnold theorem for universal approximation. The authors compare KANs against traditional multilayer perceptrons (MLPs) and graph neural networks (GNNs) in surrogate modeling for fluid dynamics. The study focuses on predicting pressure coefficients across varying Mach numbers and angles of attack, a critical task in aerodynamics. Results indicate that while KANs demonstrate competitive performance, they are marginally inferior to MLPs and less effective than GNNs, which achieve the best results but require longer training times. KANs exhibit faster convergence and lower complexity, yet they face training instabilities and sensitivity to hyperparameter optimization. This research contributes to the ongoing debate regarding the efficacy of KANs compared to established neural network architectures in surrogate modeling for physical processes.
Methodology
The authors developed a KAN-based surrogate model for predicting aerodynamic pressure coefficients and compared its performance with MLPs and GNNs. They replicated previous studies to ensure a fair comparison, focusing on the interpolation capabilities across different physical parameters and flight conditions.
Results
The KAN model achieved good performance in predicting pressure coefficients but ranked third behind MLPs and GNNs. While KANs showed faster training convergence, they suffered from training instabilities and had a higher generalization error in regions with strong gradients. MLPs provided marginally better performance, while GNNs achieved the highest accuracy at the cost of longer training times.
Implications
The findings suggest that KANs could be a viable option for surrogate modeling in fluid dynamics, particularly where training speed is critical. However, their limitations in stability and performance indicate that further research is needed to enhance their robustness and effectiveness in practical applications.
Theory-Scale Auto-Formalization of Logics for Computer Science
Theory
- Introduction of LCS-Bench, a theory-scale benchmark for auto-formalization.
- Development of a semi-automated pipeline for constructing the benchmark.
- Creation of five evaluation tracks with 1,271 benchmark instances.
- Demonstration of the benchmark's challenging nature with state-of-the-art models achieving only 20.1% success.
Read more
Theory-Scale Auto-Formalization of Logics for Computer Science
Summary
This paper addresses the challenge of theory-scale auto-formalization, which involves translating complex interdependent mathematical theories into machine-verifiable statements. The authors introduce LCS-Bench, a comprehensive benchmark designed for evaluating auto-formalization in the context of Logics for Computer Science. LCS-Bench is constructed using a semi-automated pipeline that integrates concept graphs, formal signature planning, issue tracking, and human expert reviews to ensure the quality and faithfulness of the formalizations. The benchmark encompasses 327 textbook items, over 4,076 Lean declarations, and more than 85,000 lines of Lean code, providing a rich dataset for evaluation. The authors also propose a novel evaluation protocol that includes definitional equivalence checkers for a more nuanced assessment of auto-formalization tasks. Through evaluations on 14 models, the paper demonstrates that LCS-Bench is a high-quality, challenging benchmark, revealing that state-of-the-art models only achieve 20.1% success in auto-formalization tasks. The findings highlight the complexities of theory-scale auto-formalization and suggest avenues for future research in this domain.
Methodology
The authors employed a semi-automated agentic pipeline that combines concept graph construction, formal signature planning, automated issue tracking, and expert review to create LCS-Bench. This pipeline ensures the formalization of a comprehensive set of definitions, lemmas, and theorems from the textbook 'Logics for Computer Science'. The benchmark is evaluated through a novel protocol that includes definitional equivalence checkers.
Results
LCS-Bench was successfully constructed, covering 327 textbook items and yielding a challenging dataset with 1,271 evaluation instances. The evaluation of 14 models revealed that the best-performing models achieved only 20.1% success in auto-formalization tasks, indicating the benchmark's difficulty and the need for further advancements in the field.
Implications
The introduction of LCS-Bench provides a valuable resource for researchers working on auto-formalization and formal verification, facilitating the development of more robust models. The findings underscore the complexities involved in theory-scale formalization, guiding future research efforts towards improving auto-formalization techniques.
Reasoning Quality Emerges Early: Data Curation for Reasoning Models
NLP
Large Language Models
Efficient ML
- Introduces a new method for data curation in reasoning models that relies on initial reasoning tokens.
- Demonstrates that the first 100 tokens can effectively indicate problem difficulty.
- Establishes that similar loss patterns in initial tokens lead to similar gradients during training.
- Achieves up to 1.7% performance improvement while being 91% more token efficient compared to existing methods.
Read more
Reasoning Quality Emerges Early: Data Curation for Reasoning Models
Summary
This paper presents a novel approach to data curation for supervised fine-tuning (SFT) of large language models (LLMs) aimed at enhancing reasoning capabilities. The authors argue that traditional methods for curating high-quality SFT data are inefficient and often yield suboptimal results due to their reliance on strong reasoning models for filtering based on diversity and difficulty. Instead, they propose a method that identifies diverse and challenging reasoning examples using only the initial reasoning tokens. Specifically, they demonstrate that the loss of the first 100 reasoning tokens can reliably indicate the difficulty of problems when evaluated at a randomly perturbed checkpoint of the pretrained model. Furthermore, they establish that examples with similar loss patterns over their first 1,000 tokens across a few perturbed checkpoints induce similar gradients during training. This leads to the development of the Token-Efficient Model Perturbation (TEMP) method, which efficiently curates an SFT dataset that is both diverse and challenging, thereby facilitating effective training of reasoning models. The approach is validated through extensive experiments on the Qwen2.5-7B and Llama3.1-8B models using the M23K medical reasoning and OpenThoughts-Math datasets, showing significant improvements in performance and token efficiency compared to existing baselines.
Methodology
The authors developed the Token-Efficient Model Perturbation (TEMP) method, which identifies challenging reasoning examples based on the loss of the first 100 reasoning tokens at a perturbed checkpoint. They then cluster examples based on their loss over the first 1,000 tokens across several checkpoints to ensure diversity in the curated dataset.
Results
The TEMP method outperformed existing baselines by up to 1.7% in reasoning tasks while being 91% more token efficient. This was validated through experiments on the Qwen2.5-7B and Llama3.1-8B models using the M23K medical reasoning and OpenThoughts-Math datasets.
Implications
This research has significant implications for the efficient training of reasoning models in NLP, potentially leading to better performance in reasoning-intensive applications such as medical diagnosis, mathematical problem-solving, and programming tasks.
Multipath Adaptive Gated Bottleneck Latent ODE with Raman Data Fusion for Cell Culture Process Forecasting
Time Series
Multimodal
- Introduction of a novel adaptive framework for bioprocess forecasting.
- Development of the Gated Bottleneck Latent ODE to improve learning from sparse data.
- Implementation of Multi-Path Just-In-Time Fine-Tuning for generating multiple plausible forecasts.
- Fusion of Raman spectroscopy data to enhance the observability of bioprocess runs.
Read more
Multipath Adaptive Gated Bottleneck Latent ODE with Raman Data Fusion for Cell Culture Process Forecasting
Summary
This paper addresses the challenges of forecasting mammalian cell culture processes, which are critical for biopharmaceutical production. The authors propose an innovative framework that combines a Gated Bottleneck Latent Ordinary Differential Equation (GB-Latent ODE) with Multi-Path Just-In-Time Fine-Tuning (MP-JIT-FT). The GB-Latent ODE enhances the standard Latent ODE by incorporating learnable gating mechanisms and a bottleneck structure to effectively handle high-dimensional sparse data. MP-JIT-FT retrieves similar historical trajectories, clusters them into candidate regimes, and fine-tunes separate models for each regime, allowing for multiple plausible forecasts rather than a single average prediction. Additionally, the framework integrates Raman spectroscopy data through a machine-learning soft sensor, which generates pseudo-observations to enrich sparse offline measurements. The proposed method is evaluated on 38 fed-batch bioreactor runs across 14 conditions, demonstrating superior performance compared to a global Latent ODE baseline, particularly in scenarios where early dynamics predict later behavior. The study highlights the importance of adaptive forecasting in bioprocess management, enabling timely interventions to maintain production quality.
Methodology
The methodology involves a combination of a Gated Bottleneck Latent ODE for modeling the dynamics of cell culture processes and Multi-Path Just-In-Time Fine-Tuning to adaptively retrieve and cluster historical data. The framework also incorporates Raman spectroscopy data through a soft sensor to augment the training process.
Results
The proposed framework achieved the best average rank across 38 fed-batch bioreactor runs and outperformed a global Latent ODE baseline on 8 out of 9 target variables. The multi-path forecasting approach showed significant gains in scenarios where similar early trajectories diverged later, while Raman data fusion improved the robustness of predictions.
Implications
The findings suggest that adaptive forecasting methods can significantly enhance the management of biopharmaceutical production processes, allowing for timely adjustments that can prevent off-specification batches. This approach could be applied to other complex, dynamic systems requiring real-time monitoring and control.
Decision-Aligned Evaluation of Uncertainty Quantification
Theory
- Introduces decision-alignment as a formal criterion for evaluating UQ metrics.
- Identifies misalignment in widely used UQ metrics with practical decision-making.
- Proposes prior-weighted utility metrics that better capture decision utility.
- Demonstrates the effectiveness of the new metrics through experiments and case studies.
Read more
Decision-Aligned Evaluation of Uncertainty Quantification
Summary
This paper addresses the evaluation of uncertainty quantification (UQ) in machine learning, highlighting that traditional metrics like negative log-likelihood (NLL) and expected calibration error (ECE) do not necessarily correlate with the utility of decisions made based on these uncertainties. The authors introduce a novel criterion called decision-alignment, which assesses how well UQ metrics align with downstream decision-making utilities. They demonstrate that many commonly used UQ metrics are either misaligned with practical decision problems or reflect flawed prior beliefs about the tasks. To address these issues, the authors propose prior-weighted utility metrics, a new class of scoring rules designed to provide a more accurate evaluation of uncertainty in relation to decision-making. Through extensive benchmark experiments and real-world case studies, they show that these new metrics consistently align with actual decision utility, unlike traditional metrics, thereby revealing significant flaws in current UQ evaluation protocols and suggesting a principled extension towards decision-relevant UQ evaluation.
Methodology
The authors conducted a systematic analysis of common UQ metrics using the decision-alignment framework. They proposed prior-weighted utility metrics and evaluated their performance through benchmark experiments in classification and regression tasks, comparing them against traditional metrics like NLL and ECE.
Results
The results indicate that prior-weighted utility metrics reliably align with real downstream utilities, while conventional metrics often do not. The analysis revealed that many traditional UQ metrics either encode pathological prior beliefs or are not aligned with decision-making processes.
Implications
The findings suggest that adopting decision-aligned evaluation metrics can improve the development and deployment of UQ methods in machine learning, particularly in safety-critical applications where accurate uncertainty quantification is essential for reliable decision-making.
Equivariance and Augmentation for Bayesian Neural Networks
Theory
- The paper establishes a theoretical framework for understanding how data augmentation induces equivariance in BNNs.
- Three novel symmetrization techniques are introduced to enhance the equivariance of BNNs trained on augmented data.
- The orbit expansion method outperforms baseline models in terms of both equivariance and overall performance.
- The study provides bounds on the equivariance error and conditions for maintaining equivariance during training.
Read more
Equivariance and Augmentation for Bayesian Neural Networks
Summary
This paper investigates the role of symmetries in Bayesian Neural Networks (BNNs) and explores the effectiveness of data augmentation as a means to achieve equivariance. The authors highlight the ongoing debate between imposing symmetry constraints through equivariant network architectures and learning symmetries from augmented data. They derive conditions under which exact equivariance can be achieved when training BNNs with variational inference on augmented data, particularly focusing on variational distributions in the exponential family. The study introduces three novel symmetrization techniques—geometric averaging, projection, and orbit expansion—to enhance the equivariance properties of BNNs. Extensive numerical experiments demonstrate that the orbit expansion method significantly outperforms baseline models in both equivariance and overall performance, suggesting that data augmentation can be a powerful tool for improving the robustness of BNNs in symmetric learning tasks.
Methodology
The authors analyze BNNs trained with variational inference on augmented data, focusing on variational distributions from the exponential family. They derive theoretical results regarding equivariance and introduce symmetrization techniques to improve model performance. Numerical experiments are conducted to validate the theoretical findings.
Results
The study shows that starting from an invariant prior allows the variational distribution to remain invariant throughout training under certain conditions. The introduced symmetrization techniques, particularly orbit expansion, lead to improved equivariance and overall model performance compared to baseline approaches.
Implications
The findings have significant implications for the design of neural networks in applications requiring symmetry, such as medical imaging and scientific computing. The ability to achieve equivariance through data augmentation can enhance model robustness and performance in various domains.
Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
Reinforcement Learning
Robotics
Theory
- Integration of reinforcement learning with biochemical reaction networks for modeling phototaxis.
- Formulation of phototaxis as a subjective POMDP, highlighting the role of sensory ambiguity.
- Use of Inverse Reinforcement Learning to derive a data-driven phototactic policy from experimental data.
- Demonstration of how tumbling behavior aids in resolving sensory ambiguity and supports adaptive navigation.
Read more
Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
Summary
This paper presents a novel framework that integrates reinforcement learning (RL) with biochemical reaction networks to model phototaxis in unicellular algae, specifically Chlamydomonas. The authors argue that traditional models of phototaxis, which often rely on mechanistic run-tumble processes, fail to account for the active sampling behavior of organisms in uncertain environments. By framing phototaxis as a Partially Observable Markov Decision Process (POMDP), the study emphasizes the role of internal state updates based on sensory observations. The proposed model incorporates a memoryless Bayesian approach to balance light orientation with exploratory behavior, implemented through Chemical-Reaction-Network Ordinary Differential Equations (CRN–ODEs). The authors utilize Inverse Reinforcement Learning (IRL) on 30 recorded trajectories to derive a phototactic policy and reward structure, comparing it against standard stochastic simulation baselines. The results demonstrate that the model effectively reproduces empirical alignment-to-light distributions and illustrates how tumbling serves as an information-acquisition strategy, enhancing the understanding of adaptive behavior in cellular navigation.
Methodology
The authors formulated the navigation problem as a POMDP and implemented the internal dynamics using CRN–ODEs. They employed Inverse Reinforcement Learning on experimental trajectories to infer the behavioral objectives and compared the results with standard stochastic simulation baselines.
Results
The model successfully reproduced the empirical alignment-to-light distribution observed in Chlamydomonas, demonstrating that the dynamics derived from the proposed framework are comparable to traditional stochastic simulation methods. The findings support the hypothesis that tumbling acts as a strategy for information acquisition in uncertain environments.
Implications
This research provides insights into how simple biochemical processes can underpin complex adaptive behaviors in organisms, potentially influencing the design of bio-inspired algorithms in robotics and artificial intelligence. It also opens avenues for further exploration of the relationship between cellular dynamics and cognitive processes.
Transformer-Based Classification of Bacterial Raman Spectra with LOOCV
Theory
- Transformer models outperform conventional machine learning methods in classifying bacterial Raman spectra.
- The study employs a nested leave-one-replicate-out cross-validation framework for robust evaluation.
- Transformers demonstrate superior class separation and maintain performance on raw spectra without preprocessing.
- The findings highlight the significance of replicate-aware validation in model evaluation.
Read more
Transformer-Based Classification of Bacterial Raman Spectra with LOOCV
Summary
This study investigates the application of transformer-based models for the classification of bacterial Raman spectra, utilizing a nested leave-one-replicate-out cross-validation (LOOCV) framework. The research compares the performance of the transformer model against traditional machine-learning methods that integrate Principal Component Analysis (PCA) or Independent Component Analysis (ICA) with classifiers such as Linear Discriminant Analysis (LDA), Support Vector Machines (SVM), and Random Forests (RF). The dataset comprises 5,417 single-cell Raman spectra from six bacterial species, collected across nine independent measurement replicates. The results indicate that the transformer model consistently outperformed conventional methods, achieving superior classification accuracy and demonstrating improved class separation in the learned latent feature space. Notably, the transformer maintained robust performance when applied directly to raw Raman spectra without preprocessing, underscoring its potential for effective Raman spectral classification. The study emphasizes the importance of replicate-aware validation to ensure realistic assessments of model performance, particularly in applications where variations in measurement conditions can significantly impact results.
Methodology
The study utilized a nested leave-one-replicate-out cross-validation framework to evaluate the performance of a transformer-based model for classifying bacterial Raman spectra. The dataset included 5,417 single-cell spectra from six bacterial species, and the performance of the transformer was compared with conventional machine learning methods that included PCA or ICA combined with LDA, SVM, and RF classifiers.
Results
The transformer model achieved the highest classification performance across independent test replicates, significantly outperforming all conventional approaches. The analysis of the latent feature space revealed improved class separation compared to PCA- and ICA-based representations. The transformer also demonstrated robust performance when applied to raw Raman spectra without preprocessing.
Implications
The findings suggest that transformer-based models can be effectively utilized for Raman spectral classification, offering a robust alternative to traditional methods. This has potential applications in various fields, including microbiology, biomedical research, and chemical analysis, where accurate classification of spectral data is crucial.
Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders
Computer Vision
Interpretability
- Introduction of two sparsity regularizers for Top-k Sparse Autoencoders.
- Regularizers improve monosemanticity and interpretability of latent representations.
- The ℓ1/ℓ2 penalty enhances robustness to variations in the sparsity budget.
- Evaluation across multiple datasets and vision foundation models shows consistent improvements.
Read more
Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders
Summary
This paper addresses the limitations of Top-k Sparse Autoencoders (SAEs) in interpreting vision foundation models by introducing two novel sparsity regularizers. The Top-k SAE retains only the k most active latent units per input, which can lead to fixed sparsity budgets that do not adapt to input complexity and potential overfitting to the training value of k. The authors propose an ℓ1 penalty on off-support activations and a scale-invariant ℓ1/ℓ2-ratio penalty to enhance the interpretability of the latent representations. These regularizers encourage the model to concentrate on fewer effective units, improving monosemanticity without compromising reconstruction quality. The evaluation on ImageNet-1K and Open Images V7 datasets using three vision foundation models demonstrates that these regularizers significantly enhance the interpretability of the learned representations, making them more robust to variations in the sparsity budget.
Methodology
The authors developed two sparsity regularizers compatible with the Top-k SAE architecture. The first regularizer applies an ℓ1 penalty to off-support activations, encouraging latent units to activate strongly only for relevant inputs. The second regularizer penalizes the ratio of ℓ1 to ℓ2 norms of activations, promoting concentration of information into fewer effective units. The performance of these regularizers was evaluated on two datasets using embeddings from three frozen vision foundation models.
Results
The introduction of the sparsity regularizers led to consistent improvements in the monosemanticity of the representations across various values of k, without sacrificing reconstruction quality. The ℓ1/ℓ2 penalty further concentrated information into fewer latents, enhancing robustness to the choice of k during inference and improving performance in small-budget linear probing.
Implications
The findings suggest that incorporating soft sparsity regularization can significantly enhance the interpretability of representations learned by vision foundation models, which is crucial for applications requiring transparency and accountability in AI systems.
Revisiting Action Factorization for Complex Action Spaces
Reinforcement Learning
Robotics
Optimization
- Introduces a cross-sectional study of action factorization methods across multiple RL algorithms and action spaces.
- Presents two new environments to isolate challenges in hybrid action spaces.
- VDN-PPO and PPO-MIX outperform other tested PPO factorizations by effectively assigning credit to action heads.
- Shared encoder architectures offer the best compute-performance trade-off in most scenarios.
Read more
Revisiting Action Factorization for Complex Action Spaces
Summary
This paper addresses the challenges of hybrid discrete-continuous action spaces in reinforcement learning (RL), which are prevalent in real-world control problems such as autonomous driving and robotics. The authors conduct a comprehensive study of various action factorization methods (independent networks, shared encoder, VDN, QPLEX, Joint, Auto-Regressive) across three algorithm families (PPO, SAC, DQN) and three action space types (discretized, hybrid, continuous) using four lightweight environments. They introduce two new environments, CoopPush and Hybrid-Shoot, to facilitate the analysis of state-dependent inter-action dependence. The study reveals that branching dueling architectures, particularly VDN-PPO and PPO-MIX, significantly outperform other PPO factorizations by effectively redistributing credit among action heads. The findings indicate that shared encoder architectures generally provide the best balance of computational efficiency and performance, while Auto-Regressive actions achieve the highest performance overall. However, the latter comes with increased computational costs. The paper emphasizes the need for principled benchmarks for factorization methods and provides insights for future research in model selection.
Methodology
The authors analyze 220 configurations of action factorization methods across three algorithm families (PPO, SAC, DQN) and three action space types (discretized, hybrid, continuous) using four benchmark environments. They introduce new environments to isolate specific challenges and evaluate the performance of various factorization strategies through experimental setups.
Results
The study finds that shared encoder architectures generally provide the best trade-off between computational efficiency and performance. VDN-PPO and PPO-MIX significantly outperform other PPO variants in discrete action spaces. Auto-Regressive actions yield the highest performance overall, although they require more computational resources.
Implications
The findings suggest that researchers should consider the balance between computational efficiency and performance when selecting action factorization methods for complex action spaces. The introduction of new benchmark environments can aid in the development and evaluation of future RL algorithms.
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
Optimization
Theory
- Establishes linear dependence on mixing time for high-probability PL-SGD under Markovian noise.
- Closes the gap between expectation and high-probability bounds in previous analyses.
- Introduces a clipped block method for heavy-tailed Markovian gradients.
- Provides matching lower bounds for both light-tailed and heavy-tailed scenarios.
Read more
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
Summary
This paper investigates first-order optimization methods for smooth objectives that satisfy the Polyak-Łojasiewicz (PL) condition, particularly when gradient samples are generated by an exogenous Markov chain. The authors address a gap in previous high-probability bounds for Stochastic Gradient Descent (SGD) under Markovian noise, which previously exhibited a quadratic dependence on mixing time. They establish a uniform high-probability guarantee with a linear dependence on mixing time, proving this to be optimal through a lower bound on a quadratic objective driven by a persistent two-state Markov chain. The study extends to heavy-tailed Markovian gradients, proposing a clipped block method that utilizes all Markov transitions while controlling bias. The authors derive high-probability stochastic error bounds and establish matching lower bounds, effectively characterizing the optimal polynomial dependence on mixing time for light-tailed PL-SGD and the heavy-tail exponent in robust regimes.
Methodology
The authors utilize a lag-blocking argument to derive uniform high-probability guarantees for SGD under geometric mixing conditions. They analyze the stochastic error through a combination of martingale concentration arguments and Poisson equation solutions, addressing the challenges posed by Markovian noise. The framework is extended to heavy-tailed gradients with a focus on mitigating bias through a clipped block method.
Results
The main results include a high-probability bound for PL-SGD that scales as eO(tmix/(k + K0)), demonstrating optimal linear dependence on mixing time. Additionally, for heavy-tailed gradients, the proposed method achieves a high-probability stochastic error of eO(σ²p(tmix/T)²(p−1)/p), with matching lower bounds established for both light-tailed and heavy-tailed scenarios.
Implications
The findings have significant implications for optimization in machine learning, particularly in scenarios where gradient samples are generated through Markov processes. This work enhances the understanding of convergence properties in stochastic optimization, which is crucial for applications in decentralized optimization, reinforcement learning, and other areas where Markovian sampling is prevalent.
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization
NLP
Large Language Models
Interpretability
- LLMs exhibit increased reliance on spurious concepts when encountering OOD inputs.
- Minor distribution shifts in input prompts can lead to significant performance drops.
- SAE-derived indicators can effectively identify per-sample distribution shifts.
- The approach allows for targeted fine-tuning to enhance LLM robustness against adversarial inputs.
Read more
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization
Summary
This paper investigates the generalization abilities of pre-trained transformers, particularly in out-of-distribution (OOD) settings where model performance can degrade due to unexpected data shifts. The authors present a mechanistic framework that utilizes Sparse Autoencoders (SAEs) to analyze the internal representations of large language models (LLMs). Through systematic experiments, they demonstrate that LLMs tend to infer spurious concepts when faced with OOD inputs, such as minor typos or adversarial prompts. The study reveals that even slight distribution shifts can significantly impact model performance on standard benchmarks. By employing SAEs, the authors provide a method to quantify these shifts and propose a fine-tuning strategy that enhances model robustness. The findings suggest that understanding the internal representation space of transformers is crucial for improving their reliability and safety in real-world applications.
Methodology
The authors utilize Sparse Autoencoders to analyze the internal representation of LLMs, focusing on how these models handle OOD inputs. They conduct systematic experiments to identify the effects of minor distribution shifts on model performance and develop a fine-tuning strategy based on the insights gained from SAE analysis.
Results
The study finds that LLMs are prone to generating spurious concepts when faced with OOD data, leading to performance degradation. The use of SAEs allows for the detection of subtle distribution shifts, which can be leveraged to improve model robustness through targeted fine-tuning. Additionally, the authors demonstrate that SAEs can flag successful jailbreak attempts, providing a mechanism to safeguard LLMs against such vulnerabilities.
Implications
The findings have significant implications for the deployment of AI systems in safety-critical environments, suggesting that a deeper understanding of model internals can enhance reliability and trustworthiness. The proposed methodologies could be applied to improve the robustness of LLMs in various domains, including science, business, and government.
Blackwell Approachability and Gradient Equilibrium are Equivalent
Theory
Optimization
- GEQ is algorithmically equivalent to Blackwell Approachability, allowing for the use of BA algorithms to solve GEQ problems.
- The paper provides efficient reductions that facilitate the transfer of guarantees from regret minimization to GEQ.
- Necessary and sufficient conditions for achieving GEQ are identified, enhancing the theoretical understanding of the framework.
- The equivalence between GEQ and other frameworks like regret minimization and calibration clarifies their interconnections.
Read more
Blackwell Approachability and Gradient Equilibrium are Equivalent
Summary
This paper establishes a fundamental equivalence between Gradient Equilibrium (GEQ) and Blackwell Approachability (BA) within the context of online optimization. GEQ is a novel framework that generalizes first-order stationarity from offline optimization, addressing problems such as online conformal prediction. The authors demonstrate that any BA problem can be solved using a GEQ oracle without significant loss in error rate, and vice versa. This equivalence clarifies the relationship between GEQ and other online learning paradigms, such as regret minimization and calibration, which have been previously shown to be interconnected. The authors provide efficient reductions that allow the transfer of guarantees, such as optimism and strong adaptivity, from regret minimization to GEQ. They also identify necessary and sufficient conditions for achieving GEQ and explore reductions between various GEQ formulations with constrained and unconstrained decision sets. Overall, the findings enhance the understanding of how GEQ fits into the broader online learning landscape, revealing its algorithmic power comparable to classical methods.
Methodology
The authors utilize black-box oracle reductions to demonstrate the equivalence between GEQ and BA. They analyze the structural connections between various online learning frameworks, leveraging existing results on approachability, regret minimization, and calibration. The methodology includes establishing necessary and sufficient conditions for GEQ and exploring algorithmic implementations that link GEQ with BA.
Results
The main results include the establishment of a black-box oracle reduction that allows any BA algorithm to solve GEQ problems and vice versa. The authors also show that GEQ is equivalent to regret minimization and calibration, providing a comprehensive understanding of the relationships among these frameworks. The reductions are efficient and facilitate the transfer of refined guarantees from one framework to another.
Implications
The equivalence of GEQ and BA has significant implications for online learning, particularly in applications where traditional regret minimization is not suitable. This work opens avenues for using GEQ in various statistical problems, enhancing decision-making processes against adaptive adversaries. The findings may influence future research in online optimization and machine learning frameworks.
Automating Potential-based Reward Shaping with Vision Language Model Guidance
Reinforcement Learning
Multimodal
Robotics
- Introduction of VLM-PBRS framework for automating potential-based reward shaping.
- Utilization of smaller, cost-effective vision language models to generate potential functions.
- Empirical validation showing improved sample efficiency and robustness to reward hacking.
- Demonstration of the connection between VLM preference label accuracy and learning efficiency.
Read more
Automating Potential-based Reward Shaping with Vision Language Model Guidance
Summary
This paper addresses the challenges of sparse rewards in reinforcement learning (RL), which hinder effective exploration and learning. The authors propose a novel framework called VLM-PBRS that automates potential-based reward shaping (PBRS) using feedback from vision language models (VLMs). Traditional PBRS requires a carefully designed potential function, often necessitating expert knowledge. In contrast, VLM-PBRS learns this potential function directly from VLM-generated preferences over image pairs, thus eliminating the need for expert-designed reward shaping terms. The framework utilizes smaller, computationally efficient VLMs to reduce costs while still providing sufficient guidance for learning. The authors validate their approach through empirical experiments in the Meta-World and Franka Kitchen environments, demonstrating that even with less accurate preference labels, the method significantly improves sample efficiency and robustness against reward hacking. The study highlights the connection between the accuracy of VLM preference labels and the efficiency of the learning process, marking a significant advancement in automating reward shaping in RL.
Methodology
The VLM-PBRS framework queries a lightweight vision language model to obtain preferences over image pairs, which are then used to train a model of the potential function for PBRS. This approach leverages the policy invariance property of PBRS, allowing the use of less accurate preference labels without compromising the optimality of the learned policies.
Results
The empirical results indicate that VLM-PBRS significantly enhances sample efficiency in reinforcement learning tasks within the Meta-World and Franka Kitchen environments. The framework successfully demonstrates that even with less accurate preference labels from smaller VLMs, the learning process is accelerated, and the risk of reward hacking is mitigated.
Implications
The findings suggest that VLM-PBRS can streamline the reward shaping process in reinforcement learning, making it more accessible and efficient. This could lead to broader applications in various RL domains, reducing the reliance on expert knowledge and manual reward design, thereby facilitating the development of more robust RL agents.
Dataset Usage Inference without Shadow Models or Held-out Data
Generative Models
Computer Vision
Efficient ML
- Introduces NU-DUI, a framework for Dataset Usage Inference that does not require shadow models or held-out data.
- Generates synthetic non-member samples to enhance the accuracy of dataset usage estimates.
- Recasts DUI as a mixture proportion estimation problem, making it computationally efficient.
- Empirical results show NU-DUI provides accurate member-ratio estimates across multiple large-scale generative models.
Read more
Dataset Usage Inference without Shadow Models or Held-out Data
Summary
This paper addresses the challenge of Dataset Usage Inference (DUI), which estimates the fraction of a dataset used to train a machine learning model. Traditional DUI methods rely on shadow models and held-out data, which are often impractical for large models and real-world data ownership disputes. The authors propose a novel framework called Negative Unlabeled Dataset Usage Inference (NU-DUI) that eliminates these constraints. NU-DUI generates synthetic non-member samples and extracts membership signals, framing DUI as a mixture proportion estimation problem. The method is validated through experiments on large image generative models, demonstrating its effectiveness in quantifying dataset usage without the need for shadow models or real held-out data. This advancement provides a practical tool for data owners to assess how much of their data has been utilized in model training, addressing legal and ethical concerns surrounding data ownership.
Methodology
The NU-DUI framework constructs synthetic non-member references using image-to-image paraphrasing and autoencodes the suspect dataset to mitigate distribution shifts. It extracts membership inference attack features tailored to the target model and applies mixture proportion estimation to derive the estimated member ratio.
Results
NU-DUI demonstrated accurate member-ratio estimates with a mean estimation error below 0.1 across five large-scale image generative models. The computational efficiency of NU-DUI was highlighted, taking approximately 42.5 A100 minutes compared to over 1,500 A100 hours required for traditional shadow model-based methods.
Implications
The proposed method has significant implications for data ownership disputes, allowing data owners to quantify the extent of their data's usage in model training. It also addresses legal and ethical concerns related to copyright and consent in the context of large generative models.
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
Optimization
Large Language Models
Efficient ML
- KernelPro integrates LLM code generation with expert heuristics for GPU kernel optimization.
- The system achieves state-of-the-art performance with significant speedups on the KernelBench benchmark.
- KernelPro incorporates energy efficiency as a secondary objective, reducing energy consumption while maintaining speed.
- Ablation studies confirm the effectiveness of each design component in improving optimization quality.
Read more
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
Summary
This paper introduces KernelPro, a closed-loop multi-agent system designed to optimize GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and micro-profiling tools. KernelPro's key contributions include a semantic feedback operator that transforms hardware metrics into actionable guidance, a two-stage tool invocation architecture for efficient bottleneck classification, a domain-adapted Monte Carlo Tree Search (MCTS) for optimization, and the ability to generate source-level code autonomously. The system achieves significant performance improvements on the KernelBench benchmark, with geometric mean speedups of 2.42×, 4.69×, and 5.30× across different difficulty levels, outperforming existing state-of-the-art methods. Additionally, KernelPro is the first system to incorporate energy efficiency into CUDA kernel optimization, achieving an 11.6% energy reduction at matched speed. The paper validates each component's effectiveness through rigorous ablation studies, demonstrating that KernelPro's design choices lead to substantial improvements in optimization quality.
Methodology
KernelPro employs a closed-loop multi-agent system that utilizes micro-profiling tools to analyze hardware metrics and provide actionable feedback to an LLM for code generation. It features a two-stage tool invocation architecture for efficient bottleneck classification and a domain-adapted MCTS for optimization, allowing for systematic exploration of optimization paths.
Results
KernelPro achieves geometric mean speedups of 2.42×, 4.69×, and 5.30× on Levels 1, 2, and 3 of the KernelBench benchmark, respectively. It also outperforms hand-tuned Triton on expert-optimized MoE training kernels by 1.23×. The system demonstrates an 11.6% energy reduction at matched speed compared to previous methods.
Implications
KernelPro's approach to GPU kernel optimization could lead to more efficient and interpretable optimization processes in high-performance computing, potentially benefiting a wide range of applications that rely on GPU acceleration. Its focus on energy efficiency also aligns with growing concerns about energy consumption in computing.
Statistical and Structural Approaches to Algorithmic Fairness
Theory
- Identifies limitations in current algorithmic fairness paradigms, particularly deterministic auditing metrics.
- Proposes statistical hypothesis testing as a more robust method for assessing fairness.
- Emphasizes the importance of structural context in understanding algorithmic fairness.
- Advocates for reshaping network structures to promote fairness in opportunities.
Read more
Statistical and Structural Approaches to Algorithmic Fairness
Summary
This doctoral thesis addresses the pressing issue of algorithmic fairness in modern machine learning systems, which have evolved into complex socio-technical architectures influencing human opportunities. The author identifies two significant limitations in current fairness paradigms: the reliance on deterministic point estimates for auditing and the treatment of individuals as isolated entities without structural context. The thesis critiques traditional scalar metrics for diagnosing algorithmic unfairness, which often overlook the statistical variance in small intersectional groups, leading to inaccuracies in bias detection. To enhance fairness assessments, the author proposes a shift towards statistical hypothesis testing, ensuring robustness and causal validity in evaluating model decisions. Furthermore, the thesis emphasizes the importance of examining algorithmic fairness through a structural lens, recognizing that fairness emerges from the interactions within networked and hierarchical systems. By focusing on structural dependencies, the work advocates for reshaping how opportunities flow through networks and how merit is aggregated. The thesis culminates in comprehensive frameworks that integrate statistical reliability with structural awareness, providing operational safeguards for the deployment of trustworthy AI.
Methodology
The thesis employs a combination of statistical hypothesis testing and structural analysis to evaluate algorithmic fairness. It critiques existing auditing methods and proposes new frameworks that account for the complexities of socio-technical systems.
Results
The proposed frameworks demonstrate that transitioning to statistical hypothesis testing improves the reliability of fairness assessments. Additionally, the structural approach reveals how network topologies and hierarchical systems can perpetuate biases, suggesting that deliberate restructuring is necessary to achieve fairness.
Implications
The findings have significant implications for the design and deployment of AI systems, particularly in ensuring that they do not reinforce existing inequalities. The proposed frameworks can guide policymakers and practitioners in creating fairer algorithms that consider both statistical and structural dimensions.
Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
Theory
Optimization
Multimodal
- Introduces a theoretical framework for scaling laws in contrastive learning.
- Derives a risk decomposition that highlights the effects of approximation, optimization, and sampling.
- Demonstrates that contrastive learning requires learning interactions between two views, affecting scaling behavior.
- Provides an explicit scaling law that connects sketch dimension, sample size, and optimization horizon.
Read more
Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
Summary
This paper investigates the scaling laws of contrastive representation learning, specifically through a sketched linear model under a paired Gaussian latent-variable framework. The authors derive a risk decomposition that includes irreducible risk, approximation error, gradient descent (GD) bias, GD variance, and a cross term influenced by bias and variance. The main theorem presents an explicit scaling law concerning sketch dimension, sample size, and effective optimization horizon. The findings indicate that contrastive learning, which requires learning interactions between two views, alters the scaling behavior of optimization and finite-sample noise compared to standard linear regression. This work provides a theoretical foundation for understanding scaling behavior in contrastive learning, offering insights into balancing model size, data, and computational resources.
Methodology
The authors utilize a Gaussian paired-view model and a bilinear contrastive score trained on sketched inputs. They analyze a Gaussian-negative quadratic contrastive surrogate instead of the full nonlinear InfoNCE loss, allowing for analytical tractability while preserving the core contrastive structure.
Results
The main result is a scaling law that reveals how the risk in contrastive learning decomposes into stable power-law terms influenced by model size, data size, and compute. The analysis shows that optimization and variance terms are shaped by the interactions between two views, which is a distinct feature of contrastive learning compared to standard regression.
Implications
This work lays the groundwork for further theoretical exploration of contrastive learning, potentially guiding the design of more efficient models and training protocols. It emphasizes the importance of understanding the interplay between model size, data, and computational resources in achieving optimal performance in contrastive learning tasks.
fTNN: a tensor neural network for fractional PDEs
Theory
Optimization
Efficient ML
- Introduction of fTNN, a tensor neural network for fractional PDEs.
- Development of a deterministic integration framework for the fractional Laplacian.
- Use of boundary-singularity-aware trial functions to improve solution accuracy.
- Design of a spatiotemporally separable neural network for time-dependent PDEs.
Read more
fTNN: a tensor neural network for fractional PDEs
Summary
This paper introduces the fTNN, a deterministic tensor neural network method designed for solving fractional partial differential equations (PDEs), specifically focusing on the fractional Laplacian in bounded domains. The authors tackle the fractional Poisson equation and time-dependent fractional advection-diffusion equations, which are challenging due to their nonlocal nature and boundary singularities. The fTNN employs a geometry-adapted integration split that categorizes the fractional Laplacian into three components: a singular near field, a regular interior far field, and an analytical exterior far field. The integration of these components is handled using various quadrature methods, including Gauss-Jacobi and deterministic angular quadrature, creating a fully deterministic framework for the fractional Laplacian operator. To address low-regularity solutions, the authors propose boundary-singularity-aware trial functions and strategies for selecting leading exponents based on the singularity structure. For time-dependent PDEs, a spatiotemporally separable neural network is designed, which separates temporal and spatial integrals, enhancing training efficiency through an alternating optimization strategy. Numerical experiments demonstrate that the fTNN achieves high accuracy compared to existing methods like fPINN and Monte Carlo approaches, particularly in scenarios with significant boundary singularities and long-time simulations.
Methodology
The fTNN employs a geometry-adapted integration split for the fractional Laplacian, decomposing it into singular, regular, and analytical components. Various quadrature methods are used for integration, and boundary-singularity-aware trial functions are constructed to enhance solution accuracy. A spatiotemporally separable neural network is designed for time-dependent PDEs, integrating an alternating optimization strategy for training.
Results
The numerical experiments indicate that the fTNN framework achieves high accuracy on benchmark problems, outperforming existing methods such as fPINN and Monte Carlo baselines, especially in cases with strong boundary singularities and during long-time simulations.
Implications
The fTNN framework has potential applications in various fields that require solving fractional PDEs, such as anomalous transport modeling, nonlocal diffusion processes, and other scientific computing scenarios where traditional methods struggle with boundary singularities and nonlocality.
Effective Covariance Dynamics in Solvable High-Dimensional GANs
Generative Models
Theory
Optimization
- Introduces effective covariance dynamics for multi-feature GANs with complex latent structures.
- Derives a high-dimensional ODE that captures the training dynamics of GANs with correlated latent variables.
- Identifies a spectral solvable region that governs learning stability and recovery in GAN training.
- Demonstrates a signal-boosting mechanism where weak latent directions can be enhanced through low-rank correlations.
Read more
Effective Covariance Dynamics in Solvable High-Dimensional GANs
Summary
This paper investigates a solvable high-dimensional model of Generative Adversarial Networks (GANs) where a linear generator learns from data characterized by structured latent covariance. The authors extend previous analyses that assumed unconditional signals with diagonal latent covariance to include class-dependent, correlated, and non-zero-mean latent structures. They focus on the dynamics introduced by a quadratic energy discriminator, showing that the training process converges to deterministic ordinary differential equations (ODEs) in the high-dimensional limit, governed by an effective covariance derived from the data distribution. The study reveals that learning initiation is determined by the leading effective eigenvalue, with a specific stability region defined by learning rates and noise levels. The findings highlight a signal-boosting mechanism where low-rank correlations can enhance weak directions, while strong correlations may destabilize recovery. Numerical simulations and experiments on datasets like MNIST, FashionMNIST, and CIFAR-10 validate the theoretical predictions, demonstrating that informed generator covariance can improve alignment with the data-driven reference subspace.
Methodology
The authors derive a high-dimensional ordinary differential equation (ODE) for the training dynamics of multi-feature GANs, incorporating class-dependent and correlated latent structures. They analyze the stability of learning and recovery fixed points based on effective covariance, using numerical simulations to validate their theoretical findings.
Results
The study proves that the stochastic training process converges to deterministic dynamics characterized by an effective covariance matrix. It establishes a spectral region for stability and learning initiation, revealing that low-rank correlations can enhance learning while strong correlations may lead to instability. Experiments confirm that using informed generator covariance leads to better alignment with the reference subspace and coherent class-conditional samples.
Implications
The findings suggest that incorporating structured latent covariance in GAN training can significantly enhance the model's performance, particularly in generating high-quality, class-conditional samples. This has potential applications in various fields requiring generative modeling, such as image synthesis and representation learning.