AI-generated summaries

Today's ML research,
without the noise.

Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.

48 Papers today
8h Update frequency
7 Days of history
Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
Franki Nguimatsia-Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier
Reinforcement Learning Robotics
  • Introduction of Survival Reinforcement Learning (SRL) as a scalable self-supervised RL method.
  • SRL maximizes dwell time at goals, addressing limitations of existing contrastive methods.
  • Demonstrated superior performance of SRL on long-horizon locomotion tasks compared to state-of-the-art CRL.
  • Empirical evidence supports the effectiveness of classification-based objectives in scaling RL.
Read more
Latent Diffusion Pretraining for Crystal Property Prediction
Shrimon Mukherjee, Kishalay Das, Partha Basuchowdhuri, Pawan Goyal, Niloy Ganguly
Generative Models Graph Learning Efficient ML
  • Introduction of CrysLDNet, a latent diffusion-based pretraining framework for crystal property prediction.
  • Integration of a Variational Autoencoder with a latent diffusion model to effectively learn from unlabeled data.
  • Significant performance improvements over existing models, particularly in low-data regimes.
  • Backbone-agnostic design allows for future model enhancements without retraining.
Read more
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui
Reinforcement Learning Large Language Models Efficient ML
  • Identifies intra-prompt long tails as a significant source of inefficiency in RL for LLMs.
  • Introduces DARTS, a novel framework for active distribution shaping to improve rollout efficiency.
  • Employs a dual-end length sampling strategy and adaptive redundancy allocation to optimize trajectory selection.
  • Demonstrates significant acceleration in RL training processes without degrading model performance.
Read more
CoMem: Context Management with A Decoupled Long-Context Model
Yuwei Zhang, Chengyu Dong, Shuowei Jin, Changlong Yu, Hejie Cui, Hongye Jin, Xinyang Zhang, Hamed Bonab, Colin Lockard, Jianshu Chen, Zhenyu Shi, Jingbo Shang, Xian Li, Bing Yin
NLP Large Language Models Efficient ML
  • COMEM decouples memory management from reasoning, allowing for specialized models for efficient history compression.
  • The k-step-off asynchronous pipeline significantly reduces decoding overhead by overlapping memory summarization with agent execution.
  • A novel reward-driven training methodology aligns the memory model to ensure effective decision-making.
  • COMEM achieves a 1.4x latency improvement over traditional long-context solutions while preserving performance.
Read more
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Multimodal Generative Models Efficient ML
  • DREAM-S integrates a NAS framework to optimize draft model configurations for speedup.
  • The framework employs dynamic selection of intermediate features from the target model to enhance draft model accuracy.
  • DREAM-S achieves up to a 3.85× speedup over conventional decoding methods.
  • The approach significantly outperforms existing speculative decoding techniques in various multimodal tasks.
Read more
Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity
Jun Tan, Qing Guo, Zicheng Xu, Jinglin Li, Qi Fang, Ning Gui
Generative Models Interpretability Optimization
  • DensityFlow provides a novel approach to generating robust counterfactual explanations by focusing on high-density data regions.
  • The framework utilizes Neural ODEs and a density score learned via Noise Contrastive Estimation to guide counterfactual generation.
  • A local proxy distillation mechanism enhances efficiency in black-box settings by minimizing redundant queries.
  • Experimental results show significant improvements in robustness and validity compared to traditional ensemble methods.
Read more
Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment
Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li, Rongsheng Li, Ning Li, Zhen Xu, Weiqing Huang, Ming Liu
NLP Large Language Models Efficient ML
  • AdvCL repurposes adversarial perturbations for stable continual learning.
  • The framework includes three modules: Intra-Smooth, Proto-Clip, and Inter-Align.
  • Experiments show improvements in performance, robustness, and reduced forgetting.
  • The modules can be integrated into various continual learning paradigms.
Read more
Drift Q-Learning
Anas Houssaini, Mohamad H. Danesh, Amin Abyaneh, Scott Fujimoto, Hsiu-Chin Lin, David Meger
Reinforcement Learning Generative Models Efficient ML
  • DriftQL combines a drift-based behavioral regularizer with critic-driven policy improvement.
  • The method generates actions in a single forward pass, avoiding the need for iterative denoising.
  • DriftQL outperforms existing diffusion and flow-based methods on standard benchmarks.
  • It maintains performance under degraded data quality, unlike many baseline methods.
Read more
Scalable Inference-Time Annealing with Surrogate Likelihood Estimators
Daniel Peñaherrera, Rishal Aggarwal, David Ryan Koes
Generative Models Efficient ML
  • Introduction of SITA, a scalable method for inference-time annealing in molecular sampling.
  • Utilization of surrogate likelihood estimators to bypass expensive divergence calculations.
  • Demonstration of state-of-the-art performance on alanine dipeptide and alanine tripeptide.
  • Integration of a BoltzNCE-style surrogate into a temperature annealing framework.
Read more
Learning to Construct Practical Agentic Systems
Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, Cassandra A. Cohen, Arthur Kajiyama, William W. Cohen
Large Language Models Optimization Efficient ML
  • Introduction of a modular framework for designing agentic systems using pseudo-tools.
  • Demonstration that hand-constructed fixed workflows are faster and more accurate than dynamically-planned workflows.
  • Development of novel learning methods that outperform traditional hand-engineered agents.
  • Application of multi-objective optimization to jointly enhance cost efficiency and response quality.
Read more
Spatio-temporal stochastic graph-based learning for infectious disease forecasting
Luz Stefani Sotomayor Valenzuela, Susanna Cramb, Darren Wraith
Graph Learning Time Series
  • Introduces a spatio-temporal stochastic graph-based model for infectious disease forecasting.
  • Addresses the limitations of traditional models by incorporating stochastic processes.
  • Demonstrates improved forecasting accuracy using real-world datasets for COVID-19 and chickenpox.
  • Shows the model's adaptability to various geographical scales and population sizes.
Read more
Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms
Nathan White, Krish Singal
Optimization Efficient ML Theory
  • Introduction of inner product aware quantization objectives (MDV and ADV).
  • Development of adaptive and unbiased quantization methods that outperform traditional approaches.
  • Algorithms designed are provably fast, achieving solutions within a (1 + ε) factor of optimal cost.
  • Empirical results show 2-10x speed improvements over state-of-the-art methods.
Read more
A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Lei Yang, Siyu Ding, Deyi Xiong
Reinforcement Learning Large Language Models Theory
  • Cross-domain degradation is driven by sparse RL edits interacting along shared computation routes, not just by global gradient conflict.
  • A local perturbation model reveals that degradation is concentrated in a low-dimensional shared conflict subspace.
  • Short domain refresh can selectively recover performance in earlier domains with limited impact on others.
  • The study provides empirical validation through task-level recovery and a training-free rollback method.
Read more
Learning Multi-Agent Coordination via Sheaf-ADMM
Jeffrey Seely, Bartłomiej Cupiał, Llion Jones
Optimization Graph Learning Robotics
  • Introduces Sheaf-ADMM for multi-agent coordination with limited local views.
  • Utilizes cellular sheaf theory to define inter-agent constraints for heterogeneous consensus.
  • Demonstrates improved performance on tasks like maze pathfinding, image classification, and Sudoku.
  • Enhances robustness to distribution shifts in MNIST classification compared to standard CNNs.
Read more
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Sharan Vaswani, Yifan Sun, Reza Babanezhad
Optimization Theory
  • Generalizes non-uniform smoothness assumptions for better modeling of loss landscapes.
  • Establishes convergence rates for steepest descent and adaptive methods like Adam and RMSProp.
  • Demonstrates that Sign GD converges faster than traditional gradient descent for logistic regression.
  • Shows that RMSProp and Adam can achieve linear convergence rates for certain neural networks.
Read more
Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Reinforcement Learning Theory Interpretability
  • Introduces occupancy Rashomon capacity to quantify the complexity of auditing near-optimal RL policies.
  • Establishes conditional lower bounds for exact local-query auditing, indicating potential exponential complexity.
  • Demonstrates the significance of occupancy-class level auditing in distinguishing between behaviorally distinct policies.
  • Provides a finite discounted hidden-branch MDP to illustrate theoretical findings and prove the exact Bayes success law.
Read more
What changes after deployment? A survey on On-device Learning in TinyML
Massimo Pavan, Luca Pezzarossa, Fabrizio Pittorino, Manuel Roveri, Xenofon Fafoutis
Efficient ML
  • ODL enables machine learning models to adapt to distribution changes post-deployment directly on devices.
  • The survey categorizes distribution changes into three regimes: single-change, concept drift, and continual learning.
  • There is a significant gap between theoretical benchmarks and real-world applications in ODL.
  • Understanding the nature of distribution changes is crucial for developing effective ODL solutions.
Read more
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Max Tan
Reinforcement Learning Large Language Models Theory
  • Introduces RLVR to improve LLM generation of verified programs and proofs.
  • Achieves significant increases in verified rewards and pass rates through structured training.
  • Identifies and addresses issues of specification hacking in model training.
  • Develops a verifier-guided inference scaffold that enhances proof generation.
Read more
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji
Reinforcement Learning Large Language Models Optimization
  • Introduces Group Prioritized Off-Policy Optimization (POPO) to enhance RLVR for LLMs.
  • Addresses the issue of ineffective training samples that lead to zero-variance rewards.
  • Combines prioritized group replay and decoupled off-policy optimization for efficient learning.
  • Empirical results show substantial improvements in reasoning tasks with fewer rollouts.
Read more
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
Sanae Lotfi, Polina Kirichenko, Steven Li, Zechun Liu
NLP Large Language Models Efficient ML
  • PTQ exacerbates overthinking in reasoning models, leading to failures in 52% of cases where correct intermediate answers are abandoned.
  • High KL divergence tokens, particularly hesitation and branching markers, are identified as key contributors to overthinking errors.
  • A training-free logit penalty on overthinking markers reduces CoT length by 12-23% while maintaining or improving accuracy.
  • Controlled ablations confirm that targeting overthinking markers yields the best efficiency-performance balance.
Read more
idSCD: Identifying Training Datasets through Semantic Correlation Descriptors
Andrada Gobeaja, Ionut Hodoroaga, Elena Burceanu, Marius Leordeanu
NLP Theory Interpretability
  • Introduces a semantic approach to dataset-level membership inference, moving beyond behavioral evidence.
  • Develops Semantic Correlation Descriptors (SCDs) to capture and compare semantic correlation structures across datasets.
  • Proposes a practical membership score that does not require leave-one-dataset-out models.
  • Achieves superior performance compared to existing black-box and white-box methods in various experimental settings.
Read more
Parallel Tempering Initial Sampling in Inference-Time Reward Alignment
Myeongjun Oh, Gwangho Kim, Sungyoon Lee
Generative Models
  • PATHS improves initialization for inference-time reward alignment in generative models.
  • The method utilizes parallel tempering to explore complex reward landscapes effectively.
  • Periodic Metropolis swaps between chains enhance the sampling of high-reward states.
  • Experiments show consistent performance gains over existing SMC-based methods.
Read more
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Artur Szałata, Olga Novitskaia, Maiia Shulman, Matthew Mella, Altynbek Zhubanchaliyev, Fabian J. Theis
Theory
  • Chem-PerturBridge integrates a vast amount of transcriptomic data from diverse sources, providing a unified resource for small-molecule perturbation studies.
  • The study reveals that while fine-grained logFC agreement across datasets is weak, the direction of logFC is more consistent.
  • Embeddings pretrained on Chem-PerturBridge significantly improve performance in compound representation learning compared to existing methods.
  • The resource supports both diagnostic evaluations of cross-dataset agreement and model-oriented reuse of heterogeneous data.
Read more
Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence
Valérie Castin, Kimia Nadjahi, Pierre Ablin, Gabriel Peyré
NLP Large Language Models Efficient ML
  • BaLoRA improves convergence rates by enforcing balanced low-rank adapters during optimization.
  • Theoretical analysis shows that balanced minimizers have optimal conditioning, leading to faster convergence.
  • Empirical results demonstrate that BaLoRA outperforms standard LoRA and matches or exceeds state-of-the-art LoRA variants.
  • The method is computationally efficient and compatible with existing fine-tuning frameworks.
Read more
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments
Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald
Reinforcement Learning Optimization Theory
  • Identification of 'zero collapse' as a failure mode in policy gradient methods due to discontinuous reward landscapes.
  • Mechanistic explanation of how flat zero-reward regions lead to vanishing gradient signals and sample inefficiency.
  • Empirical demonstration of zero collapse across multiple policy gradient methods.
  • Proposed mitigation strategies to enhance stability and learning speed in reinforcement learning.
Read more
Improving Selective Classification with Pairwise Queries for Binary Classification
Harsh Vardhan, Sunav Choudhary, Natwar Modani, Arya Mazumdar
NLP Large Language Models Theory
  • Selective classification can waste expert resources if confidence estimates are unreliable.
  • Pairwise queries provide a more accurate measure of sample quality than confidence estimates.
  • The proposed method improves accuracy on non-rejected samples while reducing costs.
  • Theoretical conditions for the effectiveness of pairwise queries are established.
Read more
UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures
Triet M. Le
Computer Vision Theory Optimization
  • UR-JEPA introduces a new regularization method based on uniform rectifiability to prevent representation collapse in JEPAs.
  • The method targets a uniformly n-rectifiable measure, contrasting with the isotropic Gaussian target of LeJEPA.
  • Empirical results show UR-JEPA outperforms LeJEPA in terms of accuracy and lower seed variance across multiple datasets.
  • The geometric properties of the embeddings produced by UR-JEPA are significantly different from those of LeJEPA, indicating a more structured representation.
Read more
Calibrated Preference Learning: The Case of Label Ranking
Santo M. A. R. Thies, Viktor Bengs, Timo Kaufmann, Sebastian J. Vollmer, Eyke Hüllermeier
Theory Reinforcement Learning
  • Introduces calibration notions specifically for probabilistic label ranking, extending beyond multi-class classification.
  • Establishes a theoretical framework showing the relationships between different calibration notions.
  • Empirically evaluates the calibration properties of popular label ranking models, revealing significant calibration issues.
  • Finds a strong correlation between calibration and benchmark accuracy in RLHF reward models.
Read more
Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation
Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke, Rekha Sundararajan, Andrew Rimell, James G. Steinrock
Time Series
  • The paper benchmarks five UQ methodologies for TGT prediction in engine health management.
  • A unified experimental framework is used for hyperparameter selection and performance evaluation.
  • Distinct trade-offs in interval coverage, width, and stability are identified among the methods.
  • The results provide practical guidance for selecting UQ methods in real-world applications.
Read more
Rethinking the Role of Temperature in Large Language Model Distillation
Hoang-Chau Luong, Lingwei Chen
NLP Large Language Models Theory
  • Temperature plays a crucial role in the effectiveness of distillation objectives in LLMs.
  • FKL can outperform RKL at higher temperatures, contradicting the common belief that RKL is inherently superior.
  • Temperature enhances knowledge transfer by enriching non-dominant token signals in FKL.
  • The impact of temperature extends beyond FKL to improve various KL-based distillation objectives.
Read more
Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Anej Svete, William Merrill, Ryan Cotterell, Ashish Sabharwal
Theory
  • Padded transformers are robust to changes in attention type, model width, and uniformity.
  • Numeric precision and model depth are the main factors affecting expressivity.
  • Polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC0.
  • Increasing width or precision beyond logarithmic levels does not enhance expressivity.
Read more
When Data Is Scarce: Scaling Sparse Language Models with Repeated Training
Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Mykola Pechenizkiy, Elena Mocanu, Torsten Hoefler, Decebal Constantin Mocanu
NLP Large Language Models Efficient ML
  • Introduction of a scaling law for sparse training under data constraints.
  • Demonstration of delayed data saturation, making multi-epoch training more effective.
  • Identification of resource trade-offs between loss-optimal and compute-optimal sparsity.
  • Sparsity improves both data utilization and parameter efficiency.
Read more
Regularized Large Neighborhood Search
Germain Vivier-Ardisson, Laurent Demonet, Axel Parmentier, Mathieu Blondel
Optimization Theory Efficient ML
  • Introduces RLNS, bridging LNS with Gibbs sampling for combinatorial optimization in neural networks.
  • Proves that RLNS under entropic regularization performs exact block Gibbs sampling.
  • Demonstrates the ability to interpolate between pseudolikelihood and exact maximum likelihood estimation.
  • Evaluates RLNS on multiple NP-hard combinatorial problems, showing its practical applicability.
Read more
A Unifying View of Variational Generative Wasserstein Flows
Paul Caucheteux, Clément Bonet, Anna Korba
Generative Models Optimization Theory
  • Introduction of Generative Wasserstein Flows (GWF) as a unified framework for generative modeling.
  • Derivation of various generative methods as instances of parametric JKO schemes for f-divergences.
  • Extension of the JKO framework to Integral Probability Metrics and squared Maximum Mean Discrepancy.
  • Empirical analysis of JKO regularization effects on generative model training.
Read more
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Amirhossein Ghaffari, Saeid Sheikhi, Ekaterina Gilman
Graph Learning Time Series
  • GC-MoE introduces a dual-pathway router that combines static topology features with dynamic input representations for expert selection.
  • The framework leverages frozen pretrained experts, allowing for low-parameter training while utilizing a diverse set of models.
  • An optional output refinement layer can enhance performance at minimal additional parameter cost.
  • The study includes an ablation analysis to evaluate the effectiveness of lightweight extensions and their interaction with routing mechanisms.
Read more
Minimax-Optimal Policy Regret in Partially Observable Markov Games
Raman Arora
Reinforcement Learning Theory Efficient ML
  • Introduces a unified framework for learning in partially observable Markov games against adaptive adversaries.
  • Proves that an epoch-based optimistic maximum-likelihood algorithm achieves O(√T) policy regret.
  • Establishes a matching lower bound for policy regret, confirming the optimality of the upper bound.
  • Extends the framework to handle horizon-adaptive guarantees and adversaries with fading memory.
Read more
Policy and World Modeling Co-Training for Language Agents
Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang
NLP Large Language Models Reinforcement Learning
  • Identifies next observations in on-policy rollouts as a valuable source of action-conditioned world modeling supervision.
  • Introduces PaW, the first framework for joint policy optimization and world modeling supervision during RL training.
  • Incorporates innovative techniques for data selection, loss management, and balancing to enhance stability and informativeness of WM supervision.
  • Demonstrates consistent performance improvements across multiple agentic tasks and RL algorithms.
Read more
Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim
Bo Li, Chen Zhang
Reinforcement Learning Optimization
  • The energy floor for SAC-based HVAC control is measured at $35.51/day, dominated by electrical loads.
  • Replay buffer initialization is identified as the main cause of sub-optimal performance, accounting for a significant cost gap.
  • Expanding the supply water temperature range has negligible impact on cost savings and may violate physical constraints.
  • A discount factor coupling reduces the effective planning horizon, highlighting a potential issue in benchmark configurations.
Read more
A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
Zefeng Li, Evan Shelhamer
Computer Vision
  • Current open-set TTA methods inadequately balance InD and OOD accuracy.
  • A new baseline method using sigmoid outputs improves the trade-off between InD recognition and OOD rejection.
  • The proportion of OOD data in batches significantly affects the performance of TTA methods.
  • Existing evaluations of TTA methods often overlook OOD performance metrics.
Read more
Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies
Hikmet Simsir, Ozgur S. Oguz
Reinforcement Learning Generative Models Robotics
  • LP-DS is a lightweight adaptation framework that enhances frozen generative policies without the need for full decoder fine-tuning.
  • The method employs a Lagrangian trust-region objective to dynamically constrain perturbation magnitudes, balancing reward maximization and preservation of the pretrained latent prior.
  • LP-DS effectively mitigates mode collapse, preserving action-space diversity while achieving strong performance across various benchmarks.
  • The framework is validated beyond compact diffusion policies, demonstrating effectiveness in diverse robotics applications and physical robot experiments.
Read more
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
Munawar Hasan
Large Language Models NLP Theory
  • Introduction of bounded behavioral indistinguishability for black-box LLM distillation.
  • Development of an empirical evaluation methodology combining various tests to assess behavioral indistinguishability.
  • Demonstration that LoRA distillation improves semantic similarity but does not fully eliminate distinguishability.
  • Identification of residual behavioral artifacts in style, format, and domain-specific prompts.
Read more
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters
Reinforcement Learning Robotics Generative Models
  • Introduces coherent imitation learning (CSIL) as a method for fine-tuning large behavior models using learned dense rewards.
  • Demonstrates that CSIL outperforms traditional RL approaches in terms of sample efficiency and performance retention.
  • Achieves a success rate of ≥90% on five out of six complex manipulation tasks, showcasing the effectiveness of the proposed method.
  • Addresses the issue of performance degradation during RL finetuning by ensuring optimal initial policies.
Read more
TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness
Michał Kozyra, Gesine Reinert
Theory
  • TASER introduces a geometry-aware regularisation framework that penalises model sensitivity based on the data distribution.
  • The method provides a principled alternative to isotropic gradient regularisation by aligning sensitivity with the structure of the data.
  • Theoretical insights link Stein residual minimisation to reduced sensitivity under distributional perturbations.
  • TASER enhances adversarial robustness by controlling sensitivity in directions that diverge from high-density regions.
Read more
Fixed Universal Transformers
Jingwen Liu, Alexandr Andoni, Daniel Hsu
Theory
  • Introduces the notion of universal transformers that can simulate any transformer in a class via input embeddings.
  • Provides explicit constructions of sparse universal transformers and shows that randomly initialized transformers are universally capable.
  • Establishes lower bounds on the embedding dimensions required for universality, particularly for transformers with multiple heads.
  • Empirical evaluations demonstrate the effectiveness of universal transformers in specific algorithmic tasks.
Read more
VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting
Xudong Zhang, Jierui Lei, Jiacheng Li, Lingdong Shen, Jian Cui, Haina Tang
Time Series
  • VLBM effectively separates stable dynamics from OOD deviations in multivariate time series forecasting.
  • The model uses a latent basis to capture stable ID dynamics and decomposes inputs into relevant components.
  • VLBM achieves state-of-the-art performance in OOD robustness and ID accuracy across multiple real-world tasks.
  • The framework addresses a critical reliability issue in forecasting under mixed ID/OOD conditions.
Read more
Fast Generalization after Interpolation via Critically Damped Momentum Optimization
Luca Muscarnera, Silas Ruhrberg Estévez, Yuanzhang Xiao, Mihaela Van der Schaar
Optimization Theory Efficient ML
  • GROKtimizer is introduced as a biphasic optimization strategy that enhances generalization in high-dimensional settings.
  • The paper links the post-interpolation phase of training to damped dynamics, providing a theoretical foundation for the proposed method.
  • Critically Damped Momentum (CDM) is shown to accelerate convergence towards low-norm solutions, which are associated with better generalization.
  • The method demonstrates a quadratic speedup over traditional gradient descent, making it more efficient.
Read more
Perturbative methods for non-parametric instrumental variable
Wei Bu, Arthur Gretton
Theory
  • Introduces a perturbative approach to NPIV estimation that improves accuracy in high-dimensional settings.
  • Demonstrates significant reduction in prediction error (up to 99%) compared to standard kernel ridge regression.
  • Addresses the curse of dimensionality by systematically correcting kernel ridge solutions with higher-order perturbations.
  • Shows that the method is particularly effective when the dimensionality grows rapidly with sample size.
Read more
Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete
Qian Li, Xinyu Mao, Shang-Hua Teng
Theory Large Language Models
  • Positional encoding is not necessary for Turing completeness in sliding-window transformers.
  • The sliding window mechanism introduces temporal asymmetry that breaks permutation symmetry.
  • The HIST model demonstrates that a finite control state and token-count histogram can achieve universal computation.
  • The paper provides a theoretical foundation for understanding the expressiveness of transformers without relying on positional encodings.
Read more