AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
24
Papers today
8h
Update frequency
7
Days of history
Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data
Theory
Efficient ML
Interpretability
- Model complexity does not necessarily lead to better classification performance in high-dimensional datasets.
- Logistic regression showed the most balanced performance across breast cancer subtypes, especially for rare classes.
- Random forest and SVM models exhibited limitations in minority subtype detection despite high overall accuracy.
- Macro F1 score is a more informative metric than accuracy for evaluating performance in imbalanced datasets.
Read more
Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data
Summary
This study investigates the classification of breast cancer subtypes using gene expression data from the TCGA-BRCA dataset, focusing on the interplay between model complexity and feature dimensionality. The research highlights the challenges posed by high-dimensional data and limited sample sizes, which can lead to overfitting, particularly for complex machine learning models. Three classifiers—logistic regression, random forest, and support vector machines (SVM)—were evaluated across varying numbers of highly variable genes (from 50 to 20,518) using stratified 5-fold cross-validation. Performance was assessed using accuracy and macro F1 score, revealing that while all models achieved high accuracy, their performance varied significantly at the subtype level. Logistic regression provided the most stable performance, particularly in detecting rare subtypes, whereas random forest struggled with minority classes despite overall accuracy. SVM's performance was sensitive to the dimensionality of features. The findings underscore the importance of model simplicity, appropriate evaluation metrics, and feature selection in high-dimensional biological classification tasks, suggesting that simpler models may outperform complex ones in certain contexts.
Methodology
The study utilized logistic regression, random forest, and support vector machine classifiers on TCGA-BRCA gene expression data, varying the number of selected features. Performance was evaluated using stratified 5-fold cross-validation, with metrics including accuracy and macro F1 score to assess subtype-level performance.
Results
All models achieved high accuracy; however, macro F1 score analysis revealed significant differences in performance across subtypes. Logistic regression outperformed others in stability and rare subtype detection, while random forest and SVM faced challenges with minority subtypes. The results emphasized the importance of feature selection and appropriate evaluation metrics in high-dimensional classification tasks.
Implications
The findings suggest that simpler models may be more effective for classifying high-dimensional biological data, particularly in clinical settings where accurate detection of all subtypes is critical. This research can inform future studies on machine learning applications in genomics and personalized medicine.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
NLP
Large Language Models
Generative Models
- INTRA framework unifies retrieval and generation within a single attention-based model.
- Eliminates the retriever-generator mismatch typical of traditional RAG systems.
- Empirical results show INTRA outperforms engineered retrieval pipelines in question answering.
- The model reuses pre-encoded evidence, reducing computational overhead.
Read more
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Summary
This paper introduces INTRA (INTrinsic Retrieval via Attention), a novel framework that allows attention-based encoder-decoder models to retrieve information directly from their own internal representations, rather than relying on separate retrieval systems. The authors argue that traditional retrieval-augmented generation (RAG) approaches, which treat retrieval and generation as distinct processes, can be unified within a single model. INTRA leverages the attention mechanism to score pre-encoded evidence chunks, enabling the model to reuse these representations for both retrieval and generation tasks. The framework is particularly effective in question-answering scenarios, where it outperforms conventional retrieval pipelines in terms of evidence recall and answer quality. By demonstrating that attention-based models inherently possess retrieval capabilities, the authors highlight the potential for more efficient and integrated systems in natural language processing tasks.
Methodology
The authors formulated the INTRA framework, which utilizes a pretrained encoder-decoder model to perform both retrieval and generation. The model employs cross-attention queries to score and select relevant evidence from its own encoded representations, allowing it to generate answers based on this internal retrieval without needing an external retriever. The methodology includes a detailed analysis of how attention mechanisms can be adapted for retrieval tasks, focusing on the shared representation space between evidence selection and answer generation.
Results
The INTRA framework was tested on various question-answering benchmarks, where it consistently outperformed traditional retrieval-augmented generation systems. The results indicated superior evidence recall and end-to-end answer quality, showcasing the effectiveness of using a unified model for both retrieval and generation tasks. The computational profile analysis revealed that the model benefits from reusing static evidence, leading to efficiency gains.
Implications
The findings suggest that future natural language processing systems could benefit from integrating retrieval capabilities directly into language models, potentially leading to more efficient architectures that reduce the need for separate retrieval components. This could enhance the performance of applications in question answering, dialogue systems, and other areas where information retrieval is critical.
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
NLP
Large Language Models
- Memory Inception (MI) is a training-free method for steering LLMs using latent KV banks at selected layers.
- MI provides a better control-drift trade-off compared to traditional prompting and outperforms CAA.
- The method allows for mid-conversation behavior shifts without rewriting the visible transcript.
- MI significantly reduces KV storage requirements, achieving reductions of up to 118×.
Read more
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Summary
The paper introduces Memory Inception (MI), a novel method for steering large language models (LLMs) that operates in latent attention space. Unlike traditional prompting, which clutters the prompt cache with guidance tokens, or activation steering, which lacks text fidelity, MI selectively injects text-derived key-value (KV) banks at specific layers during model inference. This approach allows for more efficient memory usage and improved control over the model's behavior. The authors evaluate MI across three tasks: personality steering, mid-dialogue behavior shifts, and structured reasoning tasks (HARDMath and PHYSICS). MI demonstrates a favorable control-drift trade-off compared to prompting and consistently outperforms CAA (Cache-Aware Attention) in various scenarios. Additionally, MI supports mid-conversation updates without altering the visible transcript and significantly reduces KV storage requirements, making it a powerful tool for persistent and structured guidance in LLMs.
Methodology
The authors propose MI as a method that encodes reminder content into latent KV banks, which are then selectively attached to specific attention layers during decoding. This selective allocation of KV slots allows the model to access guidance only where necessary, rather than storing it across all layers. The implementation includes a canonical-key RoPE (Rotary Position Embedding) that supports various attention mechanisms.
Results
MI was evaluated on personality-steering tasks, mid-dialogue behavior shifts, and structured reasoning tasks. It showed competitive performance with prompting while consistently outperforming CAA. In reasoning domains, MI improved performance on PHYSICS and HARDMath, serving as a training-free alternative to fine-tuning. The method also demonstrated a significant reduction in KV storage requirements.
Implications
The introduction of MI could enhance the usability of LLMs in applications requiring persistent guidance, such as virtual assistants, educational tools, and interactive storytelling. Its efficiency in memory usage could lead to more scalable and responsive AI systems.
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
Large Language Models
Efficient ML
- BoostLLM applies boosting as a training principle for LLM fine-tuning in tabular prediction.
- The framework uses sequential PEFT adapters as weak learners to correct residual errors.
- Empirical results show BoostLLM outperforms standard fine-tuning and matches or exceeds XGBoost.
- Incorporating decision-tree paths enhances the model's learning efficiency in low-data settings.
Read more
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
Summary
This paper introduces BoostLLM, a novel framework that applies the boosting paradigm to fine-tune large language models (LLMs) for few-shot tabular classification tasks. Traditional approaches have shown that LLMs can struggle in low-data environments compared to gradient-boosted decision trees (GBDTs). BoostLLM innovatively transforms parameter-efficient fine-tuning into a multi-round residual optimization process, where sequentially trained PEFT adapters act as weak learners. The framework incorporates decision-tree paths as an auxiliary input alongside raw features, allowing the model to leverage structured knowledge during early training phases. Empirical evaluations demonstrate that BoostLLM consistently outperforms standard fine-tuning methods across various LLM architectures and datasets, achieving performance on par with or exceeding XGBoost across different shot counts. The results indicate that BoostLLM not only enhances the robustness of LLMs in data-scarce scenarios but also scales effectively when paired with stronger tree models, suggesting that boosting can serve as a general training principle for LLM fine-tuning in structured data contexts.
Methodology
BoostLLM employs a multi-round residual optimization approach, training a sequence of parameter-efficient fine-tuning (PEFT) adapters as weak learners. Each adapter focuses on correcting the errors from the previous ensemble predictions. The framework integrates decision-tree paths as a second input view, allowing the model to learn from structured data while progressively reducing reliance on this auxiliary information as training advances.
Results
BoostLLM consistently outperformed standard fine-tuning methods across multiple LLM backbones and datasets, achieving results comparable to or better than XGBoost across various shot counts. The framework demonstrated robustness and efficiency, with gains attributed to the boosting mechanism rather than simply the number of parameters.
Implications
The findings suggest that BoostLLM can significantly enhance the performance of LLMs in few-shot tabular classification tasks, making it a valuable tool for applications in data-constrained environments such as finance, healthcare, and e-commerce. The approach also opens avenues for further research into integrating boosting principles into other machine learning frameworks.
Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
NLP
Large Language Models
Theory
- Global LLM rankings are often misleading due to the cancellation of votes and heterogeneous user preferences.
- Grouping models by language significantly improves the consistency of rankings and user agreement.
- The proposed (λ, ν)-portfolio framework effectively covers a larger fraction of user votes with fewer models.
- The study highlights the importance of considering language and task-specific preferences in model evaluation.
Read more
Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
Summary
This paper critiques the current methodology of ranking large language models (LLMs) using global leaderboards based on pairwise human feedback. The authors analyze approximately 89,000 comparisons across 116 languages and 52 LLMs, revealing that the global Bradley-Terry (BT) ranking is misleading due to significant heterogeneity in user preferences across different languages, tasks, and time. They find that nearly two-thirds of decisive votes cancel each other out, leading to indistinguishable rankings among the top models. The authors propose a new framework called (λ, ν)-portfolios, which consists of small collections of models that can achieve a specified prediction error while covering a significant fraction of users. Their approach, formulated as a variant of the set cover problem, demonstrates that using these portfolios can substantially improve coverage of user votes compared to global rankings. The paper also illustrates the application of small portfolios in a classification context, highlighting their potential to identify blind spots in data, which is particularly relevant for policymakers.
Methodology
The authors conducted a comprehensive analysis of pairwise comparisons of LLMs using data from the Arena platform. They formulated the problem of creating effective model portfolios as a variant of the set cover problem, leveraging the VC dimension of the underlying set system to provide guarantees on their approach. They also illustrated the effectiveness of their portfolios in a classification context using the COMPAS dataset.
Results
The analysis revealed that the global BT ranking was largely ineffective, with only 10.3% of votes being accurately predicted. In contrast, the (λ, ν)-portfolio framework was able to recover just 5 distinct BT rankings that covered over 96% of votes at a modest error threshold, significantly outperforming the global ranking's 21% coverage. Additionally, a portfolio of 6 LLMs was shown to cover twice as many votes as the top 6 models from the global ranking.
Implications
The findings suggest that relying on global rankings can obscure important user preferences and lead to suboptimal model selection. The proposed portfolio approach offers a more nuanced evaluation method that can enhance the performance of LLMs in diverse applications, particularly in contexts where fairness and representation are critical.
From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation
Time Series
Multimodal
- DropsToGrid is the first Neural Process-based approach for rainfall densification from PWS data.
- The model integrates temporal sequences from PWS with spatial radar context to enhance rainfall estimation.
- It employs a multi-modal attention mechanism for capturing spatial and temporal dependencies.
- Extensive empirical evaluations show superior performance compared to traditional and deep learning methods.
Read more
From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation
Summary
The paper introduces DropsToGrid, a novel Neural Process-based method designed to enhance rainfall estimation by integrating sparse observations from private weather stations (PWS) with spatial context from radar data. Traditional rainfall measurement methods often suffer from low resolution and biases, making it challenging to capture localized rainfall dynamics. DropsToGrid addresses these issues by employing a multi-scale feature extraction approach, temporal attention mechanisms, and multi-modal fusion to produce dense rainfall fields. The model is capable of generating stochastic, continuous rainfall estimates while explicitly quantifying uncertainty. Evaluations on real-world datasets demonstrate that DropsToGrid outperforms existing operational and deep learning baselines, providing accurate high-resolution rainfall maps even in scenarios with limited station availability and across different regions. The method represents a significant advancement in rainfall estimation, combining temporal sequences and radar context in a single probabilistic framework.
Methodology
DropsToGrid utilizes Neural Processes to learn a stochastic representation of rainfall from sparse, noisy PWS observations, guided by dense radar context. The methodology includes multi-scale feature extraction, temporal attention over station sequences, and translation-equivariant spatial fusion to handle the complexities of rainfall data.
Results
The results indicate that DropsToGrid generates accurate high-resolution rainfall estimates and well-calibrated uncertainty maps, outperforming both operational estimators and deep learning baselines. The model's performance is robust even with limited station data and in cross-regional evaluations across Europe.
Implications
The advancements presented in DropsToGrid have significant implications for weather forecasting, water management, and disaster mitigation by providing more accurate and reliable rainfall estimates. This can enhance decision-making processes in agriculture, infrastructure planning, and emergency response.
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
NLP
Large Language Models
Interpretability
- Introduces Prompt-Only Steering Vector (PrOSV) to improve steering effectiveness without degrading model quality.
- Proposes joint training of steering factors and directions to eliminate the need for post-hoc factor selection.
- Demonstrates that PrOSV outperforms traditional full-sequence SVs (FSSVs) in concept-based steering.
- Finds optimal initialization sizes and learning rates are crucial for effective joint training.
Read more
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
Summary
This paper addresses the limitations of current steering vector (SV) methods used to control large language models (LLMs). Traditional fine-tuned SVs require careful selection of steering factors and often operate as full-sequence SVs (FSSVs), which can degrade generation quality. The authors propose a novel approach called Prompt-Only Steering Vector (PrOSV), which intervenes only on a few prompt tokens, thus preserving model utility while achieving effective steering. They introduce a joint training method for steering factors and directions, eliminating the need for post-hoc factor selection. The methodology is grounded in neural network scaling theory, emphasizing the importance of initialization sizes and learning rates for stability. Empirical results demonstrate that PrOSV outperforms FSSVs on the AXBENCH dataset, achieving a better balance between model utility and adversarial robustness. This work provides a more principled framework for training SVs, making them a more viable tool for controlling LLMs without sacrificing performance.
Methodology
The authors utilize a joint training approach for steering factors and directions, informed by neural network scaling theory. They derive optimal learning rates and initialization strategies to enhance training stability and efficiency. The PrOSV method is designed to intervene only at the prompt stage, minimizing interference with the model's decoding process.
Results
Empirical evaluations reveal that PrOSV significantly outperforms FSSVs on the AXBENCH benchmark, achieving superior concept-based steering while maintaining general model utility. Additionally, PrOSV demonstrates enhanced robustness against adversarial attacks compared to traditional methods.
Implications
The findings suggest that PrOSV could be a more effective tool for steering LLMs in various applications, potentially leading to improved control mechanisms in AI systems. This approach may facilitate the development of more reliable and interpretable AI models, enhancing their usability in real-world scenarios.
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Optimization
Theory
Efficient ML
- OpenG2G is an open-source simulation platform for AI datacenter-grid coordination.
- The platform allows for the comparison of various control paradigms, including classical and learning-based methods.
- OpenG2G captures metrics from both AI datacenters and power systems, facilitating standardized evaluations.
- The simulation reveals trade-offs between AI operational metrics and grid performance, informing design decisions.
Read more
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Summary
The paper presents OpenG2G, a simulation platform designed to address the challenges posed by the increasing energy demands of AI datacenters on electricity grids. As AI workloads grow, they can significantly impact grid stability, leading to delays in datacenter construction and operational bottlenecks. OpenG2G enables users to simulate AI datacenter-grid runtime coordination by allowing the implementation and comparison of various control strategies, including classical and learning-based controllers. The platform features a modular architecture that integrates real-world AI service measurements, high-fidelity grid simulations, and a flexible controller interface. The authors demonstrate OpenG2G's capabilities through realistic scenarios, revealing how different AI model choices and deployment configurations can influence datacenter flexibility and grid coordination outcomes. The findings highlight the potential for improved coordination strategies that can enhance both AI service performance and grid reliability.
Methodology
OpenG2G employs a modular architecture that consists of three main components: a datacenter backend based on real measurements, a grid backend utilizing high-fidelity simulators, and a generic controller interface. Users can implement different control strategies and evaluate their performance in a simulated environment, allowing for head-to-head comparisons of various AI workloads and grid configurations.
Results
The authors successfully demonstrated OpenG2G's utility by simulating coordination scenarios involving modern AI workloads. They found that different controller designs and AI model choices significantly affect the datacenter's power flexibility and overall coordination outcomes, revealing favorable trade-offs between AI performance metrics and grid operational constraints.
Implications
OpenG2G has the potential to inform actionable design decisions for future AI datacenter projects, enabling better integration with electricity grids. By facilitating the exploration of various control strategies, it can help mitigate the energy challenges posed by growing AI workloads, ultimately supporting more sustainable AI infrastructure development.
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on â„“1-norm Lower Bounds
Optimization
Theory
Efficient ML
- SignSGD outperforms SGD under specific conditions characterized by ℓ1-norm and ℓ∞-smoothness.
- The paper establishes tight bounds for SignSGD, demonstrating its superior convergence rates compared to SGD.
- The authors show that SignSGD's complexity can be significantly better than SGD when noise is sparse.
- The theoretical framework is extended to matrix optimization, providing insights into the Muon optimizer.
Read more
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on â„“1-norm Lower Bounds
Summary
This paper investigates the theoretical underpinnings of sign-based optimization algorithms, particularly SignSGD, and their performance compared to traditional Stochastic Gradient Descent (SGD). While empirical evidence suggests that sign-based methods excel in training large models, a theoretical framework to explain this advantage has been lacking. The authors propose a shift from the conventional ℓ2-norm based analysis to an ℓ1-norm and ℓ∞-smoothness framework, which aligns better with the characteristics of sign-based updates. They derive matched upper and lower bounds for SignSGD, demonstrating that it can significantly reduce complexity under sparse noise conditions. The study also extends these findings to matrix optimization, providing a lower bound for the Muon optimizer. The theoretical insights are validated through practical experiments, showing faster convergence during the pretraining of a 124M parameter GPT-2 model, thus bridging theory and practice.
Methodology
The authors analyze sign-based optimizers using a new problem class defined by ℓ1-norm stationarity and ℓ∞-smoothness. They employ a separable noise model to capture coordinate-wise noise characteristics, allowing for a more accurate assessment of convergence rates. Rigorous upper and lower bounds are derived for both SignSGD and SGD, facilitating a comparative analysis of their performance.
Results
The study establishes that SignSGD achieves a convergence rate characterized by a tight â„“1-norm lower bound, which matches its upper bound. It is shown that under sparse noise conditions, SignSGD can reduce complexity by a factor of the problem dimension (d) compared to SGD. The findings are corroborated by empirical results demonstrating faster convergence during the pretraining of a GPT-2 model.
Implications
The theoretical insights provided in this paper could lead to the development of more efficient optimization algorithms for training large-scale models, particularly in scenarios where noise characteristics are sparse or heterogeneous. This could enhance the performance of various machine learning applications, especially in deep learning and large language models.
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Generative Models
Multimodal
Theory
- Characterization of a mean-dominated collapse state in ultra-deep DiTs.
- Introduction of Mean Mode Screaming (MMS) as a critical trigger for collapse.
- Development of Mean–Variance Split (MV-Split) Residuals to stabilize training.
- MV-Split Residuals outperform traditional gating methods like LayerScale.
Read more
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Summary
This paper addresses the challenges of scaling Diffusion Transformers (DiTs) to extreme depths, specifically the emergence of a structural vulnerability termed Mean Mode Screaming (MMS). MMS leads to a mean-dominated collapse state where token representations become homogenized, suppressing variation and hindering model performance. The authors identify the mechanisms behind MMS, highlighting the role of mean-coherent gradient components and the suppression of attention-logit gradients due to value homogenization. To mitigate this issue, they propose a novel architecture called Mean–Variance Split (MV-Split) Residuals, which separates the updates for centered and mean components of the residuals. This approach stabilizes training without the convergence penalties associated with existing isotropic gating methods. The paper demonstrates that MV-Split Residuals effectively prevent collapse events in a 400-layer DiT and maintains stable training in a 1000-layer configuration, validating the architecture's robustness at extreme depths.
Methodology
The authors conducted mechanistic audits to identify the collapse dynamics in ultra-deep DiTs. They proposed MV-Split Residuals, which involve a separate update for centered residuals combined with a leaky trunk-mean replacement. This method was tested on both a 400-layer and a 1000-layer DiT, comparing performance against baseline models and existing stabilization techniques.
Results
The MV-Split Residuals successfully prevented collapse events in the 400-layer DiT, allowing it to converge faster than the baseline model. In the 1000-layer configuration, the architecture remained stably trainable, validating its effectiveness at extreme depths.
Implications
The findings suggest that MV-Split Residuals could enhance the training stability and performance of ultra-deep generative models, potentially leading to advancements in various applications of Diffusion Transformers in generative modeling and multimodal tasks.
Federated Cross-Client Subgraph Pattern Detection
Graph Learning
Federated Learning
- Introduces a novel framework for federated subgraph pattern detection that addresses the representation-equivalence gap.
- Proposes a per-step, layer-wise embedding exchange mechanism to synchronize node representations across clients.
- Demonstrates that embedding exchange and federated parameter aggregation are complementary techniques.
- Empirical results show significant improvements in detection accuracy when using fresh embeddings at each training step.
Read more
Federated Cross-Client Subgraph Pattern Detection
Summary
This paper addresses the challenge of subgraph pattern detection in a federated learning context, where graphs are distributed across multiple clients. Traditional graph neural networks (GNNs) rely on centralized data, which leads to a representation-equivalence gap when clients can only access partial graph information. The authors formalize this issue as a structural observability problem, where subgraph patterns that span multiple clients become locally unidentifiable. To mitigate this, they propose a per-step, layer-wise embedding exchange framework that allows clients to synchronize intermediate node representations during the GNN forward pass without sharing raw features or labels. This approach, under the assumptions of extended subgraph and shared model parameters, ensures that clients can compute the same node representations as a centralized model. The paper empirically demonstrates that combining embedding exchange with federated parameter aggregation significantly reduces the representation gap, particularly when embeddings are updated at each training step rather than at the end of an epoch. The experiments conducted on synthetic directed multigraphs reveal the complementary nature of these mechanisms, highlighting the importance of fresh embeddings for effective subgraph pattern detection.
Methodology
The authors developed a per-step, layer-wise embedding exchange framework that synchronizes intermediate node representations during the GNN forward pass. This method operates under the assumptions of extended subgraph and shared model parameters, allowing clients to compute equivalent representations to a centralized GNN. The framework was evaluated using synthetic directed multigraphs with various patterns, comparing the performance of embedding exchange alone versus its combination with federated parameter aggregation.
Results
The experiments indicated that while embedding exchange alone closes only a limited portion of the representation gap, the combination of embedding exchange and federated parameter aggregation recovers most of it. The results emphasized the necessity of refreshing exchanged embeddings at each training step to achieve optimal performance.
Implications
This work has significant implications for applications requiring collaborative analysis of distributed graph data, such as financial crime detection, cybersecurity, and bioinformatics. By enabling effective subgraph pattern detection in a federated setting, it addresses privacy concerns while maintaining analytical accuracy.
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Theory
Efficient ML
Federated Learning
- Composing two RHTs provides a uniform O(d−1/2) approximation to Gaussian distributions for scalar quantization.
- The paper establishes formal guarantees for existing quantization methods (DRIVE and QUIC-FL) using the derived bounds.
- Three RHTs are necessary for effective decorrelation in Vector Quantization, addressing limitations of using only two.
- A linear-time check for input moments allows dynamic adaptation of RHT usage, improving efficiency.
Read more
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Summary
This paper addresses the use of Randomized Hadamard Transforms (RHTs) as a preprocessing step in quantization techniques, which are crucial for various applications such as gradient compression and model weight quantization. The authors demonstrate that composing two RHTs on a d-dimensional input vector results in a marginal distribution that closely approximates a standard Gaussian distribution, achieving performance comparable to Uniform Random Rotations (URRs) in terms of both Kolmogorov and 1-Wasserstein distances. Furthermore, the paper establishes that while two RHTs suffice for scalar quantization, three RHTs are necessary for effective Vector Quantization (VQ) to ensure weak correlation among coordinates. The authors also propose a linear-time method to dynamically determine the number of RHTs needed based on the input's statistical properties, enhancing efficiency without compromising theoretical guarantees. This modular approach allows for easier application of their findings to existing quantization frameworks.
Methodology
The authors analyze the performance of RHTs by establishing bounds on the distribution of coordinates after applying one or more RHTs. They utilize Berry-Esseen inequalities to derive these bounds and apply them to existing quantization algorithms. The paper also introduces a method for dynamically assessing the input vector's properties to determine the optimal number of RHTs required.
Results
The study proves that two RHTs yield a distribution close to Gaussian for scalar quantization, matching the performance of URRs. For VQ, three RHTs are shown to effectively reduce coordinate covariance, ensuring expected error remains consistent with URR-based methods. The proposed linear-time check for input moments allows for runtime adjustments in RHT application, enhancing practical efficiency.
Implications
The findings have significant implications for improving quantization techniques in machine learning, particularly in federated learning and distributed systems, where efficient data representation is crucial. The ability to dynamically adjust the number of RHTs based on input characteristics can lead to faster and more efficient algorithms in various applications.
Optimal Counterfactual Search in Tree Ensembles: A Study Across Modeling and Solution Paradigms
Optimization
Interpretability
- CPCF provides a compact and efficient formulation for optimal counterfactual search in tree ensembles.
- The study reveals that no single optimization paradigm is superior across all scenarios; each has its strengths.
- CPCF outperforms existing methods in terms of scalability and performance across various datasets and ensemble types.
- The research emphasizes the importance of generating minimal and actionable counterfactuals to enhance trust in machine learning explanations.
Read more
Optimal Counterfactual Search in Tree Ensembles: A Study Across Modeling and Solution Paradigms
Summary
This paper addresses the challenge of generating optimal counterfactual explanations for tree ensembles, which are crucial for providing interpretable recourse in various domains. The authors highlight that suboptimal counterfactuals can lead to unnecessary costs and uneven distribution of recommendations among users. To tackle this, they propose a novel constraint programming formulation, CPCF, which efficiently encodes numerical features as interval domains and retains discrete features in their native form. This approach allows for the exploration of multiple distance objectives without the need for continuous split-boundary search. The authors also compare CPCF against other mathematical programming paradigms, including a modified MaxSAT formulation and a mixed-integer linear programming (MILP) approach, across ten datasets and various tree ensemble types. Their empirical analysis reveals that CPCF outperforms the other methods overall, while also identifying specific strengths of each paradigm in different scenarios. The findings suggest that CPCF is the most versatile, MaxSAT excels with hard-voting ensembles, and MILP is competitive in settings with a moderate number of split levels.
Methodology
The authors introduce CPCF, a constraint programming formulation that encodes numerical features as interval domains and discrete features in finite-domain representations. They conduct an empirical comparison of CPCF with MaxSAT and MILP approaches, analyzing performance across multiple datasets and tree ensemble types.
Results
CPCF achieved the best overall performance in generating optimal counterfactuals, outperforming both MaxSAT and MILP methods. The analysis indicated that while CPCF is the most versatile, MaxSAT is particularly effective for hard-voting ensembles, and MILP remains competitive in specific settings.
Implications
The findings suggest that CPCF can significantly improve the quality of counterfactual explanations, which is essential for applications in sensitive areas such as credit scoring, healthcare, and public policy. This work can lead to more reliable and interpretable machine learning models, fostering greater trust among users.
Hyperbolic Concept Bottleneck Models
Interpretability
- HypCBM embeds concepts in hyperbolic space to better represent hierarchical relationships.
- The framework allows for sparse, hierarchy-aware concept activations without additional supervision.
- An adaptive scaling law is introduced for coherent user interventions across the concept tree.
- HypCBM rivals traditional Euclidean models trained on 20× more data in terms of accuracy and interpretability.
Read more
Hyperbolic Concept Bottleneck Models
Summary
The paper introduces Hyperbolic Concept Bottleneck Models (HypCBM), a novel framework that enhances the interpretability of neural networks by embedding concepts in hyperbolic space rather than traditional Euclidean space. This approach addresses the limitations of existing Concept Bottleneck Models (CBMs), which treat concepts as independent dimensions, failing to capture their hierarchical relationships. HypCBM reformulates concept activation as asymmetric geometric containment, allowing for more natural and interpretable activations that reflect the semantic structure of concepts. The authors propose a hyperbolic entailment activation mechanism and an adaptive scaling law for hierarchical interventions, enabling coherent propagation of user corrections through the concept tree. Empirical results demonstrate that HypCBM outperforms existing post-hoc CBMs, achieving competitive accuracy with significantly less training data while improving robustness and hierarchical consistency.
Methodology
The authors developed a post-hoc framework that utilizes hyperbolic geometry to represent concepts in a structured manner. They introduced a hyperbolic entailment activation mechanism that maps image representations to concept activations based on geometric containment. Additionally, an adaptive scaling law was formulated to adjust the strictness of concept activation thresholds based on the hierarchical depth of concepts.
Results
HypCBM consistently outperformed existing post-hoc CBMs, achieving comparable accuracy to Euclidean models trained on significantly larger datasets. The model demonstrated improved data efficiency, robustness to input corruptions, and maintained a high level of hierarchical consistency in concept activations.
Implications
The proposed HypCBM framework has significant implications for interpretable machine learning, particularly in high-stakes domains such as healthcare and finance, where understanding model predictions is crucial. By enabling more accurate and interpretable models, HypCBM can facilitate better human-in-the-loop interactions and interventions, ultimately leading to more reliable AI systems.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Large Language Models
Optimization
Efficient ML
- Introduction of sparse prefix caching for recurrent LLMs, optimizing latency by storing selective checkpoints.
- Formalization of the caching problem as a one-sided weighted k-median problem with an O(NM) dynamic program.
- Demonstrated improvements over traditional caching methods, particularly in scenarios with shared prefixes among requests.
- Validation of the method on real-world datasets, showing significant recomputation savings and performance enhancements.
Read more
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Summary
This paper introduces a novel approach to prefix caching in the context of serving large language models (LLMs), particularly focusing on hybrid and recurrent architectures. Traditional prefix caching methods rely on dense per-token key/value reuse, which can be inefficient for recurrent layers that maintain a fixed-size hidden state. The authors propose a sparse prefix caching strategy that stores exact recurrent states at selected checkpoint positions, allowing for efficient recomputation from the deepest stored checkpoint when a cache hit occurs. This method is formalized as a checkpoint placement problem under a distribution of overlap depths, leading to an exact O(NM) dynamic programming solution. The authors demonstrate that their approach significantly improves performance in scenarios where requests share a common prefix, outperforming standard heuristics and fixed-budget baselines in terms of recomputation savings. The results indicate that the method is particularly effective when many requests share substantial but not identical prefixes, preserving exact outputs without altering the recurrent computation process. The paper validates its findings through experiments on real-world datasets, showcasing the advantages of the proposed caching strategy in practical applications.
Methodology
The authors formalize the sparse prefix caching problem as a checkpoint placement optimization under a distribution of overlap depths. They develop an exact dynamic programming solution to determine optimal checkpoint placements, leveraging empirical overlap distributions from previous requests. The methodology includes theoretical proofs regarding optimal spacing and suboptimality bounds based on distribution variations.
Results
The proposed sparse prefix caching method consistently outperforms fixed-budget baselines and standard heuristics on datasets such as QuALITY and System Prompts. It achieves significant recomputation savings, particularly at low checkpoint budgets where overlap distributions are non-uniform. The method yields prototype wall-clock speedups in real-world applications, demonstrating its practical effectiveness.
Implications
This research has significant implications for optimizing the serving of hybrid and recurrent LLMs, particularly in applications where requests frequently share common prefixes. The proposed caching strategy can enhance the efficiency of LLM serving systems, reduce latency, and improve resource utilization, making it valuable for real-time applications such as chatbots and interactive AI systems.
Geometry-Aware Simplicial Message Passing
Graph Learning
Theory
- Introduction of the GSWL test, which incorporates geometry into simplicial message passing.
- Establishment of bounds on the expressivity of geometry-aware simplicial message passing schemes.
- Use of the Euler Characteristic Transform (ECT) as a complete invariant for geometric simplicial complexes.
- Experimental validation showing improved performance of geometry-aware models over traditional combinatorial models.
Read more
Geometry-Aware Simplicial Message Passing
Summary
This paper introduces the Geometric Simplicial Weisfeiler–Lehman (GSWL) test, which enhances the expressivity of message passing networks by incorporating geometric information from vertex coordinates into color refinement for geometric simplicial complexes. The authors demonstrate that the expressivity of geometry-aware simplicial message passing schemes is bounded by GSWL and can match it for fixed finite families of geometric simplicial complexes. They also leverage the Euler Characteristic Transform (ECT) as a complete invariant for geometric simplicial complexes, establishing a framework for geometric expressivity characterization and approximation. Experimental validation on synthetic and mesh datasets reveals a clear hierarchy of performance from combinatorial to geometry-aware models, confirming the theoretical predictions and the role of ECT in this context.
Methodology
The authors develop the GSWL test by refining vertex colors based on boundary and coboundary adjacencies, integrating vertex coordinates into the process. They utilize the ECT to characterize the expressivity of simplicial message passing for embedded simplicial complexes, demonstrating theoretical bounds and approximations through rigorous proofs and stability results.
Results
The results indicate that geometry-aware simplicial message passing can achieve expressivity that matches the GSWL test for fixed finite families of geometric simplicial complexes. The experiments validate the theoretical framework, showing that geometry-aware models outperform traditional combinatorial models across various datasets.
Implications
This work has significant implications for the design of more expressive neural network architectures that can leverage geometric information in higher-order structures, potentially enhancing applications in fields such as computer graphics, shape analysis, and topological data analysis.
Criticality and Saturation in Orthogonal Neural Networks
Theory
- Derivation of recursion relations for multiple tensors under orthogonal initializations.
- Extension of Feynman diagram techniques to simplify computations for orthogonal networks.
- Empirical validation of theoretical results through numerical simulations.
- Demonstration of stability in finite-width tensors for networks initialized with orthogonal weights.
Read more
Criticality and Saturation in Orthogonal Neural Networks
Summary
This paper investigates the advantages of orthogonal weight initialization in neural networks, particularly focusing on the stability of training dynamics in finite-width networks. The authors derive explicit layer-wise recursion relations for the tensors that govern network statistics under orthogonal initializations, extending previous work that primarily addressed linear networks or relied on mean-field theory. They introduce a novel framework using Feynman diagrams to simplify the computation of these recursions, demonstrating that the derived relations reproduce the stability of finite-width tensors observed empirically. The results indicate that orthogonal initializations lead to improved performance and stability in nonlinear networks, addressing a significant gap in the existing literature. The authors validate their theoretical findings through numerical solutions and show that these solutions align well with Monte-Carlo estimates from network ensembles, particularly for networks using the tanh activation function.
Methodology
The authors employed a theoretical framework that incorporates finite-width corrections to neural network statistics, deriving layer-wise recursion relations for various tensors. They utilized Feynman diagrams to facilitate the computation of these relations and validated their theoretical predictions through numerical simulations and Monte-Carlo estimates.
Results
The paper successfully derives recursion relations for ten fundamental tensors and demonstrates that these relations reproduce the stability of finite-width tensors observed in practice. The authors also provide analytical solutions that support the saturation behavior seen in empirical results, particularly for networks with tanh activation functions.
Implications
The findings suggest that orthogonal weight initialization can significantly enhance the training stability and performance of neural networks, which could lead to more efficient training protocols and better generalization in various applications of deep learning.
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Reinforcement Learning
Theory
Optimization
- Introduces Approximate Next Policy Sampling (ANPS) as a method to align training distribution with future policy state visitation.
- Presents Stable Value Approximate Policy Iteration (SV-API) as a lightweight modification to standard policy iteration algorithms.
- Demonstrates that SV-API can achieve better or comparable performance to existing methods while allowing for larger policy updates.
- Establishes theoretical bounds that highlight the importance of the next policy's distribution in ensuring policy improvement.
Read more
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Summary
This paper addresses a fundamental challenge in reinforcement learning (RL) known as the 'chicken-and-egg' problem, where accurate value function estimates are necessary for safe policy improvements. Traditional methods, such as conservative policy updates, limit policy changes to ensure that the state visitation distribution remains similar to the training data, which can restrict the effectiveness of policy updates. The authors propose a novel approach called Approximate Next Policy Sampling (ANPS), which shifts the training data to align with the future policy's state distribution instead of constraining the policy update. This method aims to enhance the value function estimate at states crucial for the policy update. The authors introduce Stable Value Approximate Policy Iteration (SV-API), a modification of standard approximate policy iteration algorithms that keeps the target policy fixed while an updated behavioral policy gathers relevant experience. The policy update is only committed once value estimates stabilize. The empirical results demonstrate that SV-API, when applied to existing algorithms like Proximal Policy Optimization (PPO), achieves comparable or superior performance on high-dimensional discrete (Atari) and continuous control tasks while allowing for larger policy updates. This work illustrates the potential of ANPS as a viable alternative to conservative updates in deep RL.
Methodology
The authors developed ANPS to modify the training distribution to match the next policy's state visitation. They introduced SV-API, which holds the target policy fixed while an iteratively updated behavioral policy collects relevant experience. The algorithm commits to a new policy only after ensuring that value estimates have stabilized, thus addressing the chicken-and-egg problem without constraining policy updates.
Results
The empirical evaluation of SV-API, particularly in its application to PPO (resulting in SV-PPO), showed that it matches or exceeds the performance of existing RL methods on both Atari and continuous control benchmarks. The approach allows for significantly larger target policy updates, demonstrating its effectiveness in overcoming the limitations of conservative policy updates.
Implications
The findings suggest that ANPS could lead to more efficient training in deep RL by enabling larger policy updates without sacrificing safety. This could enhance the performance of RL algorithms in complex environments, potentially broadening their applicability in real-world scenarios.
Trade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds
Optimization
Federated Learning
Theory
- Derivation of tight closed-form lower bounds for DP-SGD with random shuffling.
- Introduction of a new proof technique based on a generalized law of large numbers.
- Demonstration of parameter settings that achieve meaningful differential privacy in practical scenarios.
- Comparison of results with previous analyses focused on Poisson subsampling.
Read more
Trade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds
Summary
This paper presents a comprehensive analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) using subsampling based on random shuffling, framed within the f-DP (differential privacy) context. The authors derive tight closed-form bounds for the trade-off function, particularly in the regime where the noise multiplier σ is greater than or equal to the square root of 3 divided by the natural logarithm of M, the number of rounds in a single epoch. Unlike previous analyses that focused on Poisson subsampling, which resulted in non-transparent implicit formulas, this work provides clear and interpretable bounds. The authors utilize the Berry-Esseen theorem to establish these bounds, demonstrating that for specific parameter settings, a dataset of size approximately 1.14 × 10^7 and M ≈ 1.14 × 10^6 rounds can achieve meaningful differential privacy. Additionally, they introduce a new proof technique based on a generalized law of large numbers, leading to an asymptotic result for the trade-off function as the number of epochs increases. The findings highlight the limitations of noise levels below a certain threshold and provide a clearer understanding of the privacy guarantees of shuffled DP-SGD, making the results particularly relevant for applications in federated learning.
Methodology
The authors employ the Berry-Esseen theorem to derive tight closed-form bounds for the trade-off function of DP-SGD with random shuffling. They also introduce a new proof technique that generalizes the law of large numbers to establish asymptotic results for the trade-off function as the number of epochs increases. The analysis is conducted within the f-DP framework, allowing for a precise characterization of privacy guarantees.
Results
The paper establishes that for a single epoch with M rounds and a noise multiplier σ ≥ √(3/ln M), the trade-off function satisfies a lower bound that approaches the ideal random guessing diagonal. The results indicate that datasets of size approximately 1.14 × 10^7 and M ≈ 1.14 × 10^6 rounds are sufficient to achieve meaningful differential privacy. The authors also provide asymptotic results indicating that as the number of epochs increases, the trade-off function converges uniformly to the ideal privacy guarantee.
Implications
The findings have significant implications for the deployment of DP-SGD in practical machine learning applications, particularly in federated learning scenarios where privacy is paramount. The clear and interpretable bounds derived in this work can guide practitioners in selecting appropriate parameters to ensure robust privacy guarantees.
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
Theory
Optimization
Computer Vision
- Cumulative-goodness free-riding is identified as a significant issue in FF networks.
- Three local remedies are proposed to mitigate the effects of free-riding without backpropagation.
- Layer-separation improvements are substantial, yet they do not translate into significant accuracy gains.
- Architecture and augmentation choices have a more pronounced effect on accuracy than the proposed training modifications.
Read more
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
Summary
This paper investigates the phenomenon of cumulative-goodness free-riding in Forward-Forward (FF) networks, where later layers can inherit tasks that earlier layers have partially separated. The authors formalize this issue as layer free-riding, demonstrating that under the softplus FF criterion, the class-discrimination gradient diminishes exponentially with the positive margin accumulated by preceding layers. They propose three local remedies—per-block, hardness-gated, and depth-scaled approaches—that aim to restore current-layer separation without relying on backpropagated gradients. Experiments on CIFAR-10 and CIFAR-100 show significant improvements in layer-separation statistics, with gains of 4×–45× in deeper layers, while accuracy remains largely unchanged (less than one percentage point) for non-degenerate training procedures. The study also includes a cross-dataset evaluation on Tiny ImageNet, confirming the qualitative gap between layer-health diagnostics and final accuracy. Calibration experiments reveal that architectural choices and data augmentation have a more substantial impact on accuracy than the training-rule modifications explored in this work. The findings indicate that cumulative free-riding is a real and repairable optimization issue, but it is not the primary factor limiting accuracy in the FF training context.
Methodology
The authors formalized the concept of cumulative-goodness free-riding and tested three local remedies to improve layer separation in FF networks. They conducted experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet to evaluate the effectiveness of these remedies and analyzed the relationship between layer health and final accuracy.
Results
The proposed remedies led to significant improvements in layer-separation metrics, achieving 4×–45× gains in deeper layers. However, these improvements resulted in less than one percentage point change in accuracy across non-degenerate training procedures. The study also highlighted that architectural and augmentation choices significantly influenced accuracy, overshadowing the effects of the training-rule modifications.
Implications
The findings suggest that while cumulative-goodness free-riding can be addressed, it is not the main bottleneck for accuracy in FF networks. Future research should focus on optimizing objectives, representations, and inference rules that leverage layer-wise evidence more effectively.
Topological Signatures of Grokking
Theory
Interpretability
- Identification of a robust topological signature of grokking using persistent homology.
- Geometric and topological interpretation of grokking related to emergent structure in representation space.
- Consistent behavior of topological signatures across different model architectures.
- Topological transitions are linked to generalization, not memorization.
Read more
Topological Signatures of Grokking
Summary
This paper investigates the phenomenon of grokking in neural networks through the application of persistent homology, a method from topological data analysis. Grokking is characterized by a model's transition from memorizing training data to achieving near-perfect generalization after extended training. The authors analyze point clouds derived from embedding matrices of various models trained on modular arithmetic tasks with different primes. They identify a significant topological signature of grokking, marked by a sharp increase in the maximum and total persistence of first homology (H1). Persistence diagrams reveal the emergence of a dominant long-lived topological feature alongside increasingly structured secondary features, which reflect the cyclic nature of the task. The study demonstrates that persistent homology provides a comprehensive geometric and topological characterization of representation learning, capturing both local and global structures. The authors conduct ablation studies to show that these topological transitions correlate with generalization rather than mere memorization. The findings suggest that persistent homology can serve as a principled framework for understanding how neural networks internalize latent structures during training.
Methodology
The authors apply persistent homology to analyze point clouds derived from neural representations during training on modular arithmetic tasks. They utilize the Vietoris–Rips filtration to construct simplicial complexes and generate persistence diagrams that capture the emergence of topological features over different scales.
Results
The study finds a clear topological signature of grokking, characterized by a significant increase in the maximum and total persistence of first homology (H1) during the generalization phase. Persistence diagrams indicate the formation of a dominant long-lived 1-cycle and more structured secondary features. The results are consistent across various model architectures, including transformers and multilayer perceptrons, and indicate that the observed topological signatures are associated with generalization rather than memorization.
Implications
The findings suggest that persistent homology can be a valuable tool for analyzing neural network training dynamics, particularly in understanding how models internalize complex structures. This could lead to improved interpretability of neural networks and insights into enhancing generalization capabilities in machine learning models.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
NLP
Large Language Models
Efficient ML
- UniPool replaces per-layer expert ownership with a globally shared expert pool, allowing for cross-layer expert reuse.
- Introduces a pool-level auxiliary loss to balance expert utilization across the shared pool.
- Employs NormRouter for stable and effective routing into the global expert budget.
- Demonstrates that reduced-pool variants can match or outperform traditional layer-wise MoE models.
Read more
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
Summary
The paper introduces UniPool, a novel Mixture-of-Experts (MoE) architecture that utilizes a globally shared expert pool instead of the traditional per-layer expert allocation. This approach addresses the inefficiencies of conventional MoE designs, which tie expert capacity linearly to the depth of the model, resulting in redundant expert parameters across layers. The authors demonstrate that many experts in deeper layers are often redundant, as shown through routing-randomization experiments. UniPool allows for cross-layer expert reuse while maintaining independent routing for each layer. To ensure balanced training and expert utilization, the authors propose a pool-level auxiliary loss and employ NormRouter for stable routing into the shared expert pool. The results show that UniPool consistently outperforms vanilla MoE models across various scales, achieving lower validation loss and perplexity while using a reduced expert-parameter budget. This indicates that expert parameters can grow sublinearly with depth, enhancing efficiency and effectiveness in large language models.
Methodology
The authors conducted routing-randomization experiments to analyze expert redundancy in existing MoE architectures. They developed UniPool, which features a shared expert pool accessed by independent per-layer routers. A pool-level auxiliary loss was introduced to balance expert utilization, and NormRouter was adopted to ensure stable routing into the shared pool.
Results
Across five model scales trained on 30 billion tokens, UniPool achieved consistent improvements in validation loss and perplexity compared to vanilla MoE baselines. The reduced-pool variants of UniPool, using only 41.6% to 66.7% of the vanilla expert-parameter budget, matched or outperformed traditional layer-wise MoE models.
Implications
The findings suggest that MoE architectures can be designed more efficiently by sharing expert resources across layers, potentially leading to more scalable and effective large language models. This approach could influence future designs of MoE systems in various applications, including natural language processing and beyond.
MinMax Recurrent Neural Cascades
Theory
NLP
Efficient ML
- MinMax RNCs utilize MinMax algebra to avoid vanishing and exploding gradients.
- The architecture can express all regular languages and is efficient in both sequential and parallel evaluations.
- Empirical results show superior performance on synthetic tasks compared to state-of-the-art RNNs.
- A large-scale MinMax RNC demonstrated competitive performance in next-token prediction tasks.
Read more
MinMax Recurrent Neural Cascades
Summary
This paper introduces MinMax Recurrent Neural Cascades (RNCs), a novel architecture that leverages MinMax algebra to create recurrent neural networks that are resilient to the vanishing and exploding gradient problems commonly faced in traditional recurrent neural networks (RNNs). The author demonstrates that MinMax RNCs possess significant theoretical advantages, including the ability to express all regular languages, efficient parallel evaluation, and bounded state and activation values. The architecture allows for both sequential and parallel processing, with a logarithmic runtime in relation to input length when sufficient computational resources are available. Empirical evaluations reveal that MinMax RNCs can perfectly solve various synthetic tasks and outperform existing state-of-the-art models in terms of performance. Additionally, a MinMax RNC with 127 million parameters was trained for next-token prediction, achieving competitive results comparable to GPT-2, indicating its potential for real-world applications. The paper provides theoretical proofs for the expressivity, complexity, stability, and gradient properties of MinMax RNCs, establishing a solid foundation for future research and applications in sequence processing.
Methodology
The paper employs MinMax algebra to define a recurrence relation for RNCs, replacing traditional addition and multiplication with max and min operations, respectively. The author analyzes the theoretical properties of the architecture, including expressivity, complexity, and stability, and conducts empirical evaluations on various tasks to validate the practical performance of MinMax RNCs.
Results
MinMax RNCs were shown to perfectly solve star-free tasks and outperform existing models like Mamba, mLSTM, and sLSTM. The architecture achieved competitive performance in next-token prediction with a model of 127 million parameters, demonstrating its effectiveness and scalability.
Implications
The findings suggest that MinMax RNCs could be a viable alternative to traditional RNNs and Transformers for sequence processing tasks, particularly in scenarios where gradient stability is crucial. Their ability to handle long sequences efficiently opens up possibilities for applications in natural language processing and other sequential data tasks.
FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings
Federated Learning
- FedeKD introduces a reliability-aware FKD framework that estimates sample-wise trust in knowledge transfer.
- The framework employs an energy-based gating mechanism to down-weight unreliable knowledge during model updates.
- Extensive experiments show significant reductions in negative transfer while preserving predictive performance.
- FedeKD operates without the need for additional public datasets, enhancing privacy in federated learning.
Read more
FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings
Summary
The paper introduces FedeKD, a novel framework for Federated Knowledge Distillation (FKD) that addresses the challenges posed by heterogeneous environments in federated learning (FL). Traditional FKD methods often assume uniform reliability of transferred knowledge, which can lead to negative transfer when data distributions vary significantly across clients. FedeKD incorporates a reliability-aware mechanism that estimates sample-wise trust in knowledge transfer without relying on external public datasets. Each client utilizes a high-capacity private model for local learning and a lightweight proxy model for knowledge exchange. The framework employs an energy-based gating mechanism that translates the disagreement between private and proxy models into trust weights, allowing for adaptive weighting of knowledge transfer. This approach enhances the robustness of knowledge sharing and mitigates negative transfer by prioritizing reliable samples during model updates. The authors conducted extensive experiments on six real-world datasets, demonstrating that FedeKD significantly improves both average-case and worst-case performance in heterogeneous settings while maintaining strong predictive accuracy.
Methodology
FedeKD operates in two stages: a forward stage where each client distills knowledge from its private model into a proxy model, which is then aggregated on the server to form a global proxy, and a backward stage where the global proxy guides updates of the private model through an energy-gated mechanism. This mechanism uses task-specific private-proxy disagreement to generate continuous sample-wise trust weights, allowing for adaptive knowledge transfer.
Results
The experiments conducted on six real-world datasets demonstrated that FedeKD significantly reduces negative transfer in heterogeneous federated learning environments. The framework maintained strong predictive performance across various scenarios, outperforming existing FKD methods that do not account for knowledge reliability.
Implications
FedeKD has potential applications in privacy-sensitive federated learning scenarios, particularly in domains where data heterogeneity is prevalent. The framework can enhance the reliability of knowledge sharing among clients, making it suitable for various applications in healthcare, finance, and other fields requiring collaborative learning without data sharing.