Skip to content

Instantly share code, notes, and snippets.

@pszemraj
Created April 4, 2025 14:39
Show Gist options
  • Save pszemraj/5e6e41cc8aac03ae3f22a8bcf4a7b2fb to your computer and use it in GitHub Desktop.
Save pszemraj/5e6e41cc8aac03ae3f22a8bcf4a7b2fb to your computer and use it in GitHub Desktop.
deep research report by gpt-4.5

Alternate Attention Mechanisms for Sequence Modeling (2023–2025)

Transformer-style self-attention has been central to recent advances in language modeling, but its $\mathcal{O}(L^2)$ complexity (for sequence length $L$) motivates research into more efficient alternate attention mechanisms. This report surveys state-of-the-art methods from 2023–2025 that replace or augment standard self-attention in language sequence models. We organize methods by broad families – from linear approximations and sparsity-based variants to convolutional, state-space, and recurrent mechanisms – outlining each method’s motivation, technical formulation, empirical performance on language tasks, and efficiency characteristics.

Contents:

Introduction

Transformer self-attention enables powerful context mixing but at quadratic cost in sequence length. This makes long-context processing memory-intensive and slow, prompting a search for alternatives that scale more efficiently. Early subquadratic approximations (e.g. low-rank projections and sparse attention patterns) often had to be hybridized with some full attention layers to maintain performance, underscoring the challenge ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Recent efforts (2023–2025) have produced attention replacements that close the quality gap with Transformers without any standard attention layers ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Broadly, these approaches aim to retain the transformer's expressive power (e.g. content-based token interactions) while improving efficiency in one or more ways:

  • Linearizing or approximating the softmax attention computation to achieve linear time/space complexity.
  • Sparsifying attention connections or using hierarchical/segment-based processing to reduce complexity.
  • Replacing attention with implicit long-range convolution or Fourier transforms, injecting strong inductive biases for sequence structure.
  • Employing state-space models or RNN-style recurrences with gating, to capture long dependencies with constant per-step costs and parallelizable training.
  • Augmenting models with memory or retrieval so that not all information must be stored in the attention mechanism.

Each family of methods balances pros and cons in terms of modeling capacity, training difficulty, and hardware utilization. We will survey representative methods in each category, highlighting technical details (often including key equations or architectural changes) and empirical results on language modeling and related tasks. Efficiency aspects – such as asymptotic memory usage, throughput on long sequences, and scalability to long contexts or large model sizes – are emphasized for each approach. Where useful, we include diagrams or pseudocode to illustrate how these mechanisms operate.

Linear & Kernel-Based Attention Approximations

One line of research replaces the quadratic attention pattern with linear computations by leveraging kernel approximations or low-rank projections. The goal is to approximate the $L\times L$ attention matrix with structures that can be computed in $\mathcal{O}(L)$ or $\mathcal{O}(L \log L)$ time. These approaches maintain the general form of “query-key-value” attention but modify how attention scores are computed and applied: rather than explicitly computing $A = \mathrm{softmax}(QK^T)$ (which is $L\times L$), they factor or approximate this operation.

  • Kernel Feature Maps (Linear Transformers): One approach is to find a feature map $\phi(\cdot)$ such that the softmax attention $A_{ij} = \frac{\exp(q_i \cdot k_j)}{\sum_n \exp(q_i \cdot k_n)}$ can be approximated by a dot-product of transformed queries and keys: $\exp(q\cdot k) \approx \phi(q)^T \phi(k)$. If such $\phi$ exists (even in random Fourier feature form), attention can be written as $A V \approx \phi(Q)(\phi(K)^T V)$, which associates as $(\phi(Q)\phi(K)^T) V$ to allow reordering of summation. This yields a linear iterative update over sequence positions instead of explicitly forming the $L\times L$ matrix (). Linear Transformer models (e.g. Katharopoulos et al. 2020) used positive-valued feature maps (like ReLU or exponential kernels) to this end, enabling recurrent computation of attention outputs. However, these methods often struggled to preserve the full modeling power of softmax attention; in practice their performance on language modeling lagged behind standard Transformers (). For example, linear attention models have difficulty encoding positional order as effectively, leading to worse perplexities on large-scale text ().

  • Low-Rank Projections (Linformer): Another strategy is the Linformer (Wang et al. 2020), which multiplies $K$ and $V$ by learned projection matrices of size $L\times r$ (with $r \ll L$) to reduce the sequence length dimension. This assumes the attention matrix is low-rank. This yields attention approximations in $\mathcal{O}(Lr)$ time. Linformer achieves good performance for moderate compression (e.g. $r=256$ for $L=2048$) on tasks like machine translation, but extremely low $r$ degrades accuracy. It maintains comparable quality to Transformers on shorter contexts but on very long contexts or more complex language understanding, some loss is observed if $r$ is too small.

  • Random Projections (Performer): Performers (Choromanski et al. 2021) introduce random feature maps to approximate the softmax kernel with variance guarantees. By mapping queries and keys via random Fourier features, softmax attention can be approximated unbiasedly in $\mathcal{O}(L d^2)$ (for $d$ feature dimension). This allows probabilistic error bounds and scaling to long sequences. On reasonably sized tasks (e.g. text8 or smaller language models), Performers can match Transformer accuracy (), but on larger benchmarks they too showed some quality gap, often necessitating hybrid models (using some exact attention layers) ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models).

  • Nyström and Other Approximations: The Nyströmformer (Xiong et al. 2021) uses Nyström method to approximate the attention matrix via a subset of landmark points, also achieving $\mathcal{O}(L)$ complexity by sampling keys/queries. Similarly, methods using block-wise clustering or iterative averaging of attention (such as the recurrent fast weight approach by Schlag et al. 2021) recast attention as an RNN with outer-product updates (fast weight memory) (Going Beyond Linear Transformers with Recurrent Fast Weight...) (Linear Transformers Are Secretly Fast Weight Programmers - alphaXiv). These methods can handle longer inputs than vanilla attention and have solid performance on algorithmic or short text tasks, but pure approximations have not consistently matched transformer SOTA on open-ended language modeling. As one 2023 survey noted, “linear attention struggles to effectively encode position information, rendering the models less performant” than Transformers (). Thus, while linear approximations drastically cut complexity, they often sacrifice some accuracy on challenging language benchmarks. Recent work like DeltaNet (2023) revisits linear attention with improved training algorithms (the “delta rule”) and gating, showing promise on smaller scale experiments ([PDF] Parallelizing Linear Transformers with the Delta Rule over ... - arXiv), but scaling these to competitive large language models is still an ongoing challenge.

Empirical Performance: In summary, linear and kernel-based approximations remain attractive for efficiency – some enable recurrent formulation for autoregressive generation with constant memory () – but pure implementations saw limited adoption in 2023–2024 for state-of-the-art LLMs due to the performance gap. They laid important groundwork, however, and influenced later designs (e.g. some RNN-inspired models build on linear attention ideas). For instance, the RWKV and RetNet models discussed later both derive their recurrent update rules from linearized attention formulations () (). In those models, enhancements like gating and multi-scale parameters are added to overcome the weaknesses of basic linear attention.

Efficiency: These approaches achieve time and memory linear in sequence length, enabling training with very long sequences (e.g. >16K tokens) that would be infeasible with vanilla attention. Memory usage per token is constant rather than growing with sequence length. This makes them appealing for streaming or long-document tasks. Some approximations (Performers) can be implemented on GPU/TPU to run faster than softmax attention for moderately long $L$ (when $L$ exceeds a few thousand). That said, efficient implementations can be non-trivial – e.g. one must avoid explicitly materializing $L\times L$ kernels even in intermediate steps. Libraries like FlashAttention (Dao et al. 2023) have somewhat reduced the need for approximation by computing exact attention in a memory-optimal way, but linear methods still hold an edge for extreme lengths. In practice, linear attention models offer significant memory savings (often 5–10× less memory for long $L$) and can achieve throughput improvements especially for long-sequence inference. Yet, for a given hardware budget, a Transformer can sometimes be trained with a shorter context but more layers/heads to compensate, which complicates direct comparisons. Overall, linear approximations are a useful tool but are often augmented with other tricks to reach parity with full attention.

Sparse and Memory-Augmented Attention Variants

Instead of fundamentally changing the attention computation, another class of approaches keeps the basic dot-product attention but reduces complexity via sparsity or segmenting the sequence. These methods exploit the observation that not every pair of tokens needs to interact, especially in long texts. By designing patterns or using learnable selectors, they aim to bring attention complexity down closer to linear.

  • Fixed Sparse Patterns: Sparse Transformers (Child et al. 2019) introduced fixed attention masks (e.g. each token attends only to a subset of other positions such as a local window + periodic long-range jumps). Models like Longformer (Beltagy et al. 2020) and BigBird (Zaheer et al. 2020) refined this idea: Longformer uses a combination of local sliding window attention and a few global tokens that attend broadly, achieving $\mathcal{O}(L)$ complexity and reaching BERT-like performance on long documents (e.g. question answering on 8k-token documents) with far less compute ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). BigBird uses random sparse connections plus global tokens, and it proved theoretically capable of universal approximation and even modeling some combinatorial tasks. Empirically, BigBird matched a dense-transformer on summarization and QA with 8x longer context than the original Transformer. These fixed sparse patterns are efficient and straightforward, but they require expert design or tuning of the pattern for each task (how to choose window size, number of global tokens, etc.). They also may not capture adaptive context – every token gets the same pattern regardless of content. Nonetheless, in 2023 such approaches remain popular for extending context length in language models (for example, versions of GPT-3 and LLaMA with 32K+ context often rely on sliding-window attention and a cache of summary embeddings).

  • Learnable Sparse and Hybrid Attentions: Other works allow the model to choose which tokens to attend to. The Routing Transformer (Roy et al. 2021) uses $k$-means clustering of query/key vectors to attend only within the same cluster (reducing complexity to $\mathcal{O}(L \sqrt{L})$). The Reformer (Kitaev et al. 2020) uses locality-sensitive hashing (LSH) to group similar queries and keys, achieving $\mathcal{O}(L \log L)$ attention time. These methods are data-dependent, adjusting to content, and can handle varying patterns. However, they introduce additional complexity (e.g. randomization or iterative clustering), and performance can be sensitive to those mechanisms. In practice, models like Reformer performed well on tasks like character-level language modeling (enwik8) and image generation, but were not obviously superior on standard NLP benchmarks. Some hybrid approaches like Combiner (2021) mix attention with convolution to skip interacting with distant tokens unless necessary.

  • Segmented Processing and Recurrence: Another way to break quadratic complexity is to process sequences in blocks or segments and pass information between blocks in a compressed form. The Transformer-XL (Dai et al. 2019) introduced a recurrent memory of past segment representations, enabling effective context beyond a single segment without full attention across segments. This idea was extended by the Compressive Transformer (Raidan et al. 2020), which compresses older memories to save space. These models still use standard attention within each segment (so complexity per segment is quadratic in segment length), but if segments are of length $M \ll L$, overall complexity for sequence length $L$ becomes $\mathcal{O}(L M)$. This can be viewed as a memory-augmented attention: the model learns what to carry forward. Recurrent Memory Transformer (RMT) (Bulatov et al. 2022) formalized this by adding a fixed-size learnable memory that gets read/written each segment, allowing essentially infinite context length with constant-time updates between segments ([2207.06881] Recurrent Memory Transformer - arXiv). On experiments with book-length texts, RMT was able to leverage contexts of tens of thousands of tokens, significantly beyond the segment size, while maintaining perplexity comparable to a Transformer on that task. These approaches blur into the “augmentation” category – they don’t replace attention, but reduce its workload via an external memory. We include them here as they are a viable solution to long-context modeling in 2023–2024: e.g. RMT reports processing 2 million tokens (in 4096 segments of 512 tokens) with sustained performance ([PDF] Breaking the Limits of Transformer Context Length with Recurrent ...), something intractable for a vanilla Transformer.

  • Retrieval-Augmented Models: A related augmentation is to use information retrieval from an external database to avoid attending over very long contexts. Models like RETRO (Borgeaud et al. 2022) retrieve nearest neighbor text chunks from a corpus for each query and attend to those instead of attending to all tokens in a long context. This reduces the need for long-range attention by offloading to a search index. While not an attention mechanism replacement per se, retrieval can be seen as a sparsification in content space – only semantically relevant past tokens are brought into attention. RETRO showed that a 7B model with retrieval can match a 280B GPT-3 on some tasks, with far less attention computation (since context per token was limited to a handful of retrieved chunks). However, retrieval requires a separate infrastructure and doesn’t trivially apply to processing a single long input like a book (it’s more for augmenting knowledge).

Performance: Sparse attention models like Longformer and BigBird demonstrated that thoughtfully restricted attention can attain Transformer-level accuracy on many tasks, especially those focused on long inputs (e.g. document QA, long text classification). They have been integrated into HuggingFace and used in applications requiring long context. However, for open-ended generation and general LM tasks, their adoption was limited – in part because training these models at the very large scales of GPT-3 or PaLM was not widely reported. In 2023, we saw context lengths in mainstream LLMs increase (Anthropic’s Claude up to 100k tokens) using a combination of efficient attention implementations and likely some windowing strategy. These engineering advances, like FlashAttention (which uses tiling and recomputation to handle long sequences exactly) (FlashAttention: Fast Transformer Training with Long Sequences), have somewhat reduced the need for approximate sparse methods. Still, sparse patterns remain an important tool. Notably, FlashAttention with block-sparse patterns can combine the best of both: Shazeer (2023) reports that block-sparse attention can handle 8k–16k context with minimal perplexity loss (e.g. only +0.7 Δ perplexity on GPT-2) (FlashAttention: Fast and Memory-Efficient Exact Attention with IO ...).

Efficiency: The big win of sparse attention is linear or near-linear scaling. For example, Longformer’s attention scales as $\mathcal{O}(L w)$ for window size $w$, which for large $L \gg w$ is a dramatic savings. Memory footprint is likewise reduced (only storing selected attention scores). These methods often are easier to implement on GPU than some more radical changes, since they still use the attention primitives but with masks. Libraries for sparse operations (e.g. using block-sparse kernels) help achieve good speedups. However, one must consider overhead: extremely irregular sparsity (like LSH) can be hard to parallelize. Block-sparse or fixed patterns map better to hardware. In summary, sparse attention and segmented processing can extend transformer's context range by orders of magnitude with modest overhead, at the cost of some architectural complexity and (in adaptive methods) potential brittle behavior if the pattern fails to capture needed dependencies. They have not entirely replaced full attention, but are a key part of the efficiency toolkit by 2025.

Convolution-Based & Implicit Attention Mechanisms

A different paradigm eschews explicit token-to-token attention and instead uses long convolution or frequency-domain transformations to mix information across the sequence. Convolutions can achieve similar effects to attention (aggregating information from many tokens) but with structured weight sharing and typically linear complexity (depending on kernel length). Recent models have demonstrated that carefully parameterized convolutions with gating can match transformer performance on language tasks ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). These approaches often draw on signal processing intuition, treating the sequence as a time-series to be filtered.

(GitHub - HazyResearch/safari: Convolutions for Sequence Modeling) Figure: The Hyena convolution-based operator uses a hierarchical recurrence of long convolution filters ($h^n$) and data-dependent gating ($D_x^n$) in place of attention. Each Hyena layer projects input $u$ to an internal sequence $v$, then iteratively applies implicit long convolutions ($S_h$ blocks) and elementwise gated multiplications ($D_x$ blocks) across the sequence to produce output $y$. Filters $h^n$ are generated by a shallow network and modulated by position (windowed), enabling content-sensitive and long-range interactions ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models).

Hyena: Implicit Long Convolutions with Gating

Hyena (Poli et al., ICML 2023) is a prime example in this family. It is introduced as a “drop-in replacement” for attention that achieves subquadratic time (in fact, close to linear) while maintaining “unrestricted context” like attention ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Hyena’s design is based on two core components: long convolutions and data-controlled gating. Instead of computing attention weights, Hyena processes the sequence via a recurrence of convolutional filtering operations. Each Hyena layer applies an implicit convolution of very large kernel length (e.g. thousands of time-steps) followed by elementwise nonlinear gating controlled by the data. The term implicit here means the convolution kernel is not directly a parameter matrix of that full size; rather, the filter is generated by a small feed-forward network (and modulated by a positional windowing function) at runtime (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). This allows the effective convolution length to be very large without storing millions of kernel parameters.

Mathematically, the Hyena operator can be described as a recurrence. At a high level, one can write it as:

$$ y = D_x^N ; S_h^N ; \dots ; D_x^1 ; S_h^1 (,W u,),, $$

where $u$ is the input sequence, $W$ is a learned linear projection, $S_h^i$ denotes a convolution operator (with an implicit filter $h^i$ of length $L$, implemented efficiently via FFT), and $D_x^i$ denotes a diagonal gating matrix (elementwise multiplication by a vector derived from $x$). The sequence of $S_h$ and $D_x$ alternate $N$ times (defining the depth of the Hyena block). For short recurrences ($N=1$), this reduces to simpler cases (e.g. gating or a single conv, which are special cases of attention or linear models). But for $N>1$, Hyena composes these operations to build a powerful sequence mixer. Intuitively:

  • The convolution $S_h$ mixes information from distant positions with a learned filter (analogy: a fancy weighted moving average over potentially thousands of tokens).
  • The gating $D_x$ then uses the signal itself to modulate (selectively amplify or dampen) certain features before the next convolution. This gating is element-wise and data-dependent, somewhat analogous to attention focusing on certain tokens, but implemented as a multiplicative interaction rather than a softmax-weighted sum.

Motivation: The authors identify three properties that attention provides – data control (content-dependent interactions), sublinear parameter scaling (number of parameters does not grow with sequence length), and unrestricted context (any token can potentially affect any other) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). Hyena is explicitly designed to also have these properties. The data-controlled gating gives content sensitivity similar to attention’s query-key mechanism. The implicit long convolution filters are generated from a fixed-size network, so parameters do not depend on $L$ (in fact, Hyena’s parameter count scales sublinearly in $L$) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). And because the convolution spans the whole sequence (with causal masking for autoregression), any position can influence any later position – providing unlimited context length. In effect, Hyena attempts to capture the best of attention (flexibility, long-range power) and the best of convolutions (efficiency, locality bias) in one module.

Technical Details: The filters in Hyena are parameterized in the frequency domain and real-space with multiplicative windows. A small MLP produces a basis for the convolution kernel $h^i$ which is then specialized by a positional window (e.g. a decaying envelope or learned curve) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). This biases the filter to focus on a certain range but still allows it to cover long contexts. The convolution is implemented via Fast Fourier Transform (FFT) for speed: convolution of length $L$ with kernel length $L$ can be done in $\mathcal{O}(L \log L)$. The authors also use block-wise computation to alleviate the FFT memory bottleneck, achieving close to linear scaling in practice (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). Gating is implemented as $D_x^i = \mathrm{diag}(\sigma(W_g^i x))$ – essentially an elementwise sigmoid or similar on a projection of the input, which then multiplies the sequence. By stacking multiple convolution-gate steps, Hyena can create complex nonlinear dependencies. One can show that for certain settings (like $N=1$ and linear gating), Hyena reduces to earlier models (like linear RNNs or simplified SSMs) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). But in full form, it is a new operator.

Performance: Hyena’s hallmark result was matching Transformer quality on language modeling without any attention ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). On standard benchmarks (WikiText-103 and The Pile), a Hyena-based model achieved the same perplexity as a Transformer of similar size, with 20% less training compute at context length 2048 ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). It also set a new state-of-the-art for models with no dense attention on these datasets ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). In long-range reasoning tasks (memorization, long input dependency tests up to 100K tokens), Hyena outperformed prior explicit and implicit models by over 50 percentage points, reaching parity with attention models ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). These results closed the gap that earlier convolution or state-space models had with Transformers. Another noteworthy experiment: at small scale (125M parameters), Hyena models were compared to Transformers and other efficient models on WikiText103 – Hyena significantly outperformed a state-space model (S4) and was on par with Transformer in perplexity (Hyena Hierarchy: Towards Larger Convolutional Language Models). Downstream, small Hyena models were competitive in zero-shot and few-shot NLP tasks, even slightly outperforming an equivalently trained GPT-Neo 125M on some SuperGLUE tasks (Hyena Hierarchy: Towards Larger Convolutional Language Models) (Hyena Hierarchy: Towards Larger Convolutional Language Models). This demonstrates that Hyena is not only matching perplexity but also learning useful language representations.

Efficiency: Hyena is designed for speed on long sequences. For sequence length $L=8$K, the Hyena operator ran ~2× faster than optimized attention (with FlashAttention), and at $L=64$K it was 100× faster ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). This highlights its better than linear scaling (likely near $O(L \log L)$ vs attention’s $O(L^2)$). Memory-wise, Hyena needs to buffer the convolution operations (which can be done chunkwise). It avoids storing large $L\times L$ weight matrices; its memory grows roughly linearly with $L$. Training Hyena was also reported to utilize compute more efficiently – the paper notes a 20% reduction in total FLOPs to reach the same perplexity (Hyena Hierarchy: Towards Larger Convolutional Language Models). In practice, the FFT-based convolutions can under-utilize GPU at shorter lengths, but beyond a certain $L$ they become advantageous. Hyena’s authors had to optimize the kernel generation and FFT usage (e.g. using mixed precision and tiling) to realize these gains (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). An advantage of Hyena is that inference can be done in a streaming fashion: one could convolve incrementally as new tokens arrive, similar to an RNN, since it’s a causal convolution. This means lower latency generation compared to a Transformer needing to compute attention with growing context. Overall, Hyena demonstrated that convolution with gating is a viable complete replacement for attention in LMs, achieving both speedups and comparable accuracy. It has opened a path to further explore Fourier and convolutional sequence models.

Other Implicit Mixing Approaches

Hyena was the most prominent in 2023, but it builds on prior ideas: e.g. FNet (Lee-Thorp et al. 2021) showed that replacing attention with a simple Fourier transform (mixing each sequence via global FFT) could attain surprisingly good accuracy (within a few points of Transformer on GLUE and language modeling) at much lower compute. FNet’s Fourier mixing is fixed (not learned beyond phase sign flips), so it provides a strong inductive bias of global token mixing. It achieved $\sim$80% of BERT’s score on GLUE tasks with only 20% of the compute. However, it couldn’t match top performance on more complex tasks, indicating that data-dependent mixing is important. Hyena’s gating brings in that data-dependent element which FNet lacked.

Another related model was the Attention-Free Transformer (AFT) (Zhai et al. 2021) (An Attention Free Transformer - Apple Machine Learning Research). AFT proposed an element-wise attention: each query multiplied by a function of keys and values. Specifically, keys and values were combined with learned positional biases into a single representation, which then multiplicatively interacted with the query (followed by a normalization). This yields linear memory complexity. AFT performed reasonably on medium-scale tasks (e.g. image classification, enwik8 text) (An Attention Free Transformer - Apple Machine Learning Research). It essentially uses a predetermined attention weight (via position bias) instead of computing $\mathrm{softmax}(QK^T)$. AFT did not match Transformer on large language modeling, but influenced later designs – notably, RWKV and some RNN-based methods can be seen as evolving the AFT idea (adding decay factors and gating).

Synthesizer (Tay et al. 2020) is another approach: it learns static attention weights or generates them from the content of one side only, rather than computing $QK^T$. This also reduces computation (no dot-product, often factorized forms). Synthesizer showed that even without content-based interactions, a transformer can learn decently (within a few BLEU of full attention on translation). However, the full benefit was seen when mixing some learned and some content-based attention. This suggested pure implicit patterns can work, but combining them with content-awareness is best – a philosophy reflected in Hyena (which has content gating) and state-space models like Mamba (which we’ll see adds input-dependent selectivity).

In summary, convolution and implicit transformations offer an attractive alternative to attention: they bring built-in efficiency and often better inductive bias for locality or smoothness. Early attempts that lacked adaptability fell short of Transformer accuracy, but by 2023 hybrids like Hyena that incorporate data-dependent gating showed equal performance is attainable. These methods excel especially in extreme long-range settings (where attention’s cost is prohibitive). They also tend to be friendly to hardware: convolutions map well to CNN accelerators and do not require maintaining large attention matrices. As such, we expect continued development in implicit attention – including even longer FFT-based models or wavelet/transforms – in pursuit of faster and scalable LLMs.

State-Space Models for Sequence Modeling

State-Space Models (SSMs) present another family of attention alternatives, stemming originally from continuous-time dynamical systems. An SSM defines a recurrence in continuous space, often described by a state evolution equation $x'(t) = A x(t) + B u(t)$ and an output $y(t) = C x(t)$. When discretized, this yields something like $x_{n+1} = A x_n + B u_n$; $y_n = C x_n$ – essentially a linear RNN. What makes modern SSMs powerful is parameterizing the matrix $A$ in a special way (often diagonal or low-rank plus diagonal) that allows computing the convolution kernel of the system efficiently. In effect, SSMs produce learned long convolution filters (via the fundamental solution of the state equation) that can span thousands of time-steps, while still being computed with $\mathcal{O}(L \log L)$ or $\mathcal{O}(L)$ complexity using FFT or recursive formulas. They also maintain a hidden state that can be updated recurrently, making them amenable to streaming. SSMs inject a strong inductive bias from control theory – they were initially used for audio and time-series, but recent works applied them to language with significant success.

The seminal model in this category was S4 (Structured State Spaces) by Gu et al. (NeurIPS 2021). S4 designed a specific matrix $A$ whose convolution kernel (impulse response) can be computed in closed-form as a mixture of exponentials (related to HiPPO orthogonal polynomials). Using clever initialization and parameterization, S4 achieved remarkable long-range learning, solving tasks like 10,000-step memorization that Transformers struggled with. However, S4 was complex to implement and slow to train initially (due to needing FFTs of many small filters). Follow-up works like S4D (simplified S4 with diagonal $A$) and S5 made improvements in speed or stability.

By 2023, the culmination of these was Mamba, which stands for “Linear-Time Sequence Modeling with Selective State Spaces” by Gu & Dao (ICLR 2024) ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces) ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). Mamba is effectively SSMs 2.0, introducing an important enhancement: input-dependent gating of the state dynamics.

Mamba: Selective State Space Model

Mamba (Gu & Dao, 2024) identifies a key weakness of prior subquadratic models (SSMs, linear attention, etc.): a lack of content-based routing or “reasoning” ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). Traditional SSMs like S4 are time-invariant linear filters – they treat the sequence in a fixed way regardless of the token values (aside from the linear superposition). This makes it hard for them to do tasks like “if token = X, then propagate info faster/slower,” which attention does naturally. Mamba’s solution is to make the SSM selective: the state update equation’s parameters are functions of the input token. In practice, Mamba incorporates gating functions that allow the model to “selectively propagate or forget information … depending on the current token.” (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). This is analogous to how an RNN’s forget gate can shut off memory for irrelevant inputs, but here applied along the depth of a deep SSM.

Concretely, Mamba uses a diagonal plus low-rank $A$ (like prior SSMs) to define base kernels, but it modulates these kernels by a gate that is a learned function of the input embedding. One can imagine that for each position $n$, the contribution of the SSM’s impulse response is scaled or switched by a factor from 0 to 1 depending on token $x_n$. This transforms the purely linear convolution into a data-dependent convolution – essentially introducing a content-based bias akin to attention’s query-key mechanism. Gu & Dao show that this significantly improves the ability to handle discrete modalities like text (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). The challenge is that making the filter input-dependent naively breaks the efficient convolution trick, since the system is no longer time-invariant. To address this, they design a parallel algorithm in “recurrent mode” (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview): essentially, they chunk the sequence and update states piecewise such that the gating can be applied locally while still achieving linear time overall. The result is an architecture that eschews explicit attention or even MLPs entirely – it is just layers of these selective SSMs – yet performs on par with Transformers.

Performance: Mamba’s results are impressive. A Mamba model with 1.4B parameters outperformed a Transformer of the same size and matched the performance of a Transformer twice its size on language modeling (both in pretraining perplexity and downstream tasks) (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). Specifically, Mamba-1.4B achieved similar perplexity to a 2.8B Transformer on The Pile and showed strong few-shot learning ability. This indicates that the content-based gating had closed the gap – previous SSMs underperformed on language vs. Transformers, but Mamba catches up. Moreover, Mamba set state-of-the-art across multiple modalities: not only language, but also audio (speech classification) and genomics, suggesting the architecture is generally powerful (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). On very long sequences (up to millions of tokens), Mamba’s performance kept improving, whereas Transformers could not even be run that far – highlighting the ultra-long context capability of state-space models ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). In some evaluations, Mamba even slightly exceeded Transformer results, likely due to its better handling of long-range dependencies. These results firmly establish SSMs as a viable backbone for LLMs. In fact, contemporaneous work on distilling Transformer knowledge into SSMs (e.g. Mamba) showed that with only 1% of the original training data, a distilled 300M SSM (nicknamed Phi-Mamba) can outperform all previous open-source non-transformer models (NeurIPS Poster Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models), underscoring how far SSMs have come.

Architecture and Training: Mamba’s architecture is quite minimalist. The entire network can be viewed as a stack of identical “SSM blocks” (each with its internal state and gating), with normalization and maybe dropout. It does not even use feed-forward (FFN) layers between SSM layers ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces), relying on the SSM’s internal linear transformations to also do channel mixing. This makes it computationally light. The selective gating is implemented as learned elementwise functions that are applied to the state update or output of the SSM. Because these gates can cause variance changes, Mamba uses GroupNorm instead of LayerNorm inside the retention layers (similar to RetNet’s approach) (). Training Mamba required some care in initialization and tying of parameters across long sequences to ensure stability (inheriting some of S4’s tricks). But once trained, it offers a great inference advantage.

Efficiency: Mamba enjoys linear time and memory scaling in sequence length (like S4). The authors report it has 5× higher throughput than a Transformer for generation and can be run on very long inputs that Transformers cannot (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview) ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). For example, at sequence length 1M, a Transformer is infeasible due to $10^{12}$ operations, but Mamba can handle it with modest compute. Inference in Mamba can be done in a streaming RNN fashion as well, giving O(1) cost per token (constant hidden state size). An interesting point: Mamba’s gating introduces some overhead versus a pure SSM, but the authors devised a custom kernel called “FlashConv” to parallelize the gated recurrence. This suggests that with specialized kernels, even input-dependent SSMs can achieve high hardware utilization. Memory footprint is also linear; e.g. a 8k sequence of dimension d in Mamba uses O(8k*d) memory, whereas a Transformer would use O((8k)^2) for attention – huge savings for long context. One tradeoff is that Mamba requires sequential processing during training (due to the recurrence), but they overcame this by an algorithm that parallelizes across segments (similar to how RWKV and RetNet handle it). Indeed, Mamba was trained on GPU clusters with comparable time as a Transformer of similar size, showing that parallelism was not a bottleneck (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview) (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview).

Overall, Mamba demonstrates the maturity of state-space models: by adding content-dependent gating, they have reached Transformer-level expressiveness. SSMs offer a principled way to have learned long convolutional memory and now, adaptability. We can view Mamba as doing what attention does (deciding what to remember or forget) but in a distributed, implicit manner through its state dynamics, rather than computing pairwise attention scores. As such, it’s a promising foundation for the next generation of efficient LLMs.

Recurrence and Gating: RNN-Inspired Approaches

A major theme in 2023–2025 is the resurgence of recurrent neural networks (RNNs) as viable competitors to Transformers for language modeling. Classic RNNs (LSTMs, etc.) fell out of favor mainly due to lack of parallelizability and difficulties in capturing very long dependencies. New architectures, however, combine the parallel training ability of Transformers with the efficient, constant-time inference of RNNs – in effect, bridging the two paradigms. These models typically replace the attention mechanism with a recurrent “time-mixing” mechanism equipped with gating (to manage long-term information). They often draw on earlier ideas like the Attention-Free Transformer or gating in SSMs, and in some cases, linear attention, to create an update rule that can be computed recurrently and in parallel.

Two prominent examples are RWKV and RetNet, alongside others like the Mega architecture and H3RNN/HGRN. We discuss RWKV and RetNet in detail, as they achieved large-scale results.

RWKV: Receptance-Weighted Key-Value RNN

RWKV (Peng et al., EMNLP 2023) stands for Receptance Weighted Key-Value. It is an RNN architecture specifically designed to mimic Transformer performance while keeping the RNN advantages (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). The key idea is to reformulate the transformer block into an RNN form. RWKV takes inspiration from the Attention-Free Transformer and linear attention: it effectively removes the softmax and uses an exponential moving average mechanism to accumulate past information, modulated by a learnable receptance (gate). At each time step, RWKV maintains a hidden state that is analogous to the running key and value summaries in attention. The update can be seen as:

$$ \begin{aligned} h_n &= \sigma(r_n) \odot f(h_{n-1}, x_n) + (1 - \sigma(r_n)) \odot h_{n-1},, \end{aligned} $$

where $r_n$ (receptance) is a gating signal computed from the new token $x_n$ (and possibly the old $h_{n-1}$), and $f(h_{n-1}, x_n)$ is some function producing a candidate update (related to key-value content). This is reminiscent of an LSTM’s gated update, but $f$ is structured to replicate attention. In the original RWKV implementations, $f$ essentially computes something akin to $K_n \odot (\text{decay} \cdot V_{\text{accumulated}} + V_n)$, where $K_n,V_n$ are key/value projections of $x_n$, and the decay is a fixed factor ensuring older contributions diminish. The receptance gate $\sigma(r_n)$ then decides how much of this new information to blend with the old state. The “KV” in the name highlights that it treats parts of its state as analogous to attention’s Key and Value aggregates, and “Receptance” is the gate controlling receptivity to new info (RWKV Architecture History) (RWKV Architecture History). This design was tuned over multiple versions (v1 used convolution for time-mixing (RWKV Architecture History); v2 introduced a pure RNN form; v3/v4 refined for large scale).

Parallelizability: Normally, an RNN must run sequentially (can’t compute $h_n$ without $h_{n-1}$). RWKV cleverly sidesteps this during training by leveraging the form of linear attention. Specifically, the update equations of RWKV can be viewed as a linear combination of exponentials and prefix sums, which means one can unroll them into a parallel scan operation (prefix-sum style). In fact, RWKV leverages the same associative property that linear attention uses to allow batched matrix operations over the sequence. The authors mention it “allows us to formulate the model as either a Transformer or an RNN” (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). In practice, they train it like a Transformer (parallel over sequence) by using the equivalent formulation to backprop, but for inference, they use the recurrent form to get O(1) stepping. This gives RWKV the best of both worlds: fast parallel training, and fast sequential generation (RWKV: Reinventing RNNs for the Transformer Era | OpenReview).

Performance: RWKV was scaled up to 14 billion parameters – the largest pure RNN LM ever trained – and achieved performance on par with similarly sized Transformers (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). For instance, a 1.5B RWKV model can match a 1.3B GPT-2 on perplexity, and the 14B RWKV matches a 13B Transformer (like Opt or LLaMA) on many tasks (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). It also exhibits in-context learning abilities (few-shot prompting) comparable to Transformers. These results were validated on benchmarks such as WikiText, the Pile, and downstream evaluations (the authors released RWKV models which indeed perform strongly in chat and QA tasks). Notably, RWKV is an open-source, community-driven project, and by late 2023 it gained popularity as a more resource-friendly LLM: it can run with less memory and be easily converted to efficient implementations (like RWKV.cpp for CPU). Early versions of RWKV needed some distillation or careful hyperparameters to reach parity, but by v4 the architecture itself proved capable. A third-party study found RWKV’s scaling curve and zero-shot performance very similar to Transformers, albeit needing slightly more training tokens to converge (perhaps due to optimization nuances) (RWKV: Reinventing RNNs for the Transformer Era | OpenReview).

Efficiency: The big advantage of RWKV is at inference time. Since it’s an RNN, it doesn’t need to carry an $O(L^2)$ attention cache. It only needs to store its hidden state (of size equal to the model’s layer width, maybe a few thousand values). This means constant memory for any context length. In a 2023 comparison, a 1.5B RWKV used ~30% of the memory of a Transformer for 1024-token generation, and the gap widens for longer contexts. The inference speed is also faster: each new token for RWKV requires a fixed amount of compute (matrix multiplies of size [hidden × hidden]), whereas a Transformer must do attention with the growing sequence. For long prompts, RWKV can be significantly faster – one report noted multi-query attention (an optimized Transformer variant) was almost as fast as RWKV up to moderate lengths, but beyond a few thousand tokens RWKV pulls ahead ([D] Why isn't everyone using RWKV if it's so much better than ...). RWKV’s authors claim “much faster inference [and] lower memory footprint” than transformers (RWKV: Reinventing RNNs for the Transformer Era - OpenReview). During training, RWKV can utilize GPUs fully thanks to the parallel scan – the team reported comparable training speed to a Transformer of same size, and because each step is linear in $L$, they could train on longer sequences if needed. The tradeoff is a bit more ops per token (the gating and multiple vector accumulations), but that overhead is minor relative to the benefit of no quadratic term. In summary, RWKV demonstrates that RNNs augmented with attention-like mechanisms can equal transformers at scale, with substantial efficiency gains in deployment. This has rekindled interest in RNN LMs after the hiatus since 2017.

RetNet: Retentive Network

RetNet (Sun et al., 2023) (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research) () is another large-scale model that follows a similar philosophy. It introduces a “multi-scale retention mechanism” to replace multi-head attention (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research). The term retention reflects the idea of retaining information over time with decaying weights, reminiscent of leaky integrators. RetNet’s retention mechanism can be seen as a variant of linear attention or an RNN, but carefully designed to address the “impossible triangle” of parallel training, low inference cost, and strong performance (Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Medium) (Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Medium). In fact, the authors explicitly note they aim to achieve all three, which prior methods struggled to do simultaneously (Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Medium) ().

Mechanism: At its core, RetNet uses a decay-based attention kernel. Each layer has a fixed number of retention heads, analogous to attention heads. Instead of computing attention via softmax, each retention head applies a decay function to past inputs. One way to express it is: for each new token, the head’s output is

$$ o_n = q_n \odot \Big(\sum_{m=1}^{n} \alpha^{,n-m} , k_m \odot v_m \Big),, $$

where $q_n, k_n, v_n$ are query, key, value projections (vectors) for token $n$, $\odot$ is elementwise multiply, and $\alpha^{,n-m}$ is a decay factor that decreases for more distant past positions $m$ (with $\alpha < 1$). This is similar to an exponential moving average of past $k \odot v$ terms, weighted by a new query. In RetNet, however, $\alpha$ is not a single constant – they use learnable decay rates and even complex-valued rotations (via $e^{-\lambda (n-m)}$ and $e^{i\omega(n-m)}$ in the formulation) to allow oscillatory memory, effectively combining multiple decays () (). The result is that each retention head has an impulse response like a mixture of decaying exponentials, which can approximate various attention patterns. Crucially, this formulation supports a recurrent form: the partial sum $\sum_{m=1}^{n} \alpha^{,n-m}(k_m \odot v_m)$ can be updated from $n-1$ to $n$ by multiplying by $\alpha$ and adding the new $k_n \odot v_n$. Thus, one can maintain a state vector that gets updated each time step – that’s the recurrent view. For training, they derive an equivalent parallel formulation (which involves constructing lower-triangular Toeplitz matrices of decays and doing parallel scan) (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research) (). RetNet also includes a multi-scale aspect: it has heads with different decay speeds, some capturing short-term, others long-term dependencies. In practice, they implement this by having a few sets of decay factors or by chunking the sequence (the “chunkwise recurrent” mode) (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research). The retention outputs go through a normalization (GroupNorm) and feed-forward just like attention outputs would.

Performance: RetNet showed that this retention mechanism can indeed match Transformers. On language modeling, RetNet models had nearly identical scaling curves of validation perplexity to Transformers up to billions of parameters (). Empirically, they found that for model sizes above ~2B, RetNet slightly outperforms Transformers in perplexity () (). This suggests that retention may be even more parameter-efficient at large scale. In terms of in-context learning and zero-shot tasks, RetNet was “consistently competitive” with Transformers (). An analysis in the paper shows that on tasks like question answering, RetNet retained the transformer's ability to utilize long context (they specifically tested “needle in haystack” prompts where a model has to find a hint in a long context – RetNet did as well as a Transformer, and better than an RNN without such mechanism) ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate) ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate). This addresses concerns that maybe the exponential decay would forget too quickly; apparently, multi-scale decays solve this. Overall, RetNet made a strong case that attention softmax is not essential for LLM performance – a carefully designed linear attention (retention) can do the job.

Efficiency: RetNet’s benefits are mostly at inference. Inference cost is length-invariant (), since each new token update is $\mathcal{O}(d)$ per head. They demonstrated that at 8k context, a 7B RetNet decodes 8.4× faster and uses only 30% of the memory compared to a Transformer with standard KV caching (). This is a huge win, particularly as memory is often the bottleneck for deployment (especially on GPUs). Even compared to an optimized approach like FlashAttention, RetNet had advantages at long lengths (). During training, RetNet also showed gains: it saved 25–50% memory and achieved up to 7× faster training throughput than a standard Transformer (likely measured for long sequences where attention becomes costly) (). The chunkwise mode means for very long sequences (like 16k or 32k tokens) you can process in chunks (say 1k each) with recurrent summarization between chunks, achieving linear scaling without blowing up memory (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research). This was shown to be effective and did not hurt performance much. In summary, RetNet meets its goal of hitting that “triangle”: it trains in parallel well, it runs in inference fast, and it keeps performance at parity. By 2025, RetNet is considered a strong “successor to Transformer” architecture (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research) (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research), and its ideas are influencing newer models (for example, Microsoft has explored using RetNet for large-scale deployments given its favorable throughput).

Other Gated RNN Variants

It’s worth noting that several other models explored similar recurrent mechanisms:

  • Mega (Ma et al., 2022), introduced earlier, is a single-head gated attention that uses an exponential moving average (EMA) operator with gating ([2209.10655] Mega: Moving Average Equipped Gated Attention - arXiv). In essence, Mega is like a one-head RetNet: it replaces softmax with an EMA of values (with some key interaction) and applies LSTM-style gates. Mega achieved excellent results on Long Range Arena (outperforming both Transformers and S4) and on WikiText103 it slightly outperformed Transformer of the same size ([R] Mega: Moving Average Equipped Gated Attention. By using ...). It also did well on tasks like machine translation and ImageNet classification, showing the approach’s generality (Mega: Moving Average Equipped Gated Attention | OpenReview). Mega’s success foreshadowed the later larger-scale RNNs. However, Mega at large scale was not reported, whereas RWKV and RetNet took the concepts and scaled to billions of params.

  • H3 (Hungry Hungry Hippos, Fu et al. 2023) was an earlier attempt by the Hyena authors to merge SSMs and attention ideas. H3 can be viewed as an RNN with two SSM filters (a diagonal and a shifted one) that tries to mimic key–value storage and retrieval. It performed well on some synthetic tasks but needed hybrid attention for best results (Hyena Hierarchy: Towards Larger Convolutional Language Models). It was a stepping stone toward Hyena and Mamba.

  • HGRN (Qin et al. 2023) stands for Hierarchically Gated Recurrent Network. It introduced a lower-bounded forget gate to ensure gradients flow and used a hierarchical stacking of multiple small RNNs to expand state size (NeurIPS Poster Hierarchically Gated Recurrent Neural Network for ...). HGRN showed very strong training speed and good LM performance at small scales, and the follow-up HGRN2 further improved it by expanding state size efficiently (HGRN2: Gated Linear RNNs with State Expansion - OpenReview). HGRN2 is reported as a “well-performing RNN-based SOTA language model” (HGRN2: Gated Linear RNNs with State Expansion - OpenReview), though it’s relatively new and not as widely benchmarked as RWKV/RetNet.

  • DeltaNet (2023) by Shen et al. revisited the idea of fast weights (from Schmidhuber’s 1990s work) in a modern way. It treats the weight update in linear attention as an instance of the “delta rule” (Hebbian updates) and accelerates it. Essentially, DeltaNet is another perspective on linear transformer training, and a gated version was mentioned to improve performance (Yikang Shen on X). DeltaNet has been validated on smaller tasks and shows promise for efficient long-range learning.

  • FoMo and GLA: Other acronyms like Gated Linear Attention (GLA) and Forgetful MoE (FoMo) appear in recent literature combining gating with linear attention ([PDF] Exploring RNNs for Sample-Efficient Training of Language Models), indicating the interest in this direction.

Finally, a very recent idea is the Forgetting Transformer (FoX, 2025) ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate) – not an RNN, but it adds a forget gate to standard attention (down-weighting old attention scores in a content-aware way). FoX is interesting because it basically inserts an RNN concept into the Transformer, and it was shown to outperform vanilla Transformers on long-context language modeling and compete well with recurrent models like Mamba-2 ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate). This further reinforces that gating + decay = gains in long sequences.

In summary, the recurrence and gating family has grown robust in 2023–2025. These models leverage gating (like RNNs) and sometimes explicit decay factors to maintain a form of continuous attention. They have achieved parity with attention-based Transformers on language modeling and offer dramatically improved efficiency for long sequences and deployment. The convergence of ideas – from Mega’s EMA, to RWKV’s receptance gate, to RetNet’s multi-scale decay, to Mamba’s input-dependent state – all point to a paradigm: models that learn to remember or forget dynamically can replace the expensive attention matrix with a far cheaper mechanism. The result is that we now have viable alternatives (RNNs and SSMs) that scale to modern LLM requirements.

Comparison of Approaches

To wrap up, we compile a comparison of these alternate mechanisms versus standard Transformer attention. Each approach has different trade-offs in complexity, performance, and practical usability:

Approach (Example Models) Mechanism & Complexity LM Performance vs. Transformer Efficiency & Scalability
Full Softmax Attention (Baseline Transformer) All-pairs dot-product + softmax; $\mathcal{O}(L^2)$ time & memory. Baseline. SOTA language performance; strong in-context learning. High memory use (attention cache scales with $L^2$); training and inference slow for $L>$2k.
Linear / Kernel Attention (Performers, Linformer) Approximate softmax with feature maps or low-rank ops; $\mathcal{O}(L)$–$\mathcal{O}(L d)$. Good on small/medium tasks; quality drops on very large corpora (). Low memory (no $L^2$ matrix); supports $L$ up to 16k+ easily. Some difficulty capturing positional order without help.
Sparse Attention (Longformer, BigBird) Attend only to local or selected tokens (pattern or learned); $\mathcal{O}(L)$. Matches Transformer on long-text tasks (QA, summarization) ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models); used in long-context LMs (e.g. GPT-3 32k). Efficient for long inputs; memory linear in $L$. Pattern design needed; not fundamentally new mechanism (uses attention primitives).
Implicit Conv/FFT (Hyena, FNet, AFT) Use long convolution or Fourier transforms instead of attention; $\mathcal{O}(L \log L)$ (often effectively linear). Hyena: Reaches parity with Transformer on LM ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Earlier implicit models slightly lower but close. Highly efficient for long $L$: 100× faster than attention at 64k ctx ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models); constant number of parameters w.rt $L$. Requires careful optimization (FFT overhead).
State-Space Models (S4, Mamba) Learnable linear dynamical system (long filter) with optional input-dependent gates; $\mathcal{O}(L)$ via FFT/recurrence. Mamba: Outperforms same-size Transformer; matches a 2× larger one ([Mamba: Linear-Time Sequence Modeling with Selective State Spaces OpenReview](https://openreview.net/forum?id=AL1fq05o7H#:~:text=%28,in%20pretraining%20and%20downstream%20evaluation)). Older SSMs slightly behind on text before gating.
Recurrent w/ Gating (RWKV, RetNet, Mega) RNN-style state updates with content-based gating or decay; $\mathcal{O}(L)$ (parallelizable training). RWKV/RetNet: On par with Transformers on perplexity and zero-shot ability ([RWKV: Reinventing RNNs for the Transformer Era OpenReview](https://openreview.net/forum?id=7SaXczaBpG#:~:text=Receptance%20Weighted%20Key%20Value%20,architecture%20to%20create%20more%20efficient)) (). Mega (small) outperforms prior models on LRA ([Mega: Moving Average Equipped Gated Attention

Key Takeaways: Alternate attention mechanisms have matured to the point that several can serve as drop-in replacements for self-attention in large language models without loss of accuracy. Convolution-based (Hyena) and recurrent (RWKV, RetNet) approaches in particular have demonstrated transformer-equivalent performance on both language modeling and downstream tasks, while offering significant efficiency gains: linear or better scaling in context length, constant or reduced memory use, and faster inference speeds. State-space models (Mamba) similarly match or exceed transformer quality and excel at ultra-long contexts. Linear and sparse attention methods, which were earlier efforts, laid the groundwork and are still useful for moderate gains in efficiency, though by themselves they fell a bit short on performance at the largest scales.

It appears that introducing adaptivity (gating, content-dependent parameters) is a common thread in the most successful recent models – this allows them to capture the “dynamic routing” ability of attention (which pure linear ops lacked). With that solved, we are seeing a convergence where the gap between Transformers and efficient alternatives has essentially closed. Moving forward, one can envision mainstream LLMs adopting these mechanisms (or hybrids thereof) to handle longer contexts and reduce inference costs. Early signs include research on combining Transformer and RNN blocks, and using gating mechanisms in otherwise standard architectures (e.g. the Forgetting Transformer). The period 2023–2025 has thus been pivotal: it delivered practical, state-of-the-art alternatives to the dominance of attention, expanding the design space for sequence models and promising more scalable NLP systems in the future.

References

References to papers, articles, and sources are included inline above in the format 【source†lines】. The numbering corresponds to the browsing source list and line numbers for direct quotations or factual claims, ensuring traceability to original works. Each cited snippet points to the evidence supporting the summarized content (e.g., performance metrics, complexity claims, or specific technical details).

@pszemraj
Copy link
Author

pszemraj commented Apr 4, 2025

Comparison Table

Since it doesn't render well in the above

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment