Transformer-style self-attention has been central to recent advances in language modeling, but its
Contents:
- Introduction
- Linear & Kernel-Based Attention Approximations
- Sparse and Memory-Augmented Attention Variants
- Convolution-Based & Implicit Attention Mechanisms
- State-Space Models for Sequence Modeling
- Recurrence and Gating: RNN-Inspired Approaches
- Comparison of Approaches
- References
Transformer self-attention enables powerful context mixing but at quadratic cost in sequence length. This makes long-context processing memory-intensive and slow, prompting a search for alternatives that scale more efficiently. Early subquadratic approximations (e.g. low-rank projections and sparse attention patterns) often had to be hybridized with some full attention layers to maintain performance, underscoring the challenge ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Recent efforts (2023–2025) have produced attention replacements that close the quality gap with Transformers without any standard attention layers ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Broadly, these approaches aim to retain the transformer's expressive power (e.g. content-based token interactions) while improving efficiency in one or more ways:
- Linearizing or approximating the softmax attention computation to achieve linear time/space complexity.
- Sparsifying attention connections or using hierarchical/segment-based processing to reduce complexity.
- Replacing attention with implicit long-range convolution or Fourier transforms, injecting strong inductive biases for sequence structure.
- Employing state-space models or RNN-style recurrences with gating, to capture long dependencies with constant per-step costs and parallelizable training.
- Augmenting models with memory or retrieval so that not all information must be stored in the attention mechanism.
Each family of methods balances pros and cons in terms of modeling capacity, training difficulty, and hardware utilization. We will survey representative methods in each category, highlighting technical details (often including key equations or architectural changes) and empirical results on language modeling and related tasks. Efficiency aspects – such as asymptotic memory usage, throughput on long sequences, and scalability to long contexts or large model sizes – are emphasized for each approach. Where useful, we include diagrams or pseudocode to illustrate how these mechanisms operate.
One line of research replaces the quadratic attention pattern with linear computations by leveraging kernel approximations or low-rank projections. The goal is to approximate the
-
Kernel Feature Maps (Linear Transformers): One approach is to find a feature map
$\phi(\cdot)$ such that the softmax attention$A_{ij} = \frac{\exp(q_i \cdot k_j)}{\sum_n \exp(q_i \cdot k_n)}$ can be approximated by a dot-product of transformed queries and keys:$\exp(q\cdot k) \approx \phi(q)^T \phi(k)$ . If such$\phi$ exists (even in random Fourier feature form), attention can be written as$A V \approx \phi(Q)(\phi(K)^T V)$ , which associates as$(\phi(Q)\phi(K)^T) V$ to allow reordering of summation. This yields a linear iterative update over sequence positions instead of explicitly forming the$L\times L$ matrix (). Linear Transformer models (e.g. Katharopoulos et al. 2020) used positive-valued feature maps (like ReLU or exponential kernels) to this end, enabling recurrent computation of attention outputs. However, these methods often struggled to preserve the full modeling power of softmax attention; in practice their performance on language modeling lagged behind standard Transformers (). For example, linear attention models have difficulty encoding positional order as effectively, leading to worse perplexities on large-scale text (). -
Low-Rank Projections (Linformer): Another strategy is the Linformer (Wang et al. 2020), which multiplies
$K$ and$V$ by learned projection matrices of size$L\times r$ (with$r \ll L$ ) to reduce the sequence length dimension. This assumes the attention matrix is low-rank. This yields attention approximations in$\mathcal{O}(Lr)$ time. Linformer achieves good performance for moderate compression (e.g.$r=256$ for$L=2048$ ) on tasks like machine translation, but extremely low$r$ degrades accuracy. It maintains comparable quality to Transformers on shorter contexts but on very long contexts or more complex language understanding, some loss is observed if$r$ is too small. -
Random Projections (Performer): Performers (Choromanski et al. 2021) introduce random feature maps to approximate the softmax kernel with variance guarantees. By mapping queries and keys via random Fourier features, softmax attention can be approximated unbiasedly in
$\mathcal{O}(L d^2)$ (for$d$ feature dimension). This allows probabilistic error bounds and scaling to long sequences. On reasonably sized tasks (e.g. text8 or smaller language models), Performers can match Transformer accuracy (), but on larger benchmarks they too showed some quality gap, often necessitating hybrid models (using some exact attention layers) ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). -
Nyström and Other Approximations: The Nyströmformer (Xiong et al. 2021) uses Nyström method to approximate the attention matrix via a subset of landmark points, also achieving
$\mathcal{O}(L)$ complexity by sampling keys/queries. Similarly, methods using block-wise clustering or iterative averaging of attention (such as the recurrent fast weight approach by Schlag et al. 2021) recast attention as an RNN with outer-product updates (fast weight memory) (Going Beyond Linear Transformers with Recurrent Fast Weight...) (Linear Transformers Are Secretly Fast Weight Programmers - alphaXiv). These methods can handle longer inputs than vanilla attention and have solid performance on algorithmic or short text tasks, but pure approximations have not consistently matched transformer SOTA on open-ended language modeling. As one 2023 survey noted, “linear attention struggles to effectively encode position information, rendering the models less performant” than Transformers (). Thus, while linear approximations drastically cut complexity, they often sacrifice some accuracy on challenging language benchmarks. Recent work like DeltaNet (2023) revisits linear attention with improved training algorithms (the “delta rule”) and gating, showing promise on smaller scale experiments ([PDF] Parallelizing Linear Transformers with the Delta Rule over ... - arXiv), but scaling these to competitive large language models is still an ongoing challenge.
Empirical Performance: In summary, linear and kernel-based approximations remain attractive for efficiency – some enable recurrent formulation for autoregressive generation with constant memory () – but pure implementations saw limited adoption in 2023–2024 for state-of-the-art LLMs due to the performance gap. They laid important groundwork, however, and influenced later designs (e.g. some RNN-inspired models build on linear attention ideas). For instance, the RWKV and RetNet models discussed later both derive their recurrent update rules from linearized attention formulations () (). In those models, enhancements like gating and multi-scale parameters are added to overcome the weaknesses of basic linear attention.
Efficiency: These approaches achieve time and memory linear in sequence length, enabling training with very long sequences (e.g. >16K tokens) that would be infeasible with vanilla attention. Memory usage per token is constant rather than growing with sequence length. This makes them appealing for streaming or long-document tasks. Some approximations (Performers) can be implemented on GPU/TPU to run faster than softmax attention for moderately long
Instead of fundamentally changing the attention computation, another class of approaches keeps the basic dot-product attention but reduces complexity via sparsity or segmenting the sequence. These methods exploit the observation that not every pair of tokens needs to interact, especially in long texts. By designing patterns or using learnable selectors, they aim to bring attention complexity down closer to linear.
-
Fixed Sparse Patterns: Sparse Transformers (Child et al. 2019) introduced fixed attention masks (e.g. each token attends only to a subset of other positions such as a local window + periodic long-range jumps). Models like Longformer (Beltagy et al. 2020) and BigBird (Zaheer et al. 2020) refined this idea: Longformer uses a combination of local sliding window attention and a few global tokens that attend broadly, achieving
$\mathcal{O}(L)$ complexity and reaching BERT-like performance on long documents (e.g. question answering on 8k-token documents) with far less compute ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). BigBird uses random sparse connections plus global tokens, and it proved theoretically capable of universal approximation and even modeling some combinatorial tasks. Empirically, BigBird matched a dense-transformer on summarization and QA with 8x longer context than the original Transformer. These fixed sparse patterns are efficient and straightforward, but they require expert design or tuning of the pattern for each task (how to choose window size, number of global tokens, etc.). They also may not capture adaptive context – every token gets the same pattern regardless of content. Nonetheless, in 2023 such approaches remain popular for extending context length in language models (for example, versions of GPT-3 and LLaMA with 32K+ context often rely on sliding-window attention and a cache of summary embeddings). -
Learnable Sparse and Hybrid Attentions: Other works allow the model to choose which tokens to attend to. The Routing Transformer (Roy et al. 2021) uses
$k$ -means clustering of query/key vectors to attend only within the same cluster (reducing complexity to $\mathcal{O}(L \sqrt{L})$). The Reformer (Kitaev et al. 2020) uses locality-sensitive hashing (LSH) to group similar queries and keys, achieving$\mathcal{O}(L \log L)$ attention time. These methods are data-dependent, adjusting to content, and can handle varying patterns. However, they introduce additional complexity (e.g. randomization or iterative clustering), and performance can be sensitive to those mechanisms. In practice, models like Reformer performed well on tasks like character-level language modeling (enwik8) and image generation, but were not obviously superior on standard NLP benchmarks. Some hybrid approaches like Combiner (2021) mix attention with convolution to skip interacting with distant tokens unless necessary. -
Segmented Processing and Recurrence: Another way to break quadratic complexity is to process sequences in blocks or segments and pass information between blocks in a compressed form. The Transformer-XL (Dai et al. 2019) introduced a recurrent memory of past segment representations, enabling effective context beyond a single segment without full attention across segments. This idea was extended by the Compressive Transformer (Raidan et al. 2020), which compresses older memories to save space. These models still use standard attention within each segment (so complexity per segment is quadratic in segment length), but if segments are of length
$M \ll L$ , overall complexity for sequence length$L$ becomes$\mathcal{O}(L M)$ . This can be viewed as a memory-augmented attention: the model learns what to carry forward. Recurrent Memory Transformer (RMT) (Bulatov et al. 2022) formalized this by adding a fixed-size learnable memory that gets read/written each segment, allowing essentially infinite context length with constant-time updates between segments ([2207.06881] Recurrent Memory Transformer - arXiv). On experiments with book-length texts, RMT was able to leverage contexts of tens of thousands of tokens, significantly beyond the segment size, while maintaining perplexity comparable to a Transformer on that task. These approaches blur into the “augmentation” category – they don’t replace attention, but reduce its workload via an external memory. We include them here as they are a viable solution to long-context modeling in 2023–2024: e.g. RMT reports processing 2 million tokens (in 4096 segments of 512 tokens) with sustained performance ([PDF] Breaking the Limits of Transformer Context Length with Recurrent ...), something intractable for a vanilla Transformer. -
Retrieval-Augmented Models: A related augmentation is to use information retrieval from an external database to avoid attending over very long contexts. Models like RETRO (Borgeaud et al. 2022) retrieve nearest neighbor text chunks from a corpus for each query and attend to those instead of attending to all tokens in a long context. This reduces the need for long-range attention by offloading to a search index. While not an attention mechanism replacement per se, retrieval can be seen as a sparsification in content space – only semantically relevant past tokens are brought into attention. RETRO showed that a 7B model with retrieval can match a 280B GPT-3 on some tasks, with far less attention computation (since context per token was limited to a handful of retrieved chunks). However, retrieval requires a separate infrastructure and doesn’t trivially apply to processing a single long input like a book (it’s more for augmenting knowledge).
Performance: Sparse attention models like Longformer and BigBird demonstrated that thoughtfully restricted attention can attain Transformer-level accuracy on many tasks, especially those focused on long inputs (e.g. document QA, long text classification). They have been integrated into HuggingFace and used in applications requiring long context. However, for open-ended generation and general LM tasks, their adoption was limited – in part because training these models at the very large scales of GPT-3 or PaLM was not widely reported. In 2023, we saw context lengths in mainstream LLMs increase (Anthropic’s Claude up to 100k tokens) using a combination of efficient attention implementations and likely some windowing strategy. These engineering advances, like FlashAttention (which uses tiling and recomputation to handle long sequences exactly) (FlashAttention: Fast Transformer Training with Long Sequences), have somewhat reduced the need for approximate sparse methods. Still, sparse patterns remain an important tool. Notably, FlashAttention with block-sparse patterns can combine the best of both: Shazeer (2023) reports that block-sparse attention can handle 8k–16k context with minimal perplexity loss (e.g. only +0.7 Δ perplexity on GPT-2) (FlashAttention: Fast and Memory-Efficient Exact Attention with IO ...).
Efficiency: The big win of sparse attention is linear or near-linear scaling. For example, Longformer’s attention scales as
A different paradigm eschews explicit token-to-token attention and instead uses long convolution or frequency-domain transformations to mix information across the sequence. Convolutions can achieve similar effects to attention (aggregating information from many tokens) but with structured weight sharing and typically linear complexity (depending on kernel length). Recent models have demonstrated that carefully parameterized convolutions with gating can match transformer performance on language tasks ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). These approaches often draw on signal processing intuition, treating the sequence as a time-series to be filtered.
(GitHub - HazyResearch/safari: Convolutions for Sequence Modeling) Figure: The Hyena convolution-based operator uses a hierarchical recurrence of long convolution filters ($h^n$) and data-dependent gating ($D_x^n$) in place of attention. Each Hyena layer projects input $u$ to an internal sequence $v$, then iteratively applies implicit long convolutions ($S_h$ blocks) and elementwise gated multiplications ($D_x$ blocks) across the sequence to produce output $y$. Filters $h^n$ are generated by a shallow network and modulated by position (windowed), enabling content-sensitive and long-range interactions ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models).
Hyena (Poli et al., ICML 2023) is a prime example in this family. It is introduced as a “drop-in replacement” for attention that achieves subquadratic time (in fact, close to linear) while maintaining “unrestricted context” like attention ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Hyena’s design is based on two core components: long convolutions and data-controlled gating. Instead of computing attention weights, Hyena processes the sequence via a recurrence of convolutional filtering operations. Each Hyena layer applies an implicit convolution of very large kernel length (e.g. thousands of time-steps) followed by elementwise nonlinear gating controlled by the data. The term implicit here means the convolution kernel is not directly a parameter matrix of that full size; rather, the filter is generated by a small feed-forward network (and modulated by a positional windowing function) at runtime (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). This allows the effective convolution length to be very large without storing millions of kernel parameters.
Mathematically, the Hyena operator can be described as a recurrence. At a high level, one can write it as:
where
- The convolution
$S_h$ mixes information from distant positions with a learned filter (analogy: a fancy weighted moving average over potentially thousands of tokens). - The gating
$D_x$ then uses the signal itself to modulate (selectively amplify or dampen) certain features before the next convolution. This gating is element-wise and data-dependent, somewhat analogous to attention focusing on certain tokens, but implemented as a multiplicative interaction rather than a softmax-weighted sum.
Motivation: The authors identify three properties that attention provides – data control (content-dependent interactions), sublinear parameter scaling (number of parameters does not grow with sequence length), and unrestricted context (any token can potentially affect any other) (Paper review: Hyena Hierarchy: Towards Larger Convolutional Language Models | by Andrew Lukyanenko | Medium). Hyena is explicitly designed to also have these properties. The data-controlled gating gives content sensitivity similar to attention’s query-key mechanism. The implicit long convolution filters are generated from a fixed-size network, so parameters do not depend on
Technical Details: The filters in Hyena are parameterized in the frequency domain and real-space with multiplicative windows. A small MLP produces a basis for the convolution kernel
Performance: Hyena’s hallmark result was matching Transformer quality on language modeling without any attention ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). On standard benchmarks (WikiText-103 and The Pile), a Hyena-based model achieved the same perplexity as a Transformer of similar size, with 20% less training compute at context length 2048 ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). It also set a new state-of-the-art for models with no dense attention on these datasets ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). In long-range reasoning tasks (memorization, long input dependency tests up to 100K tokens), Hyena outperformed prior explicit and implicit models by over 50 percentage points, reaching parity with attention models ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). These results closed the gap that earlier convolution or state-space models had with Transformers. Another noteworthy experiment: at small scale (125M parameters), Hyena models were compared to Transformers and other efficient models on WikiText103 – Hyena significantly outperformed a state-space model (S4) and was on par with Transformer in perplexity (Hyena Hierarchy: Towards Larger Convolutional Language Models). Downstream, small Hyena models were competitive in zero-shot and few-shot NLP tasks, even slightly outperforming an equivalently trained GPT-Neo 125M on some SuperGLUE tasks (Hyena Hierarchy: Towards Larger Convolutional Language Models) (Hyena Hierarchy: Towards Larger Convolutional Language Models). This demonstrates that Hyena is not only matching perplexity but also learning useful language representations.
Efficiency: Hyena is designed for speed on long sequences. For sequence length $L=8$K, the Hyena operator ran ~2× faster than optimized attention (with FlashAttention), and at $L=64$K it was 100× faster ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). This highlights its better than linear scaling (likely near
Hyena was the most prominent in 2023, but it builds on prior ideas: e.g. FNet (Lee-Thorp et al. 2021) showed that replacing attention with a simple Fourier transform (mixing each sequence via global FFT) could attain surprisingly good accuracy (within a few points of Transformer on GLUE and language modeling) at much lower compute. FNet’s Fourier mixing is fixed (not learned beyond phase sign flips), so it provides a strong inductive bias of global token mixing. It achieved $\sim$80% of BERT’s score on GLUE tasks with only 20% of the compute. However, it couldn’t match top performance on more complex tasks, indicating that data-dependent mixing is important. Hyena’s gating brings in that data-dependent element which FNet lacked.
Another related model was the Attention-Free Transformer (AFT) (Zhai et al. 2021) (An Attention Free Transformer - Apple Machine Learning Research). AFT proposed an element-wise attention: each query multiplied by a function of keys and values. Specifically, keys and values were combined with learned positional biases into a single representation, which then multiplicatively interacted with the query (followed by a normalization). This yields linear memory complexity. AFT performed reasonably on medium-scale tasks (e.g. image classification, enwik8 text) (An Attention Free Transformer - Apple Machine Learning Research). It essentially uses a predetermined attention weight (via position bias) instead of computing
Synthesizer (Tay et al. 2020) is another approach: it learns static attention weights or generates them from the content of one side only, rather than computing
In summary, convolution and implicit transformations offer an attractive alternative to attention: they bring built-in efficiency and often better inductive bias for locality or smoothness. Early attempts that lacked adaptability fell short of Transformer accuracy, but by 2023 hybrids like Hyena that incorporate data-dependent gating showed equal performance is attainable. These methods excel especially in extreme long-range settings (where attention’s cost is prohibitive). They also tend to be friendly to hardware: convolutions map well to CNN accelerators and do not require maintaining large attention matrices. As such, we expect continued development in implicit attention – including even longer FFT-based models or wavelet/transforms – in pursuit of faster and scalable LLMs.
State-Space Models (SSMs) present another family of attention alternatives, stemming originally from continuous-time dynamical systems. An SSM defines a recurrence in continuous space, often described by a state evolution equation
The seminal model in this category was S4 (Structured State Spaces) by Gu et al. (NeurIPS 2021). S4 designed a specific matrix
By 2023, the culmination of these was Mamba, which stands for “Linear-Time Sequence Modeling with Selective State Spaces” by Gu & Dao (ICLR 2024) ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces) ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). Mamba is effectively SSMs 2.0, introducing an important enhancement: input-dependent gating of the state dynamics.
Mamba (Gu & Dao, 2024) identifies a key weakness of prior subquadratic models (SSMs, linear attention, etc.): a lack of content-based routing or “reasoning” ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). Traditional SSMs like S4 are time-invariant linear filters – they treat the sequence in a fixed way regardless of the token values (aside from the linear superposition). This makes it hard for them to do tasks like “if token = X, then propagate info faster/slower,” which attention does naturally. Mamba’s solution is to make the SSM selective: the state update equation’s parameters are functions of the input token. In practice, Mamba incorporates gating functions that allow the model to “selectively propagate or forget information … depending on the current token.” (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). This is analogous to how an RNN’s forget gate can shut off memory for irrelevant inputs, but here applied along the depth of a deep SSM.
Concretely, Mamba uses a diagonal plus low-rank
Performance: Mamba’s results are impressive. A Mamba model with 1.4B parameters outperformed a Transformer of the same size and matched the performance of a Transformer twice its size on language modeling (both in pretraining perplexity and downstream tasks) (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). Specifically, Mamba-1.4B achieved similar perplexity to a 2.8B Transformer on The Pile and showed strong few-shot learning ability. This indicates that the content-based gating had closed the gap – previous SSMs underperformed on language vs. Transformers, but Mamba catches up. Moreover, Mamba set state-of-the-art across multiple modalities: not only language, but also audio (speech classification) and genomics, suggesting the architecture is generally powerful (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview). On very long sequences (up to millions of tokens), Mamba’s performance kept improving, whereas Transformers could not even be run that far – highlighting the ultra-long context capability of state-space models ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). In some evaluations, Mamba even slightly exceeded Transformer results, likely due to its better handling of long-range dependencies. These results firmly establish SSMs as a viable backbone for LLMs. In fact, contemporaneous work on distilling Transformer knowledge into SSMs (e.g. Mamba) showed that with only 1% of the original training data, a distilled 300M SSM (nicknamed Phi-Mamba) can outperform all previous open-source non-transformer models (NeurIPS Poster Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models), underscoring how far SSMs have come.
Architecture and Training: Mamba’s architecture is quite minimalist. The entire network can be viewed as a stack of identical “SSM blocks” (each with its internal state and gating), with normalization and maybe dropout. It does not even use feed-forward (FFN) layers between SSM layers ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces), relying on the SSM’s internal linear transformations to also do channel mixing. This makes it computationally light. The selective gating is implemented as learned elementwise functions that are applied to the state update or output of the SSM. Because these gates can cause variance changes, Mamba uses GroupNorm instead of LayerNorm inside the retention layers (similar to RetNet’s approach) (). Training Mamba required some care in initialization and tying of parameters across long sequences to ensure stability (inheriting some of S4’s tricks). But once trained, it offers a great inference advantage.
Efficiency: Mamba enjoys linear time and memory scaling in sequence length (like S4). The authors report it has 5× higher throughput than a Transformer for generation and can be run on very long inputs that Transformers cannot (Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview) ([2312.00752] Mamba: Linear-Time Sequence Modeling with Selective State Spaces). For example, at sequence length 1M, a Transformer is infeasible due to
Overall, Mamba demonstrates the maturity of state-space models: by adding content-dependent gating, they have reached Transformer-level expressiveness. SSMs offer a principled way to have learned long convolutional memory and now, adaptability. We can view Mamba as doing what attention does (deciding what to remember or forget) but in a distributed, implicit manner through its state dynamics, rather than computing pairwise attention scores. As such, it’s a promising foundation for the next generation of efficient LLMs.
A major theme in 2023–2025 is the resurgence of recurrent neural networks (RNNs) as viable competitors to Transformers for language modeling. Classic RNNs (LSTMs, etc.) fell out of favor mainly due to lack of parallelizability and difficulties in capturing very long dependencies. New architectures, however, combine the parallel training ability of Transformers with the efficient, constant-time inference of RNNs – in effect, bridging the two paradigms. These models typically replace the attention mechanism with a recurrent “time-mixing” mechanism equipped with gating (to manage long-term information). They often draw on earlier ideas like the Attention-Free Transformer or gating in SSMs, and in some cases, linear attention, to create an update rule that can be computed recurrently and in parallel.
Two prominent examples are RWKV and RetNet, alongside others like the Mega architecture and H3RNN/HGRN. We discuss RWKV and RetNet in detail, as they achieved large-scale results.
RWKV (Peng et al., EMNLP 2023) stands for Receptance Weighted Key-Value. It is an RNN architecture specifically designed to mimic Transformer performance while keeping the RNN advantages (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). The key idea is to reformulate the transformer block into an RNN form. RWKV takes inspiration from the Attention-Free Transformer and linear attention: it effectively removes the softmax and uses an exponential moving average mechanism to accumulate past information, modulated by a learnable receptance (gate). At each time step, RWKV maintains a hidden state that is analogous to the running key and value summaries in attention. The update can be seen as:
where
Parallelizability: Normally, an RNN must run sequentially (can’t compute
Performance: RWKV was scaled up to 14 billion parameters – the largest pure RNN LM ever trained – and achieved performance on par with similarly sized Transformers (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). For instance, a 1.5B RWKV model can match a 1.3B GPT-2 on perplexity, and the 14B RWKV matches a 13B Transformer (like Opt or LLaMA) on many tasks (RWKV: Reinventing RNNs for the Transformer Era | OpenReview). It also exhibits in-context learning abilities (few-shot prompting) comparable to Transformers. These results were validated on benchmarks such as WikiText, the Pile, and downstream evaluations (the authors released RWKV models which indeed perform strongly in chat and QA tasks). Notably, RWKV is an open-source, community-driven project, and by late 2023 it gained popularity as a more resource-friendly LLM: it can run with less memory and be easily converted to efficient implementations (like RWKV.cpp for CPU). Early versions of RWKV needed some distillation or careful hyperparameters to reach parity, but by v4 the architecture itself proved capable. A third-party study found RWKV’s scaling curve and zero-shot performance very similar to Transformers, albeit needing slightly more training tokens to converge (perhaps due to optimization nuances) (RWKV: Reinventing RNNs for the Transformer Era | OpenReview).
Efficiency: The big advantage of RWKV is at inference time. Since it’s an RNN, it doesn’t need to carry an
RetNet (Sun et al., 2023) (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research) () is another large-scale model that follows a similar philosophy. It introduces a “multi-scale retention mechanism” to replace multi-head attention (Retentive Network: A Successor to Transformer for Large Language Models - Microsoft Research). The term retention reflects the idea of retaining information over time with decaying weights, reminiscent of leaky integrators. RetNet’s retention mechanism can be seen as a variant of linear attention or an RNN, but carefully designed to address the “impossible triangle” of parallel training, low inference cost, and strong performance (Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Medium) (Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Medium). In fact, the authors explicitly note they aim to achieve all three, which prior methods struggled to do simultaneously (Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Medium) ().
Mechanism: At its core, RetNet uses a decay-based attention kernel. Each layer has a fixed number of retention heads, analogous to attention heads. Instead of computing attention via softmax, each retention head applies a decay function to past inputs. One way to express it is: for each new token, the head’s output is
where
Performance: RetNet showed that this retention mechanism can indeed match Transformers. On language modeling, RetNet models had nearly identical scaling curves of validation perplexity to Transformers up to billions of parameters (). Empirically, they found that for model sizes above ~2B, RetNet slightly outperforms Transformers in perplexity () (). This suggests that retention may be even more parameter-efficient at large scale. In terms of in-context learning and zero-shot tasks, RetNet was “consistently competitive” with Transformers (). An analysis in the paper shows that on tasks like question answering, RetNet retained the transformer's ability to utilize long context (they specifically tested “needle in haystack” prompts where a model has to find a hint in a long context – RetNet did as well as a Transformer, and better than an RNN without such mechanism) ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate) ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate). This addresses concerns that maybe the exponential decay would forget too quickly; apparently, multi-scale decays solve this. Overall, RetNet made a strong case that attention softmax is not essential for LLM performance – a carefully designed linear attention (retention) can do the job.
Efficiency: RetNet’s benefits are mostly at inference. Inference cost is length-invariant (), since each new token update is
It’s worth noting that several other models explored similar recurrent mechanisms:
-
Mega (Ma et al., 2022), introduced earlier, is a single-head gated attention that uses an exponential moving average (EMA) operator with gating ([2209.10655] Mega: Moving Average Equipped Gated Attention - arXiv). In essence, Mega is like a one-head RetNet: it replaces softmax with an EMA of values (with some key interaction) and applies LSTM-style gates. Mega achieved excellent results on Long Range Arena (outperforming both Transformers and S4) and on WikiText103 it slightly outperformed Transformer of the same size ([R] Mega: Moving Average Equipped Gated Attention. By using ...). It also did well on tasks like machine translation and ImageNet classification, showing the approach’s generality (Mega: Moving Average Equipped Gated Attention | OpenReview). Mega’s success foreshadowed the later larger-scale RNNs. However, Mega at large scale was not reported, whereas RWKV and RetNet took the concepts and scaled to billions of params.
-
H3 (Hungry Hungry Hippos, Fu et al. 2023) was an earlier attempt by the Hyena authors to merge SSMs and attention ideas. H3 can be viewed as an RNN with two SSM filters (a diagonal and a shifted one) that tries to mimic key–value storage and retrieval. It performed well on some synthetic tasks but needed hybrid attention for best results (Hyena Hierarchy: Towards Larger Convolutional Language Models). It was a stepping stone toward Hyena and Mamba.
-
HGRN (Qin et al. 2023) stands for Hierarchically Gated Recurrent Network. It introduced a lower-bounded forget gate to ensure gradients flow and used a hierarchical stacking of multiple small RNNs to expand state size (NeurIPS Poster Hierarchically Gated Recurrent Neural Network for ...). HGRN showed very strong training speed and good LM performance at small scales, and the follow-up HGRN2 further improved it by expanding state size efficiently (HGRN2: Gated Linear RNNs with State Expansion - OpenReview). HGRN2 is reported as a “well-performing RNN-based SOTA language model” (HGRN2: Gated Linear RNNs with State Expansion - OpenReview), though it’s relatively new and not as widely benchmarked as RWKV/RetNet.
-
DeltaNet (2023) by Shen et al. revisited the idea of fast weights (from Schmidhuber’s 1990s work) in a modern way. It treats the weight update in linear attention as an instance of the “delta rule” (Hebbian updates) and accelerates it. Essentially, DeltaNet is another perspective on linear transformer training, and a gated version was mentioned to improve performance (Yikang Shen on X). DeltaNet has been validated on smaller tasks and shows promise for efficient long-range learning.
-
FoMo and GLA: Other acronyms like Gated Linear Attention (GLA) and Forgetful MoE (FoMo) appear in recent literature combining gating with linear attention ([PDF] Exploring RNNs for Sample-Efficient Training of Language Models), indicating the interest in this direction.
Finally, a very recent idea is the Forgetting Transformer (FoX, 2025) ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate) – not an RNN, but it adds a forget gate to standard attention (down-weighting old attention scores in a content-aware way). FoX is interesting because it basically inserts an RNN concept into the Transformer, and it was shown to outperform vanilla Transformers on long-context language modeling and compete well with recurrent models like Mamba-2 ([2503.02130] Forgetting Transformer: Softmax Attention with a Forget Gate). This further reinforces that gating + decay = gains in long sequences.
In summary, the recurrence and gating family has grown robust in 2023–2025. These models leverage gating (like RNNs) and sometimes explicit decay factors to maintain a form of continuous attention. They have achieved parity with attention-based Transformers on language modeling and offer dramatically improved efficiency for long sequences and deployment. The convergence of ideas – from Mega’s EMA, to RWKV’s receptance gate, to RetNet’s multi-scale decay, to Mamba’s input-dependent state – all point to a paradigm: models that learn to remember or forget dynamically can replace the expensive attention matrix with a far cheaper mechanism. The result is that we now have viable alternatives (RNNs and SSMs) that scale to modern LLM requirements.
To wrap up, we compile a comparison of these alternate mechanisms versus standard Transformer attention. Each approach has different trade-offs in complexity, performance, and practical usability:
Approach (Example Models) | Mechanism & Complexity | LM Performance vs. Transformer | Efficiency & Scalability |
---|---|---|---|
Full Softmax Attention (Baseline Transformer) | All-pairs dot-product + softmax; $\mathcal{O}(L^2)$ time & memory. | Baseline. SOTA language performance; strong in-context learning. | High memory use (attention cache scales with |
Linear / Kernel Attention (Performers, Linformer) | Approximate softmax with feature maps or low-rank ops; $\mathcal{O}(L)$–$\mathcal{O}(L d)$. | Good on small/medium tasks; quality drops on very large corpora (). | Low memory (no |
Sparse Attention (Longformer, BigBird) | Attend only to local or selected tokens (pattern or learned); $\mathcal{O}(L)$. | Matches Transformer on long-text tasks (QA, summarization) ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models); used in long-context LMs (e.g. GPT-3 32k). | Efficient for long inputs; memory linear in |
Implicit Conv/FFT (Hyena, FNet, AFT) | Use long convolution or Fourier transforms instead of attention; $\mathcal{O}(L \log L)$ (often effectively linear). | Hyena: Reaches parity with Transformer on LM ([2302.10866] Hyena Hierarchy: Towards Larger Convolutional Language Models). Earlier implicit models slightly lower but close. |
Highly efficient for long |
State-Space Models (S4, Mamba) | Learnable linear dynamical system (long filter) with optional input-dependent gates; $\mathcal{O}(L)$ via FFT/recurrence. | Mamba: Outperforms same-size Transformer; matches a 2× larger one ([Mamba: Linear-Time Sequence Modeling with Selective State Spaces | OpenReview](https://openreview.net/forum?id=AL1fq05o7H#:~:text=%28,in%20pretraining%20and%20downstream%20evaluation)). Older SSMs slightly behind on text before gating. |
Recurrent w/ Gating (RWKV, RetNet, Mega) | RNN-style state updates with content-based gating or decay; $\mathcal{O}(L)$ (parallelizable training). | RWKV/RetNet: On par with Transformers on perplexity and zero-shot ability ([RWKV: Reinventing RNNs for the Transformer Era | OpenReview](https://openreview.net/forum?id=7SaXczaBpG#:~:text=Receptance%20Weighted%20Key%20Value%20,architecture%20to%20create%20more%20efficient)) (). Mega (small) outperforms prior models on LRA ([Mega: Moving Average Equipped Gated Attention |
Key Takeaways: Alternate attention mechanisms have matured to the point that several can serve as drop-in replacements for self-attention in large language models without loss of accuracy. Convolution-based (Hyena) and recurrent (RWKV, RetNet) approaches in particular have demonstrated transformer-equivalent performance on both language modeling and downstream tasks, while offering significant efficiency gains: linear or better scaling in context length, constant or reduced memory use, and faster inference speeds. State-space models (Mamba) similarly match or exceed transformer quality and excel at ultra-long contexts. Linear and sparse attention methods, which were earlier efforts, laid the groundwork and are still useful for moderate gains in efficiency, though by themselves they fell a bit short on performance at the largest scales.
It appears that introducing adaptivity (gating, content-dependent parameters) is a common thread in the most successful recent models – this allows them to capture the “dynamic routing” ability of attention (which pure linear ops lacked). With that solved, we are seeing a convergence where the gap between Transformers and efficient alternatives has essentially closed. Moving forward, one can envision mainstream LLMs adopting these mechanisms (or hybrids thereof) to handle longer contexts and reduce inference costs. Early signs include research on combining Transformer and RNN blocks, and using gating mechanisms in otherwise standard architectures (e.g. the Forgetting Transformer). The period 2023–2025 has thus been pivotal: it delivered practical, state-of-the-art alternatives to the dominance of attention, expanding the design space for sequence models and promising more scalable NLP systems in the future.
References to papers, articles, and sources are included inline above in the format 【source†lines】
. The numbering corresponds to the browsing source list and line numbers for direct quotations or factual claims, ensuring traceability to original works. Each cited snippet points to the evidence supporting the summarized content (e.g., performance metrics, complexity claims, or specific technical details).
Comparison Table
Since it doesn't render well in the above