序列注意力機制：在不犧牲準確性的前提下，讓 AI 模型更精簡、更快速

Google Research

24 天前

AI 生成摘要

Google Research 推出序列注意力機制（Sequential Attention），這是一種子集選擇演算法，旨在透過識別並保留最具資訊性的輸入變數，來提升大型機器學習模型的效率，進而讓 AI 模型在不犧牲準確性的情況下變得更精簡、更快速。

Sequential Attention: Making AI models leaner and faster without sacrificing accuracy

Defining the technology of today and tomorrow.

Philosophy

We strive to create an environment conducive to many different types of research across many different time scales and levels of risk.

People

Our researchers drive advancements in computer science through both fundamental and applied research.

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

We regularly open-source projects with the broader research community and apply our developments to Google products.

Publications

Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science.

Resources

We make products, tools, and datasets available to everyone with the goal of building a more collaborative ecosystem.

Shaping the future, together.

Student programs

Supporting the next generation of researchers through a wide range of programming.

Faculty programs

Participating in the academic research community through meaningful engagement with university faculty.

Conferences & events

Connecting with the broader research community through events is essential for creating progress in every aspect of our work.

Sequential Attention: Making AI models leaner and faster without sacrificing accuracy

February 4, 2026

Thomas Fu, Principal Engineer, and Kyriakos Axiotis, Senior Scientist, Google Research

We introduce a subset selection algorithm for making large scale ML models more efficient.

Quick links

Feature selection is the process of identifying and retaining the most informative subset of input variables while discarding irrelevant or redundant noise. A fundamental challenge in both machine learning and deep learning, feature selection is NP-hard (i.e., a problem that is mathematically "impossible" to solve perfectly and quickly for large groups of data), and as such, it remains a highly challenging area of research.

In modern deep neural networks, feature selection is further complicated by intricate non-linear feature interactions. A feature may appear statistically insignificant on its own but become critical when combined with others within the network's non-linear layers. Conversely, a feature’s contribution may appear significant in isolation, but made redundant when taking other features into account. The core challenge lies in identifying essential features for retention while effectively pruning redundancy within complex model architectures.

More broadly, many ML optimization tasks can be cast as subset selection problems, of which feature selection is a special case. For example, embedding dimension tuning can be viewed as selecting a subset of embedding chunks, and weight pruning as selecting a subset of entries from the weight matrix. Therefore devising a general solution for the subset selection problem that is applicable to modern deep learning tasks can be highly impactful for building the most efficient models.

Today, we explore our solution to the subset selection problem, called Sequential Attention. Sequential Attention uses a greedy selection mechanism to sequentially and adaptively select the best next component (like a layer, block, or feature) to add to the model. While adaptive greedy algorithms are known to provide strong guarantees for various subset selection problems, such as submodular optimization, naïvely applying such algorithms would increase the training cost by many orders of magnitude. To tackle this scalability issue, we integrate selection directly into the model training process by performing selection within a single model training. This ensures that Sequential Attention can be applied to large scale ML models with minimal overhead without sacrificing accuracy or complexity. Here we will analyze how Sequential Attention works and show how it’s being used in real-world scenarios to optimize the structure of deep learning models.

How Sequential Attention works

Sequential Attention leverages the weighting power of the attention mechanism to build a subset step-by-step. In contrast to standard "one-shot" attention, in which all candidates are weighted simultaneously, Sequential Attention addresses the NP-hard nature of subset selection by treating it as a sequential decision process. This is particularly effective for identifying high-order non-linear interactions often missed by "filter methods”, which provide the simplest way to pick a subset by focusing only on the merits of each individual item.

Feature Selection in neural networks: Selective pruning of input features to optimize performance. By "switching off" low-utility features, the model simplifies the learning task and reduces the risk of overfitting.

The core idea is to maintain a set of selected candidates and use them as context to find the next most informative candidate to select. This is achieved through two primary methods: greedy selection, which allows the model to make a locally optimal decision around which element to include at each step, and importance evaluation, which uses “attention scores” (numerical values indicating the importance or relevance of different input parts) to quantify the importance of every candidate in addition to the currently selected candidates. Like the attention mechanism, Sequential Attention uses softmax as an importance ranking of different components. Yet unlike the attention mechanism, it works sequentially as opposed to one shot, allowing the selection algorithm to adapt to previous selections — a crucial property for high-quality importance ranking.

Sequential Attention benefits

The primary benefits of Sequential Attention are:

Sequential Attention in action

Feature selection

The standard feature selection method, i.e., greedy selection, is computationally expensive, as it requires re-training or re-evaluating the model for every potential feature at every step. In “Sequential Attention for Feature Selection”, we sought to replace this costly method with a much cheaper proxy: the model’s internal attention weights.

At each step, the Sequential Attention algorithm calculates attention weights for all remaining, unselected features, and permanently adds the feature with the highest attention score (the one to which the model is "paying the most attention") to the subset. The algorithm then re-runs the selection process (the process of feeding input data through a neural network, layer by layer, from input to output, to generate a prediction) and re-calculates the attention weights for the remaining features. This recalculation naturally reflects the marginal gain (how much a feature contributes to performance, given the features already selected), allowing the model to effectively identify and avoid adding redundant features.

Feature selection performance. Prediction accuracy of the proposed approach (orange) vs. baselines. Our method achieves competitive or leading results across proteomics, image, and activity recognition benchmarks, confirming its robustness.

The Sequential Attention algorithm achieved state-of-the-art results across several neural network benchmarks. Notably, it drastically improved efficiency, enabling a fast, one-pass implementation of greedy selection without the need for expensive, explicit marginal gain calculations. The study also demonstrated that when applied to a simple linear regression model, the Sequential Attention algorithm is mathematically equivalent to the established Orthogonal Matching Pursuit (OMP) algorithm. This equivalence is critical because OMP comes with provable guarantees of reliability and performance.

Block sparsification

Neural network pruning is essential for deploying large models efficiently because it reduces the model size by removing unnecessary weights. Prior research pursued two largely separate paths: differentiable pruning, which uses trainable parameters as proxies for importance, and combinatorial optimization, which uses algorithms to search for the best sparse structure.

Matrix block sparsification: Identifying and zeroing out non-essential parameter blocks to optimize memory and speed. Unlike unstructured pruning, block-based sparsity leverages hardware acceleration for superior inference performance.

In “SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization”, we sought to unite these two approaches into a coherent framework for structured neural network pruning that removes entire blocks or channels of weights to achieve real-world improvements on hardware accelerators like GPUs and TPUs.

The resulting algorithm, SequentialAttention++, provides a new way to discover the most important blocks of weight matrices, and shows significant gains in model compression and efficiency without sacrificing accuracy in ML tasks, e.g., ImageNet classification.

The future of sequential attention

As the increasing integration of AI models in science, engineering and business makes model efficiency more relevant than ever, model structure optimization is crucial for building highly effective yet efficient models. We have identified subset selection as a fundamental challenge related to model efficiency across various deep learning optimization tasks, and Sequential Attention has emerged as a pivotal technique for addressing these problems. Moving forward, we aim to extend the applications of subset selection to increasingly complex domains.

Feature engineering with real constraints

Sequential Attention has demonstrated significant quality gains and efficiency savings in optimizing the feature embedding layer in large embedding models (LEMs) used in recommender systems. These models typically have a large number of heterogeneous features with large embedding tables, and so the tasks of feature selection/pruning, feature cross search and embedding dimension optimization are highly impactful. In the future, we would like to allow these feature engineering tasks to take real inference constraints into account, enabling fully automated, continual feature engineering.

Large language model (LLM) pruning

The SequentialAttention++ paradigm is a promising direction for LLM pruning. By applying this framework we can enforce structured sparsity (e.g., block sparsity), prune redundant attention heads, embedding dimensions or entire transformer blocks, and significantly reduce model footprint and inference latency while preserving predictive performance.

Drug discovery and genomics

Feature selection is vital in the biological sciences. Sequential Attention can be adapted to efficiently extract influential genetic or chemical features from high-dimensional datasets, enhancing both the interpretability and accuracy of models in drug discovery and personalized medicine.

Current research focuses on scaling Sequential Attention to handle massive datasets and highly complex architectures more efficiently. Furthermore, ongoing efforts seek to identify superior pruned model structures and extend rigorous mathematical guarantees to real-world deep learning applications, solidifying the framework’s reliability across industries.

Subset selection is a core problem central to multiple optimization tasks in deep learning, while Sequential Attention is a key technique to solve these problems. In the future, we will explore more applications of subset selection to solve more challenging problems in broader domains

Conclusion

Sequential Attention is an effective technique for multiple large-scale subset selection problems in deep learning and plays a key role in model architecture optimization. As these techniques evolve, they will solidify the future of machine learning, guaranteeing that powerful AI remains both accurate and accessible for years to come.

Acknowledgements

We would like to express our gratitude to our research collaborators, Taisuke Yasuda, Lin Chen, Matthew Fahrbach, MohammadHossein Bateni, and Vahab Mirrokni, whose efforts have advanced the development of Sequential Attention. This work builds upon fundamental research in differentiable subset selection and combinatorial optimization to create more efficient and accessible AI models.

Sequential Attention: Making AI Models Leaner and Faster Without Sacrificing Accuracy

​Sequential Attention: Making AI models leaner and faster without sacrificing accuracy

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

​Sequential Attention: Making AI models leaner and faster without sacrificing accuracy

Quick links

How Sequential Attention works

Sequential Attention benefits

Sequential Attention in action

Feature selection

Block sparsification

The future of sequential attention

Feature engineering with real constraints

Large language model (LLM) pruning

Drug discovery and genomics

Conclusion

Acknowledgements

Quick links

Other posts of interest

Sequential Attention: Making AI models leaner and faster without sacrificing accuracy

Sequential Attention: Making AI models leaner and faster without sacrificing accuracy