1 Introduction

Abstract

Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of MoE layers to optimize a different metric: inference latency. We introduce a novel system named planer that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy. We evaluate planer on two real-world language modeling tasks using the Transformer-XL network and achieve inference latency reductions of over 2x at iso-accuracy.

\mlsystitlerunning

Efficient Sparsely Activated Transformers

\mlsystitle

Efficient Sparsely Activated Transformers

\mlsyssetsymbol

equal*

{mlsysauthorlist}\mlsysauthor

Salar Latifiumich \mlsysauthorSaurav Muralidharannvidia \mlsysauthorMichael Garlandnvidia

\mlsysaffiliation

umichDepartment of Computer Science and Engineering, University of Michigan, Ann Arbor, USA \mlsysaffiliationnvidiaNVIDIA Corporation, Santa Clara, USA

\mlsyscorrespondingauthor

Salar Latifisalar@umich.edu \mlsyscorrespondingauthorSaurav Muralidharansauravm@nvidia.com

\printAffiliationsAndNotice

1 Introduction

Attention-based deep neural networks (DNNs) such as Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2018) have been shown to exhibit state-of-the-art performance across a variety of machine learning domains, including natural language processing (Wolf et al., 2020) and computer vision (Dosovitskiy et al., 2020). Due to their size and complexity, they are expensive to train and deploy, especially on resource-constrained hardware. In particular, attention layers, which form the building blocks of such networks, account for the majority of network runtime. Figure 1 illustrates this using the Transformer-XL network (Dai et al., 2019); here, we show the proportion of inference latency that each layer type is responsible for on two different GPUs: the NVIDIA V100 and NVIDIA A100. We notice that on both GPUs, attention layers (shown in red) account for over 80% of total inference latency, with the rest coming from feed-forward (blue) and embedding layers (green). Due to their outsize influence on total inference latency, recent work has explored various approaches for runtime performance optimization that specifically target attention layers; this includes work such as PAR Transformer (Mandava et al., 2020), where attention layers are re-distributed within the network to optimize performance, and various papers on pruning either attention heads and/or entire attention layers (Wang et al., 2020).

A separate body of work has explored the addition of sparsely activated layers to Transformer models to improve task performance (Shazeer et al., 2017). In particular, mixture-of-expert (MoE) Transformer variants such as Switch Transformer (Fedus et al., 2021) have demonstrated state-of-the-art task performance while simultaneously improving training and inference costs. While most work in this direction has focused on improving task accuracy, in this paper we attempt to answer the following question: can the addition of sparsely activated layers help preserve accuracy in the face of latency-optimizing network transformations such as skipping/pruning attention layers? And if so, to what extent?

Figure 1: Profiling results for different Transformer-XL layers on NVIDIA V100 and A100 GPUs

Figure 2: Exploration results for Transformer-XL Base model on enwik8 dataset for different latency targets.

To help answer this question, we present planer, a novel system for designing latency-aware sparsely activated Transformer networks. Given a Transformer-based model as input, along with an inference latency target expressed as a percentage of the baseline model’s latency, planer produces a sparsely-activated Transformer model that fulfills the latency objective while preserving baseline accuracy. planer employs an efficient two-phase gradient descent-based neural architecture search (NAS) strategy with a dynamic loss formulation to achieve this. During the search process, planer efficiently explores the large number of alternative architectures arising from different combinations of feed-forward, attention (with varying number of heads), and mixture-of-expert layers; as a concrete example, planer considers over 68 billion unique architectures for the Transformer-XL model in our evaluation. The optimized architecture obtained from NAS is then fine-tuned using a load-balancing loss term to produce the final network. Figure 2 demonstrates how planer infers different architectures depending on the user-provided inference latency targets. Here, each of the inferred architectures matches baseline accuracy, but has different inference latencies. Depending on the latency target, we notice that planer progressively reduces the number of attention layers and their widths, while using additional MoE and/or feed forward layers to compensate for potential accuracy drops.

We evaluate planer on two different Transformer-based networks drawn from language modeling, and demonstrate an inference latency reduction of at least $2 \times$ for each network while maintaining baseline accuracy. We also compare planer with prior work such as PAR Transformer (Mandava et al., 2020) and Sandwich Transformer (Press et al., 2019), and with parameter-matched non-MoE implementations of the final optimized networks.

2 Background and Motivation

Mixture-of-expert (MoE) networks (Masoudnia and Ebrahimpour, 2014) dynamically partition the input domain so that each sub-network or “expert” specializes in one or more input partitions, yielding a sparsely activated network. Recent work has explored the application of MoE layers to efficiently increase the model capacity of Transformer-based architectures (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021; He et al., 2021). These sparsely-activated architectures are shown to achieve similar accuracy gains without the proportional increase in computation compared to traditional scaling of network parameters Raffel et al. (2019). In this work, we focus on applying MoE layers to improve inference latency while maintaining baseline accuracy.

Figure (a)a depicts a general implementation of an MoE layer with three experts. The sequence of input tokens are distributed among the experts for processing, where each token is processed by one or more experts. The number of experts per token is denoted as $T o p_{K}$ in this work. In Figure (a)a, $T o p_{K}$ is two. A single-layer linear classifier called a Gate (Figure (b)b) decides which expert(s) to use to process a specific token. The Gate generates a probability distribution across the experts per token, which will then be used to select the $T o p_{K}$ experts.

Layer-wise Performance Analysis: To better understand the performance behavior of Transformer-based networks, we present layer-wise profiled latencies for the Transformer-XL Base network in Figure 6. Here, each bar represents the latency of a network block normalized to the latency of default multi-head attention with 8 heads. Profiling is performed with a model dimension of $512$ , $t a r g e t_l e n$ of 64, and batch size of 64 on an NVIDIA A100 GPU. We observe three key points from the figure: (1) the significant cost of the default attention configuration, amounting to a $6.2 \times$ higher runtime compared to the default feed-forward layer (FFL) with an inner dimension of 2048, (2) the approximately linear scaling of the attention cost with respect to the number of heads (pruning attention heads and/or blocks could thus play a significant role in improving network performance), and (3) the compute efficiency of the MoE blocks compared to both attention and iso-parametric FFL blocks (iso-parametric FFL blocks are obtained by scaling up an FFL block to match the number of parameters in a corresponding MoE block), signifying the promise of using MoE blocks as a cost-effective solution to compensate for the potential accuracy loss caused by aggressive attention pruning.

Figure 6: Latency comparison of attention, FFL, and MoE layers normalized w.r.t. attention with 8 heads, profiled on NVIDIA A100 GPU with batch size of 64, sequence length of 192, and half-precision.

3 Searching for Efficient Transformers

In this section, we provide a thorough description of planer’s two-phase NAS methodology for finding optimal latency-aware Transformers.

3.1 Phase 1: Search Space Exploration

Transformer-based models are composed of multiple blocks, where each block consists of multi-head attention (MHA) and feed-forward layers (FFLs) (Vaswani et al., 2017). MoEs could thus be applied to either MHA or FFLs, or both. In this work, we only explore MoE FFLs in the design space; this is primarily due to the runtime overhead introduced by dynamic behavior, which we found to be prohibitively high for the already expensive attention layers. planer’s first phase explores the large design space composed of different configurations of MHAs, FFLs, and MoE layers. The inputs to the first phase are the design space, the backbone of the baseline network architecture, and a target latency, expressed as a ratio w.r.t. the baseline latency.

For real-world networks, the design space of alternative architectures often gets prohibitively large; for instance, the Transformer-XL Base network on the enwik8 dataset yields a search space size of over 68 billion architectures. To keep the search tractable, we deploy a differentiable NAS strategy, which has been shown to be significantly more efficient than reinforcement-learning-based approaches (Zoph and Le, 2016). We follow a NAS algorithm similar to the one proposed by Wu et al. (2019).

Phase 1 first composes a search architecture using the baseline network’s backbone as depicted in Figure 7. The backbone includes details on the number of blocks (MHA or FFLs) and their configuration (number of heads or hidden dimension). Using the input backbone, each of the MHA or FFL blocks in the baseline network are replaced with Super Blocks (SB), which includes all the search options in the design space. The goal is to find the best option for each block so that overall accuracy is maximized and the latency target is achieved. Figure 8 depicts the formulation of super blocks. Each of the search options ${B l o c k}_{i}$ is accompanied by corresponding architectural weights $α_{i}$ , which are trained using gradient descent to represent the benefit factor of the search option Wu et al. (2019). To make the optimization graph differentiable with respect to the architecture weights, the output of the super block is formulated as:

O u t p u t = n \sum i = 0 P_{i} \times {B l o c k}_{i} (I n p u t) s . t . P_{i} = G u m b e l S o f t m a x (α_{i}, [α_{0}, . . ., α_{n}])

(1)

Where the $G u m b e l S o f t m a x$ generates probability values by sampling the Gumbel distribution based on $α$ weights.

Figure 7: Composing the search network from the input network backbone

Figure 8: Formulating super blocks from the search space.

This formulation yields two sets of parameters to be trained in Phase 1. The first group contains the actual network weights ( ${B l o c k}_{i}$ ), and the second group the architectural weights ( $α_{i}$ ). Training of each parameter group is done sequentially in each epoch, using separate optimizers. Thus each epoch of training in phase 1 consists of optimizing the network weights using $100 %$ of the training samples, and then training the architecture weights using $20 %$ of the randomly sampled training data. We use soft sampling for $G u m b e l S o f t m a x$ during architecture optimization, and hard-sampling while training the network weights to reduce the overheads associated with the super blocks. To ensure that neither of the network weight sets are starved due to the hard-sampling of $G u m b e l S o f t m a x$ , the architecture optimization is initially disabled for $10 %$ of the epochs, and an annealing temperature scheduling is used for later epochs. These settings allow the blocks to be randomly sampled for the appropriate number of search epochs.

3.2 User-defined Latency Optimization

To incorporate latency optimization in the search phase, we formulate an auxiliary loss based on the latencies of the search and baseline network, as well as the target latency. We use an estimation for the end-to-end latency of the search network as well as baseline in phase 1, using lookup tables filled with individual block latencies similar to prior work Wu et al. (2019). Equation (2) presents the formulation for the estimated latency which is composed of accumulating the latencies of each super block ( $L a t_S B$ ).

L a t = B \sum b = 0 {L a t_S B}_{b}, s . t . {L a t_S B}_{b} = n \sum i = 0 P_{b i} \times {L a t}_{i}

(2)

Here, ${L a t}_{i}$ represents the profiled latency of $B l o c k_{i}$ in isolation, and $P_{b i}$ values correspond to the probability values for super block of $b$ as sampled in Equation (1) with respect to the architecture weights.

The latency loss $L a t_{L o s s}$ is implemented as the ratio of the estimated latency of the search network ( $L a t$ ) over the normalized baseline latency with respect to the target latency.

L o s s = {C E}_{L o s s} + β \times L a t_{L o s s} s . t . L a t_{L o s s} = L a t / ({L a t}_{B a s e l i n e} \times {T a r g e t}_{L a t}) s . t . β = 1 i f (L a t_{L o s s} > 1) e l s e 0

(3)

During the training of the architecture weights, the latency loss will be automatically activated depending on whether the estimated latency of the search network is meeting the target latency requirement. For example, if the target latency is set to $50 %$ of the baseline, the latency loss will only get included if the estimated latency is higher than $0.5 \times L a t_{B a s e l i n e}$ . Otherwise, the scalar factor of $β$ would be 0 in Equation 3, leading the optimizer to adjust the architecture weights solely in the direction of minimizing the $C E_{L o s s}$ . This novel dynamic functionality helps the search progress towards the user latency target without the need for additional hyper-parameter tuning.

3.3 Phase 2: Architecture Sampling and Retraining

The optimized architecture obtained from Phase 1 is now instantiated for retraining. Since the weights of this final architecture were shared with other search points during Phase 1, a retraining step is necessary to avoid under-fitting and to obtain optimal accuracy. We construct the optimized architecture by selecting the blocks with the highest architecture weight values in each super block; from our empirical evaluation, this sampling strategy best balances additional training overheads with accuracy compared to other approaches such as the one described in Liu et al. (2018). We retrain the sampled architecture from scratch using the same settings as the baseline.

3.4 Balancing Load Across Experts in MoE Layers

Since MoE blocks may be part of the final architecture, we incorporate an auxiliary loss during Phase 2 to enforce a balanced load across the experts. We follow the same implementation of the auxiliary loss for load balancing ( $B a l a n c e_{L o s s}$ ) as Switch Transformer Fedus et al. (2021). Consider an MoE layer with $E$ experts:

L o s s = {C E}_{L o s s} + B a l a n c e_{L o s s} s . t . B a l a n c e_{L o s s} = E \times E \sum e = 0 F_{e} \times G_{e}

(4)

Here, $F_{e}$ represents the fraction of the tokens processed by expert $e$ , and $G_{e}$ measures the average gate score received by expert $e$ across the input tokens.

The $B a l a n c e_{L o s s}$ provides an approximation for the load balancing score across experts. If the tokens are distributed uniformly across the experts by the gate function, we can expect each expert to process $\frac{1}{E}$ of the input tokens, while receiving an average score of $\frac{1}{E}$ from the gate. This would result in $B a l a n c e_{L o s s}$ having an ideal value of $1$ in a fully-uniform distribution of tokens across the experts. If there is more than one MoE layer in the architecture, the $B a l a n c e_{L o s s}$ is the average of the individual loss values across the MoE layers.

(a) Comparison of ${C E}_{L o s s}$ and $B a l a n c e_{L o s s}$ .

Comparison of MoE Runtime across different batch sizes. — (a) Comparison of ${C E}_{L o s s}$ and $B a l a n c e_{L o s s}$ .

Figure (a)a compares the Phase 2 training progress of a Transformer-XL architecture with multiple MoE layers under two scenarios: (1) when the $B a l a n c e_{L o s s}$ term is excluded from the loss function (Relaxed Load Balancing), and (2) when the loss function includes the $B a l a n c e_{L o s s}$ term (Enforced Load Balancing). From the figure, we notice that trends for the ${C E}_{L o s s}$ term are similar in both scenarios, highlighting the fact that overall accuracy of the network is unaffected by load balancing constraints. From our experiments, we also notice that a balanced load improves the runtime of MoE layers by reducing tail latency - we illustrate this in Figure (b)b. Here, we notice a runtime speedup of up to $1.16 \times$ for MoE layers when load balancing is enforced.

4 Evaluation

We evaluate planer on two real-world language modeling tasks and compare the performance of the latency-optimized networks to other state-of-the-art efficient Transformer models. We also provide a detailed analysis of the impact of using our dynamic loss formulation.

4.1 Methodology

We use Transformer-XL (TXL) Base on the WikiText-103 (WT103) and enwik8 datasets as our baseline networks. The backbone architecture for both datasets uses a model dimension of 512 and an interleaved pattern of multi-head attention (MHA) with 8 heads and feed-forward layer (FFLs) with an inner dimension of 2048. The total number of blocks (MHA/FFL) is 24 and 32 for enwik8 and WT103, respectively ¹¹1The number of MHA/FFL blocks is $2 \times$ of the number of Transformer blocks.. The search space for phase 1 includes: (1) Skip connection, (2) MHA with 1, 2, 4, or 8 heads, (3) FFL with inner dimension of 2048, and (4) MoE FFL with inner dimension of 2048, 8 experts, where each token is processed by either 1 or 2 experts ( $T o p_{K} = 1 o r 2$ ).

To evaluate the performance of planer, we compare the latency and accuracy of the optimized models with the baseline TXL model and two prior papers: Sandwich Transformer Press et al. (2019) and PAR Transformer Mandava et al. (2020). The design space is explored using planer’s 2-phase methodology (described in more detail in Section 3) with target latencies ranging from $50 %$ to $95 %$ .

All training is performed on a node with 8 NVIDIA V100 GPUs. We use the settings published by NVIDIA for hyper-parameters NVIDIA . The exact hyper-parameters used for each dataset are:

WikiText-103 - Network Weights (Phase 1 and 2): JITLamb optimizer, learning rate of 0.01, batch size of 256, target and memory length of 192, dropout rate of 0.1 for non-MoE layers and 0.2 for MoE layers, and 40000 iterations.
WikiText-103 - Architecture Weights (Phase 1): Adam optimizer, learning rate of 0.01, initial temperature of 5 for the Gumbel Softmax, and temperature annealing rate of 0.6.
enwik8 - Network Weights (Phase 1 and 2): JITLamb optimizer, learning rate of 0.004, batch size of 64, target and memory length of 512, dropout rate of 0.1 for non-MoE layers and 0.3 for MoE layers, and 120000 iterations.
enwik8 - Architecture Weights (Phase 1): Adam optimizer, learning rate of 0.01, initial temperature of 5 for the Gumbel Softmax, and temperature annealing rate of 0.7.

4.2 Accuracy and Performance Trade-offs

Model	wt103 (PPL)		enwik8 (BPC)
	Dev	Test	Dev	Test
Transformer-XL Base	22.7	23.4	1.114	1.088
Sandwich Transformer-XL	${22.6}^{*}$	-	1.107	1.083
PAR Transformer-XL	${22.7}^{*}$	-	1.121	1.119
planer Transformer-XL	22.5	23.5	1.109	1.083

Table 1: Accuracy comparison of planer with prior work and baselines (scores marked with

*

are referenced). Lower is better for both PPL and BPC metrics.

Table 1 lists the accuracy numbers obtained by planer and compares them with the baseline architectures. We notice that all the TXL variants, including ones produced by planer, maintain baseline accuracy levels. We provide a detailed comparison of the different architectures in Appendix A.

Figure 14 shows the speedups obtained by planer and the various baselines (described in Section 4.1) across both datasets and varying batch sizes. From the Figure, we notice that planer provides speedups of over $2 \times$ at larger batch sizes. On smaller batch sizes, PAR Transformer outperforms planer; this is primarily due to the unoptimized MoE layers used in our current implementation. Specifically, our current implementation computes the outputs of each expert sequentially, where a batch of sequences with $N$ tokens are sequentially processed in mini-batches of size $\frac{T o p_{K} \times N}{E x p e r t s}$ . This consequently leads to under-utilization of the compute units.

Figure 15: Runtime comparison of FFL, MHA, and MoE layers across different batch sizes normalized to FFL runtime.

Figure 15 provides a more detailed overview of the current deficiencies in the sequential implementation of the MoE layers; here, we provide a runtime comparison of the FFL, MHA, and MoE layers across different batch sizes normalized with respect to FFL runtime. At lower batch sizes, MoE layers have an overhead of $7 \times$ over FFL, which is also higher than the MHA layers. However, as batch size increases, GPU resource utilization goes up, consequently decreasing the overhead of MoE layers to less than $3 \times$ . The oracle implementation (dashed orange line in Figure) shows the theoretically optimal runtime of the MoE layer. Since we use a $T o p_{k}$ value of $2$ (viz., each input token is processed by 2 experts), we notice a corresponding $2 \times$ runtime overhead over the baseline FFL. Note that the oracle runtime does not take overheads related to gate function evaluation and the gathering/scattering of tokens across experts into account - the real-world runtime is thus likely to be higher. We are currently working on a more optimized parallel implementation of MoE layers, which will help plug this performance gap across various batch sizes.

4.3 Comparison to Iso-parametric Setting

We also compare planer to an iso-parameter setup, which replaces the MoE with a scaled FFL in the search space. The scaled FFL has an inner dimension of 16384, which results in the same number of parameters as the MoE with 8 experts. The goal of the iso-parameter experiment is to analyze the effectiveness of different model scaling solutions in compensating for accuracy drops caused by aggressive attention pruning.

Figure 16: Comparison of the Pareto frontiers of the optimized architectures obtained by planer for MoE and Iso-parameter scaled FFL setups.

Figure 16 presents the comparison of the Pareto frontiers of the architectures obtained by planer with different latency targets on the WikiText-103 dataset. From the Figure, we clearly notice that the use of MoE layers results in higher performance architectures across the board at different accuracy levels. Further performance benchmarking reveals that scaled FFL layers are at least $2 \times$ slower than our (relatively unoptimized) MoE layers and actually approach the runtime of the much slower MHA layers with 8 heads. Naively scaling up the size of FFLs is thus not an ideal option for either improving accuracy or performance.

4.4 Validating Estimated and End-to-end Runtime

Estimated vs end-to-end — (a) Target vs estimated

In this section, we analyze the performance of the dynamic latency loss used in Phase 1. Figure (a)a shows the correlation between input target latency and the estimated latency of the architectures sampled at the end of Phase 1, while Figure (b)b shows the correlation between estimated latency and profiled end-to-end latency. We make two important observations from the figures: (1) our dynamic latency loss formulation successfully steers the NAS towards architectures that match the input target latency, and (2) the latency estimated in Equation (2) is highly correlated with real-world latency, making it an appropriate option for planer’s Phase 1 search.

4.5 Repeatability Evaluation

To evaluate and validate the reproducibility of our experiments, and observe any potential variations in the final architectures, we also repeat the planer optimization of the architectures evaluated in Section 4.2. For this experiment, we keep all hyper-parameters fixed, but repeat planer’s search process four times. Figure 22 presents the achieved accuracy and speedup numbers from our experiment. We notice that all the accuracy values are within $0.5 %$ of the baseline, with speedups consistently over $2 \times$ . The variations in the final architectures across the two datasets are presented in Appendix B. Although the architectures do not match exactly, we notice a strong similarity in the number of heads in the attention layers. We also noticed that MoE layers tend to be concentrated towards the end of the networks across both datasets.

5 Related Work

The introduction of the Transformer family of networks has overhauled the domain of NLP. These attention-based architectures have been shown to outperform their LSTM-based counterparts both in terms of effectively capturing time dependencies Vaswani et al. (2017) as well as inference latency Shi et al. (2021). The general architecture of these models consists of multiple Transformer blocks, where each Transformer block consists of multi-head attention(s) and feed-forward layers.

Recent work has introduced Mixture-of-Expert (MoE) layers within networks to decompose tasks into sub-tasks, where experts could be trained on individual sub-tasks Masoudnia and Ebrahimpour (2014). One motivation behind this idea is to dynamically partition the input space, with experts getting specialized on individual partitions. Recent work has also studied the application of MoE layers to efficiently increase the model capacity of Transformer-based architectures Shazeer et al. (2017); Lepikhin et al. (2020); Fedus et al. (2021); He et al. (2021). These sparsely-activated architectures have been shown to achieve accuracy gains without the proportional increase in computation compared to traditional scaling of network parameters Raffel et al. (2019). While MoE layers have been applied for accuracy improvement and training speed-ups, their use in designing latency-aware architectures have not been explored as thoroughly.

A separate body of work has also focused on optimizing the performance of Transformer models. In particular, Press et al. (2019) show that it is possible to achieve better accuracy by redistributing multi-head attention and FFL layers across the network while maintaining the original runtime. PAR Mandava et al. (2020) deploys NAS to explore the number and distribution of attention layers (while keeping the same head count) for improved latency. Wang et al. (2020) prune attention heads and reduce the width of FFLs (while keeping the same backbone as the baseline) using an evolutionary NAS algorithm to design hardware-aware Transformers. While recent work has explored the distribution or configuration of individual (non-MoE) layers in isolation, none of them consider both aspects simultaneously as part of a larger NAS search space. Additionally, as we demonstrate in this paper, the inclusion of MoE layers in the design space can help reduce inference latency further by removing/pruning attention layers more aggressively while maintaining baseline accuracy.

6 Conclusion

This paper has presented planer, an automated system for optimizing the inference latency of Transformer-based networks. planer employs a two-phase NAS methodology to systematically introduce sparsely activated layers into the given network, and uses a dynamic loss formulation to achieve user-provided latency targets while preserving accuracy. On two real-world NLP models, planer achieves inference latency reductions of over $2 \times$ at iso-accuracy.

Acknowledgements

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. Distribution Statement “A” (Approved for Public Release, Distribution Unlimited).

References

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-XL: Attentive Language Models beyond a Fixed-length Context. arXiv preprint arXiv:1901.02860. Cited by: §1.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
W. Fedus, B. Zoph, and N. Shazeer (2021) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961. Cited by: §1, §2, §3.4, §5.
J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang (2021) FastMoE: A Fast Mixture-of-Expert Training System. arXiv preprint arXiv:2103.13262. Cited by: §2, §5.
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen (2020) GShard: scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. Cited by: §2, §5.
C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV), pp. 19–34. Cited by: §3.3.
S. Mandava, S. Migacz, and A. F. Florea (2020) Pay attention when required. arXiv preprint arXiv:2009.04534. Cited by: §1, §1, §4.1, §5.
S. Masoudnia and R. Ebrahimpour (2014) Mixture of experts: a literature survey. Artificial Intelligence Review 42 (2), pp. 275–293. Cited by: §2, §5.
[10] NVIDIA Transformer-XL for PyTorch: NVIDIA NGC. External Links: Link Cited by: §4.1.
O. Press, N. A. Smith, and O. Levy (2019) Improving Transformer Models by Reordering their Sublayers. arXiv preprint arXiv:1911.03864. Cited by: §1, §4.1, §5.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §2, §5.
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: §1, §2, §5.
Y. Shi, Y. Wang, C. Wu, C. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6783–6787. Cited by: §5.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §3.1, §5.
H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han (2020) Hat: hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. Cited by: §1, §5.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45. Cited by: §1.
B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §3.1, §3.1, §3.2.
B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §3.1.

Appendix A Evaluated Architectures

Figures 23 and 24 present the detailed architecture of all evaluated models in Section 4.2. We notice that PLANER aggressively prunes/skips attention layers, while intelligently introducing sparsely activated layers for accuracy recovery.

Figure 23: Evaluated architectures for WT103 dataset.

Figure 24: Evaluated architectures for enwik8 dataset.

Appendix B Repeatability Experiments: Architecture Comparison

Figure 25: Explored architectures through repeatability experiment on WikiText-103 dataset.

Figure 26: Explored architectures through repeatability experiment on enwik8 dataset.