Neural Architecture Search for Improving Latency-Accuracy Trade-off in Split Computing
thanks: This work was supported in part by JST PRESTO under Grant JPMJPR2035 and JPMJPR2133, and a project commissioned by NEDO (JPNP18002).

Shoma Shimizu1, Takayuki Nishio2, Shota Saito1 3, Yoichi Hirose1, Chen Yen-Hsiu1, Shinichi Shirakawa1 1Graduate School of Environment and Informations Sciences
Yokohama National University, Yokohama, Japan
{shimizu-shoma-kr, saito-shota-bt, hirose-youichi-kc, chen-yen-hsiu-tx}@ynu.jp, shirakawa-shinichi-bg@ynu.ac.jp
2School of Engineering, Tokyo Institute of Technology, Tokyo, Japan
nishio@ict.e.titech.ac.jp
3SkillUp AI, Co., Ltd., Tokyo, Japan
Abstract

This paper proposes a neural architecture search (NAS) method for split computing. Split computing is an emerging machine-learning inference technique that addresses the privacy and latency challenges of deploying deep learning in IoT systems. In split computing, neural network models are separated and cooperatively processed using edge servers and IoT devices via networks. Thus, the architecture of the neural network model significantly impacts the communication payload size, model accuracy, and computational load. In this paper, we address the challenge of optimizing neural network architecture for split computing. To this end, we proposed NASC, which jointly explores optimal model architecture and a split point to achieve higher accuracy while meeting latency requirements (i.e., smaller total latency of computation and communication than a certain threshold). NASC employs a one-shot NAS that does not require repeating model training for a computationally efficient architecture search. Our performance evaluation using hardware (HW)-NAS-Bench of benchmark data demonstrates that the proposed NASC can improve the “communication latency and model accuracy” trade-off, i.e., reduce the latency by approximately 40–60% from the baseline, with slight accuracy degradation.

Neural Architecture Search, Split Computing, Machine Learning, Deep Learning, Distributed Inference, Wireless Networks

I Introduction

Combining the physical sensing of IoT devices with deep learning-based data analysis has been gained considerable attention to enable multiple novel applications. However, the strict privacy and latency demands of IoT applications pose challenges. For example, IoT sensors (e.g., visual and audio sensors) in smart home applications obtain privacy-sensitive data that should not be exposed [Lin16], and the latency requirements for the factory automation and smart grids are less than 10 ms and 20 ms, respectively [Schulz17].

Split computing, which provides an intermediate option between edge computing and local computing, has been proposed to address the privacy and latency challenges of deploying deep learning in IoT systems. In the split computing, a well-trained deep neural network (DNN) is divided into sub-DNNs, and the IoT devices and the edge server collaboratively execute the sub-DNNs by exchanging information (e.g., intermediate outputs) via IoT networks. Because the computation to execute the DNN is partially offloaded to the edge server, and the privacy-sensitive data are processed and converted into an intermediate representation, the split computing can reduce the computation latency and data leakage risk in the DNN-based IoT applications.

However, split computing poses a new research challenge: the “communication latency vs. model accuracy” trade-off. The payload size of the intermediate output is typically larger than that of the raw input or output of the original model, which requires introducing a “bottleneck structure” between the head and tail networks. Narrowing the bottleneck allows for a smaller payload size but may result in significant performance degradation during inference. This is the “communication latency vs. model accuracy” trade-off.

Many studies have been conducted to improve the “communication latency vs. model accuracy” trade-off. Eshratifar et al. have studied the bottleneck injection, called BottleNet, and demonstrated that BottleNet can improve end-to-end latency and reduce mobile energy consumption compared with the cloud-based computation without significant degradation of model accuracy [Eshratifar19]. Matsubara et al. have proposed a method to train the head network to maintain the model accuracy, called head network distillation [Matsubara20]. Head network distillation employs knowledge distillation to transfer the knowledge of the head network generated from a well-trained original DNN into a compressed head network employing a bottleneck structure. Itahara et al. have proposed a COMtune, which can improve the model’s robustness against lossy compression of intermediate outputs and packet loss in the IoT networks [Itahara22].

As also mentioned in the survey [Survey_matsubara], although many studies have investigated split computing, many research challenges still need to be addressed. In this study, we focused on optimizing head and tail network design. The previous works used a handmade architecture of the head and tail networks. However, the architecture of the head and tail networks is of considerable importance in split computing because they have a significant impact on the payload size, model accuracy, and computation load. The smaller the bottleneck structure, the less traffic there is, but the lower the model accuracy. Moreover, because IoT devices generally have limited computational resources, the head network must be lightweight enough to satisfy latency constraints. Although few studies have addressed the optimization of the split point and bottleneck placement [Li18], further performance improvements can be still achieved by optimizing the overall model structure.

To this end, we propose a neural architecture search (NAS) [Elsken2019] method for split computing, referred to as NASC. The proposed NASC algorithm is based on one-shot NAS and employs an adaptive stochastic natural gradient (ASNG) [akimoto19] method as the search algorithm to efficiently search for the optimal model architecture and split point jointly that achieves higher accuracy while meeting latency requirements (i.e., smaller total latency of computation and communication than a certain threshold). One-shot NAS including our method trains the weights of a supernetwork that contains all candidate architectures only once during the search process and can drastically reduce the search cost. We conducted a performance evaluation using hardware (HW)-NAS-Bench [HW-NAS], which is benchmark data obtained using resource-constrained devices, and demonstrated that the proposed NASC can reduce the latency to less than a certain threshold with slight accuracy degradation.

Ii NASC: NAS for Split Computing

We propose a method for obtaining a model architecture that achieves high accuracy with low latency in the split computing via lossy wireless links. In the following sections, we first summarize several assumptions regarding our proposal and then present NASC in more detail.

Ii-a Assumptions

We consider a networked computing system consisting of a cloud server, an edge server, and end devices. The cloud server searches for and trains a deep neural network that works well in split computing. The edge server and devices collaboratively conduct inferences with the sensing data obtained by the devices and the model trained by the cloud server in a split computing manner.

A trained neural network model is split into a head and tail networks in split computing. The output of the head network (i.e., the intermediate representation) is transmitted from the device to the edge server via a wireless network. In this study, we assume that the wireless link is stable but unreliable; that is, the throughput for transmitting the model output does not change, but the packet can be dropped problematically, which can cause part of the intermediate representation to be missing.

Moreover, we assume that the computational power of end devices and throughput between the edge server and devices are limited because of resource-constrained IoT devices and networks. This causes non-negligible latency in calculations at the end devices and data transfer between the edge server and devices.

Ii-B Mathematical Formulation

As with existing studies on NAS, we address the following optimization problem: {mini} x∈X, a∈A f(x, a)  , where is an objective function, such as a loss function, and are the weight and architecture parameters of a neural network model, respectively.

Let the neural network model be defined as , where and denote the head and tail network parameterized by and , respectively. denotes the composite function of and . Thus, the optimization problem NASC addresses can be written as {mini} xh, xtX, ah, atA f(x_h, a_h, x_t, a_t)  .

In contrast to the existing NAS problem where model loss is minimized, NASC must consider computation and communication latency due to resource-constrained devices. We define the computational latency of a model processed by computing node as . Because the computational processes of the head and tail networks are independent, the computation latency for the head and tail networks can be written as and , respectively. In split computing, the node processing tail network is often an edge or cloud server with much higher computation power than end devices such as Raspberry Pi and Jetson. Thus, searching that can achieve low latency without significantly degrading the accuracy is essential.

On one hand, we define the communication latency of split computing with and conducted by computing nodes and as . Communication latency mainly depends on the communication throughput between computing node and and the data size of the output of the head network . Because a larger data size increases the communication latency, we need to find to achieve small output size (i.e., bottleneck architecture) without degrading accuracy. Moreover, as reported in [Itahara22], in the same model architecture, model training that incorporates dropout can suppress accuracy degradation when data are dropped on the communication channel in split computing, which enables the communication system to employ less reliable but fewer latency protocols (i.e., UDP).

The end-to-end latency of the split computing with nodes and is then written as

(1)

As mentioned in Sect. I, the objective of NASC is to improve the trade-off between the latency and model accuracy. To reduce communication latency, we must introduce a narrow bottleneck architecture that may induce accuracy degradation. The accuracy degradation caused by the bottleneck may be compensated by the large head and tail networks, but it will increase the computation latency. Moreover, the effect of communications on model accuracy, that is, accuracy degradation due to packet loss, must also be considered.

In this paper, we introduce a penalty that increases when the threshold is exceeded and define the objective function as a weighted sum of the model loss function and latency penalty. Finally, the optimization problem solved in NASC is written as follows: {mini!}—l— xh, xtX,ah, atA ϵ_lossl_SC(x_h, a_h, x_t, a_t)+ϵ_latτ \addConstraintτ= max(0, T-T_th) \addConstraintT= T^comp_i(a_h) + T^comp_j(a_t) + T^comm_i,j where and denote the weights for model loss and latency penalty, respectively, denotes the loss function of the split computing model when causing packet loss, is the threshold for the latency causing the penalty, and is a shorthand notation for .

Ii-C Algorithm

To optimize the weight and architecture parameters in the loss function (II-B), we applied the one-shot NAS method based on an adaptive stochastic natural gradient neural architecture search (ASNG-NAS)[akimoto19]. ASNG-NAS considers the probability distribution that generates the architecture parameters, and we optimize the weight parameters and the distribution parameters by minimizing the expected objective function . More details on the ASNG-NAS can be found in [akimoto19].

For solving NASC using the ASNG-NAS, we consider optimizing the weight and architecture parameters of the large neural network before splitting it, rather than optimizing the head and tail networks separately. Therefore, we introduce a split point and redefine the neural network model as . The split point represents the layer or block number, and the model is divided into the head and tail networks based on . We aim to optimize and based on the ASNG-NAS. The split point is sampled from the probability distribution as well as the architecture parameters. Therefore, we denote as a random variable that follows , and the expected objective function of ASNG-NAS for NASC is written as . In this paper, we use where is a loss function such as cross-entropy by inserting a dropout at the split point using a dropout rate and is a set of dropout rates, and .

1:
2:initialize the weight parameters and the distribution parameters
3:repeat weight pre-training
4:     sample pair of architecture and split point from a uniform distribution and update using (2)
5:until termination conditions are met
6:
7:repeat distribution update
8:     
9:     update with (4), then force by projection
10:     
11:     
12:     
13:     
14:until termination conditions are met
15:sample the most likely architecture and split point , and update the weight parameters weight re-training
Algorithm 1 ASNG-NAS for NASC

Algorithm 1 summarizes the proposed method. To run the algorithm, we set a supernet, a model whose all sub-models are equivalent to all the models in . The architecture and split point are represented by the following dimensional categorical variable: where possesses categories. Note that and are tied to the architecture and split point, respectively. We treat architecture and split point as one-hot vectors where such that . Also, we set a dimensional categorical distribution where is the probability of being such that , and the distribution parameters is represented by where , and the number of distribution parameters is .111We can omit the distribution parameter for the last categorical element owing to the condition of .

The proposed algorithm consists of three stages; weight pre-training, distribution update222The original ASNG-NAS alternates between updating weight and distribution parameters. However, we observed improved results in preliminary experiments when the weights and distribution parameters were updated sequentially, as in [Noda2022]., and weight re-training.

Weight Pre-training: In the weight pre-training stage, we optimize the weight parameters in the supernet to minimize . The gradient of weight parameters is given by . In most cases, it is difficult to compute analytically . In addition, computing requires times forward propagation, which is expensive for computation cost. Therefore, is approximated using Monte-Carlo method with samples sampled from the uniform distribution and to reduce the computation cost of training the weights (the same applies to the re-training stage). The approximated gradient of weight parameters is given by

(2)

where is a time step. We note that the weight parameters can be updated using any stochastic gradient descent (SGD) method with (2). We repeat this process for 30 epochs.

Distribution Update: In the distribution update stage, we in turn optimize the distribution parameters under the trained weight parameters . We use the natural gradient to update the distribution parameters. Here, where is the Fisher information matrix (FIM), and is given by under the categorical distribution [akimoto19]. However, is difficult to compute analytically as with the weight parameters. Therefore, is also approximated using Monte-Carlo method with samples, and the distribution parameters is updated as following:

(3)
(4)

where is the utility value based on the objective value -th sample, and is an adaptive learning rate updated according to line 10 in Algorithm 1. When , the value of the better sample is assigned , and that of the worse sample is . As for the termination condition, this study simply set the end of the update at 90 epochs.

Weight Re-training: In the weight re-training stage, we sample the most likely architecture and split point , and train the weight parameters from scratch to minimize for epochs.

Block type expansion () Kernel () Group
k3_e1 1 3 1
k3_e1_g2 1 3 2
k3_e3 3 3 1
k3_e6 6 3 1
k5_e1 1 5 1
k5_e1_g2 1 5 2
k5_e3 3 5 1
k5_e6 6 5 1
skip - - -
TABLE I: The candidate blocks in the FBNet search space.
Block type expansion () Kernel () Group
k3_e1/2 1/2 3 1
k3_e1/4 1/4 3 1
k3_e1/8 1/8 3 1
k5_e1/2 1/2 5 1
k5_e1/4 1/4 5 1
k5_e1/8 1/8 5 1
TABLE II: The additional candidate blocks in the extended search space.

Iii Experimental Evaluation

Iii-a Setups

Dataset and search space: We conducted experiments using the CIFAR-100 dataset and HW-NAS-Bench [HW-NAS], a hardware performance dataset for hardware-aware NAS. HW-NAS-Bench provides the computation latency and energy cost of all the network architectures in the search spaces of both NAS-Bench-201 [dong2020nasbench201] and FBNet [FBNet] on specific hardware such as Raspberry Pi 4, FPGA, and Edge GPU. In this experiment, we considered the FBNet search space [FBNet] and leveraged the latency performance of Rasberry Pi 4 and Edge GPU as those of an end device and edge server, respectively.

FBNet search space is a layer-wise search space constructed with a fixed macro architecture that defines the number of layers and each layer’s input/output dimensions. For better accuracy and efficiency, each layer of the network can independently choose different blocks from the candidate block, except for the first and last three layers with fixed operators. Each candidate block contains three operations: a 11 convolution, a KK depthwise convolution, where K denotes the kernel size, and another 11 convolution. It is important to note that only the first 11 convolution and the depthwise convolution are followed by ReLU activation functions, and there is no activation function following the last 11 convolution. Furthermore, to control the block, we use an expansion ratio to determine the input/output channel size expansion of the KK depthwise convolution. Each candidate block in the search space can choose a different expansion ratio, kernel size, and the number of groups for the group convolution. In the experiment, there were twenty-two layers that we needed to search for from nine predefined candidate blocks, as listed in Table I. The block “skip” means that there is no operation in that layer. Consequently, it contains possible architectures in FBNet search space.

(a) The candidate block in the FBNet search space.
(b) The additional candidate block in the extended search space.
Fig. 1: The candidate blocks in the FBNet search space and the extended search space. Red lines indicate the split points.

This study also searched for a split point in addition to searching for candidate blocks. We set 23 candidate split points: 22 after each block and one after the first convolution layer. Fig. 1 shows the candidate block’s operations and the split point. The skip connection is inserted when the input and output sizes of the block are the same. We set (ms) in the FBNet search space. Note that the FBNet search space originally supported only the ImageNet dataset; however, the work of [HW-NAS] created a macro-architecture for CIFAR-100.

Additionally, we consider an extension of FBNet search space, where the computation latency is approximated from the amount of computation (FLOPs) using HW-NAS-Bench. The extented search space introduces candidate blocks with an expansion ratio of less than one into the FBNet search space in order to utilize the bottleneck structure. Table II lists the additional candidate blocks in the extended search space. The additional blocks change the split point immediately after the KK depthwise convolution and remove the skip connection to reduce the transfer data size. Fig. 1 shows the additional block’s operations and split positions. We set (ms) in the extended search space.

Because the measured latency cannot be obtained from HW-NAS-Bench owing to the additional blocks, we used the latency estimated using FLOPs instead. We assume that the device has a specific computation power and that the latency is determined by the model’s FLOPs and the device’s computation power. In particular, we assume that the latency is determined as follows:

(5)

Here, is the model’s computation cost (i.e., FLOPs), and is the device’s computation power. We estimated the computation power using the latency in HW-NAS-Bench using the following equation:

(6)

where and denote the measured latency and FLOPs of the -th block in the -th layer, respectively. Table III lists the devices’ estimated computation power.

device computation power (GFLOPS)
EdgeGPU 8.0213
Raspi4 2.3562
TABLE III: Devices’ estimated computation power

Note that HW-NAS-Bench [HW-NAS] points out that the correlation between FLOPs and EdgeGPU latency is small. However, this experiment was conducted to verify that this method can be applied when the latency is obtained in some way and the precision of the latency is not critical.

Fig. 2: Accuracy vs. latency of each model searched in FBNet search space when and ms.

Assumptions on communications: An end device and an edge server were assumed to be connected via a lossy IoT network abstracted as a communication link in which the packets were randomly dropped with the probability . Hence, a proportion of the elements of the intermediate representation (i.e., the output of the head network) transmitted by the device were randomly dropped. We also assumed a stable throughput for the communication link. Thus, the communication latency between the device and server is calculated as

(7)

where denotes the data size of the output of the head network . The data size depends on the number of units in the output layer, the data type, and the compression method. In this evaluation, to calculate the communication latency simply, the data size is calculated as where and represent the quantization bit rate and the number of output units in the head network, respectively. Assuming a data type of 32 bit float, was set to 32. The throughput of the communication link (including MAC and network layer overheads) was set to 8.0 Mbit/s, which is in the throughput range for wireless LANs based on the IEEE 802.11 standards.

Compared methods: We compared NASC with a hardware-aware NAS protocol modified for this split computing scenario. We refer to this protocol as HWNAS. The HWNAS employs a split point optimization method simplified from [Li18] and model tuning leveraging dropout [Itahara22] to reduce latency and improve robustness against packet loss. The HWNAS conducts conventional NAS, model split, and model re-training sequentially. Specifically, HWNAS first performs an architecture search assuming the entire model is on the end device; that is, it calculates the latency assuming and assumes that packet loss does not occur in (II-B). Then, the model is split into head and tail networks at a point that minimizes the same objective function (II-B) as NASC to minimize the latency and accuracy degradation by split computing. Finally, in the HWNAS w/ dropout, the head and tail networks are re-trained using the dropout technique, whereas the HWNAS w/o dropout simply re-trains the networks without dropout.

Fig. 3: Accuracy vs. latency of each model searched in extended search space when and ms.

Iii-B Results

Fig. 2 shows scatter plots of accuracy vs. latency for the 15 models obtained by the proposed method and the compared methods in the FBNet search space when . Five models were obtained for each method. As shown in Fig. 2, only the proposed method can obtain a model that satisfies the latency constraints. This is because that NASC searches a model that minimizes the total latency of computation and communication, while HWNAS does not consider communication latency when searching for model architectures. However, the model accuracy of NASC was slightly lower than HWNAS. Specifically, the medians of the model accuracy for NASC, HWNAS w/ Dropout, and HWNAS w/o Dropout were 73.30%, 73.35%, and 73.63%, respectively. This is because of the trade-off between the model accuracy and latency.

Fig. 3 shows scatter plots of the accuracy vs. latency when considering the extended search space when . As in the FBNet search space, all the models obtained by the proposed method achieved lower latency than the threshold, whereas each baseline obtained only one model that achieves lower latency than the threshold. On the one hand, the median of the accuracy of NASC was slightly lower than baselines, which were 63.20%, 63.69%, and 64.50% for NASC, HWNAS w/ Dropout, and HWNAS w/o Dropout, respectively.

These results demonstrate that NASC can significantly reduce latency while slightly decreasing model accuracy, thereby improving the accuracy-latency trade-off.

Iv Conclusion

In this paper, we propose a NAS for split computing, called NASC. The NASC employs the adaptive stochastic natural gradient method to jointly explore the optimal model architecture and split point to achieve higher accuracy with low end-to-end latency, that is, to minimize the weighted sum of the model loss and total latency on communication and computation in the end device and edge server. The performance evaluation using HW-NAS-Bench demonstrates that the proposed NASC reduces the latency by approximately 40–60% from the baseline with slight accuracy degradation.

References