Anytime-Lidar: Deadline-aware 3D Object Detection

Ahmet Soyyigit Shuochao Yao George Mason University, Fairfax, VA
shuochao@gmu.edu Heechul Yun

Abstract

In this work, we present a novel scheduling framework enabling anytime perception for deep neural network (DNN) based 3D object detection pipelines. We focus on computationally expensive region proposal network (RPN) and per-category multi-head detector components, which are common in 3D object detection pipelines, and make them deadline-aware. We propose a scheduling algorithm, which intelligently selects the subset of the components to make effective time and accuracy trade-off on the fly. We minimize accuracy loss of skipping some of the neural network sub-components by projecting previously detected objects onto the current scene through estimations. We apply our approach to a state-of-art 3D object detection network, PointPillars, and evaluate its performance on Jetson Xavier AGX using nuScenes dataset. Compared to the baselines, our approach significantly improve the network’s accuracy under various deadline constraints.

Lidar, 3d object detection, PointPillars, Anytime computing

Fig. 1: Baseline lidar detector architecture.

I Introduction

Real-time object detection is a crucial part of autonomous vehicles. Radars, cameras and lidars are commonly utilized to provide sensory input for this task. All detection systems aim to detect objects accurately and timely in order to make local and global path planning safe and efficient.

Deep learning-based object detection systems have gained popularity in recent years due to their excellent performance. As a result, most object detection systems in autonomous vehicles, as in many other AI-related fields, are deep learning-based, which utilize neural networks. These networks are composed of directed acyclic predefined tensor operations, executed layer by layer to obtain the detection results. Therefore, their computational cost and execution time is deterministic and highly predictable. While this predictability in itself could be a good virtue in a real-time system, it also means that the object detection system’s timing and computational properties cannot be changed dynamically.

Within an autonomous driving framework, however, the timing (deadlines) and performance (accuracy) requirements of the detection tasks can change dynamically over time. For instance, a fast moving vehicle on a highway may favor less accurate but faster detection system over highly accurate but slow one. On the other hand, a slow moving vehicle on a complex city driving condition may favor higher accuracy even if it takes longer to detect the objects [6]. Also, hazardous situations such as sudden jaywalker appearance may need to trigger path planning, which may require reallocation of computing resources to process more urgent tasks in time. Therefore, it is crucially important to be able to make time and accuracy trade-off on the fly so the system can adapt to dynamically changing environment and make best use of the on-board computing resources, which are limited due to size, weight, power and cost constraints in autonomous vehicles.

There have been growing number of research effort to support deadline-aware anytime perception capabilities in the real-time community. Several researchers proposed to apply anytime computing in vision based image classification by dynamically skipping some layers[6, 10, 2, 31] or prioritizing subset of neurons [13] in the backbone networks, while others applied similar ideas to vision based real-time object detection problems [6, 16]. Our work extends the anytime perception problem to 3D object detection utilizing lidar point clouds.

The majority of today’s cutting-edge neural network architectures can be broadly viewed as an encoder-decoder structure [4, 22, 12, 30, 35]. The encoder (i.e., backbone network) extracts features from the input data, while the decoder converts those features into application-specific outputs. Although the backbone network of an image-based neural network is much more complicated than the decoder, the lidar-based backbone module often performs similar or less computation than the decoder. Thus, it is necessary to address significant discrepancies between vision-based and lidar-based detection tasks for anytime inference. On the one hand, this reduces the effectiveness of dynamic layer skipping and subnet prioritization exclusively in the backbone, since the decoder becomes another computational bottleneck. On the other hand, in contrast to the vision-based anytime model, the lidar-based network employs novel architectures such as multiple-scaled backbone blocks [4, 22, 12] and multiple detection heads [12, 30, 35] to improve detection accuracy over a wide range of object sizes and categories, offering a new research challenge for enabling anytime predictions.

To this end, we propose a novel scheduling framework that enables anytime perception for lidar-based 3D neural object detection pipelines. To begin, we enable imprecise computing on the encoder of the neural network (i.e. backbone). We focus on multi-scaled Region Proposal Networks (RPNs), which are widely employed in lidar and vision-based object detectors [22, 4, 12, 21] but have received little attention in the prior anytime computing literature. To our knowledge, the only exception [6] executes the whole RPN module but dynamically reduces the amount of output proposals in order to lower the cost of subsequent processing. However, they still fail to perform imprecise computation on the RPN module itself. The multi-scaled RPN is structured as a series of blocks each of which generates bounding box proposals with increasing spatial resolutions. We exploit this structural characteristics and dynamically adjust the number of blocks to be executed at runtime in order to achieve a trade-off between speed and accuracy.

Secondly, we further apply imprecise computation to the decoder of neural network (i.e., detection heads). Given that a lot of state-of-the-art lidar-based object detection networks utilize multiple detection heads [12, 30, 35], we present a head-skipping mechanism, that allows for the selective execution of a portion of the detection heads at runtime to further optimize the time and accuracy trade-off.

Thirdly, we introduce a lightweight projection method to compensate for the skipped detection heads in the current lidar frame. Assuming a fine-grained localization algorithm is providing global position of the ego-vehicle at runtime, this method allows estimating the detection results of the skipped heads by using the past detection results, improving the detection performance while avoiding potential dangers associated with skipped detection heads.

Lastly, we propose a two-phase dynamic scheduler, which decides the execution patterns of imprecise-computation capable encoder and decoder of the network prior to their execution. The scheduler is designed to maximize the detection accuracy while considering the deadline constrains.

As a case study, we apply our technique on state-of-the-art lidar object detector, PointPillars[12]. We evaluate our technique on a commercial-off-the-shelf embedded computing platform, Jetson Xavier AGX [19]. Our evaluation includes comparison with a range of methods including alternative scheduling methods that are capable of utilizing imprecise computation capable encoder and decoder. The results show that Anytime-Lidar can reduce its runtime requirement by 50 percent while meeting all the deadlines and providing better accuracy compared to baseline, multi-path, and alternative scheduling methods.

To the best of our knowledge, this work is the first to tackle the anytime perception problem in the lidar domain by considering the differences of 3D object detection compared to 2D. Our scheduling framework addresses the specific needs of 3D object detection to enable fine-grained time and accuracy trade-off.

Ii Background

Fig. 2: Proposed anytime lidar detector architecture.

The main goal of object detection using point clouds is to produce the 3D positions, bounding boxes, and optionally velocities of the objects present on the scene. Figure 1 shows the general flow of the deep learning based techniques, which is our focus in this paper. Firstly, we see the initial stage transforming the point cloud into a pseudo-image of extracted features at runtime so the backbone, which is usually a convolutional neural network, can process it. This is required as a fact that unlike images, point clouds are not represented in a 2D array of pixels that can be directly processed with CNNs. Different techniques of transformation exist while all having the goal of producing a bird-eye-view representation of the scene.

Afterwards, the generated bird-eye-view pseudo-image is processed by the backbone, which is usually Region Proposal Network (RPN) [22] as can be seen in [23, 28, 34, 24, 20, 4, 12]. RPN consists of multiple processing blocks, generally two or three, depending on the design choice. Figure 1 shows the inner structure of RPN with three blocks as in [12]. Each block consists of a set of convolutions followed by a deconvolution operation that does upsampling. At the end, the output of all blocks are concatenated to produce a tensor of region proposals about where the objects are expected to be on the scene.

Lastly, the tensor of region proposals is processed by the detection heads. Each detection head is dedicated to derive the prediction results for a subset of the object classes through a set of convolutions. These predictions include the classification scores, positions, bounding boxes, and velocities of the objects. Non-maximum suppression is applied individually to the output of each detection head to obtain the final results.

It is possible to use a single detection head to produce the output for all classes, as it is the norm in image processing. When detection is happening in 2D, the bounding box size for a specific class can have highly varying size depending on the position of the object. For example, a bus within a short distance can be captured with a bounding box that is four times larger than the bounding box of a bus far away from the ego-vehicle. When the detection is performed in 3D, however, the bounding boxes become 3D and their size stays consistent regardless of the positions of the objects. This results the classes to have a bonding relationship with their bounding box anchor sizes, in contrast to image processing. In order to exploit this fact, the classes are divided into subsets according to their anchor sizes and an individual detection head is dedicated for each of these subsets[37]. For example, one detection head is responsible for detecting pedestrians and traffic cones while another one does the same for buses and trucks. The utilization of multiple detection heads improves the detection accuracy by allowing each detection head to focus on a subset of classes having similar size. The downside is the increased overhead of running multiple detection heads instead of a single one, which has to be tackled for real-time execution.

Iii Anytime-Lidar

In this section, we present our method of making fine-grained time and accuracy trade-off at runtime for lidar-based object detectors. Our goal is to allow meeting tight deadlines which baseline models fails to do while maintaining higher detection accuracy for all deadline cases.

In this work, we base our method on PointPillars[12], one of the state-of-the-art detectors capable of delivering strong detection results with minimal latency, compared to other state-of-the-art methods. Our method does not involve modifying any part that is specific to PointPillars, therefore, it can be applicable to many other object detection architectures that utilize Region Proposal Network as backbone and multiple detection heads, such as [28, 34, 29, 37].

Iii-a Overview

The general architecture of the proposed method is shown in Figure 2. We focus on enhancing the backbone and the detection heads to be flexible in their timing as they are the main source of inference latency. Table I shows the execution time profiling data of the baseline PointPillar network on Jetson AGX Xavier. It can be seen that the processing of backbone and detection heads account for 80% of the total execution time on average.

In a nutshell, our enhancements can be described in four parts. First, we allow backbone to run varying number of blocks. Second, we enable skipping of a subset of detection heads in favor of others. Third, we provide output for the skipped heads by projecting the past detection results to the current frame. Lastly, the stage/head scheduler manages the other three modules and decide what number of blocks of the backbone and which detection heads are to be executed. In the following subsections, we will give the details of our enhancements.

* Numbers are in milliseconds except last column.
Stage	Min	Average	99Perc	Percentage
PC Transformation	24.61	26.64	28.40	20%
Backbone	44.81	45.23	45.55	34%
Detection Heads	55.06	58.83	62.72	46%

TABLE I: Execution Timing Analysis of PointPillars on Jetson AGX Xavier.

Iii-B Imprecise computation on backbone

We take advantage of the RPN’s multi-scale block structure and add two early exists to the end of first two blocks as shown in Figure 3 to allow time and accuracy trade-off. This minimal change does not involve adding any additional network layer while allowing it to run reduced number of blocks at runtime, depending on the scheduling decision.

It should be noted that connecting all the exit points of the imprecise backbone to the same set of detection heads would complicate the training process as the detection heads won’t be able to focus on the output of a specific exit. As a solution, we duplicate the detection heads for each exit and train the entire network once with the method proposed in [11], which sets training loss weights for each exit of the RPN that change over epochs. This enables each detection head to be trained to focus an exit of the backbone to achieve the best detection accuracy.

Iii-C Detection head skipping

To enable fine-grained execution time flexibility, we allow skipping of detection heads as illustrated in Figure 4. Each skipped head saves us from running a set of convolutions dedicated to output the position, size, classification score, and velocity of the objects. In addition, skipping a head also saves us from running non-maximum suppression on its output.

The reader might be concerned about the impact of head skipping on safety and detection accuracy. When heads are skipped, any object belonging to the subset of classes dedicated to the skipped heads won’t be detected at all. However, this is only the case for single a lidar frame. Not all objects on the scene have to detected in every single frame but they should be detected within a range of consecutive frames. This can be ensured by scheduling the heads in a way that every head get its turn within a time limit. We handle this problem by our head scheduling method, which will be explained later. Also, the projection module mitigates the possible negative effect of skipped heads.

Iii-D Projection

We integrate a projection module to the proposed method that estimates the current position, orientation, and velocity of the previously detected objects and use it to provide output in place of the skipped detection heads. This estimation is based on the assumption that the global position of the ego-vehicle, i.e. ego-pose, is available for each frame. In a typical autonomous driving framework, a localization module calculates ego-poses periodically so the planning can be made. Therefore our assumption does not add an extra overhead to the overall system.

The detector outputs the bounding boxes in lidar coordinate system, which is attached to the ego-vehicle. This coordinate system changes as the vehicle moves. Therefore, projecting a bounding box requires applying 3D transformations to move it to the current lidar coordinate system from the past one. Also, the last known velocity of the object should be utilized to make projection accurate. Therefore, we follow the below steps to project a bounding box:

Rotate and translate the object to be in the ego-pose coordinate system of the time of detection.
Rotate and translate the object to be in the global coordinate system.
Translate the object using its last known global velocity and the elapsed time since it was detected.
Translate and rotate the object to be in the current ego-pose coordinate system.
Translate and rotate the object to be in the current lidar coordinate system.

The listed operations has to be performed for all bounding boxes to be projected, which can introduce significant computational overhead. We tackle this problem by taking advantage of the available CPU resources while the neural networks execute on the GPU. After we determine all the bounding boxes to be projected, we distribute them to parallel CPU processes. While they are being processed on the CPU cores, we continue with the execution of backbone and detection heads on the GPU. When the detection heads finish running, we collect the projected bounding boxes and merge them with the detection results.

* Numbers are in milliseconds.
RPN	Detection heads
blocks	1	2	3	4	5	6
1	30.9	42.2	52.2	62.1	70.6	78.2
2	46.3	56.8	66.9	76.8	85.4	93.2
3	61.8	71.9	81.8	92.0	100.6	107.9

TABLE II: Post-sync WCET calibration table for Jetson AGX Xavier

RPN	Detection heads
blocks	1	2	3	4	5	6
1	67.0	67.5	70.7	74.4	79.2	80.6
2	75.4	77.5	82.1	88.2	91.9	93.3
3	79.8	84.9	90.7	95.6	98.9	100.0

TABLE III: Normalized accuracy calibration table (%)

Iii-E Scheduling of RPN blocks and detection heads

The scheduling problem involves deciding the number of backbone blocks and which detection heads to run. A certain deadline can be met by multiple block/head configurations, but how do we know which one would deliver the best detection accuracy? We solve this problem by dividing the scheduling into two phases. First, we determine the number of RPN blocks and detection heads to execute by using offline profiling data, represented in the form of two calibration tables, one for WCET and the other for accuracy. Second, we determine which detection heads to skip by our novel head scheduling algorithm.

The calibration tables hold the required execution time and deliverable detection accuracy for all combinations of number of RPN blocks and number of detection heads. Using a small subset of the training data, we conduct the calibration with the procedure shown in Algorithm 1 to reveal how accuracy and execution time changes depending on the block/head configuration. Once the calibration is done, the produced tables are saved in disk and loaded to the memory whenever a test is conducted. Tables II and III are the ones generated for our method. Please note that the Table II has the required execution times to complete the execution after point cloud transformed to pseudo-image. In the tables, the green cells are the executable block/head configurations. Other configurations deliver less or equal accuracy with higher demand of time, so they are not considered for execution.

1 Input:

2 Calibration dataset (

D

3 Number of RPN blocks (

R

4 Number of detection heads (

H

5 Output: WCET and Accuracy Tables

6 function calibrate( $D$ , $R$ , $H$ )

w c e t_t a b l e \leftarrow 2 D_a r r a y_w i t h_s i z e (R, H)

a c c_t a b l e \leftarrow 2 D_a r r a y_w i t h_s i z e (R, H)

r \leftarrow 1

h \leftarrow 1

11 while $r \leq R$ do

12 while $h \leq H$ do

f i x_n u m_b l o c k s_a n d_h e a d s (r, h)

d e a d l i n e \leftarrow \infty

w, a \leftarrow p r o c e s s_s a m p l e s (D, d e a d l i n e)

w c e t_t a b l e [r, h] \leftarrow w

a c c_t a b l e [r, h] \leftarrow a

h \leftarrow h + 1

r \leftarrow r + 1

return $w c e t_t a b l e, a c c_t a b l e$

Algorithm 1 Calibration Procedure

Iii-E1 First phase of scheduling

Initially, we wait until the point cloud transformation finishes its execution on the GPU since the GPU operations are executed asynchronously. This synchronization is done to calculate the remaining time with precision. Afterwards, we iterate over the calibration tables to find the block/head configuration that would provide highest possible detection accuracy while meeting the deadline with respect to the remaining time. Once it is found, we move to the second phase.

Iii-E2 Second phase of scheduling

We propose to schedule the detection heads by dynamically assigning priorities based on two parameters: their age and aged confidence. The age of a head is defined as the number of frames passed since the last time it was scheduled to run. We keep track of the ages of all heads and update them whenever a frame is processed, resetting the age of scheduled heads and increasing the age of others.

Each detected object has a confidence score alongside its bounding box. The confidence scores of all objects detected by a certain detection head gives a clue of its contribution to the overall detection accuracy, which can be used for prioritization. The aged confidence of a detection head is defined as the sum of these confidence scores at the last time it was scheduled to run. Although the utility of aged confidence decreases over time, it still provides a useful hint about which heads would contribute most to the overall detection accuracy thanks to the temporal locality of the objects on the scene. Since the time between the frames are in the order of milliseconds, the objects on the scene does not change significantly over a sequence of consecutive frames. We take advantage of this fact to save ourselves from the overhead of calculating the confidence of all heads for every single frame.

Algorithm 2 summarizes the proposed detection head scheduling algorithm. Firstly, we update the aged confidences of the heads scheduled in the previous frame by taking their sum and normalizing them (line 8,9). Afterwards, we create an empty max heap of tuples which makes comparison using the first element of the tuples (line 11). We calculate the priority of each head by multiplying its age with its aged confidence and push it to the max heap (line 16). We make sure that any head with an age beyond a predefined frame limit is given a special priority by multiplying it with the maximum possible aged confidence (line 14). Lastly, we pop heads from the max heap and put them in the schedule list until we reach the number of heads determined to run at the first scheduling phase (lines 21-24).

1 Input:

2 Heads Scheduled For the Previous Frame (

H_{p r e v}

3 Aged Head Confidences (

C

4 Head Ages (

A

5 Number of Heads To Run (

N

)

6 Output: List of heads to run

7 function sched_det_heads( $H p r e v$ , $C$ , $A$ , $N$ )

8 for $h \in H_{p r e v}$ do

C [h] \leftarrow n o r m (s u m (g e t_p r e v_d e t_s c o r e s (h)))

h \leftarrow 1

p r i o s \leftarrow m a x_h e a p ()

13 for $c \in C$ do

14 if $A [h] >$ FRAME_LIMIT then

t u p l e \leftarrow (A [h] \times

MAX_SCORE

), h)

16 else

t u p l e \leftarrow (A [h] \times c, h)

p r i o s . p u s h (t u p l e)

h \leftarrow h + 1

i \leftarrow 1

H \leftarrow ()

23 while $i \leq N$ do

s c r, h \leftarrow p r i o s . p o p ()

H \leftarrow H \cup h

i \leftarrow i + 1

return $H$

Algorithm 2 Head Selection Heuristic

Method	Number of model	Number of	RPN stage	Detection head
	parameters	RPN blocks	selection	scheduling
PointPillars-3	6078K	3
PointPillars-2	2626K	2
PointPillars-1	1723K	1
MultiStage	9235K	3	✓
RoundRobin				Circulating
ClsScrSum				Class scores sum
NearOptimal				Aging + Ground Truth
Ours				Aging + Aged confidences

TABLE IV: Methods to compare

Iv Evaluation

We evaluate the proposed method by extending the PyTorch implementation of PointPillars in [27] to support our method and others we used for comparison. We present our method’s performance under three contexts. Initially, we show the benefit of head scheduling plus projection over methods not having head-skipping capability. Next, the role of the head scheduling algorithm is highlighted. Last, the improvement that comes by the projection module is demonstrated.

We train the neural network models and test the methods using nuScenes [3], a realistic large-scale dataset for autonomous driving. nuScenes is composed of 1000 scenes where each scene has the data of a 20 seconds drive in urban environment. This data include the periodically collected samples of various sensors and annotations of the objects. In addition, periodically calculated global position of the ego-vehicle i.e. ego-poses are present in the dataset.

The evaluation procedure involves processing the lidar samples of five different 20 second scenes taken from the evaluation dataset. We execute this procedure for all methods within different deadline constraints and evaluate their performance in terms of detection accuracy and timeliness. We used nuScenes Detection Score (NDS)[3] as our detection accuracy metric which includes in itself mean Average Precision as its parameter. We normalized the NDS scores for a better presentation of our results. Each deadline miss is counted as a detection with no results. Before running our tests, we do the required accuracy/WCET calibration using another five different scenes from the training dataset. Baseline methods don’t require calibration as they run without scheduling.

One thing to note about the dataset is that the time between the consecutive annotated point clouds (i.e. keyframes) is 500 milliseconds whereas the point clouds were sampled with 50 milliseconds period. Since we can only evaluate the detection accuracy using the keyframes, the large time gap significantly reduces the projection performance. To overcome this problem and make the evaluation more realistic, we interpolated the annotations between the keyframes and allowed all point clouds to have annotations so they can be used in evaluation. For the tested deadlines within the ranges of 140-110 ms and 100-50 ms, we tested with the lidar samples having period of 150 ms and 100 ms, respectively.

An overview of the methods we tested is given at the Table IV. The baseline PointPillars methods utilize separately trained models whereas others use the same imprecise model. The reason why imprecise model has higher number of parameters is the duplicated detection heads that is connected to the early exits of imprecise backbone. All methods capable of head scheduling follow the processing stages shown in the Figure 2. The only difference between them is their head scheduling algorithm i.e. second phase of the scheduling. The initial scheduling phase utilizing the calibration table is same for all methods noting each method has its own calibration table.

As the testing platform, we use Jetson AGX Xavier[19] having 16GiBs of RAM. All hardware clocks were maximized during the tests. The neural network layers and non-maximum suppression were executed on GPU while the head/stage scheduling and projections were done on the CPU.

Fig. 5: End-to-end average execution times.

Iv-a Comparison With the Baselines and MultiStage

Here, we highlight the main benefits of the proposed method which enable fine-grained execution time adjustment by comparing it with the following methods:

PointPillars[12]: These are the baseline models without any dynamic execution time adjustment capability. The numbers 1 to 3 denote the number of RPN blocks used in the backbone. Having less blocks reduces required execution time with a sacrifice from accuracy. Baseline-3 is the actual baseline model without any modification.
MultiStage[6]: This method can make execution time/accuracy tradeoff by running different number of RPN blocks. However, it executes all detection heads for all frames.

Figure 5 shows how average execution time changes over deadlines. The baselines are rigid and has no capability of adjusting their execution timing, resulting deadline misses as shown in Figure 6. On the other hand, the proposed method can reduce its execution time requirement by reducing the number of executed backbone blocks and detection heads at runtime while providing higher accuracy than all baselines, as shown in Figure 7. The MultiStage method can also provide a degree of execution time flexibility and better accuracy results compared to baselines. However, its flexibility is limited with the backbone, consequentially falling behind the proposed method.

Fig. 7: Detection accuracy versus baselines.

Fig. 8: Average detection accuracy versus baselines.

The Figure 8 shows the achieved average accuracy over all tested deadlines, it is basically the average form of the Figure 7. When we look the results in this figure side by side with the deadline misses, the main takeaway is that meeting the deadlines plays the most crucial role for achieving high overall detection accuracy. This can be seen even by comparing the first three baselines. The proposed method enables fine-grained execution time flexibility that allows meeting tight deadlines, which in turn boosts the average accuracy to be twice of what MultiStage can supply.

Iv-B Analyzing the impact of head scheduling and projection

In this section, we investigate how head scheduling and projection, as separate components, impact the performance. For that purpose, we compare the proposed method with the alternative head scheduling methods listed below:

RoundRobin: The detection heads are scheduled in a round-robin order, giving equal priority to all heads.
ClsScrSum: This method makes prioritization based on the sum of classification scores calculated on-the-fly. It runs the classification part of all detection heads and individually takes their sums after applying sigmoid function and filtering it with the predefined score threshold. The sums are multiplied with the age of each head similar to the proposed method to obtain final priorities. Afterwards, the remaining part of the chosen detection heads are executed while skipping others.
NearOptimal: This method prioritizes the heads based on their ages, i.e. the time elapsed since their last usage, while avoiding to run any head that does not have a corresponding ground truth annotation on the frame being processed. Since this method relies on ground truth data, we use it as an upper bound of others to show the effectiveness of proposed method compared to this one.

Fig. 9: Detection accuracy of different head scheduling methods without projection.

Fig. 10: Average detection accuracy of different head scheduling methods without projection.

Iv-B1 Impact of head scheduling algorithm

Figures 9 and 10 illustrates the detection performance of the aforementioned head scheduling algorithms when projection is disabled. RoundRobin circulates through all heads and provides a simple solution that maintains safety by letting each head to take its turn over time. However, it does not have a prioritization mechanism that can boost the accuracy by demoting the heads having low confidence. On the other hand, ClsScrSum and the proposed method utilize the confidence scores to prioritize the heads having high confidence, while maintaining the safety through aging. The overhead of ClsScrSum grows larger with the deadline being more tight, due to the requirement of running classification part of all heads at all times. The proposed method avoids this overhead by taking advantage of temporal locality, and gives the best results that is close to the near optimal solution.

Iv-B2 Impact of projection

Figure 11 shows average accuracy improvement that comes by enabling projection of previously generated bounding boxes. Since projection is dependent on the recent history of detection head usage, all detection heads should be executed within a time limit so the projections can be made for the skipped heads. This can be ensured by either through round robin scheduling or aging. Because the compared four algorithms utilize either of these two, all of them enjoy a similar improvement through projection, as they are capable of keeping the projection data fresh.

When we look at the Figure 12, we see that the proposed method is still the best performing one when it is compared to other projection enabled methods, and again it is very close to the near optimal solution. The takeaway is that the head scheduling algorithm still plays an important role when projection is enabled.

Fig. 11: Average detection accuracy with and without projection.

Fig. 12: Detection accuracy versus other projection capable methods.

* Numbers are in milliseconds.
Method	Overhead
	Synchronization	Scheduling	Projection
RoundRobin	0.50	0.00	1.10
Ours	0.50	1.00	1.10
ClsScrSum	0.50	4.25	2.55

TABLE V: Average overhead.

* Numbers are in milliseconds.
Projection technique	Overhead (ms)
Blocking single process	30.1
Blocking six processes	6.8
Asynchronous six processes	1.1

TABLE VI: Average projection overhead for our method.

Iv-C Overhead Analysis

Table V shows the average additional time to finalize execution for the three versions of our anytime-lidar approach (differ only in head-selection method) compared to the baseline PointPillars network. Firstly, all listed methods suffer 0.5 milliseconds overhead from the synchronization done prior to scheduling. Secondly, the detection head scheduling itself puts an additional overhead depending on the algorithm used. RoundRobin has the minimal overhead since the decision can be made in constant time, whereas the proposed method needs to calculation for prioritization. The main source of overhead for the ClsScrSum method is the requirement of running classification convolutions for all detection heads including the ones that will be scheduled to skip afterwards. Lastly, there is the overhead of projection, as the bounding boxes has to be distributed to the parallel processes, and at the end, the results should be collected and merged with the detection results. These two operations adds an additional overhead contingent on the method used. The ClsScrSum needs to do more projection as it skips more heads for any specific deadline compared to other methods. This is why it suffers most from the projection. In all, the total added overhead of our proposed approach (Ours) is approximately less than 3% for a 100ms deadline.

We mentioned that projection overhead is mitigated with the use of asynchronous parallel processing in Section III-D. Table VI shows the alternative cases where the projections are done after the GPU execution finishes. The illustrated timings point out the significant mitigation of the projection overhead by asynchronous execution with parallel processes.

Our approach does increase the memory that model requires from 185 MiB to 281 MiB, as a result of the duplicated detection heads needed to support multi-exit imprecise backbone (Section III-B). Since the target platform has memory in the orders of GiB, we consider this overhead acceptable.

V Related Work

Lidar based object detection is important for many autonomous driving frameworks [15]. The release of large-scale autonomous driving datasets [3, 26, 9] allowed researchers to develop lidar-based deep learning techniques achieving remarkable detection performance. One of the state-of-the-art in this field is PointPillars [12], employed by industry and open-source autonomous driving frameworks [8, 1]. This method is nominated to deliver real-time performance for its high execution speed alongside strong detection accuracy. Besides PointPillars, many other methods have been presented promising low execution latency alongside strong detection accuracy [29, 14, 34, 25, 23, 24, 28]. However, all these networks are still computationally expensive and cannot be dynamically adjusted in a deadline-aware manner.

In the broader AI community, many work has done on model compression such as weight quantization, pruning, etc. [5, 33, 32, 36, 18]. Although these works are beneficial in reducing computational cost of execution neural networks, they do not support dynamic time and accuracy trade-off, which is needed for autonomous driving and other intelligent real-time systems.

Recently, many researchers explored anytime perception that enable deadline-aware neural network execution for real-time systems. For instance, Kim et al. [10] made time and accuracy trade-off possible for an image classification network by iteratively adding layers and re-training it to have early exits. Lee et al. [13] provided a solution in neuron level, prioritizing the subset of neurons which contribute most to the accuracy, while deactivating others to save time. Bateni et al. [2] used per-layer approximation instead of using early exits, and provided a scheduling solution for multiple DNN tasks. Yao et al. [31] also focused scheduling of multiple DNN tasks but using imprecise computation with early exits. In all, the focus of all these works was image classification whereas the proposed methods focus is object detection. Heo et al. [6] proposed a multi-path DNN architecture for anytime perception in vision based object detection. Hu et al.[7] proposed to reduce resolution of less critical part of the scene to reduce computational cost. Lie et al. [16, 17] break-up individual frames into smaller sub-regions with different criticality with the help of lidar and batch process important sub-regions to meet deadlines. However, all these effort mainly focus on 2D vision data processing and do not address the unique characteristics of 3D point cloud processing.

In contrast, our work address anytime perception problem in the lidar domain by considering the differences of 3D object detection compared to 2D.

Vi Conclusion

In this work, we presented a novel method of providing execution time flexibility to the machine learning based object detection pipelines utilizing point clouds as input. Results has shown that the proposed method delivers satisfactory detection accuracy over a long range of deadlines compared to baseline methods. This improvement comes from four major contributions, (i) modifying the backbone of the model to be partially executable by minimal effort, (ii) allowing a subset of detection heads to be skipped to enable fine-grained execution time and accuracy trade-off, (iii) integrating a projection mechanism that compensates potential accuracy loss due to skipped heads, (iv) and a two-phase scheduler that manages the execution of backbone, detection heads, and projection with the goal of maximizing detection accuracy while meeting the deadlines.

Acknowledgments

This research is supported in part by NSF grants CNS1815959, CPS-2038923, and CPS-2038658.

References

[1] Baidu Apollo team (2017), Apollo: Open Source Autonomous Driving. Note: https://github.com/ApolloAuto/apolloAccessed: 2022-04-29 Cited by: §V.
[2] S. Bateni and C. Liu (2018) ApNet: approximation-aware real-time neural network. In 2018 IEEE Real-Time Systems Symposium (RTSS), Vol. , pp. 67–79. External Links: Document Cited by: §I, §V.
[3] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) NuScenes: a multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 11618–11628. External Links: Document Cited by: §IV, §IV, §V.
[4] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li (2020) Voxel R-CNN: towards high performance voxel-based 3d object detection. CoRR abs/2012.15712. External Links: Link, 2012.15712 Cited by: §I, §I, §II.
[5] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §V.
[6] S. Heo, S. Cho, Y. Kim, and H. Kim (2020) Real-time object detection system with multi-path neural networks. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Vol. , pp. 174–187. External Links: Document Cited by: §I, §I, §I, 2nd item, §V.
[7] Y. Hu, S. Liu, T. Abdelzaher, M. Wigness, and P. David (2021) On exploring image resizing for optimizing criticality-based machine perception. In 2021 IEEE 27th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pp. 169–178. Cited by: §V.
[8] S. Kato, S. Tokunaga, Y. Maruyama, S. Maeda, M. Hirabayashi, Y. Kitsukawa, A. Monrroy, T. Ando, Y. Fujii, and T. Azumi (2018) Autoware on board: enabling autonomous vehicles with embedded systems. In 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS), Vol. , pp. 287–296. External Links: Document Cited by: §V.
[9] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet (2019) Level 5 perception dataset 2020. Note: https://level-5.global/level5/data/ Cited by: §V.
[10] J. Kim, R. Bradford, and Z. Shao (2020) AnytimeNet: controlling time-quality tradeoffs in deep neural network architectures. In 2020 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 945–950. External Links: Document Cited by: §I, §V.
[11] J. Kim, R. Bradford, M. Yoon, and Z. Shao (2020) ABC: abstract prediction before concreteness. In 2020 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 1103–1108. External Links: Document Cited by: §III-B.
[12] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 12689–12697. External Links: Document Cited by: §I, §I, §I, §I, §II, §III, 1st item, §V.
[13] S. Lee and S. Nirjon (2020) SubFlow: a dynamic induced-subgraph strategy toward real-time dnn inference and training. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Vol. , pp. 15–29. External Links: Document Cited by: §I, §V.
[14] P. Li, H. Zhao, P. Liu, and F. Cao (2020) RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. CoRR abs/2001.03343. External Links: Link, 2001.03343 Cited by: §V.
[15] Y. Li and J. Ibanez-Guzman (2020) Lidar for autonomous driving: the principles, challenges, and trends for automotive lidar and perception systems. IEEE Signal Processing Magazine 37 (4), pp. 50–61. External Links: Document Cited by: §V.
[16] S. Liu, S. Yao, X. Fu, H. Shao, R. Tabish, S. Yu, A. Bansal, H. Yun, L. Sha, and T. Abdelzaher (2021) Real-time task scheduling for machine perception in in intelligent cyber-physical systems. IEEE Transactions on Computers (), pp. 1–1. External Links: Document Cited by: §I, §V.
[17] S. Liu, S. Yao, X. Fu, R. Tabish, S. Yu, A. Bansal, H. Yun, L. Sha, and T. Abdelzaher (2020) On removing algorithmic priority inversion from mission-critical machine inference pipelines. In 2020 IEEE Real-Time Systems Symposium (RTSS), pp. 319–332. Cited by: §V.
[18] B. Minnehan and A. E. Savakis (2019) Cascaded projection: end-to-end network compression and acceleration. CoRR abs/1903.04988. External Links: Link, 1903.04988 Cited by: §V.
[19] NVIDIA Jetson AGX Xavier Developer Kit. Note: https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit Cited by: §I, §IV.
[20] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander (2021) Categorical depth distribution network for monocular 3d object detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 8551–8560. External Links: Document Cited by: §II.
[21] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §I.
[22] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §I, §I, §II.
[23] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2020) PV-rcnn: point-voxel feature set abstraction for 3d object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 10526–10535. External Links: Document Cited by: §II, §V.
[24] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li (2021) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (8), pp. 2647–2664. External Links: Document Cited by: §II, §V.
[25] M. Simon, S. Milz, K. Amende, and H. Gross (2018) Complex-yolo: real-time 3d object detection on point clouds. CoRR abs/1803.06199. External Links: Link, 1803.06199 Cited by: §V.
[26] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov (2020) Scalability in perception for autonomous driving: waymo open dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2443–2451. External Links: Document Cited by: §V.
[27] O. D. Team (2020) OpenPCDet: an open-source toolbox for 3d object detection from point clouds. Note: https://github.com/open-mmlab/OpenPCDet Cited by: §IV.
[28] Y. Yan, Y. Mao, and B. Li (2018) SECOND: sparsely embedded convolutional detection. Sensors (), pp. 3337–3354. External Links: Document Cited by: §II, §III, §V.
[29] B. Yang, W. Luo, and R. Urtasun (2018) PIXOR: real-time 3d object detection from point clouds. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 7652–7660. External Links: Document Cited by: §III, §V.
[30] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960. Cited by: §I, §I.
[31] S. Yao, Y. Hao, Y. Zhao, H. Shao, D. Liu, S. Liu, T. Wang, J. Li, and T. Abdelzaher (2020) Scheduling real-time deep learning services as imprecise computations. In 2020 IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Vol. , pp. 1–10. External Links: Document Cited by: §I, §V.
[32] S. Yao, Y. Zhao, H. Shao, S. Liu, D. Liu, L. Su, and T. F. Abdelzaher (2018) FastDeepIoT: towards understanding and optimizing neural network execution time on mobile and embedded devices. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems, SenSys 2018, Shenzhen, China, November 4-7, 2018, G. S. Ramachandran and B. Krishnamachari (Eds.), pp. 278–291. External Links: Link, Document Cited by: §V.
[33] S. Yao, Y. Zhao, A. Zhang, L. Su, and T. F. Abdelzaher (2017) Compressing deep neural network structures for sensing systems with a compressor-critic framework. CoRR abs/1706.01215. External Links: Link, 1706.01215 Cited by: §V.
[34] T. Yin, X. Zhou, and P. Krähenbühl (2021) Center-based 3d object detection and tracking. CVPR. Cited by: §II, §III, §V.
[35] X. Zhou, V. Koltun, and P. Krähenbühl (2020) Tracking objects as points. In European Conference on Computer Vision, pp. 474–490. Cited by: §I, §I.
[36] Y. Zhou, S. Moosavi-Dezfooli, N. Cheung, and P. Frossard (2017) Adaptive quantization for deep neural network. CoRR abs/1712.01048. External Links: Link, 1712.01048 Cited by: §V.
[37] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu (2019) Class-balanced grouping and sampling for point cloud 3d object detection. CoRR abs/1908.09492. External Links: Link, 1908.09492 Cited by: §II, §III.