Effective Multi-User Delay-Constrained Scheduling with
Deep Recurrent Reinforcement Learning

Pihe Hu, Ling Pan, Yu Chen, Zhixuan Fang, Longbo Huang Tsinghua UniversityBeijingChina hph19, pl17, c-y19@mails.tsinghua.edu.cn, zfang, longbohuang@tsinghua.edu.cn

Abstract.

Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing. Yet, it poses a critical challenge since the scheduler needs to make real-time decisions to guarantee the delay and resource constraints simultaneously without prior information of system dynamics, which can be time-varying and hard to estimate. Moreover, many practical scenarios suffer from partial observability issues, e.g., due to sensing noise or hidden correlation. To tackle these challenges, we propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ( $R S D 4$ )¹¹1Code available at https://github.com/hupihe/RSD4, which is a data-driven method based on a Partially Observed Markov Decision Process (POMDP) formulation. $R S D 4$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level merging to ensure scalability. Extensive experiments on simulated/real-world datasets demonstrate that $R S D 4$ is robust to system dynamics and partially observable environments, and achieves superior performances over existing DRL and non-DRL-based methods.

Delay-constrained, Scheduling, Partial Observability, Deep Reinforcement Learning

1. Introduction

Due to the emergence of many real-time interactive applications, e.g., online games, virtual reality (VR), and cloud computing, and the increasingly more rigid user requirements, delay-constrained scheduling has become a central problem in guaranteeing the satisfying quality of experience in many areas. For example, express delivery is a typical delay-sensitive scheduling problem. According to (Ma, 2017), a small increase in the delivery time will significantly impact customers’ perceived ambiguity and risk, and reduce satisfaction. Delay-constrained scheduling is also critical for data communications (Zhang et al., 2019), video streaming (Bhattacharyya et al., 2019), and data center management (Qu et al., 2017).

However, delay-constrained scheduling is challenging in many aspects. First, the scheduler needs to satisfy latency and resource constraints, while the delay metric depends on the overall dynamics and control of the system across time, and the resource constraint further couples scheduling decisions. Second, system dynamics, e.g., user channels in mobile networks, are hard to trace since distributions of underlying random components can be highly dynamic and correlated. Third, practical scheduling systems usually face a large number of users and have complex network structures, and demand highly scalable solutions for large-scale and multi-hop scenarios. For example, live video platforms such as YouTube and Instagram have thousands of millions of daily active users (Dao et al., 2022). Fourth, due to sensing noise and hidden correlation, practical systems can also suffer from partial observability issues, e.g., most IoT devices cannot have perfect knowledge of a dynamic channel environment due to hardware limitation and short sensing time (Xie et al., 2020), and channel states in network systems can also be hard to obtain (Li and Neely, 2013). This universality of partial observability in real-world scheduling problems demands highly robust algorithms.

Many algorithms have been proposed for scheduling problems based on different methods, including queueing-based methods, e.g., (Huang et al., 2015; Scully et al., 2019), optimization-based methods, e.g., (Hou and Kumar, 2010; Li et al., 2019), dynamic programming (DP)-based algorithms, e.g., (Singh and Kumar, 2018; Chen and Huang, 2018), and Lyapunov control-based approaches, e.g., (Pan and Chen, 2018; Lv et al., 2021). However, these approaches either require prior knowledge about system dynamics, or suffer from curse-of-dimensionality due to large state space, or focus on stability constraints rather than delay. To overcome the above limitations, we propose a deep reinforcement learning (DRL)-based algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ( $R S D 4$ ), whose procedure is depicted in Figure 1. $R S D 4$ builds upon the recurrent deterministic policy gradient (Heess et al., 2015) and softmax deterministic policy gradient (Pan et al., 2020), and introduces several novel components for handling the scheduling problem in partially observed settings. Specifically, $R S D 4$ is an end-to-end method based on a partially observed Markov decision process (POMDP) formulation with a Lagrange dual update. It does not require any prior knowledge of system dynamics and effectively captures hidden system correlation across time slots by a recurrent module. $R S D 4$ makes use of a softmax operator to improve value estimation in training and robustness in handling complex system dynamics. Finally, it introduces unified training for improving training efficiency, and user-level decomposition and node-level merging to support large-scale and multi-hop scenarios.

Figure 1. $R S D 4$ framework for delay-constrained scheduling.

The $R S D 4$ framework has several unique features that make it suitable for handling delay-sensitive scheduling tasks, especially in partially observable systems. Firstly, its base POMDP formulation is general to capture the partial observability issue that often arises in real-world systems. Secondly, the proposed $R S D 4$ algorithm possesses key advantages of existing DRL algorithms, i.e., being a data-driven and end-to-end method that does not require prior knowledge of the system dynamics. Thirdly, the $R S D 4$ algorithm is highly scalable, making it suitable for practical implementation in large-scale resource-constrained scenarios, where the system observation can be partial and even not instantaneous. We conduct extensive experiments to validate the performance of $R S D 4$ on both simulated environments and real-world datasets. The results demonstrate that $R S D 4$ significantly outperforms classical non-DRL methods and existing DRL benchmarks in various scenarios. Our framework and results shed light on designing scalable and efficient DRL algorithms for scheduling complex systems.

The main contributions of this paper are summarized as follows.

(i) We formulate a general POMDP framework for investigating large-scale multi-user delay-constrained scheduling problems with average resource constraint, which is suitable for partially observable systems, and provides two novel functions for scalability, i.e., user-level decomposition and node-level merging.

(ii) We propose a novel DRL-based algorithm, $R S D 4$ , to tackle the partial observability problem and achieve robust value prediction. $R S D 4$ builds upon the recurrent policy gradient method and softmax-based value estimation. It introduces a unified training approach to improve training efficiency. It also adopts a double-branch neural network architecture and delayed policy update to learn a robust policy against different system dynamics.

(iii) We conduct extensive experiments on real datasets, showing that $R S D 4$ efficiently learns system dynamics under both large-scale and multi-hop cases, and significantly outperforms existing scheduling methods, especially in various partially-observable settings.

2. Related Work

Many existing works have studied the scheduling problem. Among the many techniques adopted, four methodologies have received much attention, including queueing theory-based method, e.g., (Huang et al., 2015; Scully et al., 2019), optimization-based method, e.g., (Hou and Kumar, 2010; Li et al., 2019), dynamic programming-based control, e.g., (Singh and Kumar, 2018; Chen and Huang, 2018), Lyapunov-based optimization, e.g., (Pan and Chen, 2018; Lv et al., 2021). However, these approaches either do not explicitly capture the delay constraint or require precise system dynamics which can be highly non-trivial in real systems. Besides, DP-based algorithms often suffer from the curse-of-dimensionality and are not scalable to large-scale scenarios. DRL has been receiving much attention in the scheduling field, due to its generalization and scalability capability, and has been adopted in several scheduling scenarios, e.g., video streaming (Mao et al., 2017), Multipath TCP control (Zhang et al., 2019), network reconfigurability (Bhattacharyya et al., 2019), MAC scheduling (Moon et al., 2021), and resource-constrained scheduling, e.g., (Nasir and Guo, 2019; Meng et al., 2020; Xiao et al., 2017; He et al., 2019). However, the aforementioned works either do not ensure average resource constraints or require a perfectly observable system state as input. In addition, they do not consider large-scale or multi-hop systems.

Our proposed $R S D 4$ is a data-driven end-to-end algorithm, which well handles the partial observability issue under the POMDP formulation. Besides, the delay and resource constraint are handled by delay-sensitive queues and the Lagrangian dual, respectively. It also adopts user-level decomposition and node-level merging to significantly extend the scalability.

3. Problem and Preliminary

In this section, we present our scheduling problem description. For ease of presentation, we focus on the single-hop scheduling problem in Section 3.1 and the corresponding Lagrangian dual in Section 3.2. The multi-hop problem will be presented later in Section 4.3.

3.1. The Scheduling Problem

We consider the scheduling problem illustrated in Figure 2. Time is divided into discrete slots $t \in {0, 1, 2, . . .}$ . At the beginning of each time slot, the scheduler first receives job arrivals, e.g., data packets in a network or parcels in a delivery station. The number of job arrivals for user $i$ at time slot $t$ is denoted as $A_{i} (t)$ . We denote $A (t) = [A_{1} (t), . . ., A_{N} (t)]$ . Each job for user $i$ is associated with a strict delay constraint $τ_{i}$ , i.e., a job needs to be served within $τ_{i}$ slots upon its arrival and will be outdated and discarded otherwise.

Figure 2. A general delay-constrained scheduling problem in a single-hop network. Jobs arrive at the server and need to be delivered to their destinations before deadline.

The buffer model

Jobs arriving at the system are first placed in a buffer, conveniently modeled by a set of delay-sensitive queues. Specifically, the buffer contains $N$ separate delay-sensitive queues of infinite size, one for each user. The state of each queue $i$ at time slot $t$ is denoted by $B_{i} (t) = [B_{i}^{0} (t), B_{i}^{1} (t), . . ., B_{i}^{τ_{i}} (t)]$ , where $B_{i}^{τ} (t)$ stands for the number of jobs for user $i$ with a remaining time of $τ$ timeslots until expiration for $1 \leq τ \leq τ_{i}$ .

The scheduling and service model

At every time slot $t$ , the scheduler makes decision on the resources allocated to jobs in the buffer. The decision for user $i$ is denoted as $e_{i} (t) = [e_{i}^{0} (t), e_{i}^{1} (t), . . ., e_{i}^{τ_{i}} (t)]$ , where $e_{i}^{τ} (t) \in [0, e_{max}]$ denotes the resource, e.g., energy, allocated to serve each job in queue $i$ with deadline $τ$ . Each scheduled job then goes through a service channel, e.g., a wireless channel, whose condition is random and its value at time slot $t$ for user $i$ is denoted by $c_{i} (t)$ . We denote the service channel conditions at timeslot $t$ as $c (t) = [c_{1} (t), c_{2} (t), . . ., c_{N} (t)]$ . For a user $i$ job with allocated resource $e$ and channel condition $c$ , the probability of successful service is denoted as $P_{i} (e, c)$ , where $e = 0$ means that the job will not be served and $P_{i} (0, c) = 0$ . Also, $P_{i} (e, \cdot) > 0$ for all $e > 0$ . If a job is scheduled but fails in service, it remains in the buffer if it is not outdated. The instantaneous resource consumption at time slot $t$ is denoted as $E (t) = \sum_{i = 1}^{N} e_{i}^{⊤} (t) B_{i} (t)$ and the average resource consumption is $¯ ¯¯ ¯ E = {lim}_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} E (t)$ .

The system objective

For user $i$ , the number of successfully served jobs at time slot $t$ is denoted as $d_{i} (t)$ . Each user is given a weight $β_{i}$ , and the weighted instantaneous throughput is denoted as $D (t) = \sum_{i = 1}^{N} β_{i} d_{i} (t)$ . The objective of the scheduler is to maximize the weighted average throughput, defined as $¯ ¯¯¯ ¯ D = {lim}_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} D (t)$ , subject to the average resource consumption limit, i.e., ²²2We assume w.l.o.g. that all corresponding limits exist. The results can be generalized with $limsup$ and $liminf$ definitions otherwise.

(1)		$P max e_{i} (t) : 1 \leq i \leq N, 1 \leq t \leq T$		$lim T \to \infty \frac{1}{T} T \sum t = 1 N \sum i = 1 β_{i} d_{i} (t)$
	s.t.		$lim T \to \infty \frac{1}{T} T \sum t = 1 N \sum i = 1 e_{i}^{⊤} (t) B_{i} (t) \leq E_{0}$

where $E_{0}$ is the average resource consumption limit. We denote its optimal value as $T^{*}$ .

Note that although our problem description adapts the formulation in (Singh and Kumar, 2018), our goal is to develop practical and scalable solutions to scheduling in real-world environments, i.e., without assumptions on ergodicity of dynamics and allow the randomness to be arbitrarily correlated with hidden factors. Note here we allow resources to be allocated without hard constraint. We will show in Section 6.3 that hard constraints can be easily incorporated with only minimal impact on performance.

3.2. Lagrange Dual

Define the following Lagrangian function of problem (1) for handling the average resource budget $E_{0}$ :

(2)

L (π, λ) = lim T \to \infty \frac{1}{T} T \sum t = 1 N \sum i = 1 [β_{i} d_{i} (t) - λ e_{i}^{⊤} (t) B_{i}^{⊤} (t)] + λ E_{0}

with respect to control policy $π$ and Lagrange multiplier $λ$ . Denote $g (λ)$ the Lagrange dual function for fixed Lagrange multiplier $λ$ :

(3)

g (λ) = max π L (π, λ) = L (π^{*} (λ), λ),

where the maximizer is denoted as $π^{*} (λ)$ . Using Lemma 3 in (Singh and Kumar, 2018), the optimal timely throughput $T^{*}$ equals the optimal value of the dual problem, i.e., $T^{*} = {min}_{λ \geq 0} g (λ) = g (λ^{*})$ , where $λ^{*}$ is the optimal Lagrange multiplier. To find the optimal policy $π^{*} (λ^{*}) = arg {max}_{π} L (π, λ^{*})$ , it suffices to find optimal Lagrangian multiplier $λ^{*}$ . Denote the consumed resource under policy $π$ in time slot $t$ as $E_{π} (t)$ . The Danskin’s Theorem in (Bertsekas et al., 2003) states that $g^{'} (λ) = \frac{\partial L (π^{*}, λ)}{\partial λ} = E_{0} - E_{π^{*} (λ)}$ , where $E_{π^{*} (λ)} = {lim}_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} E_{π^{*} (λ)} (t)$ denotes the average resource consumption. Thus, the optimal policy $π^{*} (λ^{*})$ can be obtained by recovering the dual function $g (λ)$ for some $λ$ and taking gradient descent to find the optimal $λ^{*}$ . While the theoretical understanding is clear, finding the optimal $π^{*} (λ)$ for a given $λ$ in a practical system is non-trivial, due to the following main reasons:

System dynamics are hard to trace since distributions can be highly dynamic and correlated in many scenarios.
The system can only be partially observed, e.g., due to sensing limitation and noise and other hidden factors.
Practical scheduling systems usually have a large number of users and complex multi-hop topology, which demand highly scalable solutions.

Prior works (Singh and Kumar, 2018; Chen and Huang, 2018) uses DP to find the maximizer $π^{*} (λ)$ , which requires the prior knowledge of system dynamics and suffers from the curse-of-dimensionality in large systems, and may not directly apply to partially observable systems. These motivate us to design a DRL-based framework with POMDP formulation in Section 4.

4. Overall Framework

In this section, we present our novel POMDP formulation for solving the scheduling problem $P$ in Section 3.1, which supports user-level decomposition and node-level merging for scalability.

4.1. The POMDP Formulation

We now specify the POMDP formulation for the scheduling problem, under which the $R S D 4$ can be applied to find the optimal policy $π^{*} (λ^{*})$ . Specifically, the POMDP is represented by $M = ⟨ S, O, A, r, P, γ ⟩$ , where $S$ is the state space, $O$ is the observation space, $A$ is the action space, $r$ is the reward function, $P$ denotes the transition matrix, and $γ$ stands for the discount factor.

State and Observation

The overall system state $s_{t}$ includes $A (t)$ , $B_{1} (t)$ ,…, $B_{N} (t)$ , $c (t)$ and other information of the underlying MDP unobservable by the scheduler, e.g., random hyperparameters of successful transmission probability, or mobile user position that affects traffic arrival and channel. We consider $s_{t}$ to be partially-observed since partial observable scenarios are common in scheduling problems. This implies that the actual observation $o_{t}$ is a subset of $s_{t}$ and the exact form of $o_{t}$ depends on environment settings. For example, Figure 3(a) shows the observed system of a four-use case, where the observation $o_{t} = [B_{1} (t), . . ., B_{N} (t), c (t)]$ only includes buffer with delay-sensitive queues and channel states but not other factors.

Observations for a four-user system (time index omitted for brevity). — (a) Observed system

Action and Reward

At time slot $t$ , the control action is denoted by $a_{t} = [e_{1} (t), . . ., e_{N} (t)]$ , and the reward is set to

(4)

r_{t} = D (t) - λ E (t),

which is the instantaneous weighted throughput $D (t)$ minus the resource consumption $E (t)$ weighted by the multiplier $λ$ .

Learning Objective

Under POMDP, an optimal agent needs to access the entire history $h_{t} = (o_{1}, a_{1}, o_{2}, a_{2}, . . ., a_{t - 1}, o_{t})$ and learn a deterministic policy $π (\cdot; ϕ)$ parameterized by $ϕ$ , which maps from the history to the action space, with the objective of maximizing the expected long-term rewards

J (π (\cdot; ϕ)) = E [T \sum t = 0 γ^{t} r (s_{t}, a_{t}) ∣ s_{0}, a_{0}, π (\cdot; ϕ)] .

Remark 1 ().

With the reward setting in (4), when $γ = 1$ , the cumulative discounted reward is $R = \sum_{t = 1}^{T} γ^{t} r_{t} = T L (π, λ) - λ T E_{0}$ , which differs from the Lagrangian function value in (2) by $- λ T E_{0}$ . This means that an algorithm maximizing the expected rewards $J (π (\cdot; ϕ))$ is also a maximizer for the Lagrangian function, which is the objective of $R S D 4$ (detailed in Section 5).

4.2. User-Level Decomposition

The system state contains the buffer information whose size is proportional to the user number. This makes learning much harder in large-scale systems, due to the need to train neural networks with more hyperparameters, which largely limits the scalability of DRL-based methods. To overcome this issue, we propose a user-level decomposition technique, such that $R S D 4$ finds the optimal policy with small neural networks even under large-scale scenarios. Our user-level decomposition is different from the packet-level decomposition in (Singh and Kumar, 2018; Chen and Huang, 2018), primarily due to the fact that packet-level decomposition requires a much larger number of neural network parameters, resulting in poor performance.

Specifically, we define the user-level decomposed Lagrangian function for user $i$ as

(5)

L_{i} (π_{i}, λ_{i}) = lim T \to \infty \frac{1}{T} [T \sum t = 1 β_{i} d_{i} (t) - λ_{i} e_{i}^{⊤} (t) B_{i}^{⊤} (t)],

where $π_{i}$ is the decomposed policy for scheduling user $i$ ’s jobs in the buffer. Consequently, maximizing the Lagrangian function in (2) for a fixed $λ$ can be accomplished by maximizing (5) for each user $i$ ’s sub-problem separately with $λ_{i} = λ$ . Denote the optimal policy for user $i$ ’s sub-problem as $π_{i}^{*} (λ_{i})$ with multiplier $λ_{i}$ . After solving Problem (5) for each user with a fixed $λ_{i}$ , the optimal policy $π^{*} (λ)$ that maximizes the Lagrangian (2) can be derived by letting each user take its own optimal scheduling policy $π_{i}^{*} (λ_{i})$ with $λ_{i} = λ$ .

Decomposed POMDP

Based on the above intuition, we decompose the POMDP into user-level subproblems, where for each user $i$ , the action is $a_{t}^{(i)} = e_{i} (t)$ , and the reward becomes $r_{t}^{(i)} = β_{i} d_{i} (t) - λ e_{i}^{⊤} (t) B_{i} (t)$ at time slot $t$ . The observation is also decomposed as well, Figure 3(b) presents $o_{t}^{(i)} = [i, B_{i} (t), c_{i} (t)]$ for a four-user case, where the first index for user $i$ ’s sub-problem is required for algorithm training (explained next). This is a key step in $R S D 4$ . As we will see in Section 6.5, without decomposition, $R S D 4$ and other existing DRL algorithms can fail due to a large size of the state representation.

Unified Training

After decomposition, the state space is compressed, and there are $N$ different sub-POMDPs, whose dynamics of different users may not be the same. To avoid training $N$ different DRL agents, leading to linear growth of computation power, we propose the method of unified training. That is, to train different samples from different sub-environments together, and use an extra user index $i$ as the identifier to samples from the user $i$ ’s sub-problem, as shown in Figure 3(b). Consequently, the dimension of training samples remains the same regardless of the system scale.

Remark 2 ().

With observation decomposition and unified training, a single system observation $o_{t}$ is decomposed into $N$ separate sate $o_{t}^{1}, o_{t}^{2}, . . ., o_{t}^{N}$ , such that one step dynamics in the original environment creates $N$ samples for the replay buffer. Consequently, it does not require heavy parallel computation and greatly enriches the abundance of the replay buffer, such that common knowledge across different users’ sub-POMDPs is learned efficiently. With PODMP decomposition, the number of neural network parameters also remains small even for large-scale systems, because the dimension of observation remains unchanged after decomposition. This useful feature makes our framework highly scalable.

4.3. Multi-Hop and Node-level Merging

We present the multi-hop setting here and describe a technique called node-level merging for reducing complexity in this case. Specifically, in multi-hop networks, each flow can traverse multiple hops and the agent needs to decide which flows to serve and how many resources to allocate at each node. Besides, each node has an average resource constraint. Take Figure 4(a) as an example. There are three flows passing through three paths respectively, i.e., $F_{1} = {1 \to 2 \to 3 \to 5}$ with deadline $τ_{1}$ , $F_{2} = {2 \to 4 \to 6}$ with deadline $τ_{2}$ , and $F_{3} = {2 \to 3}$ with deadline $τ_{3}$ .

(a) A multi-hop network with three paths represented in different colors, each supports one flow. (b) The flows are aligned with starting nodes. — (a) Multi-hop network with three paths.

The idea of node-level merging is to augment the state, observation, action, and reward in the POMDP formulation for multihop scheduling. Figure 5 shows one example with the detailed procedure below.

State and Observation: The system state is obtained by merging buffer and channel states in different hops, as shown in Figure 5. Since jobs in a multi-hop flow differ only in node position and remaining time until expiration, we encode the system state by an aggregation of buffer states at the nodes. Besides, the buffer state at hop $j$ for flow $i$ is denoted as $B_{i}^{(j)} (t) = [B_{i}^{0 (j)} (t), B_{i}^{2 (j)} (t), . . ., B_{i}^{τ_{i} (j)} (t)]$ where $B_{i}^{τ (j)} (t)$ denotes the number of jobs for flow $i$ at hop $j$ with a remaining time of $τ$ timeslots until expiration for $0 \leq τ \leq τ_{i}$ . Denote the path length flow $i$ as $h_{i}$ . The node-level merging idea is to set the overall observation $o_{t}$ to be $[B_{1}^{(1)} (t), . . ., B_{1}^{(h_{1})} (t), . . ., B_{N}^{(h_{N})} (t), c (t)]$ and potentially other factors in the environment. Figure 5 shows a case when $o_{t} = [B_{1}^{(1)} (t), . . ., B_{1}^{(h_{1})} (t), . . ., B_{N}^{(h_{N})} (t), c (t)]$ .
Action: The action is similarly obtained by merging actions at different nodes, i.e., $a_{t} = [e_{1}^{(1)} (t), . . ., e_{1}^{(h_{1})} (t), . . ., e_{N}^{(h_{N})} (t)]$ , where $e_{i}^{(j)} (t)$ represent the resource allocated to user $i$ ’s $j$ -th node for $1 \leq j \leq h_{i}$ .
Reward: The reward is set to $r_{t} = D (t) - \sum_{k = 1}^{M} λ_{k} E^{(k)} (t)$ , where $λ_{k}$ and $E^{(k)} (t)$ correspond to the Lagrangian multiplier and resource consumption at node $i$ for $1 \leq k \leq M$ , and $M$ is the number of nodes involved in the scheduling.

Figure 5. Node-level Merging of a three-flow case. Flow $i$ ’ node $j$ is associated with a buffer $B_{j}^{i}$ . Flows are aligned with starting nodes then concatenated as observation $o_{t}$ .

A common approach for multi-hop scheduling is to address the scheduling problem in each node separately. However, in delay-constrained multi-hop networks, scheduling each node separately cannot satisfy the hard delay constraint. Multi-agent architecture (Xu et al., 2020; Zhou et al., 2021) has been adopted for multi-hop networks but with more computation resource requirements and is hard to train. Node-level merging provides a novel way of concatenating buffers of different hops in flows together, making its training similar to that for a single-hop scheduling problem except with a higher input dimension. This avoids training multiple agents for multi-hop networks, and efficiently reduces the complexity in training. It allows $R S D 4$ to perform well and outperforms other methods, even when the system scale in terms of state dimension is significantly increased in multihop networks (see Section 6.5).

5. Proposed Method: $R S D 4$

We present our novel Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ( $R S D 4$ ) algorithm in Algorithm 1. $R S D 4$ builds upon the recurrent deterministic policy gradient (Heess et al., 2015) and softmax deterministic policy gradient (Pan et al., 2020), and introduces several novel components for handling the scheduling problem in partially observed settings. Specifically, $R S D 4$ is a model-free DRL algorithm, which takes the popular actor-critic (Konda and Tsitsiklis, 2000) framework. It well handles partial observability issues by memory mechanism enabled by recurrent neural networks (RNNs) and resolves the overestimation problem in existing recurrent DRL methods with the softmax operator. Thus, it owns advantages from recurrent deterministic policy gradient (Heess et al., 2015) and softmax deterministic policy gradient (Pan et al., 2020). $R S D 4$ also adopts a double-branch architecture from (Peng et al., 2018) to better utilize the memory mechanism of the RNN, and implements a delayed policy update frequency to further reduce the variance in value estimate.

Initially, $R S D 4$ makes use of the state-action function $Q (h_{t}, a_{t}; θ)$ parameterized by $θ$ , which is defined as

(6)

\resizebox433.62pt$Q(ht,at;θ)=Est,at,...,st+T,at+T|ht,π(⋅;ϕ)[T∑i=0γir(st+i,at+i)]$,

where the expectation is taken with respect to the conditional probability $p (s_{t}, a_{t}, . . ., s_{t + T}, a_{t + T} | h_{t}, π (\cdot; ϕ))$ of the trajectory distribution associated with history $h_{t}$ and the policy $π (\cdot; ϕ))$ . $R S D 4$ initializes double critic networks and double actor networks, where critic networks $Q (h_{t}, a_{t}; θ)$ estimate the value of state-action pairs, and actor networks $π (\cdot; ϕ)$ are responsible for outputting control actions.

λ_{0}

, resource limit

E_{0}

, learning rate

α

, Precision

δ

, episode number

M

, episode length

T

, batch size

b

, target update rate

τ

λ_{1} \leftarrow λ_{0}, λ_{0} \leftarrow 0, k \leftarrow 1

2: while

| λ_{k} - λ_{k - 1} | > δ

3: Initialize

N

learning environments with

r_{t}^{(i)} = β_{i} d_{i} (t) - λ e_{i}^{⊤} (t) B_{i} (t)

for

i

-th environment.

4: Initialize critic networks

Q_{1}

Q_{2}

, and actor networks

π_{1}

π_{2}

with random parameters

θ_{1}, θ_{2}, ϕ_{1}, ϕ_{2}

. Initialize target network

θ_{1}^{-} \leftarrow θ_{1}, θ_{2}^{-} \leftarrow θ_{2}, ϕ_{1}^{-} \leftarrow ϕ_{1}, ϕ_{2}^{-} \leftarrow ϕ_{2}

5: Initialize replay buffer

D

6: for episodes

= 1

M

// Episodic interaction do

7: for

t = 1

T

8: for

i = 1

N

9: Receive sub observation

o_{t}^{i}

10:

h_{t}^{i} \leftarrow h_{t - 1}^{i}, a_{t - 1}^{i}, o_{t}^{i}

11: Select action

a_{t}^{i}

based on

π_{1}

and

π_{2}

12: end for

13: end for

14: Store

(o_{1}^{i}, a_{1}^{i}, r_{1}^{i}, . . ., o_{T}^{i}, a_{T}^{i}, r_{T}^{i})

D

for

i = 1

N

15: for

i = 1, 2

// Double learning do

16: Randomly sample a batch of

b

episodes:

B = {(o_{1}, a_{1}, r_{1}, . . ., o_{T}, a_{T}, r_{T})}

from

D

17: for

t = 1

T

// Recurrent softmax learning do

18: Sample

K

noise

ϵ \sim N (0, σ^{'})

19:

{^a}_{t} \leftarrow π_{i} (h_{t}; ϕ_{i}^{-}) + c l i p (ϵ, - c, c)

20:

^Q (h_{t}, {^a}_{t}) \leftarrow {min}_{j = 1, 2} (Q_{j} (h_{t}, {^a}_{t}; θ_{j}^{-}))

21: Compute

{softmax}_{β} (^Q (h_{t}, \cdot))

by Eq. (8)

22:

y_{t} \leftarrow r + γ (1 - d) {softmax}_{β} (^Q (h_{t}, \cdot))

23: end for

24: Update

θ_{i}

according to Bellman loss:

\frac{1}{N} \sum_{h \in B} \sum_{t} {(Q_{i} (h_{t}, a; θ_{i}) - y_{i})}_{i}^{2}

25: if episodes mod

d = 0

// Delayed update then

26: Update actor

ϕ_{i}

by recurrent policy gradient:

\begin{matrix} \nabla_{ϕ_{i}} J (ϕ_{i}) = & \frac{1}{N} \sum h \in B \sum t [\nabla_{ϕ_{i}} (π (h_{t}; ϕ_{i})) \nabla_{a} Q_{i} (h_{t}, a; θ_{i}) |_{a = π (h_{t}; ϕ_{i})}] \end{matrix}

27: Update target networks:

θ_{i}^{-} \leftarrow τ θ_{i} + (1 - τ) θ_{i}^{-}

ϕ_{i}^{-} \leftarrow τ ϕ_{i} + (1 - τ) ϕ_{i}^{-}

28: end if

29: end for

30: end for

31: Obtain policy

π_{k}

and evaluate

E_{π_{k}} .

32:

λ_{k} = λ_{k} + α (E_{π_{k}} - E_{0})

// Gradient update

33:

λ_{k - 1} \leftarrow λ_{k}

34: end while

35: Output

π_{k}

Algorithm 1

R S D 4

with decomposition

5.1. The Training Algorithm

Recurrent Network Architecture

Despite the success of RL in a number of challenging tasks, state-of-the-art RL algorithms such as TD3 (Fujimoto et al., 2018) are limited to solving fully-observable tasks. As a result, they can fail when faced with partially observable tasks as the problem considered here. To address this problem, compared to prior non-recurrent policy gradient methods, $R S D 4$ incorporates recurrency in designing the architecture for neural networks rather than simply using feedforward networks in the policy update. This strengthens the memory capability of $R S D 4$ , and enables it to learn hidden factors or temporal correlation of the system dynamics. We then update the policy by RDPG (Heess et al., 2015):

(7)

\nabla_{ϕ} J (π (\cdot; ϕ)) = E_{h_{t}} [\sum t \nabla_{ϕ} (π (h_{t}; ϕ)) \nabla_{a} Q (h_{t}, a; θ) ∣ ∣ a = π (h_{t}; ϕ)] .

Here to compute the RDPG, a sequence of episodes will be stored in the replay buffer $D$ as a training sample (line $14$ ). $R S D 4$ computes target values $(y_{1}, y_{2}, . . . ., y_{T})$ for each sampled episode using the recurrent networks in lines $17 - 23$ . Critics and actors are updated recurrently, as shown in line $24$ and $26$ , respectively.

Softmax Double Learning

Having a good estimate of the value function is critical for RL agents to achieve good performance (Pan et al., 2020). We therefore propose to incorporate the softmax operator for more accurate value function estimation in different scenarios. In lines $21 - 22$ , we compute the Q-function by softmax operator by:

(8)

{softmax}_{β} (Q (h, \cdot)) = \frac{E_{a \sim p} [\frac{exp (β Q (h, a; θ)) Q (h, a; θ)}{} p (a)]}{E_{a \sim p} [\frac{exp (β Q (h, a; θ))}{p (a)}]}

where $β$ denotes the inverse temperature and $p$ is the sampling distribution. The target value for critic $Q_{i}$ is given by $y_{i} = r + γ {softmax}_{β} (^Q (h, \cdot))$ in line 22, where $^Q (h, \cdot)$ denotes the value estimation function in line 20.

$R S D 4$ further adopts a delayed policy update mechanism from (Fujimoto et al., 2018) (line $25$ ) to avoid training divergence due to frequent updates of the policy. Thus, the policy network is updated at a lower frequency than the value network to minimize error before introducing a policy update, with the similar goal of making the scheduling policy more robust in various system dynamics.

5.2. Network Architecture

$R S D 4$ uses double actor-networks and double critic-networks. The architecture of critic networks in Figure 6, and actor networks are similar, with the difference being removing the action $a_{t}$ in the input and changing the output value $Q (s_{t}, a_{t})$ to action $a_{t}$ . The recurrent layers build upon Long-Short-Term-Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to perform RDPG in Eq. (7), and there are two parallel branches, i.e., the fully connected branch and the LSTM branch. This architecture is firstly proposed in (Peng et al., 2018) and is effective for our scheduling problem, as validated in our experiments.

The LSTM branch is designed to strengthen the memory ability of $R S D 4$ algorithm, since it allows the agent to incorporate a large amount of network measurement history into its state space to capture the long-term temporal dependencies of actual system dynamics. The LSTM layer is embedded in the second layer of the multilayer perceptron feature extractor. Consequently, our $R S D 4$ algorithm can well handle various partially observable settings. The fully connected branch is designed to capture more information and improve expressiveness in the current time slot, which provides subsequent layers with more direct access to the current state without requiring information to filter through the LSTM branch. This makes $R S D 4$ more sensitive to the current system state, so that abrupt changes of environmental conditions, e.g., a sudden burst of arrivals or temporal channel condition degradation, can be detected rapidly (shown in Section 6.4.3).

The architecture of the critic network.
There are double parallel branches of the LSTM branch and the fully-connected branch, which are later concatenated together by a fully-connected layer to output Q value. — Figure 6. The architecture of the critic network. There are double parallel branches of the LSTM branch and the fully-connected branch, which are later concatenated together by a fully-connected layer to output Q value.

Remark 1 ().

The architecture in Figure 6 accounts for the ability to resolve partial observability issues. Most non-DRL-based methods and non-recurrent DRL algorithms do not directly handle this case. Our POMDP formulation and $R S D 4$ solution well handle such potentially non-stationary and partially observable dynamics by the memory mechanism enabled by LSTM, which helps the agent to learn from a batch of history and reduce the impact of temporal variability.

6. Experiment Results

We conduct extensive experiments on $R S D 4$ based on real-world datasets and simulated data. We first present an ablation study of $R S D 4$ and then compare performances of $R S D 4$ with existing DRL methods and classical non-DRL-based algorithms. Then, we design various hard settings to validate the ability of $R S D 4$ to handle partially observability and scalability. Each experiment is repeated with $5$ different random seeds, and the average result is presented.

6.1. Environment Setup

Below, we specify the single-hop environment based on the network in Figure 2. The multi-hop environment is constructed in a similar way and will be specified in Section 6.5.

Arrivals

The arrivals of different users are given by a LTE dataset (Loi, 2018), which records the traffic flow of mobile carriers’ 4G LTE network in approximately one year. We construct an environment with four types of arrivals given by the LTE dataset, which are visualized in Figure 7(a). The character of selected data records is given in Table 1, which simulates four representative tasks, i.e., file transmission, online forum, VR gaming, and text communication, according to their rates and delay requirements.

Service Channel Conditions

The channel states are given by a wireless 2.4GHz dataset (Taotao et al., 2021), which samples the received signal strength indicator (RSSI) in the check-in hall of Shenzhen Baoan International Airport. Each channel state is quantized into 4 states, each representing a RSSI level, which is visualized in Figure 7(b), whose average channel states are given in Table 1.

\topruleUser	Rate	$τ_{i}$	Character	Avg. Channel
\midrule1	1.96	6	File Transmission	1.79
2	0.91	6	Online Forum	1.83
3	2.46	1	VR Gaming	1.82
4	0.70	1	Text Communication	1.77
\bottomrule

Table 1. Summary of arrivals and channel states.

Heat maps of arrivals and channel conditions. — (a) Visulization of four arrivals in 5000 time slots.

Service Outcomes

The probability that a transmission for user $i$ is successful under channel state $c$ and allocated resource $e$ is

(9)

P_{i} (e, c) = \frac{2}{1 + exp [- 2 e / (f_{i}^{3} c)]} - 1

as that in (Chen and Huang, 2018), where $f_{i}$ denotes the distance between user $i$ and the server. This models a wireless downlink system.

6.2. Ablation Study

We first present an ablation study on network architectures and important hyperparameters including policy update frequency and learning rate in this subsection. The results will guide us in our algorithm parameter setting. To keep the underlying MDP well-defined with time-invariant distributions, in the ablation study, we use simulated arrivals and channels in the environment by pre-specifying their distributions.

6.2.1. Network Architectures

We compare different network architectures in Figure 8 with our architecture in Figure 6.³³3We only present the first several layers of the critic network, while the last two layers are the same as that in Figure 6. These architectures are tested with $R S D 4$ on the task of maximizing the Lagrangian function in Eq. (2) on $λ = {0, 0.1, . . ., 1}$ , since we find this range of $λ$ is the region of possible optimal multiplier $λ^{*}$ for our environment setting. We normalize the rewards obtained under different $λ$ and results are shown in Figure 9(a), where rewards obtained with user-level decomposition mentioned in Section 4.2 are presented as well. We find that double-branch architecture obtained maximum rewards in both situations, and rewards are further improved by taking the last action $a_{t - 1}$ as input to LSTM (see Figure 8). This validates the advantage of our proposed double-branch architecture and implies that taking action $a_{t - 1}$ in the last timestep into account can better capture hidden correlation across time.

Different Network Architectures (part). — (a) Single-branch with $a_{t - 1}$

Experiments of the ablation study. — (a) Different network architecture

6.2.2. Policy Delay

$R S D 4$ adopts a delayed policy update mechanism (line $25$ in Algorithm 1), to avoid training divergence due to overestimating a poor policy (Fujimoto et al., 2018). The policy network is updated at a lower frequency than the value network to minimize error before introducing a policy update. We evaluate different policy frequencies on the same simulated environment as Section 6.2.1 in Figure 9(b). The results show that a moderate policy frequency of $2$ improves peak reward in both original and decomposed cases.

6.2.3. Learning Rate

The learning rate $α$ in Algorithm 1 is critical. We execute Algorithm 1 under different $α$ values. Figure 10(a) and 10(a) show the iteration of $λ_{k}$ and resource consumptions, respectively. A learning rate of $0.1$ is too large such that ${λ_{k}}$ fluctuates largely, while a small learning rate e.g., $0.01$ does eliminate the large fluctuation of ${λ_{k}}$ but convergences too slow. Thus, we use a decaying learning rate, which halves the learning rate when $λ$ flips in three consecutive time slots, i.e., $α_{k + 1} = α_{k}$ if $λ_{k + 1} \geq λ_{k} \geq λ_{k - 1}$ or $λ_{k + 1} \leq λ_{k} \leq λ_{k - 1}$ , otherwise $α_{k + 1} = α_{k} / 2$ .

Iterations under different learning rates. — (a) $λ_{k}$

6.3. Performance Comparison

We compare performances of $R S D 4$ with existing DRL algorithms and classical non-DRL methods. The benchmark DRL algorithms include Twin Delayed DDPG (TD3) (Fujimoto et al., 2018) and Softmax DDPG (SD3) (Pan et al., 2020), both are state-of-the-art algorithms. We apply them with Lagrangian dual to ensure average resource constraints. The non-DRL algorithms include Programming, Uniform, and Earliest Deadline First (EDF) (Elsayed and Khattab, 2006):

(i) Programming: Solve the following static constrained programming problem $P_{s}$ for each time slot $t$ with resource constraint using convex programming (Shen and Yu, 2018) (time index $t$ omitted):

(10)		$P_{s} : max e$		$N \sum i = 1 β_{i} τ_{i} \sum τ = 1 B_{τ}^{i} P_{i} (e_{τ}^{i}, c_{i, τ})$
	s.t.		$N \sum i = 1 e_{i}^{⊤} B_{i} \leq E_{0}$

The optimal value $T_{s t a t i c}^{*}$ for $P_{s}$ serves as the static optimal throughput for the Problem $P$ in Eq. (1).

(ii) Uniform: Assign resources to different packets uniformly.

(iii) Earliest Deadline First (EDF): Assign all resources to packets in each user’s queue with the shortest deadline equally.

To also evaluate the case when the resource expenditure at each time may be strictly bounded, we consider the hard resource constraint $E_{m a x}$ for each time slot, i.e., scheduling decisions in each time slot cannot exceed $E_{m a x}$ . We introduce two ways to ensure this. The first way is to scale each component equally of $R S D 4$ ’s decision to ensure the total expenditure is bounded by $E_{m a x}$ (denoted as $R S D 4$ - $E_{m a x}$ -1). The second way is to allocate resources to packets with smaller remaining time until the total expenditure is $E_{m a x}$ and set all other allocations to zero (denoted as $R S D 4$ - $E_{m a x}$ -2).

Comparison of different algorithms. — (a) Throughput

The comparison of $R S D 4$ with other algorithms is shown in Figure 11 with $0 \leq E_{0} \leq 10$ , where arrivals and channel conditions are set according to real datasets summarized in Table 1. Here the channels are assumed to be unobservable to the scheduler. This scenario simulates a multi-user resource-constrained wireless base-station providing delay-constrained scheduling services. We find that performances of $R S D 4$ - $E_{m a x}$ -1 and $R S D 4$ - $E_{m a x}$ -2 are quite similar to $R S D 4$ when setting $E_{m a x} = 2 E_{0}$ in each group of experiments, which means that hard resource constraint can be achieved with our methods without much performance degradation. Besides, from Figure 11(a), we see that $R S D 4$ outperforms all benchmarks in each resource limit. In the small-medium resource regime with $0 \leq E_{0} \leq 7$ , the static optimality obtained by Programming ranks only lower than $R S D 4$ . However, in the large resource regime with $8 \leq E_{0} \leq 10$ , $R S D 4$ and TD3 outperform classical methods significantly. From Figure 11(b), we see that all DRL algorithms satisfy the average resource limit, while EDF fails to fully utilize available resources since packets with the shortest deadline are too few to consume all available resources. Results in Figure 11 illustrate the superiority of $R S D 4$ over the benchmarks.

6.4. Partially Observable Systems

We compare the performance of $R S D 4$ with other DRL algorithms to further validate the effectiveness of its recurrent module. Our comparison focuses on maximizing the Lagrangian function in Eq. (2) with the same environment of Section 6.3, which is essentially equivalent to the task of throughput maximization. In fact, each average resource limit $E_{0}$ corresponds to an optimal $λ^{*}$ according to Section 3.2. Thus, maxmizing the Lagrangian function on some $λ$ is equivalent to throughput maximization under some average resource limit of $E_{0} (λ)$ determined by $λ$ . Thus, focusing on maximizing the Lagrangian function in Eq. (2) allows us to examine a wide range of problems parameterized by $λ$ .

6.4.1. Missing Buffer State

We first consider an extreme environment where the buffer state is unobservable by the agent, and only arrivals and channels are given. Specifically, observation $o_{t} = [A (t), c (t)]$ , which implies the agent needs to memorize arrivals and service outcomes across multiple time slots to have accurate estimations of current system states. From Figure 12(a), we find that with buffer state, i.e., the underlying MDP is fully observable, $R S D 4$ and SD3 obtain similar maximum rewards under different $λ$ . When the buffer state is missing, $R S D 4$ still achieves almost the same maximum rewards, whereas non-recurrent DRL algorithms SD3 and TD3 suffer from significant performance loss.

6.4.2. Unobservable Hidden Factors

Another common type of partial observability comes from unobserved hidden factors in the underlying MDP, e.g., vehicle obstruction will influence in-tunnel wireless propagation channel which is hard to trace (Song et al., 2021). We design an environment where service outcomes are related to the time index, i.e., services are available only when the current time slot $t$ is a multiple of some period, e.g., wireless communication interfered by periodic jamming signals. In this case, the underlying MDP is partially observable since the hidden factor, i.e., the period, is unknown to the agent and the observation is $o_{t} = [B^{1} (t), . . ., B^{N} (t), c (t)]$ . From Figure 12(b), we find that $R S D 4$ outperforms TD3 and SD3 in both tested cases with periods $5$ and $10$ , and the performance gain of $R S D 4$ is more significant under large $λ$ case, which corresponds to the small resource regime.

Various partially observable settings. — (a) Environment without buffer state.

6.4.3. Time-varying Environments

We next investigate partial observability coming from the time variability of underlying system dynamics. We design two types of time variability by switching environment dynamics during the experiments, i.e., in one setting we double the rate of arrivals and in the other we change the channel statistics. In both cases the channel states are unobservable. The switch happens at slot $100, 000$ and results are shown in Figure 13 for the task of maximizing Lagrangian function under $λ = 0.5$ , where the evaluated rewards of training processes under the environments with switching in the middle and switching at the very beginning. In both cases, we find that, soon after switching, $R S D 4$ reaches the same rewards one can obtain by training in these environments from the very beginning, while TD3 and SD3 both suffer from reward loss after switching, even if they are trained from the beginning for this new environment. This validates the robustness and performance of $R S D 4$ in time-varying environments.

Experiments on switching environments. — (a) Switch on arrival strength

Remark 1 ().

Experimental results on these three partially observable environments show the superiority of $R S D 4$ at addressing partial observability issues. They also show the benefits of adopting a POMDP formulation in practical scheduling problems, since partial observability can arise from many sources. This comparison also sheds light on understanding the sub-optimality of non-recurrent DRL algorithms or classical non-DRL methods in POMDP.

6.5. Scalability by Decomposition and Merging

We now turn to investigate the scalability of the algorithm by conducting experiments in systems with a large number of users and with multi-hop structures. We focus on the setting where the channel states are assumed to be unobservable. We will show that the user-level decomposition and node-level techniques in $R S D 4$ make it a highly scalable solution.

6.5.1. Large-Scale System

We present the evaluated rewards of $R S D 4$ with other algorithms on the task of maximizing the Lagrangian function with $λ = 0.3$ on different numbers of users ranging from $4$ users to $400$ users in Figure 14(a). Here we generate arrivals and channels according to predefined random processes, so that system dynamics are known and we can get the optimal reward by DP. It is worth noting that computing the optimal induces extremely high computation overhead in large-scale tasks and requires accessing full system dynamics, and is not practical in real-world applications. We examine this case here emphasize the near-optimality and effectiveness of $R S D 4$ . All baselines are implemented with the same set of hyperparameters to ensure fairness.

From Figure 14(a), we find that when user number is less than or equals to $20$ , the performance of different baselines do not differ much, and all of them achieve near-optimal rewards. However, with an increasing number of users, all DRL algorithms, including $R S D 4$ without decomposition fail, as well as Uniform and EDF. Besides, $R S D 4$ always achieves the best and near-optimal performance regardless of the number of users. Figure 14(b) also shows the learning curve under $50$ users (from which the performance of different baselines start to diverge significantly), where only $R S D 4$ achieves the nearly optimal reward. We also see that $R S D 4$ without state-decomposition fails in this case. This demonstrates the necessity of state-decomposition, without which the algorithm parameter amount will become too large and prohibits efficient training.

Remark 2 ().

When the system scale is beyond the hypothesis dimension of the underlying neural network, e.g., when the user number is larger than $20$ in Figure 14(a), the performance of DRL algorithms will degrade rapidly, which means neural networks with more hyperparameters will be required. However, with more hyperparameters, neural networks are much harder to train and require significantly more computational power. With state decomposition, in contrast, $R S D 4$ efficiently controls the state dimension and still retains the near-optimal performance. This validates the effectiveness of our proposed user-level decomposition.

Experiments on scalability. — (a) Rewards on different user scales.

6.5.2. Multi-hop Network

We now turn to the multihop network setting. Recall that we propose node-level merging in Section 4.3 to schedule under multihop networks. We conduct experiments on a multi-hop network depicted in Figure 4(a), where there are three paths, $P a t h_{1} = 1 \to 2 \to 3 \to 5$ , $P a t h_{2} = 2 \to 4 \to 6$ , and $P a t h_{3} = 2 \to 3$ . $P a t h_{1}$ , $P a t h_{2}$ , and $P a t h_{3}$ has $3$ , $4$ , and $5$ flows passing through respectively. The $12$ flows have different deadlines ranging from $3$ to $5$ timeslots and the scheduler needs to make decisions at nodes $1$ , $2$ , $3$ and $4$ simultaneously, with average resource limits $10$ , $30$ , $0.3$ , $3$ , respectively. The arrival and channel states are drawn from real-world datasets specified above, and the throughput obtained by different algorithms are shown in Figure 15(a).

We see that $R S D 4$ achieves the maximum throughput and significantly outperforms other benchmarks, including TD3 and SD3 and non-DRL-based methods Programming, Uniform, and EDF. To further validate the scalability of our $R S D 4$ algorithm, we fix the four multiplier values to $0.5$ and compare it with other benchmarks on the task of maximizing the Lagrangian function for this multi-hop network. We test the algorithm with an increasing number of flows from $12$ flows to $180$ flows. $R S D 4$ shows a larger performance gain over other algorithms as the number of flows increase and saves computation resources by unified training.

Experiments on multihop networks. — (a) Throughput of different algorithms on a multi-hop network with 12 flows.

7. Conclusion

This paper studies the problem of multi-user latency-constrained scheduling with average resource constraints. To tackle partial observability and scalability issues, we propose a novel DRL algorithm, $R S D 4$ , which is a data-driven method based on a POMDP formulation. $R S D 4$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. $R S D 4$ successfully tackles partially observable issues with a recurrent network module. It also enables robust value estimation with the softmax operator, and introduces the user-level decomposition and node-level merging techniques to ensure scalability. We conduct extensive experiments on both simulated environments and real-world datasets. Our results show that $R S D 4$ is robust to various system dynamics and partially observable settings, and significantly outperforms existing DRL/non-DRL-based benchmarks.

Acknowledgements

This work is supported in part by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403, the Tsinghua University Initiative Scientific Research Program, and Tsinghua Precision Medicine Foundation 10001020109.

References

D. Bertsekas, A. Nedic, and A. Ozdaglar (2003) Convex analysis and optimization. Vol. 1, Athena Scientific. Cited by: §3.2.
R. Bhattacharyya, A. Bura, D. Rengarajan, M. Rumuly, S. Shakkottai, D. Kalathil, R. K. Mok, and A. Dhamdhere (2019) Qflow: a reinforcement learning approach to high qoe video streaming over wireless networks. In Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 251–260. Cited by: §1, §2.
K. Chen and L. Huang (2018) Timely-throughput optimal scheduling with prediction. IEEE/ACM Transactions on Networking 26 (6), pp. 2457–2470. Cited by: §1, §2, §3.2, §4.2, §6.1.
N. Dao, A. Tran, N. H. Tu, T. T. Thanh, V. N. Q. Bao, and S. Cho (2022) A contemporary survey on live video streaming from a computation-driven perspective. ACM Computing Surveys (CSUR). Cited by: §1.
K. M. Elsayed and A. K. Khattab (2006) Channel-aware earliest deadline due fair scheduling for wireless multimedia networks. Wireless Personal Communications 38 (2), pp. 233–252. Cited by: §6.3.
S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. ICML. Cited by: §5.1, §5.1, §6.2.2, §6.3.
C. He, Y. Hu, Y. Chen, and B. Zeng (2019) Joint power allocation and channel assignment for noma with deep reinforcement learning. IEEE Journal on Selected Areas in Communications 37 (10), pp. 2200–2210. Cited by: §2.
N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver (2015) Memory-based control with recurrent neural networks. arXiv preprint arXiv:1512.04455. Cited by: §1, §5.1, §5.
S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.2.
I. Hou and P. Kumar (2010) Utility-optimal scheduling in time-varying wireless networks with delay constraints. In ACM MobiHoc, pp. 31–40. Cited by: §1, §2.
L. Huang, S. Zhang, M. Chen, and X. Liu (2015) When backpressure meets predictive scheduling. IEEE/ACM Transactions on Networking 24 (4), pp. 2237–2250. Cited by: §1, §2.
V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §5.
C. Li, P. Hu, Y. Yao, B. Xia, and Z. Chen (2019) Optimal multi-user scheduling for the unbalanced full-duplex buffer-aided relay systems. IEEE Transactions on Wireless Communications 18 (6), pp. 3208–3221. Cited by: §1, §2.
C. Li and M. J. Neely (2013) Network utility maximization over partially observable markovian channels. Performance Evaluation 70 (7-8), pp. 528–548. Cited by: §1.
N. Loi (2018) Predict traffic of lte network — kaggle. Note: https://www.kaggle.com/naebolo/predict-traffic-of-lte-networkAccessed: 2021-07 Cited by: §6.1.
L. Lv, C. Zheng, L. Zhang, C. Shan, Z. Tian, X. Du, and M. Guizani (2021) Contract and lyapunov optimization-based load scheduling and energy management for uav charging stations. IEEE Transactions on Green Communications and Networking 5 (3), pp. 1381–1394. Cited by: §1, §2.
S. Ma (2017) Fast or free shipping options in online and omni-channel retail? the mediating role of uncertainty on satisfaction and purchase intentions. The International Journal of Logistics Management. Cited by: §1.
H. Mao, R. Netravali, and M. Alizadeh (2017) Neural adaptive video streaming with pensieve. In ACM SIGCOMM, pp. 197–210. Cited by: §2.
F. Meng, P. Chen, L. Wu, and J. Cheng (2020) Power allocation in multi-user cellular networks: deep reinforcement learning approaches. IEEE Transactions on Wireless Communications 19 (10), pp. 6255–6267. Cited by: §2.
S. Moon, S. Ahn, K. Son, J. Park, and Y. Yi (2021) Neuro-dcf: design of wireless mac via multi-agent reinforcement learning approach. ACM MobiHoc. Cited by: §2.
Y. S. Nasir and D. Guo (2019) Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. IEEE Journal on Selected Areas in Communications 37 (10), pp. 2239–2250. Cited by: §2.
L. Pan, Q. Cai, and L. Huang (2020) Softmax deep double deterministic policy gradients. Advances in Neural Information Processing Systems 33. Cited by: §1, §5.1, §5, §6.3.
S. Pan and Y. Chen (2018) Energy-optimal scheduling of mobile cloud computing based on a modified lyapunov optimization method. IEEE Transactions on Green Communications and Networking 3 (1), pp. 227–235. Cited by: §1, §2.
X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §5.2, §5.
L. Qu, C. Assi, K. Shaban, and M. J. Khabbaz (2017) A reliability-aware network service chain provisioning with delay guarantees in nfv-enabled enterprise datacenter networks. IEEE Transactions on Network and Service Management 14 (3), pp. 554–568. Cited by: §1.
Z. Scully, M. Harchol-Balter, and A. Scheller-Wolf (2019) Simple near-optimal scheduling for the m/g/1. ACM SIGMETRICS Performance Evaluation Review 47 (2), pp. 24–26. Cited by: §1, §2.
K. Shen and W. Yu (2018) Fractional programming for communication systems—part i: power control and beamforming. IEEE Transactions on Signal Processing 66 (10), pp. 2616–2630. Cited by: §6.3.
R. Singh and P. Kumar (2018) Throughput optimal decentralized scheduling of multihop networks with end-to-end deadline constraints: unreliable links. IEEE Transactions on Automatic Control 64 (1), pp. 127–142. Cited by: §1, §2, §3.1, §3.2, §4.2.
J. Song, W. Wang, and R. Ibrahim (2021) The impact of obstruction by vehicle on in-tunnel wireless propagation channel. In 2021 IEEE 4th International Conference on Electronic Information and Communication Technology (ICEICT), pp. 572–576. Cited by: §6.4.2.
W. Taotao, X. Jiantao, X. Wensen, C. Yucheng, and Z. Shengli (2021) Wireless signal strength on 2.4ghz (wss24) dataset.. Note: https://github.com/postman511/Wireless-Signal-Strength-on-2.4GHz-WSS24-dataset[Online; accessed 17-2-2022] Cited by: §6.1.
L. Xiao, Y. Li, C. Dai, H. Dai, and H. V. Poor (2017) Reinforcement learning-based noma power allocation in the presence of smart jamming. IEEE Transactions on Vehicular Technology 67 (4), pp. 3377–3389. Cited by: §2.
R. Xie, Q. Tang, C. Liang, F. R. Yu, and T. Huang (2020) Dynamic computation offloading in iot fog systems with imperfect channel-state information: a pomdp approach. IEEE Internet of Things Journal 8 (1), pp. 345–356. Cited by: §1.
C. Xu, S. Liu, C. Zhang, Y. Huang, and L. Yang (2020) Joint user scheduling and beam selection in mmwave networks based on multi-agent reinforcement learning. In 2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM), pp. 1–5. Cited by: §4.3.
H. Zhang, W. Li, S. Gao, X. Wang, and B. Ye (2019) Reles: a neural adaptive multipath scheduler based on deep reinforcement learning. In IEEE INFOCOM, pp. 1648–1656. Cited by: §1, §2.
F. Zhou, L. Feng, M. Kadoch, P. Yu, W. Li, and Z. Wang (2021) Multi-agent rl aided task offloading and resource management in wi-fi 6 and 5g coexisting industrial wireless environment. IEEE Transactions on Industrial Informatics. Cited by: §4.3.

Effective Multi-User Delay-Constrained Scheduling with Deep Recurrent Reinforcement Learning

Abstract.

1. Introduction

2. Related Work

3. Problem and Preliminary

3.1. The Scheduling Problem

The buffer model

The scheduling and service model

The system objective

3.2. Lagrange Dual

4. Overall Framework

4.1. The POMDP Formulation

State and Observation

Action and Reward

Learning Objective

Remark 1 ().

4.2. User-Level Decomposition

Decomposed POMDP

Unified Training

Remark 2 ().

4.3. Multi-Hop and Node-level Merging

5. Proposed Method: RSD4

5.1. The Training Algorithm

Recurrent Network Architecture

Softmax Double Learning

5.2. Network Architecture

Remark 1 ().

6. Experiment Results

6.1. Environment Setup

Arrivals

Service Channel Conditions

Service Outcomes

6.2. Ablation Study

6.2.1. Network Architectures

6.2.2. Policy Delay

6.2.3. Learning Rate

6.3. Performance Comparison

6.4. Partially Observable Systems

6.4.1. Missing Buffer State

6.4.2. Unobservable Hidden Factors

6.4.3. Time-varying Environments

Remark 1 ().

6.5. Scalability by Decomposition and Merging

6.5.1. Large-Scale System

Remark 2 ().

6.5.2. Multi-hop Network

7. Conclusion

Acknowledgements

References

Effective Multi-User Delay-Constrained Scheduling with
Deep Recurrent Reinforcement Learning

5. Proposed Method: $R S D 4$