DRL Enabled Coverage and Capacity Optimization in STAR-RIS Assisted Networks

Xinyu Gao, Wenqiang Yi, ,
Yuanwei Liu, , Jianhua Zhang, ,
and Ping Zhang, Part of this work has been sumbitted to the IEEE Global Communications Conference, Dec. 4-Dec. 8, 2022 [1].X. Gao, W. Yi, and Y. Liu are with the Queen Mary University of London, London E1 4NS, U.K. (e-mail:{x.gao,w.yi,yuanwei.liu}@qmul.ac.uk).J. Zhang and P. Zhang are with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China (email:{jhzhang, pzhang}@bupt.edu.cn).

Abstract

Simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) is a promising passive device that contributes to a full-space coverage via transmitting and reflecting the incident signal simultaneously. As a new paradigm in wireless communications, how to analyze the coverage and capacity performance of STAR-RISs becomes essential but challenging. To solve the coverage and capacity optimization (CCO) problem in STAR-RIS assisted networks, a multi-objective proximal policy optimization (MO-PPO) algorithm is proposed to handle long-term benefits than conventional optimization algorithms. To strike a balance between each objective, the MO-PPO algorithm provides a set of optimal solutions to form a Pareto front (PF), where any solution on the PF is regarded as an optimal result. Moreover, in order to improve the performance of the MO-PPO algorithm, two update strategies, i.e., action-value-based update strategy (AVUS) and loss function-based update strategy (LFUS), are investigated. For the AVUS, the improved point is to integrate the action values of both coverage and capacity and then update the loss function. For the LFUS, the improved point is only to assign dynamic weights for both loss functions of coverage and capacity, while the weights are calculated by a min-norm solver at every update. The numerical results demonstrated that the investigated update strategies outperform the fixed weights MO optimization algorithms in different cases, which includes a different number of sample grids, the number of STAR-RISs, the number of elements in the STAR-RISs, and the size of STAR-RISs. Additionally, the STAR-RIS assisted networks achieve better performance than conventional wireless networks without STAR-RISs. Moreover, with the same bandwidth, millimeter wave is able to provide higher capacity than sub-6 GHz, but at a cost of smaller coverage.

Coverage and capacity optimization (CCO), multi-objective proximal policy optimization (MO-PPO), simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs)

I Introduction

For supporting increasing heterogeneous quality-of-service (QoS) requirements of future wireless networks, e.g., high data rate, low latency, high reliability, massive connectivity, etc., an emerging communication paradigm, i.e., reconfigurable intelligent surfaces (RISs) [2, 3, 4] has been proposed to smartly control the wireless communication environment. RISs are able to offer line-of-sight (LoS) links to users located in blocked areas via reflection to improve both the coverage and capacity of conventional wireless networks. However, conventional RISs have maximal 180 $\lx@math@degree$ coverage, where the ‘blind zone’ still exists at the backside of RISs. To overcome this limitation, a new concept named simultaneously transmitting and reflecting RISs (STAR-RISs) [5] becomes appealing. In contrast to conventional RISs, STAR-RISs are able to transmit and reflect the incident signal simultaneously, which contributes to full-space coverage [6]. As a new communication paradigm, it is an ultra-interesting question how STAR-RISs perform in terms of coverage and capacity. Note that coverage and capacity optimization (CCO) is one of the typical operational tasks mentioned by the 3rd Generation Partnership Project [7]. Since the coverage and capacity have several conflicting relationships, simultaneously optimizing them is important. For example, high transmit power contributes to large coverage but high inter-cell interference that reduces the capacity performance. To this end, multi-objective machine learning (MOML) [8] algorithms can be a potential solution. Compared to single-objective algorithms, MOML algorithms are capable of handling the inherent conflict between objectives to achieve a group of optimal solutions by coordinating and compromising the requirements of objectives.

I-a Related Works

I-A1 Capacity or Coverage Optimization for STAR-RISs Networks

Conventional performance optimization for STAR-RIS assisted networks focuses on a single objective: capacity or coverage. For capacity performance, there are some primary works. In [9], a partitioning algorithm was proposed to determine the proper number of transmitting/reflecting elements that need to be assigned to each user, and maximize the system sum-rate while guaranteeing the QoS requirements for individual users. In STAR-RIS assisted non-orthogonal multiple access (NOMA) systems, the authors in [10] proposed a sub-optimal two-layer iterative algorithm to maximize the achievable sum-rate by jointly optimizing the decoding order, power allocation coefficients, active beamforming, and transmission and reflection beamforming. The sum-rate performance of STAR-RIS assisted full-duplex communication systems was investigated in [11], where the successive convex approximation technique has been employed to develop efficient algorithms for obtaining sub-optimal solutions. In [12], the authors proposed a sub-optimal block coordinate descent algorithm to maximize the weighted sum-rate for a STAR-RIS assisted multiple-input multiple-output network. The authors in [13] investigated the resource allocation problem in a STAR-RIS assisted multi-carrier communication network and proposed location-based matching and semidefinite programming algorithms to maximize the system sum-rate. To derive the approximated average achievable rates of two users, the authors in [14] investigated the performance of STAR-RIS assisted downlink NOMA networks by a large array analysis methods. For coverage performance, only one recent work has discussed its optimization problem. The STAR-RIS assisted two-user communication networks were studied in [15], where the search-based algorithms were proposed to obtain the optimal one-dimensional (1D) coverage range.

I-A2 CCO based on MOML algorithms

There are three main CCO solutions based on MOML algorithms: 1) Keep one objective in the objective function and move the rest objectives to constraints, while the obtained results are sub-optimal [17]. 2) Assign a fixed weight to each objective. This method achieves the optimal results in a single scenario, while it cannot be used in other weight combinations, i.e., other network operation designs [18]. 3) Obtain a set of optimal solutions according to Pareto-based multi-objective optimization algorithms, where one of these solutions can be selected to meet any specific network operation designs [19]. More specifically, for the first method, an reinforcement learning (RL) algorithm-based solution for CCO by optimizing the base station (BS) antenna electrical tilt was proposed in [17], where the coverage objective was considered in the constraint. The proposed sub-optimal solution has the potential to reduce operational costs and complexity, as well as improve the quality of experience for mobile users. For the second method, in [18], a minimization of drive tests (MDT)-driven deep RL algorithm was investigated to maximize the coverage and capacity by tuning antennas tilts on a cluster of cells from cellular network, where the fixed weights were assigned for coverage and capacity. The results showed that the proposed MDT-driven approaches outperform baseline approaches, i.e., deep Q-network and best-first search, in terms of long-term reward and sample efficiency. For the third method, the authors in [19] developed two RL algorithm-based approaches for maximizing coverage and minimizing interference by jointly optimizing the transmit power and antenna downtilt across cells. The results suggested that data-driven techniques can effectively self-optimize coverage and capacity in cellular networks. There are some other promising MOML methods [20, 21]. A new algorithm was introduced in [20] for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. The authors in [21] proposed an upper bound for the multi-objective loss and show that it can be optimized efficiently. However, compared to a simple extension of the vanilla RL approaches to MOML algorithms, a new RL approach named proximal policy optimization (PPO) algorithm is able to provide a more stable training process (e.g., implement small batch updates in multiple training steps) and can be a booster for MOML algorithms.

I-B Motivations and Contributions

As can be seen from related works, the CCO problem of STAR-RIS assisted wireless networks is still at its early stage. In this research direction, there are two main challenges:

Characterizing Coverage in STAR-RIS assisted Networks: To efficiently characterize the performance metric of the coverage, it’s better to extend to the two-dimensional (2D) coverage instead of 1D range distance. Additionally, simultaneously characterizing both coverage and capacity is challenging when evaluating the performance of STAR-RIS assisted networks.
Designing MORL Algorithms to Solve COO Problem: Conventional Pareto-based MO optimization (MOO) solutions mainly aim to find a Pareto front (PF) of objectives within one time step, which ignore the dynamic requirements of temporal correlations in long-term wireless communications. PPO is a policy gradients method where policy updates use a surrogate loss function to avoid catastrophic drops in performance. In addition, the new MOO methods [20, 21] have the capability to dynamically update the weights of objectives. Therefore, how to obtain the Pareto optimal (PO) solution based on PPO algorithm and these two new MOO methods is challenging.

To solve these challenges and fully reap the advantages of STAR-RISs, in this paper, we propose a new RL approach based on the PPO algorithm, named multi-objective PPO (MO-PPO) algorithm, to provide the maximum coverage and capacity for STAR-RIS assisted networks. The optimal results obtained by the MO-PPO algorithm are different according to the different update strategies. The main contributions of this paper can be summarized as follows:

We propose a new model for a narrow-band downlink STAR-RIS assisted network consisting of two single-antenna BSs, where the serving range is defined as a square region. To quantitatively analyze the coverage and capacity, the serving range is discretized into numerous square grids and the center point of each grid acts as the evaluating sample point. Based on this framework, this work formulates the CCO problem of STAR-RIS assisted networks by jointly optimizing the transmit power, the reflection phase shift matrix, and the transmission phase shift matrix.
We investigate an action value-based update strategy (AVUS) for the MO-PPO algorithm to solve the CCO problem. The core point of this strategy is to integrate the action values of both coverage and capacity by random sampling preferences, and further invoke a coefficient to update the policy by homotopy optimization. This update strategy with high performance is able to provide the optimal coverage and capacity, while it has to spend a long time to achieve convergence. Therefore, the AVUS has strict requirements on the computation resource, which is suitable for networks with strong computation capability.
We adopt a loss function-based update strategy (LFUS) for the MO-PPO algorithm to reduce the complexity brought by the AVUS. The improved point is to assign dynamic weights for both loss functions of coverage and capacity, and to update the whole MO-PPO policy with an integrated loss function of coverage and capacity. The dynamic weights are re-calculated by a min-norm solver at every update. Compared to the AVUS, this strategy has slightly worse performance, but it still has acceptable performance gain when compared to the existing CCO solutions.
We illustrate that both AVUS and LFUS based MO-PPO algorithms are capable of striking a balance between the conflicting goals in terms of coverage and capacity. Then, AVUS and LFUS based algorithms are able to provide the Pareto optimality compared to conventional fixed weights MOO algorithms. With the same bandwidth, millimeter wave (mmWave) is able to provide better capacity while sub-6 GHz provides better coverage. Next, the coverage and capacity have a positive correlation with the number of STAR-RISs. Finally, when the number of elements in STAR-RISs is fixed, the coverage and capacity have a negative correlation with the physical size of STAR-RISs.

I-C Orgainizations

The rest of this paper is organized as follows. Section II presents the system model for the considered STAR-RIS assisted networks, and the coverage and capacity optimization problems are formulated. Section III provides the preliminaries, including the principles of the PPO algorithm and the PO solution. In Section IV, we investigate the two updated strategies-based MO-PPO algorithms, i.e., AVUS and LFUS, which are updated for different parts of the algorithm. Section V presents numerical results to verify the effectiveness of the proposed MO-PPO algorithms, by considering the different number of sample grids, the different number of elements in STAR-RISs, the different number of STAR-RISs, and the different physical sizes of STAR-RISs module. Finally, Section VI concludes this paper.

Notations: Scalars, vectors, and matrices are denoted by lower-case, bold-face lower-case, and bold-face upper-case letters, respectively. The conjugate transpose of vector $a$ is denoted by $a^{H}$ . The diag( $a$ ) denotes a diagonal matrix with the elements of vector $a$ on the main diagonal. The $| a |$ denotes the norm of vector $a$ . The Mod( $a, b$ ) denotes the modulus operation between values $a$ and $b$ . The $⌊ a ⌋$ denotes the truncated argument of value $a$ . The $*$ denotes the dot multiplication operation. The $E [A]$ is the expectation operator of matrix $A$ . The log $_{2}$ ( $A$ ) represents a logarithmic function with a constant base of 2 for matrix $A$ . The $t r (A)$ denotes the trace of matrix $A$ .

Ii System Model and Problem Formulation

Fig. 1: Illustration of the considered narrow-band downlink STAR-RIS assisted networks: (a) The geographic environment and the model of STAR-RISs; and (b) The constraints of height and width of STAR-RISs.

As shown in Fig. 1, we consider a narrow-band downlink STAR-RIS assisted network consisting of two single-antenna BSs and $N_{s}$ STAR-RISs of the same size equipped with $K = K_{H} K_{V}$ reconfigurable elements, where $K_{H}$ and $K_{V}$ denote the number of elements per row and column, respectively. The serving range is defined as a square region with the length of the side $R_{s}$ . The BSs are located at the bottom left and bottom right corners with the same height $h_{b}$ , while STAR-RISs with the height $h_{n_{s}}$ and width $ω_{n_{s}}$ are deployed at designated locations in the square region.

Ii-a Grid-based Geographic Model

Assuming a three-dimensional (3D) Cartesian coordinate system, where the origin is set at the top-left corner. The locations of two BSs and $n_{s}$ -th STAR-RISs are denoted by $B_{1} = (R_{s}, 0, h_{b})$ , $B_{2} = (R_{s}, R_{s}, h_{b})$ , and $A_{n_{s}} = (x_{n_{s}}, y_{n_{s}}, h_{n_{s}})$ , respectively. Note that $h_{n_{s}}$ is the height of STAR-RISs module, and the thickness of STAR-RISs are ignored. The height $h_{n_{s}}$ and width $ω_{n_{s}}$ are depicted in Fig. 1. The indicators $I_{h_{n_{s}}}$ and $I_{ω_{n_{s}}}$ are invoked to characterize $h_{n_{s}}$ and $ω_{n_{s}}$ , which can be expressed as follows:

	$I_{h_{n_{s}}} = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} 1, I f h_{n_{s}} \leq \frac{R_{g} h_{b}}{2 R_{s} - R_{g}} 0, I f h_{n_{s}} > \frac{R_{g} h_{b}}{2 R_{s} - R_{g}} \end{matrix},$		(1)
			(2)

In order to ensure that there is at least one direct link between BSs and any given sampling point, the indicators $I_{h_{n_{s}}}$ and $I_{ω_{n_{s}}}$ need to satisfy the condition: $I_{h_{n_{s}}} = 1$ and/or $I_{ω_{n_{s}}} = 1$ . Thus, the indicators $I_{h_{n_{s}}}$ and $I_{ω_{n_{s}}}$ can be unified as follows:

I_{n_{s}} = I_{h_{n_{s}}} \lor I_{ω_{n_{s}}}

(3)

where $\lor$ denotes the OR operator. $I_{n_{s}} = 1$ denotes that the BSs are able to establish a direct link with the considered receiver from above and/or from the side of the STAR-RISs.

To characterize the coverage and capacity, the region is discretized into numerous square grids with the length of the side $R_{g}$ , while the center point of each grid acts as the sample point . Accordingly, the total number of grids is $N = ⌈ R_{s} / R_{g} ⌉^{2}$ , where the set of sample points can be denoted as $s = {s_{1}, s_{2}, . . ., s_{N}}$ . In practical networks, in order to characterize the importance of each grid at each time step $t$ , two time-related weights, $w_{c o v, i} (t)$ and $w_{c a p, i} (t)$ , are assigned for coverage and capacity of each sample points $s_{i}$ ( $i \in [1, N]$ ), respectively. Moreover, the weights have been unified, i.e., $\sum_{i = 1}^{N} w_{c o v, s_{i}} (t) = 1$ and $\sum_{i = 1}^{N} w_{c a p, s_{i}} (t) = 1$ . In this system model, we study long-term communication with a time period $T$ . For each sample point at any time step, the weighted assignments $w_{c o v, s_{i}} (t)$ and $w_{c a p, s_{i}} (t)$ are influenced by the previous network performance and resource allocation strategy. Therefore, the considered problem can be regarded as a Markov Decision Process (MDP).

Ii-B Spatially Correlated Channel Model

In this section, the fading channels from BSs to STAR-RISs, from STAR-RISs to sample points, and from BSs to sample points are introduced, as well as their spatial channel correlations. Denote $Φ_{δ, n_{s}}$ as the coefficients of $n_{s}$ -th STAR-RISs with mode $δ$ , where $δ \in {R e, T r}$ represents the reflection and transmission modes. Due to the high path loss, this work assume that signals are only reflected and transmitted by the STAR-RISs once. We consider the non-ideal STAR-RISs with same constant amplitude and continuous phase shifters in each mode, where the phase shifters can be expressed as [23]: $ϕ_{δ, n_{s}, k} \in [0, 2 π), \forall k \in {1, 2, \dots, K}$ . The coefficients of $n_{s}$ -th STAR-RISs are denoted as $Φ_{δ, n_{s}} = d i a g (\sqrt{β_{δ, n_{s}}} e^{j ϕ_{δ, n_{s}, 1}}, . . ., \sqrt{β_{δ, n_{s}}} e^{j ϕ_{δ, n_{s}, K}}), \forall k \in {1, 2, \dots, K}$ , where $\sqrt{β_{δ, n_{s}}} \in (0, 1], β_{R e, n_{s}} + β_{T r, n_{s}} = 1$ [24]. As shown in Fig. 1, a spherical coordinate system is defined with azimuth angel $ψ$ and elevation angel $θ$ based on the 3D space. Denote the area of each element as $M = M_{H} M_{V}$ , where $M_{H}$ and $M_{V}$ are the horizontal width and vertical height, respectively. Thus, the total area of $K$ elements can be expressed as $M_{a} = K M$ . For the $k$ -th element, its location can be expressed as [25]:

l_{k} = [0, x (k) M_{H}, y (k) M_{V}]^{T},

(4)

where $x (k)$ = mod( $k - 1, K_{H}$ ) and $y (k)$ = $⌊ (k - 1) / K_{H} ⌋$ are the indices of $k$ -th element. Assume a plane wave with wavelength $λ$ is impinging on the STAR-RISs, the array response vector is then given by:

a (ψ, θ) = [e^{j b (ψ, θ)^{T} l_{1}}, e^{j b (ψ, θ)^{T} l_{2}}, \dots, e^{j b (ψ, θ)^{T} l_{K}}]^{T},

(5)

where $b (ψ, θ) \in R^{3 \times 1}$ is the wave vector, which can be expressed as follows:

b (ψ, θ) = \frac{2 π}{λ} [cos (θ) cos (ψ), cos (θ) sin (ψ), sin (θ)]^{T} .

(6)

Assume that these channels are independently distributed and corresponding channel state information are perfect. Denote $h_{a, n_{s}}$ , $h_{δ, n_{s}, s_{i}}$ , and $h_{a, s_{i}}$ as the channel from $a$ -th BS to $n_{s}$ -th STAR-RISs, from $n_{s}$ -th STAR-RISs to $s_{i}$ -th sample point with mode $δ$ , and from $a$ -th BS to $s_{i}$ -th sample point, respectively. Here, the channels $h_{a, n_{s}}$ , $h_{δ, n_{s}, s_{i}}$ , and $h_{a, s_{i}}$ can be modeled as Rician fading model, which are expressed as:

	$h_{a, n_{s}} = \sqrt{L_{a R}} (\sqrt{\frac{α_{a R}}{1 + α_{a R}}} h_{a, n_{s}}^{L O S} + \sqrt{\frac{1}{1 + α_{a R}}} h_{a, n_{s}}^{N L O S}),$		(7)
	$h_{δ, n_{s}, s_{i}} = \sqrt{L_{R P}} (\sqrt{\frac{α_{R P}}{1 + α_{R P}}} h_{n_{s}, s_{i}}^{L O S} + \sqrt{\frac{1}{1 + α_{R P}}} h_{n_{s}, s_{i}}^{N L O S}),$		(8)
	$h_{a, s_{i}} = \sqrt{L_{a P}} (\sqrt{\frac{α_{a P}}{1 + α_{a P}}} h_{a, s_{i}}^{L O S} + \sqrt{\frac{1}{1 + α_{a P}}} h_{a, s_{i}}^{N L O S}),$		(9)

where $L_{u}$ and $α_{u}, u \in {a R, R P, a P}$ denote the corresponding path loss and Rician factor, respectively. The $h_{a, s_{i}}^{L O S}$ $\sim$ $C N (0, 1)$ denotes the Rayleigh fading-modeled deterministic LoS component of the channel from $a$ -th BS to $s_{i}$ -th sample point, while $h_{a, n_{s}}^{L O S} = b (ψ^{a R}, θ^{a R}) = b {a r c s i n [(h_{b} - h_{n_{s}}) / d_{a, n_{s}}], a r c c o s [(R_{s} - x_{n_{s}}) / {¯ ¯ ¯ d}_{a, n_{s}}]}$ and $h_{n_{s}, s_{i}}^{L O S} = b (ψ^{R P}, θ^{R P}) = b {a r c s i n (h_{n_{s}} / d_{n_{s}, s_{i}}), a r c c o s [(x_{n_{s}} - x_{s_{i}}) / {¯ ¯ ¯ d}_{n_{s}, s_{i}}]}$ are the deterministic LoS components for the channels from $a$ -th BS to $n_{s}$ -th STAR-RISs, and from $n_{s}$ -th STAR-RISs to $s_{i}$ -th sample point, respectively. Among them, $d_{a, n_{s}}$ and $d_{n_{s}, s_{i}}$ denote 3D distance between $a$ -th BS and $n_{s}$ -th STAR-RISs, and 3D distance between $n_{s}$ -th STAR-RISs and $s_{i}$ -th sample point, while ${¯ ¯ ¯ d}_{a, n_{s}}$ and ${¯ ¯ ¯ d}_{n_{s}, s_{i}}$ denote 2D distance between $a$ -th BS and $n_{s}$ -th STAR-RISs, and 2D distance between $n_{s}$ -th STAR-RISs and $s_{i}$ -th sample point. The $x_{n_{s}}$ , $x_{s_{i}}$ indicate the $n_{s}$ -th STAR-RISs, and $s_{i}$ -th sample point, respectively. $h_{a, n_{s}}^{N L O S} \sim C N (0, E [h_{a, n_{s}}^{N L O S} (h_{a, n_{s}}^{N L O S})^{H}])$ , $h_{n_{s}, s_{i}}^{N L O S} \sim C N (0, E [h_{n_{s}, s_{i}}^{N L O S} (h_{n_{s}, s_{i}}^{N L O S})^{H}])$ , and $h_{a, s_{i}}^{N L O S} \sim C N (0, 1)$ are the non-line-of-sight (NLoS) components modeled as Rayleigh fading. Furthermore, for path loss $L_{u}$ , it can be modeled as $L_{u} = C_{0} d_{v}^{- γ_{v}}, v \in {{a, n_{s}}, {n_{s}, s_{i}}, {a, s_{i}}}$ , where $C_{0} = (c / 4 π d_{0} f_{c})$ denotes the path loss at the reference distance $d_{0} = 1$ m under carrier frequency $f_{c}$ , $c$ is the velocity of light, and $γ_{v}$ represents the path loss factor.

Ii-C Signal Model

Since the size of STAR-RISs module affects the direct link, the received signal $y_{a, n_{s}, s_{i}} \in C$ from the $a$ -th BS to the $s_{i}$ -th sample point via $n_{s}$ -th STAR-RISs is determined by $I_{n_{s}}$ . Thus, the recieved signal $y_{a, n_{s}, s_{i}}$ can be written as [26]:

y_{a, n_{s}, s_{i}} = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} (h_{δ, n_{s}, s_{i}}^{H} Φ_{δ, n_{s}} h_{a, n_{s}} + h_{a, s_{i}}) x + n_{a, n_{s}, s_{i}}, I f I_{n_{s}} = 1, I_{ω_{n_{s}}} = 0, (h_{δ, n_{s}, s_{i}}^{H} Φ_{δ, n_{s}} h_{a, n_{s}} + ¯ ¯ ¯ a h_{a, s_{i}}) x + n_{a, n_{s}, s_{i}}, I f I_{n_{s}} = 1, I_{h_{n_{s}}} = 0, I_{ω_{n_{s}}} = 1, \end{matrix}

(10)

where the total transmit power $P_{t} = | x |^{2}$ and $n_{a, n_{s}, s_{i}} \sim C N (0, σ^{2})$ is the additive white Gaussian noise variance. $¯ ¯ ¯ a$ is a indicator that characterizing the direct link between $a$ -th BS and $s_{i}$ -th sample point. $¯ ¯ ¯ a = 1$ denotes that there is a direct link between $a$ -th BS and $s_{i}$ -th sample point, while $¯ ¯ ¯ a = 0$ denotes the direct link between $a$ -th BS and $s_{i}$ -th sample point is blocked. Based on the received signal power, the reference signal receiving power (RSRP) can be defined as the maximal useful signal power from all possible sources. The RSRP at the sample point $s_{i}$ is given by:

{R S R P}_{s_{i}} = max a \in {1, 2}, n_{s} \in {1, 2, \dots, N_{s}} | y_{a, n_{s}, s_{i}} |^{2} .

(11)

The achievable signal-to-interference-plus-noise ratio (SINR) of $s_{i}$ -th sample point is calculated as follows:

{S I N R}_{a, n_{s}, s_{i}} = \frac{| y_{a, n_{s}, s_{i}} - n_{a, n_{s}, s_{i}} |^{2}}{\sum_{n_{s}^{^{'}} = 1, n_{s}^{^{'}} \neq n_{s}}^{| s |} | y_{a^{^{'}}, n_{s}^{^{'}}, s_{i}} - n_{a^{^{'}}, n_{s}^{^{'}}, s_{i}} |^{2} + n_{a, n_{s}, s_{i}}^{2}},

(12)

where $a = 1$ , $a^{^{'}} = 2$ ; and $a = 2$ , $a^{^{'}} = 1$ , otherwise. Assume the RSRP threshold for all sample points is $R_{t h}$ , the weighted coverage ratio at time step $t$ can be written as

C o v e r a g e (t) = \frac{| w_{c o v, ˇ s} (t) \cdot ˇ s (t) |}{N},

(13)

where $ˇ s (t) = {{ˇ s}_{1} (t), {ˇ s}_{2} (t), \dots, {ˇ s}_{~ N} (t)}$ is the set of the sample points at time $t$ that satisfying the condition ${R S R P}_{{ˇ s}_{~ n} (t)} \geq R_{t h}, {ˇ s}_{~ n} (t) \in ˇ s (t)$ . $w_{c o v, ˇ s} (t) = {w_{c o v, {ˇ s}_{1}} (t), w_{c o v, {ˇ s}_{2}} (t), \dots, w_{c o v, {ˇ s}_{~ N}} (t)}$ is the normalized corresponding coverage weights for the sample points $ˇ s (t)$ . For the network capacity, it is mainly determined by SINR, so at the time step $t$ , the weighted capacity can be represented by

C a p a c i t y (t) = N_{s} \sum s_{i} = 1 w_{c a p, s_{i}} (t) \cdot B {log}_{2} (1 + {S I N R}_{a^{*}, k^{*}, s_{i}} (t)),

(14)

where $B$ is the system bandwidth and $a^{*}, n_{s}^{*} = arg {max}_{a \in {1, 2}, n_{s} \in {1, 2, \dots, N_{s}}} | y_{a, n_{s}, s_{i}} |^{2}$ .

Ii-D Problem Formulation

We focus on maximizing the long-term coverage and capacity by optimizing the transmit power, the reflection phase shift matrix, and the transmission phase shift matrix. The formulated problem can be expressed as follows:

$max P_{t}, Φ_{R e, n_{s}}, Φ_{T r, n_{s}}$	$T \sum t = 1 [C o v e r a g e (t), C a p a c i t y (t)]$	(15)
$s . t .$	$0 < P_{t} (t) \leq P_{m a x},$	(15a)
	$0 < t r (Φ_{δ, n_{s}}^{H} Φ_{δ, n_{s}}) < 1,$	(15b)
		(15c)

where $P_{m a x}$ denotes the permitted maximum transmit power. Constraint (15a) limits the range of the transmit power. According to the energy conservation principle, constraints (15b) and (15c) show that both the energy of different modes and the sum energy of the reflected and transmitted signals is less than one. However, the main difficulty in solving the problem (15) owing to the following reasons. Firstly, the NLoS components for STAR-RIS assisted links are hard to be determined before the STAR-RISs deployment, where the locations of STAR-RISs are infinite and rely on the no concave distribution of coverage and capacity of each sample point. Secondly, the distribution weights $w_{c o v, s_{i}} (t)$ and $w_{c a p, s_{i}} (t)$ at time $t$ for calculating the coverage and capacity is not a continuous function. Thirdly, with respect to the continuous-time $t$ , it’s difficult to handle infinite variables optimization, since any adjacent time is subjected to the Markov chain. Thus, conventional non-convex optimization methods are not suitable for solving these difficulties. In the next section, the Pareto-based MO-PPO algorithms are invoked to solve this problem.

Iii Preliminaries

Before introducing the proposed strategies, we firstly give a brief introduction of the principles of the PPO algorithm and PO solution.

Iii-a PPO Algorithm

PPO algorithm is based on trust region policy optimization [27] and utilizes the typical actor-critic architecture. The actor network is to determine the action according to the current state and the critic network, where the critic network is to evaluate how well the actor network performs the action. The configuration of the PPO algorithm is shown in Fig. 2. Note that, the design of the architecture is modularized to separate the cohesion between neural networks, PPO algorithm, and environments. The action-value function in the PPO algorithm is replaced by an advantage function ${^A}_{t}$ at every $¯ ¯¯ ¯ T$ time steps, which is expressed as:

{^A}_{t} = ¯ ¯ ¯ T \sum t = 1 Q_{π_{¯ θ}} (S_{t}, A_{t}) - V_{π_{¯ θ}} (S_{t}),

(16)

where $Q_{π_{¯ θ}} (S_{t}, A_{t})$ is the action-value function at policy $π_{¯ θ}$ with state $S_{t}$ and $A_{t}$ . The $V_{π_{¯ θ}} (S_{t})$ is state-value predicted by the critic network. The update solution of loss function, i.e., No clipping or penalty (NCP), can be expressed as:

L^{N C P} = min ¯ θ E_{t} [\frac{π_{{¯ θ}^{*}} (S_{t}, A_{t})}{π_{¯ θ} (S_{t}, A_{t})} {^A}_{t}],

(17)

where $π_{{¯ θ}^{*}} (\cdot)$ and $π_{¯ θ} (\cdot)$ denote the current policy and old policy. Since a large difference between the new and old policies often leads to destructively large policy [28], there are other two methods invoked for the PPO algorithm, i.e., clipped (CLIP) and Kullback–Leibler (KL) penalty methods. The two methods can be directly expressed as:

L^{C L I P} = min ¯ θ E_{t} [\frac{π_{{¯ θ}^{*}} (S_{t}, A_{t})}{π_{¯ θ} (S_{t}, A_{t})} {^A}_{t}, c l i p (\frac{π_{{¯ θ}^{*}} (S_{t}, A_{t})}{π_{¯ θ} (S_{t}, A_{t})}, 1 - ϵ, ϵ) {^A}_{t}],

(18)

L^{K L} = min ¯ θ E_{t} [\frac{π_{{¯ θ}^{*}} (S_{t}, A_{t})}{π_{¯ θ} (S_{t}, A_{t})} {^A}_{t} - ~ β K L (π_{{¯ θ}^{*}} (S_{t}), π_{¯ θ} (S_{t}))],

(19)

where $ϵ$ is the probability ratio for the clipped method, and $~ β$ is a adjustment penalty coefficient for KL method.

Fig. 3: The Pareto solutions for two objectives.

Iii-B PO Solution

In multi-objective optimization problems, each objective function may have an individual optimal solution, while these solutions usually have significant differences. Therefore, multi-objective optimization with such conflicting objective functions provides a set of optimal solutions, namely, PO solutions [29]. As shown in Fig. 3, considering two conflict objectives, both of which aim to be maximized. The point $C_{1}$ represents a solution that $F_{2}$ is near-maximum, but $F_{1}$ is low, while point $C_{4}$ indicates a solution $F_{1}$ is near-maximum, but $F_{2}$ is small. However, it is difficult to distinguish whether solution $C_{1}$ is better than $C_{4}$ , or vice versa. In fact, there exist many such solutions belonging to the PO set, which forms a primary PF. Additionally, $C_{5}$ , $C_{6}$ , and $C_{7}$ are the feasible solutions. $C_{5}$ belongs to the second PF, while $C_{6}$ and $C_{7}$ are the part of the third PF [29].

Iv PO-based MO-PPO Algorithms

In this section, we firstly introduce the MDP in the MO-PPO algorithm. Then, the different update strategies of the PO-based MO-PPO algorithm are proposed to obtain the optimal policy applicable for the considered networks.

Iv-a MO-PPO Framework

In this work, the locations of STAR-RISs are randomly pre-selected. Note that, the locations of STAR-RISs are not overlapped. Moreover, STAR-RISs are placed along the y-axis direction to ensure that transmit signal from any BS is reflected and transmitted using the same planar of each STAR-RISs.

In MO-PPO algorithm, the MDP can be represented by a tuple $⟨ ¯ ¯¯ ¯ S, ¯ ¯¯¯ ¯ A, p, ¯ ¯¯¯ ¯ R ⟩$ with state space $¯ ¯¯ ¯ S$ , action space $¯ ¯¯¯ ¯ A$ , reward space $¯ ¯¯¯ ¯ R$ . The $p$ is the transition probability matrix indicating the probability of changing the current state to the next state. Define a controller as an agent, which controls both two BSs, to develop the policy from the BSs to sample points via STAR-RISs, i.e., the adjustment policies of phase shifts and transmit power. At each time step $t$ , the controller observes the state $S_{t}$ from state space $¯ ¯¯ ¯ S$ , and carries out an action $A_{t}$ from action space $¯ ¯¯¯ ¯ A$ . The received reward $R \subseteq ¯ ¯¯¯ ¯ R$ is calculated by the current state and action, and determine the transition probability to the next state $S_{t + 1}$ . Additionally, since the locations of STAR-RISs are pre-determined, the distance between any BS and $n_{s}$ -th STAR-RISs is fixed. The coverage and capacity are determined by the distance between STAR-RISs and $s_{i}$ -th point and the corresponding phase shift of the STAR-RISs, according to the (13) and (14). Thus, the state $S_{t}$ can be defined symbolically as follows:

S_{t} = [\begin{matrix} β_{R e, n_{s}} (t), β_{T r, n_{s}} (t), Φ_{R e, n_{s}} (t), Φ_{T r, n_{s}} (t), P_{t} (t) \end{matrix}] .

(20)

For the action $A_{t}$ , the $β_{T r, n_{s}}$ of STAR-RISs is discretized with small step $z$ as numerous values between $(0, 1)$ , while the $β_{R e, n_{s}}$ is determined by $(1 - β_{T r, n_{s}})$ based on the energy splitting policy mentioned in [4]. In the MO-PPO algorithm, the category distributions of available locations and phase shifts of STAR-RISs are constructed first. Then, the agent samples phase shifts as an action according to the probability determined by the actor network. The action $A_{t}$ can be expressed as follows:

A_{t} = [\begin{matrix} Δ β_{R e, n_{s}}, Δ β_{T r, n_{s}}, Δ ϕ_{R e, n_{s}}, Δ ϕ_{T r, n_{s}}, Δ P_{t} \end{matrix}] .

(21)

where $Δ β_{R e, n_{s}} \in {z, 2 z, \dots, 1 - z}$ , $Δ β_{T r, n_{s}} \in {1 - z, 1 - 2 z, \dots, z}$ , and $Δ ϕ_{δ, n_{s}} \in {ϕ_{δ, n_{s}, 1}, ϕ_{δ, n_{s}, 2}, \dots, ϕ_{δ, n_{s}, K}}$ denote the possible values for the tranmission amplitude, reflection amplitude, and possible phases for $n_{s}$ -th STAR-RISs with mode $δ$ , respectively. The $Δ P$ is chosen between 0 and $P M a x$ . For $k$ -th element, the phase is randomly selected from $[0, 2 π)$ . To obtain the maximum transmission coverage and capacity that BSs is able to achieve in the time period $T$ , the reward is denoted as the difference of coverage $Δ {C o v}_{t \to t + 1}$ and capacity $Δ {C a p}_{t \to t + 1}$ in adjuscent time steps, which can be expressed as:

R_{t} (S_{t}, A_{t}) = [\begin{matrix} Δ {C o v}_{t \to t + 1}, Δ {C a p}_{t \to t + 1} \end{matrix}] .

(22)

Additionally, the loss function in the PPO algorithm can be evaluated according to (17), (18), and (19). In this work, we propose a novel framework for the MO-PPO algorithm, where two update strategies, i.e., AVUS and LFUS, are employed for the PO-based MO-PPO algorithms.

Iv-B AVUS-based MO-PPO Algorithm

In this subsection, we consider the AVUS-based MO-PPO algorithm, where the MO-MDP can be rewrited as $⟨ ¯ ¯¯ ¯ S, ¯ ¯¯¯ ¯ A, p, ¯ ¯¯¯ ¯ R, Ω, f_{Ω} ⟩$ . The $Ω$ and $f_{Ω}$ denote the preferences space and the functions of preference, respectively. In this case, a linear preference function is employed, i.e., $f_{¯ ¯ ¯ ω} (R_{t} (S_{t}, A_{t})) = {¯ ¯¯ ¯ ω}^{T} R_{t} (S_{t}, A_{t}), ¯ ¯¯ ¯ ω \subseteq Ω$ . The all possible returns from MO-MDP is able to form a PF $F:={^R|∀¯¯¯¯¯R<^R,¯¯¯¯¯R⊆F∗}$ , where $^R$ and $¯ ¯¯¯ ¯ R$ , and $F^{*}$ denote the PO return, non-PO return, and the set of non-PO returns, respectively. For $Ω$ in the AVUS, a PF-based convex coverage set $f$ can be defined as:

f={^R⊆F|∃¯¯¯¯ω⊆Ω,∀¯¯¯¯¯R⊆F∗,¯¯¯¯ωT^R≥¯¯¯¯ωT¯¯¯¯¯R}.

(23)

The agent is able to learn a group of policies $Π = {π_{{¯ θ}^{1}}, π_{{¯ θ}^{2}}, \dots}$ by interacting with the environments. Among them, there exists one linear preference vector $¯ ¯¯ ¯ ω$ in policy $π_{{¯ θ}^{*}}$ to satisfy:

{¯ ¯¯ ¯ ω}^{T} V^{π_{{¯ θ}^{*}}} (S_{t}) \geq {¯ ¯¯ ¯ ω}^{T} V^{π_{¯ θ}} (S_{t}), \exists ¯ ¯¯ ¯ ω \subseteq Ω,

(24)

where $V^{π_{{¯ θ}^{*}}} (S_{t})$ denote the state-value function with state $S_{t}$ , and $π_{¯ θ}$ denotes other any policy except $π_{{¯ θ}^{*}}$ . In the AVUS, the output network policy contains two sub-policies, which are optimized for coverage and capacity over different preferences, respectively. The core point of this strategy is to integrate the action values of all objectives, which are fully based on the convex envelope of the solution front. Here, we provide a theoretical analysis of the AVUS scheme below.

Iv-B1 Bellman Operator

The standard single-objective PPO algorithm [28] utilizes the Bellman expectation operator, where the action value function $Q_{π_{¯ θ}} (S_{t}, A_{t})$ by Bellman optimality operator $J$ can be expressed as follows:

(J Q)_{π_{{¯ θ}^{*}}} (S_{t}, A_{t}) = R_{t} (S_{t}, A_{t}) + γ \sum S^{^{'}} \subseteq ¯ ¯ ¯ S p (S_{t}, A_{t}, S^{^{'}}) (H Q) (S^{^{'}}, A^{^{'}}),

(25)

where $γ$ , $p (S_{t}, A_{t}, S^{^{'}})$ , and $(H Q) (S^{^{'}}, A^{^{'}}) = {max}_{A^{^{'}} \subseteq ¯ ¯¯ ¯ A} Q_{π_{{¯ θ}^{*}}} (S^{^{'}}, A^{^{'}})$ denote the discount factor, the transition probability from $S_{t}$ to $S^{^{'}}$ by choosing $A_{t}$ , and the optimality filter for the next state $S^{^{'}}$ , respectively. Then, we extend single-objective PPO algorithm to the MO-PPO algorithm by considering a action-value function space $Q$ to estimate of expected total rewards under $¯ ¯¯ ¯ ω$ preferences, where $Q$ contains all bounded action-value functions $Q (S_{t}, A_{t}, ¯ ¯¯ ¯ ω)$ . The corresponding value metric $D$ can be defined as follows:

D (Q, Q^{^{'}}) := max S_{t} \subseteq ¯ ¯ ¯ S, A_{t} \in ¯ ¯¯ ¯ A, ¯ ¯ ¯ ω \subseteq Ω | {¯ ¯¯ ¯ ω}^{T} [Q (S_{t}, A_{t}, ¯ ¯¯ ¯ ω) - Q^{^{'}} (S_{t}, A_{t}, ¯ ¯¯ ¯ ω)] |,

(26)

Based on any given policy $π_{¯ θ}$ , the evaluation operator of action-value function in the MO-PPO algorithm can be defined as follows:

(J Q)_{π_{¯ θ}} (S_{t}, A_{t}, Ω) = R_{t} (S_{t}, A_{t}) + γ \sum S^{^{'}} \subseteq ¯ ¯ ¯ S p (S_{t}, A_{t}, S^{^{'}}) \sum A^{^{'}} \subseteq ¯ ¯¯ ¯ A π_{¯ θ} (A^{^{'}} | S^{^{'}}) Q_{π_{¯ θ}} (S^{^{'}}, A^{^{'}}, ¯ ¯¯ ¯ ω) .

(27)

Accordingly, we denote the optimality filter $H$ for the MO-PPO action-value function as follows:

(H Q) (S^{^{'}}, A^{^{'}}, ¯ ¯¯ ¯ ω) = max A^{^{'}} \subseteq ¯ ¯¯ ¯ A, Ω^{^{'}} \subseteq ¯ ¯¯ ¯ Ω Ω^{T} Q_{π_{{¯ θ}^{*}}} (S^{^{'}}, A^{^{'}}, {¯ ¯ ¯ ω}^{^{'}}) .

(28)

Intuitively, the filter $H$ provides $Q$ value under given $S_{t}$ and $¯ ¯¯ ¯ ω$ while handling the convex envelope of the solution front. The optimality operator $J$ for the MO-PPO action function under optimal policy $π_{{¯ θ}^{*}}$ can be defined as follows:

(J Q)_{π_{{¯ θ}^{*}}} (S_{t}, A_{t}, Ω) = R_{t} (S_{t}, A_{t}) + γ \sum S^{^{'}} \subseteq ¯ ¯ ¯ S p (S_{t}, A_{t}, S^{^{'}}) (H Q)_{π_{{¯ θ}^{*}}} (S^{^{'}}, A^{^{'}}, ¯ ¯¯ ¯ ω) .

(29)

Compared to (25), (29) integrated all the objectives by invoking $¯ ¯¯ ¯ ω$ to update the policy of each objective simultaneously.

Remark 1.

Compared to the single objective optimization problem, the policy in AVUS contains a preference space $¯ ¯¯ ¯ ω$ , which is utilized to estimate the total rewards under multi-objective and update the whole policy. Note that, each objective has its own policy instead of sharing a common policy.

Iv-B2 Loss Function

Typically the environment is not known entirely so there is no closed-form solution to obtain optimal action-value and state-value functions. In this case, the advantage estimator can be expressed as follows:

	${^A}_{t}^{π_{{¯ θ}^{*}}} (¯ ¯¯ ¯ ω)$	$= ¯ ¯ ¯ T \sum t Q_{π_{¯ θ}} (S_{t}, A_{t}, ¯ ¯¯ ¯ ω) - V_{π_{¯ θ}} (S_{t}, ¯ ¯¯ ¯ ω)$
		$= R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots + γ^{¯ ¯ ¯ T - t + 1} R_{¯ ¯ ¯ T - 1} + γ^{¯ T - t} V_{π_{¯ θ}} (S^{¯ ¯ ¯ T}, ¯ ¯¯ ¯ ω) - V_{π_{¯ θ}} (S_{t}, ¯ ¯¯ ¯ ω),$		(30)

where $R_{t}$ and $V_{π_{¯ θ}} (., .)$ denote the obtained reward at each time step, the output state-value by critic network, respectively. In our proposed strategy, the loss function can be calculated based on the NCP method, CLIP method, and KL penalty method, which are expressed as follows:

	$LNCP1(¯¯¯θ,¯¯¯¯ω)=Et{min[π¯¯¯θ∗(St,At,¯¯¯¯ω)π¯¯¯θ(St,At,¯¯¯¯ω)^Aπ¯θ∗t(¯¯¯¯ω)]},$		(31)
	$LCLIP1(¯¯¯θ,¯¯¯¯ω)=Et{min[π¯¯¯θ∗(St,At,¯¯¯¯ω)π¯¯¯θ(St,At,¯¯¯¯ω)^Aπ¯θ∗t(¯¯¯¯ω),clip(π¯¯¯θ∗(St,At,¯¯¯¯ω)π¯¯¯θ(St,At,¯¯¯¯ω),1−ϵ,ϵ)^Aπ¯θ∗t(¯¯¯¯ω)]},$		(32)
	$LKL1(¯¯¯θ,¯¯¯¯ω)=Et{min[π¯¯¯θ∗(St,At,¯¯¯¯ω)π¯¯¯θ(St,At,¯¯¯¯ω)^Aπ¯θ∗t(¯¯¯¯ω),~βKL(π¯¯¯θ∗(St,¯¯¯¯ω),π¯¯¯θ(St,¯¯¯¯ω))]}.$		(33)

At each update, the optimal method will be selected as follows:

L_{1}^{o p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω) = max {L_{1}^{N C P} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω), L_{1}^{C L I P} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω), L_{1}^{K L} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)} .

(34)

However, owing to the large number of discrete solutions in the optimal PO front, directly optimizing $L_{1}^{N C P / C L I P / K L} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)$ in practice is still challenging. To address the difficulty, auxiliary loss functions are invoked and the optimal selection is expressed as follows:

	$L_{2}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)$	$= max {L_{2}^{N C P} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω), L_{2}^{C L I P} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω), L_{2}^{K L} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)},$
		$= max {{¯ ¯¯ ¯ ω}^{T} L_{1}^{N C P} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω), {¯ ¯¯ ¯ ω}^{T} L_{1}^{C L I P} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω), {¯ ¯¯ ¯ ω}^{T} L_{1}^{K L} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)} .$		(35)

0: PPO network structure, preference distribution

B

, path

Δ ϖ

for coefficient

ϖ

0: The optimal MO-PPO policy network.

Initialize: Hyperparameters of PPO network, total epochs

¯ ¯¯ ¯ U

in each update, minibatch size

M

, update frequency

U

for MO-PPO algorithm.

for iteration = 1, 2,

\dots

Sample a linear preference

¯ ¯¯ ¯ ω

from

B

for actor = 1, 2,

\dots

, N do

Run policy

π_{¯ ¯ ¯ θ}

in environment for

T

time steps.

Compute advantage estimates

{^A}_{1}, \dots, {^A}_{¯ ¯ ¯ T}

for every

¯ ¯¯ ¯ T

updating time.

end for

Optimize loss function

L

wrt

¯ ¯ ¯ θ

, with

¯ ¯¯ ¯ U

and

M \leq U

, according to equation (36).

Update parameters

¯ ¯ ¯ θ \leftarrow {¯ ¯ ¯ θ}^{*}

Increase

ϖ

along the path

Δ ϖ

end for

Algorithm 1 PO-based MO-PPO algorithm, AVUS

The $L_{1}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)$ is capable of ensuring that predicted action-value closing to any real expected total reward although it may not obtaining the optimal results. $L_{2}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)$ is able to pull along the proper direction with better utility. Therefore, to obtain the optimal results, the final loss function can be expressed according to the homotopy optimization [30]:

L^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω) = ϖ L_{1}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω) + (1 - ϖ) L_{2}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω),

(36)

where $ϖ$ is a weight to trade off between $L_{1}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)$ and $L_{2}^{O p t i m a l} (¯ ¯ ¯ θ, ¯ ¯¯ ¯ ω)$ . The value of $ϖ$ is increased from 0 to 1 with step 0.1. The pseudo code of AVUS-based MO-PPO algorithm is shown in Algorithm 1. To sum up, the AVUS aims to train an agent to recover policies for the entire PF. However, different preference $¯ ¯¯ ¯ ω$ affects the total obtained rewards for coverage and capacity.

Iv-C LFUS-based MO-PPO Algorithm

In this subsection, we consider the LFUS-based MO-PPO algorithm, where the multi-task learning (MTL) method is employed. Different from the AVUS, here are multiple gradient policies that need to be updated simultaneously. In the MTL-based MO-PPO problem, the empirical risk minimization formulation is generally followed:

min ¯ ¯ ¯ θ M \sum m = 1 φ^{m} {^L}^{m} (¯ ¯ ¯ θ),

(37)

where $φ^{m}$ and ${^L}^{m} (¯ ¯ ¯ θ)$ denote the weights for $m$ -th task and the empirical loss of $m$ -th task. Consider two sets of solutions ${¯ ¯ ¯ θ}_{1}$ and ${¯ ¯ ¯ θ}_{2}$ , if ${^L}^{1} ({¯ ¯ ¯ θ}_{1}) > {^L}^{1} ({¯ ¯ ¯ θ}_{2})$ and ${^L}^{2} ({¯ ¯ ¯ θ}_{1}) < {^L}^{2} ({¯ ¯ ¯ θ}_{2})$ , it is obtained that the two tasks are mutually non-dominated, and therefore belong to the PF. In this case, MTL problem can be formulated as MO optimization to explore the optimal results for conflicting objectives, where the vector-valued loss $L$ are employed as follows:

min {¯ ¯ θ}^{^{'}} L ({¯ ¯ ¯ θ}^{^{'}}) = min ¯ ¯ θ [{^L}^{1} (¯ ¯ ¯ θ),^L^{2} (¯ ¯ ¯ θ), \dots, {^L}^{M} (¯ ¯ ¯ θ)]^{T} .

(38)

Hence, the optimization of equation (38) is to find PO solutions. Define $¯ ¯¯¯ ¯ F = {L (¯ ¯ ¯ θ)}, ¯ ¯ ¯ θ \in ¯ ¯¯ ¯ Θ$ as the PF, where $¯ ¯ ¯ θ$ and $¯ ¯¯ ¯ Θ$ denote the any one set of optimal parameters and all possible sets of optimal parameters. Here, we provide a theoretical analysis of LFUS scheme below.

Iv-C1 Multiple Gradient Descent Algorithm (MGDA)

To converge to the Pareto stationary (PS) solution problem, the MGDA [31] is a proper method. According to the Karush-Kuhn-Tucker (KKT) conditions, there exists $ν_{1}, ν_{2}, \dots, ν_{M}$ such that:

$ν_{1}, ν_{2}, \dots, ν_{M} \geq 0$ .
$\sum_{m = 1}^{M} ν_{m} = 1$ and $\sum_{m = 1}^{M} ν_{m} \nabla_{¯ ¯ ¯ θ} {^L}^{m} (¯ ¯ ¯ θ) = 0$ .

Before handling the MGDA, the objectives may have values of the different scales, while MGDA is sensitive to the different ranges. Thus, the following gradient normalization method is invoked to alleviate the value range:

(39)

where ${¯ ¯ ¯ θ}^{^{'}}$ is the initial parameters of the model. Consequently, the range of loss function has been limited to $[0, 1]$ .

Definition 1.

A solution ${¯ ¯ ¯ θ}_{1}$ dominates a solution ${¯ ¯ ¯ θ}_{2}$ if for all objectives satisfying ${^L}^{m} ({¯ ¯ ¯ θ}_{1}) \leq {^L}^{m} ({¯ ¯ ¯ θ}_{2})$ , while exists at least one objective satisfying ${^L}^{n} ({¯ ¯ ¯ θ}_{1}) < {^L}^{n} ({¯ ¯ ¯ θ}_{2})$ , $\forall m, n \in {1, 2, \dots, M}$ .

Definition 2.

A solution ${¯ ¯ ¯ θ}_{1}$ is PO solution while there is no any other solution ${¯ ¯ ¯ θ}_{2}$ dominates ${¯ ¯ ¯ θ}_{1}$ .

Definition 3.

All non-dominated solutions $^¯ ¯ ¯ θ$ are Pareto set.

The solution that satisfies the conditions above is defined as a PS solution, while the PO solution is PS. Thus, the optimization problem can be defined as follows:

minν1,ν2,⋯,νM{∣∣∣∣∣∣M∑m=1νm∇¯¯¯θ^Lm(¯¯¯θ)∣∣∣∣∣∣22∣∣∣M∑m=1νm=1,νm≥0},

(40)

where $| | \cdot | |_{2}^{2}$ and $\nabla_{(\cdot)}$ denote the L2 norm and gradient descent (GD) operator. Define $\nabla_{{¯ ¯ ¯ θ}^{^{'}}} L ({¯ ¯ ¯ θ}^{^{'}}) = \sum_{m = 1}^{M} ν_{m} \nabla_{¯ ¯ ¯ θ} {^L}^{m} (¯ ¯ ¯ θ)$ , we have that: if $\nabla_{{¯ ¯ ¯ θ}^{^{'}}} L ({¯ ¯ ¯ θ}^{^{'}}) = 0$ , the solution is PS; otherwise, it isn’t PS and $\nabla_{{¯ ¯ ¯ θ}^{^{'}}} L ({¯ ¯ ¯ θ}^{^{'}})$ is the general GD vector. Since it has two objectives in problem (15), the equation (40) can be simplied as:

min ν \in [0, 1] | | ν \nabla_{¯ ¯ ¯ θ} {^L}^{1} (¯ ¯ ¯ θ) + (1 - ν) \nabla_{¯ θ} {^L}^{2} (¯ ¯ ¯ θ) | |_{2}^{2},

(41)

The optimization problem defined in (41) is equivalent to finding a minimum-norm point in the convex hull, which is a convex quadratic problem with linear constraints. Thus, an analytical solution to equation (41) can be expressed as:

ν={[∇¯¯¯θ^L2(¯¯¯θ)−∇¯θ^L1(¯¯¯θ)]T∇¯¯¯θ^L2(¯¯¯θ)||∇¯¯¯θ^L1(¯¯¯θ)−∇¯¯¯θ^L2(¯¯¯θ)||22}[0,1],

(42)

where ${}_{[0, 1]}$ represents clipping $ν$ to $[0, 1]$ . Alternate optimization of GD vector and $ν$ produces different $ν$ , which covers all PO solutions under constraints to form PF. According to the system model, it’s suitable to select one PO solution as the optimal result.

Iv-C2 Loss Function

0: PPO network structure.

0: The optimal MO-PPO policy network.

1: Initialize: Hyperparameters of PPO network, total epochs

¯ ¯¯ ¯ U

in each update, minibatch size

M

, update frequency

U

for MO-PPO algorithm.

2: for iteration = 1, 2,

\dots

3: for objective = 1, 2,

\dots

4: for actor = 1, 2,

\dots

, N do

5: Run policy

π_{¯ θ}

in environment for

T

time steps for each objective.

6: Compute advantage estimates

{^A}_{1}, \dots, {^A}_{¯ ¯ ¯ T}

for each objective at every

¯ ¯¯ ¯ T

updating time.

7: end for

8: end for

9: Calculate loss function

L

wrt

¯ ¯ ¯ θ

, with

¯ ¯¯ ¯ U

and

M \leq U

, according to equation (47).

10: Update

¯ ¯ ¯ θ

by min-norm solver.

11: end for

Algorithm 2 PO-based MO-PPO algorithm, LFUS

Our goal is to train one policy containing two sub-policies, where each objective has a specific loss function and shares all parameters. Thus, combing with the PPO algorithm, the loss function for the MO-PPO algorithm based on the NCP method, CLIP method, and KL Penalty method can be expressed as follows:

(43)

	$LCLIP(¯¯¯θ)=minν∈[0,1]∣∣∣∣νE1t{min[π¯¯¯θ∗(St,At)π¯¯¯θ(St,At)^Aπ¯θ∗t,clip(π¯¯¯θ∗(St,At)π¯¯¯θ(St,At),1−ϵ,ϵ)^Aπ¯θ∗t]}$
			(44)

	$LKL(¯¯¯θ)=minν∈[0,1]∣∣∣∣νE1t{min[π¯¯¯θ∗(St,At)π¯¯¯θ(St,At)^Aπ∗t,~βKL(π¯¯¯θ∗(St),π¯¯¯θ(St))]}$
	$+(1−ν)E2t{min[π¯¯¯θ∗(St,At)π¯¯¯θ(St,At)^Aπ∗t,~βKL(π¯¯¯θ∗(St),π¯¯¯θ(St))]}∣∣∣∣22,$		(45)

where ${^A}_{t}$ is a advantage estimator, it can be expresssed as follows:

	${^A}_{t}^{π_{{¯ θ}^{*}}}$	$= ¯ ¯ ¯ T \sum t Q_{π_{¯ θ}} (S_{t}, A_{t}) - V_{π_{¯ θ}} (S_{t})$
		$= R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots + γ^{¯ ¯ ¯ T - t + 1} R_{¯ ¯ ¯ T - 1} + γ^{¯ T - t} V_{π_{¯ θ}} (S_{¯ ¯ ¯ T}) - V_{π_{¯ θ}} (S_{t}) .$		(46)

Accordingly, at each update, the optimal method will be selected as follows:

L^{o p t i m a l} (¯ ¯ ¯ θ) = max {L^{N C P} (¯ ¯ ¯ θ), L^{C L I P} (¯ ¯ ¯ θ), L^{K L} (¯ ¯ ¯ θ)} .

(47)

The pseudo code of LFUS-based algorithm is shown in Algorithm 2.

Remark 2.

According to the theoretical analysis, the LFUS achieves a simpler structure than AVUS by only vectorizing the loss function. For AVUS, two policies are trained, since the action value is parallelly determined according to the preference according to (29). For LFUS, only one policy is trained, since all the objectives share the same loss function in (38). Therefore, the LFUS should have a faster convergence speed than the AVUS.

Iv-D Empirical Complexity Analysis

As shown in Tab. I, we analyze the empirical complexity for the AVUS and LFUS, i.e., wall-clock time (time complexity) and memory utilization (space complexity). For the wall-clock time, AVUS spends 9.852s for 10 episodes and 16.42m for 1000 episodes, while it costs 8.934s for 10 episodes and 14.89m for 1000 episodes in LFUS. For the memory utilization, LFUS consumes 108.35MB in total, which saves 7.33MB compared to the AVUS. Therefore, the empirical complexity proves that LFUS can achieve less time complexity and space complexity than the AVUS.

Policy	Time for 10 episodes	Time for 1000 episodes	Memory utilization in MB
AVUS	~9.852s	~16.42m	~115.68
LFUS	~8.934s	~14.89m	~108.35

TABLE I: Resource footprint for AVUS and LFUS

V Numerical Results

In this section, we provide numerical results to evaluate the performance of proposed update strategies of MO-PPO algorithms. Without loss of generality, a Poisson traffic model is employed to estimate the traffic flows or data sources for the proposed system model. In practice networks, there is a relationship on the traffic load between any adjacent time steps, where the traffic load at the current time step is determined by the previous time step. Based on this traffic model, the normalized coverage probability of observing ${¯ ¯ ¯ k}_{i}$ events and capacity probability of observing ${^k}_{i}$ events at sample point $s_{i}$ at time step 0 can be given by:

w_{c o v, s_{i}} (0) = {¯ ¯¯ ¯ P}_{s_{i}} ({¯ ¯ ¯ k}_{i}) = \frac{e^{- {¯ ¯ ¯ λ}_{i}} \frac{- {¯ ¯ ¯ λ}_{i}^{{¯ k}_{i}}}{{¯ ¯ ¯ k}_{i}!}}{\sum_{i = 1}^{N} e^{- {¯ ¯ ¯ λ}_{i}} \frac{- {¯ ¯ ¯ λ}_{i}^{{¯ k}_{i}}}{{¯ ¯ ¯ k}_{i}!}}, w_{c a p, s_{i}} (0) = {¯ ¯¯ ¯ P}_{s_{i}} ({^k}_{i}) = \frac{e^{- {^λ}_{i}} \frac{- {^λ}_{i}^{{^k}_{i}}}{{^k}_{i}!}}{\sum_{i = 1}^{N} e^{- {^λ}_{i}} \frac{- {^λ}_{i}^{{^k}_{i}}}{{^k}_{i}!}},

(48)

where ${¯ ¯ ¯ λ}_{n}$ and ${^λ}_{n}$ are the average number of events at each sample point $s_{i}$ . The parameters of the MO-PPO network and communication network are given in Table. II and Table. III. Additionally, there are two benchmarks conceived to evaluate the proposed update strategies:

Without STAR-RIS (network performance): In this benchmark, the BSs serve the whole serving area without the assistance of the STAR-RISs.
Fixed weights (algorithm performance): In this benchmark, the weights of coverage and capacity are fixed as two cases: a) BM1: weights 0.3 and 0.7; and b) BM2: weights 0.6 and 0.4.

Parameter	Description	Value
$E$	The maximum number of episodes	10000
$T$	The maximum of time steps in each episode	5000
$U$	Update frequency for MO-PPO algorithm	10
$¯ ¯¯ ¯ U$	The number of epochs in each update	10
$¯ ¯¯ ¯ E$	Clipped parameter for MO-PPO algorithm	0.2
$η$	Discount factor	0.99
$ψ_{a}$	Learning rate for actor network	0.0001
$ψ_{c}$	Learning rate for critic network	0.003
$ϖ$	Initial coefficient for updating action-value strategy	0.1
$Δ ϖ$	Step for the coefficient of updating action-value strategy	0.001

TABLE II: Simulation parameters for MO-PPO algorithm

V-a PF by Different Proposed Strategies

As shown in Fig. 5, we provide the PF under AVUS and LFUS. Among them, two PF are depicted are plotted at 3.5GHz and 26GHz signal frequency using AVUS. Note that, the capacity and coverage are the cumulated results in a time period, where the optimized weights for coverage and capacity are dynamic in the proposed strategies. Compared to BM1 and BM2, the coverage and capacity of the solutions on the two fronts both satisfy the PO definition, where at least one of them is better than the benchmarks. It is obtained that a dynamic combination for CCO in a time period is better than the fixed assignment of coverage and capacity. Moreover, the performance of different frequencies on STAR-RIS is an interesting question. When the system bandwidth is the same, mmWave is able to provide better capacity due to channel and frequency characteristics, while sub-6 GHz provides better coverage. Here, the channel model of the mmWave signal only consideres LoS component in (7) - (9). According to the proposed strategy, we randomly select one result from PF based on AVUS and LFUS for discussion.

Parameter

Description

Value

{¯ ¯ ¯ λ}_{c o v}

Average number of events for coverage

{¯ ¯ ¯ λ}_{c a p}

Average number of events for capacity

Path loss when d = 1m

-30dB

n^{2}

Noise power variance

\times

^{- 6}

\approx

-85.23dBW

R_{t h}

Minimal RSRP for all sample points

0.23mW

\approx

-36.38dBW

P_{t, m a x}

Maximum transmit power

200mW = 23.01dBm

α_{a R}

Rician factor for channel from

a

-th BS to

n_{s}

-th STAR-RISs

3dB

α_{R P}

Rician factor for channel from

n_{s}

-th STAR-RISs to

s_{i}

-th sample point

3dB

α_{a P}

Rician factor for channel from

a

-th BS to

s_{i}

-th sample point

3dB

γ_{a, n_{s}}

Path loss factor for channel from

a

-th BS to

n_{s}

-th STAR-RISs

3.5

γ_{n_{s}, s_{i}}

Path loss factor for channel from

n_{s}

-th STAR-RISs to

s_{i}

-th sample point

2.8

γ_{a, s_{i}}

Path loss factor for channel from

a

-th BS to

n_{s}

-th STAR-RISs

2.2

z

Discrete step for amplitude of STAR-RISs

h_{b}

Height of BS

R_{g}

Length of each grid

TABLE III: Simulation parameters for communication networks

V-B Convergence of MO-PPO Algorithm with Proposed Strategies

In Fig. 5, the convergence of the MO-PPO algorithm under proposed update strategies is demonstrated. Note that, to evaluate the performance of proposed algorithms, the learning curves are obtained by ten times repeated training. It can be observed from Fig. 5 that proposed strategies and benchmarks are capable of achieving convergence. Among them, the AVUS converges the slowest, but its cumulative reward is the largest, while the LFUS has a comparable convergence speed, but the cumulative reward is slightly higher. Compared to the benchmarks, both proposed algorithms are able to achieve better performance than the benchmarks in cumulated rewards or convergence speed. Also, this results proves the correctness of Remark 2 from practice.

Fig. 4: PF with different strategies, $N_{s} = 3$ , $N = 9$ , $K = 8$ , $I_{h_{n_{s}}}$ = 1.

V-C Optimal Coverage and Capacity with Proposed Strategies

In this subsection, we will discuss the impact of the number of sample grids, the number of STAR-RISs, the number of elements in STAR-RISs, and the size of STAR-RISs on the selected optimal coverage and capacity.

V-C1 Impact of the Number of Sample Grids

Fig. 6 characterizes the optimized coverage and capacity versus different total grids. In Fig.6(a), it is observed that the coverage and capacity gains of all cases all present decreasing trend with the upgrading of total grides. Specifically, the maximum decreasing gain of coverage among the proposed algorithms and fixed weight-based solutions is 9.23dB, while that of capacity can achieve 10.21dB. The reasons for these results are because compared with other sampling points, the fast fading channel characteristics of sampling points from BS and STAR-RISs make the received RSRP by far grids unable to reach $R_{t h}$ . As a result, both coverage and capacity of four cases present a downward trend. Additionally, compared to the ”Without STAR-RISs” case, the proposed strategies and benchmarks show better performance. This is because the STAR-RISs proactively transmit and reflect the signal to the farther grid with less consumption. To sum up, in the case of only changing the total number of sampling points, the coverage and capacity changes are positively correlated with the total grid changes. Moreover, the proposed update strategies outperform the benchmarks, while the performance of the AVUS is better than the LFUS.

The optimized coverage and capacity for the MO-PPO algorithm with fixed weights, AVUS, and LFUS with sample grids — (a) The optimized coverage under different number of sample grids of serving area.

The optimized coverage and capacity for the MO-PPO algorithm with fixed weights, AVUS, and LFUS under different number — (a) The optimized coverage under different number of STAR-RISs.

V-C2 Impact of the Number of STAR-RISs

Fig. 7 depicts the optimized coverage and capacity versus the different numbers of STAR-RISs. As shown in Fig. 7(a), the coverage of all cases keeps growing steadily as the number of STAR-RISs increases. When the number of STAR-RISs $N_{s}$ reaches 4, the coverage of the BM1 and BM2 case can be promoted to over 0.4, and both proposed update strategies can arrive at over 0.6. This is because with the increase in the number of STAR-RISs, STAR-RISs can help to compensate the received RSRP of some sample points to reach $R_{t h}$ . For the capacity depicted in Fig. 7(b), the gain of capacity between AVUS and BM2 case achieves 23.48dB, while the gain between LFUS and BM1 only arrives 0.31dB. This is because the STAR-RISs can compensate for the severe attenuation of channels from BSs to sample points, which indicates the effectiveness of STAR-RISs. Also, compared to the ”Without STAR-RISs” case, the STAR-RISs are able to improve the coverage and capacity of the whole serving area. The gap between any multi-objective optimization solution (fixed weights or proposed strategies) and the ”Without STAR-RISs” case keeps enlarging with the increase of the number of STAR-RISs. To sum up, it can be proved that the proposed update strategies also outperform the benchmarks for optimizing coverage and capacity. Since the STAR-RISs have presented their ability to improve spectrum utilization, the ”Without STAR-RISs” case will not be discussed in the following subsections.

The optimized coverage and capacity for the MO-PPO algorithm with fixed weights, AVUS, and LFUS with different number of elements — (a) The optimized coverage with different number of elements of STAR-RISs.

V-C3 Impact of the Number of Element in STAR-RISs

Fig. 8 describes the optimized coverage and capacity versus the different number of elements in STAR-RISs. It can be observed that the coverage shows a slight change in Fig.8(a). The maximum gains among the optimized capacity of four cases in Fig.8(b) are able to achieve 11.01dB when the number of elements in STAR-RISs increases to 36. It proves that the different number of elements in STAR-RISs bring a huge impact on optimizing capacity. These are because the role of each element is to transmit the BS signal to each sampling point while increasing the number of elements of STAR-RISs is adding multiple links to reduce loss. Compared with increasing the number of STAR-RISs, increasing the number of elements does not change the channel fast fading characteristics of distant sample points. Moreover, for the coverage, the LFUS outperforms the AVUS. It proves that when changing elements in STAR-RISs, the LFUS has a priority to be employed for only optimizing coverage. However, for both coverage and capacity optimization, it can be obtained that the AVUS is better than the LFUS. Also, the proposed update strategies both outperform the benchmarks.

V-C4 Impact of the Physical Size of STAR-RISs

The optimized coverage and capacity for the MO-PPO algorithm with fixed weights, AVUS, and LFUS under different physical size of STAR-RISs, — (a) Case 1: The optimized coverage and capacity under different height of STAR-RISs, $ω_{n_{s}} = 2$ .

To evaluate the impact of the physical size of STAR-RISs on optimizing the coverage and capacity, the height $h_{n_{s}}$ and width $ω_{n_{s}}$ of the STAR-RISs module are taken out for discussion. In this scenario, the number of STAR-RISs $N_{s}$ , the number of total grids, and the number of total grids $N$ are defined as: $N_{s} = 2$ , $N = 16$ , $K = 16$ . According to the $h_{b}$ and $R_{g}$ , the threshold of (1) and (2) can be calculated as 1m and 4m. Since $I_{h_{n_{s}}}$ = 1 has been discussed before, the other three scenarios are further considered as follows:

Case 1: Width of STAR-RISs are larger than the threshold, $I_{ω_{n_{s}}}$ = 0
Case 2: Width of STAR-RISs are smaller than the threshold, $I_{ω_{n_{s}}}$ = 1
Case 3: Height of STAR-RISs are smaller than the threshold, $I_{h_{n_{s}}}$ = 0

Fig. 9 demonstrate the optimized coverage and capacity for the MO-PPO algorithm with fixed weights, AVUS, and LFUS under the different physical sizes of STAR-RISs. Fig. 9(a) provides the changes of Case 1. In this case, there is at least one direct link between BS and any given sample point. When the height is also below the threshold, all sample points can have direct links with two BSs. Otherwise, one of the direct links among some sample points and BSs may be blocked. The coverage and capacity sharply fall down while the height of the STAR-RISs module passes over the threshold of 1m. This is because the number of direct links is a significant part to determine the strength of the received RSRP of each sample point. The upgrading number of direct links will increase the probability of reaching $R_{t h}$ at each sampling point. For only considering capacity, the performance of proposed update strategies is better than benchmarks, while the AVUS outperforms the LFUS. For only considering coverage, the performance of proposed strategies cannot present better performance than BM1.

Fig. 9(b) provides the changes of Case 2. In this case, the number of direct links between BS and any given sample point can be 0, 1, and 2, which determines by the height of the STAR-RISs module. When the height is also below the threshold, all sample points can have direct links with two BSs. Otherwise, there is at most one direct link between sample points and BSs. The coverage and capacity dramatically decrease while the height of the STAR-RISs module passes over the threshold of 1m. This is because the locations of STAR-RISs determine that the direct links between the sample points and BSs are only 0 or 1. Different from the Case 1, the optimized coverage for the proposed update strategies is between benchmarks. It may indicate that the direct links play an important part in receiving RSRP, which needs to be further explored. Additionally, for only considering coverage, the performance of proposed strategies presents worse performance than BM1. But considering both coverage and capacity, the proposed update strategies are acceptable in Case 2.

Fig. 9(c) provides the optimized coverage and capacity of Case 3. In this case, the height of the STAR-RISs module is fixed, which indicates that the direct links between sample points and BSs can be 0, 1, and 2. The number of direct links is 2 and 1, while the width of the STAR-RISs module is below the threshold. Otherwise, the number of direct links is 1 and 0. The capacity also shows the sharply falling down while the width goes over 4m. This is because the locations of STAR-RISs make the direct links between the sample points and BSs 0 or 1. Same with the Case 2, for only considering coverage, the performance of proposed strategies presents worse performance than BM1. Also, when considering both coverage and capacity, the proposed update strategies can be accepted.

Vi Conclusion

In this paper, the coverage and capacity are modeled by considering the geographic property. Based on the model, we proposed a new framework for CCO in STAR-RIS-assisted wireless networks, by optimizing the transmit power, the transmit power, and the phase shift matrix. In order to simultaneously optimize the coverage and capacity, an AVUS for the MO-PPO algorithm is investigated to solve the CCO problem, whose goal is to integrate action value for both coverage and capacity, which share the same loss function. However, it has strict requirements on the computation resource thereby increasing the cost of the hardware. To handle this problem, another update strategy, i.e., the LFUS, is proposed to update the MO-PPO algorithm with an integrated loss function of coverage and capacity, whose goal is to consider the two-loss function for coverage and capacity. LFUS is able to dynamically assign the weights by a min-norm solver at each update for the MO-PPO algorithms. The numerical results prove that the investigated update strategies are able to provide more efficient solutions than the fixed weight MOO algorithms. In addition, the coverage and capacity of wireless networks can be enhanced simultaneously with limited energy consumption since STAR-RISs have passive beamforming.

References

[1] X. Gao, W. Yi, A. Agapitos, H. Wang, and Y. Liu, “Coverage and Capacity Optimization in STAR-RISs Assisted Networks: A Machine Learning Approach,” arXiv preprint arXiv:2204.06390, 2022.
[2] Y. Liu et al., “Reconfigurable Intelligent Surfaces: Principles and Opportunities,” IEEE Commun. Surv. Tutor., vol. 23, no. 3, pp. 1546-1577, thirdquarter 2021,
[3] J. Xu, Y. Liu, X. Mu and O. A. Dobre, “STAR-RISs: Simultaneous Transmitting and Reflecting Reconfigurable Intelligent Surfaces,” IEEE Communi. Lett., vol. 25, no. 9, pp. 3134-3138, Sept. 2021,
[4] E. Basar, M. Di Renzo, J. De Rosny, M. Debbah, M. -S. Alouini and R. Zhang, “Wireless Communications Through Reconfigurable Intelligent Surfaces,” IEEE Access, vol. 7, pp. 116753-116773, 2019.
[5] J. Xu et al., “Simultaneously Transmitting and Reflecting Intelligent Omni-Surfaces: Modeling and Implementation,” IEEE Veh. Technol. Mag., vol. 17, no. 2, pp. 46-54, June 2022.
[6] C. Zhang, W. Yi, Y. Liu, Z. Ding, and L. Song, “STAR-IOS Aided NOMA Networks: Channel Model Approximation and Performance Analysis,” IEEE Trans. Wirel. Commun., doi: 10.1109/TWC.2022.3152703.
[7] L. Jorguseski, A. Pais, F. Gunnarsson, A. Centonza, and C. Willcock, “Self-organizing networks in 3GPP: Standardization and future trends,” IEEE Commun. Mag., vol. 52, no. 12, pp. 28–34, Dec. 2014.
[8] E. Balevi and J. G. Andrews, “Online Antenna Tuning in Heterogeneous Cellular Networks With Deep Reinforcement Learning,” IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 4, pp. 1113–1124, 2019.
[9] M. Aldababsa, A. Khaleel, and E. Basar, “Simultaneous Transmitting and Reflecting Intelligent Surfaces-Empowered NOMA Networks,” arXiv preprint arXiv:2110.05311, 2021.
[10] J. Zuo, Y. Liu, Z. Ding, L. Song, and H. Poor, “Joint design for simultaneously transmitting and reflecting (STAR) RIS assisted NOMA systems,” arXiv preprint arXiv:2106.03001, 2021.
[11] P. Perera, V. Warnasooriya, D. Kudathanthirige, and H. Suraweera, “Sum Rate Maximization in STAR-RIS Assisted Full-Duplex Communication Systems,” arXiv preprint arXiv:2203.04709, 2022.
[12] H. Niu, Z. Chu, F. Zhou, P. Xiao, and N. Al-Dhahir, “Weighted Sum Rate Optimization for STAR-RIS-Assisted MIMO System,” IEEE Trans. Veh. Technol., vol. 71, no. 2, pp. 2122-2127, Feb. 2022.
[13] C. Wu, X. Mu, Y. Liu, X. Gu, and X. Wang, “Resource Allocation in STAR-RIS-Aided Networks: OMA and NOMA,” IEEE Trans. Wirel. Commun., doi: 10.1109/TWC.2022.3160151.
[14] T. Wang, M. Badiu, G. Chen, and J. Coon, “Performance Analysis of IOS-Assisted NOMA System with Channel Correlation and Phase Errors,” arXiv preprint arXiv:2112.11512, 2021.
[15] C. Wu, Y. Liu, X. Mu, X. Gu, and O. Dobre, “Coverage characterization of STAR-RIS networks: NOMA and OMA,” IEEE Commun. Lett., vol. 25, no. 9, pp.3036-3040, 2021.
[16] A. Asghar, H. Farooq, and A. Imran, “Concurrent Optimization of Coverage, Capacity, and Load Balance in HetNets Through Soft and Hard Cell Association Parameters,” IEEE Trans. Veh. Technol., vol. 67, no. 9, pp. 8781-8795, Sept. 2018.
[17] N. Dandanov, H. Al-Shatri, A. Klein, and V. Poulkov, “Dynamic Self-Optimization of the Antenna Tilt for Best Trade-off Between Coverage and Capacity in Mobile Networks,” Wirel. Pers. Commun., vol. 92, pp. 251–278, 2017.
[18] M. Skocaj, L. Amorosa, G. Ghinamo, G. Muratore, D. Micheli, F. Zabini, and R. Verdone, “Cellular Network Capacity and Coverage Enhancement with MDT Data and Deep Reinforcement Learning,” arXiv preprint arXiv:2202.10968, 2022.
[19] R. Dreifuerst et al., “Optimizing Coverage and Capacity in Cellular Networks using Machine Learning,” Proc. of IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 8138-8142.
[20] R. Yang, X. Sun, and K. Narasimhan, “A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation,” Adv. Neural Inf. Process. Syst, pp. 1–27, 2019.
[21] O. Sener, and V. Koltun, “Multi-task learning as multi-objective optimization”, Adv. Neural Inf. Process. Syst, pp. 31-45, 2018.
[22] Y. Liu, W. Huangfu, H. Zhang, and K. Long, “An efficient stochastic gradient descent algorithm to maximize the coverage of cellular networks,” IEEE Trans. Wirel. Commun., vol. 18, no. 7, pp. 3424-3436, Jul. 2019.
[23] Y. Liang, R. Long, Q. Zhang, J. Chen, H. V. Cheng, and H. Guo, “Large Intelligent Surface/Antennas (LISA): Making Reflective Radios Smart,” J. Commun. Netw., vol. 4, no. 2, pp. 40-50, Jun. 2019.
[24] X. Mu, Y. Liu, L. Guo, J. Lin, and R. Schober, “Joint Deployment and Multiple Access Design for Intelligent Reflecting Surface Assisted Networks,” IEEE Trans. Wirel. Commun., vol. 20, no. 10, pp. 6648-6664, Oct. 2021.
[25] E. Bjornson and L. Sanguinetti, “Rayleigh fading modeling and channel hardening for reconfigurable intelligent surfaces,” IEEE Wirel. Commun. Lett., vol. 10, no. 4, pp. 830-834, Apr. 2021.
[26] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and C. Yuen,“Reconfigurable intelligent surfaces for energy efficiency in wireless communication,” IEEE Trans. Wirel. Commun., vol. 18, no. 8, pp. 4157-4170, Aug. 2019.
[27] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization.” Proc. of IEEE Int. Conf. Mach. Learn. Appl., pp. 1889-1897. PMLR, 2015.
[28] L. Watson and R. Haftka, “Modern homotopy methods in optimization,” Comput. Methods Appl. Mech. Eng., vol. 74, no. 3, pp. 289–305, 1989.
[29] K. Deb, “Multi-objective evolutionary algorithms: Introducing bias among Pareto-optimal solutions,” Adv. Evol. Comput., pp. 263-292. Springer, Berlin, Heidelberg, 2003.
[30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[31] J. Désidéri, “Multiple-gradient descent algorithm (MGDA) for multiobjective optimization,” Comptes. Rendus. Math., vol. 350, no. 5, pp. 313-318, 2012.