Optimal Regularized Online Convex Allocation by Adaptive Re-Solving

Wanteng Ma

^{1}

, Ying Cao

^{2}

, Danny H.K. Tsang

^{2}

, Dong Xia

^{1}

^{1}

Department of Mathematics, HKUST

^{2}

Department of Electronic and Computer Engineering, HKUST

¹1Ma and Cao are co-first authors. Ma’s research was partially supported by Hong Kong PhD Fellowship No. PF20-46281. Tsang’s research was partially supported by Hong Kong RGC GRF 16211220; Xia’s research was partially supported by Hong Kong RGC Grant GRF 16300121 and 16301622.

(September 2, 2022)

Abstract

This paper introduces a dual-based algorithm framework for solving the regularized online resource allocation problems, which have cumulative convex rewards, hard resource constraints, and a non-separable regularizer. Under a strategy of adaptively updating the resource constraints, the proposed framework only requests an approximate solution to the empirical dual problem up to a certain accuracy, and yet delivers an optimal logarithmic regret under a locally strongly convex assumption. Surprisingly, a delicate analysis of dual objective function enables us to eliminate the notorious loglog factor in regret bound. The flexible framework renders renowned and computationally fast algorithms immediately applicable, e.g., dual gradient descent and stochastic gradient descent. A worst-case square-root regret lower bound is established if the resource constraints are not adaptively updated during dual optimization, which underscores the critical role of adaptive dual variable update. Comprehensive numerical experiments and real data application demonstrate the merits of proposed algorithm framework.

1 Introduction

Online resource allocation seeks to maximize the total rewards in an online service system that is subject to resource constraints. As an exemplary model for sequential decision making, online allocation has drawn considerable attentions in recent decades. Meanwhile, it is strongly connected to other online problems such as revenue management (talluri2004theory), online linear programming (agrawal2014dynamic) and ads bidding problems (lee2013real), to name but a few. Online allocation finds applications in diverse fields, e.g., computer science and operation research. Oftentimes, online allocation problems feature resource constraints that are either hard (mehta2007adwords) or soft (mahdavi2012trading), with different constraint capacities. The goal of a decision maker is to maximize the total rewards (revenue, utility) function by a real-time decision policy that enforces each of the resource constraints.

So far, existing literature on online allocation mostly focused on additively separable objectives, i.e., the objective function only involves the total rewards that can be simply described as the cumulative rewards by time (e.g., mehta2007adwords; devanur2009adwords; balseiro2019learning). While a separable objective is favorable for tracking additive total rewards, it falls short of describing globally non-separable quantities such as total resource consumption or average actions. For instance, the average action (agrawal2014fast) in online advertising measures the amount of under-delivery of impressions. Unfortunately, non-separable objectives are considerably under-explored in the literature, and particularly, there is a paucity of work investigating the impact of non-separable regularization on separable cumulative reward functions. Here we are interested in regularized online allocation problems, which add a non-separable regularizer to the objective function as a penalty for various purposes such as resource-saving, load balancing, diversity, and fairness (ghosh2009bidding; balseiro2021regularized). Compared with non-regularized online resource allocation that maximizes an additively separable objective, the non-separable regularization poses new challenges to algorithm design and regret analysis.

In this paper, we study regularized online allocation problem with a concave reward function and linear resource constraints under the so-called random input model (goel2008online) where i.i.d. requests arrive sequentially and follow an unknown distribution. Decisions must be made sequentially, that is, once a request is received with a known reward function, the decision maker shall instantly makes a decision based on current request, previous history and remaining resources. Throughout the paper, we impose hard constraints on the total resource consumption, which shall never be violated so that the decision maker must wisely control the resource consumption at any time. Clearly, the challenges of online allocation problems mainly stem from the dilemma of fulfilling the current request or reserving the resources for, possibly more rewardable, future ones. The task for a decision maker is to design a strategy that maximizes the regularized total rewards subject to resource constraints. The regularizer is a non-separable function of total resource consumption. A typical application of the problem under study is online advertising (mehta2007adwords; agrawal2018proportional) where a publisher needs to assign each impression to some advertiser and maximize the click-through rate with budget constraints on each advertiser. Oftentimes, other aspects of resource consumptions, including fairness of advertisers or load balancing, are put into consideration. Towards that end, a regularizer on total click-through rates can be added, in which case the objective function turns out to be the regularized cumulative total click-through rates.

Our main goal is to design computationally efficient algorithms for the aforementioned regularized online allocation problems, which, simultaneously, achieve theoretically optimal regrets. In the absence of non-separable regularizer, it has been well recognized that the lower bound of regret of online allocation problems grows at a logarithmic rate (bray2019does; li2021online). The forgoing works also proposed adaptive policies that achieve the logarithmic-order regrets up to an additional loglog factor. Moreover, arlotto2019uniformly shows that adaptive policies are, generally, necessary to make a low regret possible. In sharp contrast, to our best knowledge, regrets achieved by prior algorithms (balseiro2021regularized) on regularized online allocation problems are of a square-root order. A first natural question is: can a logarithmic-order regret be achieved in the existence of a non-separable regularizer? Actually, we seek an even more ambitious goal: can we achieve a regret of exactly order $O (log T)$ without the loglog factor so that the lower bound is sharply met? The next question is more crucial: is there any computationally efficient algorithm that attains the desired regret? Surprisingly, we give affirmative answers to both questions by designing an adaptive algorithm framework that is flexible, computationally fast, and theoretically guaranteed to achieve the sharply optimal regret. Extensive numerical simulations and real data experiments are presented to corroborate the effectiveness of our algorithms.

1.1 Contributions

To summarize, we make the following contributions in this paper.

Sharp dual convergence in non-linear and regularized cases. We derive the convergence rate of sample-version dual solution to its population counterpart in the case of additive non-linear rewards function and in the existence of a non-separable regularizer. The convergence rate is at $O (T^{- 1})$ , which improves the known rate $O (T^{- 1} log log T)$ that was established only for non-regularized linear reward functions (li2021online). The improvement is made possible by, jointly, a local strongly convex assumption on reward functions and a delicate analysis of the local behavior of sample-version stochastic dual program near the population optimal solution. The observed local behavior and derived convergence performance are also valid for linear or non-regularized cases. This dual convergence crucially motivates our approach to treat a non-separable objective, which converts the non-separable primal problem into a dual one that consists of separable functions. Our analysis establishes a connection between the approximation errors measured by function values and the deviations of approximate solutions, which are determined by both intrinsic randomness and approximation of solutions. It suggests that any approximate solution, up to a certain accuracy, to the dual optimization suffices to guarantee the overall convergence of a primal-dual algorithm, which lays the theoretical foundation for our history-dependent algorithm design. It is noteworthy that, as a stochastic optimization problem, the derived dual convergence sheds new light on the open Sample Average Approximation (SAA) problems and may be of independent interest.

Adaptive algorithm framework. We propose a flexible dual-based and history-dependent, i.e., reliant on past data and actions, algorithm framework for solving the regularized online allocation problem. As a primal-dual algorithm framework, each iteration mainly consists of two routines: primal decision making and dual optimization. At a high level, our adaptive algorithm framework generalizes the history-dependent policy in online linear programming (li2021online), which evolves from the budget-ration policy (arlotto2019uniformly; balseiro2019learning) and the re-solving heuristic in network revenue management (jasin2012re; wu2015algorithms). There are two key ingredients in dual optimization of our algorithm framework. First, at each iteration, we adaptively update the average remaining resources in the dual problem. Besides fulfilling the resource constraints, this adaptive resource control plays a critical role in achieving a $O (log T)$ regret rather than the $O (T^{1 / 2})$ one attained by balseiro2021regularized. Secondly, at each iteration, our algorithm framework only requires an approximate solution, up to a certain accuracy, to the dual optimization. This allows a flexible choice of computationally efficient algorithms for dual optimization, be them deterministic or stochastic. Paired with first-order methods, our algorithm enjoys an acceptable polynomial-time cost comparable to prior algorithms. More specifically, for strongly convex objectives, it requires computing gradients for $O (t)$ times at time t; for more general convex objectives, it requires $O (t^{3})$ times of gradient computation. Note that our algorithm framework is also applicable to linear reward functions or non-regularized online allocation problems.

Regret analysis. With its offline optimum as the benchmark, we investigate the regret attained by the adaptive algorithm framework for regularized online allocation problems. Since the regret is characterized by dual convergence, the aforementioned new result of dual convergence allows us to derive a sharp regret bound. More exactly, we show that our adaptive algorithm achieves an $O (log T)$ regret, which matches the best results in constraint-free and non-regularized online convex optimization (hazan2007logarithmic) and multi-secretary problem (bray2019does). A matching lower bound is established under our assumptions demonstrating the optimality of our adaptive algorithm framework. To our best knowledge, this is the first theoretical guarantee of an exact $O (log T)$ regret bound for online non-linear allocation with hard constraints and a non-separable regularizer. The best known regret even for online learning programming (li2021online) contains an additional $log log T$ factor. By comparing with existing algorithms, we clarify the critical role played by the adaptive resource control in controlling the stopping time and achieving a logarithmic-order regret. In particular, we establish a worst-case $O (T^{1 / 2})$ lower bound for dual-based algorithms if the resource constraints are not adaptively updated. Basically, without updating the resource constraints, dual-based algorithms suffer from early-stopping.

We then elaborate the applications of our method and theory to online linear programming, online convex optimization, online welfare maximization and online convex packing. Simulation results are also presented.

1.2 Related Work

1.2.1 Online Linear Allocation

Many online problems with resource constraints can be formulated into online allocation problems. A large proportion of early work mainly focused on linear models. vazirani2005adwords; mehta2007adwords; buchbinder2007online studied the AdWords problem, where a search engine tries to assign some keywords to a set of competing bidders, each with a spending limit (i.e., constraint), and the goal is to maximize the revenue generated by these keyword sales. The rewards in AdWords problem are proportional to consumed resources and, thus, is a special case of online linear allocation. By viewing AdWords as a generalization of online bipartite matching problem, mehta2007adwords achieved an optimal $(1 - e^{- 1})$ -competitive ratio, which is defined as the ratio of the revenue of an online algorithm to the revenue of the best offline algorithm. Under a so-called random permutation model, devanur2009adwords proposed a two-phase dual training algorithm for AdWords problem and achieved the regret $O (T^{2 / 3})$ . The random permutation model, which assumes that the arrivals are in random order and the order itself is uniformly distributed over all permutations, is more general than the random input model, which assumes i.i.d. arrivals. But random input model can be treated as a special case of random permutation model (mehta2013online). More discussions on the online allocation problems under random permutation model can be found in babaioff2008online; goel2008online; molinaro2014geometry and references therein.

Apart from AdWords, two major topics related to online linear allocation are online revenue management problem and online multi-secretary problem. In online revenue management, a decision maker aims to find a dynamic pricing policy that maximizes a company’s linear total rewards when the number of supplied products is finite, demands of these products arrive sequentially, and the resources for manufacturing the products are limited. Online revenue management finds diverse applications in industry such as rental services, air travel, hospital services (talluri2004theory), etc. The earliest regret analysis of this problem dates back to cooper2002asymptotic, which proposed a static LP-based algorithm and achieved an $O (T^{1 / 2})$ regret. Later works show that better regret is achievable by a re-solving strategy, i.e., repeatedly solving an optimization program but with updated information. By combining the re-solving strategy and a trigger-and-threshold mechanism, reiman2008asymptotically reduced the regret significantly to $O (T^{1 / 4})$ . Equipped with sufficiently frequent re-solving’s, jasin2015performance proposed to re-estimate the parametric distribution of arrivals and proved that an $O ({log}^{2} T)$ regret is attained. jasin2012re; wu2015algorithms and bumpensanti2020re investigated the special case when the i.i.d. arrivals obey a discrete distribution with finite support and established $O (1)$ regrets for re-solving style algorithms when the resource constraints are constants. Online multi-secretary problem (kleinberg2005multiple; babaioff2007knapsack) is one of the simplest online allocation problems as it has only one integer constraint. Assuming the arrivals obey a known finite-support discrete distribution, arlotto2019uniformly proposed an online budget-ratio (BR) policy where decisions to fulfil or ignore requests are made by comparing the remaining average budget with some fixed thresholds. Their BR policy is adaptive and achieved an $O (1)$ regret but is inapplicable to the case of multiple resource constraints. They also established a regret lower bound $Ω (T^{1 / 2})$ for all non-adaptive policies. Conversely, if the arrival distribution is continuous, e.g. a simple uniform distribution over $[0, 1]$ , bray2019does developed a regret lower bound $Ω (log T)$ even when the distribution is known to a decision maker.

Other independent works of online linear programming also contribute greatly to the understanding of online allocation problems. agrawal2014dynamic proposed a history-dependent dual-based algorithm that dynamically update dual variables and periodically solve linear programs. Their algorithm achieved an $O (T^{1 / 2})$ regret under the random permutation model. When the arrivals satisfy the random input model, devanur2019near proved that a dual-based algorithm that attained an $O (T^{1 / 2})$ regret. But their algorithm relies on the knowledge of the optimal allocation, which is unrealistic for most applications. Otherwise, their algorithm requires periodically computing the optimal solution to an offline linear programming. More recently, li2021online introduced a history-dependent algorithm that adaptively updates the resource constraints, which achieved a regret $O (log T log log T)$ that is almost optimal except the $log log$ factor. But their strategy also requires exact solutions to an offline linear programs of growing sizes, which may be computationally intractable for large $T$ . An $Ω (log T)$ regret lower bound was established, which is consistent with bray2019does.

1.2.2 Online Convex Allocation

Linear objective functions only find limited applications in practice. Online convex allocation moves one step further by allowing convex objective functions. In agrawal2014fast, the authors investigated online convex programming that is equipped with a fixed and convex reward function. The imposed stochastic constraints are soft meaning that a certain degree of constraint violations is allowed. They proposed a flexible algorithm framework based on online convex optimization, which, for general convex objectives, achieved an $O (T^{1 / 2})$ regret with $O (T^{1 / 2})$ constraint violations. Furthermore, if the objective function is smooth, their algorithm achieves an $O (log T)$ regret with $O (log T)$ constraint violations. The computational cost of their algorithm can be linear provided that the offline optimum is partially known. Otherwise, their algorithm requires solving convex programs for logarithmic times to estimate the benchmark information periodically, which can be computationally expensive.

Recently, partly due to its computational efficiency, dual mirror descent is extensively studied for online convex allocation problems. balseiro2022best; balseiro2020dual focused on a class of online allocation problems with separable reward functions and resource constraints that is proportional to time horizon $T$ . They proposed a dual-based mirror descent algorithm that achieves $O (\sqrt{T})$ regret and was said to be unimprovable under their assumptions. Their algorithm updates dual variables by mirror descent and makes primal decisions by the conjugate functions. Their approach of controlling regret put less emphasis on stopping time but focused more on the complementary slackness of dual variables within updates. The rationale behind dual mirror descent is that it presents a self-correcting mechanism that naturally prevents resources from depleting too fast. This self-correcting mechanism relies on dual updates; that is, when a request consumes more resources, the corresponding dual variables will move against the excessive consumptions, and thus leading to a more conservative subsequent action. The problem we study in this paper is closer to balseiro2021regularized, which is the first to study online convex allocation problems with a non-separable regularizer and hard resource constraints. Their approach is similar to the non-regularized cases (balseiro2022best; balseiro2020dual), except that they define a new separable dual problem and update dual variables using regularized subgradients since they allow non-smooth regularizers. They showed that, for regularized online convex allocation, dual mirror descent can still perform well and attain an $O (T^{1 / 2})$ regret. While this regret is optimal for general convex reward functions under both stochastic and adversarial input model, it is sub-optimal when the reward functions possess more favourable conditions like strong convexity. More recently, lobos2021joint extended dual mirror descent to an even more challenging online allocation problem where the separable objective and non-linear constraints are not necessarily convex. They proposed a novel benchmark to measure the regret and concluded that an $O (T^{1 / 2})$ regret is achievable by dual mirror (sub-gradient) descent.

Besides deterministic and hard constraints, a large body of literature on online convex programming focus on stochastic constraints and allow constrain violations. For instance, yu2017online investigated online convex optimization with stochastic constraints and adversarial rewards. An $O (T^{1 / 2})$ bound is achieved for both the regret and constraint violations. A closely related problem is the long-term constraint problem, which aims to solve an online convex optimization problem by permitting a small number of cumulative constraint violations. mahdavi2012trading; jenatton2016adaptive designed algorithms achieving $O (T^{1 / 2})$ regrets and $O (T^{3 / 4})$ constraint violations. When the objective function is strongly convex, yuan2018online proposed an algorithm that achieves an $O (log T)$ regret at the cost of $O ((T log T)^{1 / 2})$ constraint violations.

It is worth briefly mentioning the literature on general online convex optimization, which laid the early foundations of online convex allocation problems. For strongly convex objectives, classical literature on online convex optimization have revealed an optimal logarithmic regret. See zinkevich2003online; hazan2007logarithmic and references therein. It is reasonable to expect a logarithmic-order regret for other online problems in the existence of strong convexity. Nevertheless, achieving a logarithmic-order regret is challenging if an additional non-separable regularizer is posed. In literature, regularized online convex programming is commonly solved by the follow-the-regularized-leader style algorithms. For instance, xiao2010dual introduced regularized dual averaging (RDA), which is an extension of the simple dual averaging algorithm originally proposed by nesterov2009primal, showing that an $O (T^{1 / 2})$ regret is achieved for general convex regularizer and $O (log T)$ regret for strongly convex regularizer. However, their regularizer is separable, and hence their RDA scheme is inapplicable for our problem. A generalized follow-the-regularized-leader framework is summarized in mcmahan2011follow; mcmahan2017survey, which includes many online-mirror-descent style algorithms as special cases. Our dual-based adaptive algorithm differs from the follow-the-regularized-leader algorithms as it exploits more historical information rather than just the gradients and past actions, and it does not follow the leader. More introduction for general online convex optimization can be found in hazan2016introduction.

1.3 Notations

Some notations will be used throughout the paper. Define $a \land b := min {a, b}$ and $a \lor b := max {a, b}$ . Write $[n]$ as the shorthand of ${1, \dots, n}$ . Define the non-negative region $R_{+} := {x | x \geq 0}$ . We will always use $i$ to denote dimensions and use $d_{i}$ for the $i$ -th dimension of vector $d$ , and for vector sequence ${d_{t}}_{t}^{T}$ , i.e., $d_{i t}$ stands for the $i$ -th entry of vector $d_{t}$ . Denote $(x)^{+} := max {x, 0}$ , $∥ \cdot ∥_{2}$ and $∥ \cdot ∥_{\infty}$ for the vector $ℓ_{2}$ -norm and $ℓ_{\infty}$ -norm, respectively.

2 Regularized Online Allocation Problem

We describe the convex regularized online allocation problem with finite time period $T$ as following:

$max {x_{t}, t \in [T]}$	$T \sum t = 1 f_{t} (x_{t}) + T \cdot r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T})$	(2.1)
s.t.	$T \sum t = 1 b_{t} x_{t} ⪯ d T, d \in R_{+}^{m}$
$x_{t} \in X, \forall t \in [T] .$

where $f_{t} : R^{n} \to R$ is the concave reward function, $r : R^{m} \to R$ is a concave regularizer to penalize the average resource consumption, $b_{t} \in R^{m \times n}$ is the cost matrix and its entry could be both positive or negative (i.e., we can replenish the resource). We assume our inputs are stochastic, meaning that the i.i.d. requests ${(f_{t}, b_{t})}_{t = 1}^{T}$ are sampled from an unknown distribution $P$ : $(f_{t}, b_{t}) \sim P$ . The decision region $X \subseteq R_{+}^{n}$ is closed and convex with void action $0 \in X$ .

Following the online sequential learning setting, we assume that at each time $1 \leq t \leq T$ , we first receive a request with known reward function and cost $(f_{t}, b_{t})$ and then make the decision $x_{t}$ based on the observation of $t$ -th request and history $H_{t - 1} := {f_{j}, b_{j}, x_{j}}_{j}^{t - 1}$ :

x_{t} := A (f_{t}, b_{t}, H_{t - 1}),

by taking the total resource constraints $\sum_{j = 1}^{t} b_{j} x_{j} ⪯ d T$ into consideration. Here $A$ denotes a history-dependent algorithm. Our goal is to design such an online algorithm $A$ that can maximize the regularized total reward $\sum_{t = 1}^{T} f_{t} (x_{t}) + T \cdot r (T^{- 1} \cdot \sum_{t = 1}^{T} b_{t} x_{t})$ . Define the algorithm expected reward over a given distribution $P$ as

R (A | P) := E_{A, P} [T \sum t = 1 f_{t} (x_{t}) + T \cdot r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T})] .

(2.2)

Here we take expectation with respect to both the inputs and the algorithm $A$ if $A$ is a stochastic algorithm. To measure the performance of an online algorithm, we compare the algorithm reward with the expected offline optimum (or hindsight optimum) defined by

R^{*} (P) := E_{P} [max x_{t} \in X T \sum t = 1 f_{t} (x_{t}) + T \cdot r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T}), s . t . T \sum t = 1 b_{t} x_{t} ⪯ d T],

(2.3)

which serves as the benchmark performance. For a given $P$ , define the regret as $Regret (A | P) := R^{*} (P) - R (A | P)$ . We then define the worst-case regret of an algorithm $A$ as the worst difference between the expected online reward and offline optimum over all the possible distributions in a certain probability family $Ξ$ :

Regret (A) := sup P \in Ξ {R^{*} (P) - R (A | P)},

(2.4)

where the distribution family $Ξ$ will be identified later.

Compared with unconstrained online optimization, the key obstacle to designing algorithms for the online allocation problem is to enforce the total resource constraints, which shall not be violated at any time. However, we can transform the primal problem into a dual one with fewer constraints by the duality theory. This motivates us to investigate the problem (2.1) from the dual perspective.

2.1 The dual problem

We consider the dual problem of online allocation (2.1). The Lagrangian of this problem is

L (x, a, λ, μ) := T \sum t = 1 f_{t} (x_{t}) + T \cdot r (a) + μ^{⊤} (a T - T \sum t = 1 b_{t} x_{t}) + λ^{⊤} (d T - T \sum t = 1 b_{t} x_{t}) .

(2.5)

Here we introduce the equality constraint $a = (\sum_{t = 1}^{T} b_{t} x_{t}) / T$ in order to separate $r (T^{- 1} \cdot \sum_{i = 1}^{T} b_{t} x_{t})$ into additive terms. Denote the domain of $r (a)$ as $Z$ with $b \circ X := span {b \cdot x ∣ ∣ for all possible% b and x \in X} \subseteq Z$ . Define the conjugate function

	$f_{t}^{*} (λ) :=$	$max x \in X {f_{t} (x) - x^{⊤} λ}$		(2.6)
	$r^{*} (μ) :=$	$max a \in Z {r (a) - a^{⊤} μ} .$		(2.6)

Then, the dual problem of 2.1 can be written as

		$min μ, λ$		${¯ D}_{T} (λ, μ, d) := \frac{1}{T} T \sum t = 1 f_{t}^{} (b_{t}^{⊤} (μ + λ)) + r^{} (- μ) + d^{⊤} λ$		(2.7)
		$s . t .$		$λ ⪰ 0.$		(2.7)

Under our stochastic input assumption, (2.7) can be viewed as a sample average approximation (SAA) (shapiro2009lectures) of the following stochastic program:

		$min μ, λ$		$D (λ, μ, d) := E f_{t}^{} (b_{t}^{⊤} (μ + λ)) + r^{} (- μ) + d^{⊤} λ$		(2.8)
		$s . t .$		$λ ⪰ 0.$		(2.8)

In the following discussion, we will sometimes write the dual variable uniformly as $λ := [λ^{⊤}, μ^{⊤}]^{⊤}$ in shorthand. If we have known the exact offline solution to (2.7), denoted by $λ_{T}^{*}$ , then by choosing the corresponding primal variables we can optimize the primal problem (2.1). However, in online setting it is impossible to find such exact dual solution before time $T$ . Thus at time $t$ we turn to solve the $t$ -sample average approximation of $D (λ, μ, d)$ , i.e.,

		$min μ, λ$		${¯ D}_{t} (λ, μ, d) := \frac{1}{t} t \sum j = 1 f_{j}^{} (b_{j}^{⊤} (μ + λ)) + r^{} (- μ) + d^{⊤} λ$		(2.9)
		$s . t .$		$λ ⪰ 0.$		(2.9)

and then use the dual approximate solution $λ_{t}$ to decide the primal solution $x_{t}$ . Such a re-solving idea can also be found in other contexts (jasin2015performance; ferreira2018online; li2021online) and has shown its merit in controlling the regret both in theory and in practice. Hence we expect that this idea also works in convex online allocation problems. Nevertheless, to discuss how practical this re-solving idea is in our setting, we still have three crucial questions to answer:

What is the behavior of $λ_{T}^{*}$ when $T$ is large? We know that $λ_{T}^{*}$ varies depending on the data we collected. But from the stochastic programming perspective, as $T$ goes large, the optimal solution to the SAA (2.7), $λ_{T}^{*}$ , will converge to the solution to its stochastic program (2.8), denoted by $λ^{*}$ . If we want to establish the theory of dual-based algorithms that rely on the approximation of $λ_{T}^{*}$ , we need to explore the convergence behavior of $λ_{T}^{*}$ toward $λ^{*}$ before we proceed with the study of algorithms.
How will the dual approximate solutions affect our reward and, consequently, the regret? This question is the key for the algorithm design. For online allocation problems, a good approximation of $λ^{*}$ or $λ_{T}^{*}$ does not necessarily mean a good reward because of the restriction imposed by resource depletion and stopping time. As we will show later, simply solving the convex programming (2.9) is not enough to achieve the optimal regret. We attempt to explain the influence of dual approximate solutions on regret in two phases: before and after stopping time, and show that the adaptive strategy of updating constraints is necessary for optimal regret.
How to control the regret as well as make the algorithm computationally efficient? Most of the re-solving techniques require periodically solving potentially large-scale convex programming, which is computationally demanding. Interestingly, we will show that a proper approximation of dual optimal solutions up to certain precisions can significantly reduce the computational costs, while maintaining the optimal order of regret. The influence of our approximation scheme on the regret is, in general, negligible when compared to the exact optimal solutions.

We propose an online adaptive algorithm for solving program (2.1), which achieves logarithmic regret based on the following assumptions.

2.2 Assumptions

Assumption 1 (Basic assumptions on arrivals).

The arrival sequences ${(f_{t}, b_{t})}$ satisfy:

[label= 1.0,ref= 1.0]
${(f_{t}, b_{t})}_{t = 1}^{T}$ are generated i.i.d. from distribution $P$ .
$f_{t}$ is strictly concave in the closed convex decision region $X \subseteq R_{+}^{n}$ with ${∥ x ∥}_{\infty} \leq D$ for any $x \in X$ .
There exists $¯ f \in R_{+}$ such that $\forall x \in X$ , $| f_{t} (x) | \leq ¯ f$ .
There exists $¯ b \in R_{+}$ such that ${∥ b_{t} ∥}_{2} \leq ¯ b$ for any $t$ .
We assume there exists $d - - > 0$ , and a large $¯ d > 0$ such that for any $i \in [m]$ , $d_{i} \in (d - -, ¯ d)$ . Denote $Ω_{d} = ⨂_{i = 1}^{n} (d - -, ¯ d)$ .

The assumptions on the upper bound $¯ f$ and $¯ b$ are common and practical in online allocation problems. It helps us control the size of the problem and ease our analysis. Assumption 5 follows from li2021online. We assume that the average resource constraints $d$ is of a reasonable size, i.e., $d_{i}$ is neither too large nor too small. If $d_{i}$ is too large, then the constraint itself will be of no interest because the restriction it imposed on the primal variables is negligible. This assumption is crucial for the subsequent discussion of regret, especially for bounding the stopping time.

Under Assumption 1, we can define the general feasible region of our regularizer $r (a)$ as $Z := {a ∣ ∣ {∥ a ∥}_{2} \leq \sqrt{n} D ¯ b}$ , which satisfies $b \circ X \subseteq Z$ . We then describe the necessary assumptions on the regularizer $r$ .

In order to study the influence of the average constraint $d$ on adaptive algorithms and how the variation of it affects the solution, we need the following assumptions.

Assumption 2 (Assumptions on the regularizer).

Suppose $(λ^{*}, μ^{*})$ is the optimal solution to the problem (2.8) when $d \in Ω_{d}$ . Then for any $d \in Ω_{d}$ , the concave regularizer $r$ is either $0$ or satisfies:

[label= 2.0,ref= 2.0]
$r$ is strictly concave and bounded in $Z$ : $| r | \leq ¯ r$ with bounded (sub)gradient ${∥ \nabla r (a) ∥}_{\infty} \leq G$ for any $a \in Z$ .
The conjugate $r^{*}$ satisfies $⟨ \nabla r^{*} (- μ) - \nabla r^{*} (- μ^{*}), μ^{*} - μ ⟩ \geq {L - -}_{r} {∥ μ - μ^{*} ∥}_{2}^{*}$ for any $μ$ satisfying ${∥ μ ∥}_{\infty} \leq G$ and some constant ${L - -}_{r} > 0$ .
The conjugate $r^{*}$ satisfies ${∥ \nabla r^{*} (- μ) - \nabla r^{*} (- μ^{*}) ∥}_{2}^{*} \leq {¯ L}_{r} {∥ μ - μ^{*} ∥}_{2}^{*}$ for any $μ$ satisfying ${∥ μ ∥}_{\infty} \leq G$ and some constant ${¯ L}_{r} > 0$ .

Together with Assumption 1, we can show that both the population-version and sample-version optimal solutions, $λ^{*}$ and $λ_{T}^{*}$ , respectively, are uniformly bounded.

Lemma 1.

Under Assumption 1, 2, the optimal solutions to problem (2.7) and (2.8) are bounded by:

(2.10)

By Lemma 1, we define the regions that contain all the possible optimal dual variable as , and $Ω_{μ} := {μ ∣ ∣ {∥ μ ∥}_{\infty} \leq G}$ . These regions will be the feasible sets of our dual variables since we do not want them to move far from the optimal solution $λ^{*}$ . Assumption 2 and 3 require the conjugate of regularizer to be smooth and have quadratic growth. This can be achieved if the regularizer $r$ is locally strongly convex and smooth (see, kakade2009duality or agrawal2014fast for the conjugate of strongly convex/smooth functions). But our assumption is a bit weaker than directly assuming strong convexity and smoothness on $r$ itself.

Here are several possible regularizers that satisfy our assumptions:

$ℓ_{2}$ -loss: $r (a) := - κ {∥ a ∥}_{2}^{2}$ . This regularizer serves as a tool to directly penalize resource consumption and achieve the goal of resource saving.
Smooth minima: $r (a) := - κ log (\sum_{i = 1}^{m} exp (- a_{i}) + exp (- \sqrt{n} D ¯ b))$ . This LogSumExp regularizer is the smooth approximation of max-min fairness regularizer $r (a) := κ {min}_{i} {a_{i}} \land \sqrt{n} D ¯ b$ , which forces us to maximize the minimum resource consumption. Resources after max-min fairness regularization tend to be distributed fairly so that all resources are utilized adequately. See, e.g., nash1950; bertsimas2011price; balseiro2021regularized.
Smooth maxima: $r (a) := - κ log (\sum_{i = 1}^{m} exp (a_{i}) + 1)$ . This regularizer is the smooth approximation of negative maximum consumption $r (a) := - κ {max}_{i} {a_{i}} \lor 0$ . This represents the load-balancing task: we minimize the maximum resource consumption so that all the resources are evenly distributed and no resource is over-exploited (or balanced load for every computer server in the load-balancing task).
Entropy loss: $r (a) := - κ [\sum_{i = 1}^{m} a_{i} log (a_{i}) + (1 - \sum_{i = 1}^{m} a_{i}) log (1 - \sum_{i = 1}^{m} a_{i})]$ with the corresponding feasible region: $Z := {a \in R_{+}^{m} ∣ \sum_{i = 1}^{m} a_{i} \leq 1}$ . We use this entropy loss when our problem is related to random strategies and probabilistic assignment, e.g., in the online advertising, we randomly assign each impression to different advertisers with selected probabilities. This entropy loss regularizer seeks to find online allocation strategies with high entropy, which may share appealing properities like diversity, fairness, or robustness (agrawal2018proportional).
Huber loss (huber1964robust): $r (a) := - κ [\sum_{i = 1}^{m} \frac{1}{2} a_{i}^{2} I (| a_{i} | \leq δ) + δ (| a_{i} | - \frac{δ}{2}) I (| a_{i} | > δ)]$ for some $δ > 0$ . Then conjugate of a Huber loss is also in the form $r^{*} (μ) := \sum_{i = 1}^{m} \frac{1}{2 κ} μ_{i}^{2} I (| μ_{i} | \leq κ δ) + (\sqrt{n} D ¯ b | μ_{i} | - κ δ (\sqrt{n} D ¯ b - \frac{δ}{2}) I (| a_{i} | > κ δ)$ . Huber loss satisfies our assumption if the optimal solution sits in the center of $Ω_{μ} : {∥ μ^{*} ∥}_{\infty}^{*} < κ δ$ . This depends on actual problems since $μ^{*}$ is determined by both $f$ and $r$ . But Huber loss entails that our regularizer may not necessarily be (globally) strongly convex and smooth. Compared with the $ℓ_{2}$ -loss, Huber loss penalizes more mildly to extreme resource consumptions.
No regularizer: $r (a) := 0$ . In this case, our problem is reduced to the non-regularized online convex allocation problem. Therefore, the theory developed in this paper is immediately applicable to the non-regularized cases.

In addition, we need the following non-degeneracy assumptions.

Assumption 3 (Non-degeneracy).

We assume that our problem is non-degenerate: suppose $(λ^{*}, μ^{*})$ is the optimal solution to the problem (2.8) when $d \in Ω_{d}$ . For ease of notations, we write $λ^{*}$ and $μ^{*}$ instead of $λ^{*} (d)$ and $μ^{*} (d)$ , respectively. Then for any $d \in Ω_{d}$ ,

[label=3.0,ref= 3.0]

Let $ν := λ + μ$ and $ν^{*} := λ^{*} + μ^{*}$ . The conjugate function $f_{t}^{*}$ satisfies

\begin{matrix} {∥ ∥ E [\nabla f_{t}^{*} (b_{t}^{⊤} ν) - \nabla f_{t}^{*} (b_{t}^{⊤} ν^{*}) ∣ ∣ b_{t}] ∥ ∥}_{2} \leq {¯ L}_{f} {∥ ∥ b_{t}^{⊤} ν - b_{t}^{⊤} ν^{*} ∥ ∥}_{2}^{*}, E [⟨ \nabla f_{t}^{*} (b_{t}^{⊤} ν) - \nabla f_{t}^{*} (b_{t}^{⊤} ν^{*}), b_{t}^{⊤} ν - b_{t}^{⊤} ν^{*} ⟩ ∣ ∣ b_{t}] \geq {L - -}_{f} {∥ ∥ b_{t}^{⊤} ν - b_{t}^{⊤} ν^{*} ∥ ∥}_{2}^{*} \end{matrix}

for any $λ \in Ω_{λ}$ , $μ \in Ω_{μ}$ and constants ${¯ L}_{f}, {L - -}_{f} > 0$ , conditioning on $b_{t}$ .

The matrix $M := E [b_{t} b_{t}^{⊤}]$ is positive definite with minimum eigenvalue $σ_{min} > 0$ .
Define the primal variable given $(λ^{*}, μ^{*})$ as ${~ x}_{t} (λ^{*}) := {arg max}_{x \in X} {f_{t} (x) - (λ^{*} + μ^{*})^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ^{*} + μ^{*}))$ . Then the optimal solution $(λ^{*}, μ^{*})$ satisfies $λ_{i}^{*} = 0$ if and only if $d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*} > 0$ .

Assumption 1 requires the expected conjugate of reward function to exhibit a local quadratic growth and smoothness, conditioning on any given $b_{t}$ . Combined with Assumption 2, 3, Assumption 1 ensures that the stochastic program (2.8) is locally smooth and locally strongly convex. Assumption 1 controls the growth rate of the reward function (and its conjugate) so that it will neither grow too fast nor degenerate to a line, which plays a critical role in characterizing dual solutions. Assumption 2 is easily satisfied since, oftentimes, the constraints are linearly independent. Assumption 3 imposes strong complementary slackness on the resource constraints $d \in Ω_{d}$ uniformly. This suggests that when $d$ changes within a certain region of $Ω_{d}$ , the binding or non-binding dimensions (defined below) of resource constraints of the optimal solution will not change. This brings convenience for analyzing adaptive algorithms with frequently updated constraints. Assumption 3 states the non-degeneracy condition for both primal and dual problems with nonlinear objectives, which is generalized from the non-degeneracy condition of linear programs (jasin2012re; jasin2015performance; wu2015algorithms; li2021online). Note that Assumption 3 only concerns the deterministic problem (2.8), but the empirical problem not necessarily share these local properties.

In this sequel, all the dimensions that satisfy $d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*} = 0$ with respect to the original $d$ in (2.1) are referred to as binding dimensions. Denote $I_{B} = {i ∣ ∣ d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*} = 0}$ the collection of binding dimensions. Similarly, non-binding dimensions are written as $I_{NB} = {i ∣ ∣ d_{i} - d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*} > 0}$ . Here for ease of notations, we omit the dependence of $I_{B}$ and $I_{NB}$ on the resource constraint $d$ . Assumption 3 ensures that binding and non-binding dimensions can be uniquely determined by the dual solution $λ^{*}$ .

Note that $- \nabla f_{t}^{*} (b_{t}^{⊤} ν)$ represents the primal solution given dual variable $ν$ . Its randomness stems from the stochastic reward function $f_{t}$ . From this perspective, Assumption 3 concerns the affect of dual variables to their corresponding expected primal solutions. It turns out that merely the perturbation behavior of expected primal solutions is not sufficient for our analysis, and we also need the perturbation behavior of the intrinsically random primal solutions, which can be controlled by the second moment. The following assumption serves for this purpose. Equivalently, it depicts the variation behavior of the random award function $f_{t}$ . This second-order moment establishes the connection between dual variables and primal performances.

Assumption 4 (Smoothness of the second moment).

Let $ν := λ + μ$ and $ν^{*} := λ^{*} + μ^{*}$ when we choose $d \in Ω_{d}$ in (2.8). The second moment of the gradient $\nabla f_{t}^{*}$ satisfies the following smoothness

for any $d \in Ω_{d}$ , $λ \in Ω_{λ}$ , $μ \in Ω_{μ}$ and given $b_{t}$ , where $L_{2} > 0$ is a constant.

Assumption 4 requires the variation of reward function given $b_{t}$ : $f_{t} \sim P | b_{t}$ to be mild so that the primal solution ${~ x}_{t} (λ) := - \nabla f_{t}^{*} (b_{t}^{⊤} (λ + μ))$ has a second order moment smoothness. Note that this doesn’t mean that $\nabla f_{t}^{*}$ must be globally smooth. A similar description of smoothness can be found in gorbunov2020unified. Compared with Assumption 1, Assumption 4 actually states the smoothness in a different perspective. Assumption 1 only requires the smoothness of the expected reward, but here Assumption 4 focuses more on the variation of the random reward function itself. Basically, Assumption 4 claims that no matter how the reward $f_{t}$ varies, the difference of primal variables can be bounded by the difference of dual variables in expectation. Assumption 4 is not necessary for the study of dual convergence in section 3, but it is indispensable for the theoretical study of adaptive algorithms and regret analysis. We note that Assumptions 2-4 assume the corresponding conditions holds for all the $d \in Ω_{d}$ .

3 Dual Convergence

For all dual-based online algorithms, the finite-sample convergence rate of dual variables is of great value since it reveals the best performance dual-based algorithms can achieve compared to the deterministic optimum. Recall the optimal solution $λ_{T}^{*}$ to the sample average approximation (SSA) in eq. (2.7). The Law of Large Numbers dictates that $λ_{T}^{*}$ converges to $λ^{*}$ in probability as $T \to \infty$ . While the asymptotic behaviors of optimal solutions to SAA have been intensively studied in the literature (kleywegt2002sample; shapiro2009lectures; kim2015guide), they are not enough for us to develop the non-asymptotical dual convergence in the case of regularized online convex programming. In this section, we establish the dual convergence bounds under locally strong convexity, i.e., Assumptions 1-3, for regularized online problem (2.1). We emphasize that our assumptions hold uniformly for all $d^{'} \in Ω_{d}$ . Consequently, the dual convergence performance we will derive in this section also holds for all $d^{'} \in Ω_{d}$ .

Define $D_{t} (λ, d) := f_{t}^{*} (b_{t}^{⊤} (μ + λ)) + r^{*} (- μ) + d^{⊤} λ$ , and the corresponding gradient

ϕ_{t} (λ, d) := \nabla_{λ} D_{t} (λ, d) = [\begin{matrix} b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ + λ)) + d b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ + λ) - \nabla r^{*} (- μ) \end{matrix}] .

Then we have $\nabla D (λ, d) := \nabla_{λ} D (λ, d) = E ϕ_{t} (λ, d)$ . Denote ${¯ ϕ}_{T} (λ, d) := T^{- 1} \sum_{t = 1}^{T} ϕ_{t} (λ, d)$ . One crucial idea to bounding the dual convergence is that the confined growth rate of $D (λ, d)$ indicates that, with high probability, its sample version ${¯ D}_{T} (λ, d)$ is lower bounded by a quadratic function (li2021online). The confined growth speed of $D (λ, d)$ is guaranteed by the following proposition.

Proposition 1.

Under Assumptions 1-3, the objective function $D (λ, d)$ in stochastic program (2.8) satisfies the following growth condition:

{L - -}_{D} {∥ λ - λ^{*} ∥}_{2}^{*} \leq D (λ, d) - D (λ^{*}, d) - \nabla D (λ^{*}, d)^{⊤} (λ - λ^{*}) \leq {¯ L}_{D} {∥ λ - λ^{*} ∥}_{2}^{*},

(3.1)

where the constant ${L - -}_{D} := \frac{{L - -}_{r}}{4} \land \frac{1}{2} \frac{{L - -}_{r}}{{L - -}_{r} + 2 {L - -}_{f} σ_{min}}$ , ${¯ L}_{D} := {¯ b}^{2} {¯ L}_{f} + {¯ L}_{r} / 2$ .

By Proposition 1, we now derive an upper bound for dual convergence ${∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*}$ by capturing the shape of dual objective ${¯ D}_{T} (λ, d)$ . While this idea is typical in literature (li2021online), we seek a more delicate analysis, which enables us to reach a sharper result. Basically, we focus on the local behavior of ${¯ D}_{T} (λ, d)$ around $λ^{*}$ . The rationale is obvious. Since ${¯ D}_{T} (λ, d)$ is always convex and converges to a deterministic convex function, the shape of ${¯ D}_{T} (λ, d)$ in a small neighborhood of $λ^{*}$ will mimic that of $D (λ, d)$ as long as $T$ is large enough. Consequently, its optimal solution $λ_{T}^{*}$ will lie in a small neighbourhood of $λ^{*}$ .

Consider the first order and second order term of ${¯ D}_{T} (λ, d)$ separately. Decompose the convex function ${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d)$ into two parts:

{¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) = ⟨ {¯ ϕ}_{T} (λ^{*}, d), λ - λ^{*} ⟩      first % order term + {¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) - ⟨ {¯ ϕ}_{T} (λ^{*}, d), λ - λ^{*} ⟩      second% order term .

(3.2)

It suffices to show that, with high probability, the first order term is lower bounded by a linear function and the second order term is lower bounded by a quadratic function, within a small neighborhood of $λ^{*}$ . For the first order term, we need the concentration of gradients.

Lemma 2.

Under Assumptions 1-3, the concentration of the gradient in the first order term ${¯ ϕ}_{T} (λ^{*}, d)$ satisfies

P ({∥ ∥ {¯ ϕ}_{T} (λ^{*}, d) - \nabla D (λ^{*}, d) ∥ ∥}_{2}^{*} > ε) \leq 4 m exp (- \frac{T ε^{2}}{4 m c_{1}}),

(3.3)

for any $ε > 0$ , where the constant $c_{1} := \sqrt{n} ¯ b D + ¯ d \lor G$ .

By Lemma 2, we conclude that, with high probability, the first order term in (3.2) is lower bounded by . For the second order term, we focus on a small neighborhood of $λ^{*}$ . For a constant $H > 0$ (to be clarified soon), define $Ω_{λ} (ε)$ as

(3.4)

Actually, it suffices to control the second order term for all dual variables in $Ω_{λ} (ε)$ since we shall show that $λ_{T}^{*}$ belong to $Ω_{λ} (ε)$ with a high probability depending on the value of $ε$ . In order to control the shape of second order term for all dual variables in $Ω_{λ} (ε)$ , we systematically split the region $Ω_{λ} (ε)$ to derive a uniform concentration of the second order term. The motivation for choosing this small neighborhood rather than a fixed region is that, for a convex function, the local behavior near the deterministic optimal solution is enough to guarantee the global properties of empirical optimal solutions. Consequently, for this small neighborhood, the size of its covering according to our splitting scheme can be bounded by a constant. The benefit of this local analysis is that we can successfully eliminate the O( $log log T$ ) factor and achieve a sharper dual convergence bound. The uniform concentration of the second order term in (3.2), together with Lemma 2, enable us to derive the following result.

Proposition 2.

Under Assumptions 1-3, given any $ε > 0$ , the dual problem satisfies that for $\forall λ \in Ω_{λ} (ε)$ and ${∥ λ - λ^{*} ∥}_{2}^{*} > 2 H ε$ , there exists a corresponding $λ^{'} \in Ω_{λ} (ε)$ such that and

{¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) \geq \frac{{L - -}_{D}}{2} {∥ ∥ λ^{'} - λ^{*} ∥ ∥}_{2}^{'} - {L - -}_{D} H ε {∥ ∥ λ^{'} - λ^{*} ∥ ∥}_{2}^{'}

with probability at least $1 - 4 m exp (- \frac{T ε^{2}}{4 m c_{1}}) - 2 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m} exp (- \frac{T ε^{2}}{2})$ , where

\begin{matrix} H := (1 + 2 \sqrt{2} (\sqrt{n} ¯ b D + \sqrt{m} G) (1 + \frac{\sqrt{m} (1 - q)}{q})) / {L - -}_{D}, and q := \frac{\sqrt{m}}{\sqrt{m} + 1 \land \frac{{L - -}_{D}}{8 \sqrt{L_{2}} {¯ b}^{2} \lor {¯ L}_{r}}} \lor \frac{1}{2} . \end{matrix}

The detailed proof is deferred to Appendix A.4. Proposition 2 delivers a strong message that, under the event where the inequality holds, the dual optimal solution $λ_{T}^{*}$ must be close to $λ^{*}$ in the sense that ${∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} \leq 2 H ε$ . Otherwise:

If $λ_{T}^{*}$ has $2 H ε < {∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} \leq 4 H ε$ , then there will be a ${λ_{T}^{*}}^{'}$ such that ${∥ ∥ {λ_{T}^{*}}^{'} - λ^{*} ∥ ∥}_{2}^{'} \geq {∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} > 2 H ε$ , and

which contradicts the optimality of $λ_{T}^{*}$ .
If $λ_{T}^{*}$ has ${∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} > 4 H ε$ , since ${¯ D}_{T} (λ^{*}, d) - {¯ D}_{T} (λ^{*}, d) = 0$ and ${¯ D}_{T} (λ_{T}^{*}, d) - {¯ D}_{T} (λ^{*}, d) \leq 0$ , by the convexity of ${¯ D}_{T}$ we have ${¯ D}_{T} (~ λ, d) - {¯ D}_{T} (λ^{*}, d) \leq 0$ for any $~ λ = λ^{*} + α (λ_{T}^{*} - λ^{*})$ with $0 \leq α \leq 1$ . Then we can always find an $α$ such that $2 H ε < {∥ ∥ ~ λ - λ^{*} ∥ ∥}_{2}^{*} \leq 4 H ε$ and ${¯ D}_{T} (~ λ, d) - {¯ D}_{T} (λ^{*}, d) \leq 0$ . However, according to Proposition 2, we have

${¯ D}_{T} (~ λ, d) - {¯ D}_{T} (λ^{*}, d) \geq \frac{{L - -}_{D}}{2} {∥ ∥ {~ λ}^{'} - λ^{*} ∥ ∥}_{2}^{'} - {L - -}_{D} H ε {∥ ∥ {~ λ}^{'} - λ^{*} ∥ ∥}_{2}^{'} > 0,$

which also ends up with a contradiction.

Thus, under the event in Proposition 2, we get ${∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} \leq 2 H ε$ . Since $2 H ε$ is smaller than the radius of $Ω_{λ} (ε)$ , we can safely conclude that the optimal solution $λ_{T}^{*}$ lies in the small neighborhood $Ω_{λ} (ε)$ . Consequently, we derive the following $O (T^{- 1})$ bound for dual convergence.

Theorem 1 (Dual convergence).

Under Assumptions 1-3, the dual optimal solution $λ_{T}^{*}$ satisfies

E {∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} \leq C_{1} \cdot \frac{1}{T},

(3.5)

where

C_{1} := 4 H^{2} (16 m^{2} c_{1} + 4 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m})

Proof.

By the tail expectation formula, for constant $H > 0$ , we have

E {∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} = 4 H^{2} \int_{0}^{\infty} P ({∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} > 4 H^{2} z) d z

According to the probabilistic bound in Proposition 2, for any $z > 0$ ,

P ({∥ ∥ λ_{T}^{*} - λ^{*} ∥ ∥}_{2}^{*} > 4 H^{2} z) \leq 4 m exp (- \frac{T z}{4 m c_{1}}) + 2 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m} exp (- \frac{T z}{2}) .

Then, calculating the integral, we get

	$E ({∥ ∥ λ_{T}^{} - λ^{} ∥ ∥}_{2}^{*})$	$= 4 H^{2} \int_{0}^{\infty} P ({∥ ∥ λ_{T}^{} - λ^{} ∥ ∥}_{2}^{*} \geq 4 H^{2} z) d z$
		$\leq \int_{0}^{\infty} [4 m exp (- \frac{T z}{4 m c_{1}}) + 2 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m} exp (- \frac{T z}{2})] d z$
		$= \frac{C_{1}}{T}$

∎

Remark 1.

Our dual convergence bound is sharper than that in li2021online. Under our assumption, the $O (T^{- 1})$ rate is unimprovable because we can find a distribution $P \in Ξ$ that incurs an $Ω (T^{- 1})$ dual convergence rate. Let us consider a non-regularized case when $x \in [0, 1]$ and $f_{t} (x) := f (x, ξ_{t}) := - (x - 2 ξ_{t})^{2} / 4 + ξ_{t}^{2}$ , with the single constraint $d = 1 / 2$ and cost $b_{t} = 1$ . The dual problem is

D_{t} (λ) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} \frac{1}{2} λ & if λ > ξ_{t} - \frac{1}{4} + ξ_{t} - \frac{1}{2} λ & if λ < ξ_{t} - \frac{1}{2} λ^{2} - 2 (ξ_{t} - \frac{1}{4}) λ + ξ_{t}^{2} & if ξ_{t} - \frac{}{1} 2 \leq λ \leq ξ_{t} . \end{matrix}

Let $ξ_{t}$ be any distribution varies within $[1 / 2, 3 / 4]$ with variance $σ_{ξ}^{2} > 0$ . Then, for any $t$ , we have $ξ_{t} - 1 / 4 \in [1 / 4, 1 / 2] \subseteq [ξ_{t} - 1 / 2, ξ_{t}]$ . Thus, for the sample average ${¯ D}_{T} (λ) := T^{- 1} \sum_{t = 1}^{T} D_{t} (λ)$ , when $λ \in [1 / 4, 1 / 2]$ , ${¯ D}_{T} (λ) := λ^{2} - 2 (_{T} - 1 / 4) λ + ¯ ξ_{T}^{2}$ with the optimal solution being $λ_{T}^{*} :=_{T} - 1 / 4$ . We have $E (λ_{T}^{*} - λ^{*})^{2} \geq Var (_{T}) = σ_{ξ}^{2} / T$ . This shows that our $O (T^{- 1})$ dual convergence rate is indeed optimal.

Note that our Proposition 2 holds uniformly for all $d^{'} \in Ω_{d}$ . Denote the optimal solutions to problem (2.7) and (2.8), given a certain $d^{'}$ , by $λ_{T}^{*} (d^{'})$ and $λ^{*} (d^{'})$ , respectively. Then, we actually have

E sup d^{'} \in Ω_{d} {∥ ∥ λ_{T}^{*} (d^{'}) - λ^{*} (d^{'}) ∥ ∥}_{2}^{'} \leq C_{1} \cdot \frac{1}{T}

(3.6)

Bound (3.6) plays a critical role in our regret analysis since the re-solving strategy of our adaptive algorithm framework needs to frequently update the resource constratins.

We then discuss $ϵ$ -optimal solutions of dual problem (2.7). Our following finite-sample convergence result of $ϵ$ -optimal solution can be viewed as a non-parametric version of SAA convergence developed by large deviation theory (ruszczynski2003stochastic). Notably, we only make assumptions on the deterministic problem $D (λ, d)$ , and our result does not rely on restricted tail conditions such as the moment generating function in ruszczynski2003stochastic; shapiro2009lectures. Therefore our result allows more flexible distributions.

Theorem 2 (Convergence of dual approximate solution).

Under Assumptions 1-3, suppose $λ_{T}^{ϵ}$ is an $ϵ$ -optimal solution that satisfies ${¯ D}_{T} (λ_{T}^{ϵ}, d) - {¯ D}_{T} (λ_{T}^{*}, d) \leq ϵ$ . Then we have the following convergence of $ϵ$ -optimal solution:

E {∥ ∥ λ_{T}^{ϵ} - λ^{*} ∥ ∥}_{2}^{*} \leq \frac{C_{1}}{T} + \frac{4 ϵ}{{L - -}_{D}}

Proof.

Recall that, by Proposition 2, convex function ${¯ D}_{T}$ is larger that a quadratic function in a neighborhood of $λ^{*}$ with a high probability claimed there. Then, for any $ε$ satisfying $ϵ < 4 H^{2} ε^{2} {L - -}_{D}$ , with the same high probability, the $ϵ$ -optimal solution must belong to $Ω_{λ} (ε)$ , because, for all the points in the border ${∥ λ - λ^{*} ∥}_{2}^{*} = 4 H ε$ , we already have ${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) \geq 4 H^{2} ε^{2} {L - -}_{D}$ . Then, with the same high probability, it follows that

which suggests that .

Still, applying the tail expectation formula, we get

	$E ({∥ ∥ λ_{T}^{ϵ} - λ^{} ∥ ∥}_{2}^{})$
		$\leq \frac{4 ϵ}{{L - -}_{D}} + 4 H^{2} \int_{\frac{ϵ}{H^{2} {L - -}_{D}}}^{\infty} P ({∥ ∥ λ_{T}^{ϵ} - λ^{} ∥ ∥}_{2}^{} \geq 2 H \sqrt{z}) d z .$

Let $2 H \sqrt{z} = H ε + \sqrt{H^{2} ε^{2} + \frac{2 ϵ}{- L_{D}}}$ . When $z > \frac{ϵ}{H^{2} {L - -}_{D}}$ , we have $ϵ < 4 H^{2} ε^{2} {L - -}_{D}$ , thus $P ({∥ ∥ λ_{T}^{ϵ} - λ^{*} ∥ ∥}_{2}^{*} \geq 2 H \sqrt{z})$ can be bounded by $4 m exp (- \frac{T ε^{2}}{4 m c_{1}}) + 2 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m} exp (- \frac{T ε^{2}}{2})$ . Also when $2 H \sqrt{z} = H ε + \sqrt{H^{2} ε^{2} + \frac{2 ϵ}{- L_{D}}}$ , we have $ε^{2} \geq z - \frac{ϵ}{H^{2} {L - -}_{D}}$ . By the integral of $z$ , we get the second part of the bound. ∎

Theorem 2 explains how the approximation of dual solutions affects the dual convergence. The accuracy remains valid as we directly optimize the deterministic dual function $D (λ, d)$ . Moreover, this theorem reveals that even if the empirical dual function ${¯ D}_{T} (λ, d)$ is not strongly convex or smooth, the dual convergence of approximate solution also holds as long as we choose an appropriate accuracy. We can further show that this property is preserved with a slightly different accuracy if we run stochastic optimization algorithms on ${¯ D}_{T}$ . We describe the convergence of stochastic approximate solution in the following corollary:

Corollary 1 (Convergence of stochastic dual approximate solution).

Under Assumptions 1-3, suppose $λ_{T}^{ϵ}$ is a stochastic $ϵ$ -optimal solution generated by stochastic optimization algorithm $B$ that satisfies

E_{B} [{¯ D}_{T} (λ_{T}^{ϵ}, d) - {¯ D}_{T} (λ_{T}^{*}, d) ∣ ∣ {¯ D}_{T}] \leq ϵ

for any given ${¯ D}_{T}$ . Then we have the following convergence of the stochastic $ϵ$ -optimal solution:

where the expectation is taken with respect to $B$ and $P$ .

Corollary 1 points out that the impact of stochastic optimization on the dual convergence is limited, and the order of dual convergence can still be controlled by $ϵ$ . Compared to Theorem 2, the smaller order $ϵ^{\frac{2}{3}}$ could be viewed as the accuracy loss because of randomness. Even if we do not assume ${¯ D}_{T}$ to be strongly convex, the difference between stochastic solutions and the deterministic one $E {∥ ∥ λ_{T}^{ϵ} - λ_{T}^{*} ∥ ∥}_{2}^{2}$ is still under control just as we optimize a strongly convex function. This inspires us to apply the stochastic approximate solutions to the re-solving heuristic because, in many contexts, the benefits of stochastic algorithms greatly outweigh the lower order of convergence $ϵ^{\frac{2}{3}}$ . With the theory of dual convergence, we are ready to describe our dual-based algorithm framework for online allocation.

4 Algorithm Framework

Our algorithm extends the linear adaptive re-solving strategy in li2021online to convex objective functions. The key idea is similar to the frequent re-solving strategy in network revenue management (e.g., jasin2012re; bumpensanti2020re) in spirit. We keep re-solving dual problems with updated average remaining capacity inspired by the budget-ratio policy (arlotto2019uniformly). Compared to the re-solving strategy in network revenue management, we also need to keep updating the constraints and re-solving the associated optimization programs. But the difference is that our strategy is dual-based, and the size of our optimization problems grows with time. Fortunately, the optimization in our algorithm can be easier as we only need approximate solutions. The resource control in our algorithm is handled more carefully when compared with the simple dual mirror descent. We show that, non-adaptive policies are too greedy and can’t wisely keep the remaining budget balanced in the long run. It is noteworthy that the idea of budget-ratio policy (arlotto2019uniformly) featuring average remaining capacity update has actually been implicitly conceived in the frequent re-solving heuristic (jasin2012re; wu2015algorithms). If we rescale the variables in the frequent re-solving heuristic in jasin2012re by the remaining time, we get a very similar constraint update strategy in bumpensanti2020re.

Our dual-based online allocation algorithm is in line with other dual-based online algorithms in spirit: we keep maintaining a dual variable $λ_{t}$ and every time when a request comes, we instantly give a response based on the dual variable and the request just received. We choose the primal action $x_{t}$ , given the dual variable $λ$ , by:

{~ x}_{t} (λ)

:= arg max x \in X {f_{t} (x) - (λ + μ)^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ + μ)),

and the primal variable $a$ is set by

~ a (μ) := arg max a \in Z {r (a) + μ^{⊤} a} = - \nabla r^{*} (- μ) .

Note that the primal solution $a$ may not explicitly affect our action $x_{t}$ , but it is helpful for our theoretical analysis of dual-based policies and for algorithm implementation.

We outline our dual-based and history-dependent algorithm framework in Algorithm 1. The algorithm updates dual variables by solving a $t$ -sample SAA as shown in equation (4.1). Each $λ_{t}$ is a $ϵ_{t}$ -optimal solution of the $t$ -sample SAA with adaptive resources constraints $d_{t}$ . We emphasize that two ingredients in our algorithm framework are crucial to guarantee an $O (log T)$ regret: (1) the adaptive update of resource constraints $d_{t}$ ; (2) the careful choice of accuracy $ϵ_{t}$ for approximate dual solutions. Without the adaptive update of $d_{t}$ , the worst-case regret will never be optimal for some extreme cases (see Section 5 for more discussion). The dual solution accuracy can be set as either increasing $ϵ_{t} = Θ (t^{- 1})$ or decreasing $ϵ_{t} = Θ ((T - t)^{- 1})$ (or $ϵ_{t} = Θ (t^{- 3 / 2})$ , $ϵ_{t} = Θ ((T - t)^{- 3 / 2})$ for stochastic optimization algorithms). Approximate solutions help significantly alleviate the total computational cost. Our algorithm is history-dependent, meaning that we exploit all the information we have collected up to time $t$ . This is the essence of our adaptive strategy. This history-dependent policy makes our algorithm learn more efficiently compared with other dual-based algorithms that do not learn from history (devanur2019near; balseiro2022best), at the cost of acceptable extra computation. As is common in the literature on dual-based online algorithms, we assume that both the conjugate $f_{t}^{*}$ and corresponding primal variable ${~ x}_{t}$ are easily attainable.

0: regularizer

r

, iteration number

T

, start point

λ_{0} := 0

, and initial resource

B_{0} := d T

for all

t = 1, T

Receive

(f_{t}, b_{t}) \sim P

Calculate

{~ x}_{t} := {~ x}_{t} (λ_{t - 1})

:= arg max x \in X {f_{t} (x) - (λ_{t - 1} + μ_{t - 1})^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ_{t - 1} + μ_{t - 1})) .

Select

x_{t} := {\begin{matrix} {~ x}_{t} & if B_{t - 1} \geq b_{t} x_{t} 0 & otherwise \end{matrix}

Update remaining resources:

B_{t} := B_{t - 1} - b_{t} x_{t}

Update average remaining resources:

d_{t} := \frac{B_{t}}{T - t}

Update dual variable

λ_{t}

via solving the following dual problem by any approximation algorithm

B_{t}

with accuracy

ϵ_{t}

min λ \in Ω_{λ} \times Ω_{μ} {{¯ D}_{t} (λ, d_{t}) := \frac{1}{t} t \sum j = 1 f_{j}^{*} (b_{j}^{⊤} (μ + λ)) + r^{*} (- μ) + d_{t}^{⊤} λ}

(4.1)

end for

Algorithm 1 History-based resolving algorithm framework

Our algorithm framework is free of optimizer, that is, we can select any optimizer to get the $ϵ_{t}$ -optimal solution to dual program (4.1). Since the dual problem ${¯ D}_{t} (λ, d_{t})$ is generally convex with respect to $λ$ , one favourable choice is stochastic gradient descent that is first order (recall that we assume the gradient of the dual problem, i.e., the primal variable, is easily attainable) and the computational complexity can be free of size $t$ . This makes it possible to deal with large scale dual optimization when the total running time $T$ is large.

More specifically, if the dual optimizer is selected as stochastic gradient descent where the accuracy is specified by $ϵ_{t} := c t^{- 3 / 2}$ , we end up with the following Algorithm 2 by our algorithm framework. Basically, it requires computing $O (t^{3})$ stochastic gradients at time $t$ . Moreover, if the dual problem ${¯ D}_{t}$ is further strongly convex or smooth, we can reduce the computational cost to $O (t)$ for each time $t$ . See Section 6.1 for more discussions on the case of strongly convex objectives. In Section 5, we demonstrate that any optimization algorithm $B_{t}$ that achieves the rate of dual convergence $E {∥ λ_{t} - λ^{*} (d_{t}) ∥}_{t}^{*} = O (t^{- 1})$ or $O ((T - t)^{- 1})$ suffices to guarantee the optimal logarithmic regret in the end.

0: regularizer

r

, iteration number

T

, start point

λ_{0} := 0

, and initial resource

B_{0} := d T

for all

t = 1, \dots, T

Receive

(f_{t}, b_{t}) \sim P

Calculate

{~ x}_{t} := {~ x}_{t} (λ_{t - 1})

:= arg max x \in X {f_{t} (x) - (λ_{t - 1} + μ_{t - 1})^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ_{t - 1} + μ_{t - 1})) .

Select

x_{t} := {\begin{matrix} {~ x}_{t} & if B_{t - 1} \geq b_{t} x_{t} 0 & otherwise \end{matrix}

Update remaining resources:

B_{t} := B_{t - 1} - b_{t} x_{t}

Update average remaining resources:

d_{t} := \frac{B_{t}}{T - t}

Set

R := \sqrt{m (2 \frac{¯ f + ¯ r}{d - -} + G)}

L := \sqrt{m {¯ d}^{2} + 2 n {¯ b}^{2} D^{2} + n G^{2}}

K := t^{3}

, and

η_{t} := \frac{\sqrt{2} R}{L \sqrt{K}}

. Define

λ_{t}^{0} := λ_{t - 1}

for all

k = 1, \dots, K

Randomly pick

ζ

from

[t] := {1, \dots, t}

with uniform distribution.

Calculate the stochastic gradient

\nabla D_{ζ} (λ_{t}^{k - 1}) := [\begin{matrix} - b_{ζ} {~ x}_{ζ} (λ_{t}^{k - 1}) + d_{t} - b_{ζ} {~ x}_{ζ} (λ_{t}^{k - 1}) + ~ a (λ_{t}^{k - 1}) \end{matrix}]

(4.2)

Update dual variable via stochastic gradient descent:

λ_{t}^{k} := a r g min λ \in Ω_{λ} \times Ω_{μ} {⟨ λ, \nabla D_{ζ} (λ_{t}^{k - 1}) ⟩ + \frac{1}{2 η_{t}} {∥ ∥ λ - λ_{t}^{k - 1} ∥ ∥}_{2}^{2}}

(4.3)

end for

Update dual variable by averaging:

λ_{t} := \frac{\sum_{k = 1}^{K} λ_{t}^{k}}{K}

end for

Algorithm 2 Resolving with Stochastic Gradient Descent

5 Regret Analysis

5.1 Regret upper bound

In this section, we apply dual convergence established in Section 3 to derive an upper bound of regret. The result is valid for our algorithm framework Algorithm 1 with any dual optimizers. Without loss of generality, we focus on stochastic optimizers $B_{t}$ , which are independent of future arrivals ${(f_{j}, b_{j})}_{j \geq t + 1}$ . As long as $B_{t}$ delivers reasonably accurate dual solutions $λ_{t}$ , based on past history $H_{t - 1} = {f_{j}, b_{j}, x_{j}}_{j}^{t - 1}$ , new arrival $(f_{t}, b_{t})$ and updated constraint $d_{t} \in Ω_{d}$ , our adaptive framework Algorithm 1 achieves a logarithmic-order regret. Precisely, the accuracy of dual solutions shall satisfy the following condition.

Condition 1.

(Accuracy of dual solutions). Suppose the updated constraints ${d_{j} ∣ ∣ 1 \leq j \leq t} \subseteq Ω_{d}$ . We say the algorithm ${B_{t}}_{t \geq 1}$ satisfies dual convergence condition 1 if

E_{B, P} {∥ λ_{t} - λ^{*} (d_{t}) ∥}_{t}^{2} \leq C_{2} \frac{1}{t + 1}, % or E_{B, P} {∥ λ_{t} - λ^{*} (d_{t}) ∥}_{t}^{2} \leq C_{2} (\frac{1}{t + 1} + \frac{1}{T - t})

(5.1)

for some constant $C_{2} > 0$ . The expectation is taken with respect to all the ${B_{t}}_{t \geq 1}$ and $P$ .

Recall that the dual convergence established in Section 3 holds uniformly for any $d \in Ω_{d}$ . Therefore, any dual optimizers ensuring corresponding dual solution error $ϵ_{t} = Θ (t^{- 1})$ or $ϵ_{t} = Θ ((T - t)^{- 1})$ ( $ϵ_{t} = Θ (t^{- 3 / 2})$ or $ϵ_{t} = Θ ((T - t)^{- 3 / 2}$ for stochastic dual optimizers) satisfy Condition 1. If Condition 1 holds, our adaptive framework Algorithm 1 achieves the following optimal regret.

Theorem 3.

Under Assumptions 1-4, if the algorithm ${B_{t}}_{t \geq 1}$ we choose satisfies Condition 1, then the regret of Algorithm 1 has the following upper bound:

Regret (A) \leq ˚ C \cdot log T

for some constant $˚ C > 0$ depending on the values in Assumptions 1-4.

Clearly, exact solutions to the SAA program (4.1) is a theoretically valid candidate for ${B_{t}}_{t \geq 1}$ , which is actually the classic idea of re-solving heuristic. However, the computational cost can be high if we want to find an exact solution. Fortunately, by Theorem 3, it suffices to approximately solve SAA program (4.1) as long as the accuracy meets conditions (5.1). We shall show in Section 5.2 that the rate $O (log T)$ is optimal.

We now briefly sketch the proof of Theorem 3. The proof begins with the decomposition of regret, which shows that regret can be controlled by the cumulative error of dual solutions $λ_{t} - λ^{*} (d_{t})$ and by $E [T - τ]$ for some stopping time $τ$ . Recall, given a certain distribution $P$ , the definition of regret:

Regret (A | P) = R^{*} (P) - R (A | P),

where $R^{*} (P)$ and $R (A | P)$ are defined in (2.3) and (2.2), respectively. To upper bound the regret, we need an upper bound of offline maximum reward. To that end, we define

g (λ) := E [f_{t} ({~ x}_{t} (λ)) + r (~ a (μ)) + (~ a (μ) - b_{t} {~ x}_{t} (λ))^{⊤} μ^{*} + (d - b_{t} {~ x}_{t} (λ))^{⊤} λ^{*}] .

Here $g (λ)$ serves as an upper bound for $R^{*} (P)$ , characterized by the following lemma.

Lemma 3.

The offline maximum reward $R^{*} (P)$ satisfies $R^{*} (P) \leq T \cdot g (λ^{*}) .$

Proof.

Recall the Lagrangian of program (2.5). By duality, we have

	$R^{*} (P)$	$:= E_{P} [max x_{t} \in X T \sum t = 1 f_{t} (x_{t}) + T \cdot r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T}), s . t . T \sum t = 1 b_{t} x_{t} ⪯ d T]$
		$\leq E T \sum t = 1 [f_{t} {~ x}_{t} (λ^{}) + r (~ a (μ^{})) + (~ a (μ^{}) - b_{t} {~ x}_{t} (λ^{}))^{⊤} μ^{} + (d - b_{t} {~ x}_{t} (λ^{}))^{⊤} λ^{*}]$
		$= T \cdot g (λ^{*})$

∎

Since $f_{t}$ and $r (a)$ have trivial upper bounds, we get $R^{*} (P) \leq T (¯ f + ¯ r)$ . Thus, for a proper stopping time $τ$ , we have

R^{*} (P) \leq E [τ g (λ^{*}) + (T - τ) (¯ f + ¯ r)] .

(5.2)

Note that our strategy of dealing the non-separable regularizer is to introduce an additional (i.e., variable split) primal variable $a$ . While its actual value does not directly affect our algorithm framework, it is vital for our theoretical investigation. To this end, denote

a_{t} := arg max a \in Z {r (a) + μ_{t}^{⊤} a}

the value of $a$ at $t$ -th iteration. By Fenchel conjugate, we actually have $a_{T} = - \nabla r^{*} (- μ_{T})$ . The second impact of variable splitting is an equality constraint between $a_{T}$ and $T^{- 1} \sum_{t = 1}^{T} b_{t} x_{t}$ . It turns out that their difference can be measured by the difference between $μ_{T}$ and the following quantity

^μT:=argmaxμ{r∗(−μ)−μ⊤∑Tt=1btxtT}.

(5.3)

The above maximization is taken without constraints, implying that $T^{- 1} \sum_{t = 1}^{T} b_{t} x_{t} = - \nabla r^{*} (- {^μ}_{T})$ . Due to the property of $\nabla r^{*}$ (Assumption 3), we have

∥ ∥ a_{T} - T^{- 1} T \sum t = 1 b_{t} x_{t} {∥ ∥}_{2} \leq {¯ L}_{r} ∥ μ_{T} - {^μ}_{T} ∥_{2} .

We are in position to describe the following regret decomposition for a general stopping time.

Proposition 3.

Under Assumptions 1-3, for a proper stopping time $τ$ ensuring that the resource is not depleted before $t \leq τ$ , the regret of our dual-based adaptive framework Algorithm 1 admits the following upper bound:

	$Regret (A \| P) \leq$	$E [τ \sum t = 1 g (λ^{}) - g (λ_{t})]      R.1 + E [2 (¯ f + ¯ r + C_{3}) (T - τ) + ⟨ λ^{}, τ \sum t = 1 (d - b_{t} x_{t}) ⟩]      R.2$		(5.4)
		$+ E [⟨ μ^{*} - {^μ}_{T}, τ \sum t = 1 (a_{t} - b_{t} x_{t}) ⟩]      R.3,$		(5.4)

where $C_{3} := \sqrt{m n} G D ¯ b$ .

It remains to bound the three parts in Proposition 3, respectively. The key point is to carefully choose a stopping time that (1) avoids early stopping; (2) enforces the total resource constraints. The first term R.1 is contributed by the algorithm before stopping time, which can be controlled by the cumulative dual error $E \sum_{t = 1}^{τ} {∥ λ_{t - 1} - λ^{*} ∥}_{t - 1}^{2}$ . The second term R.2 concerns the lost rewards due to resource depletion, which can be controlled by $E (T - τ)$ . To achieve an $O (log T)$ regret, the stopping time shall be carefully designed so that $E (T - τ) = O (log T)$ . The term R.3 is contributed mainly by the variable splitting, which can be controlled jointly by the cumulative dual error and $E (T - τ)$ . The three terms capture different sources of regret induced by our adaptive framework Algorithm 1. It turns out that we shall bound $E \sum_{t = 1}^{τ} {∥ λ_{t - 1} - λ^{*} ∥}_{t - 1}^{2}$ and $E (T - τ)$ , for which a smart design of stopping time becomes crucial.

Our design of stopping time is inspired by the budget-ratio stopping time introduced and investigated by arlotto2019uniformly and li2021online for online linear allocation problems. At the core of this design is a smart strategy that ensures, as the updated constraint $d_{t}$ varies within a region $D \subset Ω_{d}$ , the binding and non-binding dimensions of the problem $D (λ, d_{t})$ remain unchanged. The region $D$ is usually a small neighbour of the original budget $d$ . The following lemma dictates that such a region $D$ exists for our regularized online convex allocation problem. Recall that $λ^{*} (d^{'})$ denotes the optimal dual solution to $D (λ, d^{'})$ , and $I_{B}$ and $I_{NB}$ stand for the binding and non-binding dimension of $D (λ, d)$ .

Lemma 4.

Under Assumptions 1-3, there exists a constant $δ_{d} > 0$ such that for any $d^{'} \in Ω_{d}$ , if

- δ_{d} \leq d_{i}^{'} - d_{i} \leq δ_{d} if i \in I_{B%}, and d_{i}^{'} - d_{i} \geq - δ_{d} if i \in I_{% NB},

the dual problems $D (λ, d^{'})$ and $D (λ, d)$ share the same binding and non-binding dimensions.

For technical convenience, we assume that, for each non-binding dimensions $i \in I_{NB}$ , the updated constraint $d_{i t}$ never exceeds the threshold $¯ d$ (the uniform bound defined in Assumption 1) at all iterations. This is a mid assumption both for theory and in practice. Indeed, if $d_{i t}$ is larger than the $¯ d$ , this means that the constraint $d_{i t}$ is very loose so that its impact to the optimization problem is negligible. In this case, such a constraint can essentially be discarded.

With Lemma 4, we define the required region where binding and non-binding dimensions remain unchanged during iterations by

D := {d^{'} \in Ω_{d} ∣ ∣ - δ_{d} \leq d_{i}^{'} - d_{i} \leq δ_{d} if i \in I_{B}, and d_{i}^{'} - d_{i} \geq - δ_{d} if i \in I_{NB}} .

We thereby design the following stopping time.

(5.5)

Additionally, this stopping time also guarantees that resource depletion will not happen before $τ$ . We show that $τ$ rules out early-stopping so that $E [T - τ] = O (log T)$ . The following lemmas bound the cumulative dual error and $E [T - τ]$ .

Lemma 5.

Under Assumptions 1-4, Algorithm 1 with selected dual optimizer ${B_{t}}_{t \geq 1}$ satisfying Condition 1 achieves

E [τ \sum t = 1 {∥ λ_{t - 1} - λ^{*} ∥}_{t - 1}^{2}] \leq O (log T)

(5.6)

Lemma 6.

Under Assumptions 1-4, the stopping time (5.5) of Algorithm 1 with selected dual optimizer ${B_{t}}_{t \geq 1}$ satisfying Condition 1 has

E (T - τ) \leq O (log T)

(5.7)

These two lemmas play a key role in our regret analysis. They are proved by investigating the dynamic behavior of constraints $d_{i t}$ for binding and non-binding dimensions, respectively. For binding dimensions, we investigate the recurrence relation of $d_{i t}$ by leveraging the binding relations. For the non-binding dimensions, we exploit the $δ_{d}$ gap between $d_{i t}$ and average resource consumption. Equipped with Lemma 5 and 6, we now continue sketching the proof of Theorem 3. By Proposition 3, it suffices to bound the three terms there.

Proof of Theorem 3.

The proof continues from Proposition 3.

Step 1: bounding R.1. By Fenchel conjugate, we re-write the bridging function $g (λ)$ by

	$g (λ) =$	$E [f_{t} ({~ x}_{t} (λ)) + r (~ a (μ)) + (~ a (μ) - b_{t} {~ x}_{t} (λ))^{⊤} μ^{} + (d - b_{t} {~ x}_{t} (λ))^{⊤} λ^{}]$
	$=$	$E [f_{t}^{} (b_{t}^{⊤} (λ + μ)) + r^{} (- μ)] + E (μ^{} - μ)^{⊤} (~ a (μ) - b_{t} {~ x}_{t} (λ)) + E (λ^{} - λ)^{⊤} (d - b_{t} {~ x}_{t} (λ))$
	$=$	$E [f_{t}^{} (b_{t}^{⊤} (λ + μ)) + r^{} (- μ)] - E [\nabla f_{t}^{} (b_{t}^{⊤} (λ + μ))^{⊤} b_{t}^{⊤} (λ + μ - λ^{} - μ^{*})$
		$- \nabla r^{} (- μ)^{⊤} (μ - μ^{}) + d^{⊤} (λ - λ^{*})]$
	$=$	$D (λ, d) - \nabla D (λ, d)^{⊤} (λ - λ^{*})$

By Assumption 2 and 3, we get

	$g (λ^{*}) - g (λ) =$
		$+ ⟨ \nabla D (λ, d) - \nabla D (λ^{}, d), λ - λ^{} ⟩ \leq (2 {¯ L}_{D} - {L - -}_{D}) {∥ λ - λ^{} ∥}_{2}^{} .$

Then Lemma 5 gives rise to the following bound.

E [τ \sum t = 1 g (λ^{*}) - g (λ_{t - 1})] \leq O (log T) .

Step 2: bounding R.2. This term can be controlled by the definition of stopping time and Lemma 6.

	$E$	$[2 (¯ f + ¯ r + C_{3}) (T - τ) + ⟨ λ^{*}, τ \sum t = 1 (d - b_{t} x_{t}) ⟩]$
		$= E [2 (¯ f + ¯ r + C_{3}) (T - τ) + ⟨ λ^{*}, d_{τ} (T - τ) - d (T - τ) ⟩]$
		$\leq E ⎡ ⎢ ⎣ 2 (¯ f + 2 ¯ r + C_{3}) (T - τ) + \sum i \in I_{B} λ_{i}^{*} (d_{i} + δ_{d}) (T - τ) ⎤ ⎥ ⎦$
		$\leq (2 ¯ f + 2 ¯ r + 2 C_{3} + (∥ d ∥ + \sqrt{m} δ_{d}) \frac{2 (¯ f + ¯ r)}{d - -}) E (T - τ) = O (log T)$

Step 3: bounding R.3. This term requires the most effort. It concerns the combined effects of variable splitting and complementary slackness. The following lemma is important for bounding this term.

Lemma 7.

Under Assumptions 1-4, Algorithm 1 with selected dual optimizer ${B_{t}}_{t \geq 1}$ satisfying Condition 1 and stopping time (5.5) ensures

E

{∥ {^μ}_{T} - μ^{*} ∥}_{T}^{*} \leq O (\frac{log T}{T}), and E {∥ ∥ ∥ ∥ τ \sum t = 1 (a_{t} - b_{t} x_{t}) ∥ ∥ ∥ ∥}_{t}^{2} \leq O (T log T) .

The proof Lemma 7 exploits the local smoothness of $r$ and ${~ x}_{t}$ with the help of the optimality of $μ^{*}$ , i.e., $~ a (μ^{*}) = E b_{t} {~ x}_{t} (λ^{*}))$ . By Cauchy–Schwarz inequality, we get

E [⟨ μ^{*} - {^μ}_{T}, τ \sum t = 1 (a_{t} - b_{t} x_{t}) ⟩] \leq

(E {∥ {^μ}_{T} - μ^{*} ∥}_{T}^{*} E {∥ ∥ ∥ ∥ τ \sum t = 1 (a_{t} - b_{t} x_{t}) ∥ ∥ ∥ ∥}_{t}^{2})^{1 / 2} .

(5.8)

Thus R.3 can be controlled by $log T$ . The proof is concluded.

∎

5.2 Lower bound and algorithms without constraint update

bray2019does and li2021online have established the logarithmic regret lower bound for online multi-secretary problems and online linear programming, respectively. To show the optimality of Theorem 3, we establish a matching lower bound in this section. We note that there always exists a regularizer function that makes our regularized online allocation problem more challenging than the non-regularized one. For example, consider that $f_{t} (x)$ and $r$ are both monotonic increasing and the hindsight optimal strategy ${x_{t}^{'}}_{t = 1}^{T}$ that optimizes ${max}_{x_{t} \in X} {\sum_{t = 1}^{T} f_{t} (x_{t}) s.t. \sum_{t = 1}^{T} b_{t} x_{t} \leq d T}$ , it holds that $\sum_{t = 1}^{T} b_{t} x_{t}^{'} = d T$ and thus $r (T^{- 1} \sum_{t = 1}^{T} b_{t} x_{t}^{'}) \geq r (T^{- 1} \cdot \sum_{t = 1}^{T} b_{t} x_{t})$ for any other ${x_{t}}_{t = 1}^{T}$ . This renders the regret lower bound of regularized problem larger than that of non-regularized one. Therefore, for the regret lower bound, we only focus on the non-regularized problems.

Theorem 4 (Regret lower bound).

For any dual-based algorithm $A$ , we have the worst-case regret lower bound:

Regret (A) \geq Ω (log T) .

Theorem 4 justifies the optimality of our algorithm in terms of worst-case regret. The logarithmic regret also matches that of classic unrestricted online convex optimization (hazan2007logarithmic). Nevertheless, one may wonder how important the adaptive constraint update is in our adaptive framework Algorithm 1 and whether it is possible to achieve an optimal regret without adaptive constraints update. Here we only present a negative answer partially for two specific but renowned algorithms. For concreteness, we investigate two similar algorithms (Algorithms 3 and 4) without constraints update that have been discussed in the literature for online dual gradient (mirror descent (balseiro2021regularized; balseiro2022best) and dual SAA (li2021online).

0: regularizer

r

, iteration number

T

, step size

η_{t} := Θ (\frac{1}{t})

for

t \in [T]

, start point

λ_{0} := 0

, and initial resource

B_{0} := d T

for all

t = 1, T

Receive

(f_{t}, b_{t}) \sim P

Calculate

	${~ x}_{t}$	$:= {~ x}_{t} (λ_{t - 1}) := arg max x \in X {f_{t} (x) - (λ_{t - 1} + μ_{t - 1})^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ_{t - 1} + μ_{t - 1}))$
	${~ a}_{t}$	$:= arg max a \in Z {r (a) + μ_{t - 1}^{⊤} a} = - \nabla r^{*} (- μ_{t - 1}) .$

Select

x_{t} := {\begin{matrix} {~ x}_{t} & if B_{t - 1} \geq b_{t} x_{t} 0 & otherwise \end{matrix}

Update remaining resources:

B_{t} := B_{t - 1} - b_{t} x_{t}

Calculate the stochastic gradient

\nabla D_{t} (λ_{t - 1}) := [\begin{matrix} - b_{t} ~ x_{t} + d - b_{t} {~ x}_{t} + {~ a}_{t} \end{matrix}]

Update dual variable via online gradient descent:

λ_{t} := a r g min λ \in Ω_{λ} \times Ω_{μ} {⟨ λ, \nabla D_{t} (λ_{t - 1}) ⟩ + \frac{1}{2 η_{t}} {∥ λ - λ_{t - 1} ∥}_{t - 1}^{2}}

end for

Algorithm 3 Online dual gradient (mirror) descent without constraint update

0: regularizer

r

, iteration number

T

, start point

λ_{0} := 0

, and initial resource

B_{0} := d T

for all

t = 1, T

Receive

(f_{t}, b_{t}) \sim P

Calculate

{~ x}_{t} := {~ x}_{t} (λ_{t - 1})

:= arg max x \in X {f_{t} (x) - (λ_{t - 1} + μ_{t - 1})^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ_{t - 1} + μ_{t - 1})) .

Select

x_{t} := {\begin{matrix} {~ x}_{t} & if B_{t - 1} \geq b_{t} x_{t} 0 & otherwise \end{matrix}

Update remaining resources:

B_{t} := B_{t - 1} - b_{t} x_{t}

Update dual variable via solving t-sample SAA:

λ_{t} := a r g min λ \in Ω_{λ} \times Ω_{μ} {\frac{1}{t} t \sum j = 1 f_{j}^{*} (b_{j}^{⊤} (μ + λ)) + r^{*} (- μ) + d^{⊤} λ}

end for

Algorithm 4 Dual SAA without constraint update

Algorithms 3 and 4 both feature the idea of approximating dual solutions, i.e., iteratively updating $λ_{t}$ to approach $λ^{*}$ by Stochastic Approximation (SA) or SAA. But the implementation shows two different approaches. Algorithm 3 is not history-dependent because it updates the dual variable using only the $t$ -th sample, while Algorithm 4 is history-dependent because it gathers all the information up to time $t$ to update the dual variable. The following lemma establishes an $Ω (T^{1 / 2})$ regret lower bound for these two algorithms equipped with a typical stopping time.

Theorem 5.

Under Assumptions 1-3, there exists a constant $c_{2} > 0$ such that any dual-based algorithm $A$ attempting to approximate $λ^{*}$ with incurs a worst-case regret lower bound:

Regret (A) \geq Ω (T^{1 / 2})

We prove this theorem by constructing a one-dimensional strongly convex reward and bound the regret by leveraging the probability estimation of a Binomial distribution. Note that the lower bound can also be controlled by both dual approximate error $E \sum_{t = 1}^{τ} {∥ λ_{t - 1} - λ^{*} ∥}_{t - 1}^{*}$ and early stopping effect $E (T - τ)$ . Here $λ^{*}$ is the deterministic dual solution when the resource constraint is fixed at $d$ . In sharp contrast, the dual solution $λ_{t}$ in our adaptive framework Algorithm 1 aims to approximate $λ^{*} (d_{t})$ where $d_{t}$ is the updated constraint at time $t$ . Intuitively, the rationale behind constraint update is that, at time $t$ , the decision should be made in consideration of the remaining resources $d_{t}$ at hand instead of the initial resource $d$ .

Remark 2.

Theorem 5 suggests that Algorithms 3 and 4 fail to reach the optimal regret under our assumptions because they all seek to approximate a deterministic $λ^{*}$ . In fact, even if we know the exact distribution $P$ and its optimal solution $λ^{*}$ , we are still unable to make our dual-based algorithm optimal by just choosing $λ_{t} = λ^{*}$ . Theorem 5 gives a rigorous evidence that our constraint-update algorithm outperforms other prior ones without constraint update such as the online gradient decent studied by balseiro2021regularized; balseiro2022best.

Finally, we remark that our theorem pushes forward the understanding of adaptiveness for online algorithms to the dual-based ones. In arlotto2019uniformly, the authors established an $Ω (\sqrt{T})$ regret lower bound only for non-adaptive strategies (without adaptively updating the dual solutions). However, our proof demonstrates that, even when the strategy is adaptive, it might still not be sufficient to deliver an optimal regret if the algorithm only focuses on dual updates but neglects the constraint update. Actually, focusing on fixed constraints leads to a sub-optimal early stopping.

6 Applications

6.1 Strongly convex dual problems

We consider a special but practical setting, in which our empirical dual problem ${¯ D}_{t} (λ, d_{t})$ in (4.1) is always ${L - -}_{D}$ -strongly convex. This assumption can be met if $f_{t}^{*}$ and $r$ are almost-surely strongly convex. In this case, we only need to do stochastic gradient descent for $O (t)$ times at time $t$ to make our algorithm theoretically optimal. The detailed implementations are in Algorithm 5.

0: regularizer

r

, iteration number

T

, start point

λ_{0} := 0

, and initial resource

B_{0} := d T

for all

t = 1, \dots, T

Receive

(f_{t}, b_{t}) \sim P

Calculate

{~ x}_{t} := {~ x}_{t} (λ_{t - 1})

:= arg max x \in X {f_{t} (x) - (λ_{t - 1} + μ_{t - 1})^{⊤} b_{t} x} = - \nabla f_{t}^{*} (b_{t}^{⊤} (λ_{t - 1} + μ_{t - 1})) .

Select

x_{t} := {\begin{matrix} {~ x}_{t} & if B_{t - 1} \geq b_{t} x_{t} 0 & otherwise \end{matrix}

Update remaining resources:

B_{t} := B_{t - 1} - b_{t} x_{t}

Update average remaining resources:

d_{t} := \frac{B_{t}}{T - t}

Set

K := t

, and

η_{k} := \frac{{L - -}_{D}}{k}

. Define

λ_{t}^{0} := λ_{t - 1}

for all

k = 1, \dots, K

Randomly pick

ζ

from

[t] := {1, \dots, t}

with uniform distribution.

Calculate the stochastic gradient

\nabla D_{ζ} (λ_{t}^{k - 1}) := [\begin{matrix} - b_{ζ} {~ x}_{ζ} (λ_{t}^{k - 1}) + d_{t} - b_{ζ} {~ x}_{ζ} (λ_{t}^{k - 1}) + ~ a (λ_{t}^{k - 1}) \end{matrix}]

Update dual variable via stochastic gradient descent:

λ_{t}^{k} := a r g min λ \in Ω_{λ} \times Ω_{μ} {⟨ λ, \nabla D_{ζ} (λ_{t}^{k - 1}) ⟩ + \frac{1}{2 η_{k}} {∥ ∥ λ - λ_{t}^{k - 1} ∥ ∥}_{2}^{2}}

end for

Update the dual variable by

λ_{t} := λ_{t}^{K}

end for

Algorithm 5 Resolving with SGD for strongly convex dual objective

Algorithm 5 satisfies Condition 1 but it does not rely on Corollary 1. Notice that where $λ_{t}^{*} (d_{t})$ is the optimal solution to the empirical dual problem ${¯ D}_{t} (λ, d_{t})$ . The second term $E {∥ ∥ λ_{t}^{*} (d_{t}) - λ^{*} (d_{t}) ∥ ∥}_{t}^{*}$ represents the dual convergence and can be bounded by $O (t^{- 1})$ by Theorem 1, while the first term accounts for the optimization error and can also be bounded by $O (t^{- 1})$ (see, rakhlin2012making). If ${¯ D}_{t} (λ, d_{t})$ is further smooth, we can also ensure Condition 1 by running batch gradient descent for constant steps at each time $t$ to get an $O (t^{- 1} + (T - t)^{- 1})$ approximate solution, which still requires computing gradients for $O (t)$ times at $t$ -th time.

6.2 Online linear programming

Our algorithm framework and theoretical results are also applicable to the classical non-regularized online linear allocation problems, which finds applications in online ad-auction (buchbinder2007online), network revenue management (jasin2012re), multi-secretary problem (kleinberg2005multiple) , etc. At time $t$ , we make a decision $x_{t} \in X = {[0, D]}^{n}$ that returns a linear reward $v_{t}$ and bears a random cost $b_{t} \in R^{m \times n}$ per unit. Online linear programming can be formalized as:

		$max x_{t}$		$T \sum t = 1 v_{t}^{⊤} x_{t}$
		s.t.		$T \sum t = 1 b_{t} x_{t} ⪯ d T, d \in R_{+}^{m}$
		$x_{t} \in {[0, D]}^{n}, \forall t \in [T] .$

The empirical dual problem and its population version can be explicitly written as

{¯ D}_{T} (λ, d) := \frac{\sum_{t = 1}^{T} \sum_{t = i}^{n} {(v_{i t} - b_{i t}^{⊤} λ)}_{i t}^{+}}{T} + d^{⊤} λ, and D (λ, d) := E n \sum t = i {(v_{i t} - b_{i t}^{⊤} λ)}_{i t}^{+} + d^{⊤} λ,

which is in line with li2021online. Here the index $b_{i t}$ means the $i$ -column of $b_{t}$ . For a given dual variable $λ$ , we make the primal decision by $x_{i t} := D I (v_{i t} - b_{i t}^{⊤} λ > 0)$ if the resource constraints are not violated. Then, under the same locally strongly convex and non-degeneracy assumptions, we can make optimal decisions by choosing $λ_{t}$ as the $O (t^{- 1})$ -optimal solution (or $O (t^{- 3 / 2})$ -optimal solution for stochastic optimizer) of ${¯ D}_{t} (λ, d_{t})$ . Towards that end, an $O (log T)$ regret is attainable, which improves prior result (li2021online) .

6.3 Online welfare maximization with costs

Our algorithm framework is also applicable to combinatorial auctions in the existence of production costs and resource constraints. (blum2011welfare; huang2014welfare; tan2020online). Imagine that we run an online service system where customers arrive with a request of getting a bundle of $n$ resources. Each customer arriving at time $t$ has a private valuation function $v_{t} = {[v_{1 t}, v_{2 t}, \dots, v_{n t}]}_{1 t}^{⊤}$ on different bundle $i$ , and each bundle $i$ includes $m$ types of resources $b_{i} \in R^{m}$ . Denote $b = [b_{1}, b_{2}, \dots, b_{n}]$ . At every time $t$ , we make our decision $x_{t}$ by choosing which bundle we would like to provide. Here the decision variable is $x_{t} \in {0, 1}^{n}$ and $\sum_{i = 1}^{n} x_{i t} \leq 1$ . The cost of consuming resources is given by a convex function $h_{T} (x)$ . Our goal is to optimize the total social welfare by the following mixed-integer program:

		$max x_{t}$		$T \sum t = 1 v_{t}^{⊤} x_{t} - h_{T} (T \sum t = 1 b x_{t})$
		s.t.		$T \sum t = 1 b x_{t} ⪯ d T, d \in R_{+}^{m}$
		${∥ x_{t} ∥}_{1} \leq 1$
		$x_{t} \in {0, 1}^{n}, \forall t \in [T] .$

This online program usually formulates practical problems involved in networking and cloud computing, e.g., cloud resource allocation (dayarathna2015data) and 5G network slicing (rost2017network). If the convex cost is in the form $h_{T} (x) := - T \cdot r (x / T)$ for some strongly convex function $r$ , we can write the corresponding empirical dual problem and its population version as

	${¯ D}_{T} (λ, d)$	$:= \frac{\sum_{t = 1}^{T} {sup}_{i \in [n]} {(v_{i t} - b_{i}^{⊤} (λ + μ))}_{i t}^{+}}{T} + r^{*} (- μ) + d^{⊤} λ$
	$D_{T} (λ, d)$	$:= E sup i \in [n] {(r_{i t} - b_{i}^{⊤} λ)}_{i t}^{+} + r^{*} (- μ) + d^{⊤} λ .$

For a given $λ$ , our decision is made by $x_{i t} := I (i = {a r g max}_{i \in [n]} {v_{i t} - b_{i}^{⊤} (λ + μ)})$ . Under the similar locally strongly convex and non-degeneracy assumptions, our algorithm framework achieves an $O (log T)$ regret. The size of our problem is different from tan2020online because our algorithm focuses more on the regret given linear resources rather than the competitive ratio with highly restricted resources constraints. Here the regularizer can be interpreted as the cost function of resources, which shares an increasing marginal cost.

6.4 Online convex covering and packing problem

We apply our algorithm framework to online covering and packing problems with convex objective functions, which have been discussed in azar2013online; azar2016online. Consider an online context that $T$ groups of clients arrive with fixed size and, at each time $t$ , we serve the $t$ -th group by assigning clients to $n$ different facilities with increasing convex assignment cost $f_{i t}$ and a demand $b_{i t} > 0$ for each facility $i \in [n]$ . Define $x_{i t}$ as the number of clients that are assigned to facility $i$ at time $t$ , and then $f_{i t} (x_{i t})$ is the corresponding assignment cost, $b_{i t} x_{i t}$ is the demand for facility $i$ . At each time, the total service must be larger than the group size $1$ , i.e., $\sum_{i = 1}^{n} x_{i t} \geq 1$ . The average maintenance cost of each facility $h_{i}$ is an increasing convex function to its congestion, which is the ratio of the total demands of clients assigned to the facility to the total capacity. Our goal is to minimize the sum of assignment costs and maintenance costs:

		$min x_{t}$		$T \sum t = 1 n \sum i = 1 f_{i t} (x_{i t}) + T \cdot n \sum i = 1 h_{i} (y_{i})$
		s.t.		$T \sum t = 1 b_{i t} x_{i t} \leq T y_{i}, \forall i \in [n]$
		$n \sum i = 1 x_{i t} \geq 1, \forall t \in [T]$
		$x_{i t} \in {[0, 1]}^{n}, \forall t \in [T], i \in [n]$
		$0 \leq y_{i} \leq 1, \forall i \in [n] .$

This is a convex and continuous variant of Capacity Constrained Facility Location (CCFL) problem (azar2013online; azar2016online) featuring non-negative covering and packing constraints. Here the covering constraint $\sum_{i = 1}^{n} x_{i t} \geq 1$ represents the minimum service requirement, and thus we can not take void actions; the packing constraints $\sum_{t = 1}^{T} b_{i t} x_{i t} \leq T y_{i}$ represents that the congestion of each facility $i$ is bounded by 1. Denote $x_{t} := {[x_{1 t}, x_{2 t}, \dots, x_{n t}]}_{1 t}^{⊤}$ and $X := {x \in R_{+}^{n} ∣ ∣ ∥ x ∥ = 1}$ , we can write our convex covering and packing problem as:

		$max x_{t} \in X$		$T \sum t = 1 f_{t} (x_{t}) + T r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T})$
		s.t.		$T \sum t = 1 b_{t} x_{t} ⪯ T \cdot 1,$

where $f_{t} (x_{t}) := - \sum_{i = 1}^{n} f_{i t} (x_{i t})$ , $r (y) := - \sum_{i = 1}^{n} h_{i} (y_{i})$ , and $b_{t} := diag (b_{1 t}, b_{2 t}, \dots, b_{n t})$ . Thus, we can similarly apply our Algorithm 1 to the CCFL problem and achieve optimal regret control.

7 Numerical Experiments

In this section, we implement Resolving with SGD as a showcase for our proposed algorithmic framework. The performance is assessed under 4 different input models. The implementation details on multiple input models are as follows: the dual update is calculated by closed-form solutions to Equation (4.3) under input I-III and by cvxpy (diamond2016cvxpy) under input IV. See Table 1 for more information. For each $T$ , we randomly sample $T$ observations from datasets, implement our algorithm, and calculate the regret. Only the average regret over 10 repetitions is reported. Note that we use the dual objective evaluated at the average gradient $D_{T} (\frac{1}{T} \sum_{t = 1}^{T} λ_{t})$ as the benchmark to compute the regret.

Input	$f_{t} (x)$	$r (x)$	$b_{i t}$	$d_{i}$
I	$a_{t}^{⊤} x$	$- κ {∥ x - d / 2 ∥}_{2}^{2}$	U $(0, 1)$	0.1
II	$a_{t}^{⊤} x$	$- κ {∥ x - d / 2 ∥}_{2}^{2}$	Bernoulli( $p_{i}$ )	U $(0.25, 0.75)$
III	$- \frac{1}{4} x^{2} + ξ_{t} x$	0	1	0.5
IV	$a_{t}^{⊤} x$	$- κ \sum_{i = 1}^{m} \frac{x_{i}}{\sum_{i = 1}^{m} d_{i}} log \frac{x_{i}}{\sum_{i = 1}^{m} d_{i}}$	1	From dataset

Table 1: Parameter Settings of Inputs

Input model I: Online welfare maximization with costs, independent reward and resource consumption. The reward functions are linear as $f_{t} (x) = a_{t}^{⊤} x$ . The regularization function is the $ℓ_{2}$ loss $r (x) = - κ {∥ x - d / 2 ∥}_{2}^{2}$ , which corresponds to the application of online welfare maximization with square costs. The reward coefficients $a_{t}$ ’s and the constraint coefficients $b_{t}$ ’s are i.i.d. random variables. More exactly, $a_{i t}$ is generated from the uniform distribution $U (0, 10)$ , and $b_{i t}$ is generated from the uniform distribution $U (0, 1)$ .

Figure 1: Regret versus horizon (T) under Input I. OGD stands for online gradient descent in balseiro2020dual; resolving with SGD is our Algorithm 2; nonadaptive resolving with SGD is the nonadaptive version (i.e., without updating the constraints) of Algorithm 2.

To illustrate how the regret scales with the time horizon $T$ , we evaluate the algorithms with different $T$ chosen from ${256, 512, 1024, 1536, 2048, 2560}$ . Here $m = 6$ . We find that Resolving with SGD (Algorithm 2) shows logarithmic regret, while its counterpart without constraint update ( $d_{t} \equiv d$ in Equation 4.2) shows a much worse regret. We name the latter algorithm as the “Nonadaptive resolving with SGD”. The online gradient descent (OGD) method in balseiro2020dual exhibits a $O (\sqrt{T})$ regret as indicated in their theoretical findings. The regret comparison between the algorithms can be found in Figure 1. In Figure 2, we plot the dynamic of resource consumption for one binding dimension of the aforementioned algorithms. Ten curves are displayed, each of which corresponds to one simulation. Being adaptive to the level of remaining resources, Algorithm 2 controls carefully the constraint consumption to ensure that the resources are consumed at a steady rate till they are used up. In comparison, both the OGD and the nonadaptive version of Algorithm 2 stop allocating resources too early, demonstrating the benefits of the constraint updates, which exploit the history of past actions.

Input model II: Online welfare maximization with costs, dependent reward and resource consumption. The parameter setting below is based on balseiro2022best. The reward functions and the regularization function are the same as in input I, whereas input II considers the case when the reward coefficients $a_{t}$ ’s are random variables conditional of the constraint coefficients $b_{t}$ ’s. We set $a_{t} = {Proj}_{[0, 10]} {θ_{t}^{⊤} b_{t} + δ_{t} 1}$ , where $θ_{t}$ is generated from a multi-variate Gaussian distribution $N (0, diag (1))$ , and $δ_{t}$ is generated from the standard Gaussian distribution $N (0, 1)$ . The constraint coefficients $b_{i t}$ ’s are generated from Bernoulli distribution with probability parameter $p_{i}$ with $p_{i} = (1 + α) / 2$ , and $α$ is generated from the beta distribution Beta $(1, 3)$ . The average resource constraints $d_{i}$ ’s are generated from the uniform distribution U $(0.25, 0.75)$ .

Figure 3: Regret versus horizon (T) under Input II. OGD stands for online gradient descent in balseiro2020dual; resolving with SGD is Algorithm 2; nonadaptive resolving with SGD is the nonadaptive version of Algorithm 2.

Similar to the setting of input I, we evaluate the algorithms under input II with different $T$ ’s and fix $m = 6$ . The regret performances and resource consumption are displayed in Figure 3 and Figure 4, respectively. Among the three algorithms (Algorithm 2, the nonadaptive Alg 2 and the OGD method in balseiro2020dual), Algorithm 2 achieves a logarithmic regret, the nonadaptive Alg 2 suffers from a higher regret while the regret of OGD grows in a much faster speed.

Input model III: Non-regularized online convex resource allocation with one resource. In this model, we assess the algorithms’ performance under a non-regularized special case, where there is only one resource, the reward function $f_{t} (x) = f_{t} (x, ξ_{t}) = - \frac{1}{4} x^{2} + ξ_{t} x$ , the constraint $d = \frac{1}{2}$ and cost $b_{t} = 1$ . The random variable $ξ_{t}$ follows a two-point distribution that takes value in ${\frac{1}{2}, \frac{3}{4}}$ with equal probability, i.e., $P [ξ_{t} = \frac{1}{2}] = P [ξ_{t} = \frac{}{3} 4] = 0.5$ . This special case is used in the proof of Theorem 5.

For input model III, the optimal solution to Problem (2.8) admits a closed-form due to the simple distribution. We compare further with two algorithms: one is “No learning” and the other is “SAA”, which are the convex versions of Algorithm 1 and 2 in li2021online, respectively. Both of them require the computation of optimal dual solutions, while neither Resolving with SGD (Algorithm 2) nor OGD needs this step. The regret comparison is shown in Figure 4(a). All benchmark algorithms show a regret increasing in $T$ while Resolving with SGD exhibits a regret gradually stable with respect to $T$ as $T$ increases. This corroborates the theoretical results that our proposed algorithmic framework can achieve $O (log T)$ regret and that any algorithm without constraint updates will incur $Ω (\sqrt{T})$ regret. We further explain the reason for the performance advantage by plotting the remaining time before stopping in Figure 4(b). All benchmark algorithms stop allocating resource $O (\sqrt{T})$ steps earlier than Resolving with SGD (Algorithm 2), which leads to the terrible regret performance.

(a) Regret versus horizon ( $T$ ) under input III.

Input model IV: Display advertisement allocation with entropy regularization. We use the display advertisement dataset in balseiro2021regularized as the last input model. Consider $m$ advertisers. In this model, $f_{t} (x) = q_{t}^{⊤} x$ , where $q_{t} = {q_{1 t}, . . ., q_{m t}}$ and $q_{i t}$ is the expected click through rate from impression $t$ of the $i$ th advertiser. The regularization function is $r (x) = - κ \sum_{i = 1}^{m} \frac{x_{i}}{\sum_{i = 1}^{m} d_{i}} log \frac{x_{i}}{\sum_{i = 1}^{m} d_{i}}$ , imposing requirements of diversity and fairness on the allocation. The per-time-slot budget of the $i$ th advertiser denoted by $d_{i}$ is also given in the dataset. The consumption cost is $b_{t} = 1$ . At time $t$ , only one advertiser can be assigned to the impression, i.e., $x_{t} \in {0, 1}^{m}$ and $\sum_{i = 1}^{m} x_{i t} \leq 1$ .

(a) Regret versus horizon ( $T$ ) under Input IV with different regularization levels ( $κ =$ 0, 0.001, 0.005).

In Figure 5(a), regret curves of Algorithm 2 and OGD algorithm under different $κ$ s are plotted. The regret of Algorithm 2 grows slower than OGD, which shows the advantage of the proposed algorithm under this setting. It is also observed that the regret is very close for different regularization levels ( $κ = 0, 0.001, 0.005$ ) and $κ = 0.001$ incurs the lowest regret. Trade-off between the reward (average click through rate) and the regularization term is plotted in Figure 5(b)

8 Discussion

In this paper, we investigated regularized online convex allocation problems with a non-separable regularizer. While a polynomial-time adaptive algorithm framework is proved optimal in controlling regret, several interesting yet challenging questions are still open to us. One is the necessity of non-degeneracy assumption. Recently, bumpensanti2020re showed that the non-degeneracy assumption is not necessary for re-solving heuristic to reach a low regret under linear settings. Can a similar optimal result be achieved without the non-degeneracy assumption on constraints in the online convex allocation? Another question is on algorithm implementation. Although our algorithms are of polynomial complexity, we still wonder whether there exists any other adaptive strategy with a linear computational cost that can achieve the (sub)optimal logarithmic regret. We note that in our adaptive strategy, most of our computational complexity comes from the frequent updating of dual solutions. To reduce the computational cost, one possible approach is to reduce the updating frequency. Lastly, throughout this paper, we only discussed online convex allocation problems under the stochastic input model. The behavior of re-solving algorithms for other input models like random permutation inputs or adversarial inputs still remains largely unknown.

References

Supplement to “ Optimal Regularized Online Convex Allocation by Adaptive Re-Solving ”

Appendix A Proofs of Main Results

a.1 Proof of Lemma 1

We prove the bound of the deterministic optimal solution. Consider $Ω_{μ}^{'} = {- \nabla r (a) | a \in Z}$ . The bounded subgradient in Assumption 1 suggests that the dual variable region $Ω_{μ}^{'}$ we defined is bounded by $G$ . We explain this definition by the optimal conditions of stochasic programming. Note that for problem (2.8), $μ$ is unconstrained. The optimal condition suggests that

\nabla r^{*} (- μ^{*}) = E b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (λ^{*} + μ^{*}))

if we assume fubini theorem holds. Then by the Fenchel conjugate, we have $μ^{*} \in - \nabla r (E b_{t} {~ x}_{t})$ . This shows that by defining $Ω_{μ}^{'}$ we indeed define the possible region that contains optimal solution $μ^{*}$ , i.e., $μ^{*} \in Ω_{μ}$ . Thus we have ${∥ μ ∥}_{\infty} \leq G$ .

For the second bound of ${∥ λ^{*} ∥}_{\infty}^{*}$ , we only need to check that $d^{⊤} λ^{*} \leq 2 (¯ f + ¯ r)$ always holds. Otherwise if $d^{⊤} λ^{*} > 2 (¯ f + ¯ r)$ , we have

	$D (λ^{*}, d)$	$= E sup x {f_{t} (x) - (λ^{} + μ^{})^{⊤} b_{t} x_{t}} + sup a {r (a) + a^{⊤} μ^{}} + d^{⊤} λ^{} \geq E f_{t} (0) + r (0) + d^{⊤} λ^{*}$
		$> (¯ f + ¯ r) \geq D (0, d),$

which suggests that $λ^{*}$ is not optimal. Thus we have $d^{⊤} λ^{*} \leq 2 (¯ f + ¯ r)$ , i.e., ${∥ λ^{*} ∥}_{\infty}^{*} \leq \frac{2 (¯ f + ¯ r)}{d - -}$ . The bound of empirical optimal solution $λ_{T}^{*}$ follows exactly the same argument.

a.2 Proof of Proposition 1

We consider

D (λ, d) - D (λ^{*}, d) - \nabla D (λ^{*}, d)^{⊤} (λ - λ^{*}) = \int_{0}^{1} {[\nabla D (z (λ - λ^{*}) + λ^{*}, d) - \nabla D (λ^{*}, d)]}^{⊤} (λ - λ^{*}) d z,

where $\nabla D (λ, d) = [\begin{matrix} E b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ + λ)) + d E b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ + λ) - \nabla r^{*} (- μ) \end{matrix}]$ . Then for any $z$ , we have

		${[\nabla D (z (λ - λ^{}) + λ^{}, d) - \nabla D (λ^{}, d)]}^{⊤} (λ - λ^{})$
	$\leq$	${∥ ∥ E b_{t} \nabla f_{t}^{} (b_{t}^{⊤} (z (μ + λ - μ^{} - λ^{}) + μ^{} + λ^{}) - E b_{t} \nabla f_{t}^{} (b_{t}^{⊤} (μ^{} + λ^{})) ∥ ∥}_{2}^{} ({∥ λ - λ^{} ∥}_{2}^{} + {∥ μ - μ^{} ∥}_{2}^{*})$
		$+ {∥ \nabla r^{} (- μ) - \nabla r^{} (- μ^{}) ∥}_{2}^{} {∥ μ - μ^{} ∥}_{2}^{}$
	$\leq$	${∥ ∥ z {¯ L}_{f} ¯ b E [b_{t}^{⊤} (μ + λ - μ^{} - λ^{})] ∥ ∥}_{2} ({∥ λ - λ^{} ∥}_{2}^{} + {∥ μ - μ^{} ∥}_{2}^{}) + {¯ L}_{r} z {∥ μ - μ^{} ∥}_{2}^{}$
	$\leq$	$z {¯ L}_{f} {¯ b}^{2} ({∥ λ - λ^{} ∥}_{2}^{} + {∥ μ - μ^{} ∥}_{2}^{})^{2} + {¯ L}_{r} z {∥ μ - μ^{} ∥}_{2}^{} \leq z (2 {¯ b}^{2} {¯ L}_{f} + {¯ L}_{r}) {∥ λ - λ^{} ∥}_{2}^{},$

where the second inequality is by Assumption 1 when conditioned on $b_{t}$ and Assumption 3. By the integral of $z$ we have

D (λ, d) - D (λ^{*}, d) - \nabla D (λ^{*}, d)^{⊤} (λ - λ^{*}) \leq (¯ b^{2} {¯ L}_{f} + {¯ L}_{r} / 2) {∥ λ - λ^{*} ∥}_{2}^{*} .

For the next direction we have

		${[\nabla D (z (λ - λ^{}) + λ^{}, d) - \nabla D (λ^{}, d)]}^{⊤} (λ - λ^{})$
	$=$	$E [E [⟨ \nabla f_{t}^{} (b_{t}^{⊤} (z (μ + λ - μ^{} - λ^{}) + μ^{} + λ^{})) - \nabla f_{t}^{} (b_{t}^{⊤} (μ^{} + λ^{})), b_{t}^{⊤} (μ + λ - μ^{} - λ^{}) ⟩] ∣ ∣ b_{t}]$
		$+ ⟨ \nabla r^{} (- (z (μ - μ^{}) + μ^{})) - \nabla r^{} (- μ^{}), μ^{} - μ ⟩$
	$\geq$	$z {L - -}_{f} E {∥ ∥ b_{t}^{⊤} (μ + λ - μ^{} - λ^{}) ∥ ∥}_{2}^{} + z {L - -}_{r} {∥ μ - μ^{} ∥}_{2}^{} \geq z {L - -}_{f} σ_{min} {∥ μ + λ - μ^{} - λ^{} ∥}_{2}^{} + z {L - -}_{r} {∥ μ - μ^{} ∥}_{2}^{} .$

Here the first inequality is by Assumption 1 when conditioned on $b_{t}$ and Assumption 2. With the inequality for any positive $δ$ , we choose $δ = \frac{{L - -}_{r}}{2 {L - -}_{f} σ_{min}}$ and $a = μ - μ^{*}$ , $b = λ - λ^{*}$ . Then we have

	${[\nabla D (z (λ - λ^{}) + λ^{}, d) - \nabla D (λ^{}, d)]}^{⊤} (λ - λ^{}) \geq$	$z (\frac{{L - -}_{r}}{{L - -}_{r} + 2 {L - -}_{f} σ_{min}} {∥ λ - λ^{} ∥}_{2}^{} + \frac{{L - -}_{r}}{2} {∥ μ - μ^{} ∥}_{2}^{})$
	$\geq$

By the integral of $z$ we can get the corresponding lower bound of the growth of $D (λ, d)$ . Thus we have

{L - -}_{D} {∥ λ - λ^{*} ∥}_{2}^{*} \leq D (λ, d) - D (λ^{*}, d) - \nabla D (λ^{*}, d)^{⊤} (λ - λ^{*}) \leq {¯ L}_{D} {∥ λ - λ^{*} ∥}_{2}^{*},

where the constant ${L - -}_{D} = \frac{{L - -}_{r}}{4} \land \frac{1}{2} \frac{{L - -}_{r}}{{L - -}_{r} + 2 - L_{f} σ_{min}}$ , ${¯ L}_{D} = {¯ b}^{2} {¯ L}_{f} + {¯ L}_{r} / 2$ .

a.3 Proof of Lemma 2

Since $ϕ_{t} (λ^{*}, d) = [\begin{matrix} b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ^{*} + λ^{*})) + d b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ^{*} + λ^{*}) - \nabla r^{*} (- μ^{*}) \end{matrix}] = [\begin{matrix} \partial_{λ} D_{t} (λ^{*}, d) \partial_{μ} D_{t} (λ^{*}, d) \end{matrix}]$ , we consider the partial gradient of $λ, μ$ separately.

For any dimension $i \in [m]$ , $| (\partial_{λ} D_{t} (λ^{*}, d))_{i} - d_{i} | = | (b_{t} \nabla f_{t}^{*} (b_{t}^{⊤} (μ^{*} + λ^{*})))_{i} | \leq \sqrt{n} ¯ b D$ , then

we also have

∣ ∣ (\partial_{μ} D_{t} (λ^{*}, d))_{i} ∣ ∣ \leq \sqrt{n} ¯ b D + G,

According to Hoeffding’s inequality, we have

P (∣ ∣ {({¯ ϕ}_{T} (λ^{*}, d))}_{i}^{*} - {(\nabla D (λ^{*}, d))}_{i}^{*} ∣ ∣ > ε / \sqrt{2 m}) \leq 2 exp (- \frac{T ε^{2}}{4 m c_{3}})

for $\forall i \in [2 m]$ .

Combining all $2 m$ dimensions together we conclude that

P (∥ ∥ ∥ ∥ \frac{1}{T} T \sum t = 1 ϕ_{t} (λ^{*}, d) - \nabla D (λ^{*}, d) ∥ ∥ ∥ ∥ > ε) \leq 2 m \cdot 2 exp (- \frac{T ε^{2}}{4 m c_{1}}) = 4 m exp (- \frac{T ε^{2}}{4 m c_{1}}) .

a.4 Proof of Proposition 2

For any given $ε > 0$ , we define the neighbourhood of $λ^{*}$ for given $ε$ as 3.4. We then construct a good event $E (ε)$ that only depends on $ε$ to guarantee that under this good event, the convex function ${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d)$ is larger than a quadratic function in $Ω_{λ} (ε)$ , which serves as a lower bound of dual function. The construction of this good event $E (ε)$ is based on the following splitting scheme and concentration of objective function:

Firstly we confine the first order term so that ${¯ ϕ}_{T} (λ, d) = \frac{1}{T} \sum_{t = 1}^{T} ϕ_{t} (λ, d)$ will not vary too much. This can be guaranteed by the concentration inequality Lemma 2.
Then we split $Ω_{λ} (ε)$ into multiple cubes layer by layer and in each single cube, we control the difference of second order terms between all the $λ$ in the cube and the central point of the cube.
Finally we uniformly control the deviation of second order terms for all central points.

For the first order term, denote event $E_{0} (ε) = {{∥ ∥ {¯ ϕ}_{T} (λ, d) - \nabla D (λ^{*}, d) ∥ ∥}_{2}^{*} > ε}$ . Then by Lemma 2, we have $P (E_{0} (ε)) \leq 4 m exp (- \frac{T ε^{2}}{4 m c_{1}})$ . Under event $E_{0}^{c} (ε)$ , we have

(A.1)

We now discuss the second order term of ${¯ D}_{T} (λ, d)$ . Define the second order term

		$s_{t} (λ, d) = D_{t} (λ, d) - D_{t} (λ^{}, d) - ⟨ ϕ_{t} (λ^{} d), λ - λ^{*} ⟩$
		${¯ s}_{T} (λ, d) = \frac{1}{T} T \sum t = 1 s_{t} (λ, d) .$

To derive an uniform lower bound of ${¯ s}_{T} (λ, d)$ , we do the following split on $Ω_{λ} (ε)$ according to huber1967under.

Define set , where $q \in (0, 1)$ and $N \in N_{+}$ will be identified later. This split divides $Ω_{λ} (ε)$ into $N$ layers ${Ω_{λ}^{k - 1} (ε) ∖ Ω_{λ}^{k} (ε)}_{k = 1}^{N}$ and a center cube $Ω_{λ}^{N} (ε)$ . We then split each layer into disjoint cubes ${{¯ Ω}^{k l} (ε)}_{l = 1}^{l_{k}}$ with edges of length $(1 - q) q^{k - 1} 4 H ε$ , and denote the center cube by ${¯ Ω}^{N 1} (ε)$ . huber1967under shows that there are at most $(2 N)^{2 m}$ cubes. This split is not unique to get the desired convergence order but it makes our result tighter. The center of each cube ${¯ Ω}^{k l} (ε)$ is $λ_{k l} = (λ_{k l}, μ_{k l})$ . Define ${¯ λ}_{k l} = arg {max}_{λ \in {¯ Ω}^{k l} (ε)} {∥ λ - λ^{*} ∥}_{2}^{*}$ , and

	$Γ_{t}^{k l}$	$= max λ \in {¯ Ω}^{k l} (ε) [s_{t} (λ_{k l}, d) - s_{t} (λ, d)]$		(A.2)
				(A.2)

Then for $k \in {0, \dots, N - 1}$ , and $\forall λ \in {¯ Ω}^{k l} (ε)$ , ${¯ s}_{T}$ can be decomposed as

	${¯ s}_{T} (λ, d) =$	$\frac{1}{T} T \sum t = 1 s_{t} (λ, d) - \frac{1}{T} T \sum t = 1 s_{t} (λ_{k l}, d) + \frac{1}{T} T \sum t = 1 s_{t} (λ_{k l}, d)$		(A.3)
	$\geq$	$E s_{t} (λ_{k l}, d) - E Γ_{t}^{k l}      ??? .1 + - \frac{1}{T} T \sum t = 1 Γ_{t}^{k l} + E Γ_{t}^{k l}      ??? .2 + \frac{1}{T} T \sum t = 1 s_{t} (λ_{k l}, d) - E s_{t} (λ_{k l}, d)      ??? .3$		(A.3)

We study lower bounds of these 3 terms in (A.3) respectively.

Lower bound of A.3.1:

$E Γ_{t}^{k l} =$	$E max λ \in {¯ Ω}^{k l} (ε) [f_{t}^{} (b_{t}^{⊤} (λ_{k l} + μ_{k l})) + r^{} (- μ_{k l}) - f_{t}^{} (b_{t}^{⊤} (λ + μ)) - r^{} (- μ)$	(A.4)
	$- \nabla f_{t}^{} (λ^{} + μ^{})^{⊤} b_{t}^{⊤} (λ_{k l} + μ_{k l} - λ - μ) + \nabla r^{} (- μ^{*})^{⊤} (μ_{k l} - μ)]$
$=$	$E max λ \in {¯ Ω}^{k l} (ε) [\int_{0}^{1} v_{1}^{⊤} (λ) [\nabla f_{t}^{} (b_{t}^{⊤} (λ + μ) + v_{1} \cdot z) - \nabla f_{t}^{} (b_{t}^{⊤} (λ^{} + μ^{}))] d z$
	$+ \int_{0}^{1} (μ_{k l} - μ)^{⊤} [- \nabla r^{} (- μ - (μ_{k l} - μ) \cdot z) + \nabla r^{} (- μ^{*})]] d z$
$=$	$E [\int_{0}^{1} v_{1}^{⊤} (~ λ, ~ μ) [\nabla f_{t}^{} (b_{t}^{⊤} (~ λ + ~ μ) + v_{1} \cdot z) - \nabla f_{t}^{} (b_{t}^{⊤} (λ^{} + μ^{})))] d z$
	$+ \int_{0}^{1} (μ_{k l} - ~ μ)^{⊤} [- \nabla r^{} (- ~ μ - (μ_{k l} - ~ μ) \cdot z) + \nabla r^{} (- μ^{*})]] d z for some (~ λ, ~ μ) \in σ (f_{t}, b_{t})$
$\leq$
$\leq$

where $v_{1} (λ) = b_{t}^{⊤} (λ_{k l} + μ_{k l} - λ - μ)$ is the direction vector, and the first inequality is obtained by Assumption 2 and Assumption 4.

According to Proposition 1, we have

	$E s_{t} (λ_{k l}, d) =$	$D (λ_{k l}, d) - D (λ^{}, d) - \nabla D (λ^{}, d)^{⊤} [\begin{matrix} λ_{k l} - λ^{} μ_{k l} - μ^{} \end{matrix}]$

So for the first term we have

		$- E Γ_{t}^{k l} + E s_{t} (λ_{k l}, d)$		(A.5)

Lower bound of A.3.2: Since the gradients ${∥ ∥ \nabla f_{t}^{*} ∥ ∥}_{\infty}$ , ${∥ \nabla r^{*} ∥}_{\infty}^{*}$ are bounded by $D$ and $G$ , by the integral form of $Γ_{k l}$ in the second equality of A.4, we also have:

	${∥ ∥ Γ_{t}^{k l} ∥ ∥}_{2} \leq$	$2 \sqrt{n} ¯ b D max λ \in {¯ Ω}^{k l} (ε) ∥ μ + λ - μ_{k l} - λ_{k l} ∥ + 2 \sqrt{m} G max λ \in {¯ Ω}^{k l} (ε) {∥ μ - μ_{k l} ∥}_{2}$
	$\leq$	$2 (\sqrt{n} ¯ b D + \sqrt{m} G) (max λ \in {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2} + {∥ μ - μ_{k l} ∥}_{2}),$

for any $t \in [T]$ .

Define event

E_{k l, 1} (ε) = {- \frac{1}{T} T \sum t = 1 Γ_{t}^{k l} + E Γ_{t}^{k l} < - 2 ε (\sqrt{n} ¯ b D + \sqrt{m} G) (max λ \in {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2} + {∥ μ - μ_{k l} ∥}_{2})} .

(A.6)

Then according to Hoeffding’s inequality, $P (E_{k l, 1} (ε)) \leq exp (- \frac{T ε^{2}}{2})$

Lower bound of A.3.3: We calculate the norm of each $s_{t} (λ_{k l}, d)$ :

${∥ s_{t} (λ_{k l}, d) ∥}_{2} =$	$∥ ∥ ∥ ∥ [\int_{0}^{1} v_{2}^{⊤} [\nabla f_{t}^{} (b_{t} (λ^{} + μ^{}) + v_{2} \cdot z) - \nabla f_{t}^{} (λ^{} + μ^{})] d z$	(A.7)
	${+ \int_{0}^{1} (μ_{k l} - μ^{})^{⊤} [- \nabla r^{} (- μ^{} - (μ_{k l} - μ^{}) \cdot z) + \nabla r^{} (- μ^{})]] d z ∥ ∥ ∥ ∥}_{2}$
$\leq$	$2 \sqrt{n} ¯ b D ({∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{} + {∥ {¯ μ}_{k l} - μ^{} ∥}_{2}^{}) + 2 \sqrt{m} G {∥ {¯ μ}_{k l} - μ^{} ∥}_{2}^{}$
$\leq$	$2 (\sqrt{n} ¯ b D + \sqrt{m} G) ({∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{} + {∥ {¯ μ}_{k l} - μ^{} ∥}_{2}^{}),$

for any $t \in [T]$ , where $v_{2} = b_{t}^{⊤} (λ_{k l} + μ_{k l} - λ^{*} - μ^{*})$ is the direction vector.

Define event

E_{k l, 2} = {\frac{1}{T} T \sum t = 1 s_{t} (λ_{k l}, d) - E s_{t} (λ_{k l}, d) < - 2 ε (\sqrt{n} ¯ b D + \sqrt{m} G) ({∥ ∥ {¯ λ}_{k l} - λ^{*} ∥ ∥}_{2}^{*} + {∥ {¯ μ}_{k l} - μ^{*} ∥}_{2}^{*})} .

(A.8)

Then we have $P (E_{k l, 2}) \leq exp (- \frac{T ε^{2}}{2})$ by Hoeffding’s inequality.

Now we would like to make all the quantities in the lower bound uniform by leveraging the splitting scheme. From the split, we have

	$max λ \in {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2}$	$= \sqrt{m} (1 - q) q^{k - 1} 4 H ε,$
	$max μ \in {¯ Ω}^{k l} (ε) {∥ μ - μ_{k l} ∥}_{2}$	$= \sqrt{m} (1 - q) q^{k - 1} 4 H ε,$
	${∥ λ^{} - λ_{k l} ∥}_{2}^{}$	$\geq q^{k} 4 H ε,$
	${∥ μ^{} - μ_{k l} ∥}_{2}^{}$	$\geq q^{k} 4 H ε .$

And also

	${∥ ∥ λ^{} - {¯ λ}_{k l} ∥ ∥}_{2}^{}$	$\leq {∥ λ^{} - λ_{k l} ∥}_{2}^{} + max {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2}$
		$\leq (1 + \frac{\sqrt{m} (1 - q)}{q}) {∥ λ - λ_{k l} ∥}_{2},$
	${∥ μ^{} - {¯ μ}_{k l} ∥}_{2}^{}$	$\leq (1 + \frac{\sqrt{m} (1 - q)}{q}) {∥ μ - μ_{k l} ∥}_{2},$
	$max λ \in {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2}$	$\leq \frac{\sqrt{m} (1 - q)}{q} {∥ λ^{} - λ_{k l} ∥}_{2}^{} \leq \frac{\sqrt{m} (1 - q)}{q} {∥ ∥ λ^{} - {¯ λ}_{k l} ∥ ∥}_{2}^{} (so as μ) .$

Thus we have the following result for the A.3.1 term in (A.5).

	$- E Γ_{t}^{k l} + E s_{t} (λ_{k l}, d) \geq$

	$\geq$	$\frac{{L - -}_{D}}{{(1 + \frac{\sqrt{m} (1 - q)}{q})}^{2}} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{k l}^{} - 2 \frac{\sqrt{m} (1 - q)}{q} \cdot \sqrt{L_{2}} {¯ b}^{2} \lor {¯ L}_{r} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{k l}^{}$

So there exists $q - = \frac{\sqrt{m}}{\sqrt{m} + 1 \land \frac{{L - -}_{D}}{8 \sqrt{L_{2}} {¯ b}^{2} \lor {¯ L}_{r}}}$ such that when $q \geq q -$ , $\frac{\sqrt{m} (1 - q)}{q} \leq 1 \land \frac{{L - -}_{D}}{8 \sqrt{L_{2}} {¯ b}^{2} \lor {¯ L}_{r}}$ , and

\frac{{L - -}_{D}}{(1 + \frac{\sqrt{m} (1 - q)}{q})^{2}} - 2 \frac{\sqrt{m} (1 - q)}{q} \cdot \sqrt{L_{2}} {¯ b}^{2} \lor {¯ L}_{r} \geq {L - -}_{D} / 2.

Choose $q = q - \lor \frac{1}{2}$ . Then for the A.3.1 term in (A.5) we have

(A.9)

For A.3.2, under event $E_{k l, 1}^{c} (ε)$ in (A.6) we have

	$- \frac{1}{T} T \sum t = 1 Γ_{t}^{k l} + E Γ_{t}^{k l}$	$\geq - 2 ε (\sqrt{n} ¯ b D + \sqrt{m} G) (max λ \in {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2} + {∥ μ - μ_{k l} ∥}_{2})$		(A.10)
		$\geq - 2 \sqrt{2} ε (\sqrt{n} ¯ b D + \sqrt{m} G) \frac{\sqrt{m} (1 - q)}{q} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{} .$		(A.10)

For A.3.3, under event $E_{k l, 2}^{c} (ε)$ in (A.8) we have

(A.11)

Now we combine first order lower bound in (A.1) and second order lower bound in (A.9), (A.10), (A.11) together under the desired good event

E (ε) = E_{0}^{c} (ε) \cap_{k = 1}^{N} \cap_{l} (E_{k l, 1}^{c} (ε) \cap E_{k l, 2}^{c} (ε)),

where we choose $N$ by setting the radius of ${¯ Ω}^{N 1} (ε)$ : $\sqrt{2 m} q^{N} 4 H ε \leq 2 H ε$ , i.e., $N = ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉$ . Under $E (ε)$ , for any $λ \in Ω_{λ} (ε)$ satisfying ${∥ λ - λ^{*} ∥}_{2}^{*} > 2 H ε$ , there exists $k = {0, \dots, N - 1}$ and $l$ such that $λ \in {¯ Ω}^{k l} (ε)$ , and

	${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) =$	$⟨ {¯ ϕ}_{T} (λ^{}, d), λ - λ^{} ⟩ + {¯ s}_{T} (λ, d)$
	$\geq$	$\frac{{L - -}_{D}}{2} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{k l}^{} - 2 \sqrt{2} ε (\sqrt{n} ¯ b D + \sqrt{m} G) \frac{\sqrt{m} (1 - q)}{q} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{}$
		$- 2 \sqrt{2} ε (\sqrt{n} ¯ b D + \sqrt{m} G) {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{} - ε {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{}$
	$=$	$\frac{{L - -}_{D}}{2} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{k l}^{} - H ε \cdot {L - -}_{D} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{}$

where $H = (1 + 2 \sqrt{2} (\sqrt{n} ¯ b D + \sqrt{m} G) (1 + \frac{\sqrt{m} (1 - q)}{q})) / {L - -}_{D}$ .

Compute the probability of $E (ε)$ we can show that

	$P (E (ε)) \geq$	$1 - P (E_{0} (ε)) - \sum 0 \leq k \leq N - 1, l (P (E_{k l, 1} (ε)) + P (E_{k l, 2} (ε)))$
	$\geq$	$1 - 4 m exp (- \frac{T ε^{2}}{4 m c_{1}}) - 2 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m} exp (- \frac{T ε^{2}}{2})$

a.5 Proof of Corollary 1

Recall the proof of Theorem 2 that when $ε$ satisfying $ϵ < 4 H^{2} ε^{2} {L - -}_{D}$ , with high probability the deterministic $ϵ$ -optimal solution must be in $Ω_{λ} (ε)$ . Similarly, for the stochastic $ϵ$ -optimal solution, we try to confine it in a larger region so that with high probability $E [{∥ ∥ λ_{T}^{ϵ} - λ_{T}^{*} ∥ ∥}_{2}^{2} ∣ ∣ {¯ D}_{T}]$ can still be bounded by $ε$ . Notice that, although our Proposition 2 only focus on $Ω_{λ} (ε)$ , it also bring us information outside $Ω_{λ} (ε)$ . For any $ε$ and $ϵ$ , under the event when Proposition 2 holds, for any ${¯ D}_{T}$ we have:

If ${¯ D}_{T} (λ_{T}^{ϵ}, d) - {¯ D}_{T} (λ^{*}, d) \leq 4 H^{2} ε^{2} {L - -}_{D}$ , then ${∥ ∥ λ_{T}^{ϵ} - λ_{T}^{*} ∥ ∥}_{2} \leq 4 H ε$ .
If ${¯ D}_{T} (λ_{T}^{ϵ}, d) - {¯ D}_{T} (λ^{*}, d) > 4 H^{2} ε^{2} {L - -}_{D}$ , then we have ${∥ ∥ λ_{T}^{ϵ} - λ_{T}^{*} ∥ ∥}_{2} \leq \frac{1}{H ε {L - -}_{D}} ({¯ D}_{T} (λ_{T}^{ϵ}, d) - {¯ D}_{T} (λ^{*}, d))$ . Because the convex function ${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) = 0$ when $λ = λ^{*}$ , and ${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) \geq 4 H^{2} ε^{2} {L - -}_{D}$ when ${∥ λ - λ^{*} ∥}_{2}^{*} = 4 H ε$ .

We conclude that under the event when Proposition 2 holds, for any $ϵ < 4 H^{2} ε^{2} {L - -}_{D}$ ,

E_{B} [{∥ ∥ λ_{T}^{ϵ} - λ^{*} ∥ ∥}_{2}^{*} ∣ ∣ {¯ D}_{T}] \leq 16 H^{2} ε^{2} + \frac{2 \sqrt{m (2 \frac{¯ f + ¯ r}{d - -} + G)}}{H ε {L - -}_{D}} \cdot ϵ

because . The RHS term has a minimum value

z_{0} = 3 \cdot 2^{\frac{4}{3}} ϵ^{\frac{2}{3}} {(m (2 \frac{¯ f + ¯ r}{d - -} + G))}^{\frac{1}{3}} / {L - -}_{D}^{\frac{2}{3}}

when $ε_{0} = ϵ^{\frac{1}{3}} {(m (2 \frac{¯ f + ¯ r}{d - -} + G))}^{\frac{1}{6}} / (2^{\frac{4}{3}} H {L - -}_{D}^{\frac{1}{3}})$ . When the RHS term is larger than is minimum value, we can always take the corresponding $ε$ at the right side where $ε > ε_{0}$ and it follows that

z = 16 H^{2} ε^{2} + \frac{2 \sqrt{m (2 \frac{¯ f + ¯ r}{d - -} + G)}}{H ε {L - -}_{D}} \cdot ϵ \leq 48 H^{2} ε^{2} .

Then by the tail expectation formula we have

	$E_{B, P} {∥ ∥ λ_{T}^{ϵ} - λ^{} ∥ ∥}_{2}^{}$	$= \int_{0}^{z_{0}} P (E_{B} [{∥ ∥ λ_{T}^{ϵ} - λ^{} ∥ ∥}_{2}^{} ∣ ∣ {¯ D}_{T}] \geq z) d z + \int_{z_{0}}^{\infty} P (E_{B} [{∥ ∥ λ_{T}^{ϵ} - λ^{} ∥ ∥}_{2}^{} ∣ ∣ {¯ D}_{T}] \geq z) d z$
		$\leq z_{0} + \int_{z_{0}}^{\infty} ⎡ ⎣ 4 m exp (- \frac{T \frac{z}{48 H^{2}}}{4 m c_{1}}) + 2 (2 ⌈ {log}_{q} (\frac{1}{2 \sqrt{2 m}}) ⌉)^{2 m} exp (- \frac{T \frac{z}{48 H^{2}}}{2}) ⎤ ⎦ d z .$
		$\leq z_{0} + 12 \frac{C_{1}}{T} .$

a.6 Proof of Proposition 3

By Fenchel conjugate, the definition of ${^μ}_{T}$ implies

	$r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T}) + {^μ}_{T}^{⊤} \frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T} =$	$max a r (a) + {^μ}_{T}^{⊤} a = r^{*} (- {^μ}_{T}) \geq r (\frac{\sum_{t = 1}^{T} a_{t}}{T}) + {^μ}_{T}^{⊤} \frac{\sum_{t = 1}^{T} a_{t}}{T}$
	$\geq$	$\frac{\sum_{t = 1}^{T} r (a_{t})}{T} + {^μ}_{T}^{⊤} \frac{\sum_{t = 1}^{T} a_{t}}{T} .$

Combined with $R (A | P) = E_{A, P} [\sum_{t = 1}^{T} f_{t} (x_{t}) + T \cdot r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T})]$ , we have

R (A | P)

\geq E [T \sum t = 1 f_{t} (x_{t}) + T \sum t = 1 r (a_{t}) + {^μ}_{T}^{⊤} (T \sum t = 1 (a_{t} - b_{t} x_{t}))] .

The Assumption 2 suggests that

{∥ {^μ}_{T} ∥}_{2} \leq \sqrt{m} G, and {∥ a_{t} ∥}_{2} = {∥ \nabla r^{*} (- μ_{t}) ∥}_{2}^{*} \leq \sqrt{n} D ¯ b .

Thus

	$R (A \| P) \geq E$	$[τ \sum t = 1 [f_{t} ({~ x}_{t} (λ_{t - 1})) + r (a_{t})] - (¯ f + ¯ r + 2 \sqrt{m n} G D ¯ b) (T - τ) + ⟨ {^μ}_{T}, τ \sum t = 1 (a_{t} - b_{t} x_{t}) ⟩]$
	$=$
	$-$	$E [(¯ f + ¯ r + 2 C_{3}) (T - τ) - ⟨ λ^{*}, τ \sum t = 1 (d - b_{t} x_{t}) ⟩]$

Combined with (5.2), we conclude the proof.

a.7 Proof of Lemma 4

We start with a lemma on the continuity of dual optimal solution to prove Lemma 4.

Lemma 8 (Continuity of dual optimal solution (li2021online)).

Under Assumption 1, 2, 3, for the stochasitc program $min μ, λ ⪰ 0 D (λ, d^{'}) = E f_{t}^{*} (b_{t}^{⊤} (μ + λ)) + r^{*} (- μ) + {d^{'}}^{⊤} λ$ , let $d^{'}$ be $d_{1}^{'}, d_{2}^{'} \in Ω_{d}$ separately, then the corresponding optimal solution $λ^{*} (d_{1}^{'}), λ^{*} (d_{2}^{'})$ satisfies

{∥ ∥ λ^{*} (d_{1}^{'}) - λ^{*} (d_{2}^{'}) ∥ ∥}_{2}^{*} \leq \frac{1}{4 {L - -}_{D}^{2}} {∥ ∥ d_{1}^{'} - d_{2}^{'} ∥ ∥}_{2}^{2} .

If further $d_{1}^{'}, d_{2}^{'}$ identify the same binding/non-binding dimensions, then

{∥ ∥ λ^{*} (d_{1}^{'}) - λ^{*} (d_{2}^{'}) ∥ ∥}_{2}^{*} \leq \frac{1}{4 {L - -}_{D}^{2}} \sum i \in I_{B} (d_{1 i}^{'} - d_{2 i}^{'})^{2},

where the binding dimension $I_{B}$ is with respect to $d_{1}^{'}$ and $d_{2}^{'}$ .

Proof (Lemma 8).

By Proposition 1 and the uniform assumption on $d$ , we have

	$D (λ^{} (d_{2}^{'}), d_{1}^{'}) - D (λ^{} (d_{1}^{'}), d_{1}^{'})$	$\geq {L - -}_{D} {∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*}$
	$D (λ^{} (d_{1}^{'}), d_{2}^{'}) - D (λ^{} (d_{2}^{'}), d_{2}^{'})$	$\geq {L - -}_{D} {∥ ∥ λ^{} (d_{1}^{'}) - λ^{} (d_{2}^{'}) ∥ ∥}_{2}^{*} .$

Summing up two inequality we have

(d_{1}^{'} - d_{2}^{'})^{⊤} (λ^{*} (d_{2}^{'}) - λ^{*} (d_{1}^{'})) \geq 2 {L - -}_{D} {∥ ∥ λ^{*} (d_{2}^{'}) - λ^{*} (d_{1}^{'}) ∥ ∥}_{2}^{*},

(A.12)

or equivalently, $\sum_{i \in I_{B}} (d_{1 i}^{'} - d_{2 i}^{'}) (λ_{i}^{*} (d_{2}^{'}) - λ_{i}^{*} (d_{1}^{'})) \geq 2 {L - -}_{D} {∥ ∥ λ^{*} (d_{2}^{'}) - λ^{*} (d_{1}^{'}) ∥ ∥}_{2}^{*}$ if further $d_{1}^{'}, d_{2}^{'}$ share the same binding/non-binding dimensions. From (A.12) we can show that

	$2 {L - -}_{D} {∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*}$	$\leq (d_{1}^{'} - d_{2}^{'})^{⊤} (λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'})) \leq {∥ ∥ d_{1}^{'} - d_{2}^{'} ∥ ∥}_{2} {∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*}$
	${∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*}$	$\leq \frac{1}{2 {L - -}_{D}} {∥ ∥ d_{1}^{'} - d_{2}^{'} ∥ ∥}_{2} .$

Thus we get the first statement. For the second statement we focus on the binding dimensions

	$2 {L - -}_{D} {∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*}$	$\leq \sum i \in I_{B} (d_{1 i}^{'} - d_{2 i}^{'}) (λ_{i}^{} (d_{2}^{'}) - λ_{i}^{} (d_{1}^{'}))$
		$\leq \sqrt{\sum i \in I_{B} (d_{1 i}^{'} - d_{2 i}^{'})^{2}} \sqrt{\sum i \in I_{B} (λ_{i}^{} (d_{2}^{'}) - λ_{i}^{} (d_{1}^{'}))^{2}}$
		$\leq \sqrt{\sum i \in I_{B} (d_{1 i}^{'} - d_{2 i}^{'})^{2}} {∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*},$

which completes the proof of Lemma 8.

Then, we return to Lemma 4 and consider the original constraints $d$ and the its binding/non-binding dimensions: $I_{B} = {i ∣ ∣ d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*} = 0}$ , and $I_{NB} = {i ∣ ∣ d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*} > 0}$ . Here we write the corresponding optimal solution to $min μ, λ ⪰ 0 D (λ, d)$ as $λ^{*}$ , and write $λ^{*} (d^{'})$ if we change $d$ to $d^{'}$ . Then if $i \in I_{B}$ and $i$ changes to non-binding dimensions for $d^{'}$ , by Lemma 8 we have

(A.13)

where $λ - - = min {λ_{i}^{*} ∣ ∣ i \in I_{B}}$ . If on the other hand, $i \in I_{NB}$ and $i$ changes to binding dimensions for $d^{'}$ , by Assumption 1 we have

	$E {∥ ∥ λ^{} (d^{'}) - λ^{} ∥ ∥}_{2}^{*}$	$\geq \frac{1}{2 {¯ b}^{2} {¯ L}_{f}} ∣ ∣ E {(b_{t} {~ x}_{t} (λ^{} (d^{'})))}_{i}^{} - E {(b_{t} {~ x}_{t} (λ^{}))}_{i}^{} ∣ ∣ = \frac{1}{2 {¯ b}^{2} {¯ L}_{f}} ∣ ∣ d_{i}^{'} - E {(b_{t} {~ x}_{t} (λ^{}))}_{i}^{} ∣ ∣$
		$\geq \frac{1}{2 {¯ b}^{2} {¯ L}_{f}} (∣ ∣ d_{i} - E {(b_{t} {~ x}_{t} (λ^{}))}_{i}^{} ∣ ∣ - ∣ ∣ d_{i}^{'} - d_{i} ∣ ∣) .$

Denote the minimum of remaining resources in non-binding dimensions by

γ = min i \in I_{NB} {d_{i} - E {(b_{t} {~ x}_{t} (λ^{*}))}_{i}^{*}} .

By Lemma 8 we have

\geq \frac{{L - -}_{D}}{{¯ b}^{2} {¯ L}_{f}} (γ - ∣ ∣ d_{i}^{'} - d_{i} ∣ ∣) \geq \frac{{L - -}_{D}}{{¯ b}^{2} {¯ L}_{f}} (γ - {∥ ∥ d - d^{'} ∥ ∥}_{2}^{'}),

i.e., ${∥ d - d^{'} ∥}_{2}^{'} \geq \frac{γ {L - -}_{D}}{{L - -}_{D} + {¯ b}^{2} {¯ L}_{f}}$ . Combined with (A.13), taking $δ_{d} = \frac{1}{\sqrt{m}} \cdot (\frac{γ {L - -}_{D}}{{L - -}_{D} + {¯ b}^{2} {¯ L}_{f}}) \land (2 {L - -}_{D} λ - -)$ we can conclude that when $∣ ∣ d_{i} - d_{i}^{'} ∣ ∣ \leq δ_{d}$ , the binding/non-binding dimensions will never change. Moreover, enlarging the constraint in a non-binding dimension will never change this constraint to binding dimension. So, for the non-binding dimensions, $d_{i}^{'} - d_{i}$ can be any large. This finishes the proof.

a.8 Proof of lemma 5

By the definition of stopping time $τ$ and Condition 1, the first term in the RHS has

E [τ \sum t = 1 {∥ λ_{t - 1} - λ^{*} (d_{t - 1}) ∥}_{t - 1}^{*}] \leq T \sum t = 1 2 C_{2} \frac{1}{t} or C_{2} \frac{1}{T - t + 1} \leq 2 C_{2} (log T + 1)

For the second term, we apply lemma 8 to it.

2 E [τ \sum t = 1 {∥ λ^{*} (d_{t - 1}) - λ^{*} ∥}_{t - 1}^{*}] \leq \frac{1}{} 2 {L - -}_{D}^{2} E ⎡ ⎢ ⎣ τ \sum t = 1 \sum i \in I_{B} (d_{i t} - d_{i})^{2} ⎤ ⎥ ⎦ .

Thus we transform the perturbation of $λ^{*} (d_{t})$ into the derivation of $d_{t}$ in the binding dimensions.

To ease our analysis, we define a new sequence $d_{t}^{'}$

d_{t}^{'} = {\begin{matrix} d_{t}, & if t \leq τ d_{t - 1}, & if t > τ \end{matrix}

which shares the same stopping time with $d_{t}$ and define for $i \in [m]$ as the stopping time on each dimension with $τ = min {τ_{1}, . . ., τ_{m}}$ .

We first consider the binding dimensions. For any $i \in I_{B}$ , we follow a similar procedure in li2021online to derive:

	$d_{i, t + 1}^{'}$	$= d_{i t}^{'} + \frac{d_{i t}^{'} - {(b_{t + 1} {~ x}_{t + 1} (λ_{t}))}_{i}}{T - t - 1} I (τ > t)$
	$E {(d_{i, t + 1}^{'} - d_{i})}_{i}^{2}$	$= E {(d_{i t}^{'} - d_{i})}_{i}^{2} + E \frac{{(d_{i t}^{'} - {(b_{t + 1} {~ x}_{t + 1} (λ_{t - 1}))}_{i})}_{i}^{2}}{(T - t - 1)^{2}} I (τ > t)      A^{'}$
		$+ 2 E \frac{(d_{i t}^{'} - d_{i}) (d_{i t}^{'} - {(b_{t + 1} {~ x}_{t + 1} (λ^{} (d_{t})))}_{i}^{})}{T - t - 1} I (τ > t)      B^{'}$
		$+ 2 E \frac{(d_{i t}^{'} - d_{i}) ({(b_{t + 1} {~ x}_{t + 1} (λ^{} (d_{t})) - b_{t + 1} {~ x}_{t + 1} (λ_{t}))}_{i}^{})}{T - t - 1} I (τ > t)      C^{'}$

For the term $A^{'}$ we have $A^{'} \leq \frac{{(¯ d + \sqrt{n} D ¯ b)}^{2}}{(T - t - 1)^{2}}$ . For the term $B^{'}$ , since $i \in I_{B}$ and $d_{t} \in σ (H_{t})$ , conditioned on past history $H_{t}$ , we always have $E [(d_{i t}^{'} - d_{i}) (d_{i t}^{'} - {(b_{t + 1} {~ x}_{t + 1} (λ^{*} (d_{t})))}_{i}^{*}) I (τ > t) ∣ ∣ H_{t}] = 0$ , thus $B^{'} = 0$ . For the term $C^{'}$ , we apply Assumption 4 and Condition 1:

	$C^{'}$	$\leq \frac{2 \sqrt{E {(d_{i t}^{'} - d_{i})}_{i}^{2}} \sqrt{E {∥ b_{t + 1} {~ x}_{t + 1} (λ^{} (d_{t})) - b_{t + 1} {~ x}_{t + 1} (λ_{t}) ∥}_{t + 1}^{}}}{T - t - 1}$
		$\leq \frac{2 \sqrt{E {(d_{i t}^{'} - d_{i})}_{i}^{2}} \sqrt{2 {¯ b}^{4} L_{2} E {∥ λ_{t} - λ^{} (d_{t}) ∥}_{t}^{}}}{T - t - 1} \leq \frac{2 \sqrt{2 C_{2} L_{2}} {¯ b}^{2} \sqrt{\frac{1}{t + 1} + \frac{1}{T - t}} \sqrt{E {(d_{i t}^{'} - d_{i})}_{i}^{2}}}{T - t - 1} .$

Here the second inequality is because of Assumption 4, and the third inequality is from Condition 1. Here in the derivation, we can treat ${λ_{t}}$ as a new sequence generated by ${d_{t}^{'}}$ , which has the same value with the original one when $t \leq τ$ , and takes $λ_{t} = B_{t} (H_{t}, d_{t}^{'})$ when $t > τ$ . We then get the recurrence relation of $d_{i t}^{'} - d_{i}$ :

E {(d_{i, t + 1}^{'} - d_{i})}_{i}^{2} \leq

E {(d_{i t}^{'} - d_{i})}_{i}^{2} + \frac{{(¯ d + \sqrt{n} D ¯ b)}^{2}}{(T - t - 1)^{2}} + \frac{2 \sqrt{2 C_{2} L_{2}} {¯ b}^{2} \sqrt{\frac{1}{t + 1} + \frac{1}{T - t}} \sqrt{E {(d_{i t}^{'} - d_{i})}_{i}^{2}}}{T - t - 1} .

Since $d_{0} = d$ , by induction we have $E {(d_{i t}^{'} - d_{i})}_{i}^{2} \leq C_{3} \frac{t + 1}{(T + 1) (T - t)}$ , where

C_{3} = {(2 \cdot {(¯ d + \sqrt{n} D ¯ b)}^{2} \lor (2 \sqrt{2 C_{2} L_{2}} {¯ b}^{2}) + 1)}^{2} .

So, we have

	$2 E [τ \sum t = 1 {∥ λ^{} (d_{t - 1}) - λ^{} ∥}_{t - 1}^{2}] \leq$	$2 E ⎡ ⎢ ⎣ τ \sum t = 1 \sum i \in I_{B} (d_{i, t - 1} - d_{i})^{2} ⎤ ⎥ ⎦$
	$\leq$	$2 m E T \sum t = 1 [{(d_{i, t + 1}^{'} - d_{i})}_{i}^{2}] \leq 2 m C_{3} log T, and$
	$E [τ \sum t = 1 {∥ λ_{t - 1} - λ^{} ∥}_{t - 1}^{}] \leq$

which completes the proof.

a.9 Proof of lemma 6

Since $τ = min {τ_{1}, . . ., τ_{m}}$ , we only need to show $E (T - τ_{i}) \leq C log T$ for any $i$ in binding dimensions and non-binding dimensions.

For the binding dimensions, applying Chebyshev’s inequality, we have

$E (T - τ_{i}) \leq T \sum i = 1 P (τ_{i} \leq t) \leq$	$1 + \frac{\sqrt{n} D ¯ b}{d - -} + T \sum i = 1 P (\| d_{i t}^{'} - d_{i} \| \leq δ_{d})$	(A.14)
$\leq$	$1 + \frac{\sqrt{n} D ¯ b}{d - -} + T \sum t = 1 \frac{E {(d_{i t}^{'} - d_{i})}_{i}^{2}}{δ_{d}^{2}}$
$\leq$	$1 + \frac{\sqrt{n} D ¯ b}{d - -} + \frac{C_{3}}{δ_{d}^{2}} log T$

For the non-binding dimensions, $D$ ensures that binding/non-binding dimensions remain unchanged when $d^{'} \in D$ . Then for $d^{'} \in D$ , we define

{~ d}_{i}^{'} = {\begin{matrix} d_{i}^{'}, & if i \in I_{% B} d_{i} - δ_{d}, & if i \in I_{NB} \end{matrix}

We know that $λ^{*} (d^{'}) = λ^{*} ({~ d}^{'})$ because the non-binding constraints are loose, then

Recall that ${~ x}_{t} (\cdot) ⊥ ⊥ H_{t - 1}$ , thus $E [{(b_{t} {~ x}_{t} (λ^{*} (d_{t - 1}))}_{i}^{*} ∣ ∣ H_{t - 1}] < d_{i} - δ_{d}$ for $i \in I_{NB}$ and $d_{t - 1} \in D$ . This implies that

	$P (τ_{i} \leq t)$	$= P (t^{'} \sum j = 1 (b_{j} {~ x}_{j} (λ_{j - 1}))_{i} \geq t^{'} (d_{i} - δ_{d}) + T δ_{d} for some 1 \leq t^{'} \leq t)$
		$\leq P (t^{'} \sum j = 1 [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} - E [{(b_{t} {~ x}_{t} (λ^{} (d_{j - 1}))}_{i}^{} ∣ ∣ H_{j - 1}]] \geq T δ_{d} for some 1 \leq t^{'} \leq t)$
		$\leq P (t^{'} \sum j = 1 [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} - E [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} \| H_{j - 1}]] +$
		$t^{'} \sum j = 1 ∣ ∣ E [{(b_{t} {~ x}_{t} (λ^{} (d_{j - 1}))}_{i}^{} ∣ ∣ H_{t - 1}] - E [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} \| H_{j - 1}] ∣ ∣ \geq T δ_{d} % for some 1 \leq t^{'} \leq t)$
		$\leq P (t^{'} \sum j = 1 [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} - E [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} \| H_{j - 1}]] \geq \frac{T δ_{d}}{2} for some 1 \leq t^{'} \leq t)$

Since sequences in the last two lines are martingales/sub-martingales, we use Doob’s martingale inequality and get the following derivation:

	$P (τ_{i} \leq t) \leq$	$\frac{4}{T^{2} δ_{d}^{2}} t \sum j = 1 E {[(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} - E [(b_{j} {~ x}_{j} (λ_{j - 1}))_{i} \| H_{t - 1}]]}_{j}^{2}$

	$\leq$
	$\leq$	$\frac{16 n {¯ b}^{2} D^{2} t}{T^{2} δ_{d}^{2}} + \frac{16 n {¯ b}^{2} D^{2} C_{2} t}{T^{2} δ_{d}^{2}} (log t + log T - log (T - t + 1) + 2)$

We now go back to calculate the $E (T - τ_{i})$ :

	$E (T - τ_{i})$	$\leq 1 + \frac{\sqrt{n} D ¯ b}{d - -} + T \sum i = 1 P (τ_{i} \leq t) \leq T \sum t = 1 \frac{16 n {¯ b}^{2} D^{2} t}{T^{2} δ_{d}^{2}} + \frac{16 n {¯ b}^{2} D^{2} C_{2} t}{T^{2} δ_{d}^{2}} (log t + log T + 2)$		(A.15)
		$\leq 1 + \frac{\sqrt{n} D ¯ b}{d - -} + \frac{8 n {¯ b}^{2} D^{2} (1 + 2 C_{2})}{δ_{d}^{2}} + \frac{16 n {¯ b}^{2} D^{2} log T}{δ_{d}^{2}}$		(A.15)

Putting together (A.14) and (A.15) we conclude the proof of lemma 6.

a.10 Proof of Lemma 7

For the $E {∥ {^μ}_{T} - μ^{*} ∥}_{T}^{*}$ , the optimality of $μ^{*}$ implies $~ a (μ^{*}) = E b_{t} {~ x}_{t} (λ^{*})$ , thus by conjugate we have

	$E {∥ {^μ}_{T} - μ^{} ∥}_{T}^{} =$	$E {∥ ∥ ∥ ∥ \nabla r (~ a (μ^{})) - \nabla r (\frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T}) ∥ ∥ ∥ ∥}_{2}^{} \leq E (\frac{1}{{L - -}_{r}})^{2} {∥ ∥ ∥ ∥ ~ a (μ^{}) - \frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T} ∥ ∥ ∥ ∥}_{2}^{}$
	$=$	$(\frac{1}{{L - -}_{r}})^{2} E {∥ ∥ ∥ ∥ \frac{\sum_{t = 1}^{T} b_{t} x_{t}}{T} - \frac{\sum_{t = 1}^{T} b_{t} {~ x}_{t} (λ^{})}{T} + \frac{\sum_{t = 1}^{T} b_{t} {~ x}_{t} (λ^{})}{T} - E b_{t} {~ x}_{t} (λ^{}) ∥ ∥ ∥ ∥}_{t}^{}$
	$\leq$

For the part A.10.1, applying Assumption 4 we can yield

	$E$
		$\leq \frac{E \sum_{t = 1}^{τ} {∥ b_{t} {~ x}_{t} (λ_{t - 1}) - b_{t} {~ x}_{t} (λ^{}) ∥}_{t}^{}}{T} = \frac{E \sum_{t = 1}^{T} {∥ b_{t} ~ x_{t} (λ_{t - 1}) - b_{t} {~ x}_{t} (λ^{}) ∥}_{t}^{} I (t \leq τ)}{T}$

		$(a) \leq L_{2} {¯ b}^{2} \frac{\sum_{t = 1}^{T} E [{∥ λ_{t - 1} - λ^{} ∥}_{t - 1}^{} I (t \leq τ)]}{T} = L_{2} {¯ b}^{2} \frac{E [\sum_{t = 1}^{τ} {∥ λ_{t - 1} - λ^{} ∥}_{t - 1}^{}]}{T}$
		$(b) \leq O (\frac{log T}{T}) .$

(a) is by Assumption 4 and the fact ${t \leq τ} \in σ (H_{t - 1})$ , and ${~ x}_{t} (\cdot) ⊥ ⊥ λ_{t - 1}$ . (b) is by Lemma 5.

For the part A.10.2, since $b_{t} {~ x}_{t} (λ^{*})$ is bounded by $| (b_{t} {~ x}_{t} (λ^{*}))_{i} | \leq \sqrt{n} D ¯ b$ , using Hoeffding’s inequality on each dimension we have

P ⎛ ⎝ {∥ ∥ ∥ ∥ \frac{\sum_{t = 1}^{T} b_{t} {~ x}_{t} (λ^{*})}{T} - E b_{t} {~ x}_{t} (λ^{*}) ∥ ∥ ∥ ∥}_{2}^{*} > ε ⎞ ⎠ \leq m exp (- \frac{2 ε^{2} T}{m n D^{2} {¯ b}^{2}}) .

Thus

	$E {∥ ∥ ∥ ∥ \frac{\sum_{t = 1}^{T} b_{t} {~ x}_{t} (λ^{})}{T} - E b_{t} {~ x}_{t} (λ^{}) ∥ ∥ ∥ ∥}_{t}^{*} =$	$\int_{0}^{\infty} P ⎛ ⎜ ⎝ {∥ ∥ ∥ ∥ \frac{\sum_{t = 1}^{T} b_{t} {~ x}_{t} (λ^{})}{T} - E b_{t} {~ x}_{t} (λ^{}) ∥ ∥ ∥ ∥}_{t}^{*} > ε ⎞ ⎟ ⎠ d ε$
	$\leq$	$\int_{0}^{\infty} m exp (- \frac{2 ε T}{m n D^{2} {¯ b}^{2}}) d ε \leq \frac{m^{2} n D^{2} {¯ b}^{2}}{2 T} .$

For the part A.10.3, since ${∥ b_{t} x_{t} - E b_{t} {~ x}_{t} (λ^{*}) ∥}_{t}^{*} \leq n D^{2} {¯ b}^{2}$ , by Lemma 6, we have

	$E {∥ ∥ ∥ ∥ \frac{\sum_{t = τ + 1}^{T} b_{t} x_{t} - b_{t} {~ x}_{t} (λ^{*})}{T} ∥ ∥ ∥ ∥}_{2}^{2} \leq$	$\frac{E [(T - τ) \sum_{t = τ + 1}^{T} {∥ b_{t} x_{t} - b_{t} {~ x}_{t} (λ^{*}) ∥}_{t}^{2}]}{T^{2}}$
	$\leq$	$n D^{2} {¯ b}^{2} \frac{E (T - τ)}{T} \leq O (\frac{log T}{T})$

We then go back to control the next term $E {∥ ∥ ∥ τ \sum t = 1 (a_{t} - b_{t} x_{t}) ∥ ∥ ∥}_{t}^{2}$ .

		$E {∥ ∥ ∥ ∥ τ \sum t = 1 (a_{t} - ~ a (μ^{}) + E b_{t} {~ x}_{t} (λ^{}) - b_{t} x_{t}) ∥ ∥ ∥ ∥}^{2}$
	$\leq$
	$\leq$	$2 E τ τ \sum t = 1 {¯ L}_{r} {∥ μ_{t} - μ^{} ∥}_{t}^{} + 2 T^{2} E {∥ ∥ ∥ ∥ \frac{1}{T^{2}} τ \sum t = 1 (E b_{t} {~ x}_{t} (λ^{}) - b_{t} x_{t}) ∥ ∥ ∥ ∥}_{t}^{}$

From the argument above, we show that the first term is controlled by $O (T log T)$ , and the second term can also be controlled by $O (T log T)$ (this proof follows previous derivation of part A.10.1-3). Thus we finish the proof.

a.11 Proof of Theorem 4 and 5

We specify a non-regularized case where $f_{t} (x) = - \frac{1}{4} (x - 2 ξ_{t})^{2} + ξ_{t}^{2}$ , with fixed cost $b_{t} = 1$ , average resource capacity $d = \frac{1}{2} D$ , and $ξ_{t}$ following two-point distribution $P (ξ_{t} = \frac{1}{2} D) = P (ξ_{t} = \frac{3}{4} D) = \frac{1}{2}$ . Then the dual problem is

D_{t} (λ) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} \frac{1}{2} D λ & if λ > ξ_{t} - \frac{1}{4} D + ξ_{t} - \frac{1}{2} D λ & if λ < ξ_{t} - \frac{1}{2} D λ^{2} - 2 (ξ_{t} - \frac{1}{4} D) λ + ξ_{t}^{2} & if ξ_{t} - \frac{1}{2} D \leq λ \leq ξ_{t} . \end{matrix}

Suppose $λ^{*}$ is the optimal solution to the deterministic problem ${min}_{λ \geq 0} D (λ) = E D_{t} (λ)$ . Without loss of generality, we assume that our dual variable $λ$ is taken within $[\frac{1}{4} D, \frac{1}{2} D]$ since we know that $λ^{*} = E ξ_{t} - \frac{1}{4} D = \frac{3}{8} D \in [\frac{1}{4} D, \frac{1}{2} D]$ .

D_{t} (λ) = f_{t}^{*} (λ) + d^{⊤} λ = λ^{2} - 2 (ξ_{t} - \frac{1}{4} D) λ + ξ_{t}^{2} .

For the dual-based police ${λ_{t}}_{t}^{T - 1}$ , the corresponding primal variable is $x_{t} = {~ x}_{t} (λ_{t - 1}) = 2 ξ_{t} - 2 λ_{t - 1}$ or void if the resource is depleted. We have the following regret:

	$Regret (A) =$	$R^{*} (P) - R (A \| P)$
	$=$	$E [max x_{t} \in [0, D] {T \sum t = 1 f_{t} (x_{t}) s.t. T \sum t = 1 x_{t} \leq \frac{1}{2} D T}] - E [T \sum t = 1 f_{t} (x_{t})]$
	$=$	$E [min λ \geq 0 {T \sum t = 1 D_{t} (λ)}] - E [T \sum t = 1 f_{t} (x_{t})]$
	$=$	$E [T \sum t = 1 D_{t} (λ_{T}^{*})] - E [T \sum t = 1 f_{t} (x_{t})]$

Define the corresponding

g (λ) = E [f_{t} ({~ x}_{t} (λ)) + ⟨ d - b_{t} ~ x_{t} (λ), λ^{*} ⟩] = D (λ) - ⟨ \nabla D (λ), λ - λ^{*} ⟩

We have $g (λ^{*}) = D (λ^{*})$ and $g (λ^{*}) - g (λ) = (λ^{*} - λ)^{2}$ . For the quadratic function $D_{t}$ , we always have $D_{t} (λ_{1}) - D_{t} (λ_{2}) = \nabla D_{t} (λ_{2}) (λ_{1} - λ_{2}) + (λ_{1} - λ_{2})^{2}$ . Thus it follows that

	$Regret (A)$	$= E [T \sum t = 1 D_{t} (λ_{T}^{})] - T D (λ^{}) + T D (λ^{*}) - E [T \sum t = 1 f_{t} (x_{t})]$
		$= E [T \sum t = 1 D_{t} (λ_{T}^{}) - D_{t} (λ^{})] + T g (λ^{*}) - E [T \sum t = 1 f_{t} (x_{t})]$
		$= - E [T \sum t = 1 [\nabla D_{t} (λ_{T}^{}) (λ^{} - λ_{T}^{})] + T (λ^{} - λ_{T}^{})^{2}] + T g (λ^{}) - E [T \sum t = 1 f_{t} (x_{t})]$
		$= - T E (λ^{} - λ_{T}^{})^{2} + T g (λ^{*}) - E [T \sum t = 1 f_{t} (x_{t})] .$

By the dual convergence in Theorem 1, we know that the first term $T E (λ^{*} - λ_{T}^{*})^{2}$ can be bounded by a constant. Now we handle the second term by controlling the stopping time. Define the stopping time $τ_{0} = min {t \in [T] ∣ ∣ \sum_{i = 1}^{t} x_{t} \geq \frac{1}{2} D T - D} \cup {T}$ . Then when $t \leq τ_{0}$ , we always have $x_{t} = {~ x}_{t} (λ_{t - 1}) = 2 ξ_{t} - 2 λ_{t - 1}$ , and $0 \leq \sum_{t = τ_{0} + 1}^{T} x_{t} \leq D$ for $t > τ$ . Then we have

	$E [T \sum t = 1 f_{t} (x_{t})]$	$\leq E [τ_{0} \sum t = 1 f_{t} ({~ x}_{t} (λ_{t - 1})) + ⟨ \frac{1}{2} D - {~ x}_{t} (λ_{t - 1}), λ^{} ⟩] + E ⎡ ⎣ T \sum t = τ_{0} + 1 f_{t} (x_{t}) + ⟨ \frac{1}{2} D - x_{t} (λ), λ^{} ⟩ ⎤ ⎦$
		$\leq E τ \sum t = 1 g (λ_{t - 1}) + E ⎡ ⎣ T \sum t = τ_{0} + 1 \frac{3}{4} D x_{t} + \frac{1}{2} D λ^{*} ⎤ ⎦$
		$\leq E τ \sum t = 1 g (λ_{t - 1}) + \frac{3}{16} D^{2} E [T - τ_{0}] + \frac{3}{4} D^{2} .$

The first inequality is because of the resource constraint, and the second one is because $f_{t} (x) \leq f_{t}^{'} (0) (x - 0) \leq \frac{3}{4} D x$ . If we specify $λ_{t - 1} = ξ_{t}$ when the resource constraints are violated, we also have $E [\sum_{t = 1}^{T} f_{t} (x_{t})] \leq E \sum_{t = 1}^{T} g (λ_{t - 1})$ . Then

	$T g (λ^{*}) - E [T \sum t = 1 f_{t} (x_{t})]$	$\geq E [τ_{0} \sum t = 1 g (λ^{}) - g (λ_{t - 1})] + E (g (λ^{}) - \frac{3}{16} D^{2}) E [T - τ_{0}] - \frac{3}{4} D^{2}$		(A.16)
		$= E [τ_{0} \sum t = 1 (λ^{*} - λ_{t - 1})^{2}] + \frac{5}{64} D^{2} E [T - τ_{0}] - \frac{3}{4} D^{2},$		(A.16)

or $T g (λ^{*}) - E [\sum_{t = 1}^{T} f_{t} (x_{t})] \geq E [\sum_{t = 1}^{T} (λ^{*} - λ_{t - 1})^{2}]$ . Applying van Trees inequality to the estimation of $λ^{*}$ (li2021online), we can prove the Theorem 4. To prove the Theorem 5, we only need to show the stopping time $E [T - τ_{0}] \geq Ω (\sqrt{T})$ given the convergence condition. This proof is inspired by arlotto2019uniformly. Denote $t^{'} = ⌊ T - \sqrt{T} ⌋$ . We show that $P (τ_{0} \leq t^{'})$ is larger that a constant $c$ so that $E τ_{0} \leq (1 - c) T + c (T - \sqrt{T}) \leq T - c \sqrt{T}$ .

	$P (τ_{0} \leq t^{'})$	$= P (t^{'} \sum t = 1 2 (ξ_{t} - λ_{t - 1}) \geq \frac{D T}{2} - D)$
		$\geq P ({t^{'} \sum t = 1 2 (ξ_{t} - λ^{}) \geq \frac{D T}{2} - D + ε D \sqrt{t^{'}}} \cap {t^{'} \sum t = 1 \| λ_{t - 1} - λ^{} \| < ε D \sqrt{t^{'}}})$
		$\geq P ({t^{'} \sum t = 1 2 (ξ_{t} - λ^{}) \geq \frac{D T}{2} - D + ε D \sqrt{t^{'}}}) - P (t^{'} \sum t = 1 \| λ_{t - 1} - λ^{} \| \geq ε D \sqrt{t^{'}})$

With the condition $E | λ_{t} - λ^{*} | \leq c_{2} D / \sqrt{t + 1}$ , we have $P (\sum_{t = 1}^{t^{'}} | λ_{t - 1} - λ^{*} | \geq ε D \sqrt{t^{'}}) \leq \frac{2 c_{2}}{ε}$ by Chebyshev’s inequality. Then it holds that

	$P (τ_{0} \leq t^{'})$	$\geq P ({t^{'} \sum t = 1 2 (ξ_{t} - λ^{*}) \geq \frac{D T}{2} - D + ε D \sqrt{t^{'}}}) - \frac{2 c_{2}}{ε}$
		$= P ({t^{'} \sum t = 1 \frac{4}{D} (ξ_{t} - \frac{D}{2}) \geq \frac{t^{'}}{2} + (T - t^{'}) - 2 + 2 ε \sqrt{t^{'}}}) - \frac{2 c_{2}}{ε}$
		$\geq P ({t^{'} \sum t = 1 \frac{4}{D} (ξ_{t} - \frac{D}{2}) \geq \frac{t^{'}}{2} + (1 + 2 ε) \sqrt{} t^{'}}) - \frac{2 c_{2}}{ε},$

where $\sum_{t = 1}^{t^{'}} \frac{4}{D} (ξ_{t} - \frac{D}{2})$ follows the binomial distribution $B (t^{'}, \frac{1}{2})$ , with mean $μ = \frac{t^{'}}{2}$ and standard deviation $σ = \frac{\sqrt{t^{'}}}{2}$ . The second inequality is because $T - t^{'} \leq \sqrt{T} + 1$ and $\sqrt{T} - \sqrt{t^{'}} \leq \sqrt{T} - \sqrt{T - \sqrt{T}} = \frac{\sqrt{T}}{\sqrt{} T - \sqrt{T} + \sqrt{T}} \leq 1$ . For the binomial distribution, $P (X \geq μ + x σ)$ converge to $Φ (- x)$ for any $x$ with known $O (\frac{1}{\sqrt{n}})$ speed by Berry-Esseen CLT where $Φ (x)$ is the distribution function of standard normal distribution. We let $c_{2} = {sup}_{ε > 0} ε Φ (- 2 - 4 ε) / 4$ . Then there exists $ε_{0} > 0$ such that when $T$ is large enough, $P ({\sum_{t = 1}^{t^{'}} \frac{4}{D} (ξ_{t} - \frac{D}{2}) \geq \frac{t^{'}}{2} + (1 + 2 ε_{0}) \sqrt{t^{'}}}) \geq \frac{3 c_{2}}{ε_{0}}$ , which indicates that $P (τ_{0} \leq t^{'}) \geq \frac{c_{2}}{ε_{0}}$ . This makes our proof complete.

		${[\nabla D (z (λ - λ^{}) + λ^{}, d) - \nabla D (λ^{}, d)]}^{⊤} (λ - λ^{})$
	$\leq$	${∥ ∥ E b_{t} \nabla f_{t}^{} (b_{t}^{⊤} (z (μ + λ - μ^{} - λ^{}) + μ^{} + λ^{}) - E b_{t} \nabla f_{t}^{} (b_{t}^{⊤} (μ^{} + λ^{})) ∥ ∥}_{2}^{} ({∥ λ - λ^{} ∥}_{2}^{} + {∥ μ - μ^{} ∥}_{2}^{*})$
		$+ {∥ \nabla r^{} (- μ) - \nabla r^{} (- μ^{}) ∥}_{2}^{} {∥ μ - μ^{} ∥}_{2}^{}$
	$\leq$	${∥ ∥ z {¯ L}_{f} ¯ b E [b_{t}^{⊤} (μ + λ - μ^{} - λ^{})] ∥ ∥}_{2} ({∥ λ - λ^{} ∥}_{2}^{} + {∥ μ - μ^{} ∥}_{2}^{}) + {¯ L}_{r} z {∥ μ - μ^{} ∥}_{2}^{}$
	$\leq$	$z {¯ L}_{f} {¯ b}^{2} ({∥ λ - λ^{} ∥}_{2}^{} + {∥ μ - μ^{} ∥}_{2}^{})^{2} + {¯ L}_{r} z {∥ μ - μ^{} ∥}_{2}^{} \leq z (2 {¯ b}^{2} {¯ L}_{f} + {¯ L}_{r}) {∥ λ - λ^{} ∥}_{2}^{},$

		${[\nabla D (z (λ - λ^{}) + λ^{}, d) - \nabla D (λ^{}, d)]}^{⊤} (λ - λ^{})$
	$=$	$E [E [⟨ \nabla f_{t}^{} (b_{t}^{⊤} (z (μ + λ - μ^{} - λ^{}) + μ^{} + λ^{})) - \nabla f_{t}^{} (b_{t}^{⊤} (μ^{} + λ^{})), b_{t}^{⊤} (μ + λ - μ^{} - λ^{}) ⟩] ∣ ∣ b_{t}]$
		$+ ⟨ \nabla r^{} (- (z (μ - μ^{}) + μ^{})) - \nabla r^{} (- μ^{}), μ^{} - μ ⟩$
	$\geq$	$z {L - -}_{f} E {∥ ∥ b_{t}^{⊤} (μ + λ - μ^{} - λ^{}) ∥ ∥}_{2}^{} + z {L - -}_{r} {∥ μ - μ^{} ∥}_{2}^{} \geq z {L - -}_{f} σ_{min} {∥ μ + λ - μ^{} - λ^{} ∥}_{2}^{} + z {L - -}_{r} {∥ μ - μ^{} ∥}_{2}^{} .$

	${∥ ∥ λ^{} - {¯ λ}_{k l} ∥ ∥}_{2}^{}$	$\leq {∥ λ^{} - λ_{k l} ∥}_{2}^{} + max {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2}$
		$\leq (1 + \frac{\sqrt{m} (1 - q)}{q}) {∥ λ - λ_{k l} ∥}_{2},$
	${∥ μ^{} - {¯ μ}_{k l} ∥}_{2}^{}$	$\leq (1 + \frac{\sqrt{m} (1 - q)}{q}) {∥ μ - μ_{k l} ∥}_{2},$
	$max λ \in {¯ Ω}^{k l} (ε) {∥ λ - λ_{k l} ∥}_{2}$	$\leq \frac{\sqrt{m} (1 - q)}{q} {∥ λ^{} - λ_{k l} ∥}_{2}^{} \leq \frac{\sqrt{m} (1 - q)}{q} {∥ ∥ λ^{} - {¯ λ}_{k l} ∥ ∥}_{2}^{} (so as μ) .$

	${¯ D}_{T} (λ, d) - {¯ D}_{T} (λ^{*}, d) =$	$⟨ {¯ ϕ}_{T} (λ^{}, d), λ - λ^{} ⟩ + {¯ s}_{T} (λ, d)$
	$\geq$	$\frac{{L - -}_{D}}{2} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{k l}^{} - 2 \sqrt{2} ε (\sqrt{n} ¯ b D + \sqrt{m} G) \frac{\sqrt{m} (1 - q)}{q} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{}$
		$- 2 \sqrt{2} ε (\sqrt{n} ¯ b D + \sqrt{m} G) {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{} - ε {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{}$
	$=$	$\frac{{L - -}_{D}}{2} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{k l}^{} - H ε \cdot {L - -}_{D} {∥ ∥ {¯ λ}_{k l} - λ^{} ∥ ∥}_{2}^{}$

	$D (λ^{} (d_{2}^{'}), d_{1}^{'}) - D (λ^{} (d_{1}^{'}), d_{1}^{'})$	$\geq {L - -}_{D} {∥ ∥ λ^{} (d_{2}^{'}) - λ^{} (d_{1}^{'}) ∥ ∥}_{2}^{*}$
	$D (λ^{} (d_{1}^{'}), d_{2}^{'}) - D (λ^{} (d_{2}^{'}), d_{2}^{'})$	$\geq {L - -}_{D} {∥ ∥ λ^{} (d_{1}^{'}) - λ^{} (d_{2}^{'}) ∥ ∥}_{2}^{*} .$