Smooth Monotone Stochastic Variational Inequalities and Saddle Point Problems – Survey

Aleksandr Beznosikov (MIPT, Moscow, Russia), Boris Polyak (ICS RAS, Moscow, Russia), Eduard Gorbunov (MIPT, Moscow, Russia), Dmitry Kovalev (KAUST, Thuwal, Saudi Arabia), Alexander Gasnikov (MIPT, IITP RAS, Moscow, Russia; Caucasus Mathematical Center, Adyghe State University, Maikop, Russia)

Abstract

This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.

1 Introduction

In its long, more than half-century history of study [112], variational inequalities have become one of the most popular and universal optimization formulations. Variational inequalities are used in various areas of applied mathematics. Here we can highlight both classic examples from game theory, economics, operator theory, convex analysis [6, 18, 106, 109, 112], as well as newer and even very young applications in optimization and machine learning: non-smooth optimization [92], unsupervised learning [8, 35, 24], robust/adversarial optimization [10], GANs [46] and reinforcement learning [99, 56]. Modern times present a new challenges to the community. The increase in scale of problems and the desire to speed up solution processes have led to a huge interest in stochastic formulations of applied tasks, including variational inequalities. A survey of stochastic methods for solving variational inequalities is the subject of this paper.

Structure of the paper. In Section 2, we give a formal statement of the variational inequality problem, basic examples, and main assumptions. Section 3 deals with deterministic methods, from which stochastic methods have been developed. Section 4 covers the stochastic methods. Section 5 is devoted to the recent advances in (not necessarily stochastic) variational inequalities and saddle point problems.

2 Problem: setting and assumptions

Notation. We use $⟨x,y⟩\vcentcolon=∑ni=1xiyi$ to denote standard inner product of $x, y \in R^{d}$ where $x_{i}$ corresponds to the $i$ -th component of $x$ in the standard basis in $R^{d}$ . It induces $ℓ_{2}$ -norm in $R^{d}$ in the following way $∥x∥2\vcentcolon=√⟨x,x⟩$ . We denote $ℓ_{p}$ -norms as $∥x∥p\vcentcolon=(∑di=1|xi|p)\nicefrac1p$ for $p \in [1, \infty)$ and for $p = \infty$ we use $∥x∥∞\vcentcolon=max1≤i≤d|xi|$ . The dual norm $∥ \cdot ∥_{*}$ for the norm $∥ \cdot ∥$ is defined in the following way: $∥y∥∗\vcentcolon=max{⟨x,y⟩∣∥x∥≤1}$ . Operator $E [\cdot]$ denotes full mathematical expectation. Finally, we need to introduce $O, Ω$ -notation to hide numerical constants which do not depend on any problem parameter, and notation $~ O, ~ Ω$ to hide numerical constants and logarithmic factors.

We study variational inequalities (VI) of the form

Find z^{*}

\in Z such that ⟨ F (z^{*}), z - z^{*} ⟩ \geq 0, \forall z \in Z,

(1)

where $F : Z \to R^{d}$ is an operator, and $Z \subseteq R^{d}$ is a convex set.

To emphasise the extensiveness of the formalism (1), we give some examples of variational inequalities arising in applied science.

{example}

[Minimization] Consider the minimization problem:

min z \in Z f (z) .

(2)

Suppose that $F(z)\vcentcolon=∇f(z)$ . Then, if $f$ is convex, it can be proved that $z^{*} \in Z$ is a solution for (1) if and only if $z^{*} \in Z$ is a solution for (2).

{example}

[Saddle point problem] Consider the saddle point problem (SPP):

min x \in X max y \in Y g (x, y) .

(3)

Suppose that $F(z)\vcentcolon=F(x,y)=[∇xg(x,y),−∇yg(x,y)]$ and $Z = X \times Y$ with $X \subseteq R^{d_{x}}$ , $Y \subseteq R^{d_{y}}$ . Then, if $g$ is convex-concave, it can be proved that $z^{*} \in Z$ is a solution for (1) if and only if $z^{*} \in Z$ is a solution for (3).

Minimization problems are widely researched separately from variational inequalities. The study of saddle point problems is often associated with variational inequalities.

{example}

[Fixed point problem] Consider the fixed point problem:

Find z^{*}

\in R^{d} such that T (z^{*}) = z^{*},

(4)

where $T : R^{d} \to R^{d}$ is an operator. With $F (z) = z - T (z)$ , it can be proved that $z^{*} \in Z = R^{d}$ is a solution for (1) if and only if $F (z^{*}) = 0$ , i.e. $z^{*} \in R^{d}$ is a solution for (4).

For the operator $F$ from (1) we assume the following.

Assumption 1 (Lipschitzness).

The operator $F$ is $L$ -Lipschitz continuous, i.e. for all $u, v \in Z$ we have $∥ F (u) - F (v) ∥_{*} \leq L ∥ u - v ∥ .$

In the context of (2) and (3), $L$ -Lipschitzness of the operator means that the functions $f (z)$ and $g (x, y)$ are $L$ -smooth.

Assumption 2 (Strong monotonicity).

The operator $F$ is $μ$ -strongly monotone, i.e., for all $u, v \in Z$ we have $⟨ F (u) - F (v), u - v ⟩ \geq μ ∥ u - v ∥_{2}^{2} .$ If $μ = 0$ , then the operator $F$ is monotone.

In the context of (2) and (3), strong monotonicity of $F$ means strong convexity of $f (z)$ and strong convexity-strong concavity of $g (x, y)$ . In this paper we first concentrate on the strongly monotone and monotone cases. But there are also various assumptions relaxing monotonicity and strong monotonicity (e.g., see [54] and references therein).

One can point out that Assumptions 1 and 2 are sufficient for the existence of a solution (1) [36].

Since we work on the set $Z$ , let us introduce the Euclidean projection on this set:

P_{Z} (z) = arg min v \in Z ∥ z - v ∥_{2} .

To characterise the convergence of the methods for monotone variational inequalities we introduce the gap function:

GapVI(z)\vcentcolon=supu∈Z[⟨F(u),z−u⟩].

(5)

Such a gap function, as a convergence criterion, is more suitable for the following variational inequality problem: $⟨ F (z), z^{*} - z ⟩ \leq 0$ for $z \in Z$ . Such a solution is also called weak or Minty (the solution of (1) is called strong or Stampacchia). However, according to Assumption 1 we have that $F$ is single-valued and continuous on $Z$ , meaning that both formulations of the variational inequality problem are equivalent [36].

For the minimization problem (2), the functional distance to the solution: $f (z) - f (z^{*})$ , can be used instead of (5). For saddle point problems (3), the gap function is also used, but it is slightly different:

GapSPP(z)\vcentcolon=gap(x,y)=maxy′∈Yf(x,y′)−minx′∈Xf(x′,y).

(6)

For both functions (5) and (6), it is critical that the set over which this function is calculated is bounded (in fact it is not necessary to take the whole set $Z$ which can be unbounded, it is enough to take a bounded convex subset $C$ which contains some solution – see [94]). Therefore it is necessary to define a distance on the set $Z$ . Since in this survey we will encounter methods not only in the Euclidean setup, let us introduce a more general notion of distance.

{definition}

[Bregman divergence] Assume that function $ν (z)$ is $1$ -strongly convex w.r.t. $∥ \cdot ∥$ -norm and differentiable on $Z$ function. Then for any two points $z, z^{'} \in Z$ we define Bregman divergence $V (z, z^{'})$ associated with $ν (z)$ as follows:

V(z,z′)\vcentcolon=ν(z′)−ν(z)−⟨∇ν(z),z′−z⟩.

We denote the Bregman-diameter of the set $Z$ w.r.t. $V (z, z^{'})$ as $DZ,V\vcentcolon=max{√2V(z,z′)∣z,z′∈Z}$ . In the Euclidean case, we use $D_{Z}$ instead of $D_{Z, V}$ . Using the definition of $V$ , we denote the proximal operator as follows:

{prox}_{x} (y) = arg min z \in Z {⟨ y, z ⟩ + V (z, x)} .

3 Deterministic foundation: Extragradient and others

The first and the simplest method for solving variational inequality (1) is iterative scheme (also known as Gradient method)

z^{k + 1} = P_{Z} (z^{k} - γ F (z^{k})),

(7)

where $γ > 0$ is a step size. Note that this method can be rewritten using the proximal operator with the Euclidean Bregman divergence:

z^{k + 1} = {prox}_{z^{k}} (γ F (z^{k})) .

The basic result is the convergence of the method to the unique solution of (1) for strongly monotone and $L$ -Lipschitz operator $F$ ; it was obtained in the papers [18, 106, 109].

{theorem}

If Assumptions 1, 2 hold and $0 < γ < 2 μ / L^{2}$ , then after $k$ iterations the method (7) converges to $z^{*}$ with linear rate:

∥ z^{k} - z^{*} ∥_{2}^{2} = O (R_{0}^{2} q^{k}), with q = (1 - 2 γ μ + γ^{2} L^{2})

and $R_{0} = ∥ z^{0} - z^{*} ∥_{2}$ (here and below). For $γ = μ / L^{2}$ we have $q = (1 - 1 / κ^{2}), κ = L / μ$ , thus the upper bound on the number of iterations to achieve the $ε$ -solution (i.e., $∥ z^{k} - z^{*} ∥_{2}^{2} \leq ε$ ) is $O (κ^{2} log (R_{0}^{2} / ε))$ .

Various extensions of this statement (for $F$ being non-Lip-
schitz but with linear grow bounds, or for values of $F$ corrupted by noise) can be found in Theorem 1 from [9].

If $F$ is a potential operator (see Example 2) the method (7) coincides with the gradient projection algorithm. It converges for strongly monotone $F$ . Moreover, bounds for the admissible step size are less restrictive ( $0 < γ < 2 / L$ ) and complexity estimates are better ( $O (κ log (R_{0}^{2} / ε))$ ) than in Theorem 3.

However, in the general monotone but non-strongly monotone case (for instance, for convex-concave SPP, Example 2) convergence is lacking. Original statements on the convergence of Uzava method (a version of (7)) for saddle point problems [6] were wrong; numerous examples of divergence of the method (7) for $F$ corresponding to bilinear SPP are well known, see e.g. Figure 39 from [103].

There were many other attempts to recover convergence of gradient-like methods not for VIs, but for saddle point problems. One of them is based on the transition to modified Lagrangians when $g (x, y)$ is Lagrange function, see [44, 103]. However, we focus on general VI case. A possible approach is based on regularization idea. Instead of the monotone variational inequality (1) one can deal with the regularized one, when monotone $F$ is replaced with strongly monotone $F + ε_{k} T$ , where $T (z)$ is a strongly monotone operator and $ε_{k} > 0$ is a regularization parameter. If we denote $z^{k}$ as the solution of regularized VI, then it is possible to prove [9] that $z^{k}$ converges to $z^{*}$ for $ε_{k} \to 0$ . However, the solution $z^{k}$ usually is not easily available. To adress this problem, the iterative regularization technique is proposed in [9], where one step of the basic method (7) is applied for the regularized problem. Step sizes and regularization parameters can be adjusted to guarantee the convergence.

Another technique is based on the Proximal Point method proposed independently by B. Martinet in [83] and by T. Rockafellar in [105]. At each iteration it requires the solution of VI with $F + c I$ , where $c > 0$ and $I$ is the identity operator. This is implicit method (similar with regularization method), however there exist numerous implementable versions of Proximal Point. For instance, some methods below can be considered from this point of view.

The breakthrough for solving (non-strongly) monotone variational inequalities was made by Galina Korpelevich [63]. She exploited the idea of the extrapolation for the gradient method. It can be explained for the simplest example of a two-dimensi-
onal min-max problem with $g (x, y) = x y, Z = R^{2}$ . It has the unique saddle point $z = 0$ , and in any point $z^{k}$ the direction $F (z^{k})$ is orthogonal to $z^{k}$ ; thus the iteration (7) enlarges the distance to the saddle point. However, if we make the step (7) and get extrapolated point $z^{k + 1 / 2}$ , the direction $- F (z^{k + 1 / 2})$ is attracting to the saddle point. Thus, the Extragradient method for solving (1) reads:

\begin{matrix} z^{k + 1 / 2} = P_{Z} (z^{k} - γ F (z^{k})), z^{k + 1} = P_{Z} (z^{k} - γ F (z^{k + 1 / 2})) . \end{matrix}

(8)

{theorem}

Let $F$ satisfy Assumptions 1, 2 (with $μ = 0$ ) and $0 < γ < 1 / L$ , then for the Extragradient method $z^{k} \to z^{*}$ .

For particular case of the zero-sum matrix game the method converges linearly, provided that the optimal solution is unique (see Theorem 3 from [63]). For $g (x, y) = y^{⊤} A x - b^{⊤} x + c^{⊤} y$ the following estimate holds: $O (κ log (R_{0}^{2} / ε))$ with $κ = λ_{max} (A A^{⊤}) / λ_{min} (A A^{⊤})$ . More general upper bounds for the Extragradient method can be found in [118] and in the recent paper [86]. In particular, for the strongly monotone case the estimate $O (κ log (R_{0}^{2} / ε))$ with $κ = L / μ$ holds true (compare with much worse bound $O (κ^{2} log (R_{0}^{2} / ε))$ for Gradient method). Adaptive version of the Extragradient method (no knowledge of $L$ required) is proposed in [60].

Another version of the Extragradient method for finding saddle points is provided in [64]. Considering the setup of Example 2, we can exploit just one extrapolating step for variables $y$ :

	$y^{k + 1 / 2} = P_{Y} (y^{k} + γ \nabla_{y} g (x^{k}, y^{k})),$
	$x^{k + 1} = P_{X} (x^{k} - γ \nabla_{x} g (x^{k}, y^{k + 1 / 2}),$		(9)
	$y^{k + 1} = y^{k} + q (y^{k + 1 / 2} - y^{k}),$

with $0 < γ < 1 / 2 L$ and $0 < q < 1$ . This method converges to the solution and if $g (x, y)$ is linear in $y$ , then the convergence rate is linear. If we put $q = 1$ in the method (3), then $y^{k + 1} = y^{k + 1 / 2}$ and we get the so-called Alternating Gradient method (Alternating descent-ascent). In [122], it was proved that this method has local linear convergence with complexity $O (κ log (R_{0}^{2} / ε))$ , where $κ = L / μ$ .

L. Popov [104] proposed a version of extrapolation scheme (sometimes this type of scheme is called optimistic or single-call):

\begin{matrix} z^{k + 1 / 2} = P_{Z} (z^{k} - γ F (z^{k - 1 / 2})), z^{k + 1} = P_{Z} (z^{k} - γ F (z^{k + 1 / 2})) . \end{matrix}

(10)

It requires the single calculation of $F$ at each iteration vs two calculations in the Extragradient method. As shown in [104], the method (10) converges for $0 < γ < 1 / 3 L$ . Convergence rates for this method were obtained recently [40, 86], it is $O (κ log (R_{0}^{2} / ε))$ with $κ = L / μ$ for the strongly monotone case and $κ = λ_{max} (A A^{⊤}) / λ_{min} (A A^{⊤})$ for the bilinear case. It is important to note that in the general strongly monotone case this estimate is optimal [123].

The extension of the above schemes to an arbitrary proximal setup was obtained in the work of A. Nemirovsky [91]. He proposed the Mirror-Prox method for VIs, exploiting Bregman divergence:

\begin{matrix} z^{k + 1 / 2} & = {prox}_{z^{k}} (γ F (z^{k})), z^{k + 1} & = {prox}_{z^{k}} (γ F (z^{k + 1 / 2})) . \end{matrix}

(11)

It implies the following result on the convergence rate. {theorem} Let $F$ satisfy Assumptions 1, 2 (with $μ = 0$ ) and

{^z}^{k} = \frac{1}{k} k \sum i = 1 z^{i + 1 / 2},

(12)

where $z^{i + 1 / 2}$ are generated by algorithm (11) with $γ = 1 / \sqrt{2} L$ . Then, after $k$ iterations

{Gap}_{V I} ({^z}^{k}) = O (\frac{L D_{Z, V}^{2}}{k}) .

(13)

Numerous extensions of these original versions of iterative methods for solving variational inequalities were published later. One can highlight Tseng’s Forward-Backward Splitting [119], Nesterov’s Dual Extrapolation [94], Malitsky and Tam’s Forward-Reflected-Backward [82]. All methods have convergence guarantees (13). It turns out that this rate is optimal [100].

4 Stochastic methods: different setups and assumptions

In this section, we move from deterministic to stochastic methods, i.e., we consider (1) with the following operator:

F (z) = E_{ξ \sim D} [F_{ξ} (z)],

(14)

where $ξ$ is a random variable, $D$ is a some (typically unknown) distribution, $F_{ξ} : Z \to R^{d}$ is a stochastic operator. In this setup, calculating the value of the full operator $F$ is computationally expensive or impossible. Therefore, one has to work mainly with stochastic realizations $F_{ξ}$ .

4.1 General case

The stochastic formulation (14) for the problem (1) was first considered by the authors of [59] and they proposed a natural stochastic generalization of Extragradient (more precisely, of Mirror-Prox):

\begin{matrix} z^{k + 1 / 2} & = {prox}_{z^{k}} (γ F_{ξ^{k}} (z^{k})), z^{k + 1} & = {prox}_{z^{k}} (γ F_{ξ^{k + 1 / 2}} (z^{k + 1 / 2})) . \end{matrix}

(15)

Here it is important to note that $ξ^{k}$ and $ξ^{k + 1 / 2}$ are independent and $F_{ξ} (z)$ is an unbiased estimator of $F (z)$ . Moreover, $F_{ξ} (z)$ is assumed to satisfy the following condition.

Assumption 3 (Bounded variance).

The unbiased operator $F_{ξ}$ has uniformly bounded variance, i.e., for all $ξ \sim D$ and $u \in Z$ we have $E ∥ F_{ξ} (u) - F (u) ∥_{*}^{2} \leq σ^{2} .$

Under this assumption, the next result was derived in [59]. {theorem} Let $F_{ξ}$ satisfy Assumptions 1, 2 (with $μ = 0$ ), 3 and ${^z}^{k}$ be defined as in (12), where $z^{i + 1 / 2}$ are generated by the algorithm (15) with $γ = min {\frac{1}{\sqrt{3} L}, D_{Z, V} \sqrt{\frac{1}{7 k σ^{2}}}}$ . Then, after $k$ iterations we can guarantee that

E [{Gap}_{V I} ({^z}^{k})] = O (\frac{L D_{Z, V}^{2}}{k} + D_{Z, V} \sqrt{\frac{σ^{2}}{k}}) .

(16)

In [16], the authors gave an analysis of the algorithm (15) for strongly monotone VIs in the Euclidean case. In more details, under Assumptions 1, 2, 3 one can guarantee that after $k$ iterations of the method (15) it holds (here and below we omit numerical constants in the exponent multiplier)

E ∥ z^{k} - z^{*} ∥_{2}^{2} = ~ O (R_{0}^{2} exp (- \frac{μ k}{L}) + \frac{σ^{2}}{μ^{2} k}) .

(17)

Also, in [16], the authors obtained lower complexity bounds for solving VIs satisfying Assumptions 1, 2, 3 with stochastic methods. It turns out that the results of Theorem 3 in the monotone case and those from (17) are optimal and meet lower bounds up to numerical constants.

Optimistic-like (or single-call) methods were also investigated in the stochastic setting. The work [40] applies the following update scheme:

\begin{matrix} z^{k + 1 / 2} & = P_{Z} (z^{k} - γ F_{ξ^{k - 1 / 2}} (z^{k - 1 / 2})), z^{k + 1} & = P_{Z} (z^{k} - γ F_{ξ^{k + 1 / 2}} (z^{k + 1 / 2})) . \end{matrix}

(18)

For this method, in the monotone Euclidean case, the authors proved an estimate similar to (16). In the strongly monotone case, (18) was investigated in the paper [53], but their estimates do not meet the lower bounds. The optimal estimates for this scheme were obtained later in the work [11].

The work [65] deals with a slightly different single-call approach in the non-Euclidean case:

\begin{matrix} z^{k + 1} & = {prox}_{z^{k}} (γ_{k} F_{ξ^{k}} (z^{k}) + γ_{k} α_{k} [F_{ξ^{k}} (z^{k}) - F_{ξ^{k - 1}} (z^{k - 1})]) . \end{matrix}

(19)

This update is a modification of the Forward-Reflected-Back-
ward approach, namely here $α_{k}$ is a parameter, while in [82], $α_{k} \equiv 1$ . The analysis of the method (19) gives optimal estimates in both the strongly monotone and monotone cases.

The theory and guarantees of the previous results significantly rely on the bounded variance assumption (Assumption 3). This assumption is quite restrictive (especially when the domain is unbounded) and does not hold for many popular machine learning problems. Moreover, one can even design a strongly monotone variational inequality on an unbounded domain such that the method (15) diverges exponentially fast [25]. Authors of [54, 47] consider a relaxation of the bounded variance condition and assume that $E ∥ F_{ξ} (u) - F (u) ∥_{2}^{2} \leq σ^{2} + δ ∥ u - z^{*} ∥_{2}^{2}$ with $δ \geq 0$ in the Euclidean case. Under this condition and Assumptions 1, 2, the authors of [47] proved that after $k$ iterations of the algorithm (15) (when $Z = R^{d}$ ) it holds that

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O (κ R_{0}^{2} exp (- \frac{k}{κ}) + \frac{σ^{2}}{μ^{2} k}),

(20)

where $κ = max {\frac{δ}{μ^{2}}; \frac{L + \sqrt{δ}}{μ}}$ . The same assumption on stochastic realizations was considered in [66]. The authors used the method (19) and provided the following estimate:

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O (R_{0}^{2} exp (- \frac{μ k}{L}) + \frac{σ^{2} + δ^{2} D_{Z}^{2}}{μ^{2} k}),

(21)

Results (20) and (21) are competitive. The first estimate is better in terms of the stochastic term (second term), while the second result is more competitive in terms of the deterministic term (first term). However, both of these results do not fully cover the issue of bounded noise, because the condition considered above is not general. The key to avoiding the assumption about the bounded variance of $F_{ξ}$ lies in the way how stochasticity is generated in the method (15). The method (15) is sometimes called Independent Samples Stochastic Extragradient (I-SEG). To address bounded variance issue, K. Mishchenko et al. [85] proposed another stochastic modification of the Extragradient algorithm called Same Sample Stochastic Extragradient (S-SEG):

	$z^{k + 1 / 2}$	$= z^{k} - γ F_{ξ^{k}} (z^{k}),$
	$z^{k + 1}$	$= z^{k} - γ F_{ξ^{k}} (z^{k + 1 / 2}) .$

For simplicity we present the above method for the case when $Z = R^{d}$ ( $F (x^{*}) = 0$ ), while [85] contains a more general case of regularized VIs. In contrast to I-SEG, S-SEG uses the same sample $ξ^{k}$ for both steps at iteration $k$ . Although such a strategy cannot be implemented in some scenarios (streaming oracle), it can be applied to finite-sum problems that have been gaining an increasing attention in the recent years. Moreover, S-SEG significantly relies on the following assumption.

Assumption 4.

The operator $F_{ξ} (z)$ is $L$ -Lipschitz and $μ$ -stron-
gly monotone almost surely in $ξ$ , i.e., $∥ F_{ξ} (z) - F_{ξ} (z^{'}) ∥_{2} \leq L ∥ z - z^{'} ∥_{2}$ and $⟨ F_{ξ} (z) - F_{ξ} (z^{'}), z - z^{'} ⟩ \geq μ ∥ z - z^{'} ∥_{2}^{2}$ for all $z, z^{'} \in R^{d}$ almost surely in $ξ$ .

The evident difference between the setups for I-SEG and S-SEG can be explained through the connection between Extragradient and the Proximal Point method (PP) [83, 105]. We assume that $Z = R^{d}$ ( $F (z^{*}) = 0$ ) in all results discussed further in Section 4.1. In this setup, PP has the following update rule

z^{k + 1} = z^{k} - γ F (z^{k + 1}) .

The method converges for any monotone operator $F$ and any $γ > 0$ . However, the update rule of PP is implicit and cannot be computed efficiently in many situations. The Extragradient method can be seen as a natural approximation of PP that substitutes $z^{k + 1}$ in the right-hand side by one gradient step from $z^{k}$ :

z^{k + 1} = z^{k} - γ F (z^{k} - γ F (z^{k})) .

In addition, when $F$ is $L$ -Lipschitz, one can estimate how good the approximation is. Consider $z^{k + 1} = z^{k} - γ F (z^{k} - γ F (z^{k}))$ (the Extragradient step) and ${~ z}^{k + 1} = z^{k} - γ F ({~ z}^{k + 1})$ (the PP step). Then, $∥ z^{k + 1} - {~ z}^{k + 1} ∥_{2}$ can be estimated as follows [85]:

	$∥ z^{k + 1} - {~ z}^{k + 1} ∥_{2} = γ ∥ F (z^{k} - γ F (z^{k})) - F ({~ z}^{k + 1}) ∥_{2}$
	$\leq γ L ∥ z^{k} - γ F (z^{k}) - {~ z}^{k + 1} ∥_{2} = γ^{2} L ∥ F (z^{k}) - F ({~ z}^{k + 1}) ∥_{2}$
	$\leq γ^{2} L^{2} ∥ z^{k} - {~ z}^{k + 1} ∥_{2} = γ^{3} L^{2} ∥ F ({~ z}^{k + 1}) ∥_{2}$
	$\leq γ^{3} L^{3} ∥ {~ z}^{k + 1} - z^{*} ∥_{2} .$

That is, the difference between the Extragradient and PP steps is of the order $O (γ^{3})$ rather than $O (γ^{2})$ . Since the later corresponds to the difference between PP and simple gradient step (7), Extragradient better approximates PP than gradient steps, which are known to be non-convergent for general monotone Lipschitz variational inequalities. This approximation feature of Extragradient is crucial for its convergence and, as the above derivation implies, the approximation argument significantly relies on the Lipschitzness of operator $F$ .

Let us go back to the differences between I-SEG and S-SEG. In S-SEG, iteration $k$ can be considered as a single Extragradient step for operator $F_{ξ^{k}} (z)$ . Therefore, Lipschitzness and monotonicity of $F_{ξ^{k}} (z)$ (Assumption 4) are important for the analysis of S-SEG. In contrast, I-SEG uses different operators for the extrapolation and update steps. In this case, there is no effect from the Lipschitzness/monotonicity of individual $F_{ξ} (z)$ . Therefore, the analysis of I-SEG naturally relies on the Lipschitzness and monotonicity of $F (z)$ and closeness (on average) of $F_{ξ} (z)$ and $F (z)$ (Assumption 3).

The convergence of I-SEG was discussed earlier in this section. Regarding S-SEG, the following result holds [85].

{theorem}

Let Assumption 4 hold. Then, there exists a choice of step size $γ$ [47] such that the output of S-SEG after $k$ iterations satisfies

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O (\frac{L R_{0}^{2}}{μ} exp (- \frac{μ k}{L}) + \frac{σ_{*}^{2}}{μ^{2} k}),

where $σ_{*}^{2} = E ∥ F_{ξ} (z^{*}) ∥_{2}^{2}$ .

The rate is similar to the one known for I-SEG up to the following differences. First, instead of the uniform bound on the variance $σ^{2}$ , the rate depends on $σ_{*}^{2}$ , which is the variance of $F_{ξ}$ measured at the solution. In many cases, $σ^{2} = \infty$ while $σ_{*}^{2}$ is finite. From this perspective, S-SEG has better rate than I-SEG. However, it comes with a price: while the rate of I-SEG depends on the Lipschitz and strong-monotonicity constants of $F$ , the rate of S-SEG depends on the worst constants of $F_{ξ}$ that can be much worse than those for $F$ . In particular, consider the finite-sum setup with uniform sampling of $ξ$ : $F (x) = \frac{1}{n} \sum_{i = 1}^{n} F_{i} (x)$ where $F_{i}$ is $L_{i}$ -Lipschitz and $μ_{i}$ -strongly monotone, $P{ξ=i}=\nicefrac1n$ . Then, Assumption 4 holds with $L = {max}_{i \in [n]} L_{i}$ , $μ = {min}_{i \in [n]} μ_{i}$ and these constants appear in the rate from Theorem 3. The authors of [47] tighten the rate and prove for S-SEG that uses different step sizes for extrapolation and update steps that

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O (\frac{L R_{0}^{2}}{¯ μ} exp (- \frac{¯ ¯ ¯ μ k}{L}) + \frac{σ_{*}^{2}}{{¯ ¯ ¯ μ}^{2} k}),

where $σ_{*}^{2} = \frac{1}{n} \sum_{i = 1}^{n} ∥ F_{i} (z^{*}) ∥_{2}^{2}$ and $¯ ¯ ¯ μ = \frac{1}{n} \sum_{i = 1}^{n} μ_{i}$ . Since $¯ ¯ ¯ μ$ is (sometimes significantly) larger than $μ$ , the improvement is noticeable. Moreover, when ${L_{i}}_{i = 1}^{n}$ are known, one can consider so-called importance sampling [51]: $P{ξ=i}=\nicefracLi(n¯¯¯¯L)$ , where $¯ ¯¯ ¯ L = \frac{1}{n} \sum_{i = 1}^{n} L_{i}$ . As the authors of [47] show, importance sampling can be combined with S-SEG via allowing the extrapolation and update step sizes at iteration $k$ to depend on the sample $ξ^{k}$ . In particular, for the proposed modification of S-SEG they derive

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O ⎛ ⎝ \frac{¯ ¯¯ ¯ L R_{0}^{2}}{¯ ¯ ¯ μ} exp (- \frac{¯ ¯ ¯ μ k}{¯ ¯¯ ¯ L}) + \frac{{^σ}_{*}^{2}}{{¯ ¯ ¯ μ}^{2} k} ⎞ ⎠,

where ${^σ}_{*}^{2} = \frac{1}{n} \sum_{i = 1}^{n} \frac{¯ ¯ ¯ L}{L_{i}} ∥ F_{i} (z^{*}) ∥_{2}^{2}$ . The exponentially decaying term is always better than the corresponding one for S-SEG with uniform sampling. This usually implies faster convergence during the initial stage. Next, typically, larger norm of $F_{i} (z^{*})$ implies larger $L_{i}$ , e.g., $∥ F_{i} (z^{*}) ∥_{2}^{2} \sim L_{i}^{2}$ . In this case, ${^σ}_{*}^{2} \leq σ_{*}^{2}$ , because ${^σ}_{*}^{2} \sim (¯ ¯¯ ¯ L)^{2}$ and $σ_{*}^{2} \sim^{2} = \frac{1}{n} \sum_{i = 1}^{n} L_{i}^{2} \geq (¯ ¯¯ ¯ L)^{2}$ . Moreover, one can allow other sampling strategies and cover the case when some $μ_{i}$ are negative, see [47] for the details.

4.2 Finite-sum case

As noted earlier, when we deal with problem (14), it is often the case (especially in practical problems) that the distribution $D$ is unknown, but we have some samples from $D$ . Then, one can replace problem (14) by a finite-sum approximation:

F (z) = \frac{1}{n} n \sum i = 1 F_{i} (z) .

(22)

This approximation is sometimes also called Monte Carlo approximation, and for machine learning problems the term empirical risk is often encountered. As before, calls of the full operator $F$ are expensive, but now they are possible. Therefore, it is worth to avoid frequent computing of $F$ and mainly use calls to single $F_{i}$ operators or small batches of them.

Before presenting the results, let us introduce the analogue of the Lipschitzness assumption.

Assumption 5 (Lipschitzness in mean).

The operator $F$ is $L_{avg}$ -Lipschitz continuous in mean, i.e. for all $u, v \in Z$ we have $E [∥ F_{ξ} (u) - F_{ξ} (v) ∥_{*}^{2}] \leq L_{avg}^{2} ∥ u - v ∥^{2} .$

For example, if $F_{i}$ is $L_{i}$ -Lipschitz for all $i$ and we draw index $ξ = i$ with probability $p_{i} = L_{i} / \sum_{j} L_{j}$ , then $L_{avg} = \frac{1}{n} \sum_{j} L_{j}$ .

The study of finite-sum problems in stochastic optimization is connected, first of all, with classical methods for minimization problems such as SVRG [58] and SAGA [28]. For the saddle point problems, these methods were adopted in [101] (in fact, these results are also valid for variational inequalities). The authors considered strongly convex–strongly concave saddles in the Euclidean case and proved the following estimates for SVRG and SAGA:

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O ⎛ ⎝ R_{0}^{2} exp ⎛ ⎝ - min ⎧ ⎨ ⎩ \frac{1}{n}; \frac{μ^{2}}{L_{avg}^{2}} ⎫ ⎬ ⎭ k ⎞ ⎠ ⎞ ⎠ .

Since the bound above is not tight in terms of $L_{avg} / μ$ , the authors proposed accelerating SVRG and SAGA via the Catalyst envelope [75]. In this case, they have the following bound:

E ∥ z^{k} - z^{*} ∥_{2}^{2} = O (R_{0}^{2} exp (- min {\frac{1}{n}; \frac{μ}{\sqrt{n} L_{avg}}} \frac{k}{log [L_{avg} / μ]})) .

(23)

The same estimates for saddle point problems methods based on accelerating envelopes were also presented in [117].

An important step in the study of the finite-sum stochastic setup was the work [22]. It is primarily focused on bilinear games. For this class of problems, the authors improved the estimate (23) and removed the additional logarithmic factor. For general problems (saddle point and variational inequalities) the results of this paper are very similar to (23) and also have an additional logarithmic factor. Meanwhile, the authors also considered the convex-concave/monotone case in the non-Euclidean setting and got that for their method after $k$ iterations it holds

E [{Gap}_{V I} ({^z}^{k})] = ~ O (\frac{\sqrt{n} L_{avg} D_{Z, V}^{2}}{k}) .

(24)

The problem of the additional logarithmic factor was resolved in [2]. The authors proposed a modification of Extragradient:

\begin{matrix} z^{k + 1 / 2} = & P_{Z} (z^{k} + τ (w^{k} - z^{k}) - γ F (w^{k})), Δ^{k} = & F_{ξ^{k}} (z^{k}) - F_{ξ^{k}} (w^{k}) + F (w^{k}), z^{k + 1} = & P_{Z} (z^{k} + τ (w^{k} - z^{k}) - γ Δ^{k}), w^{k + 1} = & {\begin{matrix} z^{k + 1}, & with % probability p w^{k}, & with probability 1 - p \end{matrix} . \end{matrix}

(25)

This algorithm is a combination of the extra step technique from the VIs theory and the loopless approach [72] for finite-sum problems. An interesting detail of the method is the randomized negative momentum: $τ (w^{k} - z^{k})$ . While for minimization problems it is usual to apply positive/heavy-ball momentum, the opposite approach turns out to be useful for saddle point problems and variational inequalities. This effect was noticed earlier [41, 121, 3] and appeared now in the theory of stochastic methods for VIs. Also, in [2], the authors presented modifications for Forward-Backward, Forward-Reflected-
Backward as well as for Extragradient in the non-Euclidean case.

As we noted earlier, the results of [2] give estimates (23) and (24), but without additional logarithmic factors. That is, to achieve $E ∥ z^{k} - z^{*} ∥_{2}^{2} \leq ε$ in the strongly monotone case and $E [{Gap}_{V I} ({^z}^{k})] \leq ε$ in the monotone case the methods from [2] require

O (max {n; \frac{\sqrt{n} L_{avg}}{μ}} log \frac{R_{0}^{2}}{ε})

(26)

and

O (\frac{\sqrt{n} L_{avg} D_{Z, V}^{2}}{ε})

(27)

stochastic oracle calls respectively. It remains to discuss the effect of batching on the method from (25), i.e., how the oracle complexity bounds change if we use not a single sample $F_{ξ^{k}}$ at each iteration, but a batch size of $b$ : $\frac{1}{b} \sum_{i \in S^{k}} F_{i}$ , where $S_{k} \subseteq [n]$ is the $b$ -element set of the indices in the mini-batch. In this case, the methods from [2] give estimates (26) and (27), but multiplied by an additional factor $\sqrt{b}$ . This extra multiplier issue was resolved in [68] using the following method:

\begin{matrix} Δ^{k} = & \frac{1}{b} \sum i \in S^{k} [F_{i} (z^{k}) - F_{i} (w^{k - 1}) + α (F_{i} (z^{k}) - F_{i} (z^{k - 1}))] + F (w^{k - 1}), z^{k + 1} = & P_{Z} (z^{k} + τ (w^{k} - z^{k}) - γ Δ^{k}), w^{k + 1} = & {\begin{matrix} z^{k + 1}, & with % probability p w^{k}, & with probability 1 - p \end{matrix} . \end{matrix}

(28)

The authors proved that in the strongly monotone case this method gives an estimate (26), i.e., without additional logarithmic factors and without factors depending on $b$ .

The only issue that remains to be understood is whether the current state of the art methods with best complexities from [2, 68] are optimal. The lower bounds from [52] claim that under Assumptions 5 and 2, the methods above are optimal. However, under $L_{max}$ -Lipschitzness of all $F_{i}$ , $i \in [n]$ and Assumption 2, the lower bound from [52] is

E ∥ z^{k} - z^{*} ∥_{2}^{2} = Ω (R_{0}^{2} exp (- min {\frac{1}{n}; \frac{μ}{L_{max}}} k)) .

The question whether this lower bound is tight remains open.

4.3 Cocoercivity assumption

In some papers, the following assumption is used instead of 1.

Assumption 6 (Cocoercivity).

The operator $F$ is $l$ -cocoercive, i.e., for all $u, v \in Z$ we have $∥ F (u) - F (v) ∥_{2}^{2} \leq l ⟨ F (u) - F (v), u - v ⟩ .$

This assumption is stronger than monotonicity + Lipschitzness, i.e., not all monotone Lipschitz operators are cocoercive. One can note that the operator for the bilinear SPP ( ${min}_{x} {max}_{y} x^{⊤} A y$ ) is not cocoercive. But if it is known that $F$ is $L$ -Lipschitz and $μ$ -strongly monotone, then it is $L^{2} / μ$ -cocoercive. Moreover, if we consider a convex $L$ -smooth minimization problem, then the corresponding operator is $L$ -cocoercive.

There is no need to use an Extragradient for cocoercive operators. One can apply the iterative scheme (7) and its modifications for the stochastic case. In spite of this, the first work on cocoercive operators in the stochastic cases, used the Extragradient as the basic method [25]. In this paper, the authors investigated methods for finite-sum problems. The latter results from [80, 14] give an almost complete picture of stochastic algorithms based on method (7) for operators under Assumption 6. In particular, the work [14] gives a unified analysis for a large number of popular stochastic methods known yet for minimization problems [50].

4.4 High-probability convergence

Before this section, we focused on in-expectation convergence guarantees for the stochastic methods, i.e., bounds on
$E [{Gap}_{V I} ({^z}^{k})]$ and/or $E ∥ z^{k} - z^{*} ∥_{2}^{2}$ . However, high-probability convergence guarantees, i.e., bounds on ${Gap}_{V I} ({^z}^{k})$ and/or $∥ z^{k} - z^{*} ∥_{2}^{2}$ that hold with probability at least $1 - β$ for given confidence level $β \in (0, 1)$ , reflect the real behavior of the methods more accurately [49]. Despite this fact, high-probability convergence of stochastic methods for solving VIs is studied only in a couple of works.

It is worth mentioning that one can always deduce the high-probability bound from the in-expectation one via Markov’s inequality. However, in this case, the derived rate of convergence will have a negative-power dependence on $β^{- 1}$ . Such guarantees are not desirable and the goal is to derive the rates that have (poly-)logarithmic dependence on the confidence level, i.e., $β$ should appear only in the $O(poly(log(\nicefrac1β)))$ factor.

The first and for many years the only high-probability guarantees of this type for solving stochastic VIs are derived in [59]. The authors assume that $F$ is monotone and $L$ -Lipschitz, the domain is bounded, and $F_{ξ}$ is an unbiased estimator with sub-Gaussian (light) tails of the distribution:

The above condition is much stronger than Assumption 3. Under these assumptions, the authors of [59] prove that after $k$ iterations of Mirror-Prox with probability at least $1 - β$ (for any $β \in (0, 1)$ ) the following inequality holds:

GapVI(^zk)=O(LD2Zk+σDZlog(\nicefrac1β)√k).

Up to the logarithmic factor this result coincides with in-expec-
tation one and, thus, it is optimal (up to the logarithms). However, the result is derived under restrictive light-tails assumption.

This limitation was recently addressed in [48], where the authors derived the high-probability rates for the considered problem under just bounded variance assumption. In particular, they consider clipped-SEG for problems with $Z = R^{d}$ :

	$z^{k + 1 / 2}$	$= z^{k} - γ \cdot clip (F_{ξ^{k}} (z^{k}), λ_{k}),$
	$z^{k + 1}$	$= z^{k} - γ \cdot clip (F_{ξ^{k + 1 / 2}} (z^{k + 1 / 2}), λ_{k + 1 / 2}),$

where $clip(x,λ)=min{1,\nicefracλ∥x∥2}x$ is the clipping operator – a popular tool in deep learning [102, 45]. In the setup, when $F$ is monotone and $L$ -Lipschitz and Assumption 3 holds, the authors of [48] prove that after $k$ iterations of clipped-SEG with probability at least $1 - β$ (for any $β \in (0, 1)$ ) the following inequality holds:

GapVI(^zk)=O(LR20log(\nicefrackβ)k+σR0√log(\nicefrackβ)√k),

Up to the differences in logarithmic factors, the definition of $σ$ , and the difference between $D_{Z}$ and $R_{0}$ the rate coincides with the one from [59] while derived without the light-tails assumption. The key algorithmic tool that allows removing the light-tails assumption is clipping: with a proper choice of the clipping level $λ$ the authors cut heavy tails without making the bias too large. It is worth mentioning that the result for clipped-SEG is derived for the unconstrained case and the rate depends on $R_{0}$ , while in [59], the analysis relies on the boundedness of the domain, which diameter explicitly appears in their rate. To remove the dependence on the diameter of the domain, the authors of [47] show that with the high-probability the iterates produced by clipped-SEG stay in the ball around $x^{*}$ with the radius proportional to $R_{0}$ . Using this trick, the authors of [47] also show that it is sufficient to assume everything (monotonicity and Lipschitzness of $F$ and bounded variance) only on this ball. Such a generality allows them to cover the problems that are non-Lipschitz on $R^{d}$ (e.g., for some monotone polynomially-growing operators) and also the situation when the variance is bounded only on a compact, which is common for many finite-sum problems. Finally, [47] contains high-probability convergence results for strongly monotone VIs and VIs with structured non-monotonicity.

5 Recent advances

In this section, we collect a few recent theoretical advances with practical impacts.

5.1 Saddle point problems with different constants of strong convexity and strong concavity

Interest in saddle point problems with different constants of strong convexity and strong concavity appeared a few years ago, see e.g., [4, 76]. However, even for the particular case

min x \in R^{d_{x}} max y \in R^{d_{y}} g (x, y) = f (x) + y^{⊤} A x - h (y),

where function $f$ is $μ_{x}$ -strongly convex ( $μ_{x} > 0$ ) and $L_{x}$ -smooth, and function $h$ is $μ_{y}$ -strongly convex ( $μ_{y} > 0$ ) and $L_{y}$ -smooth, optimal algorithms have appeared only recently [71, 115, 57]. These algorithms have the following convergence rates

O ⎛ ⎝ ⎛ ⎝ \sqrt{\frac{L_{x}}{μ_{x}}} + \sqrt{\frac{λ_{max} (A^{⊤} A)}{μ_{x} μ_{y}}} + \sqrt{\frac{L_{y}}{μ_{y}}} ⎞ ⎠ log \frac{1}{ε} ⎞ ⎠,

(29)

and attain the lower bound (here we need to assume that
$λ_{min} (A^{⊤} A) \leq \sqrt{μ_{x} μ_{y}}$ , without this assumption optimal methods are unknown), which was obtained in [123, 55]

Note that the algorithm from [57] was built upon the technique related to the analysis of primal-dual Extragradient methods via relative Lipschitzness [114, 27]. As a by-product, this technique makes it possible to obtain Nesterov’s accelerated method as a particular case of primal-dual Extragradient method with relative Lipschitzness [27].

For the non-bilinear SPP, optimal methods, based on the accelerated Monteiro–Svaiter proximal envelope, were developed only in the non-composite case [70, 21]. For the non-bilinear SPP with composite terms, there is a poly-logarithmic gap between the lower bound and the best known upper bounds [117]. A gap also appears for the SPP having the stochastic finite-sum structure [117, 81, 57]. The stochastic setting with bounded variance was considered in [124, 84, 31].

Further deterministic <<cutting-plane>> improvements are related with the additional assumptions about small dimension of vectors $x$ or/and $y$ [90, 42, 43] or with different structural (e.g., SPP on balls in $1$ / $\infty$ -norms) and sparsity assumptions, see e.g., [23, 111, 110] and references there in. Lower bounds here are mostly unknown.

Note, that in this subsection we mentioned a lot of works with (sub-)optimal algorithms for different variants of SPP. In contrast to the convex optimization, where the oracle call is uniquely associated with the gradient call $\nabla f (x)$ , for SPP we have two criteria: the number of $\nabla_{x} g (x, y)$ -calls and $\nabla_{y} g (x, y)$ -calls (and more variants for SPP with composites). “Optimality” in the most of the aforementioned papers means that the method is optimal according to the worst of the criteria. In [4, 117], authors consider these criteria separately. However, the development of the lower bounds and optimal methods for a multi-criterion setup is still an open problem.

5.2 Adaptive methods for VI and SPP

Interest in adaptive algorithms for stochastic convex optimization was mainly emerged in 2011 after the development of AdaGrad [32] and Adam [62]. For variational inequalities and saddle point problems, interest in adaptive methods have appeared only in the last few years, see e.g., [39, 7] (see also [60]). For the current moment of time, this area of research is well-developed. One can note works devoted to both adaptive step sizes [5, 34, 33, 113, 116] and adaptive scaling/preconditioning [79, 30, 12]. Approaches from the second group are based on the idea of the proper combination of AdaGrad/Adam with Extragradient or its modifications. All of the mentioned adaptive methods have no better (typically the same) theoretical convergence rates than their non-adaptive analogues but require less input information or demonstrate better performance in practice.

5.3 Quasi-Newton and tensor methods for VI and SPP

Quasi-Newton methods for solving nonlinear equations (unconstrained VI) and SPP are proposed in [74, 120] and [78] respectively. In these papers, the authors derive local superlinear convergence rates for the modifications of the Broyden’s type methods for solving nonlinear equations with Lipschitz Jacobian and SPP with Lipschitz Hessian. Stochastic versions of these methods for VI and SPP have not yet been developed.

Tensor methods for convex optimization problems are currently quite well developed. In particular, starting with [98] it has been shown that optimal second- and third-order methods can be implemented with almost the same complexity of each iteration as the Newton method [88, 96, 38]. Moreover, optimal $p$ -order methods (that use $p$ -order derivatives) significantly reduce the rate of convergence from $k^{- 2}$ to $k^{- (3 p + 1) / 2}$ [69, 20]. For VI and SPP, the interest was initiated by [93, 87] and optimal $p$ -order methods reduce the rate of convergence from $k^{- 1}$ to $k^{- (p + 1) / 2}$ [1, 77] (for $k^{- 1}$ , see Theorem 3). However, in contrast to convex optimization, the use of tensor methods for sufficiently smooth monotone VIs and convex-concave saddle point problems is not expected to be as effective. Note, that in [1, 77] one can also find optimal rates for strongly monotone VIs and strongly convex-concave saddle point problems. Stochastic tensor methods for variational inequalities and saddle point problems have not yet been developed.

5.4 Convergence in terms of the gradient norm for SPP

Several recent advances in the development of optimal algorithms are based on accelerated proximal envelops with proper stopping rules for inner loop algorithms [70, 69, 67, 108]. This rule is built upon the norm of the gradient calculated for target function of the inner problem.

For smooth convex optimization problems Yu. Nesterov in 2012 posed the problem of making the gradient norm small with the same rate of convergence as a gap in the function value, i.e. proportional to $k^{- 2}$ [95]. To address this problem, in the same paper, he proposed an optimal (up to a logarithmic factor) algorithm. This question was further investigated, including obtaining optimal results without additional logarithmic factors [61, 97] (see also [29] for explanations and survey). In the stochastic case, algorithms were presented in [37].

For smooth convex-concave saddle point problems an optimal algorithm with $∥ \nabla_{x, y} f (x^{k}, y^{k}) ∥_{2}$ proportional to $k^{- 1}$ was proposed in [121] (see also [29] and [70] for monotone inclusion). For the stochastic case see [73, 19, 26].

5.5 Decentralized VI and SPP

In practice, in order to solve the variation inequality problem more efficiently and quickly, distributed methods are usually applied. In particular, methods that work on arbitrary (possibly time-varying) decentralized communication networks between computing devices are popular.

While the field of decentralized algorithms for minimization problems has been extensively investigated, results for broader classes of problems have only begun to appear in recent years. Such works are primarily focused on saddle point problems [89, 16, 107, 17, 15], but we note that most of these results can easily be extended to variational inequalities. Let us emphasize two works that were at once devoted to VIs. In the paper [13] the authors proposed a decentralized method with local steps, and [68] gave optimal decentralized methods for stochastic (finite sum) variational inequalities on fixed and varying networks.

Acknowledgement.

The work was supported by Russian Science Foundation (project No. 21-71-30005).

References

[1] Deeksha Adil, Brian Bullins, Arun Jambulapati, and Sushant Sachdeva. Line search-free methods for higher-order smooth monotone variational inequalities. arXiv preprint arXiv:2205.06167, 2022.
[2] Ahmet Alacaoglu and Yura Malitsky. Stochastic variance reduction for variational inequality methods. arXiv preprint arXiv:2102.08352, 2021.
[3] Ahmet Alacaoglu, Yura Malitsky, and Volkan Cevher. Forward-reflected-backward method with variance reduction. Computational Optimization and Applications, 80, 11 2021.
[4] Mohammad S Alkousa, Alexander Vladimirovich Gasnikov, Darina Mikhailovna Dvinskikh, Dmitry A Kovalev, and Fedor Sergeevich Stonyakin. Accelerated methods for saddle-point problem. Computational Mathematics and Mathematical Physics, 60(11):1787–1809, 2020.
[5] Kimon Antonakopoulos, E. Veronica Belmega, and Panayotis Mertikopoulos. Adaptive extra-gradient methods for min-max optimization and games. arXiv preprint arXiv:2010.12100, 2020.
[6] K.J. Arrow, L. Hurwicz, and H. Udzawa. Studies in linear and nonlinear programming. Stanford University Press, 1958.
[7] Francis Bach and Kfir Y. Levy. A universal algorithm for variational inequalities adaptive to smoothness and noise. In Conference on learning theory, pages 164–194. PMLR, 2019.
[8] Francis Bach, Julien Mairal, and Jean Ponce. Convex sparse matrix factorizations. arXiv preprint arXiv:0812.1869, 2008.
[9] A.B. Bakushinskii and B.T. Polyak. On the solution of variational inequalities. Soviet Math. Doklady, 15(6):1705–1710, 1974.
[10] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University Press, 2009.
[11] A. Beznosikov, A. Gasnikov, K. Zainulina, A. Maslovskiy, and D. Pasechnyuk. A unified analysis of variational inequality methods: Variance reduction, sampling, quantization andcoordinate descent. arXiv preprint arXiv:2201.12206, 2022.
[12] Aleksandr Beznosikov, Aibek Alanov, Dmitry Kovalev, Martin Takáč, and Alexander Gasnikov. On scaled methods for saddle point problems. arXiv preprint arXiv:2206.08303, 2022.
[13] Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich, and Alexander Gasnikov. Decentralized local stochastic extra-gradient for variational inequalities. arXiv preprint arXiv:2106.08315, 2021.
[14] Aleksandr Beznosikov, Eduard Gorbunov, Hugo Berard, and Nicolas Loizou. Stochastic gradient descent-ascent: Unified theory and new efficient methods. arXiv preprint arXiv:2202.07262, 2022.
[15] Aleksandr Beznosikov, Alexander Rogozin, Dmitry Kovalev, and Alexander Gasnikov. Near-optimal decentralized algorithms for saddle point problems over time-varying networks. Lecture Notes in Computer Science, page 246–257, 2021.
[16] Aleksandr Beznosikov, Valentin Samokhin, and Alexander Gasnikov. Distributed saddle-point problems: Lower bounds, optimal and robust algorithms. arXiv preprint arXiv:2010.13112, 2020.
[17] Aleksandr Beznosikov, Gesualdo Scutari, Alexander Rogozin, and Alexander Gasnikov. Distributed saddle-point problems under data similarity. Advances in Neural Information Processing Systems, 34, 2021.
[18] F. E. Browder. Existence and approximation of solutions of nonlinear variational inequalities. Proceedings of the National Academy of Sciences, 56(4):1080–1086, 1966.
[19] Xufeng Cai, Chaobing Song, Cristóbal Guzmán, and Jelena Diakonikolas. A stochastic halpern iteration with variance reduction for stochastic monotone inclusion problems. arXiv preprint arXiv:2203.09436, 2022.
[20] Yair Carmon, Danielle Hausler, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Optimal and adaptive monteiro-svaiter acceleration. arXiv preprint arXiv:2205.15371, 2022.
[21] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Recapp: Crafting a more efficient catalyst for convex optimization. In International Conference on Machine Learning, pages 2658–2685. PMLR, 2022.
[22] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. arXiv preprint arXiv:1907.02056, 2019.
[23] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Coordinate methods for matrix games. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 283–293. IEEE, 2020.
[24] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
[25] Tatjana Chavdarova, Gauthier Gidel, François Fleuret, and Simon Lacoste-Julien. Reducing noise in gan training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32, 2019.
[26] Lesi Chen and Luo Luo. Near-optimal algorithms for making the gradient small in stochastic minimax optimization. arXiv preprint arXiv:2208.05925, 2022.
[27] Michael B. Cohen, Aaron Sidford, and Kevin Tian. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. arXiv preprint arXiv:2011.06572, 2020.
[28] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
[29] Jelena Diakonikolas and Puqian Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
[30] Zehao Dou and Yuanzhi Li. On the one-sided convergence of adam-type algorithms in non-convex non-concave min-max optimization. arXiv preprint arXiv:2109.14213, 2021.
[31] Simon S Du, Gauthier Gidel, Michael I Jordan, and Chris Junchi Li. Optimal extragradient-based bilinearly-coupled saddle-point optimization. arXiv preprint arXiv:2206.08573, 2022.
[32] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
[33] Alina Ene and Huy Lê Nguyen. Adaptive and universal algorithms for variational inequalities with optimal convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6559–6567, 2022.
[34] Alina Ene, Huy L Nguyen, and Adrian Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314–7321, 2021.
[35] Ernie Esser, Xiaoqun Zhang, and Tony F Chan. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4):1015–1046, 2010.
[36] Francisco Facchinei and Jong-Shi Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Series in Operations Research. Springer, 2003.
[37] Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, and Blake Woodworth. The complexity of making the gradient small in stochastic convex optimization. In Conference on Learning Theory, pages 1319–1345. PMLR, 2019.
[38] Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, César A Uribe, Bo Jiang, Haoyue Wang, Shuzhong Zhang, Sébastien Bubeck, et al. Near optimal methods for minimizing convex functions with lipschitz $p$ -th derivatives. In Conference on Learning Theory, pages 1392–1393. PMLR, 2019.
[39] Alexander Vladimirovich Gasnikov, Pavel Evgenievich Dvurechensky, Fedor Sergeevich Stonyakin, and Aleksandr Aleksandrovich Titov. An adaptive proximal method for variational inequalities. Computational Mathematics and Mathematical Physics, 59(5):836–841, 2019.
[40] Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551, 2018.
[41] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1802–1811. PMLR, 2019.
[42] Egor Gladin, Ilya Kuruzov, Fedor Stonyakin, Dmitry Pasechnyuk, Mohammad Alkousa, and Alexander Gasnikov. Solving strongly convex-concave composite saddle point problems with a small dimension of one of the variables. Sbornik: Mathematics, 2022.
[43] Egor Gladin, Abdurakhmon Sadiev, Alexander Gasnikov, Pavel Dvurechensky, Aleksandr Beznosikov, and Mohammad Alkousa. Solving smooth min-min and min-max problems by mixed oracle algorithms. In International Conference on Mathematical Optimization Theory and Operations Research, pages 19–40. Springer, 2021.
[44] E.G. Gol’shtein. On convergence of the gradient method for search of saddle point of modified lagrange functions. Ekonomika Mat. Metody, 13(2):322–329, 1977.
[45] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[46] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
[47] Eduard Gorbunov, Hugo Berard, Gauthier Gidel, and Nicolas Loizou. Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, pages 7865–7901. PMLR, 2022.
[48] Eduard Gorbunov, Marina Danilova, David Dobre, Pavel Dvurechensky, Alexander Gasnikov, and Gauthier Gidel. Clipped stochastic methods for variational inequalities with heavy-tailed noise. arXiv preprint arXiv:2206.01095, 2022.
[49] Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042–15053, 2020.
[50] Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020.
[51] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General Analysis and Improved Rates. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5200–5209, 2019.
[52] Yuze Han, Guangzeng Xie, and Zhihua Zhang. Lower complexity bounds of finite-sum optimization problems: The results and construction. arXiv preprint arXiv:2103.08280, 2021.
[53] Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the convergence of single-call stochastic extra-gradient methods. Advances in Neural Information Processing Systems, 32, 2019.
[54] Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information Processing Systems, 33:16223–16234, 2020.
[55] Adam Ibrahim, Waıss Azizian, Gauthier Gidel, and Ioannis Mitliagkas. Linear lower bounds and conditioning of differentiable games. In International conference on machine learning, pages 4583–4593. PMLR, 2020.
[56] Yujia Jin and Aaron Sidford. Efficiently solving MDPs with stochastic mirror descent. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pages 4890–4900. PMLR, 2020.
[57] Yujia Jin, Aaron Sidford, and Kevin Tian. Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. arXiv preprint arXiv:2202.04640, 2022.
[58] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
[59] Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
[60] Evgenii Nikolaevich Khobotov. Modification of the extra-gradient method for solving variational inequalities and certain optimization problems. USSR Computational Mathematics and Mathematical Physics, 27(5):120–127, 1987.
[61] Donghwan Kim and Jeffrey A Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of optimization theory and applications, 188(1):192–219, 2021.
[62] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, pages 2305–2313, 2015.
[63] G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:35–49, 1977; Russian original: Economika Mat. Metody, 12(4):747–-756, 1976.
[64] G. M. Korpelevich. Extrapolation gradient methods and relation to modified lagrangeans. Ekonomika Mat. Metody, 19(4):694–703, 1983.
[65] Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, i: operator extrapolation. arXiv preprint arXiv:2011.02987, 2020.
[66] Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning. arXiv preprint arXiv:2011.08434, 2020.
[67] Dmitry Kovalev, Aleksandr Beznosikov, Ekaterina Borodich, Alexander Gasnikov, and Gesualdo Scutari. Optimal gradient sliding and its application to distributed optimization under similarity. arXiv preprint arXiv:2205.15136, 2022.
[68] Dmitry Kovalev, Aleksandr Beznosikov, Abdurakhmon Sadiev, Michael Persiianov, Peter Richtárik, and Alexander Gasnikov. Optimal algorithms for decentralized stochastic variational inequalities. arXiv preprint arXiv:2202.02771, 2022.
[69] Dmitry Kovalev and Alexander Gasnikov. The first optimal acceleration of high-order methods in smooth convex optimization. arXiv preprint arXiv:2205.09647, 2022.
[70] Dmitry Kovalev and Alexander Gasnikov. The first optimal algorithm for smooth and strongly-convex-strongly-concave minimax optimization. arXiv preprint arXiv:2205.05653, 2022.
[71] Dmitry Kovalev, Alexander Gasnikov, and Peter Richtárik. Accelerated primal-dual gradient method for smooth and convex-concave saddle-point problems with bilinear coupling. arXiv preprint arXiv:2112.15199, 2021.
[72] Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don’t jump through hoops and remove those loops: Svrg and katyusha are better without the outer loop. In Aryeh Kontorovich and Gergely Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 451–467. PMLR, 08 Feb–11 Feb 2020.
[73] Sucheol Lee and Donghwan Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 34:22588–22600, 2021.
[74] Dachao Lin, Haishan Ye, and Zhihua Zhang. Explicit superlinear convergence rates of broyden’s methods in nonlinear equations. arXiv preprint arXiv:2109.01974, 2021.
[75] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. Advances in neural information processing systems, 28, 2015.
[76] Tianyi Lin, Chi Jin, and Michael I Jordan. Near-optimal algorithms for minimax optimization. In Conference on Learning Theory, pages 2738–2779. PMLR, 2020.
[77] Tianyi Lin, Michael Jordan, et al. Perseus: A simple high-order regularization method for variational inequalities. arXiv preprint arXiv:2205.03202, 2022.
[78] Chengchang Liu and Luo Luo. Quasi-newton methods for saddle point problems. arXiv preprint arXiv:2111.02708, 2021.
[79] Mingrui Liu, Youssef Mroueh, Jerret Ross, Wei Zhang, Xiaodong Cui, Payel Das, and Tianbao Yang. Towards better understanding of adaptive gradient algorithms in generative adversarial nets. arXiv preprint arXiv:1912.11940, 2019.
[80] Nicolas Loizou, Hugo Berard, Gauthier Gidel, Ioannis Mitliagkas, and Simon Lacoste-Julien. Stochastic gradient descent-ascent and consensus optimization for smooth games: Convergence analysis under expected co-coercivity. Advances in Neural Information Processing Systems, 34:19095–19108, 2021.
[81] Luo Luo, Guangzeng Xie, Tong Zhang, and Zhihua Zhang. Near optimal stochastic algorithms for finite-sum unbalanced convex-concave minimax optimization. arXiv preprint arXiv:2106.01761, 2021.
[82] Yura Malitsky and Matthew K. Tam. A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM Journal on Optimization, 30(2):1451–1472, 2020.
[83] Bernard Martinet. Regularisation, d’inéquations variationelles par approximations succesives. Revue Francaise d’informatique et de Recherche operationelle, 1970.
[84] Dmitriy Metelev, Alexander Rogozin, Alexander Gasnikov, and Dmitry Kovalev. Decentralized saddle-point problems with different constants of strong convexity and strong concavity. arXiv preprint arXiv:2206.00090, 2022.
[85] Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, and Yura Malitsky. Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, pages 4573–4582. PMLR, 2020.
[86] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pages 1497–1507. PMLR, 2020.
[87] Renato DC Monteiro and Benar F Svaiter. Iteration-complexity of a newton proximal extragradient method for monotone variational inequalities and inclusion problems. SIAM Journal on Optimization, 22(3):914–935, 2012.
[88] Renato DC Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):1092–1125, 2013.
[89] Soham Mukherjee and Mrityunjoy Chakraborty. A decentralized algorithm for large scale min-max problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 2967–2972, 2020.
[90] Arkadi Nemirovski. Efficient methods in convex programming. Lecture notes, 1994.
[91] Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
[92] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152, 2005.
[93] Yurii Nesterov. Cubic regularization of newton’s method for convex problems with constraints. 2006.
[94] Yurii Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2):319–344, 2007.
[95] Yurii Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012.
[96] Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, 186(1):157–183, 2021.
[97] Yurii Nesterov, Alexander Gasnikov, Sergey Guminov, and Pavel Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, 2021.
[98] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
[99] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 2681–2690. PMLR, 2017.
[100] Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming, 185(1):1–35, 2021.
[101] Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416–1424, 2016.
[102] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013.
[103] B. Polyak. Introduction to optimization. Optimization Software, 1987.
[104] L. D. Popov. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28:845–848, 1980.
[105] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
[106] R.T. Rockafellar. Convex functions, monotone operators and variational inequalities. theory and applications of monotone operators. Theory and applications of monotone operators, pages 13–65, 1969.
[107] Alexander Rogozin, Alexander Beznosikov, Darina Dvinskikh, Dmitry Kovalev, Pavel Dvurechensky, and Alexander Gasnikov. Decentralized distributed optimization for saddle point problems. arXiv preprint arXiv:2102.07758, 2021.
[108] Abdurakhmon Sadiev, Dmitry Kovalev, and Peter Richtárik. Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact prox. arXiv preprint arXiv:2207.03957, 2022.
[109] M. Sibony. Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo, 7(1):65–183, 1970.
[110] Chaobing Song, Cheuk Yin Lin, Stephen J Wright, and Jelena Diakonikolas. Coordinate linear variance reduction for generalized linear programming. arXiv preprint arXiv:2111.01842, 2021.
[111] Chaobing Song, Stephen J Wright, and Jelena Diakonikolas. Variance reduction via primal-dual accelerated dual averaging for nonsmooth convex finite-sums. In International Conference on Machine Learning, pages 9824–9834. PMLR, 2021.
[112] G. Stampacchia. Formes bilineaires coercitives sur les ensembles convexes. Académie des Sciences de Paris, 258:4413–4416, 1964.
[113] Fedor Stonyakin, Alexander Gasnikov, Pavel Dvurechensky, Alexander Titov, and Mohammad Alkousa. Generalized mirror prox algorithm for monotone variational inequalities: Universality and inexact oracle. Journal of Optimization Theory and Applications, pages 1–26, 2022.
[114] Fedor Stonyakin, Alexander Tyurin, Alexander Gasnikov, Pavel Dvurechensky, Artem Agafonov, Darina Dvinskikh, Mohammad Alkousa, Dmitry Pasechnyuk, Sergei Artamonov, and Victorya Piskunova. Inexact model: A framework for optimization and variational inequalities. Optimization Methods and Software, pages 1–47, 2021.
[115] Kiran Koshy Thekumparampil, Niao He, and Sewoong Oh. Lifted primal-dual method for bilinearly coupled smooth minimax optimization. arXiv preprint arXiv:2201.07427, 2022.
[116] A. A. Titov, S. S. Ablaev, M. S. Alkousa, F. S. Stonyakin, and A. V. Gasnikov. Some adaptive first-order methods for variational inequalities with relatively strongly monotone operators and generalized smoothness. arXiv preprint arXiv:2207.09544, 2022.
[117] Vladislav Tominin, Yaroslav Tominin, Ekaterina Borodich, Dmitry Kovalev, Alexander Gasnikov, and Pavel Dvurechensky. On accelerated methods for saddle-point problems with composite structure. arXiv preprint arXiv:2103.09344, 2021.
[118] Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(1-2):237–252, 1995.
[119] Paul Tseng. A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38(2):431–446, 2000.
[120] Haishan Ye, Dachao Lin, and Zhihua Zhang. Greedy and random broyden’s methods with explicit superlinear convergence rates in nonlinear equations. arXiv preprint arXiv:2110.08572, 2021.
[121] TaeHo Yoon and Ernest K Ryu. Accelerated algorithms for smooth convex-concave minimax problems with o (1/k^ 2) rate on squared gradient norm. In International Conference on Machine Learning, pages 12098–12109. PMLR, 2021.
[122] Guodong Zhang, Yuanhao Wang, Laurent Lessard, and Roger B Grosse. Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. In International Conference on Artificial Intelligence and Statistics, pages 7659–7679. PMLR, 2022.
[123] Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower iteration complexity bounds for the convex concave saddle point problems. Mathematical Programming, pages 1–35, 2021.
[124] Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. Robust accelerated primal-dual methods for computing saddle points. arXiv preprint arXiv:2111.12743, 2021.