Smooth Monotone Stochastic Variational Inequalities and Saddle Point Problems: A Survey

This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.


Introduction
In its long, more than half-century history of study [113], variational inequalities have become one of the most popular and universal optimization formulations. Variational inequalities are used in various areas of applied mathematics. Here we can highlight both classic examples from game theory, economics, operator theory, convex analysis [6,19,107,110,113], as well as newer and even more recent applications in optimization and machine learning: non-smooth optimization [93], unsupervised learning [9,36,25], robust/adversarial optimization [11], GANs [47] and reinforcement learning [100,57]. Modern times present a new challenges to the community. The increase in scale of problems and the desire to speed up solution processes have led to a huge interest in stochastic formulations of applied tasks, including variational inequalities. A survey of stochastic methods for solving variational inequalities is the subject of this paper. Structure of the paper. In Section 2, we give a formal statement of the variational inequality problem, basic examples, and main assumptions. Section 3 deals with deterministic methods, from which stochastic methods have been developed. Section 4 covers the stochastic methods. Section 5 is devoted to the recent advances in (not necessarily stochastic) variational inequalities and saddle point problems.
2 Problem: setting and assumptions Notation. We use x, y := n i=1 x i y i to denote standard inner product of x, y ∈ R d where x i corresponds to the i-th component of x in the standard basis in R d . It induces ℓ 2 -norm in R d in the following way x 2 := √ x, x . We denote ℓ p -norms as for p ∈ [1, ∞) and for p = ∞ we use x ∞ := max 1≤i≤d |x i |. The dual norm · * for the norm · is defined in the following way: y * := max { x, y | x ≤ 1}. Operator E [·] denotes full mathematical expectation. Finally, we need to introduce O, Ω-notation to hide numerical constants which do not depend on any problem parameter, and notationÕ,Ω to hide numerical constants and logarithmic factors.
We study variational inequalities (VI) of the form Find z * ∈ Z such that F(z * ), z − z * ≥ 0, ∀z ∈ Z, where F : Z → R d is an operator, and Z ⊆ R d is a convex set.
To emphasise the extensiveness of the formalism (1), we give some examples of variational inequalities arising in applied science.
Example 1 (Minimization). Consider the minimization problem: min z∈Z f (z). ( Suppose that F(z) := ∇ f (z). Then, if f is convex, it can be proved that z * ∈ Z is a solution for (1) if and only if z * ∈ Z is a solution for (2).
Suppose that F(z) := F(x, y) = [∇ x g(x, y), −∇ y g(x, y)] and Z = X × Y with X ⊆ R d x , Y ⊆ R d y . Then, if g is convexconcave, it can be proved that z * ∈ Z is a solution for (1) if and only if z * ∈ Z is a solution for (3).
The study of saddle point problems is often associated with variational inequalities.
Example 3 (Fixed point problem). Consider the fixed point problem: where T : R d → R d is an operator. With F(z) = z − T (z), it can be proved that z * ∈ Z = R d is a solution for (1) if and only if F(z * ) = 0, i.e. z * ∈ R d is a solution for (4).
For the operator F from (1) we assume the following.
Assumption 1 (Lipschitzness). The operator F is L-Lipschitz continuous, i.e. for all u, v ∈ Z we have F(u) − F(v) * ≤ L u − v .
In the context of (2) and (3), L-Lipschitzness of the operator means that the functions f (z) and g(x, y) are L-smooth.
Assumption 2 (Strong monotonicity). The operator F is µstrongly monotone, i.e., for all u, v ∈ Z we have F If µ = 0, then the operator F is monotone.
In the context of (2) and (3), strong monotonicity of F means strong convexity of f (z) and strong convexity-strong concavity of g(x, y). In this paper we first concentrate on the strongly monotone and monotone cases. But there are also various assumptions relaxing monotonicity and strong monotonicity (e.g., see [55] and references therein).
One can point out that Assumptions 1 and 2 are sufficient for the existence of a solution (1) [37].
Since we work on the set Z, let us introduce the Euclidean projection on this set: To characterise the convergence of the methods for monotone variational inequalities we introduce the gap function: Such a gap function, as a convergence criterion, is more suitable for the following variational inequality problem: F(z), z * − z ≤ 0 for z ∈ Z. Such a solution is also called weak or Minty (the solution of (1) is called strong or Stampacchia). However, according to Assumption 1 we have that F is single-valued and continuous on Z, meaning that both formulations of the variational inequality problem are equivalent [37]. For the minimization problem (2), the functional distance to the solution: f (z) − f (z * ), can be used instead of (5). For saddle point problems (3), the gap function is also used, but it is slightly different: For both functions (5) and (6), it is crucial that the feasible set is bounded (in fact it is not necessary to take the whole set Z which can be unbounded, it is enough to take a bounded convex subset C which contains some solution -see [95]). Therefore it is necessary to define a distance on the set Z. Since this survey covers methods not only in the Euclidean setup, let us introduce a more general notion of distance.
Definition 4 (Bregman divergence). Assume that function ν(z) is 1-strongly convex w.r.t. · -norm and differentiable on Z function. Then for any two points z, z ′ ∈ Z we define Bregman divergence V(z, z ′ ) associated with ν(z) as follows: We denote the Bregman-diameter of the set Z w.
In the Euclidean case, we use D Z instead of D Z,V . Using the definition of V, we denote the proximal operator as follows:

Deterministic foundation: Extragradient and others
The first and the simplest method for solving variational inequality (1) is iterative scheme (also known as Gradient method) where γ > 0 is a step size. Note that this method can be rewritten using the proximal operator with the Euclidean Bregman divergence: The basic result is the convergence of the method to the unique solution of (1) for strongly monotone and L-Lipschitz operator F; it was obtained in the papers [19,107,110].
If F is a potential operator (see Example 1) the method (7) coincides with the gradient projection algorithm. It converges for strongly monotone F. Moreover, bounds for the admissible step size are less restrictive (0 < γ < 2/L) and complexity estimates are better (O(κ log(R 2 0 /ε))) than in Theorem 5, see Theorem 2 from Section 1.4.2 of [104].
However, in the general monotone but non-strongly monotone case (for instance, for convex-concave SPP, Example 2) convergence is lacking. Original statements on the convergence of Uzava method (a version of (7)) for saddle point problems [6] were wrong; numerous examples of divergence of the method (7) for F corresponding to bilinear SPP are well known, see e.g. Figure 39 from [104].
There were many other attempts to recover convergence of gradient-like methods not for VIs, but for saddle point problems. One of them is based on the transition to modified Lagrangians when g(x, y) is Lagrange function, see [45,104]. However, we focus on general VI case. A possible approach is based on regularization idea. Instead of the monotone variational inequality (1) one can deal with the regularized one, when monotone F is replaced with strongly monotone F+ε k T , where T (z) is a strongly monotone operator and ε k > 0 is a regularization parameter. If we denote z k as the solution of regularized VI, then it is possible to prove [10] that z k converges to z * for ε k → 0. However, the solution z k usually is not easily available. To adress this problem, the iterative regularization technique is proposed in [10], where one step of the basic method (7) is applied for the regularized problem.
Step sizes and regularization parameters can be adjusted to guarantee the convergence.
Another technique is based on the Proximal Point method proposed independently by B. Martinet in [84] and by T. Rockafellar in [106]. At each iteration it requires the solution of VI with F + cI, where c > 0 and I is the identity operator. This is implicit method (similar with regularization method), however there exist numerous implementable versions of Proximal Point. For instance, some methods discussed further can be considered from this point of view.
The breakthrough for solving (non-strongly) monotone variational inequalities was made by Galina Korpelevich [64]. She exploited the idea of the extrapolation for the gradient method. It can be explained for the simplest example of a two-dimensional min-max problem with g(x, y) = xy, Z = R 2 . It has the unique saddle point z = 0, and in any point z k the direction F(z k ) is orthogonal to z k ; thus the iteration (7) enlarges the distance to the saddle point. However, if we make the step (7) and get extrapolated point z k+1/2 , the direction −F(z k+1/2 ) is attracting to the saddle point. Thus, the Extragradient method for solving (1) reads: Theorem 6. Let F satisfy Assumptions 1, 2 (with µ = 0) and 0 < γ < 1/L, then the Extragradient method generates a sequence of iterates z k that converge to z ⋆ .
For particular cases of the zero-sum matrix game or the general bilinear problem g(x, y) = y ⊤ Ax − b ⊤ x + c ⊤ y the method converges linearly, provided that the optimal solution is unique (see Theorem 3 from [64]). In this case the convergence rate is equal to O(κ log(R 2 0 /ε)) with κ = λ max (AA ⊤ )/λ min (AA ⊤ ). More general upper bounds for the Extragradient method can be found in [119] and in the recent paper [87]. In particular, for the strongly monotone case the estimate O(κ log(R 2 0 /ε)) with κ = L/µ holds true (compare with much worse bound O(κ 2 log(R 2 0 /ε)) for Gradient method). Adaptive version of the Extragradient method (no knowledge of L required) is proposed in [61].
Another version of the Extragradient method for finding saddle points is provided in [65]. Considering the setup of Example 2, we can exploit just one extrapolating step for variables y: y k+1 = y k + q(y k+1/2 − y k ), with 0 < γ < 1/2L and 0 < q < 1. This method converges to the solution and if g(x, y) is linear in y, then the convergence rate is linear. If we put q = 1 in the method (9), then y k+1 = y k+1/2 and we get the so-called Alternating Gradient method (Alternating descent-ascent). In [123], it was proved that this method has local linear convergence with complexity O(κ log(R 2 0 /ε)), where κ = L/µ. L. Popov [105] proposed a version of extrapolation scheme (sometimes this type of scheme is called optimistic or singlecall): It requires the single calculation of F at each iteration vs two calculations in the Extragradient method. As shown in [105], the method (10) converges for 0 < γ < 1/3L. Convergence rates for this method were obtained recently [41,87], it is O(κ log(R 2 0 /ε)) with κ = L/µ for the strongly monotone case and κ = λ max (AA ⊤ )/λ min (AA ⊤ ) for the bilinear case. It is important to note that in the general strongly monotone case this estimate is optimal [124], but for the bilinear problem the upper bounds available in the literature for both the Extragradient and optimistic methods are not tight [56]. Meanwhile, optimal estimates O( √ κ log(R 2 0 /ε)) with κ = λ max (AA ⊤ )/λ min (AA ⊤ ) can be achieved using approaches from [7] and [4].
The extension of the above schemes to an arbitrary proximal setup was obtained in the work of A. Nemirovsky [92]. He proposed the Mirror-Prox method for VIs, exploiting Bregman divergence: It implies the following result on the convergence rate.

Stochastic methods: different setups and assumptions
In this section, we move from deterministic to stochastic methods, i.e., we consider (1) with the following operator: where ξ is a random variable, D is some (typically unknown) distribution, F ξ : Z → R d is a stochastic operator. In this setup, calculating the value of the full operator F is computationally expensive or intractable. Therefore, one has to work mainly with stochastic realizations F ξ .

General case
The stochastic formulation (14) for the problem (1) was first considered by the authors of [60]. They proposed a natural stochastic generalization of Extragradient (more precisely, of Mirror-Prox): Here it is important to note that ξ k and ξ k+1/2 are independent and F ξ (z) is an unbiased estimator of F(z). Moreover, F ξ (z) is assumed to satisfy the following condition.
Assumption 3 (Bounded variance). The unbiased operator F ξ has uniformly bounded variance, i.e., for all ξ ∼ D and Under this assumption, the next result was derived in [60].
Theorem 8. Let F ξ satisfy Assumptions 1, 2 (with µ = 0), 3 andẑ k be defined as in (12), where z i+1/2 are generated by the Then, after k iterations we can guarantee that In [17], the authors gave an analysis of the algorithm (15) for strongly monotone VIs in the Euclidean case. In particular, under Assumptions 1, 2, 3 one can guarantee that after k iterations of the method (15) it holds (here and below we omit numerical constants in the exponent multiplier) Also, in [17], the authors obtained lower complexity bounds for solving VIs satisfying Assumptions 1, 2, 3 with stochastic methods. It turns out that the results of Theorem 8 in the monotone case and those from (17) are optimal and meet lower bounds up to numerical constants. Optimistic-like (or single-call) methods were also investigated in the stochastic setting. The work [41] applies the following update scheme: For this method, in the monotone Euclidean case, the authors proved an estimate similar to (16). In the strongly monotone case, (18) was investigated in the paper [54], but their estimates do not meet the lower bounds. The optimal estimates for this scheme were obtained later in the work [12].
The work [66] deals with a slightly different single-call approach in the non-Euclidean case: This update is a modification of the Forward-Reflected-Backward approach, namely here α k is a parameter, while in [83], α k ≡ 1. The analysis of the method (19) gives optimal estimates in both the strongly monotone and monotone cases. The theoretical results and guarantees discussed above significantly rely on the bounded variance assumption (Assumption 3). This assumption is quite restrictive (especially when the domain is unbounded) and does not hold for many popular machine learning problems. Moreover, one can even design a strongly monotone variational inequality on an unbounded domain such that the method (15) diverges exponentially fast [26]. Authors of [55,48] consider a relaxation of the bounded variance condition and assume that E F ξ (u) − F(u) 2 2 ≤ σ 2 + δ u − z * 2 2 with δ ≥ 0 in the Euclidean case. Under this condition and Assumptions 1, 2, the authors of [48] proved that after k iterations of the algorithm (15) where κ = max δ µ 2 ; L+ √ δ µ . The same assumption on stochastic realizations was considered in [67]. The authors used the method (19) and provided the following estimate: Results (20) and (21) are competitive. The first estimate is better in terms of the stochastic term (second term), while the second result is more competitive in terms of the deterministic term (first term). However, both of these results do not fully cover the issue of bounded noise, because the condition considered above is not general. The key to avoiding the assumption about the bounded variance of F ξ lies in the way how stochasticity is generated in the method (15). The method (15) is sometimes called Independent Samples Stochastic Extragradient (I-SEG). To address bounded variance issue, K. Mishchenko et al. [86] proposed another stochastic modification of the Extragradient algorithm called Same Sample Stochastic Extragradient (S-SEG): For simplicity we present the above method for the case when Z = R d (F(x * ) = 0), while [86] contains a more general case of regularized VIs. In contrast to I-SEG, S-SEG uses the same sample ξ k for both steps at iteration k. Although such a strategy cannot be implemented in some scenarios (streaming oracle), it can be applied to finite-sum problems that have been gaining an increasing attention in the recent years. Moreover, S-SEG significantly relies on the following assumption.
The evident difference between the setups for I-SEG and S-SEG can be explained through the connection between Extragradient and the Proximal Point method (PP) [84,106]. We assume that Z = R d (F(z * ) = 0) in all results discussed further in Section 4.1. In this setup, PP has the following update rule z k+1 = z k − γF(z k+1 ).
The method converges for any monotone operator F and any γ > 0. However, the update rule of PP is implicit and cannot be computed efficiently in many situations. The Extragradient method can be seen as a natural approximation of PP that substitutes z k+1 in the right-hand side by one gradient step from z k : In addition, when F is L-Lipschitz, one can estimate how good the approximation is. Consider z k+1 = z k − γF(z k − γF(z k )) (the Extragradient step) andz k+1 = z k − γF(z k+1 ) (the PP step). Then, z k+1 −z k+1 2 can be estimated as follows [86]: That is, the difference between the Extragradient and PP steps is of the order O(γ 3 ) rather than O(γ 2 ). Since the later corresponds to the difference between PP and simple gradient step (7), Extragradient better approximates PP than gradient steps, which are known to be non-convergent for general monotone Lipschitz variational inequalities. This approximation feature of Extragradient is crucial for its convergence and, as the above derivation implies, the approximation argument significantly relies on the Lipschitzness of operator F. Let us go back to the differences between I-SEG and S-SEG. In S-SEG, iteration k can be considered as a single Extragradient step for operator F ξ k (z). Therefore, Lipschitzness and monotonicity of F ξ k (z) (Assumption 4) are important for the analysis of S-SEG. In contrast, I-SEG uses different operators for the extrapolation and update steps. In this case, there is no effect from the Lipschitzness/monotonicity of individual F ξ (z). Therefore, the analysis of I-SEG naturally relies on the Lipschitzness and monotonicity of F(z) and closeness (on average) of F ξ (z) and F(z) (Assumption 3).
The convergence of I-SEG was discussed earlier in this section. Regarding S-SEG, the following result holds [86].
Theorem 9. Let Assumption 4 hold. Then, there exists a choice of step size γ [48] such that the output of S-SEG after k iterations satisfies where σ 2 * = E F ξ (z * ) 2 2 . The rate is similar to the one known for I-SEG up to the following differences. First, instead of the uniform bound on the variance σ 2 , the rate depends on σ 2 * , which is the variance of F ξ measured at the solution. In many cases, σ 2 = ∞ while σ 2 * is finite. From this perspective, S-SEG has better rate than I-SEG. However, it comes with a price: while the rate of I-SEG depends on the Lipschitz and strong-monotonicity constants of F, the rate of S-SEG depends on the worst constants of F ξ that can be much worse than those for F. In particular, consider the finite-sum setup with uniform sampling of ξ: where F i is L i -Lipschitz and µ i -strongly monotone, P{ξ = i} = 1 /n. Then, Assumption 4 holds with L = max i∈[n] L i , µ = min i∈[n] µ i and these constants appear in the rate from Theorem 3. The authors of [48] tighten this rate. In particular, they prove that for S-SEG with different step sizes for extrapolation and update steps it holds where σ 2 * = 1 n n i=1 F i (z * ) 2 2 and µ = 1 n n i=1 µ i . Since µ is (sometimes significantly) larger than µ, the improvement is noticeable. Moreover, when {L i } n i=1 are known, one can consider so-called importance sampling [52]: P{ξ = i} = L i/(nL), where L = 1 n n i=1 L i . As the authors of [48] show, importance sampling can be combined with S-SEG via allowing the extrapolation and update step sizes at iteration k to depend on the sample ξ k . In particular, for the proposed modification of S-SEG they derive The exponentially decaying term is always better than the corresponding one for S-SEG with uniform sampling. This usually implies faster convergence during the initial stage. Next, typically, larger norm of F i (z * ) implies larger L i , e.g., F i (z * ) 2 2 ∼ L 2 i . In this case, σ 2 * ≤ σ 2 * , becauseσ 2 * ∼ (L) 2 and σ 2 * ∼ L 2 = 1 n n i=1 L 2 i ≥ (L) 2 . Moreover, one can allow other sampling strategies and cover the case when some µ i are negative, see [48] for the details.

Finite-sum case
As noted earlier, when we deal with problem (14), it is often the case (especially in practical problems) that the distribution D is unknown, but we have some samples from D. Then, one can replace problem (14) by a finite-sum approximation: This approximation is sometimes also called Monte Carlo approximation. For machine learning problems the term empirical risk is often encountered. Although calls of the full operator is now tractable, they remain expensive in practice. Therefore, it is worth to avoid frequent computing of F and mainly use calls to single F i operators or small batches of them. Before presenting the results, let us introduce the analogue of the Lipschitzness assumption.
Assumption 5 (Lipschitzness in mean). The operator F is L avg -Lipschitz continuous in mean, i.e. for all u, v ∈ Z we have E F ξ (u) − F ξ (v) 2 * ≤ L 2 avg u − v 2 . For example, if F i is L i -Lipschitz for all i and we draw index ξ = i with probability p i = L i / j L j , then L avg = 1 n j L j . The study of finite-sum problems in stochastic optimization is connected, first of all, with classical methods for minimization problems such as SVRG [59] and SAGA [29]. For the saddle point problems, these methods were adopted in [102] (in fact, these results are also valid for variational inequalities). The authors considered strongly convex-strongly concave saddles in the Euclidean case and proved the following estimates for SVRG and SAGA: Since the bound above is not tight in terms of L avg /µ, the authors proposed accelerating SVRG and SAGA via the Catalyst envelope [76]. In this case, they have the following bound: The same estimates for saddle point problems methods based on accelerating envelopes were also presented in [118]. An important step in the study of the finite-sum stochastic setup was the work [23]. It is primarily focused on bilinear games. For this class of problems, the authors improved the estimate (23) and removed the additional logarithmic factor. For general problems (saddle point and variational inequalities) the results of this paper are very similar to (23) and also have an additional logarithmic factor. Meanwhile, the authors also considered the convex-concave/monotone case in the non-Euclidean setting and got that for their method after k iterations it holds The problem of the additional logarithmic factor was resolved in [2]. The authors proposed a modification of Extragradient: This algorithm is a combination of the extra step technique from the VIs theory and the loopless approach [73] for finitesum problems. An interesting detail of the method is the randomized negative momentum: τ(w k −z k ). While for minimization problems it is usual to apply positive/heavy-ball momentum, the opposite approach turns out to be useful for saddle point problems and variational inequalities. This effect was noticed earlier [42,122,3] and appeared now in the theory of stochastic methods for VIs. Also, in [2], the authors presented modifications for Forward-Backward, Forward-Reflected-Backward as well as for Extragradient in the non-Euclidean case. As we noted earlier, the results of [2] give estimates (23) and (24), but without additional logarithmic factors. That is, to achieve E z k − z * 2 2 ≤ ε in the strongly monotone case and E[Gap VI (ẑ k )] ≤ ε in the monotone case the methods from [2] require and stochastic oracle calls respectively. It remains to discuss the effect of batching on the method from (25), i.e., how the oracle complexity bounds change if we use not a single sample F ξ k at each iteration, but a batch size of b: 1 b i∈S k F i , where S k ⊆ [n] is the b-element set of the indices in the mini-batch. In this case, the methods from [2] give estimates (26) and (27), but multiplied by an additional factor √ b. This extra multi-plier issue was resolved in [69] using the following method: The authors proved that in the strongly monotone case this method gives an estimate (26), i.e., without additional logarithmic factors and without factors depending on b.
The only issue that remains to be understood is whether the current state of the art methods with best complexities from [2,69] are optimal. The lower bounds from [53] claim that under Assumptions 5 and 2, the methods above are optimal. However, under L max -Lipschitzness of all F i , i ∈ [n] and Assumption 2, the lower bound from [53] is The question whether this lower bound is tight remains open.

Cocoercivity assumption
In some papers, the following assumption is used instead of 1.
Assumption 6 (Cocoercivity). The operator F is l-cocoercive, i.e., for all u, v ∈ Z we have F This assumption is stronger than monotonicity + Lipschitzness, i.e., not all monotone Lipschitz operators are cocoercive. One can note that the operator for the bilinear SPP (min x max y x ⊤ Ay) is not cocoercive. But if it is known that F is L-Lipschitz and µ-strongly monotone, then it is L 2 /µcocoercive. Moreover, if we consider a convex L-smooth minimization problem, then the corresponding operator is Lcocoercive.
There is no need to use an Extragradient for cocoercive operators. One can apply the iterative scheme (7) and its modifications for the stochastic case. In spite of this, the first work on cocoercive operators in the stochastic cases, used the Extragradient as the basic method [26]. In this paper, the authors investigated methods for finite-sum problems. The latter results from [81,15] give an almost complete picture of stochastic algorithms based on method (7) for operators under Assumption 6. In particular, the work [15] gives a unified analysis for a large number of popular stochastic methods known yet for minimization problems [51].

High-probability convergence
Before this section, we focused on in-expectation convergence guarantees for the stochastic methods, i.e., bounds on E[Gap VI (ẑ k )] and/or E z k − z * 2 2 . However, high-probability convergence guarantees, i.e., bounds on Gap VI (ẑ k ) and/or z k − z * 2 2 that hold with probability at least 1 − β for given confidence level β ∈ (0, 1), reflect the real behavior of the methods more accurately [50]. Despite this fact, high-probability convergence of stochastic methods for solving VIs is studied only in a couple of works.
It is worth mentioning that one can always deduce the highprobability bound from the in-expectation one via Markov's inequality. However, in this case, the derived rate of convergence will have a negative-power dependence on β −1 . Such guarantees are not desirable and the goal is to derive the rates that have (poly-)logarithmic dependence on the confidence level, i.e., β should appear only in the O(poly(log( 1 /β))) factor.
The first and for many years the only high-probability guarantees of this type for solving stochastic VIs are derived in [60]. The authors assume that F is monotone and L-Lipschitz, the domain is bounded, and F ξ is an unbiased estimator with sub-Gaussian (light) tails of the distribution: The above condition is much stronger than Assumption 3. Under these assumptions, the authors of [60] prove that after k iterations of Mirror-Prox with probability at least 1 − β (for any β ∈ (0, 1)) the following inequality holds: Up to the logarithmic factor this result coincides with in-expectation one and, thus, it is optimal (up to the logarithms). However, the result is derived under restrictive light-tails assumption. This limitation was recently addressed in [49], where the authors derived the high-probability rates for the considered problem under just bounded variance assumption. In particular, they consider clipped-SEG for problems with Z = R d : where clip(x, λ) = min{1, λ / x 2 }x is the clipping operator -a popular tool in deep learning [103,46]. In the setup, when F is monotone and L-Lipschitz and Assumption 3 holds, the authors of [49] prove that after k iterations of clipped-SEG with probability at least 1−β (for any β ∈ (0, 1)) the following inequality holds: Up to the differences in logarithmic factors, the definition of σ, and the difference between D Z and R 0 the rate coincides with the one from [60] while derived without the light-tails assumption.
The key algorithmic tool that allows removing the light-tails assumption is clipping: with a proper choice of the clipping level λ the authors cut heavy tails without making the bias too large. It is worth mentioning that the result for clipped-SEG is derived for the unconstrained case and the rate depends on R 0 , while in [60], the analysis relies on the boundedness of the domain, which diameter explicitly appears in their rate. To remove the dependence on the diameter of the domain, the authors of [48] show that with the high-probability the iterates produced by clipped-SEG stay in the ball around x * with the radius proportional to R 0 . Using this trick, the authors of [48] also show that it is sufficient to assume everything (monotonicity and Lipschitzness of F and bounded variance) only on this ball. Such a generality allows them to cover the problems that are non-Lipschitz on R d (e.g., for some monotone polynomially-growing operators) and also the situation when the variance is bounded only on a compact, which is common for many finite-sum problems. Finally, [48] contains high-probability convergence results for strongly monotone VIs and VIs with structured nonmonotonicity.

Recent advances
In this section, we collect a few recent theoretical advances with practical impacts.

Saddle point problems with different constants of strong convexity and strong concavity
Interest in saddle point problems with different constants of strong convexity and strong concavity appeared a few years ago, see e.g., [4,77]. However, even for the particular case where function f is µ x -strongly convex (µ x > 0) and L x -smooth, and function h is µ y -strongly convex (µ y > 0) and L y -smooth, optimal algorithms have appeared only recently [72,116,58]. These algorithms have the following convergence rates and attain the lower bound (here we need to assume that λ min (A ⊤ A) ≤ √ µ x µ y , without this assumption optimal methods are unknown), which was obtained in [124,56] Note that the algorithm from [58] was built upon the technique related to the analysis of primal-dual Extragradient methods via relative Lipschitzness [115,28]. As a by-product, this technique makes it possible to obtain Nesterov's accelerated method as a particular case of primal-dual Extragradient method with relative Lipschitzness [28].
For the non-bilinear SPP, optimal methods, based on the accelerated Monteiro-Svaiter proximal envelope, were developed only in the non-composite case [71,22]. For the nonbilinear SPP with composite terms, there is a poly-logarithmic gap between the lower bound and the best known upper bounds [118]. A gap also appears for the SPP having the stochastic finite-sum structure [118,82,58]. The stochastic setting with bounded variance was considered in [125,85,32]. Further deterministic «cutting-plane» improvements are related with the additional assumptions about small dimension of vectors x or/and y [91,43,44] or with different structural (e.g., SPP on balls in 1/∞-norms) and sparsity assumptions, see e.g., [24,112,111] and references there in. Lower bounds here are mostly unknown.
Note, that in this subsection we mentioned a lot of works with (sub-)optimal algorithms for different variants of SPP. In contrast to the convex optimization, where the oracle call is uniquely associated with the gradient call ∇ f (x), for SPP we have two criteria: the number of ∇ x g(x, y)-calls and ∇ y g(x, y)calls (and more variants for SPP with composites). "Optimality" in the most of the aforementioned papers means that the method is optimal according to the worst of the criteria. In [4,118], authors consider these criteria separately. However, the development of the lower bounds and optimal methods for a multi-criterion setup is still an open problem.

Adaptive methods for VI and SPP
Interest in adaptive algorithms for stochastic convex optimization was mainly emerged in 2011 after the development of AdaGrad [33] and Adam [63]. For variational inequalities and saddle point problems, interest in adaptive methods have appeared only in the last few years, see e.g., [40,8] (see also [61]). Currently, this area of research is well-developed. One can note works devoted to both adaptive step sizes [5,35,34,114,117] and adaptive scaling/preconditioning [80,31,13]. Approaches from the second group are based on the idea of the proper combination of AdaGrad/Adam with Extragradient or its modifications. All of the mentioned adaptive methods have no better (typically the same) theoretical convergence rates than their non-adaptive analogues but require less input information or demonstrate better performance in practice.

Quasi-Newton and tensor methods for VI and SPP
Quasi-Newton methods for solving nonlinear equations (unconstrained VI) and SPP are proposed in [75,121] and [79] respectively. In these papers, the authors derive local superlinear convergence rates for the modifications of the Broyden's type methods for solving nonlinear equations with Lipschitz Jacobian and SPP with Lipschitz Hessian. Stochastic versions of these methods for VI and SPP have not yet been developed.
Tensor methods for convex optimization problems are currently quite well developed. In particular, starting with [99] it has been shown that optimal second-and third-order methods can be implemented with almost the same complexity of each iteration as the Newton method [89,97,39]. Moreover, optimal p-order methods (that use p-order derivatives) significantly reduce the rate of convergence from k −2 to k −(3p+1)/2 [70,21]. For VI and SPP, the interest was initiated by [94,88] and optimal p-order methods reduce the rate of convergence from k −1 to k −(p+1)/2 [1,78] (for k −1 , see Theorem 7). However, in contrast to convex optimization, the use of tensor methods for sufficiently smooth monotone VIs and convexconcave saddle point problems is not expected to be as effective. Note, that in [1,78] one can also find optimal rates for strongly monotone VIs and strongly convex-concave saddle point problems. Stochastic tensor methods for variational inequalities and saddle point problems have not yet been developed.

Convergence in terms of the gradient norm for SPP
Several recent advances in the development of optimal algorithms are based on accelerated proximal envelops with proper stopping rules for inner loop algorithms [71,70,68,109]. This rule is built upon the norm of the gradient calculated for target function of the inner problem.
For smooth convex optimization problems Yu. Nesterov in 2012 posed the problem of making the gradient norm small with the same rate of convergence as a gap in the function value, i.e. proportional to k −2 [96]. To address this problem, in the same paper, he proposed an optimal (up to a logarithmic factor) algorithm. This question was further investigated, including obtaining optimal results without additional logarithmic factors [62,98] (see also [30] for explanations and survey). In the stochastic case, algorithms were presented in [38].
For smooth convex-concave saddle point problems an optimal algorithm with ∇ x,y f (x k , y k ) 2 proportional to k −1 was proposed in [122] (see also [30] and [71] for monotone inclusion). For the stochastic case see [74,20,27].

Decentralized VI and SPP
In practice, in order to solve the variation inequality problem more efficiently and quickly, distributed methods are usually applied. In particular, methods that work on arbitrary (possibly time-varying) decentralized communication networks between computing devices are popular.
While the field of decentralized algorithms for minimization problems has been extensively investigated, results for broader classes of problems have only begun to appear in recent years. Such works are primarily focused on saddle point problems [90,17,108,18,16], but we note that most of these results can easily be extended to variational inequalities. Let us emphasize two works that were at once devoted to VIs. In the paper [14] the authors proposed a decentralized method with local steps, and [69] gave optimal decentralized methods for stochastic (finite sum) variational inequalities on fixed and varying networks.