Convergence Rates of Training Deep Neural Networks via Alternating Minimization Methods

Jintao Xu Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
Email: xujt19@mails.tsinghua.edu.cn Chenglong Bao Yau Mathematical Sciences Center, Tsinghua University, Beijing 100084, China, and Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China.
Email: clbao@mail.tsinghua.edu.cn Wenxun Xing Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
Email: wxing@mail.tsinghua.edu.cn

Abstract

Training deep neural networks (DNNs) is an important and challenging optimization problem in machine learning due to its non-convexity and non-separable structure. The alternating minimization (AM) approaches split the composition structure of DNNs and have drawn great interest in the deep learning and optimization communities. In this paper, we propose a unified framework for analyzing the convergence rate of AM-type network training methods. Our analysis are based on the $j$ -step sufficient decrease conditions and the Kurdyka-Łojasiewicz (KL) property, which relaxes the requirement of designing descent algorithms. We show the detailed local convergence rate if the KL exponent $θ$ varies in $[0, 1)$ . Moreover, the local R-linear convergence is discussed under a stronger $j$ -step sufficient decrease condition.

Keywords Deep neural networks training; Alternating minimization; Kurdyka-Łojasiewicz property;
$j$ -step sufficient decrease; Convergence rate

Mathematics Subject Classification (2020) 49M37 90C26 90C52

1 Introduction

In recent years, deep learning has achieved impressive successes in many areas including computer vision [8, 12], natural language processing [27, 29], and recommender system [9, 10]. For deep neural networks (DNNs) training, the alternating minimization (AM)-type training methods, mainly based on the block coordinate descent (BCD) [25] or the alternating direction method of multipliers (ADMM) [6], have been discussed such as BCD-type algorithms [7, 16, 30, 33, 35] and ADMM-type algorithms [28, 34, 36]. In these methods, auxiliary variables are added for each layer to decouple the nested parameters in DNNs, and the constraints are lifted to the objective functions. Using the above techniques, the vanishing gradient issue [4, 11] is avoided and parallel updates can be achieved.

Let ${min}_{X \in D} f (X)$ be a DNNs training model, where $D \subseteq R^{m \times n}$ , and ${X_{k}}$ be a sequence generated by an AM-type training algorithm. In this paper, we propose a unified framework for establishing the convergence rate of the objective function value ${f (X_{k})}$ generated by various AM-type training algorithms. In particular, let $j$ be a positive integer, our analysis imposes a $j$ -step sufficient decrease condition defined as follows, which relaxes the common descent condition used in [2].

A1. For a certain $j \in N_{+}$ , there exists $c_{1} > 0$ such that

$c_{1} d i s t (0, \partial f (X_{k + j}))^{2} \leq f (X_{k}) - f (X_{k + j})$ (1)

for each $k \geq k_{0}$ .

If $j = 1$ , the above condition can be derived from H1 and H2 of [2]. When $j \geq 2$ , it allows the oscillations of ${f (X_{k})}$ during the consecutive $j$ iterations, which can be classified as non-monotone methods. In Sect. 3, we will show that the existing four AM-type algorithms for training DNNs satisfy A1. Besides, the Kurdyka-Łojasiewicz (KL) property [1, 18] assumption of $f$ , which implies a local sharpness under reparametrization [2], plays a central role in our analysis.

Convergence results have been stated for some AM-type training algorithms [13, 16, 30, 33, 34]. From the theoretical perspective, the convergence rate of ${X_{k}}$ generated by a BCD-type algorithm is analyzed [16], and the convergence of the objective function value ${f (X_{k})}$ is proved for BCD-type and ADMM-type algorithms [33, 34]. Besides, the convergence rate and the sample complexity of an AM-type algorithm for ReLU networks are established in [13]. However, there are few results about the convergence rate of the objective function value, like that in [30], and we address this question.

Based on A1, we establish the local convergence rate of the objective function value sequences in Theorem 4.1, which depends on the different values of the Kurdyka-Łojasiewicz (KL) exponent $θ \in [0, 1)$ [1, 18] of $f$ . Moreover, if we replace the lower bound in (1) with $c_{2} d i s t (0, \partial f (X_{k + j}))^{\frac{1}{α}}$ , where $c_{2}$ and $α \geq θ$ are positive constants, we give the local linear convergence in Theorem 4.2.

The rest of the paper is organized as follows. Notations and definitions used throughout this paper are listed in Sect. 2. Four examples are shown in Sect. 3. Estimations of convergence rate are stated in Sect. 4, and we summarize our results in Sect. 5.

2 Notations and Definitions

We give notations and definitions that are useful in our analysis.
Notations. Throughout this paper, $R$ , $R^{m}$ , $R^{m \times n}$ , and $N_{+}$ denote the set of real numbers, real $m$ -dimensional vectors, real $m \times n$ matrices, and positive integers respectively. 0 denotes the matrix of all zeros whose size varies from the context. $∥ \cdot ∥$ denotes the Euclidean norm for $x \in R^{m}$ and the Frobenius norm for $X \in R^{m \times n}$ , respectively. For $A, B \in R^{m \times n}, ⟨ A, B ⟩ = t r (A B^{T})$ . $dist(x,S)\coloneqqinfs∈S∥x−s∥$ . For any $u \in R$ , $⌊ u ⌋$ is the greatest integer no larger than $u$ . $O (\cdot)$ is the standard big O asymptotic notation.

Definition 2.1.

(Fréchet subdifferential [23, 24])
The Fréchet subdifferential of $f$ at $X \in d o m (f)$ is the set

ˆ \partial f (X) = ⎧ ⎨ ⎩ G ∣ ∣ ∣ ∣ liminf \frac{Y \neq X}{Y \to X} \frac{f (Y) - f (X) - ⟨ G, Y - X ⟩}{∥ Y - X ∥} \geq 0 ⎫ ⎬ ⎭ .

Definition 2.2.

(Limiting subdifferential [23, 24])
For each $X \in d o m (f)$ , the limiting subdifferential of $f$ at $X$ is the set

\partial f (X) = {G ∣ ∣ \exists X_{k} \to X, f (X_{k}) \to f (X), G_{k} \to G, G_{k} \in ˆ \partial f (X_{k})} .

$d o m (\partial f) = {X | \partial f (X) \neq \emptyset}$ .

Definition 2.3.

(Limiting critical point [2])
A point $X \in d o m (f)$ is called a (limiting) critical point of $f$ if $0 \in \partial f (X)$ .

Definition 2.4.

(Kurdyka-Łojasiewicz property [1, 18])
A proper lower semicontinuous function $f$ is called to have the Kurdyka-Łojasiewicz (KL) property at $ˆ X \in d o m (\partial f)$ if there exist $c, τ \in (0, \infty]$ , $θ \in [0, 1)$ , and a neighborhood $U$ of $ˆ X$ such that for all $X \in U \cap {X | f (ˆ X) < f (X) < f (ˆ X) + τ}$ ,

(f (X) - f (ˆ X))^{θ} \leq c d i s t (0, \partial f (X)) .

We call $θ$ as the Kurdyka-Łojasiewicz (KL) exponent [18].

KL property is widely used in nonconvex optimization [1, 2, 5, 31]. The pioneering work on it is credited to Kurdyka [15] and Łojasiewicz [19, 20]. More details about this property can be seen in [1, 2, 5] and the references therein.

Definition 2.5.

(Root (R)-convergence rate [26])
For any convergent sequence ${X_{k}}$ with limit $X^{*}$ , $R1\coloneqqlimsupk→∞∥Xk−X∗∥1k$ . If $0 < R_{1} < 1$ , ${X_{k}}$ is called R-linearly convergent. If $R_{1} = 1$ , ${X_{k}}$ is called R-sublinearly convergent.

3 Typical AM-type algorithms

In this section, we present four examples that apply AM-type algorithms for training DNNs. These examples and their numerical results motivate us to construct a framework for establishing a theoretical analysis of the objective value.

Example 3.1.

(BCD for FNNs [33])
For the feedforward neural networks (FNNs) training, Zeng et al. [33] formulated two optimization models and designed BCD-type algorithms for their unconstrained approximations, respectively. In one model, the objective function is

	$f (X)$	$= \frac{1}{n} n \sum j = 1 ℓ ((V_{N})_{: j}, y_{j}) + N \sum i = 1 r_{i} (W_{i}) + N \sum i = 1 s_{i} (V_{i})$
		$+ \frac{γ}{2} N \sum i = 1 ∥ V_{i} - σ_{i} (W_{i} V_{i - 1}) ∥^{2},$		(2)

where $X = ({W_{i}}_{i = 1}^{N}, {V_{i}}_{i = 1}^{N})$ , $σ_{i}$ denotes the activation function of the $i$ th layer, $i = 1, 2, \dots, N$ , $ℓ$ is a loss function, $r_{i}$ , $s_{i}$ can be seen as regularization terms about $W_{i}$ , $V_{i}$ , $i = 1, 2, \dots, N$ , respectively, $(V_{N})_{: j}$ denotes the $j$ th column of $V_{N}$ , $j = 1, 2, \dots, n$ , and the last term represents a quadratic penalty for constraints $V_{i} = σ_{i} (W_{i} V_{i - 1})$ , $i = 1, 2, \dots, N$ . Under the Assumption 1 of [33], (3.1) satisfies the KL property on any closed set. Moreover, ${f (X_{k})}$ generated by the BCD-type algorithm is convergent, and satisfies

\frac{a}{b^{2}} d i s t (0, \partial f (X_{k + 1}))^{2} \leq f (X_{k}) - f (X_{k + 1})

for certain $a, b > 0$ (see [33] for the values of $a$ and $b$ ).

In the other model,

	$f (X)$	$= \frac{1}{n} n \sum j = 1 ℓ ((V_{N})_{: j}, y_{j}) + N \sum i = 1 r_{i} (W_{i}) + N \sum i = 1 s_{i} (V_{i})$
		$+ \frac{γ}{2} N \sum i = 1 ∥ V_{i} - σ_{i} (U_{i}) ∥^{2} + \frac{γ}{2} N \sum i = 1 ∥ U_{i} - W_{i} V_{i - 1} ∥^{2},$		(3)

where $X = ({W_{i}}_{i = 1}^{N}, {U_{i}}_{i = 1}^{N}, {V_{i}}_{i = 1}^{N})$ , and the last two terms are quadratic penalties for constraints $V_{i} = σ_{i} (U_{i})$ and $U_{i} = W_{i} V_{i - 1}$ , $i = 1, 2, \dots, N$ , respectively. Under the same assumptions, (3.1) satisfies the KL property on any closed set. Similarly, ${f (X_{k})}$ is convergent, and satisfies

\frac{a}{b^{2}} d i s t (0, \partial f (X_{k + 1}))^{2} \leq f (X_{k}) - f (X_{k + 1})

for certain $a, b > 0$ (see [33] for the values of $a$ and $b$ ).

Example 3.2.

(ADMM for FNNs [34])
Zeng et al. [34] considered the augmented Lagrangian function of an FNNs training model and solved it via an ADMM-type algorithm. Technically, they gave

$f (X)$	$= \frac{1}{2} ∥ V_{N} - Y ∥^{2} + \frac{λ}{2} N \sum i = 1 ∥ W_{i} ∥^{2} + N - 1 \sum i = 1 ⟨ Λ_{i}, σ (W_{i} V_{i - 1}) - V_{i} ⟩$
	$+ N - 1 \sum i = 1 \frac{β_{i}}{2} ∥ σ (W_{i} V_{i - 1}) - V_{i} ∥^{2} + ⟨ Λ_{N}, W_{N} V_{N - 1} - V_{N} ⟩$
	$+ \frac{β_{N}}{2} ∥ W_{N} V_{N - 1} - V_{N} ∥^{2} + N \sum i = 1 ξ_{i} ∥ V_{i} - {¯ ¯¯ ¯ V}_{i} ∥^{2},$	(4)

where $X = ({W_{i}}_{i = 1}^{N}, {V_{i}}_{i = 1}^{N}, {Λ_{i}}_{i = 1}^{N}, {{¯ ¯¯ ¯ V}_{i}}_{i = 1}^{N})$ and $σ$ denotes the activation function. Under the Assumption 1 of [34], (3.2) satisfies the KL property. Moreover, ${f (X_{k})}$ generated by the ADMM-type algorithm is convergent. Suppose that there exist $χ > 0$ and $k_{0} \in N$ such that $∥ V_{i}^{k - 1} - V_{i}^{k - 2} ∥ \leq χ ∥ V_{i}^{k} - V_{i}^{k - 1} ∥$ for each $k \geq k_{0}$ . Then, there exist $a, b > 0$ (see [34] for the values of $a$ and $b$ ) such that

\frac{a}{2 b^{2} (1 + χ)^{2} N} d i s t (0, \partial f (X_{k + 1}))^{2} \leq f (X_{k}) - f (X_{k + 1})

for each $k \geq k_{0}$ .

Example 3.3.

(mDLAM for FNNs [30])
Wang et al. [30] formulated an FNNs training model and designed an AM-type algorithm called mDLAM to solve it. The objective function of the training model is

f (X) = ℓ (v_{N}, y) + N \sum i = 1 r_{i} (W_{i}) + \frac{γ}{2} N \sum i = 1 ∥ v_{i} - W_{i} u_{i - 1} ∥^{2},

(5)

where $X = ({W_{i}}_{i = 1}^{N}, {u_{i}}_{i = 1}^{N - 1}, {v_{i}}_{i = 1}^{N})$ , and the last term is a quadratic penalty for constraints $v_{i} = W_{i} u_{i - 1}$ , $i = 1, 2, \dots, N$ . If (5) is real analytic [14], it satisfies the KL property [19, 20, 21]. Moreover, ${f (X_{k})}$ is convergent, and satisfies

\frac{a}{6 b^{2}} d i s t (0, \partial f (X_{k + 2}))^{2} \leq f (X_{k}) - f (X_{k + 2})

for certain $a, b > 0$ (see [30] for the values of $a$ and $b$ ), which is a 2-step sufficient decrease condition.

Example 3.4.

(BCD for ResNets [33])
For the residual networks (ResNets) [12] training, Zeng et al. [33] formulated a simplified model and its unconstrained approximation. The objective function is

	$f (X)$	$= \frac{1}{n} n \sum j = 1 ℓ ((V_{N})_{: j}, y_{j}) + N \sum i = 1 r_{i} (W_{i}) + N \sum i = 1 s_{i} (V_{i})$
		$+ \frac{γ}{2} N \sum i = 1 ∥ V_{i} - V_{i - 1} - σ_{i} (U_{i}) ∥^{2} + \frac{γ}{2} N \sum i = 1 ∥ U_{i} - W_{i} V_{i - 1} ∥^{2},$		(6)

where $X = ({W_{i}}_{i = 1}^{N}, {U_{i}}_{i = 1}^{N}, {V_{i}}_{i = 1}^{N})$ , and the last two terms represent quadratic penalties for constraints $V_{i} - V_{i - 1} = σ_{i} (U_{i})$ and $U_{i} = W_{i} V_{i - 1}$ , $i = 1, 2, \dots, N$ , respectively. Under the Assumption 1 of [33], (3.4) satisfies the KL property on any closed set. Moreover, ${f (X_{k})}$ generated by the BCD-type algorithm is convergent, and satisfies

\frac{a}{3 N b^{2}} d i s t (0, \partial f (X_{k + 1}))^{2} \leq f (X_{k}) - f (X_{k + 1})

for certain $a, b > 0$ (see [33] for the values of $a$ and $b$ ).

4 Theoretical analysis

In this section, we give the convergence rate estimations based on A1 and the KL property.

Theorem 4.1.

For a proper lower semicontinuous objective function $f$ and a sequence ${X_{k}}$ generated by an AM-type training algorithm, suppose that there exists $ˆ X \in d o m (\partial f)$ such that $f$ satisfies the KL property at $ˆ X$ with a neighborhood $U_{ˆ X}$ and the KL exponent $θ$ , $f (X_{k}) \to f (ˆ X)$ as $k \to \infty$ , $X_{k} \in U_{ˆ X}$ for each $k \geq k_{0}$ , and A1 is satisfied. Then the following conclusions hold:

If $θ = 0$ , then ${f (X_{k})}$ converges in a finite number of steps;
If $θ \in (0, \frac{1}{2}]$ , then there exist $k_{1} \in N$ , $η \in (0, 1)$ , and $C > 0$ such that $f (X_{k}) - f (ˆ X) \leq C η^{⌊ \frac{k - k_{1}}{j} ⌋ + 1}$ for each $k \geq k_{1}$ ;
If $θ \in (\frac{1}{2}, 1)$ , then there exist $k_{1} \in N$ and $C > 0$ such that $f (X_{k}) - f (ˆ X) \leq C (⌊ \frac{k - k_{1}}{j} ⌋ + 1)^{- \frac{1}{2 θ - 1}}$ for each $k \geq k_{1}$ .

Additionally, for each accumulation point (if any) $˜ X \in U_{ˆ X}$ of ${X_{k}}$ , it is a critical point of $f$ if and only if $f (˜ X) = f (ˆ X)$ .

The following Lemma 4.1 is needed to prove Theorem 4.1.

Lemma 4.1 ([2]).

Suppose that $X_{k} \to X$ , $G_{k} \to G$ , and $f (X_{k}) \to f (X)$ as $k \to \infty$ , of which $G_{k} \in \partial f (X_{k})$ for each $k$ . Then $G \in \partial f (X)$ .

With the similar arguments as in the proof of Theorem 3 of [17], Theorem 2 of [3] for convergence rate estimation and Theorem 1 of [30] for the analysis of accumulation points, we give the following proof.

Proof.

(proof of Theorem 4.1) First of all, the lower boundness of ${f (X_{k})}$ is proved, and a key inequality is constructed as follows. They are the bases of our analyses.

By A1, $f (X_{k}) \geq f (X_{k + j}) \geq f (X_{k + 2 j}) \geq \dots \geq f (X_{k + n j})$ holds for each $k \geq k_{0}$ . Letting $n \to \infty$ , we have $f (X_{k}) \geq f (ˆ X), k \geq k_{0}$ .

If $f (X_{k}) \to f (ˆ X)$ infinitely, there exists a $k_{1} \in N, k_{1} \geq k_{0} + j$ such that $f (X_{k}) - f (ˆ X) < 1$ , and for each $f (X_{k}) - f (ˆ X) > 0, k \geq k_{1}$ ,

	$(f (X_{k}) - f (ˆ X))^{2 θ}$	$\leq c^{2} d i s t (0, \partial f (X_{k}))^{2}$
		$\leq \frac{c^{2}}{c_{1}} (f (X_{k - j}) - f (X_{k})),$		(7)

where the first inequality follows from the KL property of $f$ at $ˆ X$ , and the second inequality follows from A1.

We now analyze the convergence rate as follows. (i) is proved by contradiction. If the conclusion is not true, there exists a subsequence ${k_{l}} \subseteq {k_{1}, k_{1} + 1, \dots}$ such that

1 \leq c d i s t (0, \partial f (X_{k_{l}})) .

Letting $l \to \infty$ , $1 \leq 0$ , a contradiction. Therefore, there exists $k_{2} \in N$ such that $f (X_{k}) \equiv f (ˆ X)$ for each $k \geq k_{2}$ .

If ${f (X_{k})}$ converges finitely, (ii) and (iii) hold trivially. Then we only need to prove them in the case of infinite convergence.

When $θ \in (0, \frac{1}{2}]$ , according to (7),

f (X_{k}) - f (ˆ X) \leq (f (X_{k}) - f (ˆ X))^{2 θ} \leq \frac{c^{2}}{} c_{1} (f (X_{k - j}) - f (X_{k}))

holds for each $k \geq k_{1}$ , which then implies that

f (X_{k}) - f (ˆ X) \leq \frac{c^{2}}{c_{1} + c^{2}} (f (X_{k - j}) - f (ˆ X)), k \geq k_{1} .

Hence, we have

for each $k \geq k_{1}$ , where

C_{1} = max {f (X_{k_{1} - j}) - f (ˆ X), f (X_{k_{1} - j + 1}) - f (ˆ X), \dots, f (X_{k_{1} - 1}) - f (ˆ X)} .

It follows from A1 and the infinite convergence assumption of ${f (X_{k})}$ that $C_{1} > 0$ . Thus (ii) holds with $C = C_{1}, η = \frac{c^{2}}{c_{1} + c^{2}}$ .

When $θ \in (\frac{1}{2}, 1)$ , given a constant $ω \in [2, \infty)$ , for each $f (X_{k}) - f (ˆ X) > 0, k \geq k_{1}$ , if $(f (X_{k}) - f (ˆ X))^{- 2 θ} \leq ω (f (X_{k - j}) - f (ˆ X))^{- 2 θ}$ , then,

	$\frac{c_{1}}{c^{2}}$	$\leq (f (X_{k}) - f (ˆ X))^{- 2 θ} ((f (X_{k - j}) - f (ˆ X)) - (f (X_{k}) - f (ˆ X)))$
		$\leq ω (f (X_{k - j}) - f (ˆ X))^{- 2 θ} ((f (X_{k - j}) - f (ˆ X)) - (f (X_{k}) - f (ˆ X)))$
		$\leq ω \int_{f (X_{k}) - f (ˆ X)}^{f (X_{k - j}) - f (ˆ X)} x^{- 2 θ} d x$
		$= \frac{ω}{2 θ - 1} ((f (X_{k}) - f (ˆ X))^{- 2 θ + 1} - (f (X_{k - j}) - f (ˆ X))^{- 2 θ + 1}),$

where the first inequality follows from (7). Hence

0 < \frac{c_{1} (2 θ - 1)}{c^{2} ω} \leq (f (X_{k}) - f (ˆ X))^{- 2 θ + 1} - (f (X_{k - j}) - f (ˆ X))^{- 2 θ + 1} .

(8)

If $(f (X_{k}) - f (ˆ X))^{- 2 θ} > ω (f (X_{k - j}) - f (ˆ X))^{- 2 θ}$ , then,

(f (X_{k}) - f (ˆ X))^{- 2 θ + 1} \geq ω^{\frac{2 θ - 1}{2 θ}} (f (X_{k - j}) - f (ˆ X))^{- 2 θ + 1} .

Hence

$0$	$< (ω^{\frac{2 θ - 1}{2 θ}} - 1) C_{1}^{- 2 θ + 1}$
	$\leq (ω^{\frac{2 θ - 1}{2 θ}} - 1) (f (X_{k - j}) - f (ˆ X))^{- 2 θ + 1}$
	$\leq (f (X_{k}) - f (ˆ X))^{- 2 θ + 1} - (f (X_{k - j}) - f (ˆ X))^{- 2 θ + 1} .$	(9)

According to (8) and (9),

(f (X_{k - j}) - f (ˆ X))^{- 2 θ + 1} + L \leq (f (X_{k}) - f (ˆ X))^{- 2 θ + 1}

holds for each $f (X_{k}) - f (ˆ X) > 0, k \geq k_{1}$ , where

L = min {\frac{c_{1} (2 θ - 1)}{c^{2} ω}, (ω^{\frac{2 θ - 1}{2 θ}} - 1) C_{1}^{- 2 θ + 1}} > 0.

Then we have

	$f (X_{k}) - f (ˆ X)$	$\leq {(C_{1}^{- 2 θ + 1} + L (⌊ \frac{k - k_{1}}{j} ⌋ + 1))}^{- \frac{1}{2 θ - 1}}$
		$\leq L^{- \frac{1}{2 θ - 1}} {(⌊ \frac{k - k_{1}}{j} ⌋ + 1)}^{- \frac{1}{2 θ - 1}} .$		(10)

Clearly, for each $f (X_{k}) - f (ˆ X) = 0, k \geq k_{1}$ , (10) still holds. Thus we obtain (iii) with $C = L^{- \frac{1}{2 θ - 1}}$ .

For each accumulation point $˜ X$ , there exists a subsequence ${X_{k_{l}}}$ such that ${lim}_{l \to \infty} X_{k_{l}} = ˜ X$ . By A1, ${lim}_{l \to \infty} d i s t (0, \partial f (X_{k_{l}})) = 0$ . For each $k_{l}$ , there exists $G_{k_{l}} \in \partial f (X_{k_{l}})$ such that

d i s t (0, \partial f (X_{k_{l}})) \leq ∥ G_{k_{l}} ∥ \leq d i s t (0, \partial f (X_{k_{l}})) + \frac{1}{k_{l}} .

Letting $l \to \infty$ , we have

0 = lim l \to \infty d i s t (0, \partial f (X_{k_{l}})) \leq lim l \to \infty ∥ G_{k_{l}} ∥ \leq lim l \to \infty (d i s t (0, \partial f (X_{k_{l}})) + \frac{1}{k_{l}}) = 0.

So ${lim}_{l \to \infty} ∥ G_{k_{l}} ∥ = 0$ . Without loss of generality, suppose that $G_{k_{l}} \to ˜ G$ as $l \to \infty$ . Then $∥ ˜ G ∥ = {lim}_{l \to \infty} ∥ G_{k_{l}} ∥ = 0$ , $˜ G = 0$ . According to

X_{k_{l}} \to ˜ X, G_{k_{l}} \to 0, and f (X_{k_{l}}) \to f (˜ X), a s l \to \infty, G_{k_{l}} \in \partial f (X_{k_{l}}),

and Lemma 4.1, we have $0 \in \partial f (˜ X)$ .

Moreover, if $˜ X$ is a critical point, it follows from the Remark 2.5 (d) of [2] that $f (˜ X) = f (ˆ X)$ . ∎

Parts (ii) and (iii) of the above theorem implies that ${f (X_{k})}$ converges at least locally R-linearly and locally R-sublinearly to $f (ˆ X)$ , respectively. For each example in Sect. 3, A1 and the KL property are satisfied. If $f$ is real analytic, then the KL exponent $θ$ is in $[\frac{1}{2}, 1)$ at a critical point [1, 20]. Examples of the real analytic objective functions are provided in [33, 34], of which the linear, polynomial, hyperbolic tangent, and sigmoid activation functions; the squared, logistic, and exponential loss functions; and the squared Frobenius norm regularization terms are considered. Furthermore, ${X_{k}}$ is convergent in Example 1, 2, and 4 [33, 34], so the corresponding ${f (X_{k})}$ has $O (η^{k})$ local convergence rate for $θ = \frac{1}{2}$ and $O (k^{- \frac{1}{2 θ - 1}})$ local convergence rate for $θ \in (\frac{1}{2}, 1)$ by our Theorem 4.1. Besides, if the assumption $X_{k} \in U_{ˆ X}$ for each sufficiently large $k$ is satisfied in Example 3, the aforementioned results also hold. Moreover, the continuity of $f$ is satisfied in each example in Sect. 3 under certain assumptions [30, 33, 34]. Then, each accumulation point of ${X_{k}}$ is a critical point by our Theorem 4.1 (see [30, 33, 34] for the existence of accumulation point for each example in Sect. 3).

Moreover, if the following stronger $j$ -step sufficient decrease condition is satisfied, ${f (X_{k})}$ converges at least locally R-linearly to $f (ˆ X)$ for each $θ \in [0, 1)$ as shown in the following Theorem 4.2.

A2. For a certain $j \in N_{+}$ , there exist positive constants $α \in [θ, \infty) and c_{2}$ such that

$c_{2} d i s t (0, \partial f (X_{k + j}))^{\frac{1}{α}} \leq f (X_{k}) - f (X_{k + j})$

for each $k \geq k_{0}$ , where $θ$ is the KL exponent of $f$ .

When $θ \in (\frac{1}{2}, 1)$ , compared with A1, a larger descent of $j$ steps iteration is guaranteed in A2, and it is a more dedicated estimation for $f (X_{k}) - f (X_{k + j})$ .

Theorem 4.2.

For an objective function $f$ and a sequence ${X_{k}}$ , suppose that $f$ satisfies the KL property at $ˆ X$ with $U_{ˆ X}$ and $θ$ , $f (X_{k}) \to f (ˆ X)$ , $X_{k} \in U_{ˆ X}$ for each $k \geq k_{0}$ , and A2 is satisfied. Then ${f (X_{k})}$ has a local convergence rate of $O (η^{k})$ , where $η \in (0, 1)$ . Additionally, for each accumulation point (if any) $˜ X \in U_{ˆ X}$ of ${X_{k}}$ , it is a critical point if and only if $f (˜ X) = f (ˆ X)$ .

Proof.

When $θ = 0$ , with the similar arguments as in the proof of Theorem 4.1, finite convergence is achieved, and $O (η^{k})$ complexity bound holds trivially. When $θ \in (0, 1)$ , similarly, we only prove it in the case of infinite convergence. By A2 and the KL property of $f$ at $ˆ X$ , there exists a $k_{1} \in N, k_{1} \geq k_{0} + j$ such that for each $k \geq k_{1}$ , $d i s t (0, \partial f (X_{k})) < 1$ , and for each $f (X_{k}) - f (ˆ X) > 0$ ,

	$f (X_{k}) - f (ˆ X)$	$\leq c^{\frac{1}{θ}} d i s t (0, \partial f (X_{k}))^{\frac{1}{θ}}$
		$\leq c^{\frac{1}{θ}} d i s t (0, \partial f (X_{k}))^{\frac{1}{α}}$
		$\leq \frac{c^{\frac{1}{θ}}}{c_{2}} (f (X_{k - j}) - f (X_{k})) .$

Then,

f (X_{k}) - f (ˆ X) \leq \frac{c^{\frac{1}{θ}}}{c_{2} + c^{\frac{1}{θ}}} (f (X_{k - j}) - f (ˆ X)) .

Hence, we have

f (X_{k}) - f (ˆ X) \leq C_{1} {⎛ ⎝ \frac{c^{\frac{1}{θ}}}{c_{2} + c^{\frac{1}{θ}}} ⎞ ⎠}^{⌊ \frac{k - k_{1}}{j} ⌋ + 1}

for each $k \geq k_{1}$ , where $C_{1}$ is the same as that in the proof of Theorem 4.1. So a $O (η^{k})$ local convergence rate is achieved with $η = (c^{\frac{1}{θ}} / (c_{2} + c^{\frac{1}{θ}}))^{\frac{1}{j}}$ . With the similar arguments as in the proof of Theorem 4.1, we obtain the rest of Theorem 4.2. ∎

It is worth noting that although the local R-linear convergence can be achieved in any cases under A2, when $θ \in (\frac{1}{2}, 1)$ , verification whether a training model and algorithm satisfies the stronger decrease condition is a challenging problem [18, 22, 32].

5 Conclusions

In this paper, a unified framework is proposed to analyze the convergence rate of the objective function value generated by the AM-type training algorithm. The $j$ -step sufficient decrease conditions and the KL property play a central role in our analysis. And the requirement of nonincreasing property of function value sequence is relaxed in our framework. Based on the squared norm lower bound estimation of the $j$ -step descent, three kinds of convergence rates are discussed for different values of the KL exponent $θ$ , respectively. Moreover, if a larger descent is guaranteed, we can improve the convergence rate to $O (η^{k})$ for $θ \in (\frac{1}{2}, 1)$ .

Acknowledgements

Bao’s research was supported by the National Natural Science Foundation of China (Grant No. 11901338) and the Tsinghua University Initiative Scientific Research Program. Xing’s research was supported by the National Natural Science Foundation of China (Grant No. 11771243).

References

[1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran (2010) Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35 (2), pp. 438–457. External Links: Document Cited by: §1, §1, Definition 2.4, Definition 2.4, §4.
[2] H. Attouch, J. Bolte, and B. F. Svaiter (2013) Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137, pp. 91–129. External Links: Document Cited by: §1, Definition 2.3, Definition 2.4, Lemma 4.1, §4.
[3] H. Attouch and J. Bolte (2009) On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116, pp. 5–16. External Links: Document Cited by: §4.
[4] Y. Bengio, P. Simard, and P. Frasconi (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5 (2), pp. 157–166. External Links: Document Cited by: §1.
[5] J. Bolte, S. Sabach, and M. Teboulle (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, pp. 459–494. External Links: Document Cited by: Definition 2.4.
[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), pp. 1–122. External Links: Document Cited by: §1.
[7] M. Á. Carreira-Perpiñán and W. Wang (2014) Distributed optimization of deeply nested systems. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, S. Kaski and J. Corander (Eds.), Proceedings of Machine Learning Research, Vol. 33, pp. 10–19. Cited by: §1.
[8] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §1.
[9] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10. External Links: Document Cited by: §1.
[10] P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. External Links: Document Cited by: §1.
[11] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. External Links: Link Cited by: §1.
[12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, Example 3.4.
[13] G. Jagatap and C. Hegde (2018) Learning ReLU networks via alternating minimization. arXiv preprint arXiv: 1806.07863. External Links: Document, 1806.07863 Cited by: §1.
[14] S. G. Krantz and H. R. Parks (2002) A primer of real analytic functions. 2 edition, Birkhäuser, Boston. External Links: Document Cited by: Example 3.3.
[15] K. Kurdyka (1998) On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48 (3), pp. 769–783. External Links: Document Cited by: Definition 2.4.
[16] T. T. Lau, J. Zeng, B. Wu, and Y. Yao (2018) A proximal block coordinate descent algorithm for deep neural network training. arXiv preprint arXiv:1803.09082. External Links: Document, 1803.09082 Cited by: §1, §1.
[17] G. Li and T. K. Pong (2016) Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Program. 159, pp. 371–401. External Links: Document Cited by: §4.
[18] G. Li and T. K. Pong (2018) Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, pp. 1199–1232. External Links: Document Cited by: §1, §1, Definition 2.4, §4.
[19] S. Łojasiewicz (1993) Sur la géométrie semi- et sous- analytique. Ann. Inst. Fourier 43 (5), pp. 1575–1595. External Links: Document Cited by: Definition 2.4, Example 3.3.
[20] S. Łojasiewicz (1963) Une propriété topologique des sous-ensembles analytiques réels. In Les Équations aux Dérivées Partielles, pp. 87–89. Cited by: Definition 2.4, Example 3.3, §4.
[21] S. Łojasiewicz (1984) Sur les trajectoires du gradient d’une fonction analytique. In Seminari di Geometria 1982-1983, Bologna, pp. 115–117. Cited by: Example 3.3.
[22] Z. Luo, J. Pang, and D. Ralph (1996) Mathematical programs with equilibrium constraints. Cambridge University Press, Cambridge. External Links: Document Cited by: §4.
[23] B. S. Mordukhovich (2006) Variational analysis and generalized differentiation I: basic theory. Springer-Verlag, Berlin, Heidelberg. External Links: Document Cited by: Definition 2.1, Definition 2.2.
[24] R. T. Rockafellar and R. J. Wets (1998) Variational analysis. Springer-Verlag, Berlin, Heidelberg. External Links: Document Cited by: Definition 2.1, Definition 2.2.
[25] H. M. Shi, S. Tu, Y. Xu, and W. Yin (2016) A primer on coordinate descent algorithms. arXiv preprint arXiv:1610.00040. External Links: Document, 1610.00040 Cited by: §1.
[26] W. Sun and Y. Yuan (2006) Optimization theory and methods: nonlinear programming. Vol. 1, pp. 1–70. External Links: Document Cited by: Definition 2.5.
[27] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Cited by: §1.
[28] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein (2016) Training neural networks without gradients: a scalable ADMM approach. In Proceedings of the 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, pp. 2722–2731. Cited by: §1.
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: §1.
[30] J. Wang, H. Li, and L. Zhao (2022) Accelerated gradient-free neural network training by multi-convex alternating optimization. Neurocomputing 487, pp. 130–143. External Links: Document Cited by: §1, §1, Example 3.3, §4, §4.
[31] Y. Xu and W. Yin (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6 (3), pp. 1758–1789. External Links: Document Cited by: Definition 2.4.
[32] P. Yu, G. Li, and T. K. Pong (2021) Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math.. External Links: Document Cited by: §4.
[33] J. Zeng, T. T. Lau, S. Lin, and Y. Yao (2019) Global convergence of block coordinate descent in deep learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 7313–7323. Cited by: §1, §1, Example 3.1, Example 3.1, Example 3.4, §4.
[34] J. Zeng, S. Lin, Y. Yao, and D. Zhou (2021) On ADMM in deep learning: convergence and saturation-avoidance. J. Mach. Learn. Res. 22 (199), pp. 1–67. Cited by: §1, §1, Example 3.2, §4.
[35] Z. Zhang and M. Brand (2017) Convergent block coordinate descent for training Tikhonov regularized deep neural networks. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §1.
[36] Z. Zhang, Y. Chen, and V. Saligrama (2016) Efficient training of very deep neural networks for supervised hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1487–1495. Cited by: §1.