Convergence Rates of Training Deep Neural Networks via Alternating Minimization Methods

Jintao Xu Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
Email: xujt19@mails.tsinghua.edu.cn
   Chenglong Bao Yau Mathematical Sciences Center, Tsinghua University, Beijing 100084, China, and Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China.
Email: clbao@mail.tsinghua.edu.cn
   Wenxun Xing Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
Email: wxing@mail.tsinghua.edu.cn
Abstract

Training deep neural networks (DNNs) is an important and challenging optimization problem in machine learning due to its non-convexity and non-separable structure. The alternating minimization (AM) approaches split the composition structure of DNNs and have drawn great interest in the deep learning and optimization communities. In this paper, we propose a unified framework for analyzing the convergence rate of AM-type network training methods. Our analysis are based on the -step sufficient decrease conditions and the Kurdyka-Łojasiewicz (KL) property, which relaxes the requirement of designing descent algorithms. We show the detailed local convergence rate if the KL exponent varies in . Moreover, the local R-linear convergence is discussed under a stronger -step sufficient decrease condition.

Keywords Deep neural networks training; Alternating minimization; Kurdyka-Łojasiewicz property;
-step sufficient decrease; Convergence rate

Mathematics Subject Classification (2020) 49M37 90C26 90C52

1 Introduction

In recent years, deep learning has achieved impressive successes in many areas including computer vision [8, 12], natural language processing [27, 29], and recommender system [9, 10]. For deep neural networks (DNNs) training, the alternating minimization (AM)-type training methods, mainly based on the block coordinate descent (BCD) [25] or the alternating direction method of multipliers (ADMM) [6], have been discussed such as BCD-type algorithms [7, 16, 30, 33, 35] and ADMM-type algorithms [28, 34, 36]. In these methods, auxiliary variables are added for each layer to decouple the nested parameters in DNNs, and the constraints are lifted to the objective functions. Using the above techniques, the vanishing gradient issue [4, 11] is avoided and parallel updates can be achieved.

Let be a DNNs training model, where , and be a sequence generated by an AM-type training algorithm. In this paper, we propose a unified framework for establishing the convergence rate of the objective function value generated by various AM-type training algorithms. In particular, let be a positive integer, our analysis imposes a -step sufficient decrease condition defined as follows, which relaxes the common descent condition used in [2].

  • A1. For a certain , there exists such that

    (1)

    for each .

If , the above condition can be derived from H1 and H2 of [2]. When , it allows the oscillations of during the consecutive iterations, which can be classified as non-monotone methods. In Sect. 3, we will show that the existing four AM-type algorithms for training DNNs satisfy A1. Besides, the Kurdyka-Łojasiewicz (KL) property [1, 18] assumption of , which implies a local sharpness under reparametrization [2], plays a central role in our analysis.

Convergence results have been stated for some AM-type training algorithms [13, 16, 30, 33, 34]. From the theoretical perspective, the convergence rate of generated by a BCD-type algorithm is analyzed [16], and the convergence of the objective function value is proved for BCD-type and ADMM-type algorithms [33, 34]. Besides, the convergence rate and the sample complexity of an AM-type algorithm for ReLU networks are established in [13]. However, there are few results about the convergence rate of the objective function value, like that in [30], and we address this question.

Based on A1, we establish the local convergence rate of the objective function value sequences in Theorem 4.1, which depends on the different values of the Kurdyka-Łojasiewicz (KL) exponent [1, 18] of . Moreover, if we replace the lower bound in (1) with , where and are positive constants, we give the local linear convergence in Theorem 4.2.

The rest of the paper is organized as follows. Notations and definitions used throughout this paper are listed in Sect. 2. Four examples are shown in Sect. 3. Estimations of convergence rate are stated in Sect. 4, and we summarize our results in Sect. 5.

2 Notations and Definitions

We give notations and definitions that are useful in our analysis.
Notations. Throughout this paper, , , , and denote the set of real numbers, real -dimensional vectors, real matrices, and positive integers respectively. 0 denotes the matrix of all zeros whose size varies from the context. denotes the Euclidean norm for and the Frobenius norm for , respectively. For . . For any , is the greatest integer no larger than . is the standard big O asymptotic notation.

Definition 2.1.

(Fréchet subdifferential [23, 24])
The Fréchet subdifferential of at is the set

Definition 2.2.

(Limiting subdifferential [23, 24])
For each , the limiting subdifferential of at is the set

.

Definition 2.3.

(Limiting critical point [2])
A point is called a (limiting) critical point of if .

Definition 2.4.

(Kurdyka-Łojasiewicz property [1, 18])
A proper lower semicontinuous function is called to have the Kurdyka-Łojasiewicz (KL) property at if there exist , , and a neighborhood of such that for all ,

We call as the Kurdyka-Łojasiewicz (KL) exponent [18].

KL property is widely used in nonconvex optimization [1, 2, 5, 31]. The pioneering work on it is credited to Kurdyka [15] and Łojasiewicz [19, 20]. More details about this property can be seen in [1, 2, 5] and the references therein.

Definition 2.5.

(Root (R)-convergence rate [26])
For any convergent sequence with limit , . If , is called R-linearly convergent. If , is called R-sublinearly convergent.

3 Typical AM-type algorithms

In this section, we present four examples that apply AM-type algorithms for training DNNs. These examples and their numerical results motivate us to construct a framework for establishing a theoretical analysis of the objective value.

Example 3.1.

(BCD for FNNs [33])
For the feedforward neural networks (FNNs) training, Zeng et al. [33] formulated two optimization models and designed BCD-type algorithms for their unconstrained approximations, respectively. In one model, the objective function is

(2)

where , denotes the activation function of the th layer, , is a loss function, , can be seen as regularization terms about , , , respectively, denotes the th column of , , and the last term represents a quadratic penalty for constraints , . Under the Assumption 1 of [33], (3.1) satisfies the KL property on any closed set. Moreover, generated by the BCD-type algorithm is convergent, and satisfies

for certain (see [33] for the values of and ).

In the other model,

(3)

where , and the last two terms are quadratic penalties for constraints and , , respectively. Under the same assumptions, (3.1) satisfies the KL property on any closed set. Similarly, is convergent, and satisfies

for certain (see [33] for the values of and ).

Example 3.2.

(ADMM for FNNs [34])
Zeng et al. [34] considered the augmented Lagrangian function of an FNNs training model and solved it via an ADMM-type algorithm. Technically, they gave

(4)

where and denotes the activation function. Under the Assumption 1 of [34], (3.2) satisfies the KL property. Moreover, generated by the ADMM-type algorithm is convergent. Suppose that there exist and such that for each . Then, there exist (see [34] for the values of and ) such that

for each .

Example 3.3.

(mDLAM for FNNs [30])
Wang et al. [30] formulated an FNNs training model and designed an AM-type algorithm called mDLAM to solve it. The objective function of the training model is

(5)

where , and the last term is a quadratic penalty for constraints , . If (5) is real analytic [14], it satisfies the KL property [19, 20, 21]. Moreover, is convergent, and satisfies

for certain (see [30] for the values of and ), which is a 2-step sufficient decrease condition.

Example 3.4.

(BCD for ResNets [33])
For the residual networks (ResNets) [12] training, Zeng et al. [33] formulated a simplified model and its unconstrained approximation. The objective function is

(6)

where , and the last two terms represent quadratic penalties for constraints and , , respectively. Under the Assumption 1 of [33], (3.4) satisfies the KL property on any closed set. Moreover, generated by the BCD-type algorithm is convergent, and satisfies

for certain (see [33] for the values of and ).

4 Theoretical analysis

In this section, we give the convergence rate estimations based on A1 and the KL property.

Theorem 4.1.

For a proper lower semicontinuous objective function and a sequence generated by an AM-type training algorithm, suppose that there exists such that satisfies the KL property at with a neighborhood and the KL exponent , as , for each , and A1 is satisfied. Then the following conclusions hold:

  • If , then converges in a finite number of steps;

  • If , then there exist , , and such that for each ;

  • If , then there exist and such that for each .

Additionally, for each accumulation point (if any) of , it is a critical point of if and only if .

The following Lemma 4.1 is needed to prove Theorem 4.1.

Lemma 4.1 ([2]).

Suppose that , , and as , of which for each . Then .

With the similar arguments as in the proof of Theorem 3 of [17], Theorem 2 of [3] for convergence rate estimation and Theorem 1 of [30] for the analysis of accumulation points, we give the following proof.

Proof.

(proof of Theorem 4.1) First of all, the lower boundness of is proved, and a key inequality is constructed as follows. They are the bases of our analyses.

By A1, holds for each . Letting , we have .

If infinitely, there exists a such that , and for each ,

(7)

where the first inequality follows from the KL property of at , and the second inequality follows from A1.

We now analyze the convergence rate as follows. (i) is proved by contradiction. If the conclusion is not true, there exists a subsequence such that

Letting , , a contradiction. Therefore, there exists such that for each .

If converges finitely, (ii) and (iii) hold trivially. Then we only need to prove them in the case of infinite convergence.

When , according to (7),

holds for each , which then implies that

Hence, we have

for each , where

It follows from A1 and the infinite convergence assumption of that . Thus (ii) holds with .

When , given a constant , for each , if , then,

where the first inequality follows from (7). Hence

(8)

If , then,

Hence

(9)

According to (8) and (9),

holds for each , where

Then we have

(10)

Clearly, for each , (10) still holds. Thus we obtain (iii) with .

For each accumulation point , there exists a subsequence such that . By A1, . For each , there exists such that

Letting , we have

So . Without loss of generality, suppose that as . Then , . According to

and Lemma 4.1, we have .

Moreover, if is a critical point, it follows from the Remark 2.5 (d) of [2] that . ∎

Parts (ii) and (iii) of the above theorem implies that converges at least locally R-linearly and locally R-sublinearly to , respectively. For each example in Sect. 3, A1 and the KL property are satisfied. If is real analytic, then the KL exponent is in at a critical point [1, 20]. Examples of the real analytic objective functions are provided in [33, 34], of which the linear, polynomial, hyperbolic tangent, and sigmoid activation functions; the squared, logistic, and exponential loss functions; and the squared Frobenius norm regularization terms are considered. Furthermore, is convergent in Example 1, 2, and 4 [33, 34], so the corresponding has local convergence rate for and local convergence rate for by our Theorem 4.1. Besides, if the assumption for each sufficiently large is satisfied in Example 3, the aforementioned results also hold. Moreover, the continuity of is satisfied in each example in Sect. 3 under certain assumptions [30, 33, 34]. Then, each accumulation point of is a critical point by our Theorem 4.1 (see [30, 33, 34] for the existence of accumulation point for each example in Sect. 3).

Moreover, if the following stronger -step sufficient decrease condition is satisfied, converges at least locally R-linearly to for each as shown in the following Theorem 4.2.

  • A2. For a certain , there exist positive constants such that

    for each , where is the KL exponent of .

When , compared with A1, a larger descent of steps iteration is guaranteed in A2, and it is a more dedicated estimation for .

Theorem 4.2.

For an objective function and a sequence , suppose that satisfies the KL property at with and , , for each , and A2 is satisfied. Then has a local convergence rate of , where . Additionally, for each accumulation point (if any) of , it is a critical point if and only if .

Proof.

When , with the similar arguments as in the proof of Theorem 4.1, finite convergence is achieved, and complexity bound holds trivially. When , similarly, we only prove it in the case of infinite convergence. By A2 and the KL property of at , there exists a such that for each , , and for each ,

Then,

Hence, we have

for each , where is the same as that in the proof of Theorem 4.1. So a local convergence rate is achieved with . With the similar arguments as in the proof of Theorem 4.1, we obtain the rest of Theorem 4.2. ∎

It is worth noting that although the local R-linear convergence can be achieved in any cases under A2, when , verification whether a training model and algorithm satisfies the stronger decrease condition is a challenging problem [18, 22, 32].

5 Conclusions

In this paper, a unified framework is proposed to analyze the convergence rate of the objective function value generated by the AM-type training algorithm. The -step sufficient decrease conditions and the KL property play a central role in our analysis. And the requirement of nonincreasing property of function value sequence is relaxed in our framework. Based on the squared norm lower bound estimation of the -step descent, three kinds of convergence rates are discussed for different values of the KL exponent , respectively. Moreover, if a larger descent is guaranteed, we can improve the convergence rate to for .

Acknowledgements

Bao’s research was supported by the National Natural Science Foundation of China (Grant No. 11901338) and the Tsinghua University Initiative Scientific Research Program. Xing’s research was supported by the National Natural Science Foundation of China (Grant No. 11771243).

References

  • [1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran (2010) Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35 (2), pp. 438–457. External Links: Document Cited by: §1, §1, Definition 2.4, Definition 2.4, §4.
  • [2] H. Attouch, J. Bolte, and B. F. Svaiter (2013) Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137, pp. 91–129. External Links: Document Cited by: §1, Definition 2.3, Definition 2.4, Lemma 4.1, §4.
  • [3] H. Attouch and J. Bolte (2009) On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116, pp. 5–16. External Links: Document Cited by: §4.
  • [4] Y. Bengio, P. Simard, and P. Frasconi (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5 (2), pp. 157–166. External Links: Document Cited by: §1.
  • [5] J. Bolte, S. Sabach, and M. Teboulle (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, pp. 459–494. External Links: Document Cited by: Definition 2.4.
  • [6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), pp. 1–122. External Links: Document Cited by: §1.
  • [7] M. Á. Carreira-Perpiñán and W. Wang (2014) Distributed optimization of deeply nested systems. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, S. Kaski and J. Corander (Eds.), Proceedings of Machine Learning Research, Vol. 33, pp. 10–19. Cited by: §1.
  • [8] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §1.
  • [9] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10. External Links: Document Cited by: §1.
  • [10] P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. External Links: Document Cited by: §1.
  • [11] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. External Links: Link Cited by: §1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, Example 3.4.
  • [13] G. Jagatap and C. Hegde (2018) Learning ReLU networks via alternating minimization. arXiv preprint arXiv: 1806.07863. External Links: Document, 1806.07863 Cited by: §1.
  • [14] S. G. Krantz and H. R. Parks (2002) A primer of real analytic functions. 2 edition, Birkhäuser, Boston. External Links: Document Cited by: Example 3.3.
  • [15] K. Kurdyka (1998) On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48 (3), pp. 769–783. External Links: Document Cited by: Definition 2.4.
  • [16] T. T. Lau, J. Zeng, B. Wu, and Y. Yao (2018) A proximal block coordinate descent algorithm for deep neural network training. arXiv preprint arXiv:1803.09082. External Links: Document, 1803.09082 Cited by: §1, §1.
  • [17] G. Li and T. K. Pong (2016) Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Program. 159, pp. 371–401. External Links: Document Cited by: §4.
  • [18] G. Li and T. K. Pong (2018) Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, pp. 1199–1232. External Links: Document Cited by: §1, §1, Definition 2.4, §4.
  • [19] S. Łojasiewicz (1993) Sur la géométrie semi- et sous- analytique. Ann. Inst. Fourier 43 (5), pp. 1575–1595. External Links: Document Cited by: Definition 2.4, Example 3.3.
  • [20] S. Łojasiewicz (1963) Une propriété topologique des sous-ensembles analytiques réels. In Les Équations aux Dérivées Partielles, pp. 87–89. Cited by: Definition 2.4, Example 3.3, §4.
  • [21] S. Łojasiewicz (1984) Sur les trajectoires du gradient d’une fonction analytique. In Seminari di Geometria 1982-1983, Bologna, pp. 115–117. Cited by: Example 3.3.
  • [22] Z. Luo, J. Pang, and D. Ralph (1996) Mathematical programs with equilibrium constraints. Cambridge University Press, Cambridge. External Links: Document Cited by: §4.
  • [23] B. S. Mordukhovich (2006) Variational analysis and generalized differentiation I: basic theory. Springer-Verlag, Berlin, Heidelberg. External Links: Document Cited by: Definition 2.1, Definition 2.2.
  • [24] R. T. Rockafellar and R. J. Wets (1998) Variational analysis. Springer-Verlag, Berlin, Heidelberg. External Links: Document Cited by: Definition 2.1, Definition 2.2.
  • [25] H. M. Shi, S. Tu, Y. Xu, and W. Yin (2016) A primer on coordinate descent algorithms. arXiv preprint arXiv:1610.00040. External Links: Document, 1610.00040 Cited by: §1.
  • [26] W. Sun and Y. Yuan (2006) Optimization theory and methods: nonlinear programming. Vol. 1, pp. 1–70. External Links: Document Cited by: Definition 2.5.
  • [27] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Cited by: §1.
  • [28] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein (2016) Training neural networks without gradients: a scalable ADMM approach. In Proceedings of the 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, pp. 2722–2731. Cited by: §1.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: §1.
  • [30] J. Wang, H. Li, and L. Zhao (2022) Accelerated gradient-free neural network training by multi-convex alternating optimization. Neurocomputing 487, pp. 130–143. External Links: Document Cited by: §1, §1, Example 3.3, §4, §4.
  • [31] Y. Xu and W. Yin (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6 (3), pp. 1758–1789. External Links: Document Cited by: Definition 2.4.
  • [32] P. Yu, G. Li, and T. K. Pong (2021) Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math.. External Links: Document Cited by: §4.
  • [33] J. Zeng, T. T. Lau, S. Lin, and Y. Yao (2019) Global convergence of block coordinate descent in deep learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 7313–7323. Cited by: §1, §1, Example 3.1, Example 3.1, Example 3.4, §4.
  • [34] J. Zeng, S. Lin, Y. Yao, and D. Zhou (2021) On ADMM in deep learning: convergence and saturation-avoidance. J. Mach. Learn. Res. 22 (199), pp. 1–67. Cited by: §1, §1, Example 3.2, §4.
  • [35] Z. Zhang and M. Brand (2017) Convergent block coordinate descent for training Tikhonov regularized deep neural networks. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §1.
  • [36] Z. Zhang, Y. Chen, and V. Saligrama (2016) Efficient training of very deep neural networks for supervised hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1487–1495. Cited by: §1.