Learning Generative Embeddings using an Optimal Subsampling Policy for Tensor Sketching

Chandrajit Bajaj
Department of Computer Science and Oden Institute
University of Texas at Austin
Austin, TX 78712, USA bajaj@cs.utexas.edu
&Taemin Heo
Department of Civil, Architectural and Environmental Engineering
University of Texas at Austin
Austin, TX 78712, USA taemin@utexas.edu
&Rochan Avlur
Department of Computer Science
University of Texas at Austin
Austin, TX 78712, USA ravlur@utexas.edu

Abstract

Data tensors of orders 3 and greater are routinely being generated. These data collections are increasingly huge and growing. They are either tensor fields (e.g., images, videos, geographic data) in which each location of data contains important information or permutation invariant general tensors (e.g., unsupervised latent space learning, graph network analysis, recommendation systems, etc.). Directly accessing such large data tensor collections for information has become increasingly prohibitive. We learn approximate full-rank and compact tensor sketches with decompositive representations providing compact space, time and spectral embeddings of both tensor fields (P-SCT) and general tensors (P-SCT-Permute). All subsequent information querying with high accuracy is performed on the generative sketches. We produce optimal rank- $r$ Tucker decompositions of arbitrary order data tensors by building tensor sketches from a sample-efficient sub-sampling of tensor slices. Our sample efficient policy is learned via an adaptable stochastic Thompson sampling using Dirichlet distributions with conjugate priors.

1 Introduction

Data tensors of orders 3 and greater are routinely being generated. These data collections are increasingly huge and growing. For instance, a North American Regional Reanalysis (NARR) has been collecting 70 climate variables every 3 hours from 1979 to the present, and it is currently at a total size of 29.4 Terabytes Mesinger et al. (2006). These geographical data tensors are tensor fields in which each location of data contains important information. Images and videos are also tensor fields. On the other hand, there are permutation invariant general tensors with applications in unsupervised latent space learning, community graphs and network analysis, anomaly detection, and recommendation systems. Directly accessing such large data tensor collections for information has become increasingly prohibitive. Approximate low-rank and compact tensor decompositions of tensor sketches provide a space and time efficient alternative as all information querying with high accuracy is performed on the sketches.

Randomized sketching algorithms have been a popular approach for producing a low rank approximation of very large matrices and tensors Woolfe et al. (2008); Halko et al. (2011); Woodruff and others (2014); Cohen et al. (2015); Boutsidis et al. (2016); Tropp et al. (2017). The algorithms are faster than the singular value decomposition while keeping the approximation accuracy at a similar level. A recently proposed SketchyCoreSVD further speeds up the computation and reduces memory requirement by building random sketches only from its subsampled columns and rows Bajaj et al. (2019). It advances SketchySVD Tropp et al. (2019) by exploiting redundancy in the data matrix which can be characterized by incoherence. SketchySVD has also been extended to compute the approximate low-rank Tucker decomposition of tensors Sun et al. (2020). Using the same idea of SketchyCoreSVD Bajaj et al. (2019), approximating only from samples, we derived Randomized SketchyCoreTucker (R-SCT) decomposition in the Appendix.

SketchyCoreTucker is mathematically proven and computationally advantageous, but its performance has an inherent variance due to random sampling protocol and user-defined sampling ratios. SketchyCoreTucker randomly samples data subsets from the original data tensor. It can be suboptimal and inefficient for sparse structured data. In many modern climate data, meaningful information is often concentrated in the small regions — e.g., extreme events such as storms or droughts. Sampling irrelevant or duplicate subsets are critical memory and accuracy loss for the method. Moreover, the performance of the method varies by the user-defined sampling ratios. Additional cumbersome computation is needed to find the optimal sampling ratios for available computational resources.

In addition to the SketchyCoreTucker, other randomized sampling based Tucker decompositions have been studied in last decades. Most of algorithms are sampling columns of matrix unfoldings Ahmadi-Asl et al. (2021). All these algorithms are randomly sampling from the unknown distribution of the data fibers. Simple algorithms use uniform distribution so that they can skip informativeness calculation Drineas and Mahoney (2007). They are theoretically fastest, but mostly suboptimal and sample inefficient. Other algorithms utilize measures such as leverage score Saibaba (2016); Perros et al. (2015); Cheng et al. (2016) to infer informativeness distribution. These algorithms use more information, but they are problematic for higher order and large sized data tensors. This is because the informativeness distributions can be easily large enough to prohibit calculation when the dimension and size of original data tensor is large. Not only the calculating density values of wide distribution but also the storing those values make the algorithm inefficient.

To overcome these challenges, we present a new method called Progressive Sketching that progressively samples row vectors of matrix unfolding of input tensor (i.e., order $K - 1$ tensor slice of order $K$ input tensor) from the actively updated informativeness distribution. By doing so, we actively learn the optimal subsampling policy for tensor sketches. A batch of data subsets will be progressively subsampled at each step. Based on the amount of information embedded in sampled data subsets and its geometric distribution, the agent decides which data subsets to sample next. By learning the optimal policy, we can expect the best low rank approximation with the smallest sample use.

Progressive sketching applied to SketchyCoreTucker, P-SCT, does not need the sampling ratio, and it produces more accurate low rank approximation than SketchyCoreTucker using the same amount of data subsets. In other words, P-SCT is a sample efficient sketching based tensor decomposition. P-SCT is best for tensor fields due to its informativeness inference based on the index-wise location of them. Permutation agnostic P-SCT (named P-SCT-Permute) is also derived to allow for index permutations and learning general tensor sketches. Progressive sketching can be applied to any randomized sketching algorithms and formulated by any sequential decision making framework. This study introduces vanilla Thompson sampling framework, and it can be advanced to Thompson sampling via local uncertainty framework for contextual bandit problems Wang and Zhou (2020). State-of-the-art deep reinforcement learning algorithms, especially for Soft Actor-Critic Haarnoja et al. (2018, 2018); Christodoulou (2019), and Soft Q-learning with mutual information regularization Grau-Moya et al. (2019), can be used for progressive sketching based on the algorithm introduced in this paper. We validate the performance of P-SCT and P-SCT-Permute on various large static and time-varying datasets while quantitatively comparing it to R-SCT.

Overall, our main contributions are the following:

A new method of progressive sketching applicable to both tensor fields and permutation invariant tensors, to further optimize the performance of randomized sketching by learning an optimal Thompson sampling policy for tensor sketching, using Dirichlet distribution;
Introduce formulation of progressive sketching based on a generalization of randomized SketchyCoreSVD, namely Randomized SketchyCoreTucker (R-SCT) and verify through experiments that progressive sketching are able to further optimize R-SCT, producing accurate low rank approximations, with a much smaller number of data fibers than R-SCT and other one-shot algorithms.

This creation of compact latent space decompositive tensor embeddings in generative modeling is universal to all high-dimensional data learning.

2 Background

2.1 Tucker decomposition and higher order SVD (HOSVD)

Tucker decomposition has been considered as a multilinear generalization of SVD, and HOSVD is a constrained Tucker decomposition that ensures the orthogonality of factor matrices and all-orthogonality of the core tensor De Lathauwer et al. (2000). For the decomposition, a way to unfold tensor to matrix is needed to be defined. For order 3 tensor $A \in R^{n_{1} \times n_{2} \times n_{3}}$ , "matrix unfolding" $A_{(1)} \in R^{n_{1} \times n_{2} n_{3}}$ contains the element $a_{i_{1}, i_{2}, i_{3}}$ at the position with row number $i_{1}$ and column number equal to $i_{2} \times n_{3} + n_{2}$ . Other two matrix unfoldings $A_{(2)}$ and $A_{(3)}$ can be defined in the same manner. Then, we define “ $k$ -mode vectors,” defined as the $n_{k}$ -dimensional vectors obtained from $A$ by varying the index $i_{k}$ and keeping the other indices fixed. With this definition, the 1-mode product of a tensor $A \in R^{n_{1} \times n_{2} \times n_{3}}$ by a matrix $U \in R^{m_{1} \times n_{1}}$ , denoted by $A \times_{1} U$ is an $(m_{1} \times n_{2} \times n_{3})$ -tensor of which the entries are given by $\sum_{i_{1}} a_{i_{1}, i_{2}, i_{3}} u_{j_{1}, i_{1}}$ . Other two mode product $A \times_{2} U$ and $A \times_{2} U$ can be also defined accordingly. Finally, order 3 SVD can be stated as following Theorem 2.1.

Theorem 2.1 (see De Lathauwer et al. (2000)).

Every order 3 tensor $A$ can be written as the product

A = S \times_{1} U^{(1)} \times_{2} U^{(2)} \times_{3} U^{(3)},

in which $U^{(k)}$ is a unitary $(n_{k} \times n_{k})$ -matrix; $S$ is a $(n_{1} \times n_{2} \times n_{3})$ -tensor of which subtensors $S_{i_{k} = α}$ , obtained by fixing the kth index to $α$ , have the property of all-orthogonality and ordering for all possible values of k.

The algorithm for calculating core tensor $S$ and scaling matrices $U^{(k)}$ is given in the algorithm 1.

Input:

A \in R^{n_{1} \times n_{2} \times n_{3}}

Output:

S, U^{(1)}, U^{(2)}, U^{(3)}

for

k = 1

3

U^{(k)} \leftarrow r_{k}

leading left singular vectors of

A_{(k)}

end for

Set

S = A \times_{1} U^{(1) T} \times_{2} U^{(2) T} \times_{3} U^{(3) T}

Algorithm 1 HOSVD

2.2 Thompson sampling (TS)

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance Russo et al. (2018); Agrawal and Goyal (2012); Chapelle and Li (2011); Thompson (1933).

Suppose the agent has imperfect knowledge about the reward of each action of available action set $X$ . Since the stochastic environment is assumed, the reward of each action is a random variable following a certain distribution denoted by $q_{θ}$ . Then, the imperfect knowledge about the reward of each action can be characterized by uncertain parameters, $θ$ . The distribution for the parameters, $p$ , is introduced with the hyperparameters. To effectively balance exploitation and exploration, TS samples the parameters $θ$ from $p$ . Then, the expected reward of each action can be predicted by sampled $θ$ . The agent applies action selecting the one that has the highest expected reward. After applying action $x_{t}$ , the agent observes an outcome $y_{t}$ and corresponding reward $r (y_{t})$ from the stochastic environment. Using the observed $y_{t}, r (y_{t})$ , TS updates the hyperparameters using the Bayesian inference. By utilizing conjugated $p$ and $q_{θ}$ , the update can be done with a closed form solution.

2.3 Uniqueness of SketchyCoreTucker and P-SCT

Ahmadi-Asl et al. (2021) categorized randomized algorithms for the Tucker decomposition and HOSVD into four groups: Random projection Che and Wei (2019); Minster et al. (2020); Zhou et al. (2014); Sun et al. (2021); Wolf (2019); Batselier et al. (2018); Che et al. (2021), Sampling Caiafa and Cichocki (2010); Sun et al. (2009); Oseledets et al. (2008); Saibaba (2016); Tsourakakis (2010); Song et al. (2019); Perros et al. (2015); Traoré et al. (2019), Random least-squares Malik and Becker (2018), and Count sketch Wang et al. (2015); Gittens et al. (2020); Malik and Becker (2018, 2020). Numerical experiments showed that sampling group outperform random projection group in computation time with approximately the same results Ahmadi-Asl et al. (2021). The random projection based algorithms are not suitable for huge tensors stored outside-of-cores since they need to pass the original data tensor multiple times. Randomized pass-efficient algorithms have been introduced to overcome this issue by minimizing the number of passes Halko et al. (2011); Woodruff and others (2014); Boutsidis et al. (2016); Upadhyay (2018), but none utilize the sampling approach for the solution. SketchyCoreTucker is a unique algorithm that combines random projection and sampling to utilize the advantage of both approaches and overcome limitations rooted in using them separately.

Many sampling-based algorithms compute each factor matrix by sampling the "columns" of the corresponding unfolding matrix or equivalently sampling the fibers of the original data tensor. However, our P-SCT samples the "rows" of the unfolding matrix or equivalently sampling the tensor slices. By doing so, P-SCT assures all sampled slices are intersected with each other and progressively updates the approximation of the core from the intersection of all sampled slices.

The main differentiation of our P-SCT is sample efficiency earned by active learning the informativeness distribution utilizing the idea of Thompson sampling. Many sampling-based algorithms use different sampling protocols to minimize the number of samples and computation time, but none use active learning for progressiveness. Sampling based on leverage scores are proposed in Saibaba (2016); Perros et al. (2015); Cheng et al. (2016). These algorithms are not sample efficient due to prior scanning entire original data tensor to compute the leverage scores. CUR decomposition can be considered a sampling technique with a difference that the sampling procedure is performed heuristically instead of randomly Caiafa and Cichocki (2010); Oseledets (2011); Saibaba (2016). It is known that CUR approximations are less accurate than the SVD, and the quality of decomposition quite depends on selection of fibers. The optimal selection has been considered an NP-hard problem, and thus heuristic algorithms have been used for computing suboptimal solutions. Our progressive sketching approach can be applied to CUR decomposition to overcome this issue too. Recently, Song et al. (2019) extended matrix CUR decomposition to tensor CURT decomposition and proposed an randomized algorithm to compute a low-rank CURT approximation to a tensor Song et al. (2019). But this method does not use reinforcement learning to acquire sample efficiency.

Random least-squares and count-sketch groups are a randomized algorithms for ALS-based computation of the best mulitilinear rank approximation such as Higher-Order Orthogonal Iteration (HOOI) De Lathauwer et al. (2000). A randomized least-squares HOOI algorithm is proposed in Malik and Becker (2018), but numerical experiment in Ahmadi-Asl et al. (2021) showed that sampling group have better computation efficiency. Count-sketch technique has been implemented in several algorithms, mostly for CANDECOMP/PARAFAC (CP) decomposition Wang et al. (2015); Gittens et al. (2020); Malik and Becker (2020).

There are other notable researches that share the same objective of approximating tensor decomposition with the highest accuracy and minimal computation. Oh et al. (2018) proposed P-TUCKER, a scalable Tucker decomposition that resolves the computational and memory bottleneck of ALS by row-wise updating factor matrices with parallelization Oh et al. (2018). FREDE is a graph embedding based on matrix sketching that applies Frequent Directions — a deterministic matrix sketching in the row-updates model — on a matrix-factorization interpretation of a neural embedding, VERSE Tsitsulin et al. (2021). Others introduced a randomized CP decomposition algorithms Wang et al. (2015); Gittens et al. (2020); Malik and Becker (2020). The CP decomposition factorizes a tensor into a sum of component rank-one tensors Kolda and Bader (2009), and it is a special case of Tucker decomposition with hyper-diagonal core tensor. Another type of tensor decomposition is Tensor-Train (TT) decomposition which requires iterative matrix SVD computation, and thus SketchyCoreSVD or any randomized matrix SVD algorithm can be applied. Tucker, CP and TT decompositions produces different factors of which best for different applications. This study focuses on Tucker decomposition, and still SketchyCoreTucker and P-SCT have unique ability to learn/provide informativeness distributions of each mode to the following computations, generally the main operation of interest.

3 Progressive Sketching

3.1 Progressive SketchyCoreTucker (P-SCT)

P-SCT is for any order $K$ tensor field, so let us assume order 3 tensor $A \in R^{N_{1} \times N_{2} \times N_{3}}$ for explanation purposes, which is huge in size, and its spectrum is decaying fast enough for low rank approximation. Its matricizations along the three dimensions are denoted by $A_{(1)} \in R^{N_{1} \times (N_{2} N_{3})}$ , $A_{(2)} \in R^{N_{2} \times (N_{3} N_{1})}$ , $A_{(3)} \in R^{N_{3} \times (N_{1} N_{2})}$ . Let us denote row space of $A_{(k)}$ as $R o w_{k} = {A_{(k)}^{(1, :)}, \dots, A_{(k)}^{(N_{k}, :)}}$ . Then, $R o w_{k}$ where $k = 1, 2, 3$ are bandit arms of TS. As an action, we choose a $R o w_{k}$ and sample from it without replacement. Note that a row vector of a matrix unfolding is a slice matrix in original geometric dimension of order 3 tensor. Thus, P-SCT samples order $K - 1$ slice tensors to sketch an order $K$ tensor field.

The amount of information contained in the samples ( $\equiv$ order $K - 1$ slice tensors) is the reward of the action. For Tucker decomposition, the amount of information can be measured as accumulated variation in data fibers along with $K - 1$ dimensions. For instance, let us assume there is a sample $A_{(1)}^{(δ_{1}, :)}$ which is equivalent to the slice matrix $A^{(δ_{1}, :, :)}$ . Both rows and columns (i.e., data fibers) of $A^{(δ_{1}, :, :)}$ are used in Tucker decomposition. We define the Sum of Absolute Differences ( $S A D$ ) to capture the accumulated variation in these data fibers. $S A D$ of $A_{(1)}^{(δ_{1}, :)} \equiv A^{(δ_{1}, :, :)}$ is

(1)

where the absolute differences of data in data fibers of $A^{(δ_{1}, :, :)}$ are accumulated and normalized by the size of slice matrix. The normalization is required because what we want to measure is the amount of information per usage of the memory for the sample efficient Tucker decomposition. $S A D$ of samples from other $R o w_{k}$ can be calculated in the same manner.

Dirichlet distribution is used to balance exploitation and exploration of $R o w_{k}$ . We update the concentration parameters of Dirichlet distribution by the entropy of normalized $S A D$ of samples from $R o w_{k}$ . By doing so, P-SCT samples more from $R o w_{k}$ of which information is uniformly distributed rather than concentrated in narrow region. It effectively reduces the mutual information between samples.

At first, we assume concentration parameters of the Dirichlet distribution as unity to make the distribution equivalent to the uniform distribution $α = {α_{1}, α_{2}, α_{3}} = {1, 1, 1}$ . Then, sampling ratios $p = {p_{1}, p_{2}, p_{3}}$ are drawn from the Dirich $(α)$ . By multipling the batch-size per round $N_{b a t c h}$ to $p$ , we get the number of samples $n = {n_{1}, n_{2}, n_{3}}$ that we newly sample from each $R o w_{k}$ .

We utilize $S A D$ once again here for the sampling. Rather than randomly sample, we assign weights by $S A D$ . As a result, P-SCT greedily search samples that contain highest amount of information. Combined two search schemes — 1) TS with Dirichlet distribution that balances $R o w_{k}$ selection minimizing the mutual information between samples and 2) $S A D$ -weighted sampling that maximizes the use of information of samples — is the core of P-SCT algorithm that learns optimal subsampling policy for streaming tensor sketching.

At the first round, we know nothing about the input tensor. Thus, weights are set as unity that can be denoted by $W = {w_{1}, w_{2}, w_{3}}, w_{k} = {1, . . ., 1}, n (w_{k}) = N_{k} .$ where $w_{k}$ is weights for $R o w_{k}$ and $n (w_{k})$ denotes the length of weights. Let us define sampled indices $Δ = {Δ_{1}, Δ_{2}, Δ_{3}}$ and unsampled indices $Ω = {Ω_{1}, Ω_{2}, Ω_{3}}$ and an allowed total number of samples $N_{a l l o w}$ . Then, the main steps of P-SCT are the following. In the algorithm, $⌊ \cdot ⌋$ represents floor function.

Sample from $R o w_{k}$
- Draw sampling ratios $p = {p_{1}, p_{2}, p_{3}}$ from Dirich $(α)$ .
- For $R o w_{k}$ , sample $n_{k} = ⌊ p_{k} N_{b a t c h} ⌋$ indices from unsampled indices $Ω_{k}$ following weights $w_{k}$ . Denote the indices of the sampled rows to be $Δ_{k}$ .
Compute $S A D$ and update TS parameters
- Compute $S A D$ of $A_{(k)}^{(Δ_{k}, :)}$ . Denote $S A D$ of $A_{(k)}^{(δ_{k}, :)}, δ_{k} \in Δ_{k}$ as $S A D (δ_{k})$ .
- Compute $S A D$ entropy $H_{k} = - \sum_{δ_{k} \in Δ_{k}} \frac{S A D (δ_{k})}{\sum_{δ_{k}} S A D (δ_{k})} log (\frac{S A D (δ_{k})}{\sum_{δ_{k}} S A D (δ_{k})}) .$
- Update the concentration parameter $α_{k} \leftarrow H_{k}$ .
- Update $Ω_{k} \leftarrow Ω_{k} ∖ Δ_{k}$ .
- Linearly interpolate $S A D$ of unsampled rows $S A D (Ω_{k})$ based on the index differences using $S A D (Δ_{k})$ .
- Set $S A D (Δ_{k}) = 0$ to avoid sampling duplicates, and update $w_{k}$ by normalized $S A D$ .
- Repeat Steps 1 and 2 until $n (Δ_{1}) + n (Δ_{2}) + n (Δ_{3}) > N_{a l l o w}$ .
Final SketchyCoreTucker
- Compute rank- $r$ approximation using SketchyCoreTucker (see Appendix).
- Instead of uniformly sampling rows of $A_{(1)}$ , $A_{(2)}$ , $A_{(3)}$ , use $Δ_{1}$ , $Δ_{2}$ , $Δ_{3}$ from Step 2.
- For the columns of $A_{(1)}$ , $A_{(2)}$ , $A_{(3)}$ , use corresponding fibers crossing the intersection $A^{(Δ_{1}, Δ_{2}, Δ_{3})}$ .
- Dimension reduction parameters $k$ and $s$ can be decided as
  
  $k = ⌊ r + \frac{1}{3} (n - r) ⌋, s = ⌊ r + \frac{2}{3} (n - r) ⌋,$
  
  where $n = {n (Δ_{1}), n (Δ_{2}), n (Δ_{3})}$ .

3.2 Permutation Agnostic Progressive SketchyCoreTucker (P-SCT-Permute)

Permutation agnostic P-SCT is achieved by considering the entropy of sample variance of latent embeddings based on J-L projected and $Q R$ orthogonalized fibers of each tensor mode matrix unfolding. The entropy is used for the concentration/balancing parameter of Thompson sampling with a Dirichlet prior. P-SCT-Permute also substitutes $S A D$ -based adoptive sampling with uniform sampling to avoid spatial interpolation. Steps 1 and 3 of the P-SCT algorithm are the same, and only Step 2 is different in P-SCT-Permute.

Sample from $R o w_{k}$
Compute sample variance and update TS parameters
- Denote $Θ_{1}$ , $Θ_{2}$ , and $Θ_{3}$ as the corresponding columns of $A_{(1)}$ , $A_{(2)}$ , $A_{(3)}$ that crossing the intersection $A^{(Δ_{1}, Δ_{2}, Δ_{3})}$ .
- Compute latent embeddings by the map $Ω_{1} \in R^{k_{1} \times n (Θ_{1})}$ applied to $A_{(1)}^{(:, Θ_{1})}$ from the right and QR decomposition, $Q_{1} R_{1} = Y_{1} = A_{(1)}^{(:, Θ_{1})} Ω_{1}^{*}$ , and $Q_{2}$ and $Q_{3}$ accordingly.
- Compute entropy of sample variance of $Q_{k}$
  
  $H_{k} = {Entropy}_{j} ⎛ ⎜ ⎝ \frac{1}{N_{k} - 1} N_{k} \sum i = 1 {(Q_{k}^{(i, j)} - \frac{1}{N_{k}} N_{k} \sum i = 1 Q_{k}^{(i, j)})}^{2} ⎞ ⎟ ⎠$
- Update the concentration parameter $α_{k} \leftarrow H_{k}$ .
- Update $Ω_{k} \leftarrow Ω_{k} ∖ Δ_{k}$ .
- Repeat Steps 1 and 2 until $n (Δ_{1}) + n (Δ_{2}) + n (Δ_{3}) > N_{a l l o w}$ .
Final SketchyCoreTucker

3.3 Complexity analysis

Table 1 shows the complexity of R-SCT, P-SCT, P-SCT-Permute and one-shot algorithms (RP-HOSVD,R-STHOSVD,R-PET,R-ST, and R-HOID Ahmadi-Asl et al. (2021)) and an ALS-based algorithm (RP-HOOI Ibid.) for the order 3 tensor case. For simple comparison, we assume that $N_{1} = N_{2} = N_{3} = N$ , $r_{1} = r_{2} = r_{3} = r$ , $m_{1} = m_{2} = m_{3} = m$ , $s_{1} = s_{2} = s_{3} = s$ , and $k_{1} = k_{2} = k_{3} = k$ . Also, it is assumed that the same $b = N_{b a t c h} / 3$ rows are sampled from each $R o w_{k}$ every iteration for P-SCT and P-SCT-Permute. Thus, a product of total number of iteration $I$ and $b$ is $m$ . Also, note that $N > m > r > s > k$ .

Algorithm	Projection/Sampling	QR	Core	Total
R-SCT	$O (N m k + m^{3} s)$	$O (N k^{2})$	$O (m s k + r^{3} k)$	$O (N m k + m^{3} s)$
P-SCT	$O (N^{2} m + m^{3} s)$	$O (N k^{2})$	$O (m s k + r^{3} k)$	$O (N^{2} m + m^{3} s)$
P-SCT-Permute	$O (I N m k + m^{3} s)$	$O (N k^{2})$	$O (m s k + r^{3} k)$	$O (I N m k + m^{3} s)$
RP-HOSVD	$O (N^{3} r)$	$O (N r^{2})$	$O (N^{3} r)$	$O (N^{3} r)$
R-STHOSVD	$O (N^{4})$		$O (N^{3} r)$	$O (N^{4})$
R-PET	$O (N^{3} s)$	$O (N^{2} k)$	$O (N s k + s^{3} k)$	$O (N^{3} s)$
R-ST	$O (r)$		$O (N^{3} r)$	$O (N^{3} r)$
R-HOID	$O (N^{3} k)$		$O (N^{3} r)$	$O (N^{3} r)$
RP-HOOI	$O (I N^{3} r)$	$O (I N r^{2})$	$O (I N^{3} r)$	$O (I N^{3} r)$

Table 1: The computational complexity for decomposing order 3 data tensor. The computational benefits of SCT depend on how much

m

is smaller than

N

. If required

m

for target accuracy is close to

N

, SCT has similar computational complexity to other algorithms. The main performance of proposed algorithms is reducing

m

for target accuracy to sustain computational complexity of decomposition as 1 or 2 power of

N

4 Experiments

The proper multilinear ranks, $r$ , for each dataset are identified from the scree plots for the modes calculated by HOSVD (see Algorithm 1). There are no strong guarantees on the performance of HOSVD; however, it is widely believed to produce an approximation usually quite close to the best multilinear rank approximation. At rank $r$ for mode $k$ , scree plot can be computed by $s c r e e_{k} (r) = \frac{1}{∥ A_{(k)} ∥_{F}^{2}} \sum_{i = r + 1} σ_{i}^{2} (A_{(k)})$ . Scree plots are shown in Appendix.

For a rank $r = (r_{1}, r_{2}, r_{3})$ approximation $[[A]]_{r}$ to $A$ , we measure its approximation error as

e r r = \frac{∥ [[A]]_{r} - A ∥_{F}^{2}}{∥ A ∥_{F}^{2}} .

To the best of our knowledge, P-SCT and P-SCT-Permute are the first algorithms that introduce active learning in the streaming Tucker decomposition of data tensors. Thus, the performance comparison has done two-fold. First, we compare the low-rank approximation accuracy with several one-shot algorithms with the same target rank (RP-HOSVD,RP-HOOI,R-STHOSVD,R-PET,R-ST, and R-HOID Ahmadi-Asl et al. (2021)) and an ALS-based algorithm (RP-HOOI Ibid.). Other randomized tensor decompositions (e.g., TT and CP) are not compared since their factors all have different applications. We represent distribution and mean of errors and computation times evaluated over 100 trials. The computation time reported does not include computing matrix unfoldings and the final approximation $[[A]]_{r}$ since we can avoid unfolding using index-wise access to the data tensor and, in practice, normally use the method only for calculating decompositions.

Second, we compare the sample efficiency of our progressive sketching with its variant using uniform sampling at each update. To fairly compare the performance of P-SCT and P-SCT-Permute by using the same size of partial data tensor and undecided parameters, a parameter random generation subroutine is added to SketchyCoreTucker. We will call this R-SCT. Sampling numbers of rows for R-SCT are randomly generated, and other parameters are decided in the same way described in P-SCT algorithm Step 3. We compare how low rank approximation error decreases as the number of samples increases. We represent this in the learning curves evaluated 30 trials. We also provide visual comparisons on the few snapshots of data tensors in each case. The experiments are executed from Python on a 64-bit Microsoft machine with 6 Intel i7-9750H CPUs at 2.60 GHz and 32 GB of RAM.

4.1 Cardiac Magnetic Resonance Imaging

A collection of time-varying, 2D slice stacks of Cardiac MRI is tested. $A \in R^{256 \times 176 \times 160}$ consists of 160 time snapshots 2D images, each of size 256 $\times$ 176. Based on the scree plots shown in Appendix, we choose $r = (20, 20, 5)$ . We also choose $N_{a l l o w} = 300$ . Table 2 and Figure 1 show that P-SCT and P-SCT-Permute produce good performance as other one-shot algorithms, although they use less than half of the data tensor. Figures 2 and 3 compares progressive performance to show the sample efficiency of P-SCT and P-SCT-Permute. P-SCT-Permute’s learning curve is slightly decaying faster, but its computation is slower than P-SCT.

	RP-HOSVD	RP-STHOSVD	R-PET	R-ST	R-HOID	RP-HOOI	R-SCT	P-SCT	P-SCT-Permute
err	0.04	0.0081	2.4	0.089	0.026	0.022	0.866	0.025	0.022
time	0.096	0.42	0.23	0.043	0.216	0.7	0.23	0.55	2.06

Table 2: Performance comparison for Cardiac MRI dataset. Displayed values are averaged over 100 trials. One-shot algorithms (RP-HOSVD, RP-STHOSVD, R-PET, R-ST,R-HOID) use all entries of the data tensor, but P-SCT uses only less than half of the data tensor. Nevertheless, P-SCT produces good approximation results with acceptable computation time (sec). An ALS-based algorithm, RP-HOOI, also uses the original data tensor. Its accuracy can be adjusted by setting smaller error tolerance, but it substantially increases iteration and computational complexity. For this Cardiac MRI dataset, RP-STHOSVD show slightly better averaged performance. However, RP-STHOSVD is easily prohibitive as the size of the data tensor grows since its complexity is

O (N^{4})

. Whereas, P-SCT is more scalable with a small sample size since its complexity is

O (N^{2} m + m^{3} s)

where

N > m > s

\includegraphics

[width=]figures/cardiac_error_ArXiv.png

Figure 1: Performance comparison for Cardiac MRI dataset. Violin plots of error and computation time (sec) from 100 trials. (see Table 2 for more information).

\includegraphics[width=]figures/supp_cardiac_lc1.png

Figure 2: Learning curve comparison. P-SCT and P-SCT-Permute converge faster than R-SCT. P-SCT and P-SCT-Permute show sample efficient behavior.

\includegraphics[width=0.9]figures/cardiac_Ar_permutation.png

Figure 3: Progressive visual comparison of the first snapshot of datasets. Low rank approximation result from R-SCT, P-SCT, P-SCT-Permute using the number of samples indicated in each subtitle. P-SCT reveals the Cardiac data structure fastest.

4.2 NARR air temperature

NARR air temperature dataset is spatiotemporal series covering the North American continent Mesinger et al. (2006). Its grid resolution is 349 $\times$ 277, which is approximately 0.3 degrees (32km) resolution at the lowest latitude. It provides 3 hourly series from 1979/01/01 to the present. Air temperature is also collected from 29 pressure levels (i.e., 29 different altitudes). Therefore, it can be represented as a order 5 data tensor. Due to the limitation on the computational resource, only one altitude at 500 hpa pressure level is selected. Also, 1 month-long series on 2020 October is used for the experiment. Spatial coverage is also clipped to discard null values around boundaries. As a result, $A \in R^{(248, 252, 336)}$ . Based on scree plots shown in Appendix, we choose $r = (50, 10, 20)$ . $N_{a l l o w}$ is assumed to be 300. Figures 5 and 6 compares progressive performance showing the sample efficiency of P-SCT and P-SCT-Permute. Interestingly, P-SCT-Permute performs better than P-SCT for the air temperature dataset. We can also see that the spatiotemporal structure of air temperature has been revealed by PS-SCT-Permute fastest.

	RP-HOSVD	RP-STHOSVD	R-PET	R-ST	R-HOID	RP-HOOI	R-SCT	P-SCT	P-SCT-Permute
err	3.3e-5	9e-5	3.9	4.1e-4	2.1e-5	2.3e-5	2.2e-1	5e-5	3.6e-5
time	0.57	1.8	1.1	0.14	1.2	1.4	0.32	0.7	3.2

Table 3: Performance comparison for NARR air temperature dataset. Displayed values are averaged over 100 trials. Although P-SCT uses less than half of the data tensor, it produces good approximation results with acceptable computation time (sec) as other one-shot algorithms.

\includegraphics

[width=]figures/air_error_ArXiv.png

Figure 4: Performance comparison for NARR air temperature dataset. Violin plots of error and computation time (sec) from 100 trials. (see Table 3 for more information).

\includegraphics[width=]figures/supp_AirTemp_lc1.png

Figure 5: Learning curve comparison. P-SCT and P-SCT-Permute converge faster than R-SCT. P-SCT and P-SCT-Permute shows sample efficient behavior.

\includegraphics[width=]figures/air_Ar_permutation.png

Figure 6: Progressive visual comparison of the first snapshot of NARR air temperature dataset. Low rank approximation result from R-SCT, P-SCT, P-SCT-Permute using the number of samples indicated in each subtitle. P-SCT-Permute reveals the air temperature data structure fastest.

4.3 Discussion

The computation time of P-SCT and P-SCT-Permute is longer than R-SCT. This delay is inevitable since the progressive sketching approach adds more computations. The delay will be greater for higher order tensor or bigger sized data tensor. However, considering a combinatorial number of possible sampling ratio parameters, the amount of computation time we trade off for getting optimal sample efficient low rank approximation is sufficiently negligible. The optimally balanced performance of the progressive sketching approach is well addressed in Figure 1 that P-SCT maintains a good accuracy and computation time compared to fast-but-erroneous (RP-HOSVD, R-ST), and accurate-but-slow (RP-STHOSVD, R-HOID) one-shot algorithms. The computation time of PS-SCT-Permute should be improved in the future.

5 Conclusion

Our paper introduces P-SCT (Progressive SketchyCoreTucker) and P-SCT-Permute (Permutation agnostic P-SCT), a learned sub-sampling policy with tensor fiber and space efficiency, for progressively sketching of higher order tensors. P-SCT actively learns the optimal subsampling sketch streaming protocol so as to maximize the accuracy while minimizing the space requirements of the final tensor sketch. Our numerical experiments show that P-SCT and P-SCT-Permute can get better performance with the limited number of samples than a random sampling based SketchyCoreTucker (R-SCT) and other one-shot algorithms (RP-HOSVD,R-STHOSVD,R-PET,R-ST,R-HOID from Ahmadi-Asl et al. (2021)). R-SCT was derived from a generalization of a previously published SketchyCoreSVD method. While R-SCT relies on a preset sampling ratio of the tensor fibers, P-SCT does not need this preset since the algorithm actively balances the ratio of tensor fibers from each dimension to get the maximum reward. P-SCT also avoids further fine tuning that was required for SketchCoreTucker and its generalization to R-SCT. Progressive sketching is a straightforward approach that can be applied to any randomized sketching algorithm and also can be coupled with many other state-of-the-art deep reinforcement learning algorithms to optimize generation of compact latent spaces of large tensor fields and permutation invariant general tensors.

References

S. Agrawal and N. Goyal (2012) Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39–1. Cited by: §2.2.
S. Ahmadi-Asl, S. Abukhovich, M. G. Asante-Mensah, A. Cichocki, A. H. Phan, T. Tanaka, and I. Oseledets (2021) Randomized algorithms for computation of Tucker decomposition and higher order SVD (HOSVD). IEEE Access 9, pp. 28684–28706. Cited by: §1, §2.3, §2.3, §3.3, §4, §5.
C. Bajaj, Y. Wang, and T. Wang (2019) SketchyCoreSVD: sketchysvd from random subsampling of the data matrix. In 2019 IEEE International Conference on Big Data (Big Data), pp. 26–35. Cited by: §1.
K. Batselier, W. Yu, L. Daniel, and N. Wong (2018) Computing low-rank approximations of large-scale matrices with the tensor network randomized SVD. SIAM Journal on Matrix Analysis and Applications 39 (3), pp. 1221–1244. Cited by: §2.3.
C. Boutsidis, D. P. Woodruff, and P. Zhong (2016) Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 236–249. Cited by: §1, §2.3.
C. F. Caiafa and A. Cichocki (2010) Generalizing the column–row matrix decomposition to multi-way arrays. Linear Algebra and its Applications 433 (3), pp. 557–573. Cited by: §2.3, §2.3.
O. Chapelle and L. Li (2011) An empirical evaluation of Thompson sampling. Advances in neural information processing systems 24, pp. 2249–2257. Cited by: §2.2.
M. Che, Y. Wei, and H. Yan (2021) Randomized algorithms for the low multilinear rank approximations of tensors. Journal of Computational and Applied Mathematics 390, pp. 113380. Cited by: §2.3.
M. Che and Y. Wei (2019) Randomized algorithms for the approximations of Tucker and the tensor train decompositions. Advances in Computational Mathematics 45 (1), pp. 395–428. Cited by: §2.3.
D. Cheng, R. Peng, Y. Liu, and I. Perros (2016) SPALS: fast alternating least squares via implicit leverage scores sampling. Advances in neural information processing systems 29, pp. 721–729. Cited by: §1, §2.3.
P. Christodoulou (2019) Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207. Cited by: §1.
M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu (2015) Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 163–172. Cited by: §1.
L. De Lathauwer, B. De Moor, and J. Vandewalle (2000) A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications 21 (4), pp. 1253–1278. Cited by: §2.1, §2.3, Theorem 2.1.
P. Drineas and M. W. Mahoney (2007) A randomized algorithm for a tensor-based generalization of the singular value decomposition. Linear algebra and its applications 420 (2-3), pp. 553–571. Cited by: §1.
A. Gittens, K. Aggour, and B. Yener (2020) Adaptive sketching for fast and convergent canonical polyadic decomposition. In International Conference on Machine Learning, pp. 3566–3575. Cited by: §2.3, §2.3, §2.3.
J. Grau-Moya, F. Leibfried, and P. Vrancx (2019) Soft q-learning with mutual-information regularization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 1861–1870. Cited by: §1.
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2018) Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: Link, 1812.05905 Cited by: §1.
N. Halko, P. Martinsson, and J. A. Tropp (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: §1, §2.3.
T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §2.3.
O. A. Malik and S. Becker (2018) Low-rank Tucker decomposition of large tensors using TensorSketch. In Advances in Neural Information Processing Systems, pp. 10096–10106. Cited by: §2.3, §2.3.
O. A. Malik and S. Becker (2020) Fast randomized matrix and tensor interpolative decomposition using countsketch. Advances in Computational Mathematics 46 (6), pp. 1–28. Cited by: §2.3, §2.3, §2.3.
F. Mesinger, G. DiMego, E. Kalnay, K. Mitchell, P. C. Shafran, W. Ebisuzaki, D. Jović, J. Woollen, E. Rogers, E. H. Berbery, et al. (2006) North american regional reanalysis. Bulletin of the American Meteorological Society 87 (3), pp. 343–360. Cited by: §1, §4.2.
R. Minster, A. K. Saibaba, and M. E. Kilmer (2020) Randomized algorithms for low-rank tensor decompositions in the Tucker format. SIAM Journal on Mathematics of Data Science 2 (1), pp. 189–215. Cited by: §2.3.
S. Oh, N. Park, S. Lee, and U. Kang (2018) Scalable tucker factorization for sparse tensors-algorithms and discoveries. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1120–1131. Cited by: §2.3.
I. V. Oseledets, D. Savostianov, and E. E. Tyrtyshnikov (2008) Tucker dimensionality reduction of three-dimensional arrays in linear time. SIAM Journal on Matrix Analysis and Applications 30 (3), pp. 939–956. Cited by: §2.3.
I. V. Oseledets (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2.3.
I. Perros, R. Chen, R. Vuduc, and J. Sun (2015) Sparse hierarchical Tucker factorization and its application to healthcare. In 2015 IEEE International Conference on Data Mining, pp. 943–948. Cited by: §1, §2.3, §2.3.
D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen (2018) A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 11 (1), pp. 1–96. External Links: ISSN 1935-8237 Cited by: §2.2.
A. K. Saibaba (2016) HOID: higher order interpolatory decomposition for tensors based on Tucker representation. SIAM Journal on Matrix Analysis and Applications 37 (3), pp. 1223–1249. Cited by: §1, §2.3, §2.3.
Z. Song, D. P. Woodruff, and P. Zhong (2019) Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2772–2789. Cited by: §2.3, §2.3.
J. Sun, S. Papadimitriou, C. Lin, N. Cao, S. Liu, and W. Qian (2009) Multivis: content-based social network exploration through multi-way visual analysis. In Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 1064–1075. Cited by: §2.3.
Y. Sun, Y. Guo, C. Luo, J. Tropp, and M. Udell (2020) Low-rank Tucker approximation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science 2 (4), pp. 1123–1150. Cited by: §1.
Y. Sun, Y. Guo, J. A. Tropp, and M. Udell (2021) Tensor random projection for low memory dimension reduction. arXiv preprint arXiv:2105.00105. Cited by: §2.3.
W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §2.2.
A. Traoré, M. Berar, and A. Rakotomamonjy (2019) Singleshot: a scalable Tucker tensor decomposition. Advances in Neural Information Processing Systems 32. Cited by: §2.3.
J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2017) Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications 38 (4), pp. 1454–1485. Cited by: §1.
J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019) Streaming low-rank matrix approximation with an application to scientific simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. Cited by: §1.
A. Tsitsulin, M. Munkhoeva, D. Mottin, P. Karras, I. Oseledets, and E. Müller (2021) FREDE: anytime graph embeddings. Proceedings of the VLDB Endowment 14 (6), pp. 1102–1110. Cited by: §2.3.
C. E. Tsourakakis (2010) Mach: fast randomized tensor decompositions. In Proceedings of the 2010 SIAM international conference on data mining, pp. 689–700. Cited by: §2.3.
J. Upadhyay (2018) The price of privacy for low-rank factorization. Advances in Neural Information Processing Systems 31. Cited by: §2.3.
Y. Wang, H. Tung, A. J. Smola, and A. Anandkumar (2015) Fast and guaranteed tensor decomposition via sketching. In Advances in Neural Information Processing Systems, pp. 991–999. Cited by: §2.3, §2.3, §2.3.
Z. Wang and M. Zhou (2020) Thompson sampling via local uncertainty. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119, pp. 10115–10125. Cited by: §1.
A. S. J. W. Wolf (2019) Low rank tensor decompositions for high dimensional data approximation, recovery and prediction. Cited by: §2.3.
D. P. Woodruff et al. (2014) Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science 10 (1–2), pp. 1–157. Cited by: §1, §2.3.
F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert (2008) A fast randomized algorithm for the approximation of matrices. Applied and Computational Harmonic Analysis 25 (3), pp. 335–366. Cited by: §1.
G. Zhou, A. Cichocki, and S. Xie (2014) Decomposition of big tensors with low multilinear rank. arXiv preprint arXiv:1412.1885. Cited by: §2.3.

Appendix A Randomized SketchyCoreTucker (R-SCT)

SkechySVD has been extended to compute low-rank Tucker decomposition of tensors. One can call this method SketchyTucker. Using the same idea once has for constructing approximation only from random samples, one has a Randomized SketchyCoreTucker (R-SCT) method. Without loss of generality, we assume $A \in R^{N_{1} \times N_{2} \times N_{3}}$ . Its matricizations along the three dimensions are denoted by $A_{(1)} \in R^{N_{1} \times (N_{2} N_{3})}$ , $A_{(2)} \in R^{N_{2} \times (N_{3} N_{1})}$ , and $A_{(3)} \in R^{N_{3} \times (N_{1} N_{2})}$ , respectively.

Define

r = ⎡ ⎢ ⎣ \begin{matrix} r_{1} r_{2} r_{3} \end{matrix} ⎤ ⎥ ⎦, k = ⎡ ⎢ ⎣ \begin{matrix} k_{1} k_{2} k_{3} \end{matrix} ⎤ ⎥ ⎦, s = ⎡ ⎢ ⎣ \begin{matrix} s_{1} s_{2} s_{3} \end{matrix} ⎤ ⎥ ⎦, m = ⎡ ⎢ ⎣ \begin{matrix} m_{1} m_{2} m_{3} \end{matrix} ⎤ ⎥ ⎦, n = ⎡ ⎢ ⎣ \begin{matrix} n_{1} n_{2} n_{3} \end{matrix} ⎤ ⎥ ⎦,

where $r ⪯ k ⪯ s ⪯ min {m, n}$ and “ $⪯$ ” means “ $\leq$ ” for each entry.

Define

p_{1} = \frac{m_{1}}{N_{2} N_{3}} \in (0, 1), p_{2} = \frac{m_{2}}{N_{1} N_{3}} \in (0, 1), p_{3} = \frac{m_{3}}{N_{1} N_{2}} \in (0, 1),

and

q_{1} = \frac{n_{1}}{N_{1}} \in (0, 1), q_{2} = \frac{n_{2}}{N_{2}} \in (0, 1), q_{3} = \frac{n_{3}}{N_{3}} \in (0, 1) .

The main steps of RS-SCT are the following.

Building randomized sketches.
- Uniformly sample $m_{1}$ columns of $A_{(1)}$ . Denote the indices of the sampled columns to be $Θ_{1}$ . The map $Ω_{1} \in R^{k_{1} \times m_{1}}$ is applied to $A_{(1)}^{(:, Θ_{1})}$ from the right,
  
  $Y_{1} = A_{(1)}^{(:, Θ_{1})} Ω_{1}^{*} \in R^{N_{1} \times k_{1}};$
- Uniformly sample $m_{2}$ columns of $A_{(2)}$ . Denote the indices of the sampled columns to be $Θ_{2}$ . The map $Ω_{2} \in R^{k_{2} \times m_{2}}$ is applied to $A_{(2)}^{(:, Θ_{2})}$ from the right,
  
  $Y_{2} = A_{(2)}^{(:, Θ_{2})} Ω_{2}^{*} \in R^{N_{2} \times k_{2}};$
- Uniformly sample $m_{3}$ columns of $A_{(3)}$ . Denote the indices of the sampled columns to be $Θ_{3}$ . The map $Ω_{3} \in R^{k_{3} \times m_{3}}$ is applied to $A_{(3)}^{(:, Θ_{3})}$ from the right,
  
  $Y_{3} = A_{(3)}^{(:, Θ_{3})} Ω_{3}^{*} \in R^{N_{3} \times k_{3}};$
- Uniformly sample $n_{1}$ , $n_{2}$ , and $n_{3}$ rows of $A_{(1)}$ , $A_{(2)}$ , and $A_{(3)}$ . Denote the indices of the sampled rows to be $Δ_{1}$ , $Δ_{2}$ , and $Δ_{3}$ , respectively. Apply maps $Φ_{1} \in R^{s_{1} \times n_{1}}$ , $Φ_{2} \in R^{s_{2} \times n_{2}}$ , and $Φ_{3} \in R^{s_{3} \times n_{3}}$ to the intersection $A^{(Δ_{1}, Δ_{2}, Δ_{3})}$ ,
  
  $Z = A^{(Δ_{1}, Δ_{2}, Δ_{3})} \times_{1} Φ_{1} \times_{2} Φ_{2} \times_{3} Φ_{3} \in R^{s_{1} \times s_{2} \times s_{3}};$
Computations.
- We compute the QR decomposition of $Y_{1}$ ,
  
  $Y_{1} = Q_{1} R_{1},$
  
  where $Q_{1} \in R^{N_{1} \times k_{1}}$ , and $R_{1} \in R^{k_{1} \times k_{1}}$ ;
- We compute the QR decomposition of $Y_{2}$ ,
  
  $Y_{2} = Q_{2} R_{2},$
  
  where $Q_{2} \in R^{N_{2} \times k_{2}}$ , and $R_{2} \in R^{k_{2} \times k_{2}}$ ;
- We compute the QR decomposition of $Y_{3}$ ,
  
  $Y_{3} = Q_{3} R_{3},$
  
  where $Q_{3} \in R^{N_{3} \times k_{3}}$ , and $R_{3} \in R^{k_{3} \times k_{3}}$ ;
- We compute the core approximation
  
  $C = Z \times_{1} (Φ_{1} Q_{1}^{(Δ_{1}, :)})^{†} \times_{2} (Φ_{2} Q_{2}^{(Δ_{2}, :)})^{†} \times_{3} (Φ_{3} Q_{3}^{(Δ_{3}, :)})^{†} \in R^{k_{1} \times k_{2} \times k_{3}};$
- The final near-optimal multi-linear $r$ approximation to $A$ , denoted by $[[A]]_{r}$ , is computed by
  
  $[[C]]_{r} \times_{1} Q_{1} \times_{2} Q_{2} \times_{3} Q_{3},$
  
  where $[[C]]_{r}$ is the best multi-linear rank $r$ approximation of $C$ . Empirically, $[[C]]_{r}$ can be computed by algorithm such as HOSVD or HOOI.

Appendix B Supplementary figures

b.1 Progressive SketchyCoreTucker (P-SCT)

$A \in R^{N_{1} \times N_{2} \times N_{3}}$ is the input order 3 tensor; its matricizations along the three dimensions are denoted by $A_{(1)} \in R^{N_{1} \times (N_{2} N_{3})}$ , $A_{(2)} \in R^{N_{2} \times (N_{3} N_{1})}$ , $A_{(3)} \in R^{N_{3} \times (N_{1} N_{2})}$ ; row space of $A_{(k)}$ is denoted by $R o w_{k} = {A_{(k)}^{(1, :)}, \dots, A_{(k)}^{(N_{k}, :)}}$ . We define the Sum of Absolute Differences ( $S A D$ ) to capture the accumulated variation in these data fibers. $S A D$ of $A_{(1)}^{(δ_{1}, :)} \equiv A^{(δ_{1}, :, :)}$ is

(2)

where the absolute differences of data in data fibers of $A^{(δ_{1}, :, :)}$ are accumulated and normalized by the size of slice matrix. $S A D$ of samples from other $R o w_{k}$ can be calculated in the same manner. Dirichlet distribution is used to balance exploitation and exploration of $R o w_{k}$ .

We utilize $S A D$ once again here for the sampling. Rather than randomly sample, we assign weights by $S A D$ . At the first round, we know nothing about the input tensor. Thus, weights are set as unity that can be denoted by $W = {w_{1}, w_{2}, w_{3}}, w_{k} = {1, . . ., 1}, n (w_{k}) = N_{k} .$ where $w_{k}$ is weights for $R o w_{k}$ and $n (w_{k})$ denotes the length of weights. Let us define sampled indices $Δ = {Δ_{1}, Δ_{2}, Δ_{3}}$ and unsampled indices $Ω = {Ω_{1}, Ω_{2}, Ω_{3}}$ and an allowed total number of samples $N_{a l l o w}$ . Then, the main steps of P-SCT are the following. In the algorithm, $⌊ \cdot ⌋$ represents floor function.

Sample from $R o w_{k}$
- Draw sampling ratios $p = {p_{1}, p_{2}, p_{3}}$ from Dirich $(α)$ .
- For $R o w_{k}$ , sample $n_{k} = ⌊ p_{k} N_{b a t c h} ⌋$ indices from unsampled indices $Ω_{k}$ following weights $w_{k}$ . Denote the indices of the sampled rows to be $Δ_{k}$ .
Compute $S A D$ and update TS parameters
- Compute $S A D$ of $A_{(k)}^{(Δ_{k}, :)}$ . Denote $S A D$ of $A_{(k)}^{(δ_{k}, :)}, δ_{k} \in Δ_{k}$ as $S A D (δ_{k})$ .
- Compute $S A D$ entropy $H_{k} = - \sum_{δ_{k} \in Δ_{k}} \frac{S A D (δ_{k})}{\sum_{δ_{k}} S A D (δ_{k})} log (\frac{S A D (δ_{k})}{\sum_{δ_{k}} S A D (δ_{k})}) .$
- Update the concentration parameter $α_{k} \leftarrow H_{k}$ .
- Update $Ω_{k} \leftarrow Ω_{k} ∖ Δ_{k}$ .
- Linearly interpolate $S A D$ of unsampled rows $S A D (Ω_{k})$ based on the index differences using $S A D (Δ_{k})$ .
- Set $S A D (Δ_{k}) = 0$ to avoid sampling duplicates, and update $w_{k}$ by normalized $S A D$ .
- Repeat Steps 1 and 2 until $n (Δ_{1}) + n (Δ_{2}) + n (Δ_{3}) > N_{a l l o w}$ .
Final SketchyCoreTucker
- Compute rank- $r$ approximation using SketchyCoreTucker (see Appendix).
- Instead of uniformly sampling rows of $A_{(1)}$ , $A_{(2)}$ , $A_{(3)}$ , use $Δ_{1}$ , $Δ_{2}$ , $Δ_{3}$ from Step 2.
- For the columns of $A_{(1)}$ , $A_{(2)}$ , $A_{(3)}$ , use corresponding fibers crossing the intersection $A^{(Δ_{1}, Δ_{2}, Δ_{3})}$ .
- Dimension reduction parameters $k$ and $s$ can be decided as
  
  $k = ⌊ r + \frac{1}{3} (n - r) ⌋, s = ⌊ r + \frac{2}{3} (n - r) ⌋,$
  
  where $n = {n (Δ_{1}), n (Δ_{2}), n (Δ_{3})}$ .

Figure S7 illustrates P-SCT algorithm.

\includegraphics[width=]figures/flowchart.png

Figure S7: Flowchart of P-SCT algorithm.

b.2 Scree plots

The proper multilinear ranks, $r$ , for each dataset are identified from the scree plots for the modes calculated by HOSVD. At rank $r$ for mode $k$ , scree plot can be computed by

s c r e e_{k} (r) = \frac{1}{∥ A_{(k)} ∥_{F}^{2}} \sum i = r + 1 σ_{i}^{2} (A_{(k)}) .

(3)

\includegraphics[width=]figures/cardiac_screes.png

Figure S8: Scree plots for the modes of Cardiac MRI dataset.

r = (20, 20, 5)

has been selected for the low rank approximation.

\includegraphics[width=]figures/air_screes.png

Figure S9: Scree plots for the modes of NARR air temperature dataset.

r = (50, 10, 20)

has been selected for the low rank approximation.