Index Tracking via Learning to Predict Market Sensitivities

Yoonsik Hong ys.hong@miraeasset.com Mirae Asset Global InvestmentsSeoulRepublic of Korea , Yanghoon Kim yanghoon.kim@miraeasset.com Mirae Asset Global InvestmentsSeoulRepublic of Korea , Jeonghun Kim jeonghun˙kim@miraeasset.com Mirae Asset Global InvestmentsSeoulRepublic of Korea and Yongmin Choi yongmin.choi@miraeasset.com Mirae Asset Global InvestmentsSeoulRepublic of Korea

2022

Abstract.

A significant number of equity funds are preferred by index funds nowadays, and market sensitivities are instrumental in managing them. Index funds might replicate the index identically, which is, however, cost-ineffective and impractical. Moreover, to utilize market sensitivities to replicate the index partially, they must be predicted or estimated accurately. Accordingly, first, we examine deep-learning models to predict market sensitivities. Also, we present pragmatic applications of data processing methods to aid training and generate target data for the prediction. Then, we propose a partial-index-tracking optimization model controlling the net predicted market sensitivities of the portfolios and index to be the same. These processes’ efficacy is corroborated by the Korea Stock Price Index 200. Our experiments show a significant reduction of the prediction errors compared with historical estimations, and competitive tracking errors of replicating the index using fewer than half of the entire constituents. Therefore, we show that applying deep learning to predict market sensitivities is promising and that our portfolio construction methods are practically effective. Additionally, to our knowledge, this is the first study that addresses market sensitivities focused on deep learning.

market sensitivity, index tracking, deep learning, portfolio optimization

^†^†copyright: acmcopyright^†^†journalyear: 2022^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Machine learning approaches

1. Introduction

The assets under management (AUM) of index funds have increased steadily, accounting for a significant share (more than 30%) of the AUM of U.S. equity funds in recent years (heath2022index). Index funds are a type of mutual fund whose returns closely resemble those of a predefined market index; the action of tracking such an index is referred to as index tracking (oh2005using). Index-tracking strategies are instrumental to the management of index funds.

Depending on methodologies, index-tracking can be categorized into full and partial replication (kim2020index). Full replication involves making a portfolio with every constituent in the market index, each of which is market-capitalization-weighted. Partial replication aims to deliver the market index’s performance while the portfolio contains only a part of the market index constituents. Theoretically, full replication creates a complete portfolio with zero tracking error. However, there exist many hurdles interrupting perfect replication, such as transaction costs, lack of trading volume of small-cap stocks, wide bid-ask spreads for less liquid assets, and investment constraints. The fewer the constituents in the tracking portfolio, the more challenges portfolio managers experience. Nevertheless, partial replication methods have a higher potential to reduce transaction costs because the portfolio holds comparatively fewer “smaller and less liquid stocks.” Furthermore, partial replication methods are more flexible to build portfolios with strict constraints. Therefore, it is important to develop new approaches to replicate indexes with a comparatively fewer number of stocks to reduce tracking errors.

While developing a strategy to replicate indexes, considering portfolio constituents’ market sensitivities to the indexes is paramount (oh2005using; chang2004evaluating; keim1999analysis). A financial instrument’s market sensitivity to an index indicates how sensitively it reacts to the market index when the index fluctuates (sharpe1972risk; sharpe1995risk). Market sensitivity is also called the $β$ of an instrument. Generally, if a portfolio contains many high-sensitivity financial instruments, its return may increase (decrease) more than that of the index increases (decreases). The portfolios’ sensitivity to the indexes can, therefore, affect how closely the produced portfolio follows the index, making it crucial to manage the portfolios’ sensitivity while building them.

However, impediments in estimating $β$ make it difficult for fund managers to control the $β$ of funds because they are not directly observable but should be estimated or predicted based on historical data. There have been several approaches (blume1975timevarying; ferson1993risk; pagan1980kalman; das2010market; engle2002dcc; engle2016garch) to handle this problem, but they have limitations: estimation of limited number of stocks (siegel1995option; buss2012option) and lack of out-of-sample tests (hollstein2016beta; faff2000ts). To address these issues, we propose a deep-learning (DL) approach to predict the sensitivities using historical price return data and substantiate them using the sliding window method (hota2017time) which provides pseudo out-of-sample tests. Then, we suggest a mixed-integer linear programming (MILP) model to construct index portfolios utilizing the predicted sensitivities to handle the problem of partial replication. These methods’ efficacy for prediction and portfolio construction are substantiated through the Korean Stock Price Index 200 (KOSPI 200).

This study contributes as follows:

To our knowledge, this is the first study to harness deep-learning models to predict market sensitivities.
We propose a method to construct partial-replication portfolios for index funds utilizing the predicted market sensitivities.
We present cumulative density function (CDF) transformation and a novel method to create target data, which are instrumental to train the prediction models.
Using KOSPI 200, we corroborate our methods’ efficacy in prediction and portfolio construction.
Our prediction models and portfolio construction methods, respectively, reduced historical estimations’ prediction errors to about $57 %$ and the required number of stocks by half.

The remainder of this paper is structured as follows. First, we review the related works and background knowledge in section 2 and 3, respectively. Then, section 4 elucidates our methods. Section 5 substantiates the proposed methods with experiments. Finally, section 6 concludes the paper.

2. Related Work

Since our study focuses on index tracking and prediction of market sensitivity, the literature review was conducted from this perspective.

2.1. Index Tracking via Machine Learning

Recently, machine-learning-based methods have become outstanding ways to conduct partial replication for index tracking. In (oh2005using), the authors proposed to exploit a genetic algorithm (GA) for index tracking. The algorithm first uses some fundamental variables, such as average market capitalization, average trading amount, and standard error of the portfolio beta, to select tracking portfolio constituents. Then, through the GA, the algorithm optimized the portfolio weights of the selected stocks. A heuristic approach utilizing Hopfield neural networks (hopfield1984neurons) was presented by (fernandez2007portfolio) to solve the generalized Markowitz mean-variance optimization with cardinality and bounding constraints.

Particularly, adoption of deep neural networks has broken new ground for many applications of traditional machine-learning methods and index tracking. (kwak2021neural) introduced portfolio weights based on the output of a neural network that takes a fixed noise as input. The goal was to keep the portfolio weight unchanged until the portfolio-update period arrives. However, portfolio weight determined by a fixed noise has nothing to do with the stocks and index, and it cannot reflect the market situation.

Given time series data of index constituents, (ouyang2019index) used a deep autoencoder to reconstruct the input and select stocks with minimum reconstruction losses as the tracking portfolio constituents. The portfolio weights were further calculated through a feed-forward neural network given the time series data. Similar to (ouyang2019index), (kim2020index) used a deep autoencoder to reconstruct returns of index constituents. Stocks with largest correlation coefficients or mutual information with the latent variable were selected as tracking portfolio constituents. The tracking portfolio’s weight was calculated using a correlation coefficient, for example, proportional to exponential value of the correlation coefficient. While both abovementioned works utilized the deep autoencoder to connect individual stock information to market value, interpretation of market information included by the latent variable has obscure theoretical foundation.

To our knowledge, no previous works considered both the application and prediction of market sensitivities in index-tracking portfolio construction, although market sensitivities are instrumental to index tracking (oh2005using). This study develops DL models to predict market sensitivities and utilize them directly by setting the net $β$ of the portfolio to be equal to the $β$ of the index.

2.2. Estimation of Market Sensitivities

A simple approach to estimate $β$ involves generating the slope coefficient from a linear regression by utilizing historical time series of return data, which is referred to as a historical estimation in this paper. However, much evidence has shown that $β$ has time-varying properties (blume1975timevarying; ferson1993risk), which led to many other approaches based on generalized autoregressive conditional heteroskedasticity engle2002dcc; engle2016garch, Kalman-filter (pagan1980kalman; das2010market), etc. According to (hollstein2016beta; faff2000ts), many of these intricate models attempting to capture time-variation of $β$ perform better using training data but do not provide evidence from test data. Recently, a machine-learning based approach (wolfgang2022ml) has been studied that estimates $β$ based on three different forecast model families: linear regressions, tree-based models, and neural networks. Because these studies focused on comparison of machine-learning methods, no DL method other than multi-layer perceptrons (MLP) has been reviewed. This study focuses on deep learning, utilizes other DL architectures, and examines their practical efficacy in terms of index tracking using the sliding window method (hota2017time) to evaluate properly.

Meanwhile, option implied estimation approaches have also been widely studied (siegel1995option; buss2012option). These approaches have a big advantage as they can consider forward-looking information from options markets, but only limited stocks that have option derivatives can be analyzed. This limitation could be critical to achieve the main purpose of this study, because our final goal is not to estimate $β$ itself but create a portfolio that tracks the underlying index well using the estimated $β$ . Moreover, (hollstein2016beta) states that fully option implied approaches (chang2011option; sr2005option; kks2014option) have substantial errors that they cannot produce negative values.

3. Preliminaries

This section provides an overview of a single-factor model and introduces basics of several DL models.

3.1. Single-Factor Model

Pursuant to the single-factor model, we assume that the return of each financial instrument $r_{i}$ in an investment universe $S$ can be expressed as (1) (luenberger2009investment), where $f$ is a factor. $β_{i}, α_{i}, ϵ_{i}$ are the market sensitivity (or slope), intercept, and error of the factor $f$ for a financial instrument $i \in S$ , respectively.

(1)

r_{i} = α_{i} + β_{i} f + ϵ_{i}

In this study, we consider only the return of a given index $r_{m}$ as a factor, and assume that $β_{i}, α_{i}, ϵ_{i}$ vary as time $t$ passes, as in (2):

(2)

r_{i t} = α_{i t} + β_{i t} r_{m t} + ϵ_{i t} .

Then, the return of a portfolio $π$ whose financial instruments are each weighted by $w_{i t}$ becomes (3):

(3)

r_{π t} = \sum i \in S w_{i t} r_{i t} = \sum i \in S (w_{i t} α_{i t} + w_{i t} β_{i t} r_{m t} + w_{i t} ϵ_{t}) .

3.2. Deep-Learning Models

According to (zhang2018definition), deep learning is a method for discovering the relationship between multiple variables and the knowledge behind the relationship. To extract the relationship and knowledge, several well-known DL models were suggested, and we utilize four of them: MLP, long-short term memory (LSTM) (hochreiter1997long), gated-recurrent unit (GRU) (cho2014learning), and Transformer (vaswani2017attention).

MLP is composed of several connected layers of artificial neurons that take inputs, multiply weights to them, sum them up, and output the sum after it passes a nonlinear activation (gardner1998artificial). Because it has nonlinear activation functions, it can extract nonlinear features from the data. Then, to handle sequential data, LSTM and GRU are suggested. In LSTM, the cells utilize long-term information as well as short-term or new inputs. Similarly, GRU utilizes long-term memory but has a simpler design so has fewer learnable parameters (fu2016using). In place of sequence-to-sequence structures such as LSTM and GRU, Transformer utilizes a multi-head attention mechanism to extract features. This study employs the encoder layers of Transformer, but, for simplicity, we refer to them as Transformer or Trans.

4. Proposed Methods

4.1. Overview of Our Methods

In this study, our proposed methods solve two problems: (i) prediction of $α_{i t}, β_{i t} \in R$ of the single-factor models, and (ii) portfolio construction of partially-replicated $π_{t}$ . The DL models are harnessed to the former, and MILP is employed for the latter.

4.1.1. Prediction of Market Sensitivities

Let $t \in Z$ be a time step of trading dates. It is assumed that, a portfolio is first constructed at $t_{0}$ and updated periodically with a given trading period $T_{A} \in N$ (in the experiment, $T_{A} = 21$ trading days, assumed to be a month) at close prices. That is, on ${t_{0}, T_{0} + T_{A}, T_{0} + 2 T_{A}, . . .}$ , portfolios are updated to ${w_{t_{0}}, w_{t_{0} + T_{A}}, w_{t_{0} + 2 T_{A}}, . . .}$ . To do so, in Figure 1, the data in time period $[t - T_{B}, t - 1]$ denoted by $X_{i t}$ are utilized for the DL models to predict $β_{i t}, α_{i t} \in R$ for $r_{i t}, r_{m t}$ . $T_{B}$ determines how long data will be utilized. $r_{i t}, r_{m t}$ are the returns of an instrument $i$ and index $m$ , respectively, from their close prices at $t$ to $t + T_{A}$ . The objective of the prediction is to minimize the prediction errors (PE) of one-factor models, which are formulated as follows.

(4)

m i n i m i z e

P E = E [(r_{i t} - β_{i t} r_{m t} - α_{i t})^{2}]

4.1.2. Construction of Partially Replicated Portfolio

To construct the constituent weights of a portfolio, $w_{t} = [w_{i t}] \in [0, 1]^{| S |}$ , we minimize tracking errors (TE) as Equation (5), where $| | ∙ | |$ is a norm. Although there are many definitions of TEs (pope1994discovering; roll1992mean), the one used in this study is (5). Additionally, we want to contain fewer financial instruments than the index $m$ in number. Let $N^{*} \in N$ be the limit on the number of financial instruments in $π_{t}$ . Also, we set $u_{i t} \in {0, 1}$ as an indicator variable that shows if an instrument $i$ is included in $π_{t}$ . Then, a constraint (6) restricts the number of financial instruments in $π_{t}$ (canakgoz2009mixed).

(5)		$m i n i m i z e$	$T E (r_{π}, r_{m}) = E [\| \| r_{π t} - r_{m t} \| \|]$
(6)		$w h e r e$	$\sum i \in S u_{i t} \leq N^{*}$

4.2. Prediction of Market Sensitivities via Deep Learning Models

To predict $α_{i t}, β_{i t}$ of the single-factor models used in the portfolio construction step, we train the DL model in a supervised manner. Specifically, we additionally predict ${^ρ}_{i t}$ , a residual, in replacement of error $ϵ_{i}$ . By formulating the random error $ϵ_{i}$ as a deterministic variable ${^ρ}_{i t}$ , the model learns better representations. The overall architecture of our models is depicted in Figure 2, which is CDF transformation ( $ξ_{X_{t}}$ ), feature extractor ( $ϕ_{t}$ ), fully-connected (FC) layers ( $ψ_{∙}$ ), and inverse CDF transformation ( $ξ_{∙}^{- 1}$ ). Note that subscripts of the four components do not have $i$ , which means that a single model is trained and used to predict all financial instruments.

4.2.1. Architecture of Deep Learning Models

First, when input $X_{i t}$ passes into the neural network, its distribution is reshaped by the CDF transformation to expedite the learning process of the models. Details of the transformation are explicated in the next subsection. Second, the feature extractor receives the transformed input data. The feature extractor’s primary function is finding appropriate representations for prediction. So, any kind of (sub-)differentiable architecture of the neural networks can be harnessed to it. Inside the feature extractors, dropout layers are interposed to attenuate over-fitting. The retrieved representations are then flattened before entering the FC layers, which convert the representations into appropriate outputs of ${^α}_{i t}, {^β}_{i t}, {^ρ}_{i t}$ . Note that the last activation function of each $ψ_{∙}$ is a sigmoid function. Finally, at the inverse CDF transformation, the outputs of $ψ_{∙}$ are turned into actual values of ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ .

4.2.2. CDF Transformation

To expedite the optimization of DL models, we apply CDF transformations to the input and output data. CDF transformations transmute any continuous distributions on the real line into a uniform distribution $(0, 1)$ (casella2021statistical). The input and output data in our study have wide-ranged distribution, which makes it challenging for the models to learn. The CDF transformation addresses this issue by scaling down the data to the uniform distribution.

This CDF transformation is applied variable-wisely as Figure 3. To avoid the look-ahead bias(zhou2014active; isichenko2021quantitative), we preclude the validation and test sets when approximating CDF. Then, the defined CDF transformations transmute the whole input and output data, including the validation and test data sets. Moreover, note that the approximated CDF’s inverse exists because the approximated CDF is a strictly increasing function. Thence, we train the DL models on the CDF-transformed space and apply the inverse CDF transformation to outputs of the DL model.

4.2.3. Supervised Learning

We design a supervised-learning problem with the target data of ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ , generated by Theil-Sen linear regression(wang2005asymptotics; dang2008theil) to train the aforementioned DL models, as in Figure 4. If $r_{i t}, r_{m t}$ of a single time-step $t$ were utilized to estimate ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ with (2), there are infinitely many pairs of ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ . This is because there are three variables but one equation is provided. To address this, we utilize $r_{i τ}, r_{m τ}$ ( $t - T_{C} \leq τ \leq t + T_{C}$ ) to estimate ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ ; then, there are ( $2 T_{C} + 1$ ) equations with three variables. Now, a linear regression method is applied to estimate ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ . Because the stock market data have a lot of noise (pafka2003noisy), if a simple linear regression is applied to them, outliers can have a negative impact on the estimation of parameters. To alleviate the negative effects of outliers, we employ Theil-Sen linear regression, which is a robust linear regression method.

Figure 3. Cumulative Distribution Function (CDF) Transformation; Suppose input data $X_{t}^{t r a i n} = [x_{1 t} x_{2 t} . . . x_{p t} . . .]$ , where $x_{p t}$ is the $p$ -th feature vector whose elements are samples. We find the cumulative mass function (CMF) of the samples of $x_{p t}$ by sorting and counting. Then, the CMF of $x_{p t}$ is linearly interpolated to approximate the CDF of $x_{p t}$ . Denote the approximated CDF of $x_{p t}$ as ${^F}_{x_{p t}} (∙)$ . Then, ${^F}_{x_{p t}} (x_{p t})$ becomes uniformly distributed.

4.3. Portfolio Construction

To replicate but not predict the movement of a given index, our portfolios set the net beta and alpha to be the same as those of the index. Let $w_{i t}^{m}$ be the weight of a financial instrument $i$ in the index $m$ , and let ${¯ ¯¯ ¯ α}_{m t} = \sum_{i \in S} w_{i t}^{m} α_{i t}, {¯ ¯ ¯ β}_{m t} = \sum_{i \in S} w_{i t}^{m} β_{i t}, {¯ ϵ}_{m t} = \sum_{i \in S} w_{i t}^{m} ϵ_{i t}$ . Then, the difference of the returns in (5) becomes (7)–(10) by (3). If we assume $α_{∙ t}, β_{∙ t}$ are deterministic at time $t$ , constructing a portfolio with constraints $\sum_{i \in S} w_{i t} β_{i t} = {¯ ¯ ¯ β}_{m t}, \sum_{i \in S} w_{i t} α_{i t} = {¯ ¯¯ ¯ α}_{m t}$ allows us not to predict the return of the index $r_{m}$ because the coefficient of $r_{m}$ in (8) becomes zero. Note that ${¯ ¯¯ ¯ α}_{m t}, {¯ ¯ ¯ β}_{m t}$ can be set to be zero and one, respectively, as (canakgoz2009mixed), but, instead, we set them as the weighted averages. This is because setting zero and one can give infeasible or unrealistic solutions, for predicted values $α_{i t}, β_{i t}$ . Moreover, (9) becomes zero, so our objective function becomes $E [| | \sum_{i \in S} w_{i t} ϵ_{i t} - {¯ ϵ}_{m t} | |]$ .

(7)	$r_{π t} - r_{m t}$	$= \sum i \in S w_{i t} r_{i t} - \sum i \in S w_{i t}^{m} r_{i t}$
(8)		$= (\sum i \in S w_{i t} β_{i t} - {¯ ¯ ¯ β}_{m t}) r_{m}$
(9)		$+ \sum i \in S w_{i t} α_{i t} - {¯ ¯¯ ¯ α}_{m t}$
(10)		$+ \sum i \in S w_{i t} ϵ_{i t} - {¯ ϵ}_{m t}$

Figure 4. Theil-Sen Regression and Estimation of ${^β}_{i t}, {^α}_{i t}, {^ρ}_{i t}$ . If $(r_{m, t + 1}, r_{i, t + 1})$ is an outlier in this figure, the simple linear regression would lean toward the outlier. However, if the Theil-Sen linear regression is applied, the regressed line leans less toward the outlier.

However, it is difficult to predict errors accurately since they can be a combination of the influence of other factors with random noise. Thence, in lieu of minimizing $E [| | \sum_{i \in S} w_{i t} ϵ_{i t} - {¯ ϵ}_{m t} | |]$ , we make $w_{t}$ close to $w_{t}^{m}$ at each $t$ as (11). Here, $w_{t}, w_{t}^{m} \in [0, 1]^{| S |}$ are vectors, whose components are $w_{i t}, w_{i t}^{m}$ , respectively. This is because if we make $w_{t}$ similar to $w_{t}^{m}$ , the portfolio will share similar unexplained factors or errors.

(11)

m i n i m i z e | | w_{t} - w_{t}^{m} | |

When applying the norm to the objective function $| | w_{t} - w_{t}^{m} | |$ , we consider the sum of the 1-norm and infinity-norm, (12).

(12)

m i n i m i z e \frac{1}{| S |} \sum i \in S | w_{i t} - w_{i t}^{m} | + {m a x}_{i \in S} | w_{i t} - w_{i t}^{m} |

To transform the nonlinear objective function (12) into a linear one, we introduce dummy decision variables $z_{i t}, Z_{t}$ , convert (12) into (13), and add constraints (14)–(16) (shanno1971linear; rardin1998optimization). $z_{i t}$ and $Z_{t}$ act like upper bounds for $w_{i t} - w_{i t}^{m}$ and $z_{i t}$ , respectively. Because the objective function (13) is minimized, $z_{i t}$ and $Z_{t}$ become $w_{i t} - w_{i t}^{m}$ and ${m a x}_{i} z_{i t}$ , respectively, as the programming is optimized:

(13)

m i n i m i z e \frac{1}{| S |} \sum i \in S z_{i t} + Z_{t}

s u b j e c t t o

(14)	$w_{i t} - w_{i t}^{m}$	$\leq z_{i t}$	$, \forall i \in S$
(15)	$w_{i t} - w_{i t}^{m}$	$\geq - z_{i t}$	$, \forall i \in S$
(16)	$z_{i t}$	$\leq Z_{t}$	$, \forall i \in S$
(17)	$\sum i \in S ∖ S^{*} w_{i t} β_{i t}$	$= {¯ ¯ ¯ β}_{m t}$
(18)	$\sum i \in S ∖ S^{*} w_{i t} α_{i t}$	$= {¯ ¯¯ ¯ α}_{m t}$
(19)	$\sum i \in S w_{i t}$	$= 1$
(20)	$w_{i t}$	$\leq u_{i t}$	$, \forall i \in S$
(21)	$\sum i \in S u_{i t}$	$\leq N^{*}$
(22)	$w_{i t}$	$= w_{i t}^{m}$	$, \forall i \in S^{*}$
(23)	$0 \leq w_{i t} \leq 1, u_{i t}$	$\in {0, 1}, z_{i t} \geq 0, Z_{t} \geq 0$

Figure 5. The Scheme of Experiments; Each row composed of the four blocks is an episode similar to the sliding window method (hota2017time). To prevent the look-ahead bias(zhou2014active; isichenko2021quantitative), an idle block is inserted, of which length is the sum of $T_{A}$ and $T_{C}$ required to calculate $r_{i t}, r_{m t}$ and $α_{i t}, β_{i t}$ , respectively.

Our portfolio construction scheme is expressed as an MILP model (13)–(23) at each $t$ . Constraints (17) and (18) let us not predict the return of index $m$ and minimize the errors as abovementioned. Because of lack of time-series input data, such as newly listed stocks, $α_{i t}, β_{i t}$ for some financial instruments $S^{*}$ cannot be estimated at time $t$ . They are removed from the left-hand-side summations and right-hand-side calculations in (17) and (18). Instead, their weights are set to be their weights in index $w_{i t}^{m}$ as (22). Equality (19) enforces the sum of the weights to be one. We introduce the binary decision variable $u_{i t}$ , indicating if instrument $i$ is included as defined in section 4.1. If instrument $i$ is in a portfolio, which means $w_{i t} > 0$ , to satisfy (20), $u_{i t}$ must be one. Inequality (21) is to contains financial instruments not greater than a given constant $N^{*} \in N$ (canakgoz2009mixed), as explicated in section 4.1. Note that, because of some reasons (e.g. sinful stocks in stewardship codes), should a financial instrument $i$ have lower weight than some value $w_{i t}^{m a x}$ or be excluded obligatorily, then an inclusion of a constraint $w_{i t} \leq w_{i t}^{m a x} o r w_{i t} = 0$ enables it. In the implementation, Python library PuLP is utilized to formulate and solve our MILP model.

5. Experiments

5.1. Input Data Structure

Figure 6. Architecture of Input Data $X_{i t}$

First, we introduce the input data structure to the DL model, which is depicted in Figure 6. Input data for financial instrument $i$ at trading date $t$ to predict $r_{i t}$ , which is a constituent of index $m$ , is defined as a three-dimensional tensor $X_{i t} = [A_{t},$ $B_{t},$ $L_{i t},$ $S_{i t},$ $L_{m t},$ $S_{m t}] \in R^{K \times T_{D} \times E}$ . $K$ denotes the feature size; we use $K = 6$ features, and the shape of each is $T_{D} \times E$ . Here,

$A_{t}$ : intercepts of the linearly regressed line with returns of $m$ and $i$ as the regressor and response variables, respectively;
$B_{t}$ : slopes of the linearly regressed line with returns of $m$ and $i$ as the regressor and response variables, respectively;
$L_{i t}$ : averages of the excess returns of $i$ ;
$S_{i t}$ : standard deviations of the excess returns of $i$ ;
$L_{m t}$ : averages of the returns of index $m$ ; and
$S_{m t}$ : standard deviations of the returns of index $m$ .

$T_{D}$ and $E$ determine the start and end dates of the data for the estimation of the above six, respectively. Take $A_{t} = [a_{t, (τ, I_{η})}]$ as an example. $a_{t, (τ, I_{η})}$ is the intercept of the linear regression line estimated in the data between $t - τ - I_{η}$ and $t - τ$ . Row index $τ$ determines the end dates of the data utilized to estimate the above statistics (intercepts, slopes, averages, and standard deviations). On the other hand, column $I_{η}$ determines how long the estimation data is, which is defined to capture the time-varying property of $β$ . For instance, when $τ = 1, η = 1$ (the upper right in Figure 6), $a_{t, (1, I_{1})}$ is the intercept of the linear regression line estimated in the data between $t - 1 - I_{1}$ and $t - 1$ . Similarly, $l_{i t, (1, I_{1})}$ , an element of $L_{i t}$ , is the average of the excess returns of $i$ from $t - 1 - I_{1}$ to $t - 1$ .

5.2. Temporal Data Splitting To Preclude Look-Ahead Bias

The whole data are divided into train, validation, idle, and test data blocks like the sliding window method (hota2017time) to prevent look-ahead bias(zhou2014active; isichenko2021quantitative) shown in Figure 5. We refer to each row composed of four blocks in Figure 5 as episode $t^{*}$ , where the first trading date of the test data is $t^{*}$ . Episode $t^{*}$ means that the models predict $α_{i t^{*}}, β_{i t^{*}}$ and construct $w_{t^{*}}$ using the data until $t^{*} - 1$ . As defined in Figure 1, the predicted and constructed values are evaluated by the return $r_{t^{*}}, r_{m t^{*}}$ which are calculated from the close prices of the first and last trading dates of the test data. For the validation data, $r_{i, t^{*} - 1 - T_{A}}, r_{m, t^{*} - 1 - T_{A}}$ are the last available returns for the validation of episode $t^{*}$ . This is because the close prices on $t^{*} - 1$ are the last available close prices and returns require close prices $T_{A}$ trading days ahead as defined in subsection 4.1 and Figure 1. Similarly, ${^α}_{i, t^{*} - 1 - T_{A} - T_{C}}, {^β}_{i, t^{*} - 1 - T_{A} - T_{C}}, {^ρ}_{i, t^{*} - 1 - T_{A} - T_{C}}$ are the last target pairs for the validation because, to estimate them, returns $r_{i, t^{*} - 1 - T_{A}}, r_{m, t^{*} - 1 - T_{A}}$ , the returns $T_{C}$ trading days ahead, are required, as explained in subsection 4.2.3. Hence, idle blocks whose lengths are $T_{A} + T_{C}$ are inserted between the validation and test. In the experiments, the length of the validation is set as three months. Moreover, note that this explained scheme implies monthly portfolio updates as $T_{A} = 21$ , as mentioned in subsection 4.1. Additionally, we assume that all transaction costs are zero.

5.3. Experimental Settings

We utilized data on the daily price return of KOSPI 200 and its constituents, as well as the weight of each constituent. All of them from January 2000 to June 2022 were acquired from the Korea Stock Exchange. We employed the canonical neural networks as the feature extractor: MLP, LSTM, GRU, and Transformer. $X_{i t}$ is transposed or flattened to be appropriate for each feature extractor, as in Figure 2. $T_{C}$ is set to be two. Before training neural networks, CDFs of the first and last transformations in Figure 2 are approximated, and all the input and output data are CDF-transformed to preclude repeated calculation of the transformation on the same data point during the training. Then, the model is optimized by the batch gradient descent method with a batch-size of 512, momentum of 0.1, and L2 regularization of 1.0e-04. Cosine annealing is employed as a learning rate scheduler with the initial learning rate of 1.0e-02 and max epochs of 100. Moreover, early stopping is applied. For the loss function, mean squared error is chosen.

5.4. Evaluation Metrics

We utilize PE (24) and TE (25) to evaluate the performance of the predictions of ${^α}_{i t}, {^β}_{i t}$ and the portfolio construction for a given set of time steps $T$ and $V = {(i, t) : i \in S a n d t \in T}$ . These equations are derived from (4 and 5) by replacing $E$ with the average.

(24)		$^P E = \frac{1}{\| V \|} \sum v \in V (r_{i t} - {^β}_{i t} r_{m t} - {^α}_{i t})^{2}$
(25)

5.5. Evaluation of the Historical Estimation

We, first, evaluate the performance ( $^P E$ ) of historical estimations, defined in subsection 2.2, of ${^β}_{i t}, {^α}_{i t}$ in Table 1. The table shows that the best data period for estimation varies as time passes. Because asset managers cannot know which estimation period will be the best, they need a systematic method to integrate these historical estimations into one. This necessity can be fulfilled by our prediction method which integrates the multi-period historical data and outputs a single pair of ${^β}_{i t}, {^α}_{i t}$ . Note that various estimation periods are tested, and the best ones in each year are selected to be shown in Table 1.

Another interesting result in Table 1 is that the worst performance across all estimation periods is observed in 2020, the year when the COVID-19 pandemic remarkably struck the global economy. The reason for this phenomenon is likely the unprecedented government restrictions on society that year, as suggested by (baker2020unprecedented).

Year\Days	504	756	1,008	1,260	1,512	2,520
2016	0.0195	0.0162	0.0155	0.0152	0.0152	0.0168
2017	0.0148	0.0148	0.0139	0.0136	0.0136	0.0149
2018	0.0184	0.0192	0.0199	0.0196	0.0195	0.0211
2019	0.0161	0.0144	0.0145	0.0151	0.0149	0.0152
2020	0.0261	0.0255	0.0251	0.0253	0.0253	0.0253
2021	0.0237	0.0232	0.0229	0.0221	0.0219	0.0213
2022	0.0230	0.0211	0.0211	0.0205	0.0202	0.0199

Table 1. Yearly Performance of Historical Estimations for Various Periods’ Data for Estimation

5.6. Prediction of Market Exposures and Alphas

Now, we compare the performances of our models with the historical estimations in Table 2. The table shows that all our models perform better than the best performance of the historical estimations (Column Historical) in all years. Moreover, the performance improvement is significant. The average $^P E$ of our methods is around 57 $%$ of that of the best historical estimation, indicating that our models are effective.

In Table 2, we observe that our models also performed the worst in 2020, which is similar to the performance of historical estimations in the same year. We may conjecture that our models were vulnerable to the unprecedented government restrictions (baker2020unprecedented). So, developing a methodology to reflect them would be a future research topic to improve our models. Moreover, the best models are different for each year in Table 2. To integrate them into a single model, applying ensemble techniques (dong2020survey; ganaie2021ensemble) would be future work. Moreover, we utilize a single-factor model, which can be extended to multi-factor models like using the factors defined by (fama2015five).

Year	GRU	LSTM	Trans	MLP	Historical
2016	0.00708	0.00690	0.00780	0.00683	0.01519
2017	0.00892	0.00890	0.00917	0.00896	0.01362
2018	0.00919	0.00886	0.00941	0.00891	0.01845
2019	0.00615	0.00596	0.00673	0.00592	0.01435
2020	0.01791	0.01785	0.01776	0.01792	0.02511
2021	0.01244	0.01223	0.01329	0.01237	0.02126
2022	0.01084	0.01038	0.01105	0.01033	0.01988
Average	0.01036	0.01015	0.01075	0.01018	0.01826

Table 2. Yearly Performances of the Deep-Learning Models and Best Historical Estimation

5.7. Performance of the Portfolios

So far, we have shown that all our prediction models outperform the historical estimations. In this subsection, we evaluate our portfolio construction method that uses our predictions on ${^β}_{i t}, {^α}_{i t}$ .

To evaluate our models, we compare the $^T E$ of our models and the full replications, labeled as Full in Table 3 and Figure 7. $^T E$ is measured under the condition that the portfolio is updated according to the market-capitalization ratio one day before the update because subsection 4.1 assumes data up to the day before the execution date are available. Since the full replication is irrelevant to the number of constituents, tracking error is represented as just one value as shown as ”Full” in Table 3 and Figure 7.

When $N^{*} \geq 90$ , $^T E$ of our methods are similar to those of the full replication in Figure 7, which implies that the 90 stocks selected and weighted by our method are enough to replicate the given KOSPI 200. Note that the number 90 is less than half of the constituents of KOSPI 200. In Table 3, 75 $%$ of the bold numbers of our methods showed better tracking errors than the full replications, when $N^{*} \geq 90$ . Therefore, we have shown that our portfolio construction methodology with the predicted market sensitivities is effective.

Additionally, we surmise that the reason why the performance decreased after 90 stocks is the idiosyncratic errors $ϵ_{i t}$ cannot be offset sufficiently with the fewer number of stocks. Hence, taking into account the correlation between the errors may allow us to lower our lower bound of 90 stocks; this would be future work. Also, the fewer number of stocks may have increased the difference in the exposure to other factors between the portfolios and index. This might be solved by considering multi-factor models like (fama2015five), adding new outputs for the new factors to the prediction models, and introducing constraints that make the net sensitivity of them be the that of market. Despite the room for future works, the replication of an index with fewer than the half number of its constituents is significant because it can reduce the management costs of funds with decreasing the number of stocks. Moreover, our method can flexibly reflect other constraints occurring in actual deployment.

6. Conclusion

This study has proposed a two-step novel approach to partial replication of a market index. In the first step, we examine the use of several deep learning models to predict the market sensitivities of index constituents. Amid the prediction, we present the applications of practical data handling methods. Then, we design a mixed-integer linear programming model to construct an index portfolio, given the predicted market sensitivities in the first step. Results of the experiments in KOSPI 200 indicated that our prediction models had only 57 $%$ of the errors of historical estimations. Also, half the number of the whole constituents was enough for our portfolio construction method to mimic KOSPI 200. To our knowledge, this is the first study to demonstrate the efficacy of deep-learning architectures in predicting market sensitivity with the pragmatic partial-index-tracking method that controls the predicted market sensitivities.

The Number

of Stocks

MLP

LSTM

GRU

Trans

Full

7.548

7.522

8.415

7.082

1.797

4.701

4.259

4.570

3.522

1.797

2.951

3.013

3.153

2.995

1.797

2.383

2.170

2.257

2.307

1.797

2.042

2.095

1.947

1.778

1.797

1.861

1.797

1.867

1.836

1.797

1.741

1.768

1.831

1.812

1.797

100

1.747

1.664

1.797

1.782

1.797

110

1.770

1.764

1.787

1.732

1.797

120

1.835

1.745

1.794

1.821

1.797

130

1.756

1.680

1.733

1.726

1.797

140

1.727

1.751

1.776

1.797

150

1.738

1.700

1.722

1.820

1.797

160

1.740

1.794

1.779

1.809

1.797

170

1.766

1.812

1.782

1.804

1.797

180

1.763

1.762

1.769

1.801

1.797

190

1.774

1.788

1.807

1.797

Table 3. Tracking Errors Using Our Method and Full Replication (Unit: 1.0e-5)

Figure 7. Comparison of Tracking Errors; The changes of $^T E$ as the limitations on the number of stocks $N^{*}$ changes.

Index Tracking via Learning to Predict Market Sensitivities

Abstract.

1. Introduction

2. Related Work

2.1. Index Tracking via Machine Learning

2.2. Estimation of Market Sensitivities

3. Preliminaries

3.1. Single-Factor Model

3.2. Deep-Learning Models

4. Proposed Methods

4.1. Overview of Our Methods

4.1.1. Prediction of Market Sensitivities

4.1.2. Construction of Partially Replicated Portfolio

4.2. Prediction of Market Sensitivities via Deep Learning Models

4.2.1. Architecture of Deep Learning Models

4.2.2. CDF Transformation

4.2.3. Supervised Learning

4.3. Portfolio Construction

5. Experiments

5.1. Input Data Structure

5.2. Temporal Data Splitting To Preclude Look-Ahead Bias

5.3. Experimental Settings

5.4. Evaluation Metrics

5.5. Evaluation of the Historical Estimation

5.6. Prediction of Market Exposures and Alphas

5.7. Performance of the Portfolios

6. Conclusion

References