A Framework for Inherently Interpretable Optimization Models

Marc Goerigk Network and Data Science Management, University of Siegen,
Unteres Schloß 3, 57072 Siegen, Germany Michael Hartisch Corresponding author. Email: michael.hartisch@uni-siegen.de Network and Data Science Management, University of Siegen,
Unteres Schloß 3, 57072 Siegen, Germany

Abstract

With dramatic improvements in optimization software, the solution of large-scale problems that seemed intractable decades ago are now a routine task. This puts even more real-world applications into the reach of optimizers. At the same time, solving optimization problems often turns out to be one of the smaller difficulties when putting solutions into practice. One major barrier is that the optimization software can be perceived as a black box, which may produce solutions of high quality, but can create completely different solutions when circumstances change leading to low acceptance of optimized solutions. Such issues of interpretability and explainability have seen significant attention in other areas, such as machine learning, but less so in optimization. In this paper we propose an optimization framework to derive solutions that inherently come with an easily comprehensible explanatory rule, under which circumstances which solution should be chosen. Focussing on decision trees to represent explanatory rules, we propose integer programming formulations as well as a heuristic method that ensure applicability of our approach even for large-scale problems. Computational experiments using random and real-world data indicate that the costs of inherent interpretability can be very small.

Keywords: data science; interpretable optimization; explainability; decision making under uncertainty; decision trees

1 Introduction

1.1 Motivation

Due to the existence of explicitly stated models and well defined solution processes, researchers in operations research and optimization can have high confidence in the correctness and usefulness of found solutions. If discrepancies between the real-world application and the found solution occur, the underlying model is usually adapted within a feedback cycle. Overall, this design process gives confidence in the validity of the solution. After obtaining a satisfactory model that reflects all conditions, solutions are expected to be employed in various settings, i.e. parameters are adapted, yielding new solutions for every special case.

However, explaining why a solution (e.g. a production plan) of type A is implemented today, while type B was used yesterday is not an easy task. While it is possible to explain the model that is being used, the function that maps a model to a solution is complex and not necessarily explainable. Furthermore, solvers are able to provide necessary and sufficient optimality conditions, and sensitivity analysis can yield a better understanding of the specifics of an optimal solution. But most of such checking and validation tools—often communicated in a very formal, mathematical language—are only accessible to experts, that already have a high confidence in the process. Indeed, even understanding a (linear) programming model can quickly become a challenging task, especially for people not familiar with modeling techniques. Users with less mathematical and computer background, e.g. the planner using the optimization software and the workers in charge of implementing the result, may consider the solver a black box. This raises the question whether a solution is acceptable for the end user, or the object of the optimization process (the worker). Workers directly affected by the optimized decision may wish to understand them and obtain knowledge about future outcomes in order to asses when one has to work overtime or on weekend shifts. Not being able to answer such questions may lead to discontent and loss of trust and result in weak adoptions of optimized decisions.

We argue that non-theoretical and easily comprehensible explanations have to be available in order motivate actions obtained from optimization programs. In particular, justifications of why a system recommends certain actions are more relevant for non-experts than explanations of how the solution is obtained [MB00, YJ95]. Providing the underlying optimization model might not be useful here, as, “disclosure alone does not guarantee that anyone is paying attention or is able to accurately interpret the information; more complex information is more likely to be unexamined or misunderstood” [Pra05]. In [SPVR01] the authors argue that employees’ identification with their company is stronger, the more adequate information they receive about their company and their personal role within the company. Furthermore, effective and transparent communication and explanations play a vital role in fostering trust and positive employee-organization relationships [Raw08, MS14, YMF19, WXW18]. Decision models that are complex and difficult to understand can result in skepticism and reluctance to adopt or use the system that was actually designed to support decision making, even if the models have been shown to improve decision-making performance [ACC $^{+}$ 06, KDBL $^{+}$ 09]. In [KDBL $^{+}$ 09] the authors argue that a decision-support system “must be designed to induce an alignment of a decision maker’s mental model with the decision model”.

Therefore, providing insights into decisions obtained from optimization processes is an important task for managers. Incorporating workers into the feedback cycle of the modeling process can be one important part and providing explanations during this process improves the feedback loop [CSGK19]. After a model is created workers still are in need of more easily comprehensible explanations of when and why different solutions are considered. Furthermore, end users often have the need of being able to explain decisions obtained from an AI, e.g. for loan [SYX $^{+}$ 20, SMF21] or medical consultants [SCDS77, Swa85, LSG $^{+}$ 19]. Even the European “right to explanation” tries to ensure that users can ask for an explanation of an algorithmic decision that was made about them [GF17].

Our Approach.

In addition to modeling the underlying problem in a suitable manner we impose the task of selecting a well suited explanatory rule on the modeler: Rather than trying to post-hoc find an explanation of solutions, we want to create inherently interpretable models [Rud19] that provide an explanatory rule explaining when to use which solution. This explanatory rule can be used when the data basis changes, making it unnecessary to resolve the model, and giving end users foreseeable and explainable solutions. Originating from a set of anticipated scenarios, we want to provide an explanatory rule that maps any (future) scenario to a solution. Due to the desired comprehensible nature of the rule, the set of possible solutions has to be restricted. We show that this places our line of research between so-called $K$ -adaptability and decision rule approaches: The former demands the selection of a fixed number of solutions that can be chosen after the disclosure of uncertain parameters, without having a (comprehensible) mapping from scenario to solution. The latter is frequently used in a robust optimization setting, where decision rules approximate the feasible region in order to alleviate the curse of dimensionality, but without focus on comprehensibility.

1.2 Related Literature

Explainable and Interpretable AI.

In AI and optimization the main focus has been the predictive accuracy of models and the quality of found solutions, respectively. However, inexplicable black-box solutions that cannot offer an explanation or justification lack acceptance, and in particular for machine learning approaches, the novel field of explainable AI arose in recent years, as the necessity to comprehend the obtained solution gained importance [Pau93, ARL18]. For a broad overview we recommend the recent surveys on explainable AI [GMR $^{+}$ 18, GBY $^{+}$ 18, ČRA $^{+}$ 21].

There are various ways of trying to post-hoc explain AI decisions and making them more comprehensible, e.g. counterfactual explanations [MP14, WMR17], explaining model predictions by weighting features according to their importance [RSG16, LL17, SNF21], extracting rules explaining the underlying concept [CS95, MBVGV07], argumentation [MIBD02, AP09, ABG $^{+}$ 17, ČLMT19], or qualitative abstractions [Bou94, BG96]. In [Fre14] the author argues that rather than focussing on accuracy, the comprehensibility of (classification) models should also be considered and in her seminal work [Rud19] Rudin stresses that models should be inherently interpretable rather than post-hoc explainable. Several studies exist that research what representation types are well-suited to be comprehensible for human users [Mil19, Arz21]. They suggest that decision trees or classification rules [BSH $^{+}$ 10, HDM $^{+}$ 11, Fre14] and linear models [SL97, PSLB15, UR16] are quite comprehensible, where sparsity is assumed to be an important factor [Gle16]. Furthermore, the type of explanation needed highly depends on the targeted audience [ADRDS $^{+}$ 20].

Explainable and Interpretable Optimization.

For problems in the NP complexity class, the feasibility and quality of any solution can be checked in polynomial time. This makes it easy to compare any two solutions, which is known as contrastive explanation [Mil21]. However, this explanation type demands that an alternative solution has to be present in the mind of the inquirer. In particular for complex domains this is unlikely the case and hence the question of why a solution is chosen has to be answered. This question can be answered by operations researchers by pointing at the specific model and the applied algorithm or heuristic, which is very unsatisfying for non-experts.

The possibility of contrastive explanation may have hindered further research into more interpretable models and only few such approaches exist, e.g. interpretable policies for optimal stopping in stochastic systems [CM22] or a mixed integer programming approach to obtain interpretable tree-based models for clustering [BOW21]. Here, an action (stop or continue) or a class label (cluster membership), respectively, is assigned in an interpretable manner. In contrast, our approach assigns solutions from an exponentially large solution space, which means that optimization and classification are considered holistically.

In the field of operations research and management science, for scheduling and planning a few recent approaches regarding explainability were published. In [CMP19] progress for planning problems towards extracting approximate post-hoc argumentation based explanations is made, where causal relationships are extracted and then abstracted. A tool to explain plans to non-technical users is presented in [ODV20] that uses formal argumentation and dialogue theory. Explainability in scheduling is discussed in [ČLMT19], where a makespan scheduling problem is translated into an abstract argumentation to extract explanations regarding the feasibility and efficiency. Building upon this approach, a tool that provides interactive explanations in makespan scheduling is presented in [ČLL21]. In a more general approach, for solutions obtained via dynamic programming an explaining method was introduces in [EK21], which focuses on the explainability of the program itself. Furthermore, in recent years the importance to provide comprehensible insight gained importance for multi-objective optimization. For a planning environment verbal explanations are generated in [SSG18], that explain the tradeoff made to reconcile competing objectives. In a different perspective [CGMS21] use simple decision rules to record the decision makers preference used to iteratively converge towards the part of the Pareto front containing the best compromise solution, providing insights into the impact of the given answers.

Decision Rules and $K$ -Adaptability.

In robust and stochastic multistage optimization, decision variables of later stages—that depend on realizations of uncertain parameters of previous stages—can be approximated using parametric classes of linear or nonlinear functions of the uncertain parameters [GKW19]. Decision rules specify the reaction to a scenario and therefore can be viewed as an explanatory rule for how a scenario affects a decision. However, the motivation to use decision rules is not to make decisions of later stages more comprehensible, but to obtain an approximate decision in reasonable time. The selection of a class of functions for the decision rules solely depends on its computational performance and the approximation quality. More recently, predictive prescriptions [BK20] were introduced that prescribe a decision given the observation of covariate variables, but again not with the focus on obtaining comprehensible prescriptions.

$K$ -adaptability allows the decision maker to select $K$ different solutions from which one can be chosen after the disclosure of uncertain parameters. It is used in stochastic [BP19, MMP22] and robust optimization [BC10, HKW15, BK17] as an approximation scheme with the goal of obtaining good solutions in reasonable time. The focus lies on the selection of the best $K$ solutions, while the decision which solution should be chosen in a specific scenario is postponed to when the scenario is revealed. In particular, no rule is given for the mapping between scenarios and the $K$ solutions. Only in recent works, a connection between $K$ -adaptability and decision rules was made [SGW20, VGY20], but not with a focus on obtaining more comprehensible solutions but better computational properties.

Decision Trees.

In this paper we focus on univariate decision trees in order to provide easily interpretable and comprehensible rules. While finding optimal decision trees is an NP-hard problem [LR76], several heuristics exist that recursively split the feature space in a greedy manner, e.g. in [BFOS17, Qui86]. Mixed-integer programming formulations [BD17] can be stated, which can be used to provide good solutions, while they fail to prove optimality for large instances in reasonable time. Other heuristics that are mostly based on pruning include local search [BDM19] and evolutionary algorithms [BBDCF11]. In these methods for obtaining decision trees the classification of the training data is available a-priori. We, on the other hand, want to build an inherently interpretable model [Rud19], i.e. rather than trying to post-hoc find an explanation, we incorporate the discovery of an explanation into the optimization process by explicitly demanding that a solution needs to have an explanatory rule. Hence, by predefining the structure of the intended explanatory rule we restrict the solution space instead of dealing with classification errors.

Our Contributions.

We briefly summarize our main contributions and give an overview of the paper structure.

We present a framework for inherently interpretable optimization with uncertainty in the objective function. This model offers flexibility regarding the explanatory rule which can be chosen to suitably address the target audience. Additionally, choosing the decision criterion is left to the end user in order to maintain a generally applicable approach (see Section 2).
We discuss this framework with respect to an explanatory rule representable via a univariate binary decision tree and focus on the Laplace criterion, which optimizes the set of solutions and the explanatory rule with regard to the average outcome. We examine the computational complexity of the problem and present a greedy approach for such models. A distinct advantage of the greedy method is that in some settings, any solution algorithm for the nominal optimization problem can be utilized (see Section 3).
We provide insights into extensions of the model that can incorporate meta data, uncertainty in the constraints rather than the objective, a multivariate decision tree as explanatory rule, as well as a variant where additional costs are associated with the provision of potential realizations (see Section 4).
In a computational study, utilizing a shortest path setting, we show that the proposed greedy heuristics performs well compared to solving the optimization problem and we provide insights into the quality loss for demanding the existence of an explanatory rule. In the experiments we utilize both artificial as well as real-world data (see Section 5).

2 An Inherently Interpretable Optimization Model

We consider some (deterministic) optimization problem of the form

min x \in X c^{⊤} x

where $X \subseteq R^{n}$ denotes the set of feasible solutions and $n \in N$ is the number of variables. Throughout this paper, we use the notation $[n] := {1, \dots, n}$ to denote index sets. The basic optimization model that we propose is applicable if similar problems are solved repeatedly. We would like to provide an easily comprehensible rule under which circumstances which candidate solution should be used. Finding such an explanatory rule is incorporated into the model rather than post-hoc finding a rule that fits best to the obtained solution. Therefore, the comprehensibility of a solution is inherently required, potentially to the cost of no longer being able to obtain globally optimal solutions which might not have an explanatory rule of the demanded form.

So let us assume that the cost vector $c \in R^{n}$ is uncertain and a discrete set of candidate cost scenarios ${c_{1}, \dots, c_{N}}$ with probabilities $p \in [0, 1]^{N}$ , $\sum_{j \in [N]} p_{j} = 1$ , is given. Consider the set of all functions $f : R^{n} \to X$ that map cost scenarios to solutions. Let us denote $A$ as some subset of such functions that we consider as being comprehensible explanations. Our aim is to find an explanatory rule $a \in A$ that assigns a solution $x \in X$ to any cost scenario $c \in R^{n}$ , where in the case of the $j$ -th observed cost scenarios we write $a (j)$ instead of $a (c_{j})$ for simplicity. The decision-dependent random variable $C (x)$ takes a value $C_{j} (x) = c_{j}^{⊤} x$ with probability $p_{j}$ for each scenario index $j \in [N]$ . Let some decision criterion $μ_{p} : R^{N} \to R$ be given that aggregates a vector of results to a single objective value (e.g., $μ_{p}$ can be a risk measure). We can hence evaluate the quality of an explanatory rule $a$ by calculating

F_{p} (a) = μ_{p} (C_{1} (a (1)), \dots, C_{N} (a (N))) .

The goal is thus to find a rule $a$ that minimizes the function $F_{p}$ , i.e., to solve

min a \in A F_{p} (a) .

Note that depending on the selected decision criterion, the probability vector $p$ might actually not be needed.

As an example, if the decision criterion $μ_{p}$ is the variance, function $F_{p}$ is given by

F_{p} (a) = \sum j \in [N] p_{j} {⎛ ⎝ C_{j} (a (j)) - \sum k \in [N] p_{k} C_{k} (a (k)) ⎞ ⎠}_{j}^{2} .

If we apply the worst-case criterion, $F_{p}$ is given by

F_{p} (a) = max j \in [N] C_{j} (a (j)),

while in the following we often focus on the Laplace criterion, i.e. $F_{p} (a) = L (a) = \sum_{j \in [N]} C_{j} (a (j))$ where we assume that all scenarios have the same probability (note that minimizing the sum of values is equivalent to minimizing the expected value in this case).

As the main goal is to obtain a comprehensible explanatory rule, the structure of allowed rules in $A$ is crucial. One key aspect in this paper is to restrict the image of the explanatory rule, i.e. ${a (j) ∣ j \in [N]} \subseteq X$ , to contain only a small number of potential solutions. Furthermore, the mapping of a scenario to a solution has to be easily comprehensible, given by a straightforward rule.

This marks the central difference between our approach and $K$ -adaptability as in [BP19]. While the $K$ -adaptability approach also restricts the cardinality of the image of the explanatory rule (hence the $K$ in the name), it does not put any further constraints on it to ensure comprehensibility, i.e., any mapping from $[N]$ to $[K]$ is allowed. This means that usually, the best out of the $K$ candidate solutions is assigned to each scenario. In the following, we refer to the $K$ -adaptability approach with the Laplace criterion $L (a)$ as the min-sum-min approach, in analogy to [BK17].

For calculating an explanatory rule, some data set of scenarios ${c_{1}, \dots, c_{N}}$ is required for planning. In practice, it may happen that different—not yet observed—scenarios are encountered, and we still expect the rule to perform well. For this reason, we usually distinguish between training data (i.e., the $N$ scenarios available for optimization) and test data (i.e., a set of scenarios not known during optimization). In particular our approach is suitable whenever the true scenario is disclosed and observable before a decision has to be made and the used solution is desired to be obtainable via an easily comprehensible rule. Finding such an explanatory rule is therefore a long-term planning decision, where the explanatory rule should be applicable in several future scenarios. Application examples include routing problems, disaster management and workforce scheduling.

We present an example for our model. Consider a type of selection problem, where $n = 5$ projects are given and we would like to create a portfolio consisting of exactly $p = 2$ of these projects. More formally, we have $X = {x \in {0, 1}^{n} : \sum_{i \in [n]} x_{i} = p}$ . Each project $i \in [n]$ has an associated cost, which we would like to minimize. We would like to prepare an explanatory rule that allocates up to $K = 4$ different solutions $x^{k}$ to $N = 10$ different scenarios, i.e., circumstances under which this choice may have to be made. In Table 1, we present example costs for this setting.

	Project
Scenario	1	2	3	4	5
1	4	7	8	6	4
2	6	7	3	10	2
3	3	8	1	10	4
4	7	1	7	3	7
5	2	9	8	10	3
6	1	3	4	8	6
7	10	3	3	9	10
8	8	4	7	2	3
9	10	4	2	10	5
10	8	9	5	6	1

Table 1: Example costs for a selection problem.

We assume that each scenario has equal probability and we would like to minimize the expected costs. In order to obtain an explanatory rule matching scenarios to a solution we identify two item costs; in this case, we choose costs $c_{2}$ and $c_{3}$ of items 2 and 3, respectively. We then define a threshold for each value (in this case, $5.5$ and $6$ , respectively), which creates four categories. Each scenario belongs to exactly one such category, see Figure 1, where we show the projections of the ten given scenarios onto the $(c_{2}, c_{3})$ plane. We associate a solution $x^{k}$ with each category, see the blue numbers in the four corners of the plot. For example, if $c_{2} > 5.5$ and $c_{3} > 6$ , we make use of solution $x^{2}$ . These are rules that are easy to communicate and to check without further calculations.

Example explanatory rule. — (a) Splits in $(c_{2}, c_{3})$ plane.

In set notation, the four solutions we use are $x^{1} = {2, 5}$ , $x^{2} = {1, 5}$ , $x^{3} = {3, 5}$ , and $x^{4} = {2, 3}$ . The corresponding costs are given in Table 2, along with the allocation choice made by the explanatory rule. Note that in scenarios 6 and 8 none of the four solutions is optimal.

	Costs of Solution				Allocated
Scenario	1	2	3	4	Solution
1	11	8	12	15	2
2	9	8	5	10	3
3	12	7	5	9	3
4	8	14	14	8	1
5	12	5	11	17	2
6	9	7	10	7	4
7	13	20	13	6	4
8	7	11	10	11	1
9	9	15	7	6	4
10	10	9	6	14	3

Table 2: Solution costs and choice.

This comprehensible explanatory rule that only queries the true cost of projects 2 and 3 is in this example even able to allocate the best of the four available solutions in each scenario. This is not necessarily always the case. In comparison to a min-sum-min solution, there are two ways how our approach may result in higher costs. On the one hand, the rule may not be able to allocate optimal solutions to every scenario due to its simple (but comprehensible) structure. On the other hand, even if we allocate the best of the $K$ solutions to each scenario, the choice of solutions may still result in an increase in objective value compared to selecting an individual solution in every scenario. Furthermore, note that this only holds when we consider the training data. On additional testing data, which was not used during the optimization, the performance of a solution obtained from an interpretable model may even be better than that of a min-sum-min solution.

3 Model Discussion

3.1 Mixed-Integer Programming Formulation for Decision Trees

We first introduce a mixed-integer programming formulation for the problem of minimizing $F_{p} (a)$ when we restrict the explanatory rule to the form of a univariate, binary decision tree of fixed depth. The application of such a tree can be seen as a series of yes/no queries, to reach a leaf that corresponds to a solution. Let $Q \in N$ be the number of queries, called splits, and thus $K = 2^{Q}$ is the number of leaves of the decision tree. The $k$ -th leaf is associated with solution $x^{k} \in X$ . To ensure comprehensible rules, we let the split at each level $q$ of the decision tree be the same in each subtree and furthermore only a single entry of the cost scenario vector $c_{j} \in R^{n}$ of the materialized scenario $j \in [N]$ is queried and compared to a threshold value $b^{q} \in R$ . Binary variables $d_{i}^{q}$ indicate whether entry $i$ of the cost vector is queried in split $q$ of the decision tree. Let $a_{j}^{k}$ be the indicator whether in scenario $j \in [N]$ the traversal of the decision tree reaches leaf $k \in [K]$ .

Each leaf can be described via a binary representation of the split outcomes. To this end let $S_{q} \subseteq [K]$ contain the leaves $k$ where in binary representation of $k - 1$ , the $q$ th digit is 1, e.g. for $Q = 2$ splits and $K = 4$ leaves, we have $S_{1} = {2, 3}$ and $S_{2} = {1, 3}$ using the binary representation as shown in Table 3.

$k - 1$	$q = 1$	$q = 2$	contained in
0	0	0	-
1	0	1	$S_{2}$
2	1	0	$S_{1}$
3	1	1	$S_{1}$ , $S_{2}$

Table 3: Binary representations

Using this notation, we propose the following mixed-integer formulation.


$min$	$F_{p} (a) = μ_{p} (\sum k \in [K] a_{1}^{k} c_{1}^{⊤} x^{k}, \dots, \sum k \in [K] a_{N}^{k} c_{N}^{⊤} x^{k})$		(1a)
s.t.	$\sum k \in [K] a_{j}^{k} = 1$	$\forall j \in [N]$	(1b)
	$\sum i \in [n] d_{i}^{q} = 1$	$\forall q \in [Q]$	(1c)
	$\sum i \in [n] c_{j i} d_{i}^{q} \leq b^{q} + M \sum k \in S_{q} a_{j}^{k}$	$\forall q \in [Q], j \in [N]$	(1d)
	$b^{q} + ϵ - M (1 - \sum k \in S_{q} a_{j}^{k}) \leq \sum i \in [n] c_{j i} d_{i}^{q}$	$\forall q \in [Q], j \in [N]$	(1e)
	$x^{k} \in X$	$\forall k \in [K]$	(1f)
	$b^{q} \in R$		(1g)
	$d_{i}^{q} \in {0, 1}$	$\forall i \in [n], q \in [Q]$	(1h)
	$a_{j}^{k} \in {0, 1}$	$\forall k \in [K], j \in [N]$	(1i)

While Constraints (1b) demand that each scenario is mapped to exactly one solution, Constraints (1c) ensure comprehensibility by restricting each split in the decision tree to query only a single parameter. Constraints (1d) and (1e) describe each split of the decision tree and mirror the traversal of a scenario through the different levels: Assume $b^{q}$ and $d^{q}$ are already fixed. Then for each scenario $j \in [N]$ and split $q \in [Q]$ either $\sum_{i \in [n]} c_{j i} d_{i}^{q} \leq b^{q}$ or $\sum_{i \in [n]} c_{j i} d_{i}^{q} > b^{q}$ holds. Depending on the outcome the decision tree is traversed to the right or to the left, and hence either one of the leaves to the right or to the left of the current node has to be reached which is indicated via $\sum_{k \in S_{q}} a_{j}^{k}$ for sufficiently large $M$ and sufficiently small $ϵ$ .

We can add bounds to the threshold values $b^{q}$ using ${min}_{i, j} c_{j i} \leq b^{q} \leq {max}_{i, j} c_{j i}$ for each $q$ to improve numerical stability. The optimization criterion (1a) as well as the optimization domain (1f) remain general and the type (linear, quadratic, etc.) and thus the tractability of this model vary from case to case.

As a special case, we consider the Laplace criterion, i.e., we are interested in explanatory rules yielding solutions that perform well on average. This optimization criterion can be realized within the scope of Model (1) by using the quadratic objective function $L (a) = \sum_{k \in [K]} \sum_{j \in [N]} a_{j}^{k} \sum_{i \in [n]} c_{j i} x_{i}^{k}$ that adds up the costs of the solutions chosen in every scenario. If a linear model is of interest, linearization options are possible, e.g. by introducing auxiliary variables $z_{j} \in R_{\geq 0}$ , using objective function $F_{p} (a) = \sum_{j \in [N]} z_{j}$ , and adding constraints

\sum i \in [n] c_{j i} x_{i}^{k} \leq z_{j} + M (1 - a_{j}^{k}) \forall k \in [K], j \in [N]

(2)

which force $z_{j}$ to take the cost of the unique solution $k \in [K]$ associated with scenario $j \in [N]$ according to the decision tree.

3.2 Notes on Complexity

In the proposed model, Restrictions (1c) ensure that each split within the decision tree only considers a single dimension of the cost vector $c$ . While this feature is used for the sake of comprehensibility, it also gives a computational benefit. Given a constant tree depth $Q$ , there are $O (n^{Q})$ possibilities to set variables $d_{i}^{q}$ . Once we fix which dimension $i$ to query in split $q$ , we can enumerate all possible values for $b^{q}$ . Let us assume that $d_{i}^{q} = 1$ for some fixed $q$ and $i$ . Consider the set ${c_{j i} : j \in [N]}$ of values in dimension $i$ . Let us sort the values of this set, and consider all midpoints between two subsequent sorted values. Then, this set of midpoints contains an optimal choice for $b^{q}$ , as we can enumerate all possible separations of values in this dimension. Hence, there are $O (N)$ possibilities to set $b^{q}$ .

If all $d_{i}^{q}$ and $b^{q}$ variables are fixed, variables $a_{j}^{k}$ become fixed as well. Hence, the remaining optimization problem is to minimize the criterion $μ_{p} (\sum_{k \in [K]} a_{1}^{k} c_{1}^{⊤} x^{k}, \dots, \sum_{k \in [K]} a_{N}^{k} c_{N}^{⊤} x^{k})$ with a fixed assignment of solutions to scenarios. We refer to these problems as subproblems. If subproblems can be solved in $O (T)$ , this means that the interpretable optimization problem can be solved in $O (n^{Q} N^{Q} T)$ .

In the special case of the Laplace decision criterion, we have that

	$μ_{p} (\sum k \in [K] a_{1}^{k} c_{1}^{⊤} x^{k}, \dots, \sum k \in [K] a_{N}^{k} c_{N}^{⊤} x^{k})$	$= \sum j \in [N] \sum k \in [K] a_{j}^{k} c_{j}^{⊤} x^{k}$
		$= \sum k \in [K] \sum j \in [N] a_{j}^{k} c_{j}^{⊤} x^{k}$
		$= \sum k \in [K] ⎛ ⎜ ⎜ ⎝ \sum i \in [n] ⎛ ⎜ ⎜ ⎝ \sum j \in [N] : a_{j}^{k} = 1 c_{j i} ⎞ ⎟ ⎟ ⎠ x_{i}^{k} ⎞ ⎟ ⎟ ⎠ .$

This means that we need to solve $K$ separate nominal problems, where the costs in each such problem are given by the sum of costs of scenarios that have been assigned to this $k \in [K]$ . Hence, the interpretable optimization problem can be decomposed to $O (n^{Q} N^{Q} K)$ nominal problems. In general, if $μ$ is decomposable (see [JBA14]), it is possible to consider $K$ separate problems to solve the subproblem.

3.3 Greedy Heuristic

The connection to decision trees as they are used in classification problems suggests the use of a greedy heuristic to solve the interpretable optimization problem. The main idea is that instead of considering all $n^{Q} N^{Q}$ possible combinations to determine splits in the decision tree, we determine the splits one level after the other.

We first enumerate all possible splits on the first level. That is, we solve a problem with $Q = 1$ by enumerating all possible values for variables $d_{i}^{1}$ and for the threshold $b^{1}$ . We hence solve $O (n N)$ subproblems in this first step.

We then fix the best of these candidate splits and consider the next level. That is, we solve a problem with $Q = 2$ , where we only need to enumerate the possible values for variable $d_{i}^{2}$ and $b^{2}$ . Again, there are $O (n N)$ subproblems to be solved.

By repeating this procedure until a desired depth $Q$ is reached, a feasible solution has been found in $O (Q n N T)$ . Note that this means the depth $Q$ has been moved from being an exponent in the full enumeration approach, to being a factor in the heuristic. Further note that by construction, the greedy heuristic is optimal for $Q = 1$ .

In case of the Laplace criterion, the first subproblem consists of solving two nominal problems, the second subproblem consists of solving four nominal subproblems, and so on. In total, we need to solve $O (n N \sum_{q = 1}^{Q} 2^{q}) = O (n N K)$ nominal problems to construct a greedy decision tree.

4 Model Extensions

In this section, we discuss possible extensions of the basic interpretable optimization formulation presented in Section 3.1.

4.1 Using Meta Data

In order to obtain explanatory rules that are applicable in practice, the input needed to evaluate the rule and retrieve the suggested solution has to be available. So far, the cost value in each dimension of the realized scenario functions as the input for the rule. The ability to retrieve this information at any point in time can be questionable and highly depends on the use case itself. But even if the cost parameters cannot be observed, scenarios—in particular in the case of real world data—often contain more information than seemingly useful for the problem at hand. For example traffic data might not only contain travel times, but also information regarding time and date and even could be augmented to contain weather data or other information that potentially has an impact on the decision: In a shortest path setting it is intuitively clear that an optimal path from the eastern to the western part of a town can depend on the daytime, the day of the week or the weather conditions. Hence, when trying to find easily comprehensible, high quality rules for future scenarios this kind of information should not be dismissed.

In order to utilize meta data, the general model proposed in Section 3 can be used by expanding the problem domain $X$ by $m$ dummy variables, where $m \in N$ is the dimension of the available meta data. Each dummy variable $x_{n + i}$ for $i \in [m]$ should be forced to zero by demanding $x_{n + i} = 0$ in $X$ in order to stop these non-cost information to affect the objective function. The entry $c_{j, n + i}$ for meta information $i$ in scenario $j$ does not contain the “cost”, but the realized value in this observed scenario, e.g. if for each scenario the day of the week is known, then $c_{j, n + i}$ is a value between 1 and 7.

While implementing this augmentation can be done in a straight-forward manner, expert knowledge of the underlying problem is useful in order to obtain concise explanatory rules. Picking up on the shortest path setting, one useful split in the decision tree could try to separate rush hour times. Let us assume that rush hour is from 6am to 9am and the time stamp of each scenario is stored in a minute-by-minute precision starting at midnight represented by the value 0 until one minute before midnight represented by the value 1439. In order to determine whether a scenario occurs within the rush hour, two decision tree levels are needed, where the first one checks whether it is after 6am, i.e. $c_{j, n + i} \geq 360$ and a second one checks $c_{j, n + i} \leq 540$ . However, if the relevance and characteristics of the rush hour are known, the separation of such scenarios could be achieved by a single split at the cost of having to pre-process the data. In the described case, one option could be to shift the time-stamp interpretation such that time stamp 0 represents scenarios at 6am, making the split $c_{j, n + i} \leq 180$ sufficient in order to identify a rush hour scenario. In fact, it might even be replaced by a 0/1 attribute to indicate if this scenario takes place within the rush hour or not. Another way could be to partition the time stamp information into eight clusters with each three hour duration. This way, instead of having a single data point “time stamp”, eight indicator values can be used to represent the clusters. Thus, 6am to 9am scenarios could be determined by splitting on the third entry, only.

4.2 Uncertainty in Constraints

So far the changing parameters only affected the quality and not the feasibility of a solution, i.e. the feasible set of solutions $X$ remained the same in all scenarios. If a scenario also affects the feasible region, each scenario $j$ induces a different feasibility set $X^{j}$ , where the $k$ -th solution $x^{k}$ only has to satisfy the feasibility sets of the scenarios $j$ it is assigned to via $a_{j}^{k}$ . We denote

X^{k, j} (a) = {\begin{matrix} R^{n} & , if% a_{j}^{k} = 0 X^{j} & , if a_{j}^{k} = 1 \end{matrix}

and demand that $x^{k} \in X^{k, j} (a)$ . Assuming that $X^{j}$ can be stated via a linear constraint system $A^{j} x \leq b^{j}$ , $x^{k} \in X^{k, j} (a)$ can be enforced via

A^{j} x^{k} \leq b^{j} + (1 - a_{j}^{k}) M \forall k \in [K], j \in [N]

with $M$ being a vector with sufficiently large entries, such that in case of $a_{j}^{k} = 0$ these constraints are trivially fulfilled. If with probability $ϵ \in [0, 1]$ a solution is allowed to not obey the respective feasibility set, additional auxiliary variables $g_{j} \in {0, 1}$ can be used with $\sum_{j \in [N]} p_{j} g_{j} \leq ϵ$ and the feasible region is given by

A^{j} x^{k} \leq b^{j} + (1 - a_{j}^{k} + g_{j}) M \forall k \in [K], j \in [N] .

4.3 Multivariate Decision Trees

With the goal of obtaining easily comprehensible explanatory rules we restricted ourselves to univariate decision trees (cf. Constraints (1c) and (1h)). Depending on the use case, multivariate decision trees might also be appropriate. A model that can be used to obtain a binary decision tree that allows more general linear splits can be formulated using the following constraints to replace (1c-1e).


$\sum i \in [n] d_{i}^{q} \leq P$	$\forall q \in [Q]$	(3a)
$\| ℓ_{i}^{q} \| \leq M_{ℓ} d_{i}^{q}$	$\forall q \in [Q]$	(3b)
$\sum i \in [n] \| ℓ_{i}^{q} \| = 1$	$\forall q \in [Q]$	(3c)
$\sum i \in [n] c_{j i} ℓ_{i}^{q} \leq b^{q} + M \sum k \in S_{q} a_{j}^{k}$	$\forall q \in [Q], j \in [N]$	(3d)
$b^{q} + ϵ - M (1 - \sum k \in S_{q} a_{j}^{k}) \leq \sum i \in [n] c_{j i} ℓ_{i}^{q}$	$\forall q \in [Q], j \in [N]$	(3e)
$ℓ_{i}^{q} \in R$	$\forall i \in [n], q \in [Q]$	(3f)

In order to ensure comprehensibility of the resulting decision tree we uphold in (3a) and (3b) that at most $P \in N$ non-zero entries are allowed in each $ℓ^{q}$ , i.e., we still aim for sparse rules. Furthermore, we use a normalization constraint $\sum_{i \in [n]} | ℓ_{i}^{q} | = 1$ to mitigate numerical issue arising from the use of $ϵ$ and $M$ in the same constraints. Only allowing integer coefficients can further reduce numerical issues, to the expense of the model’s computability.

When using the enumeration approach or the greedy heuristic for multivariate decision trees, more combinations need to be taken into account. If $P$ is a constant, we can still enumerate all possibilities for binary variables $d_{i}^{q}$ , of which at most $P$ can be active. Values for $ℓ_{i}^{q}$ and $b^{q}$ can then be enumerated by considering all $P$ -dimensional hyperplanes defined through the costs $c_{j i}$ . However, such approaches become computationally too expensive with higher values of $P$ .

4.4 Decision Trees with Varying Splitting Criteria depending on the Subtrees

So far the splits used in the decision tree were the same at every level of the tree. This was done in order to maintain the highest possible level of comprehensibility as in this setting the same $Q$ split criteria have to be checked regardless of the scenario. However, having different splits depending on the current region reached while traversing the decision tree can also be of interest, as solutions with higher quality are obtainable with only a slight decrease of comprehensibility. Having the number $K$ of considered solutions and thus the number of leaves of the decision tree fixed, we now have to decide on the splitting criterion in each of the $K - 1$ internal nodes. In order for one leaf to be reached, the splits of the $Q$ nodes traversed to reach this leaf have to be set accordingly.

Following the notation as in [BD17] we denote as $A_{L} (k)$ the set of ancestors of the leaf representing solution $k$ whose left branch has to be chosen in order to reach this leaf, and $A_{R} (k)$ are the remaining nodes on the path from the root to this leaf. By keeping the notation that $a_{j}^{k}$ indicates whether solution $k$ is chosen in scenario $j$ , but changing $q$ to the index of an internal node rather than the level of the splitting node, Constraints (1d) and (1e) (or for the multivariate case Constraints (3d) and (3e)) have to be replaced by the following two constraints


$\sum i \in [n] c_{i}^{j} p_{i}^{q} \leq b^{q} + M (1 - a_{j}^{k})$	$\forall k \in [K], q \in A_{L} (k), j \in [N]$	(4a)
$b^{q} + ϵ \leq \sum i \in [n] c_{i}^{j} p_{i}^{q} + M (1 - a_{j}^{k})$	$\forall k \in [K], q \in A_{R} (k), j \in [N] .$	(4b)

With Constraint (1b) each scenario is associated with exactly one leaf and with the above constraints the splitting criteria along the path leading from the root to this leaf have to be met.

Using such decision trees in the enumeration method or the greedy heuristic is directly possible, but means that more variable combinations have to be enumerated. In case of the enumeration method, $O (Q^{Q} n^{Q} N^{Q})$ subproblems need to be solved, while the greedy heuristic needs to consider $O (K n N)$ subproblems.

4.5 Preparation Costs

In the proposed model it is assumed that the costs of implementing solution $k$ in scenario $j$ are represented via $c_{j}^{⊤} x^{k}$ . However, providing the option of being able to implement solution $x^{k}$ might also be associated with costs that are independent of the scenario realization. The existence of first-stage costs is a common feature in multi-stage robust and stochastic optimization. For example, in disaster management preparatory actions are necessary in order to be able to eventually deploy a specific emergency response plan. In order to reflect this option in the model we enlarge the domain to contain a variable vector $y \in R^{π}$ that is associated to preparatory actions that have to be adequate for all potential future solutions $x^{k}$ . In particular for all solutions $k$ the pair $(y, x^{k})$ has to part of the new domain $¯ X \subseteq R^{π} \times X$ , ensuring that the performed preparations enable the realization of solution $x^{k}$ . By adding the cost term ${^c}^{⊤} y$ to the objective function and modeling the connection between preparation and implementation in $¯ X$ the original model can easily be adapted to deal with such a factor.

5 Computational Experiments

We present three experiments to address the following questions:

How does the performance of the greedy heuristic compare to an optimal solution for the interpretable optimization problem?
How large is the loss of performance when using the explanatory rule instead of unrestricted solution assignments in the min-sum-min setting?
How does our interpretable approach perform on real-data?

All code is implemented in C++ using Cplex version 12.8 to solve mixed-integer programs. We use a compute server running Ubuntu 18.04 with ten Intel Xeon Gold 5220 CPUs running at 2.20GHz. Each process was restricted to a single thread. In all experiments, we focus on the sum of objective values as the decision criterion. Code and data are available online on GitHub¹¹1https://github.com/goerigk/explainable-opt-code.git.

5.1 Experiment 1

To compare the performance of the optimal interpretable solution found by solving the mixed-integer programming formulation proposed in Section 3.1 with the performance of the greedy heuristic from Section 3.3, we generate shortest path problems on grid graphs. We create small quadratic grids of $5 \times 5$ nodes. Each node is connected to its up to four horizontal and vertical neighbors, i.e., there are 40 edges. Edges are oriented upwards and to the right. The origin is placed on the bottom left corner, while the destination is placed on the top right corner. To create scenarios, we generate fives types $t$ , each given by its own midpoint $μ_{e}^{t}$ for each edge $e$ and a deviation range $δ_{e}^{t}$ . Midpoints are sampled uniformly i.i.d. from $[10, 30]$ and deviations from $[0, 0.25]$ . To generate a scenario, we first choose a type $t$ and then sample each edge costs uniformly i.i.d. from $[(1 - δ_{e}^{t}) μ_{e}^{t}, (1 + δ_{e}^{t}) μ_{e}^{t}]$ . This means that each problem instance contains five types of scenarios, which enables us to test the degree to which optimization models are able to make use of this structure. We consider instances with ${5, 6, \dots, 20}$ scenarios for training and 1000 scenarios for testing. For each number of scenarios, 100 instances are created (a total of $16 \times 100 = 1600$ instances).

For each instance, we solve the optimization model (1) (denoted as IP2 and IP4 when using $K = 2$ and $K = 4$ solutions, respectively) and we run the greedy heuristic (denoted as G2 and G4). We compare the solution time of each approach, as well as the performance on training and test instances of the resulting explanatory rules. We measure performance in the following way. Let $f (ξ)$ denote the objective value of the solution found by some method in scenario $ξ$ . Let $f^{n o m} (ξ)$ denote the objective value of the nominal solution, i.e., the solution that minimizes the sum (or average) of objective values over all training scenarios. Finally, let $f^{o p t} (ξ)$ denote the optimal objective value for scenario $ξ$ . The performance of $f$ in scenario $ξ$ is then calculated as

1 - \frac{f (ξ) - f^{o p t} (ξ)}{f^{n o m} (ξ) - f^{o p t} (ξ)}

which means that a value of 100% gives an optimal solution in scenario $ξ$ , whereas a value of 0% means that there is no improvement over the nominal solution. We consider the average of these performance values over all scenarios.

Figure 2 shows the average performance for training scenarios and test scenarios, for different numbers of training scenarios. Note that it can be expected that the performance on training scenarios has a tendency to decrease with more training scenarios, as more scenarios need to be considered simultaneously. At the same time, more training scenarios should lead to a better performance on test scenarios. Our results confirm this behavior.

Experiment 1, average performance values for varying numbers of training scenarios. — (a) Training instances.

In Figure 2(a), G2 and IP2 have identical performance, and thus both lines are on top of each other. This is to be expected, as the greedy heuristic places a single decision in an optimal way. Comparing G4 and IP4, there is a performance difference of about 5 percentage points on the training data.

On test instances (see Figure 2(b)), it is possible that G2 and IP2 have different performance values, as optimal solutions are not necessarily unique. There is no clear preference of one method over the other. We note that IP4 again outperforms G4, but the difference between both methods has become smaller. In some cases, G4 even shows a slightly better average performance than IP4. Overall, the greedy heuristic gives a reasonable approximation of the optimal solution on these instances.

Compare this relatively small loss in performance with the improvements in median computation times, as shown in Figure 3. Note the logarithmic vertical axis.

Figure 3: Experiment 1, median computation times for varying numbers of training scenarios.

While the greedy method scales well with an increased number of scenarios, this is not the case for the integer program. For 20 scenarios (the largest problems considered in this experiment), IP4 is around two orders of magnitude slower than G4. Also note that plotting median times benefits the IP methods, as they are more prone to outliers than the greedy methods. In summary, the greedy method gives a good trade-off between small computation times and larger performance values, and will thus be the method we consider in our subsequent experiments.

5.2 Experiment 2

In this second experiment, we consider the possible disadvantage resulting from using explanatory rules over formulations that do not consider this restriction. To this end, we use the greedy heuristic to find explanatory rules, and compare with the performance of the min-sum-min model with two (MSM2) and with four solutions (MSM4). Performance is measured in the same way as in Experiment 1. We again use grid graphs, but consider larger instances with 10 times 10 nodes (180 edges) and a number of training scenarios in the range ${10, 15, \dots, 50}$ . We generate 100 instances for each number of training scenarios (a total of $9 \times 100 = 900$ instances). For each instance, we first run the greedy heuristics G2 and G4 and then assign the computation times as time limits for MSM2 and MSM4, respectively. While this may create a bias towards the greedy methods (which have exactly as much computation time as they need), this approach avoids setting an arbitrary time limit for the MSM methods.

In Figure 4, we present the average performance on training and on test instances. G4 achieves nearly the same performance of MSM4 on training instances (see Figure 4(a)), with a performance difference of about one to two percentage points. Interestingly, G2 even outperforms MSM2, which is only possible due to the time limit applied to the latter method.

Experiment 2, average performance values for varying numbers of training scenarios. — (a) Training instances.

On test instances (see Figure 4(b)), we note that the usage of our explanatory rules show a drop in performance for small sizes of training sets. This indicates that the rules learned on these sets do not yet generalize well to other scenarios. As the number of scenarios in the training set increases, the performance of greedy again comes close to that of MSM. While G2 reaches the same performance as MSM2, the difference between G4 and MSM4 remains at around three percentage points.

We show median computation times of the greedy methods in Figure 5. As predicted in our analysis, there is a clear linear relationship between the number of scenarios in the training set and computation time. Also note that running G4 takes about three times as long as running G2 (note that G4 needs to first run G2, and then run an additional iteration where four nominal problems need to be solved for every split candidate instead of two problems, which results in three times the effort of G2).

Figure 5: Experiment 2, median computation times for varying numbers of training scenarios.

Summarizing these results, we find that explanatory rules only lead to a small decrease in performance compared to general decision rules on the instances we considered, in particular, if sufficient training scenarios are available. At the same time, the linear solution time of the greedy method means that our approach scales well in problem size.

5.3 Experiment 3

In this third experiment, we consider real-world data collected in [CDG19]. The graph models the road network of Chicago with 538 nodes and 1308 edges. Travel speeds were taken from a live traffic interface over a 46-day time horizon. The dataset we use contains all measurements that were made (a total of 4363 scenarios). These are split by using one half for training, and one half for testing. We generate 100 random source-sink pairs. To avoid path problems where both nodes are too close to each other, we generate a nominal path for each source-sink candidate, and repeat the sample while the number of edges in the nominal path is less or equal to 50.

To reduce computation times, we randomly skip checking each split candidate on the existing 1308 edges with 95% probability, which means that the greedy heuristic is sped up by factor 20. Additionally, we extend the instance by two artificial edges to capture meta data on scenarios. These edges loop into an additional, separate node, so that they are never part of a path. On the first edge, we use the day of the week of a scenarios as its lengths, encoded as integer numbers from one (Monday) to seven (Sunday). On the second edge, we use the time within the day, encoded as seconds since midnight, i.e., an integer from 0 to 86399.

We use the greedy heuristic to find explanatory rules using all 1310 edges, and also using only the two edges carrying meta data. The median computation time for the first setting is 1209.5 seconds (two solutions) and 3450.5 seconds (four solutions), while the second setting results in median computation times of 126.4 seconds (two solutions) and 363.1 seconds, respectively.

	all edges		meta only
Dataset	G2	G4	G2	G4
Training	14.0%	20.8%	2.8%	3.8%
Test	12.1%	17.6%	1.8%	2.7%

Table 4: Experiment 3, average performance values.

In Table 4, we show the average performance for both settings. While it is possible to slightly improve upon the nominal solution using only meta data, this improvement is small. Significantly larger improvements are possible when using the full data of the graph. Indeed, if we consider the first splitting decisions of this approach (i.e., the rule that G2 finds), only 1 in 100 runs use the day of the week (split between Monday to Saturday and Sunday), and 8 in 100 runs use the time of the day. In Figure 6, we highlight which edges were split on in the remaining 91 cases. We note that in particular the edges along the shore of Lake Michigan in the north-east are used for this purpose. A possible explanation is that these edges see continuous high-density traffic, which makes them reliable predictors to identify the traffic scenario.

Figure 6: Experiment 3, edges that were used in explanatory rules.

If we only consider the meta data, there are five cases where G2 splits on the day (once between Friday and Saturday, four times between Saturday and Sunday). The remaining splits are made on the time of the day. In Figure 7, we present a histogram of these decisions, where each bin represents an hour of the day. The majority of splits cut off few scenarios on the boundary of the day, in particular in the first and in the last hour. This underlines the small improvements that these rules give, see Table 4. Traffic speed is highly volatile, and the data represents snapshots at particular points in time, rather than averages over larger time horizons. Hence, explanatory rules such as ”before 7am” and ”after 7am” do not perform as well as one might intuitively expect.

Figure 7: Experiment 3, histogram of threshold values for splits on the time dimension.

5.4 Discussion

Our computational experiments have been designed to answer three questions. The first is concerned with a comparison of the greedy heuristic and the mixed-integer programming formulation to find explanatory rules in form of a decision tree. Recent research on decision trees for classification models (see [BD17]) reports promising results when constructing optimal decision trees. In our setting, computation times quickly increase to become prohibitive for larger problems. The greedy heuristic scales well (linearly in the number of scenarios) and is a viable choice also for larger optimization problems. At the same time, the quality of solutions found by the heuristic is close to that of exact solutions, which makes the heuristic the preferred solution method for the remaining experiments.

The second question considers the loss that occurs when we use an explanatory rule instead of an arbitrary rule to make decisions. Our experiments show that on training data, there is little difference between both approaches. This is different on test data, where an insufficient training size leads to interpretable solutions that do not perform well. With more training data available, we can again observe that explanatory rules show a performance that is very close to more general rules. Hence, these results seem to encourage the use of explanatory rules.

Finally, we consider the performance of our approach on real-world data. While the previous experiments constructed scenarios is a way that there exists inherent structure that can be exploited, this is not necessarily the case for the considered data set. Indeed, improvements over nominal solutions towards the lower bound are significantly smaller, but still useful with up to around 20%. It seems that particularly high-density edges give reliable predictions which of the candidate solutions to use.

The results presented here are, naturally, only valid within the scope of the setup we used. At the same time, none of our methods were specifically designed for the shortest path problem, which underlies all experiments. Hence, we can expect to find a similar performance of our approach for different optimization problems.

6 Conclusions

While immense progress has been made towards finding optimal solutions to decision making problems efficiently, even the best solution is only useful if it can be applied in practice and is accepted by the stakeholders. A major barrier in this process is the black-box nature of optimization solvers. Interpretability and explainability, concepts that have become central in artificial intelligence, have seen significantly less attention in operations research and management science.

In this paper, we propose an inherently interpretable optimization model to derive solutions that come with an explanatory rule, under which circumstances which solution should be implemented. We focused on decision tree based rules, which are easily comprehensible and applicable.

To find such solutions and explanatory rules, a mixed-integer programming formulation can be used. For decision trees with small depth, it is also viable to enumerate candidate solutions. Finally, it is possible to construct heuristic rules greedily. This basic model can be easily extended to include a range of further features.

Our experiments indicate that the greedy heuristic can be used to find high-quality solution in reasonable time. Moreover, it is possible to include any available nominal solution algorithm in the process, which improves the applicability of our method. We found that the loss of using explanatory rules instead of complex, but incomprehensible rules is in the range of only a few percentage points.

As few methods for interpretable optimization exist, there are many avenues for further research that should be explored. In this paper, we focussed on decision trees as explanatory rules. Other models can be used instead, such as point-based systems, where a decision maker only needs to calculate a score by summing up points for each feature that is checked (i.e., using a preferably sparse perceptron). Furthermore, these methods should be tested in practice with decision makers to see which rules find acceptance. Finally, we considered a setting where a set of candidate solutions is found to cover any potential optimization instance that we may encounter. Fundamentally different approaches are conceivable, for example, we might be interested to produce just one solution for one particular problem, but this solution is required to have a simple and communicable structure. Imagine a car driver who asks for directions. He or she would be interested not necessarily in the fastest path to reach the destination, but rather in a path that is easy to find, i.e., a path with few turns that mostly uses major roads. e)

References

[ABG $^{+}$ 17] Katie Atkinson, Pietro Baroni, Massimiliano Giacomin, Anthony Hunter, Henry Prakken, Chris Reed, Guillermo Simari, Matthias Thimm, and Serena Villata. Towards artificial argumentation. AI Magazine, 38(3):25–36, 2017.
[ACC $^{+}$ 06] Vicky Arnold, Nicole Clark, Philip A Collier, Stewart A Leech, and Steve G Sutton. The differential use and effect of knowledge-based system explanations in novice and expert judgment decisions. MIS Quarterly: Management Information Systems, pages 79–97, 2006.
[ADRDS $^{+}$ 20] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115, 2020.
[AP09] Leila Amgoud and Henri Prade. Using arguments for making and explaining decisions. Artificial Intelligence, 173(3-4):413–436, 2009.
[ARL18] Janna Anderson, Lee Rainie, and Alex Luchsinger. Artificial intelligence and the future of humans. Pew Research Center, 10:12, 2018.
[Arz21] Vadim Arzamasov. Comprehensible and Robust Knowledge Discovery from Small Datasets. PhD thesis, Karlsruher Institut für Technologie (KIT), 2021.
[BBDCF11] Rodrigo Coelho Barros, Márcio Porto Basgalupp, Andre CPLF De Carvalho, and Alex A Freitas. A survey of evolutionary algorithms for decision-tree induction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(3):291–312, 2011.
[BC10] Dimitris Bertsimas and Constantine Caramanis. Finite adaptability in multistage linear optimization. IEEE Transactions on Automatic Control, 55(12):2751–2766, 2010.
[BD17] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082, 2017.
[BDM19] Dimitris Bertsimas, Jack Dunn, and Nishanth Mundru. Optimal prescriptive trees. INFORMS Journal on Optimization, 1(2):164–183, 2019.
[BFOS17] Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and Regression Trees. Routledge, 2017.
[BG96] Blai Bonet and Hector Geffner. Arguing for decisions: a qualitative model of decision making. In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, pages 98–105, 1996.
[BK17] Christoph Buchheim and Jannis Kurtz. Min–max–min robust combinatorial optimization. Mathematical Programming, 163(1):1–23, 2017.
[BK20] Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66(3):1025–1044, 2020.
[Bou94] Craig Boutilier. Toward a logic for qualitative decision theory. In Principles of Knowledge Representation and Reasoning, pages 75–86. Elsevier, 1994.
[BOW21] Dimitris Bertsimas, Agni Orfanoudaki, and Holly Wiberg. Interpretable clustering: an optimization approach. Machine Learning, 110(1):89–138, 2021.
[BP19] Christoph Buchheim and Jonas Pruente. K-adaptability in stochastic combinatorial optimization under objective uncertainty. European Journal of Operational Research, 277(3):953–963, 2019.
[BSH $^{+}$ 10] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. The Journal of Machine Learning Research, 11:1803–1831, 2010.
[CDG19] André Chassein, Trivikram Dokka, and Marc Goerigk. Algorithms and uncertainty sets for data-driven robust shortest path problems. European Journal of Operational Research, 274(2):671–686, 2019.
[CGMS21] Salvatore Corrente, Salvatore Greco, Benedetto Matarazzo, and Roman Slowinski. Explainable interactive evolutionary multiobjective optimization. Available at SSRN 3792994, 2021.
[ČLL21] Kristijonas Čyras, Myles Lee, and Dimitrios Letsios. Schedule explainer: An argumentation-supported tool for interactive explanations in makespan scheduling. In International Workshop on Explainable, Transparent Autonomous Agents and Multi-Agent Systems, pages 243–259. Springer, 2021.
[ČLMT19] Kristijonas Čyras, Dimitrios Letsios, Ruth Misener, and Francesca Toni. Argumentation for explainable scheduling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2752–2759, 2019.
[CM22] Dragos Florin Ciocan and Velibor V. Mišić. Interpretable optimal stopping. Management Science, 68(3):1616–1638, 2022.
[CMP19] Anna Collins, Daniele Magazzeni, and Simon Parsons. Towards an argumentation-based approach to explainable planning. In ICAPS 2019 Workshop XAIP Program Chairs, 2019.
[ČRA $^{+}$ 21] Kristijonas Čyras, Antonio Rago, Emanuele Albini, Pietro Baroni, and Francesca Toni. Argumentative XAI: a survey. arXiv preprint arXiv:2105.11266, 2021.
[CS95] Mark Craven and Jude Shavlik. Extracting tree-structured representations of trained networks. Advances in Neural Information Processing Systems, 8, 1995.
[CSGK19] Tathagata Chakraborti, Sarath Sreedharan, Sachin Grover, and Subbarao Kambhampati. Plan explanations as model reconciliation–an empirical study. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 258–266. IEEE, 2019.
[EK21] Martin Erwig and Prashant Kumar. Explainable dynamic programming. Journal of Functional Programming, 31, 2021.
[Fre14] Alex A Freitas. Comprehensible classification models: a position paper. ACM SIGKDD Explorations Newsletter, 15(1):1–10, 2014.
[GBY $^{+}$ 18] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 80–89. IEEE, 2018.
[GF17] Bryce Goodman and Seth Flaxman. European Union regulations on algorithmic decision-making and a ”right to explanation”. AI Magazine, 38(3):50–57, 2017.
[GKW19] Angelos Georghiou, Daniel Kuhn, and Wolfram Wiesemann. The decision rule approach to optimization under uncertainty: methodology and applications. Computational Management Science, 16(4):545–576, 2019.
[Gle16] Michael Gleicher. A framework for considering comprehensibility in modeling. Big Data, 4(2):75–88, 2016.
[GMR $^{+}$ 18] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys (CSUR), 51(5):1–42, 2018.
[HDM $^{+}$ 11] Johan Huysmans, Karel Dejaeger, Christophe Mues, Jan Vanthienen, and Bart Baesens. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems, 51(1):141–154, 2011.
[HKW15] Grani A Hanasusanto, Daniel Kuhn, and Wolfram Wiesemann. K-adaptability in two-stage robust binary programming. Operations Research, 63(4):877–891, 2015.
[JBA14] Paulo Jesus, Carlos Baquero, and Paulo Sérgio Almeida. A survey of distributed data aggregation algorithms. IEEE Communications Surveys & Tutorials, 17(1):381–404, 2014.
[KDBL $^{+}$ 09] Ujwal Kayande, Arnaud De Bruyn, Gary L Lilien, Arvind Rangaswamy, and Gerrit H Van Bruggen. How incorporating feedback mechanisms in a DSS affects DSS evaluations. Information Systems Research, 20(4):527–546, 2009.
[LL17] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 2017.
[LR76] Hyafil Laurent and Ronald L Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1):15–17, 1976.
[LSG $^{+}$ 19] Jean-Baptiste Lamy, Boomadevi Sekar, Gilles Guezennec, Jacques Bouaud, and Brigitte Séroussi. Explainable artificial intelligence for breast cancer: A visual case-based reasoning approach. Artificial Intelligence in Medicine, 94:42–53, 2019.
[MB00] Ji-Ye Mao and Izak Benbasat. The use of explanations in knowledge-based systems: Cognitive perspectives and a process-tracing analysis. Journal of Management Information Systems, 17(2):153–179, 2000.
[MBVGV07] David Martens, Bart Baesens, Tony Van Gestel, and Jan Vanthienen. Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research, 183(3):1466–1476, 2007.
[MIBD02] Bernard Moulin, Hengameh Irandoust, Micheline Bélanger, and Gaëlle Desbordes. Explanation and argumentation capabilities: Towards the creation of more persuasive agents. Artificial Intelligence Review, 17(3):169–222, 2002.
[Mil19] Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, 2019.
[Mil21] Tim Miller. Contrastive explanation: A structural-model approach. The Knowledge Engineering Review, 36, 2021.
[MMP22] Enrico Malaguti, Michele Monaci, and Jonas Pruente. K-adaptability in stochastic optimization. Mathematical Programming, pages 1–29, 2022.
[MP14] David Martens and Foster Provost. Explaining data-driven document classifications. MIS Quarterly: Management Information Systems, 38(1):73–100, 2014.
[MS14] Linjuan Rita Men and Don Stacks. The effects of authentic leadership on strategic internal communication and employee-organization relationships. Journal of Public Relations Research, 26(4):301–324, 2014.
[ODV20] Nir Oren, Kees van Deemter, and Wamberto W Vasconcelos. Argument-based plan explanation. In Knowledge Engineering Tools and Techniques for AI Planning, pages 173–188. Springer, 2020.
[Pau93] Gabriele Paul. Approaches to abductive reasoning: an overview. Artificial Intelligence Review, 7(2):109–152, 1993.
[Pra05] Andrea Prat. The wrong kind of transparency. American Economic Review, 95(3):862–877, 2005.
[PSLB15] Andrew M Parker, Sinduja V Srinivasan, Robert J Lempert, and Sandra H Berry. Evaluating simulation-derived scenarios for effective decision support. Technological Forecasting and Social Change, 91:64–77, 2015.
[Qui86] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
[Raw08] Brad L Rawlins. Measuring the relationship between organizational transparency and employee trust. Public Relations Journal, 2(2), 2008.
[RSG16] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.
[Rud19] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
[SCDS77] A Carlisle Scott, William J Clancey, Randall Davis, and Edward H Shortliffe. Explanation capabilities of production-based consultation systems. American Journal of Computational Linguistics, 62, 1977.
[SGW20] Anirudh Subramanyam, Chrysanthos E Gounaris, and Wolfram Wiesemann. K-adaptability in two-stage mixed-integer robust optimization. Mathematical Programming Computation, 12(2):193–224, 2020.
[SL97] Rudy Setiono and Huan Liu. Neurolinear: From neural networks to oblique decision rules. Neurocomputing, 17(1):1–24, 1997.
[SMF21] Franz Strich, Anne-Sophie Mayer, and Marina Fiedler. What do I do in a world of artificial intelligence? Investigating the impact of substitutive decision-making AI systems on employees’ professional role identity. Journal of the Association for Information Systems, 22(2):9, 2021.
[SNF21] Julian Senoner, Torbjørn Netland, and Stefan Feuerriegel. Using explainable artificial intelligence to improve process quality: Evidence from semiconductor manufacturing. Management Science, 2021.
[SPVR01] Ale Smidts, Ad Th H Pruyn, and Cees BM Van Riel. The impact of employee communication and perceived external prestige on organizational identification. Academy of Management Journal, 44(5):1051–1062, 2001.
[SSG18] Roykrong Sukkerd, Reid Simmons, and David Garlan. Toward explainable multi-objective probabilistic planning. In 2018 IEEE/ACM 4th International Workshop on Software Engineering for Smart Cyber-Physical Systems (SEsCPS), pages 19–25. IEEE, 2018.
[Swa85] William R Swartout. Explaining and justifying expert consulting programs. In Computer-Assisted Medical Decision Making, pages 254–271. Springer, 1985.
[SYX $^{+}$ 20] Swati Sachan, Jian-Bo Yang, Dong-Ling Xu, David Eraso Benavides, and Yang Li. An explainable AI decision-support-system to automate loan underwriting. Expert Systems with Applications, 144:113100, 2020.
[UR16] Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3):349–391, 2016.
[VGY20] Phebe Vayanos, Angelos Georghiou, and Han Yu. Robust optimization with decision-dependent information discovery. arXiv preprint arXiv:2004.08490, 2020.
[WMR17] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech., 31:841, 2017.
[WXW18] Weiquan Wang, Jingjun (David) Xu, and May Wang. Effects of recommendation neutrality and sponsorship disclosure on trust vs. distrust in online recommendation agents: Moderating role of explanations for organic recommendations. Management Science, 64(11):5198–5219, 2018.
[YJ95] L Richard Ye and Paul E Johnson. The impact of explanation facilities on user acceptance of expert systems advice. MIS Quarterly: Management Information Systems, 19(2):157–172, 1995.
[YMF19] Cen April Yue, Linjuan Rita Men, and Mary Ann Ferguson. Bridging transformational leadership, transparent communication, and employee openness to change: The mediating role of trust. Public Relations Review, 45(3):101779, 2019.