Classifying with Uncertain Data Envelopment Analysis

Casey Garner

^{a †}

and Allen Holder

^{b^{*} †}

^{a}

Department of Mathematics, University of Minnesota, Minneapolis, MN, USA, (garne214@umn.edu)

^{b}

Department of Mathematics, Rose-Hulman Institute of Technology, Terre Haute, IN, USA (holder@rose-hulman.edu)

^{*}

Corresponding author

^{†}

authors listed alphabetically and contributed equally, research conducted at Rose-Hulman Institute of Technology

Abstract

Classifications organize entities into categories that identify similarities within a category and discern dissimilarities among categories, and they powerfully classify information in support of analysis. We propose a new classification scheme premised on the reality of imperfect data. Our computational model uses uncertain data envelopment analysis to define a classification’s proximity to equitable efficiency, which is an aggregate measure of intra-similarity within a classification’s categories. Our classification process has two overriding computational challenges, those being a loss of convexity and a combinatorially explosive search space. We overcome the first by establishing lower and upper bounds on the proximity value, and then by searching this range with a first-order algorithm. We overcome the second by adapting the p-median problem to initiate our exploration, and by then employing an iterative neighborhood search to finalize a classification. We conclude by classifying the thirty stocks in the Dow Jones Industrial average into performant tiers and by classifying prostate treatments into clinically effectual categories.

Keywords: Data Envelopment Analysis, Robust Optimization,
Classification, Clustering

1 Motivation and Introduction

Classifications are ubiquitous and essential constructs across the human experience, a reality motivated by the fact that we learn through association and disassociation. Indeed, our thoughts are replete with comparisons and contrasts among known and novel aspects of our understanding, and we routinely toil to organize and group information into like collections for myriad purposes. This overriding narrative to categorize information manifests itself in the joys of life, say by classifying literature, art, food, and sport; it promotes scientific progress by delineating the fields of study and by interpreting the results of experimentation; and it aids innovation in fields like engineering, healthcare, and politics as we study how the practical benefits of basic research can promote the general welfare.

The classification concept has not escaped mathematical and practical study, and the process of dividing entities into categories has a long and fruitful history. The literature on how to classify is consequently substantial, well studied, broad in application, and methodologically diverse as illustrated by the following compendious list.

decision trees [34]	integer programming [3]
neural networks [35]	robust statistics [11]
Bayesian networks [25]	discriminant analysis [15]
graph clustering [30]	data mining [23]

[]

Additional classification techniques, along with numerous citations, are reviewed in [6, 17, 19, 29]. This disciplinary range challenges having a categorical language that would promote shared efforts across disciplines, and while all of mathematics, computer science, statistics, probability, data science, operations research, and machine learning dabble with similar classification problems, they each scope their problems in terms apropos for their specific tasks, see [17, 20] for related comments. We use the term classification diffusely to simply mean that entities are partitioned into categories, and we refrain from related terms like clustering and grouping that have more refined definitions in some settings.

We propose a classification technique motivated by uncertain data envelopment analysis (uDEA) that scores a classification by how intra-equitable its categories are. Our fundamental assumptions are that:

classifications of the same entities are to be compared on a
common set of characteristics, and
the characteristics are imperfect.

The assumption of imperfect characteristics is surely an anticipated general rule and not a rare oddity; after all, a lack of preciseness taints everything from scientific experimentation, to expert opinion, and to rule based delineation. We do not make any stochastic assumption, and hence, we bypass all hassle of verifying distributional suppositions. That said, we can interpret our model stochastically if such an interpretation appropriately aids analysis. We instead use the modern concept of uncertainty associated with robust optimization to account for imperfect information, and our model builds from the work in [8] to leverage uncertainty toward the goal of establishing equity within a classification’s categories. We comment that data envelopment analysis (DEA) is not new to the processes of clustering and classification, see [4, 26, 32] as illustrative examples.

Sections 2 and 3 define the proximity to equitable efficiency and give an algorithm to approximate its value. We present our classification scheme in Section 4, and we review examples in finance and medicine in Section 5. Note that all norms are $2$ -norms unless otherwise noted, although we explicitly denote all norms in the proof of Theorem 1 to avoid confusion.

2 Proximity to Equitable Efficiency

We define an assessment metric that models the quality of a classification. Assume $O = {o^{t} : t = 1, 2, . . . T}$ is a finite set of objects that are to be partitioned into the nonempty disjoint categories $C^{s}$ , for $s = 1, 2, . . ., S$ . So each $o^{t}$ is in some $C^{s}$ , and $C = {C^{s} : s = 1, 2, . . ., S}$ partitions $O$ . We assess a classification $C$ by defining a measure of the intra-similarity of the elements within the individual categories $C^{s}$ , and the classification’s overall quality then equates to the sum of its categorical values.

Characteristics describe individual objects, and we assume that a classification’s quality coincides with some of these characteristics being small and some of them being large. Our language here agrees with the DEA literature, see [5, 10, 16, 22] as reviews, and characteristics whose decrease signifies improvement are called input-characteristics, or more simply inputs, and those whose increase signifies improvement are called output-characteristics, or more simply outputs. We let

\begin{matrix} X_{i t}^{s} & be the value of the i -th % input-characteristic of object t in category C^{s}, and Y_{i t}^{s} & be the value of the i -th output-% characteristic of object t in category C^{s} . \end{matrix}

We assume there are $N$ inputs and $M$ outputs, making $X^{s}$ an $N \times | C^{s} |$ matrix and $Y^{s}$ an $M \times | C^{s} |$ matrix.

We employ the concept of efficiency from DEA, and the objects within a category are our decision making units (DMUs). An object’s efficiency within its category is measured as a ratio of weighted sums, with the numerator being an aggregate of the output-characteristics and the denominator being an aggregate of the input-characteristics. The primary goal of DEA is to decide nonnegative weighting parameters to maximize an object’s efficiency while maintaining that each object’s efficiency is no greater than one with the same weights. There are several standard variants of this central theme, and we select the canonical input oriented model with variable returns to scale [5]. We specifically chose to calculate efficiency with the dual of the LP that maximizes the efficiency score for object $t$ , which is

E^{t} = min {c^{T} η : A^{t} η \leq 0, B η = e, η \geq 0},

(1)

provided that

A^{t} = [\begin{matrix} - Y^{s} & y^{t} & 0 X^{s} & 0 & - x^{t} \end{matrix}], B = [\begin{matrix} e^{T} & 0 & 0 0 & 1 & 0 \end{matrix}], and c = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 00 ⋮ 01 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠,

with $o^{t} \in C^{s}$ and $x^{t}$ and $y^{t}$ being the respective $t$ -columns of $X^{s}$ and $Y^{s}$ . The all ones column vector is $e$ , with its length being decided by the context of its use. Note that

^ηT=(0,0,…,0,1,0,…,0,1,1)position t \raisebox4.0pt$↑$

(2)

is always feasible, and hence, $E^{t} \leq 1$ . The value of $E^{t}$ is the efficiency score of object $t$ relative to its category, and if this score is less than one, then object $t$ is inefficient in its category. Inefficiency means that another object’s efficiency score within the category bests the efficiency score of object $t$ no matter how the inputs and outputs are weighted.

Our assumption of uncertain information indicates that $E^{t}$ is likely imprecise and that an inefficiency could be the result of inaccurate data. The uDEA problem in [8] addresses this concern by acknowledging and modeling uncertainty and by recasting the efficiency score as a robust linear program. The solution of the robust linear program maximizes an object’s efficiency score by selecting a best collection of data from among the uncertain possibilities. Thus an inefficient object based on the original data might become efficient with a prudent and reasonable selection of new data from a set of uncertain options. We note that several modes of robust DEA are now proposed and that applications are increasing, see [24] and its bibliography as a review, as well as the more recent publications [28, 33].

We use an ellipsoidal model of uncertainty for each row of $A^{t}$ , and denoting the $i$ -th row of $A^{t}$ by $A_{i}^{t}$ , we set

U_{i}^{t} (σ_{i}) = {A_{i}^{t} + σ_{i} u^{T} R_{i} : | | u | |_{2} \leq 1} .

The scalar $σ_{i}$ is assumed to be nonnegative, and the matrix $R_{i}$ has $| C^{s} | + 2$ columns along with a compatible number of rows to align with the dimension of the vector $u$ . The matrix $R_{i}$ defines the structure of uncertainty, whereas $σ_{i}$ scales this structure to shrink or enlarge the amount of uncertainty.

The format of $A^{t}$ imposes a concomitant format on $R_{i}$ . The last two columns of $A^{t}$ duplicate the input- and output-characteristics of object $t$ , and $R_{i}$ must maintain this format. Assume for illustrative purposes that $t = 2 \in C^{s}$ corresponds with the second column of data in $X^{s}$ and $Y^{s}$ . Then, if

X^{s} = [\begin{matrix} 1.2, & 0.8 \end{matrix}] and Y^{s} = [\begin{matrix} 0.9 & 0.5 0.6 & 0.7 \end{matrix}],

we have

A2=⎡⎢ ⎢⎣−0.9−0.50.5\omit0−0.6−0.70.7\omit01.20.8\omit0−0.8⎤⎥ ⎥⎦.

Suppose our model of uncertainty for the inputs assumes a $2 \times 2$ identity with regard to $X^{s}$ . Such diagonal models are commonly motivated by probabilistic assumptions such as the characteristics being uncorrelated random variables. We have in this case that

R_{3} = [\begin{matrix} 1 & 0 & 0 & 0 0 & 1 & 0 & - 1 \end{matrix}],

(3)

from which

A_{3}^{2} + σ_{3} u^{T} R_{3} = [\begin{matrix} 1.2 + σ_{3} u_{1}, & 0.8 + σ_{3} u_{2}, & 0, & - 0.8 - σ_{3} u_{2} \end{matrix}] .

The last column of $R_{3}$ must be the negative of the second column for the structures of $A_{3}^{2}$ and $A_{3}^{2} + σ_{3} u^{T} R_{3}$ to agree as $u$ varies over the unit disc. Violations to this structural agreement would essentially, and erroneously, equate to the second input-characteristic of the second object having two values as the data ranged over the uncertainty set. Similar requirements for the output-characteristics are necessary, with illustrative (non-identity) options for $A^{2}$ being

R_{1} = [\begin{matrix} - 0.2 & 0.1 & - 0.1 & 0 0.3 & - 0.2 & 0.2 & 0 \end{matrix}] and R_{2} = [\begin{matrix} 0.7, & 0.6 & - 0.6 & 0 0.1, & - 0.3 & 0.3 & 0 \end{matrix}] .

We express $R_{i}$ as $[R_{i}^{'} | R_{i}^{''}]$ in the forthcoming discussion, with $R_{i}^{''}$ having two columns and being decided by $R_{i}^{'}$ as illustrated.

We let $σ$ be the $M + N$ vector whose $i$ -th component is $σ_{i}$ , and we define the robust efficiency score for object $t$ with respect to $σ$ as,

	$E^{t} (σ)$	$=$	$min{cTη:^Atiη≤0,∀^Ati∈Uti(σi),∀i,Bη=e,η≥0}$
		$=$	$min{cTη:^Atiη+σi∥Riη∥≤0,∀i,Bη=e,η≥0},$

where the second math program is the standard second-order cone expression of the robust problem. The structure of each $R_{i}$ mandates that $E^{t} (σ) \leq 1$ for the same reason that $E^{t} \leq 1$ , i.e. because $^η$ in (2) remains feasible.

We require that $E^{t} (σ) = 1$ for sufficiently large $σ$ , a property that ensures object $t$ is capable of efficiency, or more succinctly capable, if enough uncertainty is assumed. A mathematically contrived example in [8] without ellipsoidal uncertainty shows that $sup {E^{t} (σ) : σ \geq 0}$ can be strictly less than one, and hence, capability is not necessarily ensured and requires applicable justification. Theorem 1 gives serviceable assurance.

Theorem 1.

If $R_{i}^{'}$ has full column rank for each $i$ , then $E^{t} (σ) = 1$ for all sufficiently large $σ$ .

Proof.

Proposition 1 in [8] establishes the following general monotonicity property. If two collections of uncertainty satisfy $U_{i}^{'} \subseteq U_{i}^{''}$ for each $i$ , then

\omit\span\omit\span\omit\@ADDCLASSltxeqnlefteqn$min{cTη:^Atiη≤0,∀^Ati∈U′i(σi),∀i,Bη=e,η≥0}$≤min{cTη:^Atiη≤0,∀^Ati∈U′′i(σi),∀i,Bη=e,η≥0}≤1,⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭

where the last inequality follows from an assumption that $^η$ in (2) is feasible in both problems.

Theorem 2 in [31] guarantees the existence of $^σ > 0$ such that $E^{t} (σ) = 1$ if $σ \geq^σ$ and if

U_{i} = {A_{i} : A_{i}^{t} + σ u^{T} [I | R_{i}^{'}] : ∥ u ∥_{\infty} \leq 1} .

So the inequalities in (2) will complete the proof upon showing that for each $i$ and for sufficiently large $σ$ ,

{Ai:Ati+^σiuT[I|R′′i]:∥u∥∞≤1}⊆{Ai:Ati+σiuTRi:∥u∥2≤1}.

The required format of $R_{i}^{''}$ in both cases reduces this argument to showing that

{^σiuT:∥u∥∞≤1}⊆{σiuTR′i:∥u∥2≤1},

for each $i$ and sufficiently large $σ$ .

Fix $i$ and select $v∈{^σiuT:∥u∥∞≤1}$ . Then $∥ v ∥_{2} \leq {^σ}_{i}^{| C^{s} |}$ . Further select $σ_{i}$ so that its components satisfy

σ_{i} \geq {^σ}_{i}^{| C^{s} |} {∥ ∥ {(R_{i}^{'})}^{+} ∥ ∥}_{2}^{+},

where ${(R_{i}^{'})}^{+}$ is the Moore-Penrose pseudo-inverse so that

u^{T} = \frac{1}{σ_{i}} v^{T} {(R_{i}^{'})}^{+} % solves u^{T} (σ_{i} R_{i}^{'}) = v^{T} .

We now have

∥ u ∥_{2} \leq \frac{1}{σ_{i}} ∥ v ∥_{2} {∥ ∥ {(R_{i}^{'})}^{+} ∥ ∥}_{2}^{+} \leq \frac{1}{σ_{i}} {^σ}_{i}^{| C^{s} |} {∥ ∥ {(R_{i}^{'})}^{+} ∥ ∥}_{2}^{+} \leq 1,

which establishes the result. $■$

An immediate consequence of Theorem 1 is that invertible $R_{i}^{'}$ matrices ensure capability, and since common examples set $R_{i}^{'}$ to be positive diagonal matrices or covariance matrices, our assumption of capability is routine. Note also that the rank condition is sufficient but not necessary, and that examples not satisfying the rank condition can still result in each object being capable. Indeed, all but cleverly constructed examples lead to the capability of each object in the authors’ experience, making the assumption of capability prevalent and not inimical to practical application.

We classify objects into categories so that the objects within a category require a minimum amount of uncertainty for each of the objects to have a claim to efficiency. So our motivating question is, how much uncertainty does a category need for all of its objects to have a robust efficiency score of one? We define the proximity to equitable efficiency for category $s$ to answer this question, which is

P^{s} = min {∥ σ ∥ : E^{t} (σ) = 1, \forall t \in C^{s}} .

(4)

The constraints of this problem necessitate that each object reaches its maximum efficiency score by selecting characteristics from the uncertainty sets $U_{i}^{t} (σ_{i})$ , a requirement that imparts equity among the objects within the category relative to efficiency. Note that our proximity model does not require each object to be assessed in the same fashion because each object can combine their characteristic values differently to reach an efficiency score of one. A straightforward interpretation of $P^{s}$ is that it defines a minimum vicinity over which the characteristic data can vary so that each object is equally and maximally efficient if it were allowed to select values from the uncertain possibilities. We comment that a different sense of equity appears in the multiple objective literature [1, 18].

We say that a nonnegative $σ$ is feasible-for-classification for category $s$ if $E^{t} (σ) = 1$ for all $t \in C^{s}$ , that is, if $σ$ is feasible for the optimization problem defining $P^{s}$ . Note that we have two properties from (2), those being:

Property 1:	if $σ^{''} \geq σ^{'}$ and $E^{t} (σ^{'}) = 1$ , then $E^{t} (σ^{''}) = 1$ , and
Property 2:	if $σ^{'}$ is feasible-for-classification and $σ^{''} \geq σ^{'}$ ,
	then $σ^{''}$ is also feasible-for-classification.

Property 1 shows that object $t$ maintains an efficiency score of one if $σ$ surpasses what is required to satisfy $E^{t} (σ) = 1$ , which suggests that we can construct a feasible-for-classification $σ$ by computing the componentwise maximums of solutions to

min {∥ σ ∥ : E^{t} (σ) = 1} .

Theorem 2 confirms this suggestion and establishes both upper and lower bounds on $P^{s}$ based on this construction.

Let

Σ = (σ^{1}, σ^{2}, \dots, σ^{| C^{s} |}) \in \prod t \in C^{s} arg min {∥ σ ∥ : E^{t} (σ) = 1},

and let $^σ (Σ)$ be the nonnegative vector whose components are

^σi(Σ)=maxt{σti}.

Theorem 2 establishes upper and lower bounds on $P^{s}$ in terms of $^σ (Σ)$ .

Theorem 2.

We have for any

Σ \in \prod t \in C^{s} arg min {∥ σ ∥ : E^{t} (σ) = 1}

that the proximity to equitable efficiency for category $s$ satisfies

\frac{1}{\sqrt{M + N}} ∥^σ (Σ) ∥ \leq P^{s} \leq ∥^σ (Σ) ∥ .

Proof.

Select

Σ = (σ^{1}, σ^{2}, \dots, σ^{| C^{s} |}) \in \prod t \in C^{s} arg min {∥ σ ∥ : E^{t} (σ) = 1} .

Then $^σ (Σ) \geq σ^{t}$ for all $t$ by definition, and we have from Property 1 above that $E^{t} (^σ) = 1$ for all $t$ . So $^σ (Σ)$ is feasible-for-classification for category $s$ and

P^{s} \leq ∥^σ (Σ) ∥ .

The lower bound is established by first noticing that

σ_{i}^{t} \leq ∥ σ^{t} ∥ = min {∥ σ ∥ : E^{t} (σ) = 1} \leq P^{s} for all i and t .

So,

{^σ}_{i} (Σ) = max t σ_{i}^{t} \leq P^{s} for all i .

We conclude that $^σ (Σ) \leq P^{s} e$ , and hence,

∥^σ (Σ) ∥ \leq \sqrt{M + N} P^{s},

which completes the proof. $■$

Two corollaries follow, the first of which is an immediate consequence of Theorem 2.

Corollary 1.

The proximity to equitable efficiency for category $s$ satisfies,

1√M+N(supΣ{∥^σ(Σ)∥})≤Ps≤infΣ{∥^σ(Σ)∥},

where the supremum and infemum are expressed over the collection

Σ \in \prod t \in C^{s} arg min {∥ σ ∥ : E^{t} (σ) = 1} .

Theorem 2 can also be strengthened to show that $P^{s}$ achieves its upper bound should $^σ (Σ)$ be decided by a single object.

Corollary 2.

If $^σ(Σ)∈argmin{∥σ∥:Et(σ)=1}$ for some $t \in C^{s}$ , then

P^{s} = ∥^σ (Σ) ∥ .

Proof.

Let $^t \in C^{s}$ have the property that $^σ(Σ)∈argmin{∥σ∥:E^t(σ)=1}$ . Then

∥^σ(Σ)∥=min{∥σ∥:E^t(σ)=1}≤min{∥σ∥:Et(σ)=1,∀t∈Cs}=Ps,

and the result follows because Theorem 2 gives the reverse inequality. $■$

We address how these mathematical results guide our computational effort to compute $P^{s}$ in the next section. However, we first note that we assess a classification by summing the proximity values of the individual categories, giving an overall proximity evaluation for the entire classification. So we appraise a classification $C$ with

P (C) = S \sum s = 1 P^{s},

and its value indicates how much uncertainty is needed so that all objects have a claim to efficiency within their categories. Classifications with low total proximity values are preferred because they partition objects into sets that are intra-category efficient relative to low amounts of uncertainty.

3 Computing the Proximity to Equitable
Efficiency

Solving uDEA problems is generally difficult due to a loss of convexity, see Example 3 in [8] as an illustration, and calculating the proximity to equitable efficiency inherits this difficulty because Corollary 2 equates the value of $P^{s}$ to solving a uDEA problem under appropriate conditions. A first order algorithm in [8] successfully solves uDEA problems, and we use this algorithm to compute a $Σ$ by iteratively solving for each $t \in C^{s}$ ,

min {∥ σ ∥ : E^{t} (σ) = 1} .

The resulting $^σ (Σ)$ then bounds $P^{s}$ according to Theorem 2, and it even computes $P^{s}$ if $^σ (Σ)$ is defined by a single object, in which case $P^{s} = ∥^σ (Σ) ∥$ .

We work to calculate $P^{s}$ by reducing $^σ (Σ)$ if $P^{s}$ is not decided by a single object. Let $Γ (σ)$ be the total sum of robust efficiency scores within the category,

Γ (σ) = \sum t \in C^{s} E^{t} (σ) .

Then,

P^{s} = min {∥ σ ∥ : E^{t} (σ) = 1, \forall t \in C^{s}} = min {∥ σ ∥ : Γ (σ) = | C^{s} |},

where the last equality follows from the fact that $Γ (σ)$ is equal to the number of objects in $C^{s}$ if and only if $σ$ is feasible-for-classification. A unit-length improving direction $d$ from $σ$ is calculated by solving,

min {d^{T} σ : d^{T} \nabla Γ (σ) \geq 0, d^{T} d \leq 1} .

(5)

The first constraint ensures that $σ + α d$ is marginally feasible-for-classification, and the objective minimizes the directional derivative of $∥ σ ∥^{2}$ along $d$ – note that we have removed the superfluous scalar of two.

We search along the solution $d$ if the optimal value of (5) is negative, and a straightforward calculation in this case shows that the step size $α$ must satisfy

0 \leq α \leq - σ^{T} d - \sqrt{{(σ^{T} d)}^{2} - (∥ σ ∥^{2} - \frac{1}{M + N} ∥^σ (Σ) ∥^{2})}

(6)

to ensure that $σ + α d$ does not violate the lower bound in Theorem 2. We search this interval with the method of bisection to find the largest value of $α$ guaranteeing $Γ (σ + α d) \geq | C^{s} | - ε$ , where $ε$ is a suitably small tolerance required by our computational study. We then update $σ$ , solve (5), and repeat our search for $α$ . The process starts with $σ =^σ (Σ)$ and terminates once the solution to (5) is nonnegative.

We assumed the existence of $\nabla Γ$ in (5), but this vector is approximated componentwise with the finite differences

\frac{\partial Γ}{\partial σ_{i}} \approx \frac{Γ (σ) - Γ (σ - δ e_{i})}{δ},

(7)

where $e_{i}$ is the vector of zeros except for a one in the $i$ -th position. We note that this approximation exists even if the gradient doesn’t. The use of a backward difference is important since the more common forward difference would instead use $Γ (σ + δ e_{i})$ and $Γ (σ)$ . However, $σ$ being feasible-for-classification implies that $σ + δ e_{i}$ would also be feasible-for-classification, and hence, both $Γ (σ + δ e_{i})$ and $Γ (σ)$ would have the same value and our approximate gradient would be zero. The search direction from (5) would then be $d = - σ / ∥ σ ∥$ , making

σ + α d = (1 - \frac{α}{∥ σ ∥}) σ .

The misguided computational interpretation would be that we could approximate $P^{s}$ by calculating any $^σ (Σ)$ and then scaling this vector down while satisfying

Γ ((1 - \frac{α}{∥^σ (Σ) ∥})^σ (Σ)) \geq | C^{s} | - ε .

So the use of a forward difference would limit the search to compute $P^{s}$ from an initial $^σ (Σ)$ to a single search direction due to the erroneous assumption that $Γ (σ)$ is constant if $σ$ is in some small neighborhood of $^σ (Σ)$ . The use of the backward differences removes this concern and provides a more accurate search strategy.

Our algorithmic approach to compute $P^{s}$ is:

Step 0:	Initialize	Use the algorithm in [8] to compute a $Σ$ and a resulting $^σ (Σ)$ . If $^σ (Σ)$ is defined by a single object, then set $P^{s} = ∥^σ (Σ) ∥$ and stop. Otherwise, initialize $σ$ to $^σ (Σ)$ .
Step 1:	Search Direction	Compute an approximate $\nabla Γ (σ)$ with (7) and then use this approximation to calculate a search direction $d$ by solving (5). Stop if $d^{T} σ \geq 0$ .
Step 2:	Line Search	Use the method of bisection to calculate the largest $α$ satisfying $Γ (σ + α d) \geq S - ε$ , where $α$ satisfies (6). Update $σ$ to $σ + α d$ and return to Step 1.

[8pt]

We could add a termination criterion to stop if $Γ (σ + α d)$ reached its lower bound in Step 2; however, theory indicates that the directional derivative from (5) would be nonnegative in this case, and we use the termination criterion in Step 1 to validate our computational scheme’s agreement with theory. This approach has been computationally successful.

We comment that each evaluation of $Γ (σ)$ requires $| C^{s} |$ robust optimization problems to be solved, and while the algorithmic structure above is standard, assembling the computational effort is less so. Robust solvers are increasingly trustworthy, but still, they lack some of the dependability associated with other problem classes like linear programming [2, 14]. We use Gurobi with many of the numeric options set to their most stringent possibilities, and we validate the optimal status of each solve. Success is routine with the increased numeric focus but is more questionable with default settings. If a problem fails our stringent expectations, then we iteratively relax convergent tolerances while guaranteeing optimal values, which are efficiency scores, to at least six decimal places – so an efficiency score of at least 0.999999 equates to 1.0. The vast majority of solves more stringently reaches 16 decimal places of accuracy.

4 Classification with Uncertain Data

Our classification problem partitions the $T$ objects in $O$ into the $S$ categories in $C$ in a way that (approximately) minimizes the total proximity $P (C)$ , i.e. we seek to solve

min {P (C) : | C | = S} .

(8)

The search space of this problem is combinatorially problematic for even modest problems because its size is the Sterling number of the second kind that counts the number of ways to disjointly partition $T$ objects into $S$ nonempty sets. To illustrate, the search spaces of the examples in the next section are

{\begin{matrix} 303 \end{matrix}} \approx 3.4 \times 10^{13} and {\begin{matrix} 423 \end{matrix}} \approx 1.8 \times 10^{19} .

The immensity of these spaces fundamentally challenges a classification scheme independent of how we assess individual classifications, which prompts the need for heuristics to navigate a search. We divide our search for a (near) optimal classification into an initialization procedure, which efficiently solves several adapted $p$ -median problems, and an iterative improvement strategy, which is a neighborhood search that reassigns one object per iteration.

The $p$ -median problem partitions objects into sets so that the aggregate distance from the objects within the sets to their representative medians is as small as possible. We begin by considering the entire collection of objects as a single classification, i.e. we do not assume a partition and originally set $C = {O}$ . The $p$ -median problem depends on a distance between the objects, and we let the distance between objects $t_{i}$ and $t_{j}$ be

d (t_{i}, t_{j}) = | ∥ σ^{i} ∥ - ∥ σ^{j} ∥ |,

where $σ^{i}$ and $σ^{j}$ respectively solve

min {∥ σ ∥ : E^{t^{i}} (σ) = 1} and min {∥ σ ∥ : E^{t^{j}} (σ) = 1} .

Again, these $σ$ vectors are relative to the entire collection of objects and not to a proposed partition of $S$ categories; so there is no intra-category consideration in the initialization process. The idea is that calculating the minimum amount of uncertainty for each object against the whole set of objects should be a reasonable proxy to seed our search to minimize the sum of the intra-category uncertainties comprising $P (C)$ .

We adapt the $p$ -median problem to include constraints that mandate the sizes of the sets. For instance, if we seek to classify $30$ objects into three categories, then we solve an adapted $p$ -median problem for sets whose sizes are $(2, 2, 26)$ , $(2, 3, 25)$ , $(2, 4, 24)$ , etc. If the set sizes are $(p_{1}, p_{2}, \dots, p_{S})$ , then we solve

min\omit\span\omit\span\omit∑ijd(ti,tj)ζijsubject to∑jζij=1,∀i∑jζjj=S,ζjj≥ζij,∀i,j,i≠j∑kωjk≤1,∀j∑iζij=∑kp′kωjk,∀j∑jωjk=|{pq:pq=p′k}|,∀kζij∈{0,1},∀i,jωjk∈{0,1},∀j,k,

where $i$ and $j$ index over ${1, 2, \dots, S}$ , $(p_{1}^{'}, p_{2}^{'}, \dots, p_{S^{'}}^{'})$ lists the unique elements of $(p_{1}, p_{2}, \dots, p_{S})$ , and $k$ indexes over ${1, 2, \dots, S^{'}}$ . The first three constraints model the traditional $p$ -median problem, whereas the next three ensure that the partitioning sets have the specified cardinalities. We note that we do not need to consider singletons because any set with only one or two elements trivially has a proximity value of zero. So we solve the adapted $p$ -median for all other summative partitions of $S$ , and we calculate $P (C)$ for each. Our initial classification is the one with the smallest value of $P (C)$ .

Suppose $C = {C^{1}, C^{2}, \dots, C^{S}}$ is our initial incumbent classification. We define a neighbor of $C$ to be any classification that moves a single object from one category to another. So $C^{'}$ is a neighbor of $C$ if for some $s^{'}$ and $s^{''}$ , and for some object $o^{t} \in C^{s^{'}}$ ,

C^{'} = (C ∖ {C^{s^{'}}, C^{s^{''}}}) ⋃ {C^{s^{'}} ∖ {o^{t}}, C^{s^{''}} \cup {o^{t}}} .

From the current incumbent classification $C$ , we solve

min {P (C^{'}) : C^{'} is a % neighbor of C}

(9)

by iteratively swapping each object from its current category to the other categories. We accept a solution to (9) as the next incumbent if its value is less than $P (C)$ , but we otherwise terminate with a final classification. This process repeats with the updated incumbent if improvement is found.

We conclude with a few comments about our computational process. Our initialization and subsequent neighborhood search does not guarantee a solution to (8). However, our adapted $p$ -median problem efficiently and efficaciously seeds our neighborhood search algorithm so that it converges in a few iterations. Moreover, and beyond combinatorial concerns, shrewd and judicious navigation through the collection of all possible classifications is prudent due to the computational burden of calculating $P (C)$ , which requires thousands of SOCP solves even on modestly sized problems. Our two phase approach demonstrates persuasively on the examples of the next section, and we advocate it as a reasonable and practically justified approximation algorithm. Lastly, we have experimented with numerous minor variations, e.g. by adjusting the distances in our adapted $p$ -median problem to $d (t_{1}, t_{2}) = ∥ σ^{1} - σ^{2} ∥_{p}$ or by tweaking the definition of a classification’s neighborhood. We considered such adjustments in an attempt to expediting our computational task without degrading fidelity to the goal of identifying an optimal classification, but all adjustments either diminished outcomes or lacked computational improvement.

5 Illustrative Examples

We apply our classification scheme to two problems to demonstrate its employ. The first example partitions the 30 stocks in the Down Jones Industrial Average into three performance tiers, and the second categorizes 42 prostrate treatments that were originally analyzed with DEA in [22] and subsequently with uDEA in [8, 31]. The objects, i.e. stocks and prostate treatments, have one input and one output characteristic in each case. Single input and output examples lend themselves to graphical inspection to help verify efficacy, and they guarantee that the upper and lower bounds on $P (C)$ are within $\sqrt{2}$ of each other from Theorem 2. Both examples set $R_{i}^{'} = I$ , so a feasible-for-classification $σ$ exists from Theorem 1.

A few comments about the computational burden are worthwhile. Our code is written in python and is available as a supplement to this article. We use Pyomo [12] to model all optimization problems and Gurobi to solve all models. Our initial computational testing suggested that both examples would take about a week of serial calculation on a modern laptop, which seemed extreme. A parallel version of the code that distributes the calculation over twelve nodes has significantly reduced the calculation time to several hours, with the stock example solving in about six hours and the prostate example solving in about eight. The computational burden of classifying with regard to uncertain data is nonetheless significant and requires careful consideration.

The input and output characteristics of each stock are respectively the semi-deviation and the average annual return, both of which are uncertain estimates of future values. Semi-deviation is a common measure of risk and is the standard deviation of downside returns, whereas the average annual return is a routine measure of reward. The risk reward trade-off is steeped in investing, with risk adverse individuals accepting lower average returns to help safeguard principal and risk accepting individuals jeopardizing principal as they pursue higher returns. The input and output characteristics are calculated from the twelve annual returns ending at the beginning of each month in 2021.

Classifying stocks relative to efficiency with regard to risk and reward does not preference risk or reward but instead considers them as uncertain co-equal trade-offs, and hence, the categories tier stocks into performant classes independent of risk tolerance. A reasonable interpretation is that Tier 1 stocks should be considered over Tier 2 or 3 stocks regardless of one’s palatability toward risk. Tier 2 stocks should likewise be considered over Tier 3 stocks.

Figures 4 through 4 show the progression of our classification process. Tier 1 stocks are green squares, Tier 2 stocks are blue squares, and Tier 3 stocks are red diamonds. The outcome from the initialization process is in Figure 4, and we note that Tier 3 can obviously add stocks from Tier 2 while maintaining a proximity value of zero. The first two improvements identify this fact and reduce the uncertainty of Tier 2 by moving two of its stocks to Tier 3. The third and final improvement moves a Tier 1 stock to Tier 2, and the algorithm identifies no subsequent improvement. Table 1 lists the categories throughout the classification process along with the proximity values. Stocks are listed by their ticker symbol.

	Tier 1	Tier 2	Tier 3	$P (C)$
Initial	$P^{1} = 0.1361$	$P^{2} = 0.0889$	$P^{3} = 0.000$	$0.2250$
	AXP,AMGN,AAPL,CAT,	AXP,CVX,CSCO,KO,	BA,WBA
	DIS,GS,HD,HON,INTC,	DOW,IBM,JPM,TRV
	JNJ,MCD,MRK,MSFT,
	NKE,PG,CRM,UNH,VZ,
	V,WMT
Improvement 1	$P^{1} = 0.1361$	$P^{2} = 0.0646$	$P^{3} = 0.000$	$0.2001$
	AXP,AMGN,AAPL,CAT,	AXP,CSCO,KO,DOW	BA,CVX,
	DIS,GS,HD,HON,INTC,	IBM,JPM,TRV	WBA
	JNJ,MCD,MRK,MSFT,
	NKE,PG,CRM,UNH,VZ,
	V,WMT
Improvement 2	$P^{1} = 0.1361$	$P^{2} = 0.0172$	$P^{3} = 0.000$	$0.1532$
	AXP,AMGN,AAPL,CAT,	AXP,CSCO,KO,DOW	BA,CVX,
	DIS,GS,HD,HON,INTC,	JPM,TRV	IBM,WBA
	JNJ,MCD,MRK,MSFT,
	NKE,PG,CRM,UNH,VZ,
	V,WMT
Final	$P^{1} = 0.1153$	$P^{2} = 0.0209$	$P^{3} = 0.000$	$0.1361$
	AXP,AMGN,AAPL,CAT,	AXP,CSCO,KO,DOW	BA,CVX,
	DIS,GS,HD,HON,INTC,	JPM,TRV,CRM	IBM,WBA
	JNJ,MCD,MRK,MSFT,V
	NKE,PG,UNH,VZ,WMT

Table 1: Stock tiers through the classification process

The second example is a collection of 42 prostate cases that were previously evaluated with DEA and uDEA, see [8, 21, 31]. These treatments were delivered to patients in New Zealand and were approved for observational studies. Treatment planning has an established history in operations research, and we refer readers to [7, 9, 13, 27] as summary reviews. The overriding goal of treatment planning is to design a treatment that delivers a uniform, tumoricidal dose to the targeted region while sparing nearby organs from harmful effects. The targeted region is the planning treatment volume (PTV), which is a clinically delineated portion of the anatomy that encompasses both cancerous tumors and their perceived microscopic extensions. The organs at risk in a prostate case are generally the rectum, the bladder, and the femoral heads, with the rectum and bladder being particularly concerning because they abut the prostate.

Each treatment is patient specific, and although trained experts tailor treatments with sophisticated optimization software to clinical standards, treatment quality remains nebulous and varied for many reasons. There are preference and software differences among clinics and clinicians, there are mathematical and computational assumptions that only approximate what is actually delivered, and there are uncertainties in treatment delivery. All 42 cases were deemed clinically acceptable, but that does not mean that they were bereft of improvement. Indeed, one of the outcomes of [21] was that treatments identified for possible improvement actually benefited from reconsideration. So the practical importance of reviewing previously delivered treatments is to learn from those that were outstanding so that clinical acceptability improves.

The original work in [21] posited the use of DEA to analyze treatments, but this effort assumed certain characteristic data. The subsequent efforts in [8] and [31] recognized the uncertainty of the characteristics and provided alternate assessments that benefited from this recognition. The computational processes in [8, 31] evaluated individual treatments on the amount of uncertainty required to be efficient within the entire collection of treatments, and those that necessitated larger amounts of uncertainty were recommended for review and possible improvement.

Our classification scheme advances the work in [8, 21, 31]. First, it recognizes and models uncertainty like [8, 31], so it gains fidelity over [21] with regard to the uncertain clinical situation. Second, the analysis methods in [8, 31] are not directly classification processes, but rather, they ask clinicians to discern which treatments might benefit from additional consideration post calculation. For instance, the authors of [8] suggest segmenting the cases with a threshold on the amount of uncertainty, but this threshold can only be decided after the computational effort. Our classification process instead identifies (near) optimal categories and partitions the cases into a fixed number of categories without an after-the-fact cutoff. We partition the 42 cases into three categories: those deemed to be clinical exemplars, those deemed to be clinical standards, and those that should be considered for additional improvement. Classifying into three categories is clinically reasonable because it identifies clinical exemplars that illustrate best practices, which is unlike the previous dichotomies that primarily distinguish the treatments that should be reconsidered. Our trichotomy does both, i.e. it identifies clinical exemplars and it distinguishes those that should be reconsidered.

The input characteristic is the rectal generalized equivalent uniform dose (gEUD), which measures the average homogeneous dose delivered to the rectum, and the output characteristic is the dose received by 95% of the PTV ( $D_{95}$ ). The unit of dose is a Gray (Gy). These are the same input and output characteristics in [8, 21, 31], and all three of these articles review additional clinical details. The categories throughout the classification process are in Figures 6 and 6. The category of green squares contains exemplar treatments, the category of blue circles contains clinical standards, and the category of red diamonds contains treatments that should be reconsidered. Note that the search following the initialization process only identifies a single improvement, with one of the treatments moving into the category of clinical standards from those that would have been recommended for continued review.

The initial and improved proximity values are:

	$P^{1}$	$P^{2}$	$P^{3}$	$C (P)$
Initial	$0.1851$	$0.1693$	$0.8049$	$1.1593$
Final	$0.1851$	$0.1693$	$0.7997$	$1.1541$ .

Our partitioning suggests 15 cases would have likely benefited from additional planning, which is fewer than the 20 cases shown to exceed the proposed threshold in [8]. Some of this decrease is due, at least in part, to the fact that we partition into three categories instead of two. Three final observations are worthwhile. First, the interplay between the second and third categories suggests that it would be difficult to discern a clear separation between these categories without a tool that optimizes the classification. Second, just because a treatment is classified as warranting a review does not immediately suggest that the treatment fell below clinical guidelines. Remember that trained experts tailor each treatment to a patient’s specific needs, and sometimes this imposes considerations other than standard performance metrics. So being classified as warranting a review would simply alert an expert to the fact that continued design might be worthwhile. Third, the categories visually segment themselves more vertically than horizontally, which suggests that the $D_{95}$ value is somewhat more important as a classification parameter. This observation is less clear between the second and third categories, but it is the type of visual acuity provided by the classification.

6 Conclusion

Uncertain data envelopment analysis can be useful within a classification process, and we have advanced the employ of the proximity to equitable efficiency as a sound tool to classify in the presence of uncertain data. We overcome two computational issues, those being the innate difficulty of calculating our proximity value due to a loss of convexity and the combinatorially fraught problem of searching over all possible partitions. The downside to our classification process is that it necessitates a protracted calculation in serial, a fact that we sidestep on our examples by parallelizing our algorithm. Future work could seek to streamline the computational effort to better address larger problems with more characteristics. Applying the classification process to additional problems should also be fruitful.

Acknowledgements

The authors are grateful for thoughtful comments made by Matthias Ehrgott.

References

[1] D. Baatar and M.M. Wiecek. Advancing Equitability in Multiobjective Programming. Computers & Mathematics with Applications, 52(1):225–234, 2006.
[2] A. Ben-Tal and A. Nemirovski. On Polyhedral Approximations of the Second-Order Cone. Mathematics of Operations Research, 26(2):193–205, 2001.
[3] D. Bertsimas and R. Shioda. Classification and Regression via Integer Optimization. Operations Research, 55(2):252–271, 2007.
[4] S. Cinaroglu. Integrated k-means clustering with data envelopment analysis of public hospital efficiency. Health Care Management Science, 23:325–338, 2020.
[5] W. Cooper, L. Seiford, and K. Tone. Data Envelopment Analysis: A Comprehensive Text with Models, Applications, References and DEA-solver Software. Springer, New York, 2007.
[6] B. Duran and P. Odell. Cluster Analysis: A Survey, volume 100. Springer Science & Business Media, 2013.
[7] M. Ehrgott, Ç. Güler, H. W. Hamacher, and L. Shao. Mathematical optimization in intensity modulated radiation therapy. 4OR, 6(3):199–262, 2008.
[8] M. Ehrgott, A. Holder, and O. Nohadani. Uncertain Data Envelopment Analysis. European Journal of Operational Research, 268:231–242, 2018.
[9] Matthias Ehrgott and Allen Holder. Operations research methods for optimization in radiation oncology. Journal of Radiation Oncology Informatics, 6(1):1–41, Oct. 2017.
[10] A. Emrouznejad, B. Parker, and G. Tavares. Evaluation of Research in Efficiency and Productivity: A Survey and Analysis of the First 30 Years of Scholarly Literature in DEA. Socio-economic Planning Sciences, 42:151–157, 2008.
[11] L. García-Escudero, A. Gordaliza, C. Matrán, and A. Mayo-Iscar. A Review of Robust Clustering Methods. Advances in Data Analysis and Classification, 4(2-3):89–109, 2010.
[12] W. Hart, C. Laird, J.-P. Watson, D. Woodruff, G. Hackebeil, B. Nicholson, and J. Siirola. Pyomo - Optmization Modeling in Python. Springer, 2017.
[13] A. Holder. Radiotherapy treatment design and linear programming. In M.L. Brandeau, F. Sainfort, and W.P. Pierskalla, editors, Operations Research and Health Care, pages 741–774. Kluwer Academic Publishers, Norwell MA, 2004.
[14] A. Holder and B. Lyu. A Simplex Approach to Solving Robust Metabolic Models with Low-Dimensional Uncertainty. In Harvey J. Greenberg: A Legacy Bridging Operations Research and Computing. Springer, 2020.
[15] C. Huberty and S. Olejnik. Applied MANOVA and Discriminant Analysis, volume 297. John Wiley & Sons, 2nd edition, 2006.
[16] S. Hwang, H. Lee, and J. Zhu, editors. Handbook of Decision Making Using Data Envelopment Analysis. Springer, New York, 2016.
[17] A. Jain, M. Murty, and P. Flynn. Data Clustering: A Review. ACM Comput. Surv., 31(3):264–323, 1999.
[18] M. Kostreva and W. Ogryczak. Linear Optimization with Multiple Equitable Criteria. RAIRO Operations Research, 33:275–297, 1999.
[19] S. Kotsiantis. Supervised Machine Learning: A Review of Classification Techniques. In Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in EHealth, HCI, Information Retrieval and Pervasive Technologies, page 3–24. IOS Press, Neatherlands, 2007.
[20] J. Lee and S. Olafsson. Data Clustering by Minimizing Disconnectivity. Information Sciences, 181(4):732–746, 2011.
[21] Kuan-Min Lin, John Simpson, Giuseppe Sasso, Andrea Raith, and Matthias Ehrgott. Quality assessment for VMAT prostate radiotherapy planning based on data envelopment analysis. Physics in Medicine and Biology, 58(16):5753, 2013.
[22] J. Liu, L. Lu, W. Lu, and B. Lin. A Survey of DEA Applications. Omega, 41:893–902, 2013.
[23] S. Olafsson, X. Li, and S. Wu. Operations Research and Data Mining. European Journal of Operational Research, 187(3):1429–1448, 2008.
[24] P. Peykani, E. Mohammadi, R. F. Saen, S. J. Sadjadi, and M. Rostamy-Malkhalifeh. Data envelopment analysis and robust optimization: A review. Expert Systems, 37(4):e12534, 2020.
[25] T. Phyu. Survey of Classification Techniques in Data Mining. In Proceedings of the International MultiConference of Engineers and Computer Scientists, volume 1, pages 18–20, 2009.
[26] Rung-Wei Po, Yuh-Yuan Guh, and Miin-Shen Yang. A new clustering approach using data envelopment analysis. European Journal of Operational Research, 199(1):276–284, 2009.
[27] H.E. Romeijn and J.F. Dempsey. Intensity modulated radiation therapy treatment plan optimization. Top, 16:215–243, 2008.
[28] Maziar Salahi, Mehdi Toloo, and Narges Torabi. A new robust optimization approach to common weights formulation in dea. Journal of the Operational Research Society, 72(6):1390–1402, 2021.
[29] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. Patel, A. Tiwari, M. Er, W. Ding, and C. Lin. A Review of Clustering Techniques and Developments. Neurocomputing, 267:664–681, 2017.
[30] S. Schaeffer. Graph Clustering. Computer Science Review, 1(1):27–64, 2007.
[31] E. Stubington, M. Ehrgott, and O. Nohadani. Uncertain Data Envelopment Analysis: Box Uncertainty. preprint, 2020.
[32] E. Thanassoulis. A data envelopment analysis approach to clustering operating units for resource allocation purposes. Omega, 24(4):463–476, 1996.
[33] Mehdi Toloo, Emmanuel Kwasi Mensah, and Maziar Salahi. Robust optimization and its duality in data envelopment analysis. Omega, 108:102583, 2022.
[34] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel. Decision Trees for Hierarchical Multi-Label Classification. Machine Learning, 73, 2008.
[35] G. Zhang. Neural Networks for Classification: A Survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 30(4):451–462, 2000.

Classifying with Uncertain Data Envelopment Analysis

Abstract

1 Motivation and Introduction

2 Proximity to Equitable Efficiency

Theorem 1.

Proof.

Theorem 2.

Proof.

Corollary 1.

Corollary 2.

Proof.

3 Computing the Proximity to Equitable Efficiency

4 Classification with Uncertain Data

5 Illustrative Examples

6 Conclusion

Acknowledgements

References

3 Computing the Proximity to Equitable
Efficiency