Universal Mini-Batch Consistency for
Set Encoding Functions

Jeffrey Willette

^{1}

, Andreis Bruno

^{1}

, Juho Lee

^{1, 2}

, Sung Ju Hwang

^{1, 2}

KAIST

^{1}

, South Korea
AITRICS

^{2}

, South Korea
{jwillette, andries, juholee, sjhwang82}@kaist.ac.kr

Abstract

Previous works have established solid foundations for neural set functions, as well as effective architectures which preserve the necessary properties for operating on sets, such as being invariant to permutations of the set elements. Subsequently, Mini-Batch Consistency (MBC), the ability to sequentially process any permutation of any random set partition scheme while maintaining consistency guarantees on the output, has been established but with limited options for network architectures. We further study the MBC property in neural set encoding functions, establishing a method for converting arbitrary non-MBC models to satisfy MBC. In doing so, we provide a framework for a universally-MBC (UMBC) class of set functions. Additionally, we explore an interesting dropout strategy made possible by our framework, and investigate its effects on probabilistic calibration under test-time distributional shifts. We validate UMBC with proofs backed by unit tests, also providing qualitative/quantitative experiments on toy data, clean and corrupted point cloud classification, and amortized clustering on ImageNet. The results demonstrate the utility of UMBC, and we further discover that our dropout strategy improves uncertainty calibration.

1 Introduction

Set encoding functions (Zaheer et al., 2017; Bruno et al., 2021; Lee et al., 2019; Kim, 2021) are becoming a broadly researched and cited topic in recent literature. This popularity can be partly attributed to natural set structures in data such as point clouds or even datasets themselves. Given a set of cardinality $N$ , one may desire to group the elements (clustering), identify them (classification), or find likely elements to complete the set (completion/extension). Different from vanilla neural networks working on fixed input sizes, neural set functions must be able to handle dynamic set cardinalities for each input set. Additionally, sets are inherently unordered, so the function must make consistent predictions for any permutation of set elements.

Figure 1: Non-MBC set functions are inconsistent when sequentially processing random set partitions. MBC set functions are consistent, but with limited valid architectures. Universal MBC (UMBC) allows leveraging MBC+non-MBC set functions, widening the field of available MBC architectures.

Deep Sets (Zaheer et al., 2017) is a canonical work providing an in-depth investigation of the requirements and valid structures of neural set functions. Deep Sets utilizes traditional, permutation equivariant (creftype 3.2) linear and convolutional neural network layers in conjunction with featurewise permutation invariant (creftype 3.1) set-pooling functions (e.g. {min, max, sum, mean}) in order to satisfy the permutation consistency requirements and perform inference on sets. The Set Transformer (Lee et al., 2019) utilizes the power of Multi-Headed Attention (Vaswani et al., 2017) to construct multiple set-capable attention blocks, as well as an attentive pooling function. The previously mentioned works never explicitly consider the case where it may be required to process a set in multiple partitions, which can happen for a variety of reasons including resource constraints, prohibitively large or even infinite set sizes, and streaming set data.

Figure 2: $σ^{2}$ between encoded features of 100 different partitions of the same set. Set Transformer is not MBC and produces nonzero variance between different partitions. UMBC+Set Transformer is MBC and gives effectively zero variance between partitions.

Model	MBC	Cross-Attn.,	Self-Attn.
Deep Sets (Zaheer et al., 2017)	✓	✗	✗
SSE (Bruno et al., 2021)	✓	✓	✗
Set Transformer (Lee et al., 2019)	✗	✓	✓
UMBC+Set Transformer	✓	✓	✓

Powerful models such as the Set Transformer cannot make consistency guarantees when updating pooled set representations, as self-attention blocks require all $N$ elements in one pass, and therefore do not satisfy MBC (i.e. batch processing of set partitions yields a different output than processing the whole set at once). Naively using such non-MBC set encoders in an MBC setting causes under-performance, as depicted in Figure 3 (\subrefst-single-point-\subrefst-chunk) where Set Transformer exhibits poor likelihood and inconsistent predictions. With an MBC guarantee Figure 3 (\subrefumbc-st-single-point-\subrefumbc-st-chunk), UMBC+Set Transformer gives consistent results, and vastly better likelihood (See Appendices B and 5 for details of the experiment) The quantitative effect of MBC vs non-MBC encoding on pooled set representations can be seen in Figure 2 which shows the variance between final representations of 100 random partitions of the same set. (See Appendix C for details).

The MBC property of set functions was identified by Bruno et al. (2021) who also proposed the Slot Set Encoder (SSE), a specific, constrained version of an attentive pooling mechanism, eliminating the need to store all $N$ set elements during the computation. The introduction of the MBC property naturally leads to the rise of a new dimension in the taxonomy of set functions, namely, those which satisfy MBC and those which do not. The main limitation of the SSE is the fact that it limits the number of valid MBC architectures, eliminating powerful models such as the Set Transformer, which can be the best choice for tasks which require leveraging pairwise set element relationships and self-attention. This is shown in Tables 2 and 4) where the Set Transformer outperforms SSE. In this work, we identify, prove, and verify that there is a universal way to convert arbitrary non-MBC set functions to MBC functions which can provably produce the same result for random partitioning schemes, allowing any set encoder to be used in an MBC setting. This result has large implications for all current and future set functions which do not natively satisfy MBC, as it unifies all set-functions into the MBC class of set-functions. This unification allows models to scale to sets larger than they could otherwise handle, and also allows them to be used in a wider range of settings (i.e.streaming data). Animations, code, and unit tests can be found in the supplementary file and also at: https://github.com/anonymous-subm1t/umbc

Our contributions in this work are as follows:

In Theorem 4.1 we show that with a change in architecture, any arbitrary non-MBC set encoder can become MBC, guaranteeing that minibatch processing of sets gives the same result as processing the full set at once.
We loosen the constraints of the SSE attention activations by showing many functions can be used. By factorizing the activation, we can maintain the MBC property and still normalize over $N$ (i.e. like traditional attention (Vaswani et al., 2017)).
We uncover a connection between the pooling mechanism of the Set Transformer and SSE layers, which we show only differ in the attention matrix activation. We explore the effect of 5 different activation approaches.
We explore an interesting dropout approach which arises as a consequence of UMBC’s structure, delivering improvements in calibration for both in-distribution and corrupted test sets.

2 Related Work

Processing, pooling, and making a prediction for set structured data has been an active topic since the introduction of DeepSets Zaheer et al. (2017). Attention has been shown to be powerful in these tasks Lee et al. (2019), as simple independent row-wise operations may fail to capture pairwise interactions between the set elements. There have been subsequent works and variations of set attention which draw connections to optimal transport Mialon et al. (2020), and expectation maximization Kim (2021). Likewise, an efficient version of set-attention has been proposed which incorporates cross attention with lower dimensional self-attention in an iterative process Jaegle et al. (2021). Outside of attention, other approaches to set pooling functions include featurewise sorting Zhang et al. (2019), and canonical orderings to permutation sensitive functions Murphy et al. (2018).

Bruno et al. (2021), provide and especially important lens through which to view our work. Prior to the proposal of the MBC property, previous set function research never explicitly considered the mini-batched setting, which will likely become important with the ever increasing scales of models and data (Brown et al., 2020). Indeed most set functions do not satisfy creftype 3.3 (e.g. (Lee et al., 2019; Kim, 2021; Mialon et al., 2020; Jaegle et al., 2021; Zhang et al., 2019; Murphy et al., 2018)). Our work builds on the concepts established by Bruno et al. (2021), and ensures that any set functions proposed in the future, can be considered in terms of their MBC performance by incorporating UMBC.

Numerous prior works (Ovadia et al., 2019; Guo et al., 2017) focus on uncertainty quantification and improving probabilistic calibration, which can be crucial for tasks such as autonomous driving (Chen et al., 2017) and medical diagnosis (Zhou et al., 2021) where decisions can impact human well being. Guo et al. (2017), proposed quantifying uncertainty with the expected calibration error (ECE) metric measuring the mismatch between accuracy and confidence. Ovadia et al. (2019) used corrupted datasets made by Hendrycks and Dietterich (2019) (similar in form to ModelNet40-C (Ren et al., 2022) used in our experiments) to survey the landscape of neural network calibration. Guo et al. (2017); Ovadia et al. (2019) analyze variants of deep convolutional models, while Minderer et al. (2021) evaluate large Vision Transformers. To our knowledge, our work is the first to analyze set function calibration specifically, as most other works focus on general purpose classifiers.

3 Preliminaries on Set Functions

For our setting, we define a neural set function $f$ which operates on a set $X = {x_{i}}_{i = 1}^{N}$ with each set element $x_{i} \in R^{d}$ . A dataset of set-structured data $D = {(X_{i}, Y_{i})}_{i = 1}^{M}$ is itself a set composed of sets forming input sets $X_{i}$ and output sets $Y_{i}$ which can be learned by an appropriate function via mini-batch stochastic gradient descent. A set function $f : X \mapsto Y$ has a set-structured input space $X$ and output space $Y$ , which may be discrete or continuous. An input to a set function, $X_{i}$ is a set. As the input is a set and the function must process any valid set, therefore any element of the powerset $P (X_{i})$ also represents a valid input.

Deep Sets Zaheer et al. (2017) provided a crucial groundwork for neural set functions, formalizing the requirements of permutation equivariant architectures and invariant pooling mechanisms necessary for feature extraction and pooling of sets. Following these requirements, a function can assign a single output for each valid subset $X_{i} \in P (X_{i})$ which is invariant to the permutations of the elements $x_{j \in N} \in X_{i \in M}$ .

Property 3.1 (Permutation Invariance).

A function $f : P (X) \mapsto Y$ acting on sets is permutation invariant to the order of objects in the set iff for any permutation function

Permutation invariant layers are commonly referred to as set pooling functions, and have a stationary, fixed size output given any permutation, or cardinality of the input set, respectively. This stationary output can be seen as a Set to Vector function.

Definition 3.1 (Set 2 Vector Function).

A Set to Vector Function (S2V) is a pooling function which satisfies creftype 3.1, and projects a set of cardinality $N$ to one or more vectors ${z}_{i = 1}^{K}$ with $z_{i} \in R^{d}$ .

Additionally, Zaheer et al. (2017) prescribes that prior to any permutation invariant pooling, any composition of permutation equivariant layers may be used for feature extraction. Common linear and convolutional neural network layers are permutation equivariant layers when considering a batch of inputs as a set. For the remainder we assume $f$ contains both equivariant and invariant layers.

Property 3.2 (Permutation Equivariance).

A function $f : P (X) \mapsto Y$ acting on sets is permutation equivariant to the order of objects in the set iff for any permutation function $π : f ([x_{π (i)}, \dots, x_{π (N)}])^{⊤} = [f_{π (1)} (x_{1}), \dots, f_{π (n)} (x_{n})]^{⊤}$

Lee et al. (2019) identified that Self-Attention (Vaswani et al., 2017) blocks satisfy creftype 3.2 and thus can be used as equivariant feature extractors for set functions, proposing the Set Transformer. SAB’s are defined as $SAB (X, X) = Attention (X, X)$ . Additionally, the permutation invariant pooling layer of the Set Transformer (PMA), performs the attention operation between a learnable seed parameter $S \in R^{K \times d}$ and the input set, $PMA (X) = SAB (S, X) \in R^{K \times d}$ .

Bruno et al. (2021) identified and formalized the MBC property, proposing the MBC Slot Set Encoder (SSE), adding a new dimension to the original view of creftype 3.1 from Zaheer et al. (2017). Instead of merely requiring that $f$ be permutation invariant for any permutation of the indices of a specific subset $X_{i} \in 2^{X_{i}}$ , the MBC property also requires that sequential, mini-batched extraction/pooling, and subsequent aggregation of any partition of $X_{i}$ is also permutation invariant.

Property 3.3 (Mini-Batch Consistency).

Let $X \in R^{N \times d}$ be partitioned such that $X = X_{1} \cup X_{2} \cup \dots \cup X_{P}$ and $f : R^{n_{i} \times d} \mapsto R^{d^{'}}$ be a S2V set encoding function such that $f (X) = Z$ . Given an aggregation function $g : {f (X_{j}) \in R^{d^{'}}}_{j = 1}^{P} \mapsto R^{d^{'}}$ , $g$ and $f$ are Mini-Batch Consistent if and only if

g (f (X_{1}), \dots, f (X_{P})) = f (X)

An SSE (Bruno et al., 2021) layer works in a similar fashion to the PMA layers of the Set Transformer, using parameterized slots $S \in R^{K \times d}$ as queries $Q$ , and partition $X_{j \in P}$ as keys $K$ and values $V$ , with an attention activation for a single $z_{i \in N}$ which does not depend on the other $N - 1$ elements within the set. SSE uses a sigmoid activation with normalization $σ$ over the slot dimension $K$ (Bruno et al., 2021; Locatello et al., 2020) in the attention matrix $A = σ (Q K^{⊤})$ . Then with $A_{j} \in R^{K \times | X_{j} |}$ and $X_{j} \in R^{| X_{j} | \times d}$

Attention\lx@notemark{footnote}(S,X)=σ(SX⊤)X=P∑j=1σ(SX⊤j)Xj

thereby allowing any partition scheme, and satisfying creftype 3.3. With the prior description in mind, an SSE can be phrased in terms of a PMA such that $SSE (S, X) = ˜ SAB (S, X)$ , with $˜ SAB$ signifying the use of the slot-normalized sigmoid activation used in order to satisfy creftype 3.3.

4 Building a Universally MBC Set Function

Originally, SSE acts as a S2V function, creating an encoded set representation for downstream handling by a task specific decoder. Decoders make different predictions given different representations, therefore creftype 3.3 need only be satisfied until the invariant S2V pooling function.

Lemma 4.1.

Let $f^{*}$ and $f$ be arbitrary neural set functions, and let $g$ be an MBC aggregation function in the functional composition $F = f^{*} \circ g \circ f$ . For $F$ to satisfy creftype 3.3, It is sufficient to require the representation $Z = g (f (X_{1}), \dots, f (X_{p}))$ as input to $f^{*}$ satisfies creftype 3.3.

Proof.

Assume that $g \circ f$ satisfies creftype 3.3 and the composition $F$ does not satisfy creftype 3.3. $g \circ f$ updates $Z$ as new partitions $X_{j}$ arrive, yielding the same input to $f^{*}$ , and therefore the same output of $F$ for any permutation of a random partition of $X$ , contradicting the statement that $F$ does not satisfy creftype 3.3. ∎

Put simply, Lemma 4.1 states that every module $f^{*}$ coming after a module which satisfies creftype 3.3 will continue to satisfy creftype 3.3, even though $f^{*}$ itself may not satisfy creftype 3.3. With this established, we can therefore use Lemma 4.1 in order to build a universally MBC set function.

Theorem 4.1 (Universal MBC Set Function (UMBC)).

Let $f$ be a neural S2V function satisfying creftype 3.3, $f^{*}$ be an arbitrary unconstrained neural S2V function, and $g$ be an MBC aggregation function. By Lemma 4.1, the composition of functions $F = f^{*} \circ g \circ f$ satisfies creftype 3.3.

In the simplest setting where the pooled representation is a vector $z$ , $f^{*}$ receives a singleton set as input, which is valid, but may provide limited utility over a Deep Sets style encoder, as $f^{*}$ sees only a single element. SSE’s and PMA’s, however, output a set $Z \in R^{K \times d}$ . Therefore, using a SSE/PMA layer with $K > 1$ as the base module $f$ , we can view SSE/PMA layer as a type of invariant feature extractor which takes a set of cardinality $N$ and maps it to a set of cardinality $K$ . Our UMBC has features flowing through the model from the input space $X$ to the output space $Y$ as,

X \to X \in R^{N \times d} \to f (X) \to Φ \in R^{K \times^d} \to f^{*} (Φ) \to Y

Maintaining attention normalization over $N$

We now turn to the question of whether or not the constrained attention operation (i.e. avoiding normalization over $N$ in the attention activation) described for $˜ SAB$ is necessary in order to satisfy creftype 3.3.

Proposition 4.1.

By factorizing the normalization constant from the attention matrix softmax, normalization over $N$ can be performed across mini-batched partitions while still satisfying creftype 3.3.

Proof.

With $^σ$ as a softmax, and $exp$ as the elementwise exponential,

PMA (X) = Attention (S, X)

=^σ (S X^{⊤}) X = diag (ζ)^{- 1} exp (S X^{⊤}) X

(1)

Where $diag (ζ)$ is a diagonal matrix containing the normalization constants of the softmax function $ζ_{k} = \sum_{i = 1}^{N} exp (x_{i}^{⊤} s_{k})$ where $N$ is the set cardinality and $s_{k}$ is a single slot. Outside of $exp (.)$ , the final multiplication can occur in any order, so we may simply evaluate $exp (S X^{⊤}) X$ , keeping a vector $ζ$ with the sum of the rows of $exp (S X^{⊤})$ . Factoring the attention in this way, we can update $ζ$ and $exp (S X^{⊤}) X$ at the arrival of every partition $X_{j}$ , normalize over $N$ , and still satisfy creftype 3.3.

Attention (S, X) = diag (ζ)^{- 1} exp (S X^{⊤}) X = diag (P \sum j = 1 ζ_{j})^{- 1} P \sum j = 1 exp (S X_{j}^{⊤}) X_{j}

(2)

∎

Interestingly, in our ablation study (Figure 5), we find the softmax most effective, which requires the normalization over $N$ as described above. For a note about about the numerical stability of the softmax calculated this way, see Appendix F.

function ( $σ$ )	$K$ norm.	$N$ norm.	name	used in
$\frac{1}{1 + e^{- x_{i j}}}$	$\sum_{i = 1}^{\| K \|} σ (x_{i j})$	$1$	slot-sigmoid	Slot Set Encoder Bruno et al. (2021)
$e^{x_{i j}}$	$\sum_{i = 1}^{\| K \|} σ (x_{i j})$	$\sum_{j = 1}^{N} σ (x_{i j})$	slot-softmax	Slot Attention Locatello et al. (2020)
$e^{x_{i j}}$	$1$	$\sum_{j = 1}^{N} σ (x_{i j})$	softmax	Set Transformer Lee et al. (2019)
$e^{x_{i j}}$	$σ (x_{j} - {max}_{i} (x_{j}))$	$\sum_{j = 1}^{N} σ (x_{i j})$	slot-exp	-
$\frac{1}{1 + e^{- x_{i j}}}$	$1$	$\sum_{j = 1}^{N} σ (x_{i j})$	sigmoid	-

Table 1: Various attention activation functions for UMBC layers.

K

norm. and

N

norm. refer to the the normalization constant over the slots

K

and instances

N

³³3In cases where there is both a

K

norm and an

N

norm, the

K

norm is performed first., respectively.

SSE’s Connection to PMA’s

With the introduction of Proposition 4.1, it is easy to see that the only difference between an SSE and a PMA is the choice of the attention activation function. Indeed any deterministic elementwise function which 1) maps the pre-activation attention matrix to strictly positive values, and 2) has an optional normalization constant over $N$ which can be factored as in Proposition 4.1 is valid and will satisfy creftype 3.3. With this in mind, we identify five functions, and explore their performance effects in Figure 5. For the remainder of this work, we will refer to UMBC layers as $UMBC (X)$ , and likewise, models with a base UMBC module are prefixed with UMBC+. A diagram of a UMBC attention can be seen in Figure 9.

Slot Dropout

As outlined in 4, our UMBC framework projects a set of cardinality $N$ to a fixed cardinality $K$ . In doing so, there is a unique opportunity where we can treat each slot $k_{i} \in K$ as a Bernoulli random variable, dropping it with probability $p$ (i.e. dropout (Hinton et al., 2012; Gal and Ghahramani, 2016)). This strategy could be useful for faster training due to a reduced set size as input to $f^{*}$ (Figure 18), for combatting overfitting (Appendices J and 7), or achieving test time ensembling of set representations by sampling multiple dropout masks and averaging the predictions via MC integration (Figure 7) as done by Gal and Ghahramani (2016).

Multiheaded and Parallel Universal Blocks

In addition to multiheaded attention in UMBC, we can also consider multiple parallel layers $UMBC (.) = [{UMBC}_{1} (.), \dots, UMBC (.)_{L}]$ , each with independent multiheaded projections, allowing for independent representations of the same input set.

5 Experiments

Metrics & Model Setup

In the following experiments, our aim is to compare the overall effect of the composition $f^{*} (UMBC (.))$ . In these experiments, we could place arbitrarily hard MBC settings on baselines (e.g. streaming settings in Figures 8 and 3). Instead, we compare performance in the full batch setting where a standalone, non-MBC $f^{*}$ performs well in order to analyze any possible downsides and highlight the benefits in choosing a UMBC model over an MBC model like Deep Sets or SSE. In addition to accuracy, we report negative log likelihood (NLL), expected calibration error (ECE) (Guo et al., 2017), and Adjusted Rand Index (ARI) (Hubert and Arabie, 1985; Vinh et al., 2010). Standard settings of all UMBC models follow those shown in Table 5 unless otherwise specified. All models are trained for 5 runs with random initializations, with error bars corresponding to one standard deviation. We use open source code provided by Zaheer et al. (2017); Lee et al. (2019); Kim (2021) where applicable.

Amortized clustering

We perform amortized clustering on a similar Mixture Of Gaussians dataset as Lee et al. (2019) (See Appendix B for dataset details). The goal is to maximize the likelihood (Equation 3) of a set with $K$ Gaussian components by predicting the component prior, mean, and variance $f (X) = {π (X), {μ_{j} (X), σ_{j} (X)}_{j = 1}^{K}}$ . Figures 8 and 3 contain a qualitative example of the task as well as a demonstration of how non-MBC models can fail when used in a MBC setting, considering 4 different streaming settings for the inputs:

log p (X; θ) = N \sum i = 1 log K \sum j = 1 π_{j} N (x_{i}; μ_{j}, diag (σ_{j}^{2}))

(3)

single point stream $\to$ streams each point in the set one by one. This causes the most severe underperformance by the Set Transformer.
class stream $\to$ streams an entire class at once. The attention modules within Set Transformer cannot compare the input class with any other clusters, thereby degrading performance of Set Transformer.
chunk stream $\to$ streams 8 random points at a time from the dataset, Providing limited information to the Set Transformer’s attention.
one each stream $\to$ streams a set consisting of a single instance from each class. Set Transformer can see examples of each class, but with a limited sample size, the encoding fails to make accurate predictions.

Figure 4: Top Row: Unmodified set functions. Bottom Row: UMBC+set function. UMBC layers act as a bottleneck and may decrease bottom-line performance over the unmodified function. Note: The best performing model which satisfies creftype 3.3 is UMBC+Set Transformer.

We show the effect of different train/test set sizes in Figure 4. Interestingly, the best performing MBC models in Figure 4 are UMBC+(Diff. EM, Set Transformer), giving a concrete example of a task where UMBC can leverage the added power of the Set Transformer to outperform existing MBC models. Note that this is the same task as depicted in Figures 8 and 3, showing that the marginally better bottom line performance of the Set Transformer in Figure 4 disappears in the MBC setting.

Model	MBC	NLL $↓$	ARI $↑$
Oracle	-	1028.22 $\pm$ 1.24	44.09 $\pm$ 0.11
Deep Sets (Zaheer et al., 2017)	✓	531.44 $\pm$ 0.15	6.18 $\pm$ 0.08
SSE (Bruno et al., 2021)	✓	520.29 $\pm$ 0.63	22.91 $\pm$ 1.85
Diff. EM⁴⁴4Diff. EM showed some instability on the ImageNet clustering task and failed to converge for one run. Therefore variance is reported on 4/5 runs.	✗	524.74 $\pm$ 0.38	13.22 $\pm$ 0.16
Set Transformer (Lee et al., 2019)	✗	512.59 $\pm$ 0.33	17.13 $\pm$ 3.67
UMBC+Diff. EM	✓	518.56 $\pm$ 0.92	13.04 $\pm$ 0.45
UMBC+Set Transformer	✓	503.89 $\pm$ 0.87	23.68 $\pm$ 1.85

Table 2: Amortized Clustering on ImageNet features extracted with a pre-trained ResNet50.

We extended the amortized clustering to ImageNet Deng et al. (2009). We used features extracted from a pretrained, frozen ResNet50 He et al. (2016) model, and then projected to a lower dimension via a random matrix (See details in Appendix E). Results for ImageNet clustering can be seen in Table 2. The Oracle entry in Table 2 is the NLL and ARI, using the actual prior, empirical mean, and diagonal covariance of each class cluster. As in the toy clustering task, UMBC+Set Transformer performs well, and even outperforms all models. To account for UMBC’s added parameters, we included UMBC on the baseline MBC models in Table 6, and UMBC+Set Transformer still shows the best performance.

Ablation Study

Using the mixture of Gaussians dataset, we evaluate various aspects of UMBC layers in Figure 5. Of activation functions identified as valid in Section 4 we found that the traditional softmax used in attention performs the best. We use this activation in all other experiments. In agreement with Bruno et al. (2021), we find that treating the slots as a Gaussian random variable, leads to a better overall result. We learn the slots with reparameterization, outlined in Appendix H. We find that layernorm on the post attention linear layer, residual connections on the slots, before the FF layer (like PMA layers of Lee et al. (2019)) to be beneficial. A moderate number of slots (the cardinality of input to $f^{*}$ ), helps up to a point and then shows an overfitting effect from overparameterization. We used these settings to inform our base settings given in Table 5.

The effect of test time MC sampling of these Bernoulli slots can be seen in Figure 5 (bottom right). Empirically, on the MoG task, we found that using no slot dropout at test time ultimately led to the best performance, which we think is likely due to the fact that the MoG dataset has an infinite number of instances and is therefore extremely resistant to overfitting. Using dropout on the ModelNet40 dataset (Figure 7), which is prone to overfitting, lead to better results on all metrics. Note that Monte Carlo sampling the slots at test time does not violate creftype 3.3 as long as dropout noise is pre-sampled at the beginning of a mini-batch sequence, and applied in the same way to each partition.

Figure 5: Ablation study, analyzing the effects of different settings within the UMBC module. These experiments were performed on the MoG dataset with the UMBC+Set Transformer model.

		Accuracy $↑$			NLL $↓$			ECE $↓$
Model	MBC	100	1000	2048	100	1000	2048	100	1000	2048
Deep Sets (Zaheer et al., 2017)	✓	65.37 $\pm$ 1.07	88.35 $\pm$ 0.32	88.72 $\pm$ 0.21	1.57 $\pm$ 0.03	0.40 $\pm$ 0.01	0.40 $\pm$ 0.01	17.38 $\pm$ 0.95	4.21 $\pm$ 0.27	4.02 $\pm$ 0.16
SSE (Bruno et al., 2021)	✓	71.09 $\pm$ 0.51	87.85 $\pm$ 0.39	87.92 $\pm$ 0.42	1.42 $\pm$ 0.10	0.52 $\pm$ 0.05	0.51 $\pm$ 0.06	16.69 $\pm$ 1.11	5.93 $\pm$ 1.06	5.88 $\pm$ 1.17
Diff-EM (Kim, 2021)	✗	62.67 $\pm$ 1.21	86.08 $\pm$ 0.12	86.86 $\pm$ 0.36	2.40 $\pm$ 0.11	0.71 $\pm$ 0.02	0.69 $\pm$ 0.03	22.16 $\pm$ 0.93	5.15 $\pm$ 0.11	4.96 $\pm$ 0.28
Set Transformer (Lee et al., 2019)	✗	74.21 $\pm$ 1.67	87.81 $\pm$ 0.44	88.17 $\pm$ 0.32	1.76 $\pm$ 0.08	0.79 $\pm$ 0.08	0.78 $\pm$ 0.08	17.12 $\pm$ 0.46	7.48 $\pm$ 0.62	7.37 $\pm$ 0.54
UMBC+Diff-EM	✓	67.07 $\pm$ 1.67	86.22 $\pm$ 1.23	86.37 $\pm$ 1.03	1.61 $\pm$ 0.12	0.58 $\pm$ 0.06	0.57 $\pm$ 0.05	13.97 $\pm$ 1.51	4.32 $\pm$ 1.37	4.38 $\pm$ 1.27
UMBC+Set Transformer	✓	71.18 $\pm$ 1.52	86.56 $\pm$ 0.49	86.77 $\pm$ 0.29	1.23 $\pm$ 0.15	0.53 $\pm$ 0.03	0.51 $\pm$ 0.03	10.37 $\pm$ 2.24	2.60 $\pm$ 0.19	2.35 $\pm$ 0.24

Table 3: Point cloud classification on ModelNet40. All models are trained on a set size of 1000 randomly sampled points, and evaluated on 100, 1000, and 2048 (max) test set sizes.

Figure 6: Expected Calibration Error on ModelNet40-C which contains 15 corruptions at 5 different intensity levels. ‘Test’ corresponds to the uncorrupted test set. See Figures 14 and 13 for Accuracy and NLL and Figures 16, 17 and 15 for results on individual corruptions.

Point cloud classification

We perform set classification experiments ModelNet40 (Wu et al., 2015) and analyze the robustness of different set encoders to dataset shifts and varying test-time set sizes using ModelNet40-C (Ren et al., 2022) containing 15 corruptions at 5 levels of intensity. Our experiments use the version of ModelNet40 and ModelNet40-C used by Ren et al. (2022) which contains 2048 points sampled from the original ModelNet40 (Wu et al., 2015) CAD models. Results are presented in Table 3. Overall, compared with MBC baseline models, we witnessed a marginal $\approx 1 %$ decrease in accuracy for the 1000 and 2048 test set sizes, and mixed increases/decreases in accuracy for the 100 set size experiments. In terms of ECE, ‘UMBC+’ models outperform all baseline models. This increase in ECE can be partly attributed to MC sampling slots at test time (See Figures 12, 11 and 7) and partly to Slot Dropout at train time (See Table 7).

Figure 7: Performing Monte Carlo Dropout on UMBC+Set Transformer slots leads to increases in accuracy, NLL, ECE. The top row corresponds to a 0% dropout rate and is constant over dropout sample sizes. Experiment uses ModelNet40 with test set size of 100. Figures for set sizes 1000 and 2048 can be found in Figures 12 and 11

ModelNet40-C results can be seen in Figure 6. UMBC+ models give strong ECE performance in all test set sizes, improving over baselines, especially for test set size 100 and UMBC+Set Transformer where the largest miscalibration in baseline models is.

6 Conclusion

In this work, we have shown that composing a set function consisting of a mini-batch consistent base $f$ , with an arbitrary set function head $f^{*}$ , we can make the composition $F = f^{*} \circ f$ universally mini-batch consistent. We have provided proofs in Theorem 4.1, experiments Figure 2, and unit tests (included in the supplementary file) which prove our assertions. Likewise we have loosened the known constraints on the structure of the SSE Proposition 4.1, establishing an equivalency to the PMA layers of the Set Transformer. We have demonstrated that there are cases where a UMBC $F$ outperforms previous simpler MBC models, and explored an interesting dropout strategy which is made possible by our architecture and improved the calibration and NLL of UMBC. As the field of set-functions continues to widen, we look forward to seeing future research in the area of MBC set functions.

References

C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In International conference on machine learning, pp. 1613–1622. Cited by: Appendix M, Appendix H.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §2.
A. Bruno, J. Willette, J. Lee, and S. J. Hwang (2021) Mini-batch consistent slot set encoder for scalable set encoding. Advances in Neural Information Processing Systems 34. Cited by: Table 7, Appendix M, Appendix G, Appendix H, Figure 2, §1, §1, §2, §3, §3, Table 1, §5, Table 2, Table 3.
X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §2.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.
Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §4.
Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. Advances in neural information processing systems 30. Cited by: Appendix M.
C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §2, §5.
K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.
D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §2.
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §4.
L. Hubert and P. Arabie (1985) Comparing partitions journal of classification 2 193–218. Google Scholar, pp. 193–128. Cited by: §5.
A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira (2021) Perceiver: general perception with iterative attention. In International conference on machine learning, pp. 4651–4664. Cited by: §2, §2.
M. Kim (2021) Differentiable expectation-maximization for set representation learning. In International Conference on Learning Representations, Cited by: Table 5, §1, §2, §2, §5, Table 3.
J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh (2019) Set transformer: a framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. Cited by: Appendix J, Table 7, Appendix B, Appendix G, Appendix G, Table 5, Appendix I, Figure 2, §1, §1, §2, §2, §3, Table 1, §5, §5, §5, Table 2, Table 3.
F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf (2020) Object-centric learning with slot attention. Advances in Neural Information Processing Systems 33, pp. 11525–11538. Cited by: Appendix N, Appendix H, §3, Table 1.
G. Mialon, D. Chen, A. d’Aspremont, and J. Mairal (2020) A trainable optimal transport embedding for feature aggregation and its relationship to attention. arXiv preprint arXiv:2006.12065. Cited by: §2, §2.
M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lucic (2021) Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems 34. Cited by: §2.
R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro (2018) Janossy pooling: learning deep permutation-invariant functions for variable-size inputs. arXiv preprint arXiv:1811.01900. Cited by: §2, §2.
Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, and J. Snoek (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems 32. Cited by: §2.
A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. Advances in neural information processing systems 20. Cited by: Appendix E.
J. Ren, L. Pan, and Z. Liu (2022) Benchmarking and analyzing point cloud classification under corruptions. arXiv:2202.03377. Cited by: Appendix J, §2, §5.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: 2nd item, §1, §3.
N. X. Vinh, J. Epps, and J. Bailey (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. The Journal of Machine Learning Research 11, pp. 2837–2854. Cited by: §5.
Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §5.
M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. Advances in neural information processing systems 30. Cited by: Appendix J, Table 7, Appendix G, Appendix G, Table 5, Figure 2, §1, §1, §2, §3, §3, §3, §5, Table 2, Table 3.
Y. Zhang, J. Hare, and A. Prügel-Bennett (2019) Fspool: learning set representations with featurewise sort pooling. arXiv preprint arXiv:1906.02795. Cited by: §2, §2.
S. K. Zhou, H. Greenspan, C. Davatzikos, J. S. Duncan, B. Van Ginneken, A. Madabhushi, J. L. Prince, D. Rueckert, and R. M. Summers (2021) A review of deep learning in medical imaging: imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE. Cited by: §2.

Appendix A Appendix

We will briefly describe the contents of each section of this appendix below:

Appendix B: Extra information and results related to MoG Amortized Clustering.
Appendix C: Details of the experiment depicted in Figure 2.
Appendix D: A note on MBC testing of the Set Transformer.
Appendix E: Details on ImageNet amortized clustering.
Appendix F: A note on the UMBC attention softmax stability.
Appendix G: Training parameters/setup.
Appendix H: Model hyperparameters/setup.
Appendix I: Additional ablation study results/discussion.
Appendix J: Addition results/discussion for ModelNet40 experiments.
Appendix K Extra results augmenting MBC models with UMBC.
Appendix L Societal Impacts.
Appendix M Limitations and Future Work.
Appendix N Attention Activation Effects on Calibration.

Appendix B Details on the Mixture of Gaussians Amortized Clustering Experiment

We used a modified version of the MoG amortized clustering dataset which was used by Lee et al. [2019]. We modified the experiment, adding random variance into the procedure in order to make a more difficult dataset. Specifically, to sample a single task for a problem with $K$ classes,

Sample set size for the batch $N \sim U (train set size / 2, train set size)$ .
Sample class priors $π \sim Dirichlet ([1_{1}, . . ., 1_{K}])$ .
Sample class labels $z_{i} \sim Categorical (π)$ for $i = 1, . . ., N$ .
Generate cluster centers $μ_{i, j} \sim U (- 4, 4)$ for $i = 1, . . ., K$ and $j = 1, 2$ .
Generate cluster covariances $σ_{i j} = U (0.3, 0.6)$ for $i = 1, . . ., K$ and $j = 1, 2$ . Then make a covariance matrix $Σ_{i}$ for each class with $σ_{i}$ as the diagonal.
Sample data $x_{i j} \sim N (μ_{i}, Σ_{i})$

In our MoG experiments, we set $K = 4$ .

The Motivational Example in Figure 3 also used the MoG dataset, and utilized mini-batch testing of the set transformer corresponding to the procedure outlined in Appendix D

Appendix C Measuring the Variance of Pooled Features

In Figure 2, we show the direct quantitative effect on the pooled representation when using the original Set Transformer and our UMBC module added, UMBC+Set Transformer. The UMBC model variance is always effectively 0, while the Set Transformer gives different results for different set partition chunk sizes. The downward slope of the Set Transformer line can be explained by the fact that as the chunk size gets larger, the pooled representation will become closer to that of the full set. The procedure for mini-batch testing of the Set Transformer is outlined in Appendix D.

To perform this experiment, we used a randomly initialized model with 128 hidden units, and sampled a random normal input with a set size of 1024, $X \in R^{1 \times 1024 \times 128}$ . We then created 100 random permutations of the set elements of the input and split each permutation into partitions with various chunk sizes $C_{i}$ where the cardinality $| C_{i} | \in {2^{i}}_{i = 1}^{6}$ . We then encode the whole set for each chunk size and report the observed variance between the 100 different random partitions at the various chunk sizes in Figure 2. Note that the encoded set representation is a vector and Figure 2 shows a scalar value. To achieve this, we take the feature-wise variance over the 100 encodings and report the mean over each feature. Specifically, with $Z \in R^{100 \times 128}$ representing all 100 encodings, $z = var (Z)$ , with $z \in R^{128}$ . We then achieve the y values in Figure 2 by a simple mean over the feature dimension,

y = \frac{1}{128} \sum i z_{i}

(4)

Appendix D A Note on MBC Testing of the Set Transformer

In some illustrative experiments Figures 2 and 3, we apply mini-batch testing to the Set Transformer to study the effects of using a non-MBC model in an MBC setting. The Set Transformer does not have a prescribed way to do this in the original work, so we took the approach of processing each chunk up until the pooled representation that results from the PMA layer. We then performed a mean pooling operation over the chunks in the following way, with $Z$ representing the final mini-batch pooled features,

Z = \frac{1}{N} P \sum j = 1 PMA (X_{j})

(5)

Appendix E Details on the ImageNet Amortized Clustering Experiment

Version	ARI
$z_{i} \in R^{2048}$	45.93 $\pm$ 0.12
$g (z_{i}) \in R^{512}$	44.09 $\pm$ 0.11

For the ImageNet amortized clustering experiment outlined in Section 5, we first extracted the features up until the last hidden representation and before the final linear classifier layer of the pretrained and frozen ResNet50. These features $x_{i} \in R^{2048}$ are of a large dimension which would create excessively large linear layers for this experiment. Therefore, we projected the features down to a lower dimension $^x \in R^{512}$ using a random orthogonal Gaussian matrix. As this random Gaussian projection is suitable for random feature kernels [Rahimi and Recht, 2007], it should preserve the distances between points required for effective clustering with a marginal effect on overall clustering performance. To validate this assumption, we ran the Oracle model (which computes the empirical cluster mean and diagonal covariance) on both the original features $x$ and the projected features $g (^x)$ and present the results in the table above.

To construct the ImageNet dataset, we first initialized and saved the random Gaussian projection matrix, and proceeded to process the entire ImageNet1k training set with the saved matrix. From these extracted and projected features, we chose a fixed 80/20 split for our train/test sets. Class indices for the train/text sets can be found in the supplementary file.

Appendix F Numerical stability of MBC softmax attention activation

Numerical stability of the softmax requires that the values are not allowed to overflow. Generally this is done by subtracting the maximum value from all softmax logits which allows a stable and equivalent computation.

\frac{e^{x - max (x)}}{\sum_{x^{'} \in x} e^{x^{'} - max (x)}} = \frac{e^{x} e^{- max (x)}}{e^{- % max (x)} \sum_{x^{'} \in x} e^{x^{'}}} = \frac{e^{x}}{\sum_{x^{'} \in x} e^{x^{'}}}

(6)

This poses a problem when using the plain softmax attention activation, as the $max (.)$ in Equation 6 requires a max over the whole set of $N$ items which is unknowable given the current mini-batch.

Originally, we had devised a special conditional update rule which would maintain the same form as in Equation 6, by tracking the overall max of each row of the attention matrix and then conditionally updating either the current $A$ and $ζ$ or the previously stored values from the last processed partition. Those updates needed to be calculated in the exponential space which cause a propagation of numerical errors through the network, becoming large enough to interfere with inference. In our experiments,w e found it sufficient to calculate the softmax as a simple exponential activation with a subsequent sum over $N$ with no consideration for numerical stability. If numerical stability is a concern, one could also set a hyperparameter $λ$ for the model such that the softmax is calculated with an exponential function such as $e^{z_{i} - λ}$ , which should provide a reasonable solution.

Appendix G Training Specification

We use no L2 regularization, except for the ModelNet40 experiments, which use a small weight decay of $1 e - 7$ . This was a setting taken from previous experiments by Lee et al. [2019], Zaheer et al. [2017] which used dropout before and after the pooling layers and other regularization strategies such as gradient clipping to avoid overfitting.

The only experiment which utilized any kind of data augmentations was the ModelNet40 experiments which used random rotations of the point cloud as is common in the precedent experiments [Zaheer et al., 2017, Lee et al., 2019, Bruno et al., 2021]

All single runs of all of our experiments were able to fit on a single GPU with 12GB of memory.

	Experiments
Setting	MoG	ImageNet	ModelNet40
Optimizer	Adam	Adam	Adam
Learning Rate	1e-3	1e-3	1e-3
Data Augmentation	✗	✗	✓
Epochs	50	50	1000
Iters/Epoch	1000	1000	9840

Table 4: The hyperparameter setup for all of our experiments involving UMBC modules.

Appendix H Universal Model Specification

Figure 9: The architecture of a UMBC layer. $A^{*}$ represents the unnormalized attention matrix $σ (S X_{i}^{⊤}) X_{i}$ discussed in Proposition 4.1 and ‘MBC Sum’ represents the summation in Equation 2

Unless otherwise specified, all universal modules were run with the following model hyperparameter settings in Table 5. The settings for the MoG dataset apply to those in Figure 4, and Figure 5 studies the effects of changing individual settings.

	Experiments
Setting	MoG	ImageNet	ModelNet40
Embedder	✓	✗	✓
Hidden dim	128	256	256
Num. Slots Per Parallel UMBC	128	32	64
Slot-type	random	random	random
Slot LayerNorm	✓	✓	✓
FF LayerNorm	✓	✓	✓
Heads	4	4	4
Slot Dropout Prob.	0%	50%	50%
Attention Activation	softmax	softmax	softmax
Slot Residual	✓	✓	✓
UMBC Num. Parallel	1	4	4
Test MC Samples	10	100	10

Table 5: The hyperparameter setup for all of our experiments involving UMBC modules. The hyperparameters were chosen as sensible default based on previous architectures in Lee et al. [2019], Zaheer et al. [2017], Kim [2021]

Slots

Different from both [Locatello et al., 2020] and Bruno et al. [2021], we use unique initial slot parameters for each slot such that the set of slots $S \in R^{K \times d}$ has a separate parameter for each $k_{i} \in K$ . We do this because the original Slot Attention in [Locatello et al., 2020] used a GRU in an inner loop to adapt the single general slot into specific slots for a given task, forcing them to ‘compete’ to capture different parts of the input. We cannot use a GRU, as it violates creftype 3.3, so we instead let each slot $k_{i} \in K$ learn to adapt to the overall data distribution. We always used the same dimension of inputs $X$ and slots $S$ .

Random Slots

To initialize the random Gaussian slots, we use a similar initialization strategy as [Blundell et al., 2015] and initialize $μ \in U [- 0.2, 0.2]$ and $log σ \in U [- 5.0, - 4.0]$ . During training, we sample the distribution with reparameterization $s_{k} = μ_{k} + σ_{k} * ϵ_{k}$ with $ϵ_{k} \sim N (0, I^{d})$ .

Embedder

We found it useful to place a single layer embedding function at the base of UMBC modules which consists of a single linear layer and a ReLU activation function. We used this embedder in all experiments except the ImageNet amortized clustering, as the ResNet feature extractor acted as the embedding function in this case.

Appendix I Additional Ablation Results

Figure 10: Ablation study on the numbers of attention heads in UMBC layers

In addition to the results in Figure 5, we also did an experiment looking at the effect of the number of attention heads in the UMBC layer in Figure 10. This result was uninformative, but we choose to use a stock setting of 4 attention heads in our experiment as was common in the experiments performed by Lee et al. [2019].

Appendix J Additional ModelNet/ModelNet-C Results

Table 7 shows extra results from the ModelNet point cloud classification task. In this table, we include results for ‘UMBC+SSE’ and ‘UMBC+Deep Sets’ for completeness. While there is a slight decrease in accuracy for both ‘UMBC+SSE’ and ‘UMBC+Deep Sets,’ UMBC improves SSE in terms of NLL and ECE while lowering the performance of Deep Sets. This seems to generally agree with the results in Figure 4, indicating that it is likely unhelpful to add a UMBC $f$ to an already MBC $f^{*}$ , and instead the model $f^{*}$ should be chosen according to the given task first, and then UMBC considered if MBC treatment will be necessary.

Figure 11: Performing Monte Carlo Dropout on UMBC+Set Transformer slots leads to increases in accuracy, NLL, ECE. The top row corresponds to a 0% dropout rate and is constant over dropout sample sizes. Experiment uses ModelNet40 with test set size of 1000.

Figure 12: Performing Monte Carlo Dropout on UMBC+Set Transformer slots leads to increases in accuracy, NLL, ECE. The top row corresponds to a 0% dropout rate and is constant over dropout sample sizes. Experiment uses ModelNet40 with test set size of 2048.

ModelNet40 is prone to overfitting, and previous experiments in Deep Sets [Zaheer et al., 2017] and Set Transformer [Lee et al., 2019] have used Dropout layers both before and after the pooling function in their encoders. To evaluate the regularization effect of our dropout strategy, the last block of Table 7 includes UMBC models trained without dropout. Training without dropout generally lowers test set performance in all metric categories.

For examples of the corrupted point clouds, we refer the reader to the original work which proposed ModelNet40-C [Ren et al., 2022]. In Figures 14 and 13 we provide additional boxplots for accuracy and NLL metrics which correspond to the ECE metric reported in Figure 6. In Figures 15, 17 and 16 we provide individual boxplots for each individual corruption on accuracy, ECE, and NLL respectively. The aggregate of all of these datapoints forms the boxplots seen in Figures 6, 14 and 13. Size is reduced to avoid excessive page length. Best viewed on screen with a high zoom.

Figure 13: Accuracy across all corruptions in the ModelNet40-C dataset. This figure corresponds to the ECE results presented in Figures 14 and 6

Figure 14: NLL across all corruptions in the ModelNet40-C dataset. This figure corresponds to the ECE results presented in Figures 13 and 6

Figure 15: Accuracy boxplots for individual ModelNet-C test results. Size is minimized to avoid excessive page length. Best viewed on screen with high zoom

Figure 16: ECE boxplots for individual ModelNet-C test results. Size is minimized to avoid excessive page length. Best viewed on screen with high zoom

Figure 17: NLL boxplots for individual ModelNet-C test results. Size is minimized to avoid excessive page length. Best viewed on screen with high zoom

Figure 18: Because of the model structure, higher slot dropout rates correspond to faster training times, and smaller set sizes as input to the subsequent set encoder modules. A dropout rate of $p = 0.5$ in the function $f$ will, in expectation, deliver a set size of $K / 2$ to the subsequent function $f^{*}$ . This figure was generated from a UMBC+Set Transformer model with 128 hidden units and input $x \in R^{32 \times 200 \times d}$ . The plotted line shows mean and standard deviation for 250 iterations at each $p \in [1, 99]$ . As a safeguard against unstable training, we ensure that at least one slot remains after dropout is applied.

Appendix K Adding the UMBC module To existing MBC Functions

Model	NLL $↓$	ARI $↑$
Oracle	1028.22 $\pm$ 1.24	44.09 $\pm$ 0.11
Deep Sets	531.44 $\pm$ 0.15	6.18 $\pm$ 0.08
SSE	520.29 $\pm$ 0.63	22.91 $\pm$ 1.85
Set Transformer	512.59 $\pm$ 0.33	17.13 $\pm$ 3.67
UMBC+Deep Sets	532.87 $\pm$ 0.69	6.22 $\pm$ 0.18
UMBC+SSE	544.67 $\pm$ 3.64	16.59 $\pm$ 1.26
UMBC+Set Transformer	503.89 $\pm$ 0.87	23.68 $\pm$ 1.85

Table 6: Amortized Clustering on ImageNet features extracted with a pre-trained ResNet50.

	Accuracy $↑$			NLL $↓$			ECE $↓$
Model	100	1000	2048	100	1000	2048	100	1000	2048
Deep Sets [Zaheer et al., 2017]	65.37 $\pm$ 1.07	88.35 $\pm$ 0.32	88.72 $\pm$ 0.21	1.57 $\pm$ 0.03	0.40 $\pm$ 0.01	0.40 $\pm$ 0.01	17.38 $\pm$ 0.95	4.21 $\pm$ 0.27	4.02 $\pm$ 0.16
SSE [Bruno et al., 2021]	71.09 $\pm$ 0.51	87.85 $\pm$ 0.39	87.92 $\pm$ 0.42	1.42 $\pm$ 0.10	0.52 $\pm$ 0.05	0.51 $\pm$ 0.06	16.69 $\pm$ 1.11	5.93 $\pm$ 1.06	5.88 $\pm$ 1.17
Set Transformer [Lee et al., 2019]	74.21 $\pm$ 1.67	87.81 $\pm$ 0.44	88.17 $\pm$ 0.32	1.76 $\pm$ 0.08	0.79 $\pm$ 0.08	0.78 $\pm$ 0.08	17.12 $\pm$ 0.46	7.48 $\pm$ 0.62	7.37 $\pm$ 0.54
UMBC+Deep Sets	71.53 $\pm$ 1.03	87.52 $\pm$ 0.25	87.74 $\pm$ 0.45	1.48 $\pm$ 0.09	0.61 $\pm$ 0.03	0.62 $\pm$ 0.03	16.39 $\pm$ 1.52	7.53 $\pm$ 0.38	7.49 $\pm$ 0.50
UMBC+SSE	71.03 $\pm$ 0.73	86.19 $\pm$ 0.62	86.36 $\pm$ 0.46	1.11 $\pm$ 0.09	0.50 $\pm$ 0.01	0.49 $\pm$ 0.01	9.67 $\pm$ 2.03	2.42 $\pm$ 0.77	2.37 $\pm$ 1.10
UMBC+Set Transformer	71.18 $\pm$ 1.52	86.56 $\pm$ 0.49	86.77 $\pm$ 0.29	1.23 $\pm$ 0.15	0.53 $\pm$ 0.03	0.51 $\pm$ 0.03	10.37 $\pm$ 2.24	2.60 $\pm$ 0.19	2.35 $\pm$ 0.24
UMBC+Deep Sets (No Dropout train)	69.96 $\pm$ 0.64	87.50 $\pm$ 0.21	87.58 $\pm$ 0.16	1.82 $\pm$ 0.06	0.66 $\pm$ 0.02	0.64 $\pm$ 0.02	21.25 $\pm$ 0.54	8.59 $\pm$ 0.32	8.51 $\pm$ 0.26
UMBC+SSE (No Dropout train)	68.80 $\pm$ 1.00	84.81 $\pm$ 1.17	84.89 $\pm$ 1.39	1.19 $\pm$ 0.06	0.55 $\pm$ 0.04	0.54 $\pm$ 0.04	11.80 $\pm$ 2.07	3.05 $\pm$ 0.64	3.02 $\pm$ 0.82
UMBC+Set Transformer (No Dropout train)	71.52 $\pm$ 0.75	86.56 $\pm$ 0.47	86.61 $\pm$ 0.45	1.50 $\pm$ 0.43	0.63 $\pm$ 0.14	0.62 $\pm$ 0.15	13.36 $\pm$ 4.66	4.22 $\pm$ 2.04	4.28 $\pm$ 2.09

Table 7: Point cloud classification on ModelNet40. All models are trained on a set size of 1000 randomly sampled points, and evaluted on 100, 1000, and 2048 (max) test set sizes. UMBC models in the second block are trained and tested with our slot dropout technique outlined in Section 4. Models in the last block are trained without Slot Dropout, and use all available slots output from

f

at both train and test time

Appendix L Potential Societal Impacts

We are not aware of any potential negative societal impacts of MBC processing of sets. Although, generally speaking, sets are a natural choice for estimating things like population statistics as our amortized clustering experiments did. In this setting, fairness to all involved groups is an important factor to consider, especially if human well-being is at stake.

Appendix M Limitations & Future Work

UMBC is a bottleneck

UMBC projects the input set to a fixed size, and can therefore be a bottleneck, causing possible loss of information from the input set. An interesting line of research could be an exploration of methods to maximize mutual information between the input set of cardinality $N$ and the projected set of cardinality $K$ , or an exploration of other forms which a UMBC may take, we look forward to seeing future research in this area.

Train/Test Set Size Variability

In Figure 4, Deep Sets shows the tightest grouping between training set sizes, although giving the lowest overall performance, indicating that more complicated set functions which make pairwise comparisons may be less robust to varying training set sizes, which may provide an interesting topic of future research.

Bayesian Slots

In our experiments, we used a similar random slot parameter initialization as Blundell et al. [2015]. Following Bruno et al. [2021], we use no Bayesian prior on these random slots, so the increased performance of random slots is likely due to randomness aiding in exploration of the parameter space rather than learning a proper Bayesian posterior. Future work could explore the effects of incorporating a prior distribution over slots or slot dropout rates (e.g. Concrete Dropout [Gal et al., 2017]). This could lead to further increases in robustness to corruptions and varying set sizes.

Appendix N Attention Activations & Calibration

To test the effect of training with different attention activation functions on calibration, we train and evaluate the UMBC+Set Transformer model on all corruptions of ModelNet40-C in Figures 21, 20 and 19, and individual corruptions in Figures 24, 23 and 22. Besides the change in attention activation, each model was trained with the same settings as the UMBC+Set Transformer from the corresponding experiments in Figures 14, 6 and 13. Surprisingly, we find the slot-softmax, originally used by Locatello et al. [2020] delivers strong performance in terms of NLL and ECE, although it gives slightly lower accuracy on the natural, uncorrupted test set.

Figure 19: Accuracy across all corruptions on the ModelNet40-C dataset for UMBC+Set Transformer with different attention activation functions. This figure corresponds to the results presented in Figures 20 and 21

Figure 20: NLL across all corruptions on the ModelNet40-C dataset for UMBC+Set Transformer with different attention activation functions. This figure corresponds to the results presented in Figures 21 and 19

Figure 21: ECE across all corruptions on the ModelNet40-C dataset for UMBC+Set Transformer with different attention activation functions. This figure corresponds to the results presented in Figures 20 and 19

Figure 22: Accuracy boxplots for individual ModelNet-C tests with UMBC+Set Transformer and different attention activation functions. Size is minimized to avoid excessive page length. Best viewed on screen with high zoom

Figure 23: NLL boxplots for individual ModelNet-C tests with UMBC+Set Transformer and different attention activation functions. Size is minimized to avoid excessive page length. Best viewed on screen with high zoom

Figure 24: ECE boxplots for individual ModelNet-C tests with UMBC+Set Transformer and different attention activation functions. Size is minimized to avoid excessive page length. Best viewed on screen with high zoom

Universal Mini-Batch Consistency for Set Encoding Functions

Abstract

1 Introduction

2 Related Work

3 Preliminaries on Set Functions

Property 3.1 (Permutation Invariance).

Definition 3.1 (Set 2 Vector Function).

Property 3.2 (Permutation Equivariance).

Property 3.3 (Mini-Batch Consistency).

4 Building a Universally MBC Set Function

Lemma 4.1.

Proof.

Theorem 4.1 (Universal MBC Set Function (UMBC)).

Maintaining attention normalization over N

Proposition 4.1.

Proof.

SSE’s Connection to PMA’s

Slot Dropout

Multiheaded and Parallel Universal Blocks

5 Experiments

Metrics & Model Setup

Amortized clustering

Ablation Study

Point cloud classification

6 Conclusion

References

Appendix A Appendix

Appendix B Details on the Mixture of Gaussians Amortized Clustering Experiment

Appendix C Measuring the Variance of Pooled Features

Appendix D A Note on MBC Testing of the Set Transformer

Appendix E Details on the ImageNet Amortized Clustering Experiment

Appendix F Numerical stability of MBC softmax attention activation

Appendix G Training Specification

Appendix H Universal Model Specification

Slots

Random Slots

Embedder

Appendix I Additional Ablation Results

Appendix J Additional ModelNet/ModelNet-C Results

Appendix K Adding the UMBC module To existing MBC Functions

Appendix L Potential Societal Impacts

Appendix M Limitations & Future Work

UMBC is a bottleneck

Train/Test Set Size Variability

Bayesian Slots

Appendix N Attention Activations & Calibration

Universal Mini-Batch Consistency for
Set Encoding Functions

Maintaining attention normalization over $N$