Detecting the unknown in Object Detection

Dario Fontanel

^{1}

, Matteo Tarantino

^{1}

, Fabio Cermelli

^{1, 2}

, Barbara Caputo

^{1}

^{1}

Politecnico di Torino,

^{2}

Italian Institute of Technology
dario.fontanel@polito.it

Abstract

Object detection methods have witnessed impressive improvements in the last years thanks to the design of novel neural network architectures and the availability of large scale datasets. However, current methods have a significant limitation: they are able to detect only the classes observed during training time, that are only a subset of all the classes that a detector may encounter in the real world. Furthermore, the presence of unknown classes is often not considered at training time, resulting in methods not even able to detect that an unknown object is present in the image. In this work, we address the problem of detecting unknown objects, known as open-set object detection. We propose a novel training strategy, called UNKAD, able to predict unknown objects without requiring any annotation of them, exploiting non annotated objects that are already present in the background of training images. In particular, exploiting the four-steps training strategy of Faster R-CNN, UNKAD first identifies and pseudo-labels unknown objects and then uses the psuedo-annotation to train an additional unknown class. While UNKAD can directly detect unknown objects, we further combine it with previous unknown detection techniques, showing that it improves their performance at no costs.

1 Introduction

Figure 1: Open-set detection aims to detect previously unseen objects. While previous works consider the unknown objects as background, we are interested in evaluation the capability of detectors to predict unseen object as unknown. From left to right, we can see the in the first row a MS-COCO [lin2014microsoft] test image and the wrong prediction of Faster R-CNN [ren2015faster]. In the second one, we can see the background prediction of [dhamija2020overlooked], and the unknown detection made by our framework.

For autonomous system acting in the real world, it is essential to have a clear understanding of their surroundings. To accomplish this purpose, multiple works have concentrated on the task of object detection [lin2017feature, girshick2014rich, girshick2015fast, he2017mask, ren2015faster, wang2019learning, liu2016ssd, lin2017focal, redmon2017yolo9000, redmon2016you, cao2019hierarchical], where the goal is to locate the objects inside an image and to assign them a category. Despite the outstanding performance demonstrated by state-of-the-art detection models [ren2015faster, he2017mask, liu2016ssd, redmon2016you], they still have a critical limitation: they are able to predict only the classes that they have observed during training time, which are defined a priori and annotated in the training dataset. Regardless of how extensive the training dataset is, it is practically impossible to capture all the possible objects that system might ever encounter. Thus, in the real world, when current detection models are presented with an unkown object they will not detect it, considering it either as a known class or as background. For example, let us consider an autonomous driving car that has been trained on classes that are likely to be encountered in the city, such as pedestrians, other vehicles, trees, etc. However, at some point in time, the car is lead to a country region and a wild animal appears on the road. If we do not enable detection models to recognize unknown objects, the detectors would not be able to spot the unknown animal and will label it as background, putting at risk the safety of the passengers since the car cannot recognize the obstacle and may crash on it. Ideally, we would like to have object detection models able to detect all the objects in the world ad possibly understand whether an object corresponds to one of the training categories or if it is an unknown object that they have never seen before.

In this work, we focus on this task, known in the literature as open-set object detection, for which an illustration is depicted in Figure 1. Previous works [dhamija2020overlooked] addressed the problem partially, focusing only on limiting performance degradation on known classes when unknown data is encountered. We believe, as demonstrated in the previous example, that considering unknown objects as background is not sufficient to produce systems able to act in the real world as they could not detect obstacles, introducing safety risks. Differently, we propose to explicitly detect unknown objects at test time using a novel approach called UNKAD. It exploits the fact that, in object detection, multiple objects are preset in an image but only few of them are annotated, since the others are not considered relevant. By exploiting the multi-steps training strategy of the Faster R-CNN [ren2015faster], UNKAD is able to extract pseudo-supervision for the unknown objects, identifying them in the background areas of the training images. Moreover, it exploits the pseudo-supervision on the classification head, introducing an additional unknown class that can be predicted, which helps to learn sharper decision boundaries between known and unknown objects. While this training strategy is already able to detect unknown objects directly predicting them in the classifier, we combine it with standard out-of-distribution methods and we show that it is able to improve the performance of them with respect to a standard Faster R-CNN training strategy. We demonstrate the performance of UNKAD on the popular Pascal VOC 2007 [Everingham2009ThePV] and MS-COCO [lin2014microsoft] benchmarks.

Contributions. To summarize, our contributions are:

We propose a novel perspective for open-set detection problem, developing a simple yet effective training strategy;
We introduce out-of-distribution standard approaches into the object detection framework, leading to novel analysis with respect to previous benchmarks;
Experiments on the widely adopted Pascal VOC and MS-COCO datasets show that standard out-of-distribution methods benefits from our approach by a significant amount.

2 Related work

In this section, we review the foundations of our work, i.e. object detection, open-set and open-set object detection.

Object detection. Modern object detection approaches [lin2017feature, girshick2014rich, girshick2015fast, he2017mask, ren2015faster, wang2019learning, liu2016ssd, lin2017focal, redmon2017yolo9000, redmon2016you, cao2019hierarchical] are dominated by architectures based on convolutional neural networks that differ on whether or not candidate object proposals are used. We can group these works in two different categories: two-stage approaches [lin2017feature, girshick2014rich, girshick2015fast, he2017mask, ren2015faster], that generates object proposals which are classified and regressed by a region-of-interest (RoI) head module, and single-stage approaches [wang2019learning, liu2016ssd, lin2017focal, redmon2017yolo9000, redmon2016you, cao2019hierarchical] that simultaneously output both classification scores and regressed bounding boxes without the need of any object proposal. Despite the outstanding performance achieved on popular benchmarks [Everingham2009ThePV, lin2014microsoft], all of these architectures operate solely in an offline fashion, which means that after the model has been trained, additional knowledge cannot be added. Despite recent efforts to advance and deal with the inclusion of novel classes [shmelkov2017incremental, peng2020faster, joseph2021towards], none of these techniques strictly focus on open-set object detection.

Open-set. In recent years, open-set recognition has sparked a lot of interest in the machine learning community [bendale2015towards, perera2020generative, zhou2021learning]. The seminar work of [bendale2015towards] formalized the open-set recognition task investigating for the first time what might happen when a model is ask to recognize data from categories that it has never seen before. One of the most popular open-set sub-field is out-of-distribution (OOD) detection [hendrycks2016baseline, liang2018enhancing, hsu2020generalized] which aims at discriminating between samples that belong to the training distribution (also known as in-distribution or known data) and samples that do not (also noted as out-of-distribution or unknown data).

[hendrycks2016baseline] settled the standard baseline for out-of-distribution detection by applying a threshold over the maximum softmax probability (MSP) to categorize a sample as belonging to known classes or as an unknown one. [gal2016dropout, kendall2017uncertainties] used Monte Carlo Dropout (MC-Dropout) to estimate the model uncertainty by forwarding the same image through the network multiple times, each time with a different dropout probability. [corbiere2019addressing] trains an additional neural network to output high confidence values when the prediction of the main model on in-distribution data is correct. This additional branch is then used to detect if the network prediction is reliable or not. Scaling the softmax probabilities by a temperature, for each sample [liu2020energy] computes the energy which is higher for known samples rather than unobserved ones. ODIN [liang2018enhancing] further enhanced MSP by introducing a temperature scaling factor in the softmax function and small perturbations over the test images. Both these hyperparameters are learned on an OOD validation set available during training. As collecting OOD data is not always feasible, in this work we avoid relying on it, leveraging background objects to model unknown properties.

Open-set object detection. The open-set framework introduces additional challenges once adopted to the object detection task. [miller2018dropout] has been the first to bring open-set object detection to light and [dhamija2020overlooked] further investigated the problem, assessing how detectors performance on known classes varies when evaluated on both known and unknown objects. In this paper, instead, we believe it is critical to evaluate also the capability of object detection models of recognizing objects as unknown. As a result, we need to introduce in the object detection framework out-of-distribution approaches able to distinguish between known and unknown categories, evaluate models’ performance on both. Recently, few works [gupta2021ow, joseph2021towards] made a step further introducing into object detection models the rejection capability, i.e. the ability of recognizing an object as unknown. To detect unknown samples, [joseph2021towards] employed a contrastive approach while [gupta2021ow] adopted a transformer-based architecture. Despite the introduction of the rejection capability, it is worth noting that in this work we do not provide comparisons with [joseph2021towards] and [gupta2021ow] as their primarily concerned is with building models capable of expanding their knowledge over time, hence de facto focusing on another task and objective.

3 Method

In this section we first formalize the problem definition and the importance of distinguishing between known e unknown categories (Section 3.1), We then describe UNKAD, showing how to detect unknown objects during training and how to involve them through the learning process (Section 2). Finally, in Section 3.3 we will analyze and compare different rejection strategies.

3.1 Preliminaries and problem formulation

The goal of open-set object detection is to detect objects that have not been seen during the training phase [dhamija2020overlooked]. To perform this task, the model is provided with a training dataset $T_{t r a i n} = {(x, y)}$ , where $x$ is an image and $y$ is its corresponding ground-truth label. As in standard object detection, $y$ contains bounding box annotations for a set of classes, that we indicate as known classes $Y$ . We note that, in object detection, the training images contains multiple objects, but not all of them are annotated. We denote all the objects not annotated as unknown objects. During testing, we are provided with a dataset $T_{t e s t}$ containing objects of both $Y$ and never seen categories (unknown).

Focusing on R-CNN architectures [girshick2015fast, ren2015faster], our aim is to learn a function $F$ mapping an image $x$ to its known and unknown predictions at bounding box level. We consider $F$ as built upon four key components. The first one is the feature extractor $F_{F E}$ that is responsible for producing a feature map for each image $x$ . The second one is a region proposal network $F_{R P N}$ that is fed with the feature map of the input image and produces a set of rectangular object proposals associated with a binary objectness score. In particular, it firstly projects the input feature map into a lower dimensional space by means of an intermediate projection layer and then the projected features are fed into two separate convolutional layers, one responsible for regressing the bounding box and the other responsible to output an objectness score $ω$ . The set of object proposals is then applied to the feature map and pooled, producing $R$ regions of interest (RoIs). The third component is the classification head and aims to classify the objects and regress the correct bounding box. It is composed by two functions: a class-scoring function $F_{ψ} : R \to {I R}^{R \times | Y |}$ that maps $R$ RoIs to a box-level class scores, and a function $F_{ρ}$ that regresses, for each RoIs and class, the precise bounding box coordinates, i.e. $F_{ρ} : R \to {I R}^{R \times 4 | Y |}$ . Finally, the last component is the unknown detection function $F_{ϕ} : [0, 1]^{R \times | K_{k n o w n} |} \to {I R}^{R}$ that a binary score indicating if the RoI is unknown ( $F_{ϕ} = 1$ ) or not ( $F_{ϕ} = 0$ ).

3.2 Detecting the unknown

In this work we take inspiration from out-of-distribution detection approaches [lee2018simple, hendrycks2018deep, vyas2018out, dhamija2018reducing, li2020background] that leverage external OOD data as extra supervision to learn richer decision boundaries between known and unknown samples. As collecting external unknown datasets is not always feasible, and most of the time human intervention is required to at least partially annotate them, our intuition is that we can leverage as OOD data a portion of the objects labelled as background. This is uniquely possible due to the object detection framework, that provides labels only for the classes of interest in $Y$ , leaving all the other objects that are not meant to be learned not annotated.

Towards this end, we train our $F$ detector module in a four alternate steps strategy, adapting the methodology proposed in Faster R-CNN [ren2015faster] and proposing UNKAD (UNKnown Aware Detection), that extends the model classification ability towards unknown objects. In particular, in the first and the third steps the $F_{R P N}$ learns to extracts class-agnostic RoIs, while in the second and the fourth the detector $F$ learns to classify known classes $Y$ and the unknown. An illustration of UNKAD four steps training strategy is reported in Fig. 2.

Discovering the unknown. The intuition behind UNKAD is that unknown objects are already present in the training images and we can learn to recognize them by using the RPN class-agnostic ability to detect objects. In particular, the RPN exploits the ground truth labels to find and predict the RoI, i.e. n region where is likely to find an object. During preliminary experiments (see Section 4.3), we found that the RPN was able to detect, other than the annotated objects, the objects in the image background with high confidence. Arguing that its class-agnostic nature helps to find any object in the image, we propose to pseudo-label the objects detected in the background as unknown objects. In order to generate pseudo-labels during training, we propose to use a simple thresholding mechanism. In particular, we define a threshold to establish whether a RoI is a region containing an object (known or unknown) or not. The threshold $τ_{o b j}$ is derived from data, and it is computed as:

τ_{o b j} = μ + λ \cdot σ,

(1)

where

μ = \frac{\sum_{i} ω_{i}}{| R_{F G} |},

(2)

σ = \sqrt{\frac{\sum_{i} (ω_{i} - μ)^{2}}{| R_{F G} |}},

(3)

and $μ$ and $σ$ represent respectively the mean and the standard deviation of foreground-layer activations, $R_{F G}$ is the set of RoIs that either (i) has the highest IoU, or (ii) has an IoU overlap with with any ground-truth box higher than $0.7$ . We recall that $ω_{i}$ represents the objectness score for the $i^{t h}$ region fed to $F_{R P N}$ . $λ$ is hyperparameter set to 1.

To obtain the pseudo-labels, we first select all the RoIs having an objectness score greater than $τ_{o b j}$ . Then, we select all the RoIs that do not match with a ground truth annotation (i.e. they do not overlap more than 0.3 IoU with it) ad we associate them to the unknown class $u$ .

Once we obtain pseudo-labels on the unknown objects that are present in the training images, we exploit them to train the classification head to detect every unknown object.

Figure 2: Illustration of the UNKAD training procedure. In the first step, the RPN learns to predict class-agnostic RoIs (white boxes) containing objects. It then pseudo-labels unknown objects and, in step 2, trains the classifier with the additional unknown class. In step 3, the RPN is trained again, considering both known and unknown objects. Finally, in step 4 the final model able to detect both known and unknown objects is obtained.

Learning to predict unknown As our primary goal is to learn an unknown detector through pseudo-supervision, we add to the final classification layer $F_{ϕ}$ an additional class that leverages the unknown pseudo-labels generated by $F_{R P N}$ . More precisely, following the alternate training strategy of Faster R-CNN [ren2015faster, girshick2015fast], at the end of the first and the third steps, we perform the pseudo-labelling phase. In the second and the fourth steps, $F_{ρ}$ and $F_{ϕ}$ are learned, leveraging both annotated and pseudo-annotated data. By using the pseudo-annotations as ground truth, $F$ learns to separate at training time the unknown category from all the known classes.

3.3 Rejection strategies

In this section we present different strategies to predict a RoI as an unknown object, i.e. different implementation of the $F_{ϕ}$ function. In particular, we assume that, at inference time, we obtain $R$ RoIs from the RPN, which are passed to the classification head to obtain the final classification scores $s$ using $F_{ψ}$ .

Direct prediction. The standard approach of localizing and classifying objects in closed-set scenarios is to localize the object and assign it the class with the highest class probability. Since UNKAD extends the classification logits also to the unknown class, we may apply the same principle and predict a sample as unknown only if has the unknown class has the highest score. Formally, given an image and a RoI $r$ , we obtain the set $s$ of class scores, including also the score for the $u$ class, and we compute $F_{ϕ}$ as:

F_{ϕ} (r) = ⎧ ⎨ ⎩ \begin{matrix} 1 & if^y = u 1 & if^y = b & ω_{r} > τ_{o b j} 0 & o t h e r w i s e \end{matrix}

(4)

where $^y = {a r g m a x}_{c \in Y \cup u} s_{c}$ is the class with highest score, $ω_{r}$ is the RPN objetness score for $r$ , and $b$ indicates the background class. We note that we added the second case to avoid missing potential unknown objects.

Maximum Softmax Probability (MSP). Maximum Softmax Probability [hendrycks2016baseline] leverages the highest probability value assigned to any known class in $Y$ as a measure of uncertainty. For an image, given a RoI $r$ and the class scores $s$ for it, MSP computes the value for the unknown class $F_{ϕ} (r)$ as:

F_{ϕ} (r) = max c \in Y \cup b \frac{e^{s_{c}}}{\sum_{k \in Y \cup b} e^{s_{k}}} \leq τ_{M S P},

(5)

where $τ_{M S P}$ is a user defined threshold that we set to $0.5$ and $b$ indicates the background class. Intuitively, if a class or background is predicted with a probability inferior to $τ_{M S P}$ , MSP identifies the RoI as an unknown object.

Energy-based classifier (Energy). The energy-based scoring function [liu2020energy] adopts a completely different perspective from traditional classifiers. Instead of computing class probabilities, it maps each samples to a scalar value, i.e. the energy. Given an RoI $r$ of a image, $F_{ϕ} (r)$ is computed as:

F_{ϕ} (r) = - T \cdot log Y \cup b \sum c = 1 e^{\frac{s_{c}}{T}} \leq τ_{E N},

(6)

where T is the temperature hyperparameter, $τ_{E N}$ is a user defined threshold set to $- 3$ , and $s_{c}$ is the classification score for class $c$ . To align with the convention introduced by [liu2020energy], we keep the negative notation for the energy score, which is higher for known classes and lower for unknown ones.

ODIN. To enhance MSP performances, ODIN [liang2018enhancing] adopted temperature scaling and input perturbation. In detail, before feeding the scores to the softmax function, ODIN scales each class scores by a temperature parameter $T$ . In addition, it also pre-processes each input image $x$ by introducing a small perturbation, that we adapt in the context of object detection as:

~ x = x - ϵ \cdot s g n (- \nabla_{x} \cdot \frac{1}{R} R \sum r = 1 log (max c \in Y \cup b p_{c})),

(7)

where $ϵ$ is the perturbation magnitude, $\nabla_{x}$ is the gradient vector with respect to $x$ , and $p$ is the softmax of the scores $s$ , i.e. $p = softmax (s)$ .

As for MSP [hendrycks2016baseline], given a perturbed image $~ x$ and the class scores $s$ computed on the RoI $r$ , ODIN computes $F_{ϕ} (r)$ as:

F_{ϕ} (r) = max c \in Y \frac{e^{s_{c}}}{\sum_{k \in Y} e^{s_{k}}} \leq τ_{O D I N},

(8)

$τ_{O D I N}$ is a user defined threshold set to $0.4$ .

4 Experiments

Figure 3: Extract of unknown images taken from [lin2014microsoft].

4.1 Experimental Protocol

Datasets. We conduct our experiments on the popular Pascal VOC 2007 [Everingham2009ThePV] and MS-COCO [lin2014microsoft] datasets. Pascal VOC 2007 is a widely used benchmark that includes 20 foreground object classes and consists in 5K images for training, considering both training and the validation splits, and 5K for testing. MS-COCO is a large scale dataset that provide 80K images for training, 20K for validation and for the test. It contains annotation for 80 object classes of which 20 are in common with Pascal VOC 2007. For the open-set evaluation, we follow the split defined as WR $_{1}$ in [dhamija2020overlooked]. In particular, the test set of Pascal VOC 2007 is used for the evaluation on the $Y$ classes, while 4952 images from MS-COCO training set that do not contain any Pascal VOC class are selected for the evaluation on the $u$ class (see. Figure 3), resulting in a total of nearly 10k test images.

Baseline. We compare our proposed training strategy with the 4-steps Faster R-CNN [ren2015faster] training procedure. We implement both strategies upon multiple open-set and out-of-distribution detection strategies. In particular, we implement the training strategy described in Sec. 3.3: direct prediction, MSP [hendrycks2016baseline], Energy [liu2020energy] and ODIN[liang2018enhancing]. We note that it is not possible to use direct prediction on the standard Faster R-CNN training since it does not provide the unknown class in the classification head.

Architectures. Each of the evaluated method share the same Faster R-CNN architecture with a ResNet-50 backbone initialized with ImageNet pretrained weights [deng2009imagenet]. We train the network using SGD optimizer with batch size equal to $4$ , learning rate set to $10^{- 3}$ and decreased after $75 %$ of iterations by a factor of $0.1$ . The weight decay is set equal to $10^{- 4}$ and SGD momentum to $0.9$ . The number of training iterations is set to $40 k$ for both the $1^{s t}$ and the $2^{n d}$ training step, while it is decreased down to $10 k$ in the $3^{r d}$ and $4^{t h}$ step. We resize images to $800 \times 1333$ in both training and testing, while during training we also performed random crop and random horizontal and vertical flip operations.

Training	Rejection	mAP% $↑$	WI $_{n o_r e j} ↓$	WI $↓$	U $_{R e c a l l} ↑$	U $_{P r e c i s i o n} ↑$	U $_{F 1} ↑$
Standard	Without Rejection [ren2015faster]	67.29	1.63	1.63	0.00	0.00	0.00
	MSP [hendrycks2016baseline]	67.36	1.58	-17.52	2.46	4.20	3.10
	Energy [liu2020energy]	51.01	0.78	-30.39	2.40	0.41	0.70
	ODIN [liang2018enhancing]	67.22	1.58	-20.43	3.81	1.38	2.02
UNKAD	Without Rejection [ren2015faster]	67.75	1.50	1.50	0.00	0.00	0.00
	MSP [hendrycks2016baseline]	67.74	1.50	-19.22	3.22	4.57	3.78
	Energy [liu2020energy]	51.75	0.68	-29.82	2.34	0.30	0.53
	ODIN [liang2018enhancing]	67.63	1.49	-21.56	4.85	1.62	2.43
	Direct Prediction	67.75	1.48	-21.91	5.19	2.83	3.67

Table 1: Evaluation of out-of-distribution detection methods adopting standard and UNKAD (ours) training strategies on Pascal VOC [Everingham2009ThePV] and MS-COCO [lin2014microsoft] datasets. Results on

Y

are evaluated through mAP and

W I_{n o_r e j}

[dhamija2020overlooked] metrics, while results on

u

are evaluated computing WI [dhamija2020overlooked] and the recall, precision and F1 scores on unknown objects.

Metrics. To assess the impact of the unknown objects on the performance of standard object detection models, [dhamija2020overlooked] introduced the metric called Wilderness Impact (WI), defined as:

\begin{matrix} W I & = \frac{Precision% Closed-Set}{Precision Open-Set} - 1 = = \frac{T P_{c}}{T P_{c} + F P_{c}} \cdot \frac{T P_{c} + F P_{c} + T P_{o} + F P_{o}}{T P_{c} + T P_{o}} - 1, \end{matrix}

(9)

where $T P_{c}, F P_{c}$ , indicate the true positives and false positives of known classes, while $T P_{o}, F P_{o}$ the true and false positive on unkowns. However, in [dhamija2020overlooked] the metric was simplified since it did not considered the rejection option, i.e. the models cannot predict the unknown class, thus $T P_{o} = 0$ . They considered the following metric:

W I_{n o_r e j} = \frac{F P_{o}}{T P_{c} + F P_{c}} .

(10)

We note that $W I_{n o_r e j}$ does not consider the detection performance on unknown classes, but it considers only the performance on the known classes. In particular, classifying all the unknown as background objects would get the optimal score since $F P_{o} = 0$ . Despite the second metric has been used by recent works [joseph2021towards, dhamija2020overlooked], we consider more suited to our task the $W I$ metric since it took also into account the true positive on unknown objects.

Moreover, since in our work we are explicitly interested in evaluating the models on unknown objects, we report the recall, precision and F1 metrics considering only the unknown class. We define them as:

U_{R e c a l l} = \frac{T P_{o}}{T P_{o} + F N_{o}},

(11)

U_{P r e c i s i o n} = \frac{T P_{o}}{T P_{o} + F P_{o}} .

(12)

The more the model tends to predict each sample as belonging to the unknown category, the higher the recall will be, but the lower the precision will be as the number of false positive tends to increase. For this reason, we introduce the $U_{F 1}$ metric which summarizes both $U_{R e c a l l}$ and $U_{P r e c i s i o n}$ . $U_{F 1}$ is maximized if and only if both $U_{R e c a l l}$ and $U_{P r e c i s i o n}$ are maximized. It is defined as:

U_{F 1} = 2 \cdot \frac{U_{R e c a l l} \cdot U_{P r e c i s i o n}}{U_{R e c a l l} + U_{P r e c i s i o n}} .

(13)

Finally, in addition to the open-set metrics, we also report the mAP to evaluate the model performances on $Y$ classes, defined as:

m A P = \frac{1}{| Y |} \sum c \in Y A P_{c},

(14)

where $A P_{c}$ is the average precision for the class $c$ computed at different recall levels.

4.2 Unknown detection results

Tab. 1 reports the comparison among the standard Faster R-CNN training and UNKAD, considering multiple unknown detection strategies. As the table shows, our approach improves detection of both known and unknown objects. Starting from the former, standard Faster R-CNN architecture with no rejection capability benefits from our approach, achieving up to to 67.75% mAP. This behaviour is also confirmed by the WI improvement from 1.63 to 1.5 (the lower the better), indicating that UNKAD distinguishes better known and unknown objects. The same behavior on the known classes is evident when employing a rejection strategy: considering the WI metric, MSP improves from -17.52 to -19.22 and ODIN from -20.43 to -21.56, while Energy obtain similar results (-30.39 vs -29.82). We note that, while the Energy approach is the best on the WI and WI $_{n o_r e j}$ metric, it obtains very low mAP. We ascribe this behavior to the high unknown score that Energy assigns to samples, leading to the rejection of most of the known objects. Moreover, the contrast among the WI and mAP metrics reveals that WI is not well-suited to evaluate methods on the open-set task, since it does not consider the overall model performance but only the ratio among closed- and open-set precision.

While improving the results on known classes is important, our goal is mainly detect unknown objects. Considering the $U$ - $F 1$ score, we note that UNKAD increases the results of MSP from $3.10 %$ to $3.78 %$ and from $2.02 %$ to $2.43 %$ for ODIN [liang2018enhancing]. This remarks the impressive ability of our training strategy, that improves the detection ability of the model without introducing additional costs, pseudo-labeling unknown objects in the background of the images. We acknowledge that our training procedure slightly hamper the Energy performance. However, we remark that Energy already assigns a very high score for the objects, as demonstrated by the very low $U_{P r e c i s i o n}$ achieved by it. We also note that only Energy shows this behavior, while both MSP and ODIN, when using UNKAD, improve the results on both $U_{P r e c i s i o n}$ and $U_{R e c a l l}$ metrics. Finally, comparing the simple direct prediction approach with the others, we see that it achieves comparable or even better performance. In particular, it achieves the best performance on $U_{R e c a l l}$ without suffering performance drop, as indicated by the mAP performance. However, the $U_{P r e c i s i o n}$ is lower than other approaches (-1.74% w.r.t. MSP).

No rejection	$^y = u$	$τ_{o b j}$	mAP $↑$	U $_{F 1} ↑$
✓			67.75	0.00
	✓		67.73	3.46
		✓	67.72	3.41
	✓	✓	67.75	3.67

Table 2: Ablation study of the direct approach rejection strategy when adopting UNKAD. We compute mAP and U

_{F 1}

metrics without rejection, using the direct prediction of the final classifier, exploiting the

τ_{o b j}

for unknown predictions and combining the two strategies.

4.3 Ablation studies

Direct prediction rejection strategy. In Table 2 we report a detailed analysis on the direct prediction rejection strategy available when using our UNKAD approach. It is composed by two key components: the additional unknown class on the final classification layer, able to predict the unknown class for each RoI; and the objectness threshold $τ_{o b j}$ . As we can see from the second row, the additional unknown class allows to maintain the same $67.75 %$ mAP performance over the known classes of the standard Faster R-CNN with no rejection capability; while it also achieves 3.46% of $U$ - $F 1$ . In the third row, we reported instead the results achieved by directly applying $τ_{o b j}$ to background RoIs. In particular, for each RoI predicted as background with the highest probability among all classes, if the $F_{R P N}$ emits an objectness score higher than $τ_{o b j}$ it is then considered unknown. Combining the two strategies together allows to achieve the best results, as shown in the last row of the table. The overall rejection procedure maintains the highest mAP on known classes, while also increases the $U$ - $F 1$ up to $3.67 %$ and decreases the WI down to $1.48$ .

Figure 4: Qualitative results of $F_{R P N}$ detections on Pascal VOC [Everingham2009ThePV]. The cyan boxes indicate a detection on unknown objects, while the other ones indicate a detection on known classes. Best viewed in color.

RPN unknown detection. Although $F_{R P N}$ is a class-agnostic detector by design [ren2015faster, girshick2015fast], an open question is whether it is able to recognize even objects on which it has not been explicitly trained on, as it is essential for the unknown pseudo-labeling procedure. To this end, we formulated the $A V G_{o b j}$ metric as a quantitative measure of the $F_{R P N}$ ability to identify objects within an image in both closed- and open-set scenarios. It is formulated as follows:

A V G_{o b j} = \frac{1}{| P_{f g} |} \sum p \in P_{f g} f_{f g} (p)

(15)

where $P_{f g}$ is the set of ground truth foreground proposals and $f_{f p} (p)$ is the probability that the proposal $p$ is actually considered foreground.

We report in Tab. 3 the evaluation of standard training procedure and UNKAD under the $A V G_{o b j}$ computed on both known and unknown objects. Our approach achieves up to $0.99$ on known and $0.98$ on unknown (being 1 the upper bound), surpassing the standard procedure in both the evaluations. Achieving comparable performance in both closed- and open-set scenarios proves our intuition that $F_{R P N}$ is able to precisely detect objects despite their belonging to the training distribution or not. It is worth noting that UNKAD increases the confidence on considering proposals as foreground ones on the known classes, as shown in the comparison between the first row and the third one.

Standard	UNKAD	known	unknown	$A V G_{o b j}$
✓		✓		0.98
✓			✓	0.97
	✓	✓		0.99
	✓		✓	0.98

Table 3: Ablation study of

F_{R P N}

ability to identify objects in closed- and open-set scenarios.

5 Conclusions

In this work, we proposed a novel training strategy, called UNKAD, to improve open-set object detection performance. UNKAD relies on the assumption that, during training, the images contain multiple non-annotated objects. Instead of requiring an explicit annotation for them, it automatically detects and pseudo-labels them, exploiting the four-steps Faster R-CNN training procedure. In particular, in the first step, it trains the class-agnostic RPN to detect objects using the ground truth annotations. Then, it pseudo-labels as unkown all the objects in the dataset with a high objectness score that do not match a ground truth annotation. The pseudo-labels are then used as pseudo ground-truths to train the classification head. In the third and fourth training steps, the knowledge on the unknowns is further consolidated, obtaining the final model.

We demonstrate that UNKAD is able to directly detect the unknown classes and it also improves the performance of previous training strategies with no additional costs on the Pascal VOC and MS-COCO datasets. Indeed, the unknown detection performance is still far from a system ready to operate in the wild and we hope that our work establishes a new baseline to push forward the state of the art in this research field.