Weakly and Semi-Supervised Detection, Segmentation and Tracking of Table Grapes with Limited and Noisy Data¹¹1This work extends the one titled ”Pseudo-label Generation for Agricultural Robotics Applications” presented at the 3rd International Workshop on Agriculture-Vision, CVPR 2022, New Orleans.

Thomas A. Ciarfuglia¹ Ionut M. Motoi¹ Leonardo Saraceni¹ Mulham Fawakherji Alberto Sanfeliu Daniele Nardi Department of Information, Management and Automation Engineering (DIAG), Sapienza University of Rome, Italy Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona,Spain

²2The authors contributed equally to the work.

Abstract

Detection, segmentation and tracking of fruits and vegetables are three fundamental tasks for precision agriculture, enabling robotic harvesting and yield estimation applications. However, modern algorithms are data hungry and it is not always possible to gather enough data to apply the best performing supervised approaches. Since data collection is an expensive and cumbersome task, the enabling technologies for using computer vision in agriculture are often out of reach for small businesses. Following previous work in this context Ciarfuglia et al. (2022), where we proposed an initial weakly supervised solution to reduce the data needed to get state-of-the-art detection and segmentation in precision agriculture applications, here we improve that system and explore the problem of tracking fruits in orchards. We present the case of vineyards of table grapes in southern Lazio (Italy) since grapes are a difficult fruit to segment due to occlusion, color and general illumination conditions. We consider the case when there is some initial labelled data that could work as source data (e.g. wine grape data), but it is considerably different from the target data (e.g. table grape data). To improve detection and segmentation on the target data, we propose to train the segmentation algorithm with a weak bounding box label, while for tracking we leverage 3D Structure from Motion algorithms to generate new labels from already labelled samples. Finally, the two systems are combined in a full semi-supervised approach. Comparisons with SotA supervised solutions show how our methods are able to train new models that achieve high performances with few labelled images and with very simple labelling.

keywords:

Fruit detection and segmentation, Yield prediction, Computer vision, Deep learning, Self-supervised learning

1 Introduction

Detection and tracking of fruits and vegetables are two fundamental tasks for precision agriculture, enabling robotic harvesting and yield estimation application. As with any other automation task, detection of vegetables benefits controlled environments and well known field conditions. Some examples of this approach can be found in Nuske et al. (2014); Potena et al. (2016). The reduction of variability and uncertainty in fruit position, occlusions, variety, illumination, to cite a few aspects of the problem, have a huge impact on the successful implementation of a learning based detection system. It is clear that an integrated approach to precision agriculture will push towards solving the detection problem with a more holistic approach, by engineering the field in a way to ease the task of the detection algorithms, while at the same time refining the algorithms to their best performances, specializing them to the exact context. However, this approach requires considerable economic investment, effort and vision. The reality of agricultural businesses are often small and family based, and while the economic problem they face are the same as those of bigger companies (e.g. lack of manpower to harvest vegetables), they don’t have the economic strength or knowledge to engineer the cultivation from the ground up. In this context, it is interesting to consider the challenges that detection and tracking algorithms face when the field is not prepared for automation. Some of these challenges are: uneven distribution of vegetables in the field, intra-species variability, illumination, occlusion and clutter. From a technical point of view, all these aspects translate to covariate shifts and lack of labelled samples. In this context it is difficult to collect a good amount of labelled images that catch the actual distribution variability. In addition, labelling images for specific tasks, such as pest or illness detection, requires expert advice, which increases the costs.

The aforementioned considerations inspired us to explore methods that could help in training detection and instance segmentation algorithms with few labelled data. We explicitly consider the case where a small amount of labelled data from a similar cultivation has been collected and labelled (Source Dataset, SD), but which is not enough to get acceptable detection and segmentation performances on a different orchard with consistent covariate distribution shift (Target Dataset, TD). We use as our test target data example table grapes vineyards cultivated in Aprilia, southern Lazio, while our source dataset is the wine grape dataset presented in Santos et al. (2020). We present a combination of weakly and semi supervised techniques that are able to significantly increase the performance of the algorithms and we compare these newly trained algorithms with the State-of-the-art on the example application of tracking fruits for yield estimation.

In the following we give an overview of the work done in detection for agricultural applications, in particular for grape detection and segmentation, and at the end of the Section we summarize the contribution of this work.

1.1 Detection and Segmentation in Agriculture Applications

Computer Vision techniques applied to Agriculture have been a central subject of recent research. Often these systems are focused on yield estimation, but pest and sickness detection are also relevant applications of detection and segmentation in Agriculture. In this Section we outline mainly the methods that rely on vision sensors, such as simple RGB cameras, since they are readily available and relatively cheap. In this context, a number of studies are relevant to this work. In Bellocchio et al. (2019) Bellocchio et al. present an olive counting solution that is explicitly trained with weak labels and consistency losses. This work is close to ours for the focus on working on data with minimal labelling, however it is based on simple direct fruit counting, which can lead to huge errors in cases where self occlusion is typical. While many early works Skrabanek and Majerík (2016); Nuske et al. (2014), and some recent ones Pérez-Zavala et al. (2018), use hand crafted features, most of the recent approaches use representation learning Bargoti and Underwood (2017); Wan and Goudos (2020); Bellocchio et al. (2019). When moving into this kind of technique, data availability is a key issue. In this respect, Korala et al. Koirala et al. (2019) present an overview of Deep Learning methods applied to fruit detection, pointing out in particular the critical role of data availability and recommending the use of public data and benchmarks to compare results.

Often, agricultural tasks are cast as activities that could be performed by an autonomous robot. In this line of research there are numerous contributions, both in single and multi robot scenarios. Halstead et al. Halstead et al. (2018) present one such systems where they detect red peppers with a camera mounted on an AGV. The authors also estimate ripeness by considering ripeness stages as different classes. Among the robotic solutions for yield estimation, there are a few that use a multi robot setup, in particular mixing different types of robots, such as Unmanned Ground Vehicles (UGVs) and Unmanned Aerial Vehicles (UAVs) as in Rahnemoonfar and Sheppard (2017). A more recent example of this kind of system is given in Pretto et al. (2021) by Pretto et al. where the multi robot approach is explored in depth together with the use of multi spectral cameras to detect and monitor both crop and weeds. These approaches, while effective, require a good deal of time from experienced technicians and are not applicable on cultures such as table grapes, where the foliage covers the fruit from above. However, in Ballesteros et al. (2020), Ballesteros et al. use UGVs equipped with multi spectral equipment to compute Vegetation Indexes and then train a yield estimator leveraging the correlation between plant vigour and yield quantity. This approach, using different sensors and indirect measures, is valuable and could be considered as complementary to the one we propose, where yield can be estimated by directly tracking and counting the fruit instances.

1.1.1 Grape Detection and Segmentation

An early approach to grape detection for yield estimation has been presented by Nuske et al. Nuske et al. (2014). In their work a robotic solution able to work at night with controlled lighting is presented. The classifier that is trained uses color, texture and shape cues to detect berries, and then bunches. The experiments are conducted across four seasons with good results. This holistic approach however requires a significant investment in the robotic platform and is not suited for small to medium business owners. Another early approach of grape detection is presented by Skrabanek and Mejerik in Skrabanek and Majerík (2016). Here the authors use HOG descriptors together with a Support Vector Machine to build a white wine grape detector. An approach that builds on these early results and data is the one presented by Pérez-Zavala et al. Pérez-Zavala et al. (2018). The authors use again hand crafted features (HOG, FRST and LBP) to feed an SVM based detector, and use geometrical considerations to separate self-occluding grape bunches and that show some robustness to color and illumination variability. The yield estimation task is then a result of the computation of the number of berries detected. Another approach to grape yield estimation that is based on geometrical considerations is that of Liu et al. Liu et al. (2017), where the detection is done on the early stage buds that shoot from the branches in an unsupervised fashion only by using Gaussian fitting. The advantage of this method is the independence from labeled data and the reliance on a simple camera as input. In this sense this work is close to ours, but the approach is usable only in the early stage season and does not take into account the yield loss for malformed and sick grape bunches. We consider this kind of approach complementary to ours, since it can be used to have early season prediction that could be refined later by proper detection based methods.

1.2 Contribution:

As we have shown in the related works, supervised solutions to detection and tracking have been proposed many times and show interesting performances, but at the same time the generality of these solutions is naturally impaired by the limited data each solution is trained on. Even if for the majority of the most common fruits a supervised solution with a relative amount of labelled data exists (e.g. apples Bargoti and Underwood (2017), grapes Santos et al. (2020), tomatoes Liu et al. (2020a)), every time we want to train a new detector for the same fruit in a different field, there is considerable covariate shift due to different sensor and environmental characteristics, illumination conditions and fruit intra-species variability. Put in simpler terms, given the limited variety of labelled data for this kind of application, the overfitting of the algorithm to the specific data collected is inevitable. We propose then a different approach to the problem: collect data for the target application and find easy and robust ways to bridge the covariate shift from any existing data and the target one. We work on this problem using the case of table grapes, for which there is some labelled data available (e.g. the wine grape dataset from Santos et al. Santos et al. (2020)), but which is not enough to work on a different variety (e.g. Pizzutello instead of Cabernet), cultivated with slightly different techniques (tendone structure instead of standard trellis structure). We propose algorithmic techniques to produce pseudo labelled data in order to bridge the gap of covariate shifts that occur whenever a new specific crop becomes the target of a computer vision system for precision agriculture. We explicitly tackle the problem of doing so with limited hardware and software resources, to address the needs of small and medium business. For this reason all the pseudo labelling strategies presented are based on simple videos collected with a cellphone camera. With this in mind, the specific pseudo labelling strategies we propose are of two kinds:

Automatic bounding boxes generation for objects contained in consecutive video frames, based on a starting estimate and 3D structure geometrical considerations. We show that, leveraging a simple initial labelling - which could be manual or automatic - and the information that we can get from feature matching and structure from motion, we are able to generate new labelled data that greatly increases the performance of the detector.
Pseudo mask generation for instance segmentation: we show how, starting from a simple bounding box - which could be the one automatically generated in the previous step - it is possible to use a segmentation network together with a refining strategy to generate new mask labels.

Figure 1: This figure shows the complete system architecture. The inputs are a source dataset (red cylinder) and a video collected on the field by the robot or a farmer (first green cylinder, TVid, expanded as a sequence of frames to show the keyframe selection process). SDet and SSeg (light red blocks) are the initial detection and segmentation networks trained only on the source dataset. All the intermediate computing blocks are depicted in orange, while the intermediate outputs are in blue circles. Both the pseudo bounding boxes and pseudo masks produced are depicted in yellow, while the detection and segmentation networks trained on these new labels (TDet and TSeg) are depicted in light green. The data flow in the system is also color coded as per legend.

2 Materials and Methods

In this Section we discuss the data, the general architecture of the system, and the algorithms on which it is based. Section 2.1 starts by describing the global system architecture and introducing its components. Then, in Section 2.2 we describe the experimental field where the target data has been collected, while in Section 2.3 we describe how it was collected and how it compares to the source data that was already available. Section 2.4 introduces the metrics used for our experimental evaluation. The following Sections give more details of each subsystem (Sections 2.5, 2.6, 2.7).

2.1 System Overview

An overview of the system is depicted in Figure (1). The main inspiring principle of this work is the economy of data labelling and data reuse, for this reason the only two sources of data are the source dataset (data available from a similar task) and a video collected on the target field. The source dataset is used to train the initial detection and segmentation models, namely the Source Detector Network (SDet) and the Source Segmentation Network (SSeg). SDet is not perfectly tuned on the target environment, still it can be used on selected frames of the video input that we call keyframes. To keep this solution simple, we consider equally spaced keyframes starting from the first one, but other strategies could be devised. A set of initial bounding boxes is extracted from this keyframe, using a high confidence threshold, to limit the false positives. Then, the whole video is passed in a Geometric Consistency block (GC block) that extracts features from each frame and associates them. We tested two different options for this block, as will be shown in the following Sections. Using this geometric information, together with the initial bounding boxes extracted from the keyframes, it is possible to interpolate the bounding boxes positions for the remaining frames with high accuracy. These new bounding boxes are our pseudo labels for training the detector on the target environment, which we call Target Detector (TDet).

The Detection Pseudo-Labels Generation (DPLG) sub-system could be used independently by the Segmentation Pseudo-Labels Generation (SPLG). To prove the effectiveness of the approach, we compare the performance of TDet on the bunches tracking problem, i.e. counting the number of grape bunches by counting the instances tracked along a video. This problem is relevant, since it can be used for yield estimation purposes. We test two different tracking algorithms, that are described in Section 2.6.3 and evaluated in Section 3.2.

The goal of the second part of the system is to generate pseudo masks for training an instance segmentation network. This sub-system can be seen both as an independent pseudo label generator, or as part of a bigger system such as the one we describe here. As mentioned before, the SSeg is trained only on source data and is not able to produce good segmentation masks on the Target Data. However, it is possible to give the network some information cues that can greatly improve the mask estimates. The first one is the bounding box region in which the instance should be segmented. This cue comes easily from the previous step of pseudo bounding box generation, but it could be produced otherwise. This generates the initial pseudo masks. It is possible to use these pseudo masks for training the Target Segmentation Network (TSeg) but this would lead to poor performances due to confirmation bias. We need therefore to inject external information from other cues that we have. In our system, this is the role of the pseudo masks Refining block. In Section 2.7.1 three different solutions for refinement will be described. Thanks to these refined pseudo masks it is finally possible to train the TSeg Network. Section 3.3 reports the results of the refining strategies and compares the performance of TSeg with SSeg.

Finally, the experiments of the whole system, trained only on videos and tested for instance segmentation performance are given in Section 3.4.

2.2 Experimental Field

The experimental field is located in southern Lazio (Italy). The vineyard is composed of two plots approximately 114 m x 51 m (0,58 ha) and 122 m x 48 m (0,58 ha). Vineyards are structured as a traditional trellis system called Tendone with a wide distance between each plant, 3 x 3 $m^{2}$ . Plantations are all older than 3 years and so in full production and health, thus representing a typical working condition for the validation of agronomic activities such as fruit harvesting or vine pruning. All structures are traditionally covered with plastic and net to protect grapes from rain and hail. The average extension of each plot is around 1 hectare and dimensions (length and width) are on average between 25 m and 50 m according to plot extension and geometry. The selected vineyard in Aprilia has currently four different table grape varieties which are described in the following: White Pizzutello, Black Pizzutello, Red Globe and Black Magic. Figure (3) shows some examples of the these grape varieties while Figure (2) shows images of the experimental field as well as the approximate extension of each grape variety in the vineyard.

Of the four varieties that were present in the vineyard, Black Magic was of very low quality and thus untended by the field owner. White Pizzutello is identical in shape to the Black one, and the latter has the same color of the former when not ripe. Together, white and black Pizzutello are a peculiar variety of the Lazio Region and present the highest variability in shape and color with respect to standard rounded berry variants. For these reasons, while we collected images of all of the varieties, we finally concentrated our data labelling effort only on Black Pizzutello.

2.3 Dataset

The proposed system deals with the covariate shift from a generic source dataset to a target dataset that is representative of the images that could be collected on the field. We assembled our target dataset with two different kinds of data. The first are videos recorded using a mid range cellphone camera (MotoG8 Plus), which simulates a data collection operation that could be performed by a farmer with ease. We collected videos moving along the vineyard (i.e. tangential to the rows), without any requirement on distance from the fruits or height from the ground. In this work we use HD (128 $\times$ 720) videos at 10Hz with a total of 1469 frames. Examples of these frames are shown in Figure 4. A short segment of 10 seconds has been labelled for test use in the case of the tracking algorithm evaluation, while the rest has been used without labelling thanks to the semi-supervised nature of the system. We briefly call this target video dataset TVid.

The second kind of data is composed of static images of Black Pizzutello. This data simulates the images a farmer, or a robot, could collect to perform some agricultural action on specific grape bunches (e.g. quality estimation, disease detection, automatic harvesting). The dataset consists of 134 images of 3000x4000 resolution, collected with the same cellphone camera used for the videos, however the optics and chip used for video and still images are different, as often is the case with cellphones. This is intentional, since it adds a very common source of covariate shift related to the device and capturing mode (motion vs still images). All the images in this case have been labelled for detection (bounding boxes), while a small subset has also been labelled for instance segmentation (70 images), using the Innotescus labelling application Innotescus LLC (2022). All these labels are used for validation and testing of the algorithms described in this Section. We call this still images dataset TImg.

Together these datasets (TVid and TImg) constitute our Target dataset (TD).

As mentioned before, we work under the hypothesis that a small amount of labelled data of the same fruit exists, but that it has considerable covariate shift with respect to the TD distribution. In this work, our Source Data is the one presented by Santos et al.(Santos et al., 2020). For the details about these data the reader can check the cited work, here we give a short summary to underline the differences between SD and TD:

the grape varieties (wine vs table, berry shape and color)
the illumination conditions (full sun vs shadows)
the camera device (Reflex vs cellphone camera)
scale of the images (standard scale vs variable scale)

To quantify the covariate shift gap, in Section 3 the performance drop for detectors and instance segmentation networks trained on SD and tested on TD are given.

2.4 Metrics

In this Section we describe the metrics used to evaluate and compare the detectors, the trackers and the instance segmentation algorithms. To evaluate the detectors and instance segmentation algorithms, the standard metrics of Precision, Recall and Intersection over Union (IoU) have been used. In addition, for instance segmentation, the Average Precision, as defined in the MS COCO challenges Lin et al. (2014) has been used. Usually AP is computed for each class and then averaged to obtain the mean average precision (mAP). In this work, since there is only one class (grape), the AP coincides with the mAP. In the MS COCO metrics, the AP is calculated by computing the precision at every recall level from 0 to 1 with a step size of 0.01. The mAP is then computed by averaging the AP over all the object categories and ten IoU thresholds from 0.5 to 0.95 with a step size of 0.05.

To evaluate the trackers we follow the common practice of Multiple Object Tracking (MOT) as defined by Wu and Nevatia Wu and Nevatia (2006) and the CLEAR MOT metrics Bernardin and Stiefelhagen (2008). MOT is a difficult task to evaluate, since the performance metrics should capture both the precision in detecting individual instances and the accuracy in tracking each instance across multiple frames, without losing track, or switching between instances. Given a number of objects $o_{j}, j \in [0.. m]$ , the tracker produces a number of hypotheses $h_{i}, i \in [0.. n]$ . The performance of association of hypothesis and objects can be measured frame by frame by using the classic True Positive, True Negative, False Positive and False Negative figures, together with their direct descendants Precision and Recall. However, recently a number of compound indexes have been proposed to better capture the general tracker’s performance. The first one is the Multiple Object Tracking Accuracy (MOTA) defined as follows:

M O T A = 1 - \frac{(F N + F P + I D_{s w})}{G T} \in (- \infty, 1]

(1)

where $F N$ and $F P$ are False Negatives and False Positives, $I D_{s w}$ represents the number of instances whose ID has been erroneously switched, GT is the real number of instances in the video. This index accounts for three sources of error, namely the false positive ratio, the false negative ratio and the mismatch ratio. Together, they give an idea of the general tracking accuracy. To evaluate the precision a second index was proposed:

M O T P = \frac{\sum_{t, i} d_{t, i}}{\sum_{t} c_{t}}

(2)

where $c_{t}$ denotes the total number of matches in frame $t$ , and $d_{t, i}$ in general represent the distance of the hypothesis and the object, but in our case can be computed as the overlap of the ground truth and the hypothesis bounding boxes. This second index gives only a measure of the precision in detecting the instances, without giving any information on the tracking and association capability.

2.5 Detection and Segmentation Network Architectures

As explained in Section 2.1, the general pseudo label generation system is based on pre-trained detection and segmentation networks (SDet and SSeg), and is meant produce the pseudo labelled data to train new networks that are able to perform better on TD (TDet and TSeg).

The main parameters that influence the choice of the architectures are speed and accuracy. It is well known (Liu et al. (2020b)) that SotA detection networks can be divided into two main categories: two stage and single stage. The first kind separates detection into two phases, the first is called region proposal and gives object bounding boxes candidates, while the second filters and refines these candidates to produce the bounding boxes and classifies the objects. The second kind instead extracts both region proposal and class prediction in one pass. The main advantage of the single stage detectors is speed, which is much higher than the two stage one, but at the cost of a general reduced accuracy. The main examples of single stage detectors are the YOLO variants, in particular the recent YOLOv5 Redmon and Farhadi (2018). One of the best known and best performing two stage architectures is Mask R-CNN He et al. (2017), which is also a segmentation network, more accurate than any YOLO variants, but slower and difficult to tweak for real-time use.

In this work we use the single stage YOLOv5s architecture for the experiments on tracking, since real-time detection is needed for this kind of application. In addition, some of the variants have a small number of parameters, which makes them viable for embedded applications, such as robotic harvesting. The pseudo bounding box generation could be performed offline, thus allowing use of the better performing Mask R-CNN, but we decided to use the YOLO detector to keep this sub system self-contained. In addition, using an architecture with lower detection performance stresses and tests the robustness of the generation process. For segmentation and pseudo mask generation a segmentation network is needed, so the choice falls on Mask R-CNN. The details of the pretraining and fine tuning of the detection and segmentation networks are given in Sections 3.1.1 and Section 3.3.1, respectively.

Figure 5: This figure shows the detection pseudo-label generation (DPLG) sub-system alone. The source dataset is used to train an initial coarse bounding box detector (SDet) that is then used together with the SfM system, to generate a large number of new labelled images from the frames of continuous videos of the vineyard. This same system can be applied to other fruits with relative simplicity.

2.6 Detection Pseudo-Label Generation Sub-System and Tracking application

In this Section we detail the elements of the DPLG system depicted in Figure 5, i.e. the pseudo bounding box generation system, together with the tracking algorithm used for yield estimation as a possible application.

2.6.1 Geometric Consistency Block

The purpose of this block is to use geometrical correspondences extracted through epipolar geometry to associate grape instances in different frames of a video stream. We use this strategy in two ways in this work, first to extract pseudo bounding boxes, and then for tracking. In this Section we describe the general functional principles of SfM algorithms and their computational costs.

We experimented with two approaches, the first is the same used in Santos et al. (2020), which leverages a SfM software application, namely COLMAP Schönberger et al. (2016); Schönberger and Frahm (2016). Since SfM is a well known problem, the interested reader can find details of the solutions in Harltey and Zisserman (2006); Szeliski (2022). In brief, we used the COLMAP modality that extracts sparse features from each frame and then runs a sequential all versus all search and matching of the features extracted from the video. These correspondences are then used to triangulate the 3D points by minimizing the 3D to 2D reprojection error. However, the nature of the problem is such that even with the sparse setting, the computational costs increase exponentially with the number of frames. Our experiments required 5 hours of computation for videos of 500 to 600 full HD frames, on a computer equipped with an Intel-Core i7 3.4 GHz, a Nvidia GTX 950m and 16 GB of memory.

The second approach we experimented addresses this aspect, in order to have a real time solution that can be run on an online tracker, such as the one that will be described in Section 2.6.3. The idea is that in our context a full SfM solution is not needed, since the kind of videos that are collected in the vineyard are simple walks without closed loops. This means that each table grape bunch is present at most in a few consecutive frames, except for the occasional occlusion. For this reason we found that extracting 2D features from a frame $i$ and from a small number of subsequent frames $i + 1, \dots, i + n$ , and then matching them was enough to map the grape instance correspondences along the video stream. The features and descriptors used are SURF Bay et al. (2006), while the initial matching was corrected by geometrical verification using RANSAC, following common practice. An example of this process is given in Figure 6. Using this approach we were able to reach, without particular optimizations and working only on CPU, 3 frames per second on the aforementioned video and hardware.

2.6.2 Bounding Box Interpolation and Pseudo Label Generation

Bounding box interpolation can be better understood by looking at Figure 7.

Starting from a bounding box found by SDet at frame $i$ , thanks to the GC Block, it is possible to have an association between the 2D features contained inside the box with some features in frame $i + n$ . Since the camera is moving, both the position of the grapes and the illumination conditions in frame $i + n$ will be different, consequently the features matched will have a different position. The question is then how to draw the new bounding box in frame $i + n$ . We use the hypothesis that the camera is slowly moving, and that the motion is tangential to the direction of the vineyard. Thanks to this hypothesis we can assume that the new bounding box will have the same size as the one found in frame $i$ .

The position of the new bounding box is computed by setting the center of the box to coincide with the centre of gravity of the features in frame $i + n$ , as depicted in Figure 6(a). Another aspect to consider in evaluating the pseudo bounding box generation scheme is the effect of camera velocity combined with frame rate. If the frame rate of the video is high, or the camera velocity is low, the change in view will be minimal, and consequently the information added by such a sample will be minor. For this reason we considered it useful to explore the effect of the ratio between keyframes and other frames. We call this parameter skip value, since it is the number of frames in which the bounding boxes predicted in frame $i$ are interpolated, before taking a new prediction by SDet. Our ablation experiments showed that using skip 1 (i.e. using only SDet to produce pseudo labels) gave lower performance than using skip 2. However, increasing the skip value seems to not give more advantages. this aspect is explored in Section 3.2 where we show on the tracker application the results of using different skip values.

2.6.3 Tracking for Yield Estimation

Multi Object Tracking of the grape bunch instances is a preliminary step in yield estimation, as it is possible to estimate the number of bunches by counting the number of trajectories tracked by the algorithm. The main approach to tracking by detection we consider is the one presented by Santos in Santos et al. (2020) which is based on detection and SfM. However, no metrics were given there to formally describe the performances of the approach. Therefore, we replicated the experiments and computed the metrics using as a target the test sequence of TVid, described in Section 2.3, and depicted in Figure 4. In addition, we tested another detection based State-of-the-Art tracker, DeepSORT Wojke et al. (2017), designed to work in real-time using a deep association metric. We choose this tracker since the computation involved in estimating even sparse correspondences between the frames using COLMAP Schönberger and Frahm (2016) require considerable time and are not feasible for edge or robotic devices. In addition, the computation of the full SfM solution takes a long time and limits the length of the video to a few hundred frames, while for the second approach there is no such limit. We compare the tracking solutions using the MOT metrics with the tracking graphs to gain more insights on what the tracker does and how to improve it further, in Section 3. An example of these graphs is given in Figure 15.

Figure 8: This figure shows the segmentation pseudo-label generation (SPLG) sub-system.

2.7 Segmentation Pseudo-Label Generation Sub-System

While detection is enough for counting tasks, for quantitative yield estimations or for tasks that require a physical interaction with the volumes of the grapes (e.g. harvesting), segmentation, and in particular instance segmentation, is required. Instance segmentation requires labels that are ideally pixel perfect masks, however Bellocchio et al. Bellocchio et al. (2019) showed how, even with minimal labelling signal (e.g. presence or absence of an object in a image) the task network is able to learn representations that are close to masks of the object of interest. For this reason we again adopt a pseudo-labelling approach to this problem, starting with a pretrained network on the WGISD source dataset and then using simple external cues to work as our external information signal that helps in refining the label. The overview of this sub-system is depicted in Figure 8.

Our SSeg network is Mask R-CNN trained on WGISD, as usual. Mask R-CNN in its basic form extracts region proposals and uses them to predict bounding boxes and instance segmentation masks. However, it is possible to use the segmentation subnetwork of Mask R-CNN as the pseudo mask initial generator. In particular, Mask R-CNN is wired differently at inference time than at training time, since the bounding boxes predicted by the detection head are directly fed to the mask head. The network will use this bounding box as a cue, or as an attention mechanism, which helps the segmentation sub network to output a useful pseudo mask. This is depicted in Figure 9. This strategy will mitigate the problem of confirmation bias, since the box comes from an external information source. In our system the bounding boxes could come from the output of the DPLG sub-system. At the same time, in Section 3 we show the performances of the pseudo mask generation starting from ground truth bounding boxes so as to better isolate the performance contribution of the mask generation process. In this way, the number of pseudo-masks will coincide with the actual number of grape clusters in the image, and the measured error will only be due to the mask generation process. The qualitative difference of segmentation mask between the standard wiring of the Mask R-CNN network, and the one with an external attention mechanism is shown in Figure 10.

Figure 9: Mask-RCNN internal wiring at training and inference times. At training time, the mask prediction head uses the same inputs as the other two heads, i.e. the RoI cropped features. At inference time the cropping is done only using the bounding boxes proposals of the bounding box prediction head. Our system uses only the feature extraction part and drops the bounding box regression, using instead either the bounding boxes coming from ground truth, or the bounding boxes pseudo-labels generated by the DPLG sub-system. The dashed blue line in the lower diagram shows where the wire is cut and our bounding boxes proposals are injected.

Figure 10: Left image: pseudo mask produced by Mask-RCNN trained only on the Source Dataset and without a bounding box cue. Right image: same image showing the effect of giving a bounding box cue at inference time.

2.7.1 Pseudo Mask Refining Block

To refine the pseudo masks in order to reduce or remove confirmation bias, an external source of information is needed. Some earlier works worked on this aspect, such as Khoreva et al. (2017). We tried three different strategies to refine the initial masks, using simple computer vision techniques that work on different principles from the convolutional filters contained in SSeg and that use simple geometrical considerations.

Dilation: the first method originates from the observation that SSeg tends to under estimate the masks on the target data. For this reason, a simple morphological dilation that expands the mask until it touches the reference bounding box is able to add valuable information to the label. The dilation is applied with a 5x5 circular-shaped kernel. An example of the result is given in Figure 10(a).
SLIC: Simple Linear Iterative Clustering (SLIC) Achanta et al. (2012) is a method for super-pixel segmentation of the image. Super-pixels are contiguous regions of the image that are clustered together by a KMeans algorithm running on both color and space (5-dimensional). We apply this super-pixel division to the entire image and compare it with each pseudo mask. The SLIC algorithm that was used was the one implemented in the Python scikit-image library van der Walt et al. (2014) with 2000 segments and compactness 0.1. All the super pixels that are covered by more than a upper threshold $t_{u} = 70 %$ are added to the mask, while all the pixels that are covered by less than a lower threshold $t_{l} = 30 %$ are removed from the mask. The rationale is that in this way we should be able to remove also the background pixels erroneously contained in the initial pseudo mask. An example of the result is given in Figure 10(b).
Grub Cut: this is an iterative segmentation technique introduced by Rother et al. (2004). It represents the image as a graph where foreground and background pixels are modeled as Gaussian Mixture Models and have to be separated iteratively by cuts to the graph edges. We used the OpenCV Bradski (2000) implementation where it is possible to initialize the algorithm with the pseudo mask defining four pixel categories, i.e. sure foreground, sure background, probable foreground, and probable background. The pseudo mask is used as probable foreground. Dilation is applied to the pseudo mask for a number of iterations proportional to the smallest dimension of the reference bounding box to obtain the probable background. Erosion is applied for the same number of iterations to obtain the sure foreground, while the rest is set to sure background. A sample of the effects of Grab Cut is shown in Figure 12.

Figure 12: Examples of image refined with Grabcut. The color of the overlay defines the pixel as sure foreground (blue), probable foreground (yellow), probable background (green) or sure background (purple).

In Section 3 we show how each of these refinement methods performs compared to the baseline (pseudo mask with no refinement).

3 Results and Discussion

3.1 Detection Experiments

In this section we describe the results of the detection experiments. Table 1 shows our preliminary experiments to compare different version of the detector. All models in this initial comparison are trained and tested on WGISD. Results show that the models with a large number of parameters offer a minimal performance increase on basic detection, compared to the lightweight versions S and N. This can be explained by a general homogeneity of the distribution of the grape images in WGISD, which does not require huge number of parameters to learn a good estimator. This is expected, since in agriculture we do not work with huge amounts of data. For this reason we decided to base all the trackers on the S and N variants to reduce overfitting.

Model	mAP_0.5:0.95	mAP_0.5	Speed (ms)	Params (M)
YOLOv5n	58.2	89.4	6.3	1.9
YOLOv5s	62.5	89.7	6.4	7.2
YOLOv5m	61.9	89.5	8.2	21.2
YOLOv5l	64.0	90.5	10.1	46.5
YOLOv5x	61.5	87.5	12.1	86.7

Table 1: Comparison of the YOLOv5 models tested to be the tracker engine, trained and tested on WGISD. The models with the highest number of parameters do not have a significant performance advantage over the lightweight versions S and N.

3.1.1 Training Details

All the training experiments conducted on YOLO have been done on the Nvidia DGX-1 Station, since it offers an appropriate computational power for the training. All the training runs are composed of 300 epochs with a batch size of 4 and the patience parameter for early stopping set to 30 epochs. The learning rate ( $l r$ ) strategy used is ”one cycle” Smith and Topin (2018), with initial $l r = 0.01$ and final $l r = 0.001$ . The optimizer is SGD, with momentum 0.937 and weight decay $5 \times 10^{- 4}$ . The time required for training was highly influenced by the specific version of YOLO and by the number of images used, meaning that using 3368 images for training, the smaller version of YOLO (YOLOn) required just under 3 hours for all the epochs, while the bigger one (YOLOx) took more than 24 hours to converge. All the detection networks were pretrained on MS COCO dataset Lin et al. (2014) and then finetuned on the source and target datasets. The baseline model was trained only on the source dataset, while the proposed models were trained on the target using the methods described in 2.6.2. In order to help generalization to different conditions, the 242 training images of the source set were augmented with random crop, random contrast, Gaussian blur, Gaussian noise and horizontal flip. During our experiments, these random augmentations have been applied offline four times, generating 726 augmented images.

Table 2 shows the difference on the images of the target set ( $T_{i m g}$ ) between the detector trained only on the source data (SDet) and also on the pseudo-labels generated from the videos (TDet). It is possible to see that the $m A P_{0.5}$ increased by 8% despite the fact that the videos have a different distribution with respect to the images, due to the different process followed to collect them.

Detectors Performance on the TImg dataset
Model	Precision	Recall	$m A P_{0.5}$	$m A P_{0.95}$
SDet	0.90	0.56	0.69	0.46
TDet	0.98	0.68	0.77	0.47

Table 2: Comparison of the source detector (SDet) with the target one (TDet) that has been trained also using the pseudo-labels generated from the videos. The numbers shown are the performance on the test set of the TImg data.

An example of how the detection changes is given in Figure 13. In the upper row are present the detections made by SDet, while in the lower one there are the predictions made by TDet. It is possible to see that not only the bounding boxes are tighter around the instances, but also more grapes are detected, meaning that both precision and recall have improved.

Since TImg and TVid have different distribution, we also applied TDet on the test data from TVid. The results are shown in Table 3, where it is possible to see that the network trained with the pseudo labels gained 10% in $m A P_{0.5}$ compared to the one trained without them. In this case the increase is higher due to the minimal covariate shift between TVid test and training data.

Detectors Performance on the TVid dataset
Model	Precision	Recall	$m A P_{0.5}$	$m A P_{0.95}$
SDet	0.62	0.59	0.55	0.21
TDet	0.74	0.60	0.65	0.23

Table 3: Comparison of the source detector (SDet) with the target one (TDet) that has been trained also using the pseudo-labels generated from the videos. The numbers shown are the performance on the test set of the TVid data.

3.2 Tracking Experiments

In this section we describe the results of the tracking experiments. Each tracker is built on a YOLOv5 detector version. As explained in Section 2.6.3, we compare two tracking schemes combined with two pseudo label generation strategies. Pseudo labels depend, among other intrinsic and extrinsic variables, on the frame rate combined with the skip value, that was described in Section 2.6.2. As mentioned in that section, with the bounding box interpolation system we use the best performing skip value is 2, as showed in Figure 14. This is expected, as if too many frames are interpolated, the motion becomes too large to be compensated. In addition, from the same results it is clear that the SfM approach has higher MOTA than DeepSORT at most skip values, due to its use of the geometrical representation of the scene. However, it is not meant for real time computation. In Table 4 the MOT metrics for the best models have been summarized.

Figure 14: Comparison of trackers performances as the skip value changes. The skip value is the number of frame skipped between a keyframe and the next in the process of generating pseudo-labels. The extracted pseudo-labels influence the detector performance, both due to quantity and quality of the labels, and consequently also the tracker is influenced. In this chart we show the degradation of performances as the hyper-parameter is increased. While the best performances are obtained on skip 2, the degradation with skip 5 could be tolerable considering that it requires only 20% of labelled frames instead of 50%.

Among the MOT metrics described in Section 2.4, the MOTA and MOTP give a general idea of the tracking performances. However, to use trackers as yield estimators, one of the figures of interest is the number of IDs that the tracker finds, which could be considered an estimate of the number of bunches found. However, during the tracking process, some of the bunches IDs are switched. This can happen, for example, when two bunches are occluding each other, and the IDs are inverted after the occlusion situation disappears, or when there are errors in feature association in the SfM block (Figure 15). Whatever the reason, this situation is captured by the $I D_{s w}$ metric, together with the MOTA score. More details on these challenges of MOT can be found in Bernardin and Stiefelhagen (2008). For all these reasons we focus our attention on MOTA for the tracking accuracy, and on the number of IDs for the yield estimation.

Method	MOTA $↑$	MOTP $↑$	MT $↑$	ML $↓$	$I D_{s w} ↓$	FM $↓$	Pr $↑$	Re $↑$
SfMTrack WGISD	46.741	72.545	9	9	5	29	91.304	52.427
SfMTrack Pseudo labels	55.756	74.557	11	8	9	22	89.143	64.91
DeepSort WGISD	40.499	72.229	8	8	16	19	87.198	50.069
DeepSort Pseudo labels	50.624	72.941	9	6	17	18	85.634	63.662
Method	TP $↑$	FP $↓$	FN $↓$	Dets	GT Dets	IDs	GT IDs	Yield est. Err
SfMTrack WGISD	344	70	377	414	721	19	31	38%
SfMTrack Pseudo labels	431	94	290	525	721	28	31	9%
DeepSort WGISD	299	115	422	414	721	46	31	48%
DeepSort Pseudo labels	380	156	341	536	721	39	31	26%

Table 4: The upper half of this table shows some performance metrics for two types of trackers. Both trackers are based on a YOLOs detector, trained with two different datasets: WGISD is the baseline, while the Pseudo labels is trained on pseudo labels generated as described in Section 2.6.2. The lower half shows other common MOT metrics, notably

I D s

is the one used to compute yield estimation. It can be seen that for both tracking strategies there is a consistent advantage in using the pseudo labels. In particular the error in yield estimation for the SfM tracker drops by 29%.

From Table 4 it is evident that the pseudo label generation is highly beneficial, with an increase of more than 10% compared to the source dataset WGISD. Looking closely at the performance metrics, it can be seen that the capacity to track the same IDs for the entire trajectory is stronger in DeepSORT. This is probably due to Kalman filtering, since taking into account the bounding box movement dynamics implicitly avoids errors such as the one shown in Figure 15.

3.3 Segmentation Experiments

In this Section we show the results of the experiments concerning the performance of the SPLG sub-system. The pseudo-mask generator can be seen as an independent system, or in conjunction with the DPLG sub-system. In the following the experiments of the SPLG sub-system are described, while the results of the whole system are described in Section 3.4.

3.3.1 Training Details

The implementation of Mask R-CNN we chose is Detectron2 Wu et al. (2019), using ResNet 101 as backbone network. Again, the experiments were performed on the NVidia DGX cluster. The training started from the MS COCO weights, then was fine-tuned on the source and target dataset. For all the training, common data augmentation was performed by applying Gaussian blur, Gaussian noise, random changes in brightness and contrast, pixel dropout, random flip, and random crop. In addition, the trainings were executed using a learning rate of 0.001, weight decay of 0.0001 and a momentum of 0.9. Each training proceeded for a maximum of 100 epochs, but early stopping was used while monitoring the segmentation AP on the validation set of the table grape dataset, with a patience value of 20.

As for the detector, we give an idea of the initial performance gap of SSeg when directly applied to the target dataset TImg in Table 5, using the MS COCO Lin et al. (2014) metrics as described in Section 2.4. We performed data augmentation on the WGISD dataset, in particular crop and resize to mitigate the difference in scale with the TD, nonetheless in all metrics there is more than 20 points of decrement in Average Precision.

Test data	Task	$A P$	$A P_{50}$	$A P_{75}$
WGISD	Detection	53.40	87.02	57.36
WGISD	Segmentation	53.60	89.44	55.41
TImg	Detection	32.65	60.40	30.37
TImg	Segmentation	32.88	65.40	34.77

Table 5: Evaluation of covariate shift for SSeg: SSeg is a Mask R-CNN model trained on the Source dataset (WGISD) and in this table we show the performance comparison when it is tested on WGISD test set (27 images) and on the test set of our TImg dataset (20 images) using some of the COCO metrics.

3.3.2 Pseudo-mask generation experiments

Training Data	mAP@0.5:0.95	mAP@0.5	mAP@0.75
wgisd (baseline)	32.88	65.40	34.77
wgisd + TImg	48.43	83.06	53.12
wgisd + TImg w/ Dilation	48.67	81.54	54.87
wgisd + TImg w/ SLIC	47.78	80.41	53.54
wgisd + TImg w/ Grabcut	49.56	81.03	57.70

Table 6: Comparison of the Mask R-CNN model trained with different pseudo-mask processing methods.

The first set of experiments are an ablation study to evaluate the performance of TSeg in isolation from TDet, in order to quantify the effectiveness of generating pseudo labels when no other mask labels on target data are provided. As before, SSeg is our baseline and in this case TSeg is trained on both the source dataset and the training set of TImg, whose labels were generated as pseudo masks by SSeg with the successive refinement.

We performed comparison experiments between the three refinement strategies presented in Section 2.7.1, namely dilation, SLIC and GrabCut. We show in Table 6 the average performance of five trials for each experiment, as evaluated on the TImg test set. In the same table, we show the results obtained by TSeg trained with and without the Refining Block. The additional pseudo masks are able to considerably improve the performance on the target data in terms of AP, with an improvement of almost 50% on the baseline performance. Moreover, our results show that the best performing refining method is GrabCut. The additional refinement increases the $m A P_{0.5 : 0.95}$ by $1.13$ and the $m A P_{0.75}$ by $4.58$ with respect to TSeg trained without refinement, but decreases in $m A P_{0.5}$ , showing that the refinement process is more effective at higher IoU levels.

3.4 Complete System Experiments

In this section we describe the results obtained by using the detector described in section 3.1 to generate the bounding boxes required by the SPLG sub-system described in the previous section 3.3. First the best YOLOv5 detector, namely that obtained with the use of the pseudo-labels generation method, was used to predict the bounding boxes that are used to generate the pseudo-masks by Mask R-CNN. This was done both for TImg training set and for TVid. The test data for this experiment is the TImg test set, so the training and test distributions, although being target data, are different. Table 7 again shows the comparison of TSeg with and without the Refining Block. In the case of refined TImg the improvement is still substantial with an improvement of more than 40% over the baseline. Moreover, TVid is able to give an even greater improvement thanks to the greater number of images in the training set. Despite the fact that the video frames present many differences with respect to the target dataset, the TSeg still manages to increase the performance by 42% with respect to the baseline. Finally, the increase in mAP is even greater when considering TImg and TVid as training data. Also in those experiments, the Refinement Block gives an improvement over the non refined counterpart. From the values of $m A P_{0.50}$ and $m A P_{0.75}$ we deduce that the increase is mainly due to an improvement at IoU higher than $0.75$ .

Training Data	mAP@0.5:0.95	mAP@0.5	mAP@0.75
wgisd (baseline) (88)	32.88	65.40	34.77
wgisd + TImg (182)	44.45	75.68	49.78
wgisd + TImg (182) w/ Grabcut	46.41	78.09	51.74
wgisd + TVid (687)	45.99	78.30	52.59
wgisd + TVid (687) w/ Grabcut	46.66	77.38	52.22
wgisd + TImg + TVid (781)	47.44	76.63	56.06
wgisd + TImg + TVid (781) w/ Grabcut	47.81	77.23	53.27

Table 7: Comparison of the Mask R-CNN model trained on different training sets with bounding boxes generated by YOLO (TDet). In parenthesis are shown the number of images in the training set.

4 Conclusions

In this work a system to produce pseudo-labels for detection and segmentation tasks has been presented. This system is particularly aimed at agricultural applications, where data scarcity is a common challenge. The system has two components, the Detection Pseudo-Label Generator and the Segmentation Pseudo-Label Generator. Both sub-systems require a starting coarse detection, or segmentation learning algorithm, respectively, to find the initial labels estimates. This is not a difficult requirement to fulfil, since the initial performances do not have to be high, and a limited amount of data, even from a different dataset, have have been shown to be sufficient. The detection PLG is able to label any data collected from simple continuos videos of the target objects, by leveraging the 3D structure extracted from the video motion. The segmentation PLG is able to work on any image, and uses other segmentation strategies to refine the pseudo-labels produced by the initial segmentation algorithm. The two subsystems can be chained in a single PLG system, able to extract both bounding boxes and segmentation masks from the video provided. New detection and segmentation algorithms can be trained on the pseudo-labels and the experiments show that their performances surpass the initial algorithms performances by a large margin.

While demonstrated on the problem of table grape labelling with covariate shift, the system can be applied to other fruits. This approach could be used also as part of more sophisticated and expensive agronomic solutions, such as robotic harvesting systems, leading to savings in the labelling costs and in development time. Future development will address iterative pseudo-label refinement and the removal of initial requirements to make the system fully unsupervised.

CRediT authorship contribution statement

Thomas A. Ciarfuglia: Conceptualization, Methodology, Validation, Investigation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration; Ionut M. Motoi: Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft; Leonardo Saraceni: Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft; Mulham Fawakherji: Methodology; Alberto Sanfeliu: Supervision; Daniele Nardi: Supervision, Project administration, Resources, Funding acquisition, Writing - Review & Editing;

Acknowledgment

This work has been supported by the European Commission under the grant agreement number 101016906 – Project CANOPIES

References

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk (2012) SLIC superpixels compared to state-of-the-art superpixel methods. 34 (11), pp. 2274–2282. External Links: Document Cited by: item 2.
R. Ballesteros, D. S. Intrigliolo, J. F. Ortega, J. M. Ramírez-Cuesta, I. Buesa, and M. A. Moreno (2020) Vineyard yield estimation by combining remote sensing, computer vision and artificial neural network techniques. 21 (6), pp. 1242–1262. Cited by: §1.1.
S. Bargoti and J. P. Underwood (2017) Image segmentation for fruit detection and yield estimation in apple orchards. 34 (6), pp. 1039–1060. Cited by: §1.1, §1.2.
H. Bay, T. Tuytelaars, and L. Van Gool (2006) SURF: speeded up robust features. In Computer Vision – ECCV 2006, A. Leonardis, H. Bischof, and A. Pinz (Eds.), Berlin, Heidelberg, pp. 404–417. External Links: ISBN 978-3-540-33833-8 Cited by: §2.6.1.
E. Bellocchio, T. A. Ciarfuglia, G. Costante, and P. Valigi (2019) Weakly supervised fruit counting for yield estimation using spatial consistency. 4 (3), pp. 2348–2355. External Links: Document Cited by: §1.1, §2.7.
K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. 2008, pp. . External Links: Document Cited by: §2.4, §3.2.
G. Bradski (2000) The OpenCV Library. Cited by: item 3.
T. A. Ciarfuglia, I. M. Motoi, L. Saraceni, and D. Nardi (2022) Pseudo-label generation for agricultural robotics applications. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2022, New Orleans, LS, USA, June 19-20, 2022, Cited by: Weakly and Semi-Supervised Detection, Segmentation and Tracking of Table Grapes with Limited and Noisy Data¹¹1This work extends the one titled ”Pseudo-label Generation for Agricultural Robotics Applications” presented at the 3rd International Workshop on Agriculture-Vision, CVPR 2022, New Orleans..
M. Halstead, C. McCool, S. Denman, T. Perez, and C. Fookes (2018) Fruit quantity and ripeness estimation using a robotic vision system. 3 (4), pp. 2995–3002. Cited by: §1.1.
A. Harltey and A. Zisserman (2006) Multiple view geometry in computer vision (2. ed.). Cambridge University Press. External Links: ISBN 978-0-521-54051-3 Cited by: §2.6.1.
K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.5.
Innotescus LLC (2022) Innotescus External Links: Link Cited by: §2.3.
A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele (2017) Simple does it: weakly supervised instance and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.7.1.
A. Koirala, K. B. Walsh, Z. Wang, and C. McCarthy (2019) Deep learning – method overview and review of use for fruit detection and yield estimation. 162, pp. 219–234. External Links: ISSN 0168-1699, Document, Link Cited by: §1.1.
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.4, §3.1.1, §3.3.1.
G. Liu, J. C. Nouaze, P. L. Touko Mbouembe, and J. H. Kim (2020a) YOLO-tomato: a robust algorithm for tomato detection based on yolov3. 20 (7). External Links: Link, ISSN 1424-8220, Document Cited by: §1.2.
L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2020b) Deep learning for generic object detection: a survey. 128 (2), pp. 261–318. Cited by: §2.5.
S. Liu, S. Cossell, J. Tang, G. Dunn, and M. Whitty (2017) A computer vision system for early stage grape yield estimation based on shoot detection. 137, pp. 88–101. External Links: ISSN 0168-1699, Document, Link Cited by: §1.1.1.
S. Nuske, K. Wilshusen, S. Achar, L. Yoder, S. Narasimhan, and S. Singh (2014) Automated visual yield estimation in vineyards. 31 (5), pp. 837–860. Cited by: §1.1.1, §1.1, §1.
R. Pérez-Zavala, M. Torres-Torriti, F. A. Cheein, and G. Troni (2018) A pattern recognition strategy for visual grape bunch detection in vineyards. Computers and Electronics in AgricultureComputers and Electronics in AgricultureComputers and Electronics in AgricultureIEEE robotics and automation LETTERSsensorsComputers and Electronics in AgricultureSensorsJournal of Field RoboticsSensorsJournal of Field RoboticsIEEE Robotics and Automation LettersComputers and Electronics in AgricultureIEEE Robotics Automation MagazinePrecision AgricultureComputers and Electronics in AgricultureComputer NetworksComputers and Electronics in AgricultureSensorsInternational Journal of Computer VisionACM transactions on graphics (TOG)Dr. Dobb’s Journal of Software ToolsInternational journal of computer visionEURASIP Journal on Image and Video ProcessingIEEE Transactions on Pattern Analysis and Machine IntelligencePeerJ 151, pp. 136–149. External Links: ISSN 0168-1699, Document, Link Cited by: §1.1.1, §1.1.
C. Potena, D. Nardi, and A. Pretto (2016) Fast and accurate crop and weed identification with summarized train sets for precision agriculture. In Proc. of the 14th International Conference on Intelligent Autonomous Systems (IAS-14), Cited by: §1.
A. Pretto, S. Aravecchia, W. Burgard, N. Chebrolu, C. Dornhege, T. Falck, F. V. Fleckenstein, A. Fontenla, M. Imperoli, R. Khanna, F. Liebisch, P. Lottes, A. Milioto, D. Nardi, S. Nardi, J. Pfeifer, M. Popovic, C. Potena, C. Pradalier, E. Rothacker-Feder, I. Sa, A. Schaefer, R. Siegwart, C. Stachniss, A. Walter, W. Winterhalter, X. Wu, and J. Nieto (2021) Building an aerial–ground robotics system for precision farming: an adaptable solution. 28 (3), pp. 29–49. External Links: Document Cited by: §1.1.
M. Rahnemoonfar and C. Sheppard (2017) Real-time yield estimation based on deep learning. In Autonomous Air and Ground Sensing Systems for Agricultural Optimization and Phenotyping II, J. A. Thomasson, M. McKee, and R. J. Moorhead (Eds.), Vol. 10218, pp. 59 – 65. External Links: Document, Link Cited by: §1.1.
J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. External Links: 1804.02767 Cited by: §2.5.
C. Rother, V. Kolmogorov, and A. Blake (2004) ” GrabCut” interactive foreground extraction using iterated graph cuts. 23 (3), pp. 309–314. Cited by: item 3.
T. T. Santos, L. L. de Souza, A. A. dos Santos, and S. Avila (2020) Grape detection, segmentation, and tracking using deep neural networks and three-dimensional association. 170, pp. 105247. External Links: ISSN 0168-1699, Document, Link Cited by: §1.2, §1, Figure 4, §2.3, §2.6.1, §2.6.3.
J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.6.1, §2.6.3.
J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §2.6.1.
P. Skrabanek and F. Majerík (2016) Simplified version of white wine grape berries detector based on svm and hog features. In Artificial Intelligence Perspectives in Intelligent Systems, R. Silhavy, R. Senkerik, Z. K. Oplatkova, P. Silhavy, and Z. Prokopova (Eds.), Cham, pp. 35–45. External Links: ISBN 978-3-319-33625-1 Cited by: §1.1.1, §1.1.
L. N. Smith and N. Topin (2018) Super-convergence: very fast training of neural networks using large learning rates. External Links: 1708.07120 Cited by: §3.1.1.
R. Szeliski (2022) Computer vision - algorithms and applications, second edition. Texts in Computer Science, Springer. Cited by: §2.6.1.
S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu, and the scikit-image contributors (2014) Scikit-image: image processing in Python. 2, pp. e453. External Links: ISSN 2167-8359, Link, Document Cited by: item 2.
S. Wan and S. Goudos (2020) Faster r-cnn for multi-class fruit detection using a robotic vision system. 168, pp. 107036. External Links: ISSN 1389-1286, Document, Link Cited by: §1.1.
N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. Cited by: §2.6.3.
B. Wu and R. Nevatia (2006) Tracking of multiple, partially occluded humans based on static body part detection. Vol. 1, pp. 951– 958. External Links: ISBN 0-7695-2597-0, Document Cited by: §2.4.
Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §3.3.1.