Learning to Construct 3D Building Wireframes from 3D Line Clouds

Abstract

Line clouds, though under-investigated in the previous work, potentially encode more compact structural information of buildings than point clouds extracted from multi-view images. In this work, we propose the first network to process line clouds for building wireframe abstraction. The network takes a line cloud as input , i.e., a nonstructural and unordered set of 3D line segments extracted from multi-view images, and outputs a 3D wireframe of the underlying building, which consists of a sparse set of 3D junctions connected by line segments. We observe that a line patch, i.e., a group of neighboring line segments, encodes sufficient contour information to predict the existence and even the 3D position of a potential junction, as well as the likelihood of connectivity between two query junctions. We therefore introduce a two-layer Line-Patch Transformer to extract junctions and connectivities from sampled line patches to form a 3D building wireframe model. We also introduce a synthetic dataset of multi-view images with ground-truth 3D wireframe. We extensively justify that our reconstructed 3D wireframe models significantly improve upon multiple baseline building reconstruction methods.The code and data will be released $^{‡}$ .

\addauthor

Yicheng Luo $^{*}$ luoyicheng@bupt.edu.cn1,2 \addauthorJing Ren $^{*}$ jing.ren@inf.ethz.ch1,3 \addauthorXuefei Zhexuefeizhe@outlook.com1 \addauthorDi Kangdi.kang@outlook.com1 \addauthorYajing Xuxyj@bupt.edu.cn2 \addauthorPeter Wonkapwonka@gmail.com4 \addauthorLinchao Bao $^{†}$ linbaochao@gmail.com1 \addinstitution Tencent AI Lab \addinstitution Beijing University
of Posts And
Telecommunications \addinstitution ETH Zurich \addinstitution KAUST

^†^†

^{*}

These authors contributed equally to this work.^†^†

^{†}

Corresponding author.^†^†

^{‡}

https://github.com/Luo1Cheng/LC2WF

1 Introduction

Recent advancement in photogrammetry makes it possible to obtain 3D data in city-scale from drone images. Traditional point-based methods for 3D surface reconstruction from image such as multi-view stereo [campbell2008using, furukawa2009accurate, schonberger2016pixelwise] rely on accurate key point matching, which usually becomes challenging when facing texture-less surfaces (such as glass curtain) or large viewpoints changes. To tackle this challenge, line segment-based methods have been proposed as a promising solution to camera pose estimation [miraldo2018minimal, salaun2016robustlinesfm] and surface reconstruction [surfaceLine, hofer2013incremental]. It is shown to be easier and more robust to extract reliable line segments than points from multi-view images, especially in the case of lacking texture [Line3Dpp]. Moreover, to alleviate computational costs of downstream geometry processing applications and to reduce storage cost of city-scale data, there is an increasing demand for urban reconstruction with lightweight models such as 3D wireframe models or low-resolution polygonal meshes. Besides, wireframe models are also widely-used in creating virtual cities or building information models.

{overpic}

[trim=0cm 0.6cm 0cm 0cm,clip,width=1grid=false]mtd_overview_v2.pdf

Figure 1: Method Overview. Top: our method takes multi-view images (a) as input and outputs a high-quality 3D wireframe (g). Specifically, we first extract a line cloud (b) from the images, from which we sample line patches (c) and (e) to predict wireframe junctions (d) and connectivities (f) respectively. Bottom: we compare to four baselines: Line3Dpp [Line3Dpp] (B1) produces abstracted line clouds from the input noisy line clouds (b). Line2Surf [surfaceLine] (B2) takes (B1) as input and outputs a triangle mesh. From the point cloud (k), PolyFit [polyfit] (B3) produces a polygonal mesh, while PC2WF [PC2WF] (B4) outputs a 3D wireframe.

To obtain lightweight building models, existing methods can be roughly categorized into two groups: (1) fit multiple simple primitives such as planes or boxes to the input point cloud to obtain a building abstraction as a polygonal mesh [polyfit, fangCVPR2020, lafarge2013surface, holzmann2017plane, holzmann2018semantically, vanegas2012automatic, li2016fitting, he2021manhattan]; (2) first construct a dense triangle mesh from the input point cloud using standard surface reconstruction techniques, e.g. [kazhdan2006poisson, labatut2009robustRecon, schonberger2016colmap]; then apply mesh simplification or decimation techniques to obtain an abstracted building model based on planar shape priors [chauve2010robustPlanar, salinas2015structure, bauchet2020kinetic, lafarge2012creating]. However, both types of solutions rely on discrete operations (such as RANSAC-based fitting [polyfit] or region-growing for mesh decimation [salinas2015structure]), which makes it hard to adapt existing solutions for learning-based frameworks. To close this gap, we present the first learning-based solution for 3D building wireframe reconstruction. We choose wireframe models as output since they are best suited for piece-wise planar objects such as urban buildings [PC2WF]. A wireframe is a graph representation of an object described by a set of junctions connected by line segments. Wireframe models have become popular for characterizing the contours of objects in both 2D [LCNN, HAWP, PPGNET, Kong2021hole] and 3D [PC2WF]. However, learning a 3D building wireframe from a point or line cloud is a challenging and under-explored task, which still remains an open problem.

In this work, we propose a solution to extract the 3D building wireframe from a line cloud. As observed in [holzmann2018semantically, he2021manhattan], line clouds potentially provide more structural information such as corner points and boundary edges of buildings, which are much harder to extract from point clouds. Moreover, a line cloud is more compact to characterize a building than a point cloud. For example, our method can output a reasonable building wireframe from a line cloud containing around 1K line segments. To achieve comparable result, a dense point cloud containing 50K-100K points is required for baseline methods such as PolyFit [polyfit].

To summarize, our main contributions are: (1) a novel learning-based solution to reconstruct 3D building wireframe from multi-view images; (2) LC2FW: a transformer-based and the first network to process line clouds based on line patches; (3) an adapted synthetic dataset with annotated multi-view images and ground-truth 3D wireframe models.

2 Related Work

There is relatively limited work that focuses on building wireframe reconstruction from either multi-view images or point clouds. We mainly review related work of building reconstruction, wireframe reconstruction, as well as existing datasets for building reconstruction.

3D Point/Line Reconstruction Structure-from-Motion [snavely2006photo, schonberger2016colmap, agarwal2011buildingRome, crandall2011discrete, snavely2008skeletal, sweeney2015optimizing] is an effective method to acquire 3D point clouds or line clouds for surface reconstruction from multi-view images. Corresponding feature points extracted from multi-view images are used to estimate camera parameters and generate 3D point clouds. Similarly, 3D line clouds can be generated from corresponding 2D line segments detected from multi-view images [Line3Dpp]. In our work, we focus on line clouds since the building shapes can be easily characterized by line structures.

Building Reconstruction

Multiple optimization-based algorithms have been proposed for building reconstruction from point clouds. Some works [he2021manhattan, manhattan2016, li2016fitting, 3Dwireframe] use the Manhattan-world assumption to further regularize the building reconstruction. The reconstructed building meshes are usually dense and noisy, and thus different methods have been proposed for simplification or abstraction [verdie2015lod, BigSur, li2021feature]. Primitive-based building reconstruction is another popular direction to get abstracted polygonal mesh by exploiting high-level primitives such as cubes [vanegas2012automatic, li2016fitting, he2021manhattan], planes [polyfit, fangCVPR2020, lafarge2013surface, holzmann2017plane, holzmann2018semantically], or general 3D templates [nan2015template, lin2013semantic] to fit input point clouds of buildings. However, building reconstruction from a 3D line cloud has been rarely investigated. Existing works [sugiura20153d, holzmann2018semantically, surfaceLine, he2021manhattan] take 3D lines into consideration to fit planes first, instead of directly extracting the building structure from the lines. Sugiura et al. [sugiura20153d] extend the tetrahedra-carving method to the 3D point-and-line cloud setting, while Holzmann et al. [holzmann2018semantically] use additional semantic labels from image segmentation to cluster lines for plane fitting. Langlois et al. [surfaceLine] propose a RANSAC-based method to extract planes from the input line cloud, which are fused to form a watertight mesh. He et al. [he2021manhattan] estimate planes and corners from a line cloud for box fitting. Some other works [hofer2015line3d, Line3Dpp] provide heuristics for line cloud abstraction. In our work, we propose the first learning-based solution to process line clouds for 3D building wireframe reconstruction.

Wireframe Extraction As a special case of 2D edge detection [AFM, semanticLine, HT-HAWP, LETR, TP-LSD, ELSD, SOLD2, LSD, gu2021line], 2D wireframe detection from a single image [LCNN, HAWP, PPGNET, Kong2021hole] is much more explored compared to the 3D wireframe reconstruction setting. A recent work PC2WF [PC2WF] proposes a CNN-based method to extract 3D wireframe models from point clouds. Zhou et al. [3Dwireframe] provide a method to reconstruct partial 3D wireframe models from a single image, from which depth maps, junction heatmaps, edge maps, and vanishing points are estimated independently for wireframe prediction. In our work, a complete 3D wireframe model is reconstructed from a noisy line cloud extracted from multi-view images.

Dataset

There are multiple datasets that contain ground-truth 2D lines in images with semantically meaningful annotations [semanticLine, Lee2017SLNet]. Here we mainly review datasets that can be potentially used for either wireframe or building reconstruction. [wireframeDataset] and [YorkUrbanDataset] provide ground-truth 2D wireframe annotations for single images of indoor or outdoor scenes. [3Dwireframe] proposes a synthetic city dataset that contains 2D synthetic images with ground-truth depth and partial 3D wireframe annotations that are visible from a single view. There are also some datasets consisting of CAD models [abcdataset] or polygonal meshes [ren2021intuitive] that can be potentially adapted to wireframes. The ABC dataset [abcdataset] is a recent dataset consisting of one million CAD models, most of which are mechanical parts. In this work, we build on [ren2021intuitive] to create a synthetic dataset with complete ground-truth 3D wireframe models paired with multi-view images.

3 Background & Training Dataset

Notation Our method takes a set of multi-view images $I = {I_{i}}_{i = 1}^{m}$ as input, from which we extract a line cloud [Line3Dpp, he2021manhattan] that consists of a group of line segments $L = {l_{i}}_{i = 1}^{N}$ , where each line segment $l_{i}$ is denoted by its two 3D endpoints, i.e., $l_{i} = (p_{i}, q_{i}), p_{i}, q_{i} \in R^{3}$ . We denote $G$ as a group of line segments belonging to $L$ , i.e., $G \subset L$ . The underlying 3D wireframe model of the line cloud $L$ is denoted as $W = (V, E)$ , which is defined by a set of 3D vertices (junctions) $V$ and a set of edges (connectivities) $E$ that connect those vertices. Specifically, we have $V = {v_{i}}_{i = 1}^{n_{v}}, v_{i} \in R^{3}$ , and $E \subset V \times V .$

Overall Pipeline The goal of our method is to output an accurate and clean wireframe model $W$ from a set of input images $I$ of a building. Our method contains the following major building blocks (see Fig. 1): (1) a line cloud extraction step where a dense line cloud is extracted from the input images (Sec. 3.1); (2) a junction predictor which classifies if there exists a junction in a group of lines and regresses the junction position accordingly (Sec. 4.3); (3) a connectivity predictor that instantiates edges between the predicted junctions (Sec. 4.4).

3.1 Line Cloud Extraction

{overpic}

[trim=0.2cm 0cm 0.6cm 0.6cm,clip,width=1grid=false]stats_edge_length.pdf

There are roughly two groups of methods to extract a line cloud from multi-view images: (1) reconstruct 3D lines and estimate camera parameters simultaneously [PLVIO, lineBasedSLAM, salaun2017line]; (2) reconstruct 3D lines with fixed camera parameters estimated from standard structure-from-motion (SfM) methods [Line3Dpp, jain2010exploiting]. In the work, we follow the latter one to reconstruct a line cloud, which is also adopted in Line3Dpp [Line3Dpp], the current state-of-the-art line cloud abstraction method. Specifically, the camera parameters are estimated from the multi-view images using SfM. Correspondences between the 2D line segments detected from each image (using any existing line detector) are established based on epipolar constraints, which are then used to solve 3D line segments based on the camera parameters. We use the line cloud extractor as provided in [Line3Dpp]. Note that, the extracted line cloud is potentially dense, noisy, and incomplete. The inset figure shows the histogram of the length of the line segments in the line cloud shown in Fig. 1 (b). Around 85% of the extracted line segments has a shorter length than the average edge length of the underlying building (Fig. 1 (h)). This suggests that the extracted line clouds contain large portion of short (and potentially noisy in orientations) line segments, which makes it challenging to extract a clean wireframe.

3.2 BuildingWF Dataset: Training Dataset

Challenges To design a data-driven solution for building wireframe reconstruction, we need large-scale datasets with ground-truth 3D wireframe annotations paired with either multi-view images or point clouds. However, it is quite challenging to obtain such datasets. Existing building datasets can be roughly categorized as follows: (1) single image with ground-truth 2D line segments [semanticLine]; (2) single image with ground-truth 2D wireframe [wireframeDataset, YorkUrbanDataset]; (3) single depth image with ground-truth partial 3D wireframe that is visible in the image [3Dwireframe]. On the other hand, the dataset used in PC2WF [PC2WF] are indeed in large-scale but only contain ground-truth 3D wireframe for man-made objects such as mechanical objects [abcdataset] and furniture.

BuildingWF Dataset In this work, we introduce a synthetic dataset with ground-truth 3D building wireframe models based on the Roof-Image dataset proposed in [ren2021intuitive], which contains around 3.6K polygon meshes of residential buildings (denoted as $M_{i}$ ). See supplementary materials for some examples. We first extract the ground-truth wireframe $W^{gt} = (V^{gt}, E^{gt})$ from the provided building mesh $M$ . We then synthesize multi-view images $I$ of the building $M$ in Blender with synthetic textures based on the provided face labels. A 3D line cloud $L$ is extracted from synthetic images $I$ as mentioned in Sec. 3.1. We then use the ground-truth wireframe $W^{gt}$ and camera parameters to label each 3D line segment in the line cloud $L$ .

Specifically, we first project the 3D ground-truth wireframe $W^{gt}$ to image planes using the corresponding camera parameters to get the ground-truth 2D wireframe for each image $I_{i} \in I$ , which allows us to check if a 3D line $l_{i} \in L$ is part of the wireframe $W^{gt}$ or not. If the 2D line segments, that are used to reconstruct the 3D line $l_{i}$ , are close enough to the ground-truth 2D wireframes, $l_{i}$ will be classified as part of $W^{gt}$ and be labeled as 1. For a line $l_{i}$ with label 1, we further associate it with two ground-truth junction vertices that are the endpoints of the corresponding edge in $W^{gt}$ that $l_{i}$ belongs to. In summary, each line $l_{i}$ has a 5-dimensional label: $(f, i_{1}, d_{1}, i_{2}, d_{2})$ where $f$ is binary indicating if this line is part of the wireframe, $i_{1}, i_{2}$ are the junction index in $V^{gt}$ and $d_{1}, d_{2}$ are the distances from $l_{i}$ to the two ground-truth junctions respectively if $f = 1$ . Note that the label $f$ is used to supervise our junction classifier, and the remaining labels are used to supervise our junction regressor.

4 LC2WF: Line Cloud to Wireframe

In this section, we present the key component of our framework, LC2WF network that reconstructs a 3D wireframe from a line cloud. Before we dive into the architecture details, we would like to first motivate our design choices. The core problem is how to correctly predict the junction positions and the connectivities between junctions from a line cloud. Similar to a point cloud, a line cloud is nonstructural, dense, noisy, and potentially incomplete. However, on the other hand, the orientation and length is properly defined for line segments, which makes the neighborhood in a line cloud potentially more informative than the neighborhood in a point cloud, where only the distance between points is defined.

In the following, we first introduce line patches to define the neighborhood in a line cloud. We then propose our line-patch transformer[transformer], LPT, that is designed to process line patches to effectively extract information for junction and connectivity prediction, which are combined to produce the final 3D wireframe.

{overpic}

[trim=0.8cm 1.2cm 0.2cm 0.5cm,clip,width=1grid=false]eg_line_patches.pdf

Figure 2: Example line patches (red lines) w.r.t. the sampling points (blue). Top: we report the probability for each line patch to have a junction. Bottom: we report two probabilities for a pair of sampling points, i.e., (1) two points are connected, and (2) two points that potentially have graph distance of 2. Note all the line patches have the same number of lines.

4.1 Line Patches

In our setting, a line patch is defined as a group of 3D line segments collected w.r.t. sampling points. Specifically, given an arbitrary point $x \in R^{3}$ , the corresponding line patch $G (x)$ is defined as: $G (x) = {l \in L | dist (x, l) \leq ϵ}$ , where $dist (x, l)$ measures the point-to-line distance between a point $x$ and the 3D line where the line segment $l$ lies. We can similarly define the line patch w.r.t. a pair of sampling points as $G (x, y) = G (x) \cup G (y)$ . We observe that the line patches encode sufficient information to predict junction positions and connectivity between junctions. Specifically, the line patch $G (x)$ of point $x$ can be used to estimate the probability of having a junction located around point $x$ , while the line patch $G (x, y)$ can be used to estimate the probability of having an edge connecting the point $x$ and point $y$ .

See Fig. 2 for an illustration: In example (a1) and (a2), the lines in the red patch have multiple dominant orientations, suggesting that the blue sampling point is indeed close to a wireframe junction. This aligns with the fact that a 3D junction is formed by the intersection of at least three planes, and the corresponding intersecting lines shape the contour of the underlying building, which would lead to dominant line clusters in images. The blue sampling point in example (a5) is located on a roof plane, where the roof texture (see Fig. 1) contains structural lines. In this case there exists only one dominant direction, which is not enough to support a junction. Example (a6) shows the line patch of an outlier point, where the lines in the patch are extremely unstructured. Similar for the examples in (b1-b6), we can see that if two sampling points are likely to be connected to each other, the corresponding line patch will reveal strong pattern (e.g., having duplicated lines) to support it. All these observations of line patches perfectly align with the properties of the wireframe models of planar objects: the junctions are formed by the intersection of planes with at least three dominant directions determined by the intersecting lines. As a comparison, other points such as corners in textures or noisy points do not have comparably strong signals.

4.2 Line-Patch Transformer (LPT)

Given a line patch $G (x)$ or $G (x, y)$ , how can we tell if there exists a junction or an edge? We propose a line-patch transformer, LPT, to extract features from line patches, which can then be used to predict the junctions/edges. Specifically, a line patch $G (x)$ can be represented as a 2D tensor $(N, F^{in})$ , that stores $N$ neighboring lines in $G (x)$ , and each line has $F^{in}$ features including the coordinates of the two endpoints and the distance between the line and the

{overpic}

[trim=-0.2cm 0cm 0cm 0cm,clip,width=1grid=false]lat_net.pdf

sampling point $x$ . We then collect $G$ groups of line patches in a 3D tensor $(G, N, F^{in})$ . LPT contains two transformers (see inset figure): (1) the first transformer attends to the $N$ neighbors for each line patch, to potentially find the most prominent lines for junction predictions; (2) the second transformer attends on the $G$ groups, to potentially attend to the junctions that are co-planar. We can similarly use LPT to process line patches $G (x, y)$ , where the initial features can be obtained by concatenating the features of $G (x)$ an $G (y)$ .

4.3 Junction Predictor

We sample $G$ points ${x_{k}}_{k = 1}^{G}$ from all the line endpoints of the line cloud $L$ to construct line patches for junction prediction. Specifically, we first sample a smaller set of points (around 25%) according to the endpoint density and then sample the remaining points via Farthest Point Sampling (FPS) [fps]. We then obtain the corresponding line patch $G (x_{k})$ for each sample $x_{k}$ as discussed in Sec. 4.1. The line patches ${G (x_{k})}$ are fed into LPT to extract patch features, which are used to classify if there exist a junction close to $x_{k}$ , and regress the potential junction position $p_{k}$ . We then collect the predicted junctions $p_{k}$ in $V^{pred}$ . The classifier can help to filter out junctions with a low confidence. During the training phase, we first draw samples that are close to the ground-truth junctions, then sample via density and FPS to get $G$ sampling points for constructing the line patches. This guarantees that we draw both positive line patches (containing a junction) and negative line patches. Specifically, a line patch is considered as a negative sample if there are more than half of the line segments in the patch are labeled as noise (introduced in Sec. 3.2). The loss function for the classifier is a binary cross-entropy $E_{v-clf}$ . The loss function for the regression is L $_{2}$ distance between the predicted position and the ground-truth position.

4.4 Connectivity Predictor

We first sample $G$ pairs of predicted junctions $(p_{k}, q_{k}) \in V^{pred} \times V^{pred}$ w.r.t. the predicted probability. We can then construct the line patches ${G (p_{k}, q_{k})}$ and feed them into LPT to extract patch features, which is used to classify the junction pair $(p_{k}, q_{k})$ into five groups: (A) labeled as -1 if at least one of $(p_{k}, q_{k})$ is a false positive junction (i.e., does not belong to the underlying wireframe); (B) two vertices are true positive junctions and the pair is labeled by the graph distance in the underlying wireframe, i.e., (B.0) with graph distance 0 ( $p_{k}$ is identical to $q_{k}$ ), (B.1) with graph distance 1 ( $p_{k}$ is connected to $q_{k}$ ), (B.2) with graph distance 2 ( $p_{k}, q_{k}$ are adjacent to the same vertex), or (B.3) having graph distance larger than 2. Note that the edge labels can help to further prune the false positive junctions besides the probability produced by the junction classifier. During training, we sample from $V^{gt} \times V^{gt}$ and $V^{pred} \times V^{pred}$ to learn junction connectivity. A vertex from $V^{pred}$ is regarded as a false positive junction if its distance to the nearest ground-truth junction is larger than a threshold $ϵ$ . Any vertex pair that contains a false positive junction is labeled as -1. The rest vertex pairs is labeled according to the graph distance in the ground-truth wireframe, where for a pair of predicted junctions, we use the graph distance between their nearest ground-truth junctions. The loss function for the classifier is standard cross-entropy $E_{e -clf}$ .

4.5 Training Loss & Post-processing

The total training loss for our complete networks is: $E_{total} = E_{v -clf} + λ_{v} E_{v -reg} + λ_{e} E_{e -clf}$ , where $λ_{v}, λ_{e}$ are balancing weights. For post-processing, we first use non-maximum suppression (NMS) to remove duplicated vertices and redundant edges that are close to each other. We then use the connectivity predictor to further prune the predicted junctions that tend to be false positives. Specifically, if a vertex pair is categorized to be identical (i.e., with label 0), then the junction with a lower confidence will be removed. For two vertices with similar Hamming distance in adjacency and small Euclidean distance, the vertex with a lower confidence will be removed. We also remove the isolated edges from the final wireframe.

5 Experiments

We compare different methods for building mesh/wireframe reconstruction on our BuildingWF dataset with ground-truth annotations (introduced in Sec. 3.2). We briefly introduce the baselines and the metrics for evaluation in Sec. 5.1. In Sec. 5.2 we show quantitative and qualitative results on building wireframe reconstruction. See supplementary materials for ablation study, more results and discussions. Code and data will be released.

5.1 Baselines & Evaluation Metrics

Method

Input

type

Input

size

Output

type

Runtime

(sec)

line3Dpp

lines

1,388

lines

33.1

line2Surf.

lines

120

mesh

220.8

PolyFit

points

86,396

mesh

45.6

PC2WF

points

86,396

wireframe

31.7

Ours

lines

1,388

wireframe

0.9

Table 1: Baselines

To the best of our knowledge, there is no existing baseline for reconstructing building wireframes from multi-view images directly. We therefore mainly compare to the state-of-the-art 3D line cloud abstraction method, line3Dpp [Line3Dpp], building reconstruction methods, Line2Surface [surfaceLine] and PolyFit [polyfit], and 3D wireframe reconstruction method PC2WF [PC2WF]. Specifically, line3Dpp [Line3Dpp] outputs an abstracted line cloud from a dense line cloud based on heuristics for line clustering. Line2Surface [surfaceLine] is an optimization-based method that extracts planes from a line cloud via RANSAC to form a building mesh. PolyFit [polyfit] is the state-of-the-art optimization-based method for building mesh reconstruction from a potentially noisy point cloud. PC2WF [PC2WF] is a novel learning-based method to reconstruct a 3D wireframe from a point cloud, which achieves plausible results on man-made objects such as mechanical objects and furniture. For evaluation, we follow PC2WF[PC2WF] to measure the precision and recall on both predicted junctions and wireframes, and the Wireframe Edit Distance(WED): (1) $v {AP}_{η}$ and $v {Recall}_{η}$ show the precision/recall on the predicted junctions. (2) $s {AP}_{η}$ and $s {Recall}_{η}$ report the structural quality of the predicted wireframes. Specifically, it checks if a predicted edge is a true positive or if a ground-truth edge is retrieved according to the distances between the edge endpoints. (3) WED reports the number of operations and the editing distances of adding/removing predicted junctions/edges that are needed to transform the graph structure of the predicted wireframe into the ground-truth wireframe.

5.2 Results and Comparisons

We compare to baseline methods on 757 test buildings. The line clouds (for line3Dpp, line2Surf, and our method) and the point clouds (for PolyFit and PC2WF) are extracted using the same camera parameters. Note that we use a commercial software to extract high-quality point clouds (See Fig. 1 (k) and Fig. 3 (a) for some examples). Moreover, to make a fair comparison to line2Surface [surfaceLine] and PolyFit [polyfit], we post-process the output meshes into wireframes by merging co-planar faces and parallel adjacent edges, removing interior edges and isolated vertices, etc. For PC2WF we use the provided NMS for post-processing. We report the evaluations on the results after post-processing (Tab. 2).

	$v {AP}_{η}$ / $v {Recall}_{η}$ (%)				$s {AP}_{η}$ / $s {Recall}_{η}$ (%)
Method	$η = 0.15$	$η = 0.25$	$η = 0.35$	avg.	$η = 0.25$	$η = 0.35$	$η = 0.50$	avg.
line2Surf.	26.7/83.9	27.4/85.8	27.6/86.6	27.2/85.4	24.2/58.8	25.1/61.0	25.8/62.6	25.0/60.8
PolyFit	52.1/70.8	62.0/84.3	64.3/87.4	59.5/80.8	45.5/53.8	58.7/69.5	65.5/77.5	56.6/66.9
PC2WF	11.9/26.8	43.2/54.3	58.5/65.2	37.9/48.8	0.84/7.61	7.68/23.3	23.0/40.4	10.5/23.8
Ours	91.3/92.2	93.4/93.9	94.4/94.8	93.0/93.6	76.8/84.7	80.6/87.1	83.9/89.5	80.4/87.1

(a) Precision/Recall of the predicted junctions (

v AP

v Recall

) and the predicted wireframe models (

s AP

s Recall

) on results after post-processing. We highlight the best and the second best results.

	(WED) +vertex		(WED) +edge		(WED) -edge		(WED) Total
Method	Num.	Dist	Num	Dist	Num	Dist	Num	Dist
line2Surf.	1.012	13.78	6.223	35.76	9.427	48.77	16.66	98.31
PolyFit	1.681	3.170	4.811	21.41	0.969	5.285	7.463	29.86
PC2WF	5.216	3.445	17.01	87.94	4.622	33.38	26.84	124.8
Ours	0.766	1.810	2.880	11.49	1.655	14.03	5.301	27.33

(b) Wireframe Edit Distance (WED) of the reconstructed wireframes. We report the number of operations (Num) and the editing distances in meters (Dist).

Table 2: Precision/Recall and Wireframe Edit Distance results after post-processing

Tab. 1(a) shows a fair comparison to line2Surf and PolyFit, where all the output meshes are post-processed into cleaner wireframes (see Fig. 3 for some examples of the post-processed results; see Fig. 1 (B2,B3) and supplementary materials for examples of direct outputs from different methods). we report the number of vertex/edges, and the precision/recall of the predicted junctions/wireframes on the results after post-processing. We do not compare to Line3Dpp in this case since it outputs a nonstructural line cloud. The results show that our method outperforms all the baselines on building wireframe reconstruction. Tab. 1(b) shows the wireframe edit distance for different methods, where our method achieves the least number of editing operations and the smallest editing distances.

{overpic}

[trim=0.5cm 0.8cm 0.15cm 0.4cm,clip,width=1grid=false]res2_comp.pdf

Figure 3: Comparison to baselines on building wireframe reconstruction.

Fig. 3 shows a qualitative comparison of different baselines. We can see that line3Dpp can indeed provide more abstracted line clouds, but they are still far from clean wireframe models. Line2Surf can robustly recover planes from the line cloud (from Line3Dpp), but it is not robust to the noise. PC2WF is trained on point clouds from mechanical objects and furniture, which are likely to have a large domain gap to the rooftop structures. Therefore, the wireframes constructed by PC2WF can only recover the walls in building point clouds. Moreover, we also observe that the point clouds stemming from our scenes are more noisy (though they are accurate enough) than the point clouds that PC2WF is trained on. This can also lead to the less satisfactory results that PC2WF obtains. PolyFit is a powerful method for building reconstruction that is robust to noisy point clouds. However, PolyFit can be computationally costly when the input point cloud is too dense since the algorithm involves integer linear programming. As a comparison, our method can achieve visually comparable and quantitatively better results in a much more efficient way. For example, on average it takes our method 0.9s to infer a building wireframe while it takes PolyFit 45.6s to optimize a building mesh (see Tab. 1). We show more results of our reconstructed wireframes in the supplementary materials.

6 Conclusion, Limitation & Future Work

In this work, we present the first learning-based solution for building wireframe reconstruction from line clouds, which can be efficiently extracted from multi-view images. We construct a synthetic dataset, BuildingWF, containing multi-view images of 3.6K buildings and the corresponding ground-truth wireframe models. The key component of our method is a Line-Patch Transformer which can be used for junction and connectivity prediction from line patches, a group of neighboring line segments that potentially encode the contour information of the underlying building. Our method outperforms multiple state-of-the-art building reconstruction methods on both accuracy and efficiency.

Our method still has some limitations. For example, we assume the input multi-view images cover the overall region of the underlying buildings, and we expect to extract a building wireframe from a noisy but relatively complete line clouds. Therefore, no prior knowledge or extra regularizers are investigated to complete a wireframe from a partial line cloud with large missing regions. We would like to leave it as future work to investigate wireframe reconstruction from partial line clouds. Moreover, in this work we do not investigate how to convert a wireframe into a watertight mesh. We believe it would be interesting to try to learn face information from line patches as well, which we leave as future work.

\citestyle

bmvc2k

Learning to Construct 3D Building Wireframes from 3D Line Clouds

Abstract

1 Introduction

2 Related Work

Building Reconstruction

Dataset

3 Background & Training Dataset

3.1 Line Cloud Extraction

3.2 BuildingWF Dataset: Training Dataset

4 LC2WF: Line Cloud to Wireframe

4.1 Line Patches

4.2 Line-Patch Transformer (LPT)

4.3 Junction Predictor

4.4 Connectivity Predictor

4.5 Training Loss & Post-processing

5 Experiments

5.1 Baselines & Evaluation Metrics

5.2 Results and Comparisons

6 Conclusion, Limitation & Future Work

References