TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers

Zengyuan guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang This paper was produced by the Baidu Inc.Haojie Li and Zhihui Wang are professors of Dalian University of technology

Abstract

Table structure recognition is a crucial part of document image analysis domain [1]. Its difficulty lies in the need to parse the physical coordinates and logical indices of each cell at the same time. However, the existing methods are difficult to achieve both these goals, especially when the table splitting lines are blurred or tilted. In this paper, we propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST. Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation. By introducing novel Transformer-based Query-based Splitting Module and Vertex-based Merging Module, the table structure recognition problem is decoupled into two joint optimization sub-tasks: multi-oriented table row/column splitting and table grid merging. The Query-based Splitting Module learns strong context information from long dependencies via Transformer networks [2], accurately predicts the multi-oriented table row/column separators, and obtains the basic grids of the table accordingly. The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately. We conduct experiments on several popular benchmarks including PubTabNet[3] and SynthTable[4], our method achieves new state-of-the-art results. In particular, TRUST runs at 10 FPS on PubTabNet, surpassing the previous methods by a large margin.

Article submission, IEEE, IEEEtran, journal, LaTeX, paper, template, typesetting.

I Introduction

Table Structure Recognition aims to recognize the internal structure of a table. It is a fundamental task in document understanding and has numerous practical applications [5], such as question answering, dialogue systems, table-to-text, etc. With the increasing number of documents containing tables, automated reading of tables within these images has become an urgent task.

Through studied for years, table structure recognition is still a very open research problem. The main difficulty lies in the need to parse the exact bounding box and logical index of each cell at the same time. In particular, four types of degradation and variations cause various problems in most current table structure recognition systems, as illustrated in Fig.1. First, spanning cells that occupy at least two rows or columns are more important than other simple cells on tables because spanning cells are more likely to be table headers in a table[6]. Second, parsing unlined tables or partially lined tables is more difficult than lined tables, because there are no explicit visual cues that delimit cells, columns, and rows. Third, empty cells are easier to neglect and more difficult to be located than non-empty cells in tables. Fourth, rotation and linear perspective transformation may degrade strongly the performance of table structure recognition.

Fig. 1: Overview of different types of table structure recognition methods.

Recent efforts have been devoted to improving the performance of table structure recognition which can be summarized into three categories: (1) Component-based Approaches (2) Sequence-based Approaches (3) Splitting-based Approaches. Unfortunately, the Component-based approaches such as DeepDeSRT [7], TableNet [8] and LGPMA [9] still suffer from boundary ambiguity problems in unlined tables and cannot achieve decent performance in complex scenarios such as tables with empty cells. Besides, the Sequence-based Approaches such as EDD [3] strongly depend on a large amount of data for end-to-end training and the generalization will drop sharply when encountering unseen data. Moreover, they often fail to regress accurate cell boundaries. Significantly, the Splitting-based Approaches provide strong generative capabilities for different kinds of table images because they mainly focus on capturing global and local visual context in tables such as the row or column separators or the linking relationships between a pair of adjacent basic cells, which will not change very large among different kinds of table images. Also, the Splitting-based approaches can attain more accurate cell locations compared with Component-based approaches and Sequence-based approaches. Our proposed TRUST follows the Splitting-based Approaches.

However, recent Splitting-based approaches such as SPLERGE [10] and SEM [11] may suffer from following disadvantages: (1) The pipeline of SEM is inefficient, which may involve time-consuming Region of Interest (RoI) [12] operations and context features extraction via BERT [13]. (2) SPLERGE trained two isolate split-model and merge-model which may increase the difficulty of optimization compared with training in an end-to-end fashion. (3) Existing Splitting-based approaches can not handle well tables with rotation and linear perspective transform.

In this paper, we propose an end-to-end Transformer-based table structure recognition method. Our method addresses the challenges in table structure recognition via an innovative encoder-decoder architecture as illustrated in Fig.2. The Convolutional Neural Networks with FPN are used as the backbone feature extractor. We enable table structure recognition with a Query-based Splitting Module, which introduces angle classification and starting point prediction for multi-oriented row/column separators. Through these predicted separators, a fine grid structure of the table is generated.

In addition, we design a novel Vertex-based Merging Module, to calculate features of all intersection of row separators and column separators, a.k.a vertices. With these features of vertices, a self-attention mechanism is built to scan all vertices and predicts which basic grid pairs should be merged in four directions including (top-left, top-right), (top-right, down-right), (down-left, down-right) and (top-left, down-left) around vertices. Vertex-based Merging Module helps to merge adjacent grids together to recover the spanning table cells more accurately, regardless of unlined tables or tables with empty cells. Our model is trained in an end-to-end fashion and the results show the effectiveness of our method.

The major contributions of this work can be summarized in the following three points:

We present a novel end-to-end framework named TRUST to tackle the tasks of table structure recognition, which leverages multi-headed self- and cross-attention mechanisms between the visual feature maps and row/column features to capture contextual information from long dependencies more efficiently and effectively. Furthermore, we design a novel Query-based Splitting Module and Vertex-based Merging Module to extract semantic features of the row/column separators and vertices, leading to more accurate table structure recognition in a split-merge manner.
Our Splitting-based TRUST can handle well most categories of tables, including those that are unlined or partially lined, and those with empty cells or spanning cells. Moreover, TRUST can recognize the structure of rotating tables, which is not solved very well by the previous Splitting-based methods.
We develop an end-to-end trainable table structure recognition method that demonstrates superior performance over some public datasets including the PubTabNet and SynthTable.

Ii Related Works

Quite a number of table recognition techniques have been reported in recent years [14, 15, 16], and most of them can be broadly classified into three categories. The first one follows a bottom-up approach which first detects text parts or basic cell parts and then links them up to form a table structure through graph neural networks or post-processing. The second follows a sequence decoding framework which treats table recognition as a image-to-sequence problem. The third follows a split-merge approach which obtains the basic table grids through dense splitting lines prediction, and then merge some of them to form spanning cells.

Component-based Approaches. Many conventional methods follow a bottom-up approach that first detects text or basic cell parts and then connects them to form a table structure. Popular table structure recognition methods include DeepDeSRT[7], ReS2Tim[15], DeepTabStR[16], etc. More recent methods explore Graph Neural Networks to link the basic components. For example, TIES [4] combines CNN [17] and GNN [14] to construct a bottom-up model to recognize the table structure. TabStruct-Net[18] first detects individual cells and then links them to get table structure by graphs. Similarly, in NCGM[19], it leverages graphs and modality interaction to boost the multi-modal representation for complex scenarios. Though Component-based Approaches are efficient, the big challenge is that these methods often fail to detect spanning cells and require extra cell detection networks which reduce the efficiency.

Sequence-based Approaches. Methods that directly reconstruct table structure in image-to-sequence manners become popular recently as reconstructing table structure in one shot avoids the extra linking process. EDD[3] utilizes an attention-based encoder-dual-decoder architecture to convert images of tables into HTML code. Its structure decoder reconstructs the table structure and directly recognizes cell content by the cell decoder at the same time. Though direct recognition of table structure is efficient, the big challenge is that these methods depend largely on the amount of trainable data and often fail to regress accurate cell locations.

Splitting-based Approaches. Splitting-based methods divide table structure recognition into two phases. They split the table into basic grid elements in which adjacent ones are then merged to recover spanning cells. For example, SPLERGE[10] first predicts the basic table grid pattern using Row Projection Networks and Column Projection Networks with novel projection pooling and then combines them to get table structure. Similarly, in SEM[11], a splitter is applied to obtain the fine grid structure of the table by predicting the potential regions of the table row/column separators. It also enhances the representational power of each table cell by modeling the textual information via transformer networks and merging these table cells through the attention mechanism.

Our proposed TRUST follows the splitting-based approaches. Different from existing techniques, we predict row/column separators using a transformer decoder namely a Query-based Splitting Module, which can more effectively and efficiently deals with unconstrained table. Additionally, a novel Vertex-based Merging Module in which the vertex’s representation is efficiently constructed of the learnable row/column representations from Query-based Splitting Module is introduced to merge table grids. Compared with the training of two independent modules in SPLERGE[10], the whole framework of TRUST can be trained in an end-to-end manner and achieve better performance.

Iii Proposed Method

Iii-a Overview

We describe the details of our TRUST, As Shown in Fig.2, it consists of three main components: a CNN backbone, a Query-Based Splitting Module, and a Vertex-based Merging Module.

We use a ResNet18[20] as the visual feature encoder of TRUST which computes increasingly high-level visual features as the layers become deeper. To alleviate the size problem of table and text, we adopt the FPN[21] strategy to merge features of different resolutions. Then in the Query-based Splitting Module, a transformer network is used as the feature decoder, in which visual features and learnable row/column position embedding features [22] are jointly used to capture features in a horizontal direction and vertical direction, respectively. We apply FFNs to the learnable row/column representations in the previous stage, generating the row and column separators of arbitrary orientations in the table. Through these predicted separators, a fine grid structure of the table is generated and each cell in this grid is a basic element of the table. Finally, those generated basic grids are further merged if they belong to the same spanning cells by a Vertex-based Merging Module. The feature representation of each vertex can be efficiently constructed by fusing the associated row and column features extracted by the Query-based Splitting Module. After further enhancing the vertex representation via FFNs, they are used to predict the merging results between adjacent basic grids.

Fig. 2: An overview of the proposed TRUST. It consists of a CNN backbone, a Query-Based Splitting Module, and a Vertex-based Merging Module. The features of row/column separators are extracted and then generate row splitting lines and column splitting lines, forming a fine grid structure by the Query-Based Splitting Module. The row/column features are further fed into the Vertex-based Merging Module to predict the linking relations between adjacent basic cells

Iii-B Query-Based Splitting Module

As illustrated in Fig.2, the proposed Query-Based Splitting Module takes visual features and row/column embedding features as inputs. In the Transformer decoder, visual features $F^{V} \in R^{H \times W \times d}$ obtained by the CNN encoder are firstly flattened to $R^{(H \times W) \times d}$ and fed to the Transformer decoder as the key and value of attention mechanism. In the meantime, the position indexes $(0, 1, . ., N - 1)$ of rows and $(0, 1, . . ., M - 1)$ of columns are fed to Embedding layers to get embedding features $F^{e m b e d} \in R^{N / M \times d}$ , which are used as the queries of attention mechanism. $N$ and $M$ represent the predefined maximum number of horizontal and vertical separators in the table.

Following the row/column Transformer decoder, three fully connected layer produces the final prediction $(c_{i / j}, o_{i / j}, a_{i / j})$ for each row/column queries. $c_{i / j}$ means whether the row/column query is classified as a horizontal/vertical separator or not, $o_{i / j}$ means the offset value of each predicted horizontal/vertical separator intersecting the left/top boundary of the table, and $a_{i / j}$ means the predicted rotation angle of each row/column separator. Note that the offset value and rotation angle are only meaningful when the row/column queries are classified as positive. Based on the forecast results, a fine grid structure of the table can be generated, as shown in the bottom-right of Fig.2

The proposed Query-Based Splitting Module addresses the unconstrained tables better from two aspects. First, the use of the self-attention mechanism in Transformers helps to capture contextual information from global long dependencies, which is very helpful for cases with blurred splitting lines and plentiful empty cells. Second, the prediction output of the separator with rich attributes can well describe the scene of tilted table lines.

Iii-C Vertex-based Merging Module

We can accurately represent a simple merge-free table without spanning cells based only on the fine grid structure generated by Query-Based Splitting Module. However, when table contains spanning cells, the basic cells belonging to same spanning cell need to be merged. To solve this problem, we introduce Vertex-based Merging Module to model cell merging.

First, the intersection of each horizontal and vertical separator represents a table vertex, whose features can be efficiently obtained by the fusion of horizontal and vertical separator features. We have previously obtained the horizontal separator feature $F^{r} \in R^{N \times d}$ and vertical separator feature $F^{c} \in R^{M \times d}$ in Query-Based Splitting Module. We then expand them into a feature with a shape of $N \times M \times d$ and add them together, getting the feature representation of $N \times M$ vertices, where N and M represent the number of horizontal and vertical separators. To improve the perception of vertex features to the context information of table rows and columns, before the horizontal and vertical separator features are fused, we perform cross-feature enhancement on them. The enhancement method is shown at the bottom of Fig.2. When enhancing the horizontal separators, we use the horizontal separator feature $F^{r} \in R^{N \times d}$ as query, the vertical separator feature $F^{c} \in R^{M \times d}$ as key and value, and feed them into Transformer decoder layers to get the enhanced horizontal separator features. Using a similar operation, we can get the enhanced features of the vertical separators. The features of all vertices in the table are enhanced by the cross-attention mechanism between row and column separators. Each vertex will predict 4 attribute values, which are used to predict whether the 4 grid pairs around it should be merged. Specifically, 4 grid pairs include (top-left, top-right), (down-left, down-right), (top-left, down-left) and (top-right, down-right).

From the Query-based Splitting Module and Vertex-based Merging Module mentioned above, we can get the horizontal and vertical separators and vertices of the table, and further determine the information of the basic table grids and the merged information between the grids. From this information, we can form a variety of complex table structures.

Iii-D Ground Truth and Loss Function

In the above section, we refer to the two components of the table, the splitting lines and vertexes, and their respective attributes. In this section, we will describe how attribute labels are derived and the design of the Loss function.

Ground Truth for Query-based Splitting Module. We expanded the definition of splitter in SPLERGE[10] to support inclined separators. As illustrated in Fig.3, we use parallelogram to express the column table separators. The format of parallelogram can maximize the area of the separator regions without intersecting non-spanning cell content, and it is especially suitable for the inclined tables.

As described earlier, we predefined M queries for column table separators. In order to ease the difficulty of learning in query mode, we evenly distribute the column queries along the horizontal direction of the table image according to the index value of query. Therefore, the $j - t h$ index corresponds to the $(j * w / M)$ horizontal position in the image, where the $w$ means the width of table image.

Next, we use this predefined position information to determine whether the query falls in the area of a column separator. If so, the category of query is set to positive class. At the same time, the angle label is set to $θ$ , which is the angle between the corresponding quadrilateral and the vertical direction. In addition, by drawing a vertical line with horizontal coordinate $(j * w / M)$ and the point intersecting the top boundary on the table is easily obtained, we can get the vertical offset value $x$ . With the predefined horizontal position information and vertical offset value, plus the rotation angle, we can draw an accurate column splitting line. Using a similar operation method, we can also obtain the labels of n row split lines queries.

Once determining the positive horizontal and vertical queries of a table, the basic grids of the table are determined.

Fig. 3: The label generation process of columns: the j-th query is a positive column query; red regions represent column separators; red points represent the start point of positive column queries and $θ$ represents the rotation angle of positive column queries. The label generation process of rows is similar.

Ground Truth for Vertex-based Merging Module. For partially complex tables, a portion of the text region may span more than one base cell, so some basic cells need to be merged. Merge labels are reflected in attributes of intersection points. The attributes of the intersection points have the following four dimensions, that is, the upper, lower, left and right, which represent the four merging proposals respectively. The attributes of the intersection points indicate whether the adjacent basic cells around the intersection point need to be merged or not. If two cells need to be merged, the attributes of their common intersection point should be set to positive. For example, two basic cells with index (i, j) and index (i + 1, j) need to be merged, i means row i and j means column j. Then the (top-left, bottom-left) attribute value of their common intersection point with index (i, j) is set to positive, and the (top-right, bottom-right) merged attribute value with index (i, j-1) is set to 1. The cell has the same index value as its bottom right corner point.

Loss Function. Our model is trained in an end-to-end fashion, where the training loss is a weighted combination of multiple functions from Query-Based Splitting Module and Vertex-based Merging Module. Overall, the loss function is a weighted sum of the three losses:

		$L (y_{r o w}, c_{r o w}, y_{c o l}, c_{c o l}, y_{a n g}, c_{a n g},^s, s, y_{l n k}, c_{l n k})$		(1)
		$= \frac{1}{N_{r}} L_{b c e} (y_{r o w}, c_{r o w}) + \frac{1}{N_{c}} L_{b c e} (y_{c o l}, c_{c o l})$
		$+ \frac{1}{N_{p o s}} L_{c e} (y_{a n g}, c_{a n g}) + \frac{1}{N_{p o s}} L_{l o c} (^s, s)$
		$+ \frac{1}{N_{v t x}} L_{b c e} (y_{l n k}, c_{l n k})$

Here, $y_{r o w}$ is the label of all row queries, $y_{r o w}^{i} = 1$ if $i$ -th query is labeled as positive,, and 0 otherwise. Likewise, $y_{c o l}$ is the label of all column queries. $L_{b c e}$ is the binary cross-entropy loss [23] over the predicted row and column queries scores, respectively $c_{r o w}$ and $c_{c o l}$ , given by

L (y_{r / c}, c_{r / c}) = - (y l o g (p_{c})) + (1 - y) l o g (1 - p_{c}))

(2)

$L_{l o c}$ is the Smooth L1 regression loss [24] over the predicted start point geometries $^s$ and the groundtruth $s$ :

L(^s,s)={0.5(^s−s)2,if|^s−s|<1|^s−s|−0.5,otherwise

(3)

As for rotated angle prediction, we limit the range of rotated angles to $[- 45^{\circ}, + 45^{\circ}]$ and one degree represented one prediction category, the loss of rotation angle is computed as

L (y_{a n g}, c_{a n g}) = - + 45 \sum a n g = - 45 y_{a n g} l o g (p_{c_{a n g}})

(4)

For link classification over all vertices, we also use binary cross-entropy, given by

L (y_{l n k}, c_{l n k}) = - (y l o g (p_{c})) + (1 - y) l o g (1 - p_{c}))

(5)

Notice that we only consider the loss of adjacent grids needed to merge during training. Online Hard Example Mining(OHEM) [25] is applied to $L (y_{r / c}, c_{r / c})$ and $L (y_{l n k}, c_{l n k})$ for balancing positive and negative samples.

The losses on row/column classification are normalized by $N_{r / c}$ , which is the number of positive and hard negative samples. The loss on angle classification and start point regression is normalized by the number of positive samples $N_{p o s}$ . The loss on link classification is normalized by the number of positive and hard negative samples $N_{v t x}$

Iii-E Inference Process

In the previous section, we introduced the structure of the model and the setting of labels. In this part, we will introduce how to get the final table structure through the output of the model. First, as shown in Fig.2, after we put the table image into the model, we can directly obtain the output results of the Query-Based Splitting Module(QBS) and Vertex-based Merging Module(VBM). Through the output of QBS, we can get the distribution probability of horizontal and vertical lines in the table image. Set threshold $α$ , We can get the distribution area of the splitting line, and the connected areas represent the distribution range of a splitting line. At the same time, we define the line unit with the highest score in the range as the final splitting line. After getting the split line, we can get the distribution of basic cells in the table, and we can get the position of the vertex. That means we can get the vertex information from the output of the VBM model. According to the vertex attribute, we can merge the basic cells to get the final table structure.

Iv Experiments

Iv-a Datasets

Quite a number of table structure recognition datasets have been reported in recent years, and most of them can be broadly classified into two categories: standard tables and unconstrained tables. We evaluate our proposed method on the following benchmarks which contain table data of various styles in various scenes and the benchmarks are as listed below. We also provide ablation studies to verify the effects of each proposed component.

PubTabNet[3] PubTabNet is one of the most commonly used benchmarks for table structure recognition. It is a large-scale complicated table collection that contains 500777 training images, 9115 validating images, and 9138 testing images. This dataset contains a large amount of three-line tables with multi-row/column cells, empty cells, etc. The images of the benchmark are extracted from scientific documents.

SynthTable. Unlike PubTabNet which contains mostly standard tables. SynthTable covers the unconstrained table in a natural scene, requiring the table structure recognizer to have both discriminative and generative capabilities. Therefore, we also use SynthTable proposed in TIES [4]. SynthTable is a synthetic dataset that contains 1000 images for the training and 1000 images for the testing. The generated tables are harmoniously blended with the existing document background. where tables have a variety of orientations sizes and types of separators.

Iv-B Evaluation Protocol

In the evaluation process, we focus on the accuracy of the logical structure of the table. We use Tree-Edit-Distance-based Similarity (TEDS[3]) to evaluate the performance of our model for recognizing table logical structure. in addition to TEDs that consider both table structure and text content, we also evaluate performance on the structure TEDs metric that considers only the accuracy of table structure prediction.

Iv-C Implementation Details

We use ResNet-18, pre-trained on ImageNet[17], as the backbone, and the whole networks are then fine-tuned end-to-end using ADAM[26] optimizer on the training sets of PubTabNet[3] and SynthTable. For fine-tuning, images are resized to $640 \times 640$ after random scaling, and the long size is resized to 640. Our model is trained for 20 epochs and the initialized learning rate is 0.0001. The batch size is set to 16. TRUST is implemented using PaddlePaddle[27], and we use Tesla A100 64GB GPU.

Iv-D Experimental Results

The proposed technique is evaluated over PubTabNet and SynthTable datasets. Additionally, it is benchmarked with a number of state-of-the-art techniques such as SPLERGE[10], TabStruct-Net[18], EDD[3], GTE[28], LGPMA[9], FLAG-Net[29], etc. Unlike many state-of-the-art methods that perform evaluations only at TEDs, our method also test the Structure TEDs on PubTabNet.

Tab.I shows quantitative results on the PubTabNet dataset that contains mostly unlined tables. As the table shows, the TRUST achieves the best Structure TEDs 97.1% and TEDs 96.2% among all published methods for this widely studied dataset, TabStruct-Net[18] has low TEDs because it cannot handle the problem of unlined tables. Our method detects row/column separators and accordingly alleviates the unlined table problem. Notice that the OCR results of PubTabNet are obtained by the public text detection method PSENet[30] and text recognition method MASTER[31] for a fair comparison. The superior performance of TRUST is largely due to the proposed Query-Based Splitting Module and Vertex-based Merging Module. The individual contributions of the Query-Based Splitting Module and Vertex-based Merging Module will be discussed in the ensuing Ablation Study.

Method	Str-TEDs	TEDs
EDD[3]	-	88.3%
TabStruct-Net[18]	-	90.1%
GTE[28]	-	93.0%
LGPMA[9]	96.7%	94.6%
FLAG-Net[29]	-	95.1%
Ours	97.1%	96.2%

TABLE I: Comparison results of logical structure recognition on PubTabNet datasets

Fig. 4: Illustration of table structure recognition results made by TRUST: Images from first row are from PubTabNet. Images from middle row are from SynthTable. And final row is the example of bad cases. Blue lines indicate the predicted structure of tables.

Results on SynthTable. We also evaluate the SynthTable dataset proposed in TIES [4] that mainly consists of tables in diverse categories. Different from Pubtabnet, tables in SynthTable have a more diverse style such as rotation and linear perspective transformation, and a more complex background. As Tab.V shows, our method achieves 99.2%, 96.9%, 93.6%, and 89.2% TEDs, respectively, outperforming the state-of-the-art methods, Both EDD and SPLERGE have a much lower TEDs because they cannot cope with tables with rotation or linear perspective transform in category 4 due to the limitation of rotation modeling. Our method models these situations through a Query-Based Splitting Module accordingly alleviates rotation and linear perspective problems. Besides, the proposed Vertex-based Merging Module explicitly merges adjacent table grids, enabling it to recognize spanning cells. This leads to up to 7.8% and 14.3% TEDs improvement over EDD and SPLERGE, respectively.

Speed analysis. We also evaluate the TRUST efficiency as shown in Tab.III. The runtime of TRUST is evaluated with NVIDIA Tesla A100 64GB. We can see that TRUST achieves 10 FPS, which is much faster than other methods such as EDD and SEM.


	Splitting Model		Merging Model			Performance(Pubtabnet/SynthTable(C4))
#	Split[10]	QBS	Heuristic[10]	Merge[10]	VBM	Str-TEDs	TEDs
1	✓				✓	94.8% / 88.2%	93.4% / 85.9%
2		✓	✓			88.3% / 81.7%	85.4% /76.7%
3		✓		✓		96.2% / 90.8%	95.3% /86.6%
4		✓			✓	97.1% / 92.4%	96.2% /89.2%

TABLE II: Effectiveness of Query-Based Splitting Module and Vertex-Based Merging Module on PubTabNet and SynthTable. Split: split model proposed in SPLERGE[10], QBS: Query-Based Splitting Module, Heuristic: Heuristic Post-processing, Merge: merge model proposed in SPLERGE[10], VBM: Vertex-Based Merging Module

Iv-E Ablation Study

We conducted several experiments to evaluate the effectiveness of our design. These experiments mainly focus on evaluating two important modules in our TRUST: Query-based Splitting Module and Vertex-based Merging Module. Tab.II summarizes the results of TRUST with different settings on PubTabNet.

Method	FPS
TabStruct-Net[18]	0.77
EDD[3]	1
SEM[11]	1.94
Ours	10

TABLE III: Speed analysis. TRUST is the current fastest table structure recognition method with a speed of 10 FPS. The comparisons with the previous state-of-the-arts demonstrate the efficiency of our method.

The Effectiveness of Query-Based Splitting Module. We designed this module to handle the row/column separators splitting problem. To evaluate this module, we replace the Query-Based Splitting Module with the Split Model proposed in SPLERGE[10]. As Tab.II shows, the Split Model proposed in SPLERGE only achieves Structure TEDs 94.8% and TEDs 93.4%. By using the Query-Based Splitting Module, TRUST improves both Structure TEDs and TEDs by about 2.3% and 2.8%, respectively. The large improvement is largely due to the attention mechanism in the Query-Based Splitting Module that helps capture contextual information from long dependencies of both horizontal and vertical directions.

The Effectiveness of Vertex-based Merging Module. We also conducted another experiment to evaluate the Vertex-based Merging Module. We found that, if we merge the basic cells by replacing the Vertex-based Merging Module with heuristic post-processing, the Structure TEDs and TEDs drop from 97.1% $\to$ 88.3% and 96.2% $\to$ 85.4%, respectively. We further replace the Vertex-based Merging Module with the merge model proposed in SPLERGE, and the Structure TEDs and TEDs drop from 97.1% $\to$ 96.2% and 96.2% $\to$ 95.3%, respectively. This suggests that the proposed Vertex-based Merging Module is critical to the merging results.

The Effectiveness of Cross Feature Enhancement. In this study, we evaluate the impact of cross-feature enhancement in the Vertex-based Merging Module by replacing them with feature enhancement from each own branch. As shown in Tab. IV, this results in substantial performance drops, e.g., 96.2% $\to$ 88.0%TEDS without cross feature enhancement. suggesting that the proposed cross-feature enhancement in the Vertex-based Merging Module is an important contributor to the performance boost.

	Str-TEDs	TEDs
with Cross Feature Enhancement	97.1%	96.2%
w/o Cross Feature Enhancement	90.6%	88.0%

TABLE IV: Effectivenes of Cross Feature Enhancement in Vertex-Based Merging Module on Pubtabnet.

Iv-F Qualitative Results

Fig.4 shows a few sample images from the SynthTable dataset and the corresponding structure recognition results by TRUST. As Fig.4 shows, TRUST is capable of recognizing most tables that have rotation, linear perspective transform, empty cells, spanning cells, invisible separators, etc. Its performance degrades slightly when tables appear with perspective distortion as shown in the third row in Fig.4

V Conclusion

This paper presents a robust and accurate table structure recognition method using innovative encoder-decoder architecture and Transformer networks. An encoder-decoder architecture is designed that can not only reconstruct the structure of tables in arbitrary orientations but also can accurately recognize the structure of complex tables that contains spanning cells. In addition, two innovative Query-Based Splitting Module and Vertex-Based Merging Module are designed which generates feature maps with contextual information from long dependencies in a more efficient and effective way. Additionally, Transformer networks are introduced to further increase the accuracy of table structure recognition. Extensive experiments over a number of public datasets show that the proposed TRUST achieves superior performance as compared with state-of-the-art, with a remarkable faster-running speed.

	C1		C2		C3		C4
Method	Str-TEDs	TEDs	Str-TEDs	TEDs	Str-TEDs	TEDs	Str-TEDs	TEDs
$E D D^{*}$ [32]	97.8%	96.0%	98.0%	93.4%	96.1%	93.2%	89.9%	81.4%
$S P L E R G E^{*}$ [3]	97.8%	97.0%	94.4%	91.6%	95.5%	92.1%	85.6%	74.9%
TRUST w/o[10]	99.6%	99.0%	98.0%	97.0%	96.1%	93.6%	90.7%	81.0%
TRUST	99.7%	99.2%	98.1%	96.9%	96.0%	93.6%	92.4%	89.2%

TABLE V: Comparison results of logical structure recognition on SynthTable dataset, * represent models that were trained by us. C1 means standard tables with visible lines; C2 means standard tables without invisible lines; C3 means standard tables with spanning cells; C4 means unconstrained tables with rotation and linear perspective transform.

References

[1] K. A. Hashmi, M. Liwicki, D. Stricker, M. A. Afzal, M. A. Afzal, and M. Z. Afzal, “Current status and performance analysis of table recognition in document images with deep neural networks,” IEEE Access, vol. 9, pp. 87 663–87 685, 2021.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[3] X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in European Conference on Computer Vision. Springer, 2020, pp. 564–580.
[4] S. R. Qasim, H. Mahmood, and F. Shafait, “Rethinking table recognition using graph neural networks,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 142–147.
[5] R. Zanibbi, D. Blostein, and J. R. Cordy, “A survey of table recognition,” Document Analysis and Recognition, vol. 7, no. 1, pp. 1–16, 2004.
[6] Z. Chi, H. Huang, H.-D. Xu, H. Yu, W. Yin, and X.-L. Mao, “Complicated table structure recognition,” arXiv preprint arXiv:1908.04729, 2019.
[7] S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt: Deep learning for detection and structure recognition of tables in document images,” in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1162–1167.
[8] S. S. Paliwal, D. Vishwanath, R. Rahul, M. Sharma, and L. Vig, “Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 128–133.
[9] L. Qiao, Z. Li, Z. Cheng, P. Zhang, S. Pu, Y. Niu, W. Ren, W. Tan, and F. Wu, “Lgpma: Complicated table structure recognition with local and global pyramid mask alignment,” in International Conference on Document Analysis and Recognition. Springer, 2021, pp. 99–114.
[10] C. Tensmeyer, V. I. Morariu, B. Price, S. Cohen, and T. Martinez, “Deep splitting and merging for table structure decomposition,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 114–121.
[11] Z. Zhang, J. Zhang, J. Du, and F. Wang, “Split, embed and merge: An accurate table structure recognizer,” Pattern Recognition, p. 108565, 2022.
[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[13] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
[14] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.
[15] W. Xue, Q. Li, and D. Tao, “Res2tim: Reconstruct syntactic structures from table images,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 749–755.
[16] S. A. Siddiqui, I. A. Fateh, S. T. R. Rizvi, A. Dengel, and S. Ahmed, “Deeptabstr: deep learning based table structure recognition,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 1403–1409.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[18] S. Raja, A. Mondal, and C. Jawahar, “Table structure recognition using top-down and bottom-up cues,” in European Conference on Computer Vision. Springer, 2020, pp. 70–86.
[19] H. Liu, X. Li, B. Liu, D. Jiang, Y. Liu, and B. Ren, “Neural collaborative graph machines for table structure recognition,” arXiv preprint arXiv:2111.13359, 2021.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[22] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
[23] U. Ruby and V. Yendapalli, “Binary cross entropy with deep learning technique for image classification,” Int. J. Adv. Trends Comput. Sci. Eng, vol. 9, no. 10, 2020.
[24] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[25] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 761–769.
[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[27] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An open-source deep learning platform from industrial practice,” Frontiers of Data and Domputing, vol. 1, no. 1, pp. 105–115, 2019.
[28] X. Zheng, D. Burdick, L. Popa, X. Zhong, and N. X. R. Wang, “Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 697–706.
[29] H. Liu, X. Li, B. Liu, D. Jiang, Y. Liu, B. Ren, and R. Ji, “Show, read and reason: Table structure recognition with flexible context aggregator,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1084–1092.
[30] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao, “Shape robust text detection with progressive scale expansion network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9336–9345.
[31] N. Lu, W. Yu, X. Qi, Y. Chen, P. Gong, R. Xiao, and X. Bai, “Master: Multi-aspect non-local network for scene text recognition,” Pattern Recognition, vol. 117, p. 107980, 2021.
[32] R. Long, W. Wang, N. Xue, F. Gao, Z. Yang, Y. Wang, and G.-S. Xia, “Parsing table structures in the wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 944–952.