Introduction
Obtaining 3D atrial geometry from late gadolinium-enhanced magnetic resonance imaging (LGE-MRI) is a crucial task for the structural analysis of patients with atrial fibrillation (AF). Direct segmentation and 3D reconstruction based on 2D images, and the relationship between the front and back frames are commonly used in clinical studies. However, this manual segmentation is labor-intensive, structure-specialized work. Its accuracy remains challenging because of imaging artifacts, varying intensities depending on the extent of fibrosis, and varying imaging quality [1]. Therefore, an intelligent algorithm is urgently needed to perform fully automated 3D segmentation, to ensure precise reconstruction and measurement of atrial geometry, thereby facilitating clinical applications.
In 2018, the Statistical Atlases and Computational Modeling of the Heart workshop held the left atrial (LA) Segmentation Challenge; 27 teams participated in the final evaluation phase, and 18 teams attended the conference and proposed diverse approaches. These approaches included two traditional methods and 16 deep-learning models. These two traditional methods ranked second and third from the bottom, respectively. In contrast, the top ten were deep-learning models, and the Double 3D U-Net model from Xia et al. [2] achieved the best score in this challenge [3]. Therefore, deep-learning models are highly promising for fully automated LA segmentation.
The objective of this research was to introduce an automated segmentation approach and verify its accuracy.
Methods
Image Acquisition and Pre-Processing
The University of Utah provided 100 3D LGE-MRIs, which were randomly split into training (N = 80) and test (N = 20) sets. Each 3D LGE-MRI scan had a spatial size of either 640 × 640 or 576 × 576 pixels, and consisted of 88 slices. A 1.5 Tesla Avanto or 3.0 Tesla Verio clinical whole-body scanner was used to acquire LGE-MRIs with a spatial resolution of 0.625 × 0.625 × 0.625 mm [3]. To obtain the segmentation masks as the ground truths, experts manually segmented the LA, including the LA appendage, the mitral valve, and the pulmonary vein regions. To increase the performance of deep-learning models, data augmentation (e.g., elastic deformations, affine transformations and warping) was used to artificially increase the size of the training set without causing overfitting [4]. All images were cropped to the same size (576 × 576 × 80) for input into different networks.
Shared 3D U-Net
U-Net consists of an encoder and decoder that are connected with long skip connections. Like U-Net, U-Net + + has the same U-shaped architecture of the encoder-decoder scheme, but the encoder and decoder are connected through a series of nested dense convolutional blocks. To improve the performance of U-Net, Double U-Net, which combines two U-Net architectures stacked on top of each other, was proposed. Inspired by Double U-Net, we propose Shared U-Net to achieve the performance of Double U-Net while decreasing the size of the model.
According to the convolution operation rules, U-Net designed to segment images can also determine the position characteristics of the image. Therefore, Double U-Net can be used to complete these two tasks (i.e., detection and segmentation). First, we use U-Net to detect the region of interest (RoI) and then extract the position coordinates. According to the position coordinates of the extracted RoI, the RoI area of the original image is cropped. Second, the RoI area of the original image is input into the second U-Net for LA cavity segmentation (Figure 1A). In contrast to Double U-Net, our proposed Shared U-Net can achieve LA cavity segmentation (Figure 1B). As shown in Figure 1C, our proposed method is also a two-stage approach: 1) In the first stage, the feature map representing the LA is first obtained with the shared 3D U-Net, the bounding box is located with a fully connected layer, and the RoI for decreasing background predominance is finally cropped from a 3D LGE-MRI. 2) In the second stage, the shared 3D U-Net is used to precisely segment the LA cavity from the RoI output, and zero-padding is applied to reconstruct the final 3D LA with the same size as the original input.
The shared 3D U-Net used in this study follows the U-Net architecture, which can be divided into two main parts. The first part is the encoder, which extracts relevant image information. The second part is the decoder, which uses the extracted information to predict and reconstruct the segmentation of the image.
Our approach involves five encoding and decoding blocks in the network architecture. Each encoding block is specifically a residual convolution block; that is, the main path is two 3 × 3 × 3 convolution operations, and the branch path is the residual connection. The residual convolution block is used for image downsampling and feature extraction, and the added residual connection alleviates the problems of gradient disappearance, gradient explosion and overfitting caused by excessive network depth. As the depth of the network increases, the number of feature maps increases, as does the number of channels in the feature maps. Each decoding block consists of a 3 × 3 × 3 inverse convolution; a feature fusion module, which consists of a feature graph concatenation operation according to the dimension of the number of channels; and a 3 × 3 × 3 convolution. A jump connection is added between the encoding block and the decoding block corresponding to the U-shaped structure, so that the low-level feature information can be directly transmitted to the high level, to enable better recovery of the original image by the decoder. To ensure nonlinearity and avoid the disappearance of the gradient problem, each convolution layer is followed by a linear element activation function of the rectifier.
Evaluation Metrics
To assess the precision of various deep-learning models, we conducted an evaluation against the ground truths. DICE and Jaccard measurements were used to verify the performance at the volumetric level.
where P is the 3D prediction, and G is the corresponding 3D ground truth.
In addition, the Hausdorff distance (HD) was used to evaluate the performance of different models. HD is defined as
where h(A, B) is called the directed HD and is given by
where ||a−b|| is the Euclidean distance.
Experimental Settings
Our experiments were based on PyTorch and were run on an NVIDIA GTX3090 GPU. The Adam optimizer was used in the experiment. The initial learning rate was set to 0.001, the epoch was set to 200, and the batch size was set to 64. In the training stage, we used five-fold cross-validation and early stop training to continuously monitor the loss changes in the model on the validation set. We aimed to adjust the learning rate and other parameters appropriately, and stop training when the loss of the validation set was at a minimum, to prevent overfitting and ensure the model’s generalizability. After training, we used the model weight that achieved the best DICE score on the validation set as the final model weight.
Results
Figure 2 shows the segmentation results of our proposed method and other 3D networks on a 2018 image data set and 2013 image data set. Compared with the 3D U-Net and 3D U-Net + + models, our model showed a more comprehensive segmented region. Moreover, compared with Double 3D U-Net, our proposed model confers advantages in processing surface details, thus, making the result similar to the ground truth.
Our method achieved a DICE score of 0.918 and a Jaccard index of 0.848. Compared with 2D U-Net, 3D U-Net, U-Net + +, 3D V-Net and Double 3D U-Net, our method performed best (Figure 3). In detail, the DICE score increased from 0.916 for Double 3D U-Net to 0.918 for our model with a shared 3D U-Net, and the Jaccard index increased from 0.845 to 0.848.
Despite a slight increase in segmentation performance, the number of parameters, our model required lower memory consumption and shorter training times than the other methods. With the prolongation of the side length of RoI, the memory consumption and training time of the Double 3D U-Net dramatically increased, whereas those of our method were lower and showed little variation (Figure 4). Similarly, the number of parameters of our method was half that of Double 3D U-Net (176620KB vs. 88276KB).
The best performance of the proposed model is attributable to our modifications to the traditional 2D U-Net (Table 1). In contrast to 2D U-Net with 3 × 3 convolutions, 3D U-Net with 3 × 3 × 3 convolutions consider 3D information and is suitable for the segmentation of 3D volume data, thus, resulting in better performance in experiment one (Exp. 1). The 3D U-Net + + model was developed on the basis of 3D U-Net by addition of more skip connections for extracting richer shallow features to generate several “U-Nets.” Compared with 3D U-Net, 3D U-Net, + + has more parameters, but has slightly effects on the DICE score and Jaccard index. Therefore, using 3D U-Net rather than 3D U-Net + +, we used a two-stage strategy for 3D LA segmentation with two 3D U-Nets: one for extracting the RoI and the other for segmenting LA. Compared with those of 3D U-Net, the DICE score and Jaccard index of Double 3D U-Net were significantly greater (1.3% and 2.3%, respectively; Exp. 3). However, Double 3D U-Net had twice the number of parameters as 3D U-Net (88276KB vs. 176620KB). Therefore, we proposed a two-stage method with only one shared 3D U-Net to segment the 3D LA, thus maintaining performance while decreasing the number of parameters (88276KB; Exp. 4).
Experiments | Network | DICE | HD | Jaccard |
---|---|---|---|---|
Exp.1 (2D Vs. 3D) | 2D U-Net | 0.890 | 14.051 | 0.802 |
3D U-Net | 0.903 | 4.234 | 0.823 | |
Exp.2 (U-Net Vs. U-Net + +) | 2D U-Net + + | 0.903 | 12.386 | 0.823 |
3D U-Net + + | 0.908 | 3.639 | 0.832 | |
Exp.3 (Single Vs. Double) | 3D U-Net | 0.903 | 4.234 | 0.823 |
Double 3D U-Net | 0.916 | 1.300 | 0.845 | |
Exp.4 (Double Vs. Ours) | Double 3D U-Net | 0.916 | 1.300 | 0.845 |
Shared 3D U-Net | 0.918 | 1.211 | 0.848 |
Exp.1: 2D U-Net vs. 3D U-Net; Exp.2: 2D/3D U-Net vs. 2D/3D U-Net + +; Exp.3: Single 3D U-Net vs. Double 3D U-Net; Exp.4: Double 3D U-Net vs. Shared 3D U-Net. Bold values indicate better results.
Further evaluation was conducted on a publicly available 2013 image data set. The best performance (according to the DICE score, Jaccard index and HD) was obtained with the proposed model (Shared 3D U-Net; Table 1). However, the DICE score, Jaccard index and HD were lower than those on the publicly available 2018 image data set (Tables 1 and 2).
Discussion
AF, a prevalent type of continuous irregular heart rhythm, is caused by different substrates that are extensively dispersed in both atrial chambers [5]. AF also produces further structural changes, such as dilatation, fibrosis and myofiber alterations [6]. Consequently, comprehensive research on the atrial anatomy and its transformations is essential to enhance understanding and management of AF [7]. Recent studies support the visibility of AF-associated structures by LGE-MRI for identifying AF substrates and predicting AF ablation outcomes [8]. On the one hand, the extent and distribution of atrial fibrosis have been suggested to be reliable predictors of the catheter ablation success rate [9, 10]. On the other hand, LA diameter and volume have been shown to provide reliable information for clinical diagnosis [11]. However, the above structural analysis is based on 3D atrial geometry.
Segmenting the LA is a crucial step in quantitatively analyzing the structural characteristics of the atria. However, LA segmentation on LGE-MRI images is difficult, owing to variations in intensity. Recently, deep learning techniques have been proposed for automatically segmenting cardiac structures from medical images in 3D [3, 8]. Multiple studies have demonstrated that convolutional neural networks exhibit superior performance to conventional techniques [2, 12–25]. Hence, we presented a dual-phase method using a collaborative 3D U-Net for LA segmentation. Several notable discoveries emerged from this study. First, we found that the 3D U-Net architecture achieved better performance than the traditional 2D U-Net architecture. Second, double sequential U-Net architectures (e.g., Double 3D U-Net) achieved superior segmentation results to those of a single U-Net (e.g., 3D U-Net). Finally, the Double 3D U-Net architecture can be optimized with a shared 3D U-Net to decrease model complexity while maintaining good performance.
In the 2018 LA Segmentation Challenge, 17 teams provided methods and their performance to the challenge organizer. As shown in Table 2, the top method, with a Double 3D U-Net architecture, achieved a Dice score of 0.932 and a Jaccard index of 0.874. The Double 3D U-Net design uses the initial 3D U-Net for automatic RoI detection, and a 3D U-Net is subsequently used for the refined regional segmentation. Similar to this two-stage strategy, our method achieved superior results to those of these one-stage networks (Figure 3). In contrast to the Double 3D U-Net architecture reported by Xia et al., pre-processing (e.g., down-sampling and contrast limited adaptive histogram equalization) and residual connections were not used in our study, and two 3D U-Nets were replaced by a shared 3D U-Net. Under the same conditions, we observed similar performance between Double 3D U-Net and Shared 3D U-Net (Table 3), but Shared 3D U-Net had lower memory consumption, shorter training times and fewer parameters than Double 3D U-Net. In the present study, we also evaluated the performance of Double 3D U-Net, but our results were not consistent with those obtained in the challenge. This inconsistency is attributable to factors including different running devices, pre-processing approaches, development frameworks, hyperparameter settings and post-processing methods. Nevertheless, several tips for improving model performance may be considered in further work: (1) pre-processing methods (e.g., image resizing to multiple scales, normalization, cropping, use of de-noise filters, cropping and down-sampling); (2) post-processing methods (e.g., keeping the largest component, smoothing and dilation erosion); and (3) network components (e.g., dense connections, dilated convolutions, spatial pyramid pooling, attention units, pre-trained networks and ensemble learning).
Methods | Train | Test | Networks | DICE | Jaccard |
---|---|---|---|---|---|
Xia et al. [2] | 100 | 54 | Double 3D U-Net | 0.932 | 0.874 |
Huang et al. [3] | 100 | 54 | Double 3D U-Net | 0.931 | 0.872 |
Bian et al. [14] | 100 | 54 | Dilated 2D ResNet | 0.926 | 0.869 |
Vesal et al. [24] | 100 | 54 | Dilated 3D U-Net | 0.925 | 0.861 |
Yang et al. [25] | 100 | 54 | Double 3D U-Net | 0.925 | 0.860 |
Li et al. [18] | 100 | 54 | Double 3D U-Net | 0.923 | 0.859 |
Puybareau et al. [21] | 100 | 54 | 2D FCN with VGG-Net | 0.923 | 0.857 |
Chen et al. [16] | 100 | 54 | Multi-task 2D U-Net | 0.921 | 0.854 |
Xu et al. [3] | 100 | 54 | Ensemble 2D U-Net | 0.915 | 0.845 |
Jia et al. [17] | 100 | 54 | Double Ensemble 2D U-Net | 0.907 | 0.832 |
Liu et al. [19] | 100 | 54 | 2D U-Net | 0.903 | 0.825 |
Borra et al. [15] | 100 | 54 | 3D U-Net | 0.898 | 0.817 |
De Vente et al. [23] | 100 | 54 | 2D U-Net | 0.897 | 0.815 |
Preetha et al. [20] | 100 | 54 | 2D U-Net | 0.887 | 0.799 |
Qiao et al. [26] | 100 | 54 | Multi-atlas segmentation | 0.861 | 0.758 |
Nuñez-Garcia et al. [27] | 100 | 54 | Multi-atlas segmentation | 0.859 | 0.758 |
Savioli et al. [22] | 100 | 54 | 3D FCN | 0.851 | 0.744 |
Liu et al. [28] | 80 | 20 | V-Net | 0.91 | 0.84 |
Milletari et al.[13] | 80 | 20 | V-Net | 0.90 | 0.82 |
Çiçek et al. [12] | 80 | 20 | 3D U-Net | 0.87 | 0.78 |
Liu et al. [29] | 80 | 20 | UNSMLNet | 0.92 | 0.85 |
Ours | 80 | 20 | Shared 3D U-Net | 0.918 | 0.848 |
Bold values indicate better results.
Training the shared 3D U-Net on more reliable data could potentially enhance its accuracy. In the present study, our model trained on a 2018 dataset achieved a DICE score of 0.918, but when tested on a 2013 dataset, it achieved a DICE score of 0.851. Therefore, its generalizability is insufficient, and this important aspect must be further studied in the future. Furthermore, we intend to use the shared 3D U-Net to segment both atrial chambers and fibrosis, given that AF is a bi-chamber disease. Our current focus is on developing a dataset that includes manual segmentations of atrial chamber masks, which may be used to train the 3D U-Net model collaboratively.
Conclusions
We proposed an efficient algorithm for fully automatic 3D left atrial segmentation. In our method, 3D U-Net is used to extract 3D features, and a two-stage strategy is used to decrease the segmentation error caused by the class imbalance problem. Our network architecture with only one shared 3D U-Net has relatively low complexity, low memory requirements and a short training time. Our automatic method was highly reproducible and objective, producing a Dice score of 0.918 and a Jaccard index of 0.848, thus, outperforming the traditional six methods. Before its application, its performance must be further evaluated on an edge computing platform, and its effectiveness must be assessed in clinical settings.