1. INTRODUCTION
HNC is the sixth most common type of cancer worldwide, with approximately 600,000 new cases each year, most of which are head and neck squamous cell carcinoma (HNSCC); its mortality rate reaches 40–50% [1]. The main risk factors for HNC include smoking, alcohol consumption, and infection with human papillomavirus [2]. Treatments include surgery, radiotherapy, and chemotherapy. Information on the shape, size, and metabolism of head and neck tumors is provided primarily by positron emission tomography (PET), computed tomography (CT), and magnetic resonance imaging (MRI) [3]. Although excellent progress has been made in treatment technology, and the staging systems are being updated, the recurrence rate of HNC in patients is as high as 40%, and the prognosis remains unsatisfactory [4]. Visual analysis allows for diagnosis, staging, and detection of recurrence, but is subjective. Quantitative analysis appears necessary for disease prognosis [5]. In the past decade, the conversion of standard medical images into analyzable high-dimensional data has been at the forefront of imaging research [6]. Traditional medical imaging analysis is subjective and qualitative. The subjectivity of radiologists significantly affects analysis results. Radiomics and deep learning for quantitative analysis have attracted increasing attention because they can avoid this subjectivity [7].
Information in medical images can be obtained through both visualization with the human eye and quantitative analysis. In recent years, radiomics for quantitative analysis has become a new and rapidly developing research area in medical imaging. Radiomics based on machine learning uses advanced techniques to quantitatively extract information with diagnostic, predictive treatment response, and prognostic value from medical images such as PET, CT, and MRI. The information is analyzed, and the results can be applied to clinical decision-making for more accurate diagnosis and prognosis [8–11]. Conventional radiomics workflows include ROI segmentation, feature extraction and selection, model construction, and result analysis ( Figure 1 ). Radiomics is based on manually constructed features, such as histograms, texture and shape features, and machine learning classifiers [12]. Although the application of radiomics in head and neck tumor prognosis has become more sophisticated, traditional machine learning methods and the limited number of variables have resulted in some shortcomings of these models [13].
Conventional radiomics uses manual or semi-automatic segmentation of tumor ROIs, extracts predefined features, and then applies them to machine learning classification models. In contrast, deep learning radiomics (DLR) applies deep learning in the steps of ROI segmentation, feature extraction, and model building, thus providing the advantages of both deep learning and conventional radiomics. Deep convolutional neural network (DCNN) segmentation models, particularly the U-Net architecture, has led to tremendous progress in segmenting ROIs [14]. Deep learning typically extracts deep features from the last or penultimate convolutional layer of a neural network, which are complementary to those of radiomics in some aspects; therefore, DLR has potential in cancer prognosis [15]. To solve complex computational problems, deep neural networks are more applicable than traditional machine learning approaches, such as logistic regression, owing to the multilayer neural network structure [16].
After studying the application of DLR to other diseases. Shboul et al. have used a feature-based random forest segmentation model and a convolutional neural network (CNN)-based featureless segmentation model to segment ROIs, followed by an “union” approach to combine the segmentation outputs. In the featureless segmentation described above, three network models, DCNN, U-Net, and Fully Connected Network, were used in combination [17]. Paul et al. have used three pre-trained CNNs (Vgg-F, Vgg-M, and Vgg-S) to extract features, which were later combined with traditional radiomics features after feature filtering, then fed into a classifier to classify benign malignancy in lung nodules [18]. Devinder et al. have extracted features directly by using a five-layer auto-encoder, and output feature vectors in the penultimate layer of the network, followed by use of binary classifiers to distinguish between benign from malignant lung nodules [19]. Similarly to Paul et al., Lao et al. have used a deep learning model that extracts deep features and output features in the penultimate layers, and combines them with other radiomics features to predict survival in patients with glioblastoma multiforme [20]. In contrast to the approach of Lao et al., Wang et al. have directly used CNN for feature extraction and applied a conventional classifier to predict the grading of low-grade glioma mutations [21]. Some studies have used DLR methods to predict the prognosis of HNC. This article systematically reviews the application of deep learning combined with conventional radiomics in the prediction of HNC prognosis.
2. MATERIALS AND METHODS
2.1 Eligibility criteria
2.1.1 Inclusion criteria
The inclusion criteria were as follows: (1) observational studies (e.g., prospective cohort studies, retrospective cohort studies, or case-control studies) or clinical trials (e.g., randomized controlled trials); (2) studies in populations of patients diagnosed with HNC; (3) PET, CT, or PET/CT-based patient prognosis studies; (3) studies with methods combining traditional imaging histology and deep learning; and (4) studies with complete data available.
2.1.2 Exclusion criteria
The following exclusion criteria were applied: (1) non-human studies; (2) studies lacking sufficient data; (3) studies that did not evaluate deep learning combined with conventional radiomics; (4) case reports, reviews, experimental studies, short communications, personal opinions, letters to the editor, or conference abstracts; and (5) non-English language studies.
2.2 Focus
This article summarizes and discusses the application of DLR in the prediction of HNC prognosis, and compares the performance of DLR with that of conventional radiomics alone.
2.3 Search strategy
Electronic databases (PubMed, Embase, Scopus, Web of Science, and Cochrane) were searched with the following terms: (head and neck neoplasms OR head and neck cancer AND machine learning OR deep learning OR artificial intelligence OR radiomics OR DLR). The search period was from 2012 through 2022.
2.4 Study selection and quality assessment
Of the 796 studies evaluated, a total of 788 were excluded, and eight were included ( Figure 2 ). Two investigators (Bingzhen Wang and Yifan Yang) screened the studies.
The RQS, a radiomics-specific quality assessment tool, was used to assess the methodological quality of the included studies [22]. Sixteen items were assessed, including imaging protocol, feature extraction, data modeling, model validation, and data sharing. The summed total score ranged from −8 to 36, and was converted into a final 0–100 percentage score [23]. The RQS scoring table is shown in Table 1 . Two researchers evaluated the articles and reached a consensus.
Criteria | Points | |
---|---|---|
Item 1 | Image protocol quality: well documented image protocols (e.g., contrast, slice thickness, energy, etc.) and/or use of public image protocols enable reproducibility/replicability. | +1 (if protocols are well documented) +1 (if public protocol is used) |
Item 2 | Multiple segmentations: possible actions include segmentation by different physicians/algorithms/software, perturbing segmentations by (random) noise, and segmentation at different breathing cycles. Analysis of feature robustness to segmentation variability. | +1 |
Item 3 | Multiple segmentations: possible actions include segmentation by different physicians/algorithms/software, perturbing segmentations by (random) noise, and segmentation at different breathing cycles. Analysis of feature robustness to segmentation variability. | +1 |
Item 4 | Imaging at multiple time points: collection of individuals’ images at additional time points. Analysis of feature robustness to temporal variability (e.g., organ movement, organ expansion/shrinkage). | +1 |
Item 5 | Feature reduction or adjustment for multiple testing decreases the risk of overfitting. Overfitting is inevitable if the number of features exceeds the number of samples. Consideration of feature robustness when selecting features. | -3 (if neither measure is implemented) +3 (if either measure is implemented) |
Item 6 | Multivariable analysis with non-radiomic features (e.g., EGFR mutation) is expected to provide a more holistic model permitting correlations/inferences between radiomics and non radiomics features. | +1 |
Item 7 | Detecting and discussing biological correlates: demonstration of phenotypic differences (possibly associated with underlying gene–protein expression patterns) deepens understanding of radiomics and biology. | +1 |
Item 8 | Cut-off analyses: determination of risk groups by either the median or a previously published cut-off, or report a continuous risk variable. Decreases the risk of reporting overly optimistic results. | +1 |
Item 9 | Discrimination statistics: reporting of discrimination statistics (e.g., C-statistic, ROC curve, or AUC) and their statistical significance (e.g., p-values or confidence intervals). A resampling method (e.g., bootstrapping or cross-validation) can also be applied. | +1 (if a discrimination statistic and its statistical significance arereported) +1 (if a resampling method technique is also applied) |
Item 10 | Calibration statistics: reporting of calibration statistics (e.g., calibration-in-the-large/slope or calibration plots) and their statistical significance (e.g., p-values or confidence intervals). A resampling method (e.g., bootstrapping or cross-validation) can also be applied. | +1 (if a calibration statistic and its statistical significance are reported) +1 (if a resampling method technique is also applied) |
Item 11 | A prospective study registered in a trial database provides the highest level of evidence supporting the clinical validity and usefulness of the radiomics biomarker. | +7 (for prospective validation of a radiomics signature in an appropriate trial) |
Item 12 | Validation performed without retraining and adaptation of the cut-off value, thus providing crucial information concerning credible clinical performance. | -5 (if validation is missing) +2 (if validation is based on a dataset from the same institute) +3 (if validation is based on a dataset from another institute) +4 (if validation is based on two datasets from two distinct institutes) +4 (if the study validates a previously published signature) +5 (if validation is based on three or more datasets from distinct institutes) *Datasets should be of comparable size and should have at least ten events per model feature. |
Item 13 | Comparison to “gold standard”: assessment of the extent to which the model agrees with/is superior to the current “gold standard’ method” (e.g., TNM staging for survival prediction). This comparison indicates the added value of radiomics. | +2 |
Item 14 | Potential clinical utility: reporting on the current and potential application of the model in a clinical setting (e.g., decision curve analysis). | +2 |
Item 15 | Cost-effectiveness analysis: reporting on the cost-effectiveness of the clinical application (e.g., quality adjusted life years generated). | +1 |
Item 16 | Open science and data: code and data made publicly available. Open science facilitates knowledge transfer and reproducibility of the study. | +1 (if scans are open source) +1 (if region of interest segmentations are open source) +1 (if code is open source) +1 (if radiomics features are calculated on a set of representative ROIs and the calculated features + representative ROIs are open source) |
Total points (36=100%) |
3. RESULTS
3.1 Study selection and characteristics
Among the eight articles screened, all of which were retrospective studies, four were published in 2022, one was published in each of 2021 and 2020, and the remaining two were published in 2019. Data from the articles were extracted and analyzed. In total, 2339 patients were included in the studies. The included studies reached a median total RQS of 13 with a range of 10–15, corresponding to percentages of 36.11% and 27.78%–41.67%. The specific literature score table is shown in Table 2 . The included articles all used deep learning combined with conventional radiomics, and four studies detailed the toolkits used for feature extraction. Details on the toolkits, deep learning networks, and conventional radiomics methods used in the articles are shown in Table 3 . The number of features extracted, feature selection methods, and final results are shown in Table 4 .
Author (year) | Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 | Item 7 | Item 8 | Item 9 | Item 10 | Item 11 | Item 12 | Item 13 | Item 14 | Item 15 | Item 16 | Total score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Chen et al. (2019) | 1 | 1 | 0 | 0 | 3 | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 12(33.33%) |
Bizzego et al. (2019) | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 10(27.78%) |
Salmanpour et al. (2022) | 1 | 1 | 0 | 0 | 3 | 0 | 0 | 1 | 2 | 0 | 0 | 3 | 0 | 2 | 0 | 2 | 15(41.67%) |
Mehdi et al. (2022) | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 12(33.33%) |
Peng et al. (2022) | 1 | 1 | 0 | 0 | 3 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 2 | 2 | 0 | 2 | 14(38.89%) |
Tang et al. (2021) | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 2 | 0 | 2 | 0 | 1 | 11(30.56%) |
Zhou et al. (2020) | 1 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 2 | 1 | 0 | 2 | 0 | 2 | 0 | 1 | 15(41.67%) |
Bourigault et al. (2022) | 1 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 2 | 1 | 0 | 3 | 0 | 0 | 0 | 2 | 14(38.89%) |
Author (year) | Number of patients | Segmentation method | Whether to use deep learning | Package | Deep learning networks | Conventional radiomics methods |
---|---|---|---|---|---|---|
Chen et al. (2019) | 59 | Manual | Yes | NR | 3D-CNN (12 convolutional layers, 2 max-polling layers, and 2 fully connected layers) Each convolutional layer is equipped with ReLU activation and batch normalization | SVM |
Bizzego et al. (2019) | 194 | Manual | Yes | NR | 3D multimodal CNN | LSVM |
Salmanpour et al. (2022) | 325 | Auto | Yes | SERA package for radiomics features | 3D U-NET | CSF; CoxPH; FSSVM CoxBoost; GlmNet GlmBoost; GBM; RSF |
Mehdi et al. (2022) | 325 | Auto | Yes | SERA package for radiomics features | 3D U-NET SegResNet | CoxPH; FSSVM CoxBoost; GlmBoost; GBM |
Peng et al. (2022) | 707 | Manual | Yes | Python Keras package with the Tensorflow library | DCNNs (12 or 8 weighted layers) | Cox regression |
Tang et al. (2021) | 188 | Manual | Yes | NR | DL-ANN (3 hidden layers, 107 inputs, and 1 binary output) | 3D slicer (v. 4.10.2) with PyRadiomics |
Zhou et al. (2020) | 188 | Manual | Yes | NR | DL with stacked sparse autoencoder | SVM; DT; KNN |
Bourigault et al. (2022) | 353 | Auto | Yes | PyRadiomics package for radiomics features | 3D NormResSE-Unet3 + (an encoder-decoder architecture with full inter- and intra-skip connections) 3D UNet | CoxPH regression model |
CNN: convolutional neural network; DL: Deep learning; SVM: support vector machine; LSVM: linear support vector machine; CSF: conditional survival forest; CoxPH: Cox’s proportional hazard; FSSVM: Fast Survival SVM; CoxBoost: CoxPH model by likelihood-based Boosting; GlmNet: LASSO and Elastic-NetRegularized Generalized Linear Models; GlmBoost: Gradient Boosting with Component-wise Linear Models; GBM: Gradient boosting machines; RSF: random survival forest; DT: decision tree; KNN: K-nearest neighbor; NR: not reported
Author (year) | Number of radiomics features | Number of deep learning features | Feature selection | Evaluation metrics |
---|---|---|---|---|
Chen et al. (2019) | 257 features extracted from PET and CT images | NR | NR | ACC=0.88 |
Bizzego et al. (2019) | 261 | 239 | Removal of correlated features; UA; RFE | ACC=0.96 |
Salmanpour et al. (2022) | 215 | NR | CFS; FSASL; ILFS; LS; Lasso; LLCFS; MRMR; ReliefA; UDFSUFSOL; CindexFS; MI; VH; VH. VIMP; MD | Dice score=0.81 C-index=0.75 |
Mehdi et al. (2022) | 215 | NR | NR | Dice score=0.76 C-index=0.73 |
Peng et al. (2022) | 4 | 14 | Reproducibility measurement; UA; feature grouping; LASSO | C-index=0.72 |
Tang et al. (2021) | 107 | NR | NR | AUC=0.96 ACC=0.72 |
Zhou et al. (2020) | 257 features extracted from PET and CT images | NR | NR | AUC=0.84 ACC=0.83 |
Bourigault et al. (2022) | 14 | 49 | LASSO regression with 5-fold cross-validation | Average DSC=0.75 C-index=0.82 |
UA: univariate analysis; RFE: recursive feature elimination; CFS: correlation-based feature selection; FSASL: feature selection with adaptive structure learning; ILFS: infinite latent feature selection; LS: Laplacian score; LLCFS: local learning based clustering feature selection; MRMR: minimum redundancy maximum relevance; ReliefA: relief algorithm; UDFS: unsupervised discriminative feature selection; UFSOL: unsupervised feature selection with ordinal locality; CindexFS: select features based on C-index; C-index: concordance index; MI: mutual information; VH: variable hunting; VH. VIMP: variable hunting variable importance; MD: minimal depth; ACC: accuracy; AUC: area under the curve; DSC: Dice similarity coefficient; NR: not reported.
3.2 Main findings
Traditional radiomics tended to manually or semi-automatically segment ROI, extract defined features, and apply machine learning models for prediction. Radiomics combined with deep learning enabled automatic segmentation of ROIs, extraction of deeper measured features, and building of deep network models. Of the eight included articles, three used deep learning for automatic segmentation of ROIs [24–26], five combined deep learning in feature extraction [26–29] and feature fusion [30], and two used deep learning for model building [27, 31].
3.2.1 Deep learning for ROI segmentation
In terms of ROI segmentation, automatic segmentation using deep learning eliminated the subjective influence of manual segmentation and enabled more accurate ROI segmentation, through use of both low and high-level details, thus enabling extraction of more accurate and comprehensive features, and improving model predictive performance [24–26]. Salmanpour et al. used 3D U-Net and 3D U-NETR to automatically segment images of HNC. Five fusion techniques, seven dimensionality reduction algorithms, and five survival prediction algorithms were used to predict the survival of patients. The Laplacian pyramid + spare representation fusion technique with the 3D U-Net model obtained the highest segmentation ACC, with a Dice score of 0.81 on the validation set, because 3D U-Net had more layers and a similar number of parameters. On the basis of the optimal fusion and segmentation, 215 radiomics features were extracted from PET and CT images with the Standardized Environment for Radiomics Analysis (SERA) package. Moreover, the Gradient Boosting with Component-wise Linear Models (GlmBoost) survival prediction algorithm was used to predict survival. Finally the Ensemble Voting technique was used to predict the survival rate, with a C-index of 0.75 on the validation set [24]. Mehdi et al. have also used multiple fusion techniques to combine PET/CT images and used deep learning for automatic segmentation of ROI. The Laplacian Pyramid fusion technique combined with 3D U-Net had the best segmentation performance, with a Dice score of 0.76 in the validation set. A total of 215 radiomics features were extracted by the SERA imaging histology package and input to multiple hybrid machine learning systems containing 13 dimensionality reduction algorithms, 15 feature selection algorithms, and 8 survival prediction algorithms. The final prediction results were obtained with a C-index value of 0.73 on the validation set [25] by using the Ensemble Voting technique. Bourigault et al. have used an encoder-decoder combining 3D Normalised Squeeze-and-Excitation Residual Blocks proposed by Iantsen with the full-size connected U-Net’s UNet3 + structure proposed by Huang [32, 33]. The network structure was capable of downward and upward sampling, focused on the image ROI as well as other relevant regions; combined local and global full-scale information; and leveraged low-level and high-level semantics, thus improving the ACC of segmentation. An average Dice similarity coefficient (DSC) of 0.75 was obtained in the cross-validation [26]. The tumor segmentation with the deep learning approach achieved high ACC segmentation and laid a good foundation for subsequent steps.
3.2.2 Deep learning radiomics for feature extraction and feature fusion
In terms of feature extraction and feature fusion, DLR automatically learned features from images, which differed from the features manually extracted through radiomics. Because the two classes of features above obtained complementary information, deep learning features and radiomics features can be combined to produce more stable prediction results [26–29].
Chen et al. input PET and CT images into a multi-objective radiomics (MaO-Radiomics) model with support vector machine (SVM) classifiers and a 3D-CNN model consisting of convolutional layers, max-pooling layers, and fully connected layers, which fully used contextual information to extract features and made predictions, respectively. The outputs were finally combined with an evidential inference evidential reason (ER) method. The hybrid model combining the DCNN model and the MaO-Radiomics model had an ACC of 0.88 for the classification of normal, suspicious, and involved LNs, thus indicating better performance than the ACC values of 0.75 obtained with conventional radiomics alone and 0.82 obtained by MaO-radiomics described in the report [27].
Bizzego et al. observed that the approach of mixing DLR features has more accurate predictive performance than models using only one feature type or image pattern. Bizzego’s study consisted of two identical parallel deep learning CNN networks trained on CT and PET simultaneously (including BatchNorm layers, convolutional layers, Dropout layers, MaxPool3d layers, and an AdaptiveAvgPool3D layer) to extract deep learning features output by the final AdaptiveAvgPool3D layer, and used a predefined feature extractor to extract radiomics features. Subsequently, 239 deep learning features and 261 radiomics features were concatenated and combined into a unified classification pipeline with linear SVM for prognosis of local recurrence of HNC. The ACC of hand-crafted radiomics (HCR) + DLR was 0.96 on the test set, a value higher than the ACC associated with using only manual or deep learning features (ACC of 0.87 and 0.79, respectively) [28].
Peng et al. used ITK-SNAP to manually segment PET and CT images. Four DNNs (12 or 8 weighted layers) were constructed to extract deep learning features. One of the DNNs for CT images had three convolutional layer groups, followed by a dropout layer and a bidirectional join layer with a softmax classifier. Each convolutional layer group consisted of four convolutional layers and one max-pooling layer. Only two convolutional groups existed for PET images. Deep learning features were output by the last convolutional layer. Finally, 14 deep learning features and four radiomics features were identified, and linear combinations were weighted by their coefficients. The optimal combination of prognostic features was screened with Least absolute shrinkage and selection operator (LASSO) Cox regression, and the features were input into the Cox proportional risk model to construct a radiomics nomogram for predicting disease-free survival in HNC. A C-index of 0.722 was obtained in the test set, thus indicating a significant improvement over the C-index values of 0.634 and 0.655 obtained in the team’s previous radiomics studies using manual features based on CT and PET [34–37]. The combination of manual features and depth features had the best discriminatory ability [29].
Bourigault et al. used automatic segmentation of ROI and 3D U-Net for deep feature extraction from the fifth convolutional layer, the PyRadiomics package for extracting radiomics features, and 5-fold cross-validated LASSO regression for feature selection. They finally identified 49 deep learning features and 14 radiomics features, combined with 7 clinical features in a Cox proportional risk regression. The model was subsequently used to predict progression-free survival of patients and exhibited a C-index of 0.82 on the validation set, thereby indicating a significant improvement over the C-index of 0.72 obtained with only radiomics features; however, the C-index of 0.62 obtained on the external challenge test set exhibited an overfitting problem that could be decreased by adding regularization to the model [26].
In terms of feature fusion, different imaging modalities measured different features. For example, PET scans measured glucose metabolism, and CT scans provided attenuation coefficient information. When features came from different sources and were complementary, feature fusion was required. A stacked self-encoder depth model was used to combine features from different modalities into one feature set, instead of simply stitching multimodal features into one long vector, as in conventional radiomics, to improve the predictive performance of the model. Zhou et al. extracted radiomics features from manually segmented ROIs; used a stacked sparse auto-encoder structure to combine 257 manual features obtained from PET and CT images; and input them into a MaO-radiomics model with multiple basis classifiers such as SVM, decision tree (DT), and K-nearest neighbor for prognostic prediction of HNC with distal metastasis, rather than simply concatenating features extracted from different modalities. Evidence-based reasoning was used to combine the outputs of multiple models at the decision level, thus resulting in an AUC of 0.84, indicating a significant improvement over the results of feature fusion without using stacked self-encoders. The improvement was attributed to the stacked self-encoder’s extracting more distinguishing features and discovering joint information among different patterns [30].
3.2.3 Deep learning for model building
In model building, deep learning combined with radiomics has higher predictive performance than radiomics alone [27, 30]. Chen et al. used the ER method to combine the previously introduced deep learning-based 3D-CNN and MaO-Radiomics models, in a simple, easily implemented process. The input image was normalized to accelerate the convergence, and the synthetic minority over-sampling technique was used to balance and expand the minority class samples to improve the efficiency of the model. The final ACC was 0.88, indicating an improvement over radiomics alone (ACC of 0.75) and Mao-Radiomics alone (ACC of 0.82) [27]. Tang et al. used 3D slicer with PyRadiomics extensions to extract 107 radiomics features from CT images and input them into an artificial neural network with hidden layers, then used the binary output to predict death prognosis and cancer recurrence rates. The imaging data were used as both the training and validation sets. The AUC for cancer recurrence with gross tumor volume in this experiment was 0.956, and the ACC was 0.724 [30]. Artificial neural networks outperformed other methods for most data predictions, and performed well with higher data complexity [38].
The performance data from the included studies involving DLR, compared with radiomics, are shown in Table 5 .
Author (year) | Methods or features | Evaluation metrics |
---|---|---|
Chen et al. (2019) | Radiomics | ACC=0.75 |
MaO-radiomics | ACC=0.82 | |
DCNN + MaO-radiomics | ACC=0.88 | |
Bizzego et al. (2019) | HCR | ACC=0.87 |
DLR | ACC=0.79 | |
HCR + DLR | ACC=0.96 | |
Peng et al. (2022) | CT-HCR | C-index=0.63 |
PET-HCR | C-index=0.65 | |
HCR + DLR | C-index=0.72 | |
Bourigault et al. (2022) | Clinical + PET radiomics | C-index=0.67 |
Clinical + CT radiomics | C-index=0.68 | |
Clinical + PET/CT radiomics | C-index=0.72 | |
Clinical + CT radiomics + deep learning features | C-index=0.82 |
4. DISCUSSION
In this systematic review, we evaluated the literature on the application of radiomics combined with deep learning in the prognosis of HNC. In recent years, the application of DLR for the quantitative analysis of medical images has increased significantly [39–41]. Radiomics efforts have focused on manually extracting radiomics features, such as texture features and histogram features. Feature filtering is used to filter the optimal set of features, and the features are then fed into machine learning classifiers (e.g., SVM or DT) [7]. Several radiomics studies on HNC have been reported. For example, Wang et al. [42] have used radiomics combined with a machine learning model to predict the T stage of locally advanced laryngeal cancer. Ren et al. [43] have extracted MRI-based radiomics features to stage HNSCC III-IV and I-II. Yuan et al. [44] have found that MRI-based radiomics features are independent prognostic factors in patients with HNSCC. Other studies have combined clinicopathological features with radiomics features to predict overall survival and disease-free survival [45, 46] or have linked radiomics findings to molecular features of HNC [47–49]. However, manually mapping ROIs is time consuming and characterized by high inter-rater variability. Deep learning can solve this problem through automatic ROI segmentation, and it has also shown great potential in extracting features and combining fully connected layers to accomplish classification and prediction. The three articles on automatic segmentation reviewed in this study all used 3D U-Net for automatic segmentation. Because the U-Net network is relatively simple and combines down sampling and up sampling paths, and has the ability to aggregate features and spatial information for accurate localization, it is commonly used for the segmentation of medical maps. For example, Lin et al. [50] have used it for automatic segmentation of cervical cancer, and Moe et al. [51] and Ren et al. [52] have used it for delineation of gross tumor volume.
HCR features have been calculated by mathematical formulas based on the pixel values in the ROI [6, 53], whereas the deep learning features have been automatically obtained from the convolutional layer of the CNN through the convolutional kernel sliding calculation over the image [54]. The features learned in the shallow layers are similar to the histogram, shape, and texture features of hand-crafted features. The deeper the layers, the more abstract the extracted features [55]. Because the features in the neural network are gradually generated in the learning process of the model, these features require no human intervention, and the distribution of features is consequently more objective, thereby providing an effective supplement to the manually identified radiomics features [56]. Thus, the combination of hand-crafted features and deep features improves the classification or regression performance of the model [57]. Research on lung cancer by Wang [58], Afshar [59], Astaraki [60], and Liang [61]; on breast cancer by Jiang [62]; on Gastric cancer by Sun [63] and Dong [15]; and on glioma by Chen [64] has indicated that deep learning features are complementary to manual features, and that combining HCR and DLR provides more comprehensive features and enables the model to achieve better results. However, because deep learning features are automatically learned in a black box-like process, they are generally nameless [65].
DLR and HCR can be combined in two approaches: decision-level and feature-level fusion [56]. In decision-level fusion, separate classifiers are trained by using DLR and HCR, and then the results are combined with ER to obtain the final classification. Chen et al. [27] have combined CNN and radiomics results by using ER. This process can also be performed by using voting, such as soft voting. In feature-level fusion, DLR and HCR are concatenated into feature vectors, and then the classifier is trained, as reported by Bizzego et al. [28]
Important components of deep learning models include convolutional and fully connected layers, which can automatically learn features and therefore be used to extract deep features of data. In the four articles extracting deep learning features included in this study, the convolutional layers ranged between 5 and 12 layers, to avoid overfitting due to overly complex parameters. Bourigault et al. used the output of the fifth convolutional layer of the 3D-Unet network as a deep feature, Bizzego et al. used the output of the AdaptiveAvgPool3d layer after the last convolutional layer as a deep feature, and the other studies used the output from the last convolutional layer as a deep feature. Specific information on the network was presented in the results section. Deep learning networks as feature extractors have also been applied in other cancer studies, such as Lao et al. [20] for survival prediction for patients with glioblastoma multiforme; Wang et al. [66] for Coronavirus disease 2019 pneumonia, and Wang et al. [67] for hilar cholangiocarcinoma.
CNN, a class of deep learning models, can learn deeper features from images and eventually map the input image to the desired output for prediction. In the two studies of radiomics combined with deep learning models reviewed herein, Chen et al. and Tang et al. predicted the results with deep networks, whereas Chen et al. combined CNN results and radiomics model results at the decision level. Some studies have used deep learning models in HNC, such as Lombardo et al. [68] on long-range metastasis in HNC, Kann et al. [69] on nodal metastasis and extranidal extension, and Naser et al. on progression free survival prediction; moreover, Kim et al. [70] built a deep learning-based survival prediction algorithm named DeepSurv for survival prediction among patients with oral cancer. Notably, large amounts of data are needed to train deep learning networks to obtain better performance.
Most articles included in our study were based on PET/CT imaging data, wherein images based on different modalities contained different tumor information. Most current radiomics studies have combined the features of multiple modalities only in a simple linear way and thus have not benefited from the advantages of multimodality. Integration of the features of multiple modalities organically in various ways, such as using stacked sparse auto-coding, can increase model accuracy [71, 72]. ER methods have been used to combine the prediction results of multiple classifiers, or to combine the results of radiomics and deep learning, to improve prediction results [30]. Overfitting occurs when the number of extracted features is larger than the number of samples, and can be avoided by adding regularization or using a feature filtering algorithm to obtain the optimal set of features. Multiple algorithms can be combined instead of use of a single algorithm for pre-experimentation, and the best algorithm can be applied to filter the features.
In conclusion, the application of combined DLR improves HNC prognosis beyond that achievable with conventional radiomics. However, deep learning also has limitations, and some problems remain to be improved. First, deep learning networks contain many parameters, and a massive number of high-quality samples are needed to train the network and avoid overfitting. Small sample size was a common problem and was the main limitation of model performance. When the training samples are not sufficiently large, transfer learning pre-trained networks or data augmentation techniques can be used to expand the sample data, such as random addition of noise, rotation or inversion of images, etc., to further improve model performance. Bourigault et al. randomly flipped the tumor volume left/right, superior/inferior, and anterior/posterior in their study to enhance the data [26]. Chen et al. performed a data enhancement technique based on 59 patients, by adding samples from both suspicious and normal categories to balance the data [27]. Bizzego et al. pre-trained the architecture for T-stage then minimal rotation, translation and used Gaussian noise to adjust the images [28]. Moreover, in Tang et al., all data were used as a training cohort and validation cohort simultaneously, by leaving one set as the validation set and using the remainder for training. The above process was repeated until all data were used for the validation set. This method is also effective in training and validation for small sample sizes [31]. Unsupervised learning does not require tags, for example using Deep Autoencoders and The Restricted Boltzmann Machine, etc. Chang et al. proposed a multi-scale convolutional sparse coding method to provide an unsupervised solution [73]. In the future, semi-supervised learning could also be applied to self-train a small amount of labeled data, and unlabeled data could then be studied to solve the problem of a lack of samples being labeled [74, 75]. Second, the features extracted by radiomics and deep learning, respectively, came from the same imaging modality, and redundancy in the feature space negatively affected model performance. Chen et al. [27] and Zhou et al. [30] used ER for combining the probabilistic outputs of multiple models at the decision level to decrease the influence of feature space redundancy on model performance. Third, owing to their different principles, radiomics and deep learning have different advantages for specific tasks. Deep neural networks focus on the whole image, and are preferred in image fusion, feature extraction, and tumor segmentation tasks [76]. Although two studies on predictive model building by deep neural networks included herein had good performance, training deep learning models requires large amounts of data. When the amount of image data is small, deep neural networks cannot fully outperform traditional machine learning classifiers in model building [77]. However, deep neural networks still have great potential in prediction tasks. With the rapid development of deep learning technology, the use of deep learning must be explored to build prediction models in future research. Finally, the stability of deep learning features also must be studied. Compared with conventional radiomics, black box deep learning methods are less interpretable, and the features learned by deep learning are complex to interpret and conceptualize.
In addition, some limitations exist in the current radiomics studies. First, most studies were based on retrospective single-center data, but models require external validation with multi-center data to improve generalizability and ACC [78]. Therefore, multi-center prospective studies and clinical trials are necessary [79] to determine whether DLR might be applied in clinical settings. The data in Peng et al. were acquired from one center, and prospective studies with external validation to verify their results were lacking [29]. Although some data platforms have been established, such as The Cancer Impact Archive, the quality of the data varies. Establishing a unified standard to ensure data quality also must be addressed as soon as possible [80]. Second, combining clinical features (e.g., age, sex, TNM stage, etc.) could make the model more comprehensive [81]. Clinical features were included in studies such as Salmanpour et al. [24], Mehdi et al. [25], and Chen et al. [27]. Consequently, appropriate clinical features should be added when constructing features to improve model performance.
5. CONCLUSION
PET/CT-based DLR has promising prospects in HNC prognosis. Among them, deep learning can be used for ROI segmentation, feature extraction and fusion, and model building. In ROI segmentation, automatic segmentation can avoid the subjective influence of manual segmentation, and combine global and local information to make the segmentation of ROI more accurate. In feature extraction and fusion, images of different modalities include different features, and conventional radiomics features and deep learning features are complementary. Combining deep features with radiomics features and using deep learning to combine features from different imaging modalities greatly improves overall model predictive performance. In model building, combining deep learning models results in faster prediction times and good performance in handling relatively complex datasets. Accordingly, adding deep learning techniques to conventional radiomics results in higher evaluation metrics in HNC prognosis than using only conventional radiomics. Of course, the prediction results are also affected by the number of cases, centers, and the quality of data, thus potentially leading to overfitting and other problems. In short, the organic combination of both methods improves model performance.