The rarity and heterogeneity of sarcomas makes performing appropriately powered studies challenging and magnifies the significance of large databases in sarcoma research. Established large tumor registries and population-based databases have become increasingly more relevant to answer clinical questions regarding sarcoma incidence, treatment patterns, and outcomes. However, the validity of large databases has been questioned and scrutinized due to inaccuracy and wide variability of coding practices and absence of clinically relevant variables. Additionally, the utilization of large databases for the study of rare cancers like sarcoma may be particularly challenging secondary to known limitations of administrative data and poor overall data quality. Currently there are several large national cancer databases including the Surveillance, Epidemiology, and End Results (SEER) database, the American College of Surgeons’ and American Cancer Society’s National Cancer Database (NCDB), and the Center for Disease Control (CDC) National Program of Cancer Registries (NPCR). These are often used for sarcoma research but these databases are limited by a dependence on administrative or billing data, the lack of agreement between chart abstractors on diagnosis codes, and the use of preexisting documented hospital diagnosis codes for tumor registries leading to significant underestimation of sarcomas in large datasets. Current and future initiatives to improve databases and big data applications for sarcoma research include increasing the utilization of sarcoma-specific registries and encouraging national initiatives to expand on real-world evidence based datasets.
The main aim of this article is to demonstrate the limitations of these databases specifically for sarcoma research. We also describe current initiatives formed to improve the application of big data for rare malignancies.