A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations.
The identification of genetic variants associated with survival time is crucial in genomic studies. To this end, a number of methods have been proposed to computing a p-value that summarized the difference in survival time of two or more population. The most widely used method among these is the log-rank test. Widely used implementations of the log-rank test present a systematic error that emerges in most genome-wide applications, where the two populations have very different sizes, and the accurate computation of very small p-values is required due to the evaluation of a number of candidate variants. Considering cancer genomic applications, we show that the systematic error leads to many false positive associations of somatic variants and survival time. We present and analyze a new algorithm, ExaLT that accurately computes the p-value for the log-rank test under a distribution that is appropriate for the parameters found in genomics. Unlike previous approaches, ExaLT allows to control the accuracy of the computation. We use ExaLT to analyze cancer genomics data from The Cancer Genome Atlas (TCGA), identifying several novel associations in addition to well known associations. In contrast, the standard implementations of the log-rank test report a huge number of presumably false positive associations.