We investigate the issue in determining the significance of candidate transient gravitational-wave events in a ground-based interferometer network. Given the presence of non-Gaussian noise artefacts in real data, the noise background must be estimated empirically from the data itself. However, the data also potentially contains signals, thus the background estimate may be overstated due to contributions from signals. It has been proposed to mitigate possible bias by removing single-detector data samples that pass a multi-detector consistency test from the background estimates. We conduct a high-statistics Mock Data Challenge to evaluate the effects of removing such samples, modelling a range of scenarios with plausible detector noise distributions and with a range of plausible foreground astrophysical signal rates. We consider the two different modes: one in which coincident samples are removed, and one in which all samples are retained and used. Three algorithms were operated in both modes, show good consistency with each other; however, discrepancies arise between the results obtained under the "coincidence removal" and "all samples" modes, for false alarm probabilities below a certain value. In most scenarios the median of the false alarm probability (FAP) estimator under the "all samples" mode is consistent with the exact FAP. On the other hand the "coincidence removal" mode is found to be unbiased for the mean of the estimated FAP. While the numerical values at which discrepancies become apparent are specific to the details of our experiment, we believe that the qualitative differences in the behaviour of the median and mean of the FAP estimator have more general validity. On the basis of our study we suggest that the FAP of candidates for the first detection of gravitational waves should be estimated without removing single-detector samples that form coincidences.