Seismic event detection and phase picking are the base of many seismological workflows. In recent years, several publications demonstrated that deep learning approaches significantly outperform classical approaches and even achieve human-like performance under certain circumstances. However, as most studies differ in the datasets and exact evaluation tasks studied, it is yet unclear how the different approaches compare to each other. Furthermore, there are no systematic studies how the models perform in a cross-domain scenario, i.e., when applied to data with different characteristics. Here, we address these questions by conducting a large-scale benchmark study. We compare six previously published deep learning models on eight datasets covering local to teleseismic distances and on three tasks: event detection, phase identification and onset time picking. Furthermore, we compare the results to a classical Baer-Kradolfer picker. Overall, we observe the best performance for EQTransformer, GPD and PhaseNet, with EQTransformer having a small advantage for teleseismic data. Furthermore, we conduct a cross-domain study, in which we analyze model performance on datasets they were not trained on. We show that trained models can be transferred between regions with only mild performance degradation, but not from regional to teleseismic data or vice versa. As deep learning for detection and picking is a rapidly evolving field, we ensured extensibility of our benchmark by building our code on standardized frameworks and making it openly accessible. This allows model developers to easily compare new models or evaluate performance on new datasets, beyond those presented here. Furthermore, we make all trained models available through the SeisBench framework, giving end-users an easy way to apply these models in seismological analysis.