Abstract Identification of in vivo transcription factor (TF) binding sites is crucial to understand gene regulation, but the lack of scalability in their experimental identification directs researchers towards computational models. These models are often specific for a given TF, which hinders their generalizability to held-out TFs. In this work, we analyse different modeling strategies to predict in vivo TF binding sites using DNA accessibility, TF RNA expression and binding motif features. We present and test a cross-TF transfer learning scheme that allows learning from the entire training set. We show that model ensembling and DNA language model embeddings increase model performance. We provide an analysis of feature importance and show that ground truth ChIP-seq data quality is an important determinant of model performance. We also test our models in an independent dataset of held-out TFs, and report a mean AUPR of 0.36 in a very challenging cross-TF, cross-cell-type and cross-chromosomal setting, providing estimates of binding for TFs without available ChIP-seq experiments.
Aksu et al. (Fri,) studied this question.