From raw text to fairseq RoBERTa: A modular snakemake-based framework enabling language-specific BPE tokenization | Synapse