Key points are not available for this paper at this time.
Supplementary data archive for the cgDist paper, containing distance matrices, allelic profiles, and recombination-candidate flagging outputs for four bacterial genomic datasets curated from the BeONE project. Datasets Dataset Species Samples (Complete) Samples (98% filtered) Lm1426 Listeria monocytogenes 1, 426 1, 381 Lm1874 Listeria monocytogenes 1, 874 1, 824 Se1540 Salmonella enterica 1, 540 1, 446 Se1434 Salmonella enterica 1, 434 1, 192 File Types (per dataset) Distance matrices (TSV, square matrices with sample IDs as both row and column headers; values are nucleotide-level distances for cgDist or allele differences for Hamming): datasetₕamming. tsv / datasetₕamming₉8percent. tsv — cgMLST allelic distance (complete / 98%-filtered) datasetcgdistₛnpsₒnly. tsv /. . . ₉8percent. tsv — cgDist SNPs-only mode datasetcgdistₛnpsᵢndelcontiguous. tsv /. . . ₉8percent. tsv — cgDist SNPs+InDel-contiguous mode datasetcgdistₛnpsᵢndelbases. tsv /. . . ₉8percent. tsv — cgDist SNPs+InDel-bases mode Allelic profiles: datasetₐllelicₚrofiles. tsv — cgMLST allelic profiles used for distance calculation Recombination-candidate flagging (heuristic flagging by per-locus mutation density; not a recombination detector — confirmation requires phylogeny-aware tools such as Gubbins, ClonalFrameML, fastGEAR. See the Supplementary Methods of the paper for the full description and caveats): datasetflaggingₚerₗocus₉8percent. tsv — per-locus rows for each flagged candidate (filtered dataset). datasetflaggingₚairwiseₛummary₉8percent. tsv — per-pair summary. Empirical density distribution (Se-1540 only): Se1540ᵣhodistribution. png — visualization of the empirical per-locus density distribution used to inform selection of the 3. 0% flagging threshold. Quality filters Sample completeness (98% filter): minimum 98% of loci present per sample. Recombination-candidate exclusion: per-locus mutation density threshold of 3. 0%; loci flagged at this threshold are excluded from the corrected distance matrices used for downstream clustering (see paper Discussion and Supplementary Methods for the framing of the flagging step). Data generation All distance matrices were computed using cgDist v0. 1. 1 (https: //github. com/genpat-it/cgDist): Alignment: DNA-strict mode Schema: species-specific cgMLST schemes Candidate-flagging threshold: 3. 0% per-locus mutation density Reproducibility All results in the paper can be independently verified using the data in this archive: Distance matrices can be recalculated with cgDist v0. 1. 1 on the provided allelic profiles. Clustering results can be reproduced with single-linkage clustering on the distance matrices. Statistical analyses (ARI, correlations) can be recalculated with standard Python libraries (pandas, scipy, scikit-learn). Citation If you use this data, please cite the paper: de Ruvo et al. (2025). cgDist: Enhanced Resolution for Bacterial Genomic Surveillance Through Nucleotide-Level Distance Calculation from cgMLST Profiles. bioRxiv 2025. 10. 16. 682749 (manuscript under review). https: //doi. org/10. 1101/2025. 10. 16. 682749 Data: https: //doi. org/10. 5281/zenodo. 17285517 (concept DOI; resolves to the latest version) Contact Corresponding author: andrea. deruvo@gssi. it (also: a. deruvo@izs. it) cgDist tool: https: //github. com/genpat-it/cgDist License Creative Commons Attribution 4. 0 International (CC BY 4. 0).
Building similarity graph...
Analyzing shared references across papers
Loading...
Andrea De Ruvo
Pierluigi Castelli
Andrea Bucciacchio
Gran Sasso Science Institute
National Institute of Health Dr. Ricardo Jorge
Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise G. Caporale
Building similarity graph...
Analyzing shared references across papers
Loading...
Ruvo et al. (Thu,) studied this question.
www.synapsesocial.com/papers/6a080b4ea487c87a6a40d79d — DOI: https://doi.org/10.5281/zenodo.20040167