In recent years, numerous approaches have emerged for identifying compounds that are absent from databases using a combination of GC–MS and deep learning methods. Despite significant progress in this area, studies assessing the reliability of identification (for GC–MS) using simultaneously several of state-of-the-art models are virtually nonexistent. Such an assessment requires reference mass spectra and retention indices for compounds not represented in the NIST database used to train the models. The assessment is only valid if all used models “have not seen” the test molecules during training. In this work, such an assessment was performed for 12 nitrogen-containing compounds that are absent from the NIST 23 mass spectral database: 2-methyl-1-pyrroline, N',N'-dimethylformohydrazide, 1-ethylpyrazole, 3,4-dimethyl-1,2-oxazole, 1,4-dimethyl-1,2,3-triazole, 1-ethyl-1,2,4-triazole, 2-amino-5-methylpyrazine, and others. The following models were used: the AIRI model for predicting retention indices, the neims-pytorch model for predicting mass spectra, and the EI2FP model for predicting molecular fingerprints (the presence or absence of certain substructures) based on the mass spectrum. For each molecule, the isomer structures corresponding to the molecular formula were extracted from the PubChem database. Isomers with low probability of being present in typical samples and isomers for which the predicted retention index differed significantly from the observed one were excluded. The remaining isomers were sorted according to the similarity of the observed and predicted mass spectra, as well as the similarity of the molecular fingerprints obtained from the mass spectrum to that calculated for the candidate structure. Using both approaches simultaneously allows for the determination of the correct compound structure in 8 out of 12 cases; in the remaining cases, the correct structure is among the top 5 candidates; it is a quite high result for non-targeted screening. Using either approach alone yields lower accuracy, and satisfactory identification without the retention index is also not possible. Outdated models for predicting mass spectra and retention indices (CFM, SVEKLA) also fail to achieve such results.
Building similarity graph...
Analyzing shared references across papers
Loading...
D. D. Matyushin
M. D. Khrisanfov
S. A. Borovikova
Journal of Analytical Chemistry
Lomonosov Moscow State University
Frumkin Institute of Physical Chemistry and Electrochemistry
Building similarity graph...
Analyzing shared references across papers
Loading...
Matyushin et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69d896406c1944d70ce0784d — DOI: https://doi.org/10.1134/s1061934826700061