What question did this study set out to answer?

The study aims to create AgriBioNER, a tool designed to extract ncRNA and disease entities from vast agricultural literature.

April 21, 2026

AgriBioNER: A Named Entity Recognition Tool for Identification ofncRNA and Diseases in Agricultural Literature

Puntos clave

The study aims to create AgriBioNER, a tool designed to extract ncRNA and disease entities from vast agricultural literature.
Developed a custom dataset from PubMed abstracts related to ncRNAs and diseases in agriculture.
Performed manual annotation and preprocessing of 2,652 abstracts for entity labeling.
Fine-tuned four transformer models using a specialized pipeline and evaluated their performance on precision, recall, and F1-score.
PubMedBERT-base-uncased-abstract-fulltext achieved the highest precision of 0.8351, recall of 0.8732, and F1-score of 0.8537.
Transformer models generally outperformed the classical spaCy model, which showed higher asymmetric errors.
The system processed abstracts efficiently at approximately 3.56 seconds per abstract, supporting scalability.

Resumen

Objective: The rapid growth of agricultural and genomic research has generated a vast body of literature, making manual extraction of non-coding RNAs (ncRNAs) and agricultural disease entities increasingly challenging. Although pretrained transformer models such as BioBERTv1. 1, PubMedBERT-base-uncased-abstract-fulltext, DeBERTa-v3-large, and RoBERTa-base perform well in biomedical NLP, they lack task-specific Named Entity Recognition (NER) layers adapted to agricultural terminology. This study presents AgriBioNER, a domain-adapted NER framework for accurately extracting ncRNA and disease entities from agricultural literature. materials and methods: To develop a robust NER system focused on disease-related ncRNAs, relevant scientific literature was systematically retrieved from PubMed. The collected abstracts underwent a comprehensive preprocessing phase in which non informative components such as author names, institutional affiliations, and digital object identifiers (DOIs) were removed to retain only the core textual content. Following this, manual annotation was performed to label key biological entities of interest. The labeled data was then exported in two formats: a JSON format compatible with the spaCy NER pipeline and a BIO (Beginning-Inside-Outside) format 22, widely used in fine-tuning transformer-based models such as DeepSeek, BERT, and other LLMs. Based on this labeled corpus, a custom NER dataset was developed specifically for the domain of ncRNA and disease associations. Using this dataset, seven different models were fine tuned to identify domain-specific entities. These models were then evaluated on a separate unseen test set, allowing for an assessment of their generalization capability and real-world applicability (see Figure₁). Dataset Preparation Data Collection: We collected abstracts related to ncRNAs and diseases within the agriculture domain, as in Figure₁ (raw data). We used an advanced query on PubMed, “Non-coding RNA AND Disease AND Agriculture NOT Human, ” which returned 2, 652 (January 2025) total abstracts published between 1988 and 2025, as in Figure₂. Data Preprocessing: The raw data contains institute name, author name, DOI, journal name, etc. , which are not required for our objective. So, we have removed them manually from each abstract to make it more consistent and noiseless, as in Figure₁ (pre-processed data). Data annotation: We manually annotated and pre-processed 2, 652 PubMed abstracts and carefully labeled ncRNAs, including their different types as well as disease names. We labeled these entities as NON-CODINGRNA and DISEASE NER annotators for spaCy and Label Studio 23. We exported the annotated data in two formats: JSON, which works well with the spaCy 24 pipeline, and the BIO CoNLL-2003 22 format, which is commonly used in other biomedical NLP. The JSON schema shows the text and its entities along with their positions within the text, as in Figure₁ (JSON format). The BIO tagging scheme consists of three labels, i. e. , Beginning (B), Inside (I), and Outside (O). The B (Beginning) tag is used to mark the first word of an entity, indicating both the start of the entity and its type. If an entity spans multiple words, only the first word is tagged with B, while the subsequent words are tagged with I (Inside) to show that they are part of the same entity but not at its beginning. Tokens that do not belong to any entity are labeled with O (Outside), as in Figure₁ (BIO format). After labeling, the custom dataset (D) was created, as in equation (1). � � Methods: A curated dataset of agricultural abstracts related to ncRNAs and diseases was retrieved from PubMed. Four transformer models were fine-tuned using the spaCy transformer pipeline, and a classical spaCy model (encorewebₗg) was trained separately. Performance was evaluated using precision, recall, F1-score, statistical significance testing, 500-iteration bootstrap validation, error analysis, and assessment of prediction time and scalability Results: Transformer models showed consistent performance, while encorewebₗg exhibited higher asymmetric errors. PubMedBERT-base-uncased-abstract-fulltext achieved the best results with a precision of 0. 8351 ± 0. 0139, recall of 0. 8732 ± 0. 0117, and F1-score of 0. 8537 ± 0. 0111. It processed abstracts in approximately 3. 56 seconds per abstract, supporting large-scale deployment. Discussion: Domain-adapted transformer models substantially improved ncRNA–disease entity recognition in agricultural texts. Enhanced contextual representations enabled a better understanding of domain-specific terminology. These improvements demonstrate the effectiveness of specialized fine-tuning for agricultural NLP tasks. Conclusion: AgriBioNER offers a reliable framework for automated extraction of ncRNA and disease entities. The system reduces manual curation efforts while maintaining high accuracy and computational efficiency. It supports data-driven advancements in agricultural genomics and biotechnology research.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Choubisa et al. (Wed,) studied this question.

www.synapsesocial.com/papers/69e71423cb99343efc98d809 — DOI: https://doi.org/10.2174/0113892029435693260407094051

Authors

Bhavesh Kumar Choubisa

Anu Sharma

K. K. Chaturvedi

Journals

Current Genomics

Actions

Institutions

Indian Council of Agricultural Research

Indian Agricultural Statistics Research Institute

National Institute of Medical Statistics

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AgriBioNER: A Named Entity Recognition Tool for Identification ofncRNA and Diseases in Agricultural Literature

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion