Objective: The rapid growth of agricultural and genomic research has generated a vast body of literature, making manual extraction of non-coding RNAs (ncRNAs) and agricultural disease entities increasingly challenging. Although pretrained transformer models such as BioBERTv1. 1, PubMedBERT-base-uncased-abstract-fulltext, DeBERTa-v3-large, and RoBERTa-base perform well in biomedical NLP, they lack task-specific Named Entity Recognition (NER) layers adapted to agricultural terminology. This study presents AgriBioNER, a domain-adapted NER framework for accurately extracting ncRNA and disease entities from agricultural literature. materials and methods: To develop a robust NER system focused on disease-related ncRNAs, relevant scientific literature was systematically retrieved from PubMed. The collected abstracts underwent a comprehensive preprocessing phase in which non informative components such as author names, institutional affiliations, and digital object identifiers (DOIs) were removed to retain only the core textual content. Following this, manual annotation was performed to label key biological entities of interest. The labeled data was then exported in two formats: a JSON format compatible with the spaCy NER pipeline and a BIO (Beginning-Inside-Outside) format 22, widely used in fine-tuning transformer-based models such as DeepSeek, BERT, and other LLMs. Based on this labeled corpus, a custom NER dataset was developed specifically for the domain of ncRNA and disease associations. Using this dataset, seven different models were fine tuned to identify domain-specific entities. These models were then evaluated on a separate unseen test set, allowing for an assessment of their generalization capability and real-world applicability (see Figure₁). Dataset Preparation Data Collection: We collected abstracts related to ncRNAs and diseases within the agriculture domain, as in Figure₁ (raw data). We used an advanced query on PubMed, “Non-coding RNA AND Disease AND Agriculture NOT Human, ” which returned 2, 652 (January 2025) total abstracts published between 1988 and 2025, as in Figure₂. Data Preprocessing: The raw data contains institute name, author name, DOI, journal name, etc. , which are not required for our objective. So, we have removed them manually from each abstract to make it more consistent and noiseless, as in Figure₁ (pre-processed data). Data annotation: We manually annotated and pre-processed 2, 652 PubMed abstracts and carefully labeled ncRNAs, including their different types as well as disease names. We labeled these entities as NON-CODINGRNA and DISEASE NER annotators for spaCy and Label Studio 23. We exported the annotated data in two formats: JSON, which works well with the spaCy 24 pipeline, and the BIO CoNLL-2003 22 format, which is commonly used in other biomedical NLP. The JSON schema shows the text and its entities along with their positions within the text, as in Figure₁ (JSON format). The BIO tagging scheme consists of three labels, i. e. , Beginning (B), Inside (I), and Outside (O). The B (Beginning) tag is used to mark the first word of an entity, indicating both the start of the entity and its type. If an entity spans multiple words, only the first word is tagged with B, while the subsequent words are tagged with I (Inside) to show that they are part of the same entity but not at its beginning. Tokens that do not belong to any entity are labeled with O (Outside), as in Figure₁ (BIO format). After labeling, the custom dataset (D) was created, as in equation (1). � � Methods: A curated dataset of agricultural abstracts related to ncRNAs and diseases was retrieved from PubMed. Four transformer models were fine-tuned using the spaCy transformer pipeline, and a classical spaCy model (encorewebₗg) was trained separately. Performance was evaluated using precision, recall, F1-score, statistical significance testing, 500-iteration bootstrap validation, error analysis, and assessment of prediction time and scalability Results: Transformer models showed consistent performance, while encorewebₗg exhibited higher asymmetric errors. PubMedBERT-base-uncased-abstract-fulltext achieved the best results with a precision of 0. 8351 ± 0. 0139, recall of 0. 8732 ± 0. 0117, and F1-score of 0. 8537 ± 0. 0111. It processed abstracts in approximately 3. 56 seconds per abstract, supporting large-scale deployment. Discussion: Domain-adapted transformer models substantially improved ncRNA–disease entity recognition in agricultural texts. Enhanced contextual representations enabled a better understanding of domain-specific terminology. These improvements demonstrate the effectiveness of specialized fine-tuning for agricultural NLP tasks. Conclusion: AgriBioNER offers a reliable framework for automated extraction of ncRNA and disease entities. The system reduces manual curation efforts while maintaining high accuracy and computational efficiency. It supports data-driven advancements in agricultural genomics and biotechnology research.
Building similarity graph...
Analyzing shared references across papers
Loading...
Choubisa et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e71423cb99343efc98d809 — DOI: https://doi.org/10.2174/0113892029435693260407094051
Bhavesh Kumar Choubisa
Anu Sharma
K. K. Chaturvedi
Current Genomics
Indian Council of Agricultural Research
Indian Agricultural Statistics Research Institute
National Institute of Medical Statistics
Building similarity graph...
Analyzing shared references across papers
Loading...