This repository contains the implementation materials for a citation-aware large language model (LLM) pipeline designed to enrich sensor metadata extraction from scientific literature. The workflow extends traditional single-document extraction by identifying sensor-related citations within primary research articles, retrieving referenced studies, and extracting additional metadata from those cited sources. The project was developed within the SMARTER (Sensors and Metadata for Analytics and Research in Exposure Health) initiative to improve the completeness, harmonization, and discoverability of sensor metadata across environmental and exposure health literature. LLM Configuration Model: GPT-4o Prompting strategy: Zero-shot Structured JSON outputs Regex-based parsing and normalization Note that this repository currently contains the implementation for Componet 2 of the work flow. The implementation for Component 5 (LLM-Based Metadata Extraction) was developed as part of our previous study and is available at: 10.5281/zenodo.16929793
FatemehAleahmad et al. (Tue,) studied this question.