• This study evaluates how effectively open and general purpose LLMs can determine whether two biomedical concepts are taxonomically related. • Utilizing SNOMED CT as the primary taxonomy source, we present a procedure for creating a dataset specifically designed to test taxonomic relationship identification capabilities of various open LLMs. • We investigate the factors that influence the accuracy of identification by our set of LLMs following a reproducible methodology. Ontologies serve as semantic blueprints for knowledge management by capturing information in a coherent machine-processable format. They define concepts and relationships, commonly represented through knowledge graphs (KG) in which taxonomic, or “is-a”, relationships arrange concepts into hierarchical structures. Biomedical applications particularly benefit from these structured representations because of the domain’s inherent complexity and continual evolution. Structured representations of relationships support automated reasoning and inference, which are crucial for clinical decision-making, research hypothesis generation, and data integration tasks. Although a substantial portion of biomedical knowledge remains in natural language, Large Language Models (LLMs) offer new potential to automatically extract and interpret this information. Despite promising results in various natural language processing tasks, few studies have examined how effectively LLMs recognise taxonomic relationships. This study evaluates the ability of general-purpose LLMs to reason about biomedical taxonomies by identifying hierarchical “is-a” relationships between concepts. To operationalise this evaluation, we use the SNOMED CT Knowledge graph, one of the most comprehensive clinical terminologies, as a gold-standard reference for determining whether candidate concept pairs are taxonomically linked. Overall, LLMs often succeed in recognising domain-specific taxonomic links based solely on their generic pre-training, yet they exhibit weaknesses in directional reasoning, particularly in challenging negative cases where true parent–child relations are intentionally reversed. Our findings reveal that employing chain-of-thought prompting techniques significantly improves their performance in interpreting these relationships. Taken together, the results highlight both the benefits of Chain-of-Thought prompting for hierarchical judgments and the practical feasibility of integrating LLMs into algorithmic knowledge-graph workflows that require structured, machine-interpretable outputs.
Serrano et al. (Sun,) studied this question.