Advances in the field of natural language processing (NLP) allow for the semi-automated extraction of useful information from large sets of text-based documents and have untapped potential to provide insights in understanding both scientific and stakeholder priorities on groundwater challenges. NLP, a set of techniques within the broader field of artificial intelligence (AI), includes statistical tools such as topic modeling and sentiment analysis. These methods allow the user to identify latent (hidden) topics and overall sentiments (attitudes leaning positive, neutral, or negative) within large sets of text. These methods have already been applied in the field of hydrogeology and hydrology to deduce temporal priorities in hydrogeologic science, spatial trends in water resource management issues, and influential human decisions to water allocation. Christenson and Cardiff (2024) quantified the evolution in research trends in hydrogeology using topic modeling of scientific abstracts published over the past 60 years, identifying the increased prevalence of treating ground and surface water as a single hydrologic system within the scientific literature. This work also documented the popularity of developing analytical and numerical methods in well hydraulics and groundwater modeling between the 1960s and 1980s, which ultimately gave way to increasingly complex models incorporating uncertainty estimation and model calibration in more recent decades. Additionally, this paper quantified the shift throughout the period of “boom and bust” of contaminant hydrogeology between the 1980s and early 2000s from assessment and characterization of contaminants toward understanding degradation and remediation techniques. Recent work by Sweitzer et al. (2023) used topic modeling to analyze 1.8 M water-related narratives in local newspapers throughout the United States. They identified regional water priorities related to concerns of resource scarcity and co-dependence of water policy with other economic sectors. Articles on drinking water topics were found to occur in local news more frequently in the western United States where concerns of drought are persistent. Discussions on pollutants found ubiquitously across most of the United States were noticeably absent of water-borne illness concerns common to many areas of the world. Near areas of energy production, opinions varied largely on whether the use of water was to support renewable or nonrenewable development. This work also utilized two existing sentiment analysis dictionaries to assess overall positive or negative sentiments associated with key topics identified. Renewable-energy articles contained fewer negatively valanced terms than other topics, while articles relating to contaminants, oil and gas, and disease contained more negatively valanced terms (Sweitzer et al. 2023). Nunes Carvalho et al. (2024) applied topic modeling combined with social network analysis to text records of water basin committee meetings over a 25-year period in Brazil to identify water management topics and influential actors in water allocation decision making. This analysis revealed that “reservoir operation” was the most prevalent topic within these records, and that urban water supply was more of a stakeholder concern than agricultural demand during drought periods. In each of these studies, the outcomes or perceptions of human priorities as influences on the hydrologic system is uncovered through NLP where conversion of massive text datasets to structured knowledge is used to inform broader conclusions. Topic models are powerful unsupervised statistical models that allow the user to uncover themes within large collections of documents based on word trend occurrences. A topic model is capable of analyzing very large (thousands to millions) sets of text documents and delivering sets of keywords that have been identified by the model and that represent prominent “themes” within the data through model success criteria optimization and iterative feedback from the model user. While several variations on this concept exist, two topic models that are widely utilized in the scientific literature include the Latent Dirichlet Allocation (LDA) and the Structural Topic Model (STM). Topic models of either sort require iterative human input during data collection, preprocessing, and tokenization (breaking down long text into smaller units known as tokens, which are usable by the model) to adjust and retrain models until interpretable and actionable insights can be gained. Model initiation is built upon human expertise through selection of a pre-determined number of topics. Later model discovery phases involve domain-specific knowledge to iterate on the number of topics, evaluate topic coherence and validate outcomes for interpretation. Introduced by Blei et al. (2003), LDA is a generative probabilistic model that represents documents as mixtures of latent topics, where each topic is characterized by a distribution over words. In other words, every word in the vocabulary of the dataset has some probability of occurring in each topic, and the most highly probable words for each topic are the defining words for said topic. STM is an extension of the LDA method that incorporates metadata into the modeling process, thereby allowing the user to understand how hidden themes within documents vary by factors such as geographic location, publication date, or any other metadata associated with individual documents. A relatively user-friendly package exists for STM in the R programming language, and while significant pre-processing can and may be required to utilize certain sets of text documents, the model itself is straightforward to run for anyone with some proficiency in R (Roberts et al. 2019). While these tools have immense potential for unstructured information and data acquisition, often the limitations of their use derive not from the implementation of the tools themselves, but from data accessibility and pre-processing constraints. Large, consistently formatted sets of text documents are not widely available and accessible. For example, social media companies such as X and Reddit, which previously allowed for the use of free APIs to access data, have vastly limited the free public capabilities of these in recent years and have moved to a cost-based model that is cost prohibitive to most apart from large corporations. Large sets of newspaper and news media text documents and scientific literature often require paid subscriptions to databases such as LexisNexis or major journal publishing houses, which may only be accessible to those with university or research organization affiliations. Despite data access limitations and preprocessing challenges, the potential applications for NLP methods in hydrology are extensive. Hydrogeologists with local knowledge of stakeholder groups and resources may have unique insight into unpublished sources that could be used to temporally or spatially scale analyses. For example, reports from periodic or recurring watershed-based stakeholder meetings over some period could be analyzed for thematic variations over time to assess how stakeholder priorities in the watershed have shifted and evolved. The potential exists for near real-time consumption of human decisions recorded in policy, minutes, or news which could inform decision making in hydrologic model design, thus speeding scientific model responses and informing the location of new data observation points. In the era of proliferation of large language models (LLMs) such as ChatGPT and Microsoft Copilot, among others, we must consider if similar knowledge can be gained from asking research questions to LLMs. For example, one could ask a LLM a research question such as “What water-quality issues in groundwater are most salient in newspaper publishing in the Midwest?” and receive an answer, but the data and exact analysis techniques used in the LLMs are often opaque to the user, and it can be difficult to know how much confidence to place on the response. The value of answering this research question by conducting a content analysis applying STM to a set of newspaper data is that (1) the data source can be well curated, queried and known by the researcher, and (2) the exact method, as well as some degree of sensitivity analysis, can be applied to the model, ensuring a higher level of confidence and understanding of the limitations of the results. As vast arrays of textual data discussing hydrogeologic topics and priorities, such as news and social media content, scientific articles, agricultural publications, stakeholder reports, and gray literature become more broadly accessible online, NLP methods such as topic modeling can be utilized to derive conclusions about groundwater that would be otherwise difficult to quantify. The authors would like to thank Colin Livdahl and Carol Luukkonen of the U.S. Geological Survey for their technical reviews. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. The publishing of this article was funded by the U.S. Geological Survey. The authors do not have any conflicts of interest or financial disclosures to report. Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
Building similarity graph...
Analyzing shared references across papers
Loading...
Christenson et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d896566c1944d70ce07bee — DOI: https://doi.org/10.1111/gwat.70067
Catherine Christenson
Kurt J. McCoy
Ground Water
United States Geological Survey
Building similarity graph...
Analyzing shared references across papers
Loading...