Through the advances of large-language models (LLMs) AI- generated text can be created with ease. But, these tools can also pose a threat, e.g. through the creation of disinformation. In this work, we analysed texts generated by three LLMs: GPT-3.5, LLaMA3, and Qwen from the CUDRT dataset. We extracted 220 stylistic and statistical features of human and AI-generated text using the LFTK library. First, we analysed the features using the pearson correlation. Second, we trained five machine learning models and tested the classifiers on detecting completely AI-generated, polished, rewritten texts, and summaries created by AI. We calculated an F1-score of 90%+ for the text generated entirely by AI, depending on the LLM used. We found that AI-generated texts, independent of LLM, can be identified through a high kuperman age, i.e. high word complexity, whereby human-written texts are written with higher lexical variation and richness. We provide an explanation for the classification results and a comparison with RoBERTa (fine-tuned).
Building similarity graph...
Analyzing shared references across papers
Loading...
Karla Schäfer
M. Steinebach
Fraunhofer Institute for Secure Information Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Schäfer et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69a7665fbadf0bb9e87dcc52 — DOI: https://doi.org/10.1109/trustcom66490.2025.00134