What question did this study set out to answer?

The research aims to enhance the extraction of entities in Chinese Cyber Threat Intelligence by addressing specific challenges.

April 15, 2026Open Access

An improved transformer for entity recognition in chinese cyber threat intelligence reports

Key Points

The research aims to enhance the extraction of entities in Chinese Cyber Threat Intelligence by addressing specific challenges.
Developed a Transformer-based entity recognition method using a pointer network.
Constructed a RoBERTa model with Rotary Positional Embeddings.
Introduced tokenization compensation and positional-parameter compression for boundary sensitivity.
Refined GlobalPointer for 2D head-tail span matching to detect overlapping entities.
Implemented an entity-frequency-aware dynamic threshold to manage long-tail bias.
Achieved an overall F1 improvement of 6.32% over baseline models.
Gained an absolute increase of 19.7% in recognition accuracy for nested and long entities.
Validated the model’s effectiveness for high-accuracy automated CTI analysis.

Abstract

Abstract Extracting Chinese Cyber Threat Intelligence (CTI) under increasingly complex advanced persistent threat scenarios is crucial, yet challenging due to domain-specific term ambiguity and frequent long, nested entities. To address polysemy, nested-label conflicts, and cross-sentence semantic discontinuity, we propose an enhanced Transformer-based entity recognition method formulated as a pointer network. On the encoder side, we build a RoBERTa model with Rotary Positional Embeddings. To handle complex positions and boundaries of heterogeneous entity types, we introduce tokenization compensation and positional-parameter compression to sharpen boundary sensitivity. In the decoder, we refine GlobalPointer and model recognition as 2D head–tail span matching, enabling direct detection of overlapping and nested entities. To mitigate long-tail bias, we introduce an entity-frequency-aware dynamic threshold and a reweighted zero-boundary log-loss to improve recall for rare entities. Experiments demonstrate an overall F1 improvement of 6.32% over baselines on Chinese CTI datasets, with absolute gains reaching 19.7% specifically on nested and long entities. These results validate the model’s effectiveness in Chinese-specific named entity recognition and its utility for high-accuracy automated CTI analysis.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yongwei Wang

Jipeng Tang

Hao Hu

Journals

Cybersecurity

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

An improved transformer for entity recognition in chinese cyber threat intelligence reports

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study