What question did this study set out to answer?

The aim is to create a library that sanitizes untrusted Unicode text to prevent invisible attacks.

March 8, 2026Open Access

navi-sanitize: Deterministic Input Sanitization for Untrusted Unicode Text

Key Points

The aim is to create a library that sanitizes untrusted Unicode text to prevent invisible attacks.
Developed a Python library with no dependencies for input sanitization.
Implemented a six-stage pipeline to process input text.
Removed null bytes and 411 invisible characters.
Applied NFKC normalization and replaced 54 targeted homoglyphs.
Ensured consistent output by re-normalizing and using context-specific escaping.
Achieved zero false positives on legitimate Unicode text.
Maintained clean-path latency of 2.8 µs per string.
Successfully removed vectors of invisible attacks from input.

Abstract

Untrusted text entering AI agent pipelines, template engines, and identity systems carries invisibleattacks: homoglyph substitution that bypasses keyword filters, zero-width characters that splitdelimiters, Unicode Tag block characters that encode instructions tokenizers read but humanscannot see, and bidirectional overrides that reorder displayed text. These attacks operate belowthe layer where existing defenses—HTML escaping, schema validation, probabilistic detection—aredesigned to function. navi-sanitize is a zero-dependency Python library that removes these vectorsdeterministically at the input boundary. A six-stage pipeline removes null bytes, strips 411 invisiblecharacters, applies NFKC normalization, replaces 54 targeted homoglyphs with Latin equivalents,re-normalizes to guarantee idempotency, and runs a pluggable context-specific escaper—producingidentical output for identical input, with zero false positives on legitimate Unicode text. Clean-pathlatency is 2.8 µs per string.

navi-sanitize: Deterministic Input Sanitization for Untrusted Unicode Text

Key Points

Abstract

Cite This Study