You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working through the audit findings and one of the items (non-ASCII rejection in the lexer) led to a bigger question: should Aster support Unicode identifiers?
We just landed a fix to allow non-ASCII characters inside string literals (emoji, CJK, accented chars, etc.), but identifiers are still ASCII-only. That means let café = 42 is a lexer error, even though most modern languages (Python, Rust, Swift, Go, Java, Kotlin) have supported Unicode identifiers for years.
Why this matters
Non-English-speaking developers are forced to transliterate every identifier into Latin characters. That hurts readability for the people writing and maintaining that code. It also blocks idiomatic mathematical notation (Greek letters for variables, etc.).
RFC
There's an RFC at unicode-identifiers.md covering the full design:
Replace col byte-offset tracking with a proper byte-offset cursor
Use unicode_ident::is_xid_start(ch) for identifier start detection (currently ch.is_ascii_alphabetic() || ch == '_')
Use unicode_ident::is_xid_continue(ch) for identifier continue (currently ch.is_ascii_alphanumeric() || ch == '_')
NFC-normalize collected identifiers before keyword lookup and interning
Update span tracking throughout for multi-byte characters
The formatter, parser, and type checker should work without changes since they operate on token spans (byte offsets into the source). Codegen emits names as UTF-8 strings into DWARF debug info, which should also work, but needs testing.
Migration
No breaking changes. ASCII is a subset of UTF-8, and ASCII identifier chars are a subset of UAX #31. Every existing program stays valid.
What
I've been working through the audit findings and one of the items (non-ASCII rejection in the lexer) led to a bigger question: should Aster support Unicode identifiers?
We just landed a fix to allow non-ASCII characters inside string literals (emoji, CJK, accented chars, etc.), but identifiers are still ASCII-only. That means
let café = 42is a lexer error, even though most modern languages (Python, Rust, Swift, Go, Java, Kotlin) have supported Unicode identifiers for years.Why this matters
Non-English-speaking developers are forced to transliterate every identifier into Latin characters. That hurts readability for the people writing and maintaining that code. It also blocks idiomatic mathematical notation (Greek letters for variables, etc.).
RFC
There's an RFC at
unicode-identifiers.mdcovering the full design:unicode-identcrate (UAX Iterable unique method #31 tables) andunicode-normalizationcrate (NFC)Implementation scope
The main work is in the lexer:
colbyte-offset tracking with a proper byte-offset cursorunicode_ident::is_xid_start(ch)for identifier start detection (currentlych.is_ascii_alphabetic() || ch == '_')unicode_ident::is_xid_continue(ch)for identifier continue (currentlych.is_ascii_alphanumeric() || ch == '_')The formatter, parser, and type checker should work without changes since they operate on token spans (byte offsets into the source). Codegen emits names as UTF-8 strings into DWARF debug info, which should also work, but needs testing.
Migration
No breaking changes. ASCII is a subset of UTF-8, and ASCII identifier chars are a subset of UAX #31. Every existing program stays valid.
Labels