Algorithm Documentation

Enhanced Confix-Stripping (ECS)

This stemmer implements the Enhanced Confix-Stripping (ECS) variant of the Nazief-Adriani algorithm as described in Asian et al., 2007.

Overview

The ECS algorithm improves upon the original Nazief-Adriani (1996) by:

Iterative confix-stripping (up to 4 passes)
Nasal-assimilation restoration
Phonotactic validity guards
Two-path candidate generation

Algorithm Steps

1. Clitic Removal

First, remove the possessive/determiner clitic -nya:

bukunya → buku
merekan → mereka

2. Iterative Confix-Stripping

For up to 4 passes, the algorithm strips:

One prefix family
One derivational suffix

Prefix Families (in priority order)

Family	Examples
`me(N)-`	membaca, menulis, mengambil
`pe(N)-`	pembaca, penulis, pengambil
`ber-`	berjalan, bertanya
`ter-`	terlihat, terkenal
`se-`	selesai, serupa
`ke-`	kesalahan, kebersihan
`di-`	dibaca, ditulis

The N in me(N)- and pe(N)- represents nasal assimilation:

meng- → g before vowels: mengambil → ambil
men- → t before vowels: menulis → tulis
men- → s before vowels: menyapu → sapu
meny- → s before vowels: menyanyi → nyanyi

Derivational Suffixes (in priority order)

-kan
-an
-i

3. Inflectional Suffix Fallback

If no prefix matched, remove inflectional suffixes:

-lah
-kah
-tah
-pun

4. Nasal-Assimilation Restoration

When a prefix with nasal assimilation is stripped, the algorithm reconstructs the dropped consonant:

menulis → (strip men-) → nulis → (restore t) → tulis
mengambil → (strip meng-) → ngambil → (restore g) → ambil

Without a dictionary, the algorithm prefers the longer candidate when ambiguous.

With a dictionary, the first candidate found in the FST wins.

5. Phonotactic Validity Guards

Indonesian phonotactics forbid CC-onset (consonant clusters at word start). The algorithm discards candidates that would create invalid CC-onsets:

Valid: baca, tulis, jalan
Invalid: blajar (from belajar), ktulis (from ketulis)

This prevents over-stemming.

Two-Path Candidate Generation

The algorithm explores both orderings:

Prefix-first then suffix
Suffix-first then prefix

Candidates from both paths are combined and ranked. The longer candidate is preferred when no dictionary is available.

Example Walkthrough

Example: `mempertimbangkan`

Pass 1:

Strip memper- → timbangan
Strip suffix -an → timbang

Pass 2:

No more prefixes/suffixes to strip

Result: timbang

Example: `pembelajaran`

Pass 1:

Strip pem- → belajaran
Strip suffix -an → belajar

Pass 2:

Strip be- → lajar
No suffix

Result: lajar (but with dictionary, ajar might be preferred)

Accuracy

Without dictionary: ~85-90% accuracy on common Indonesian words With dictionary: ~95-98% accuracy

The dictionary helps resolve ambiguous cases, especially with nasal-assimilation prefixes before vowel-initial roots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithm Documentation

Enhanced Confix-Stripping (ECS)

Overview

Algorithm Steps

1. Clitic Removal

2. Iterative Confix-Stripping

Prefix Families (in priority order)

Derivational Suffixes (in priority order)

3. Inflectional Suffix Fallback

4. Nasal-Assimilation Restoration

5. Phonotactic Validity Guards

Two-Path Candidate Generation

Example Walkthrough

Example: `mempertimbangkan`

Example: `pembelajaran`

Accuracy

FilesExpand file tree

algorithm.md

Latest commit

History

algorithm.md

File metadata and controls

Algorithm Documentation

Enhanced Confix-Stripping (ECS)

Overview

Algorithm Steps

1. Clitic Removal

2. Iterative Confix-Stripping

Prefix Families (in priority order)

Derivational Suffixes (in priority order)

3. Inflectional Suffix Fallback

4. Nasal-Assimilation Restoration

5. Phonotactic Validity Guards

Two-Path Candidate Generation

Example Walkthrough

Example: mempertimbangkan

Example: pembelajaran

Accuracy

Example: `mempertimbangkan`

Example: `pembelajaran`