SomeLang is a lightweight and decently accurate natural language detection library. It is designed to be fast, python native, with no external dependencies for the main script, and highly customizable with support for whitelists and blacklists.
pip install somelang- Fast Natural Language Detection - Trigrams-based approach for accurate results
- Default 158+ language whitelist - The default whitelist provides better accuracy on short texts (3-100 characters)
- Supports 194+ languages - Can detect a wide range of languages in full mode
- Modern Training Data - Trained on OpenLID-v2 & many other modern datasets
- Python-native - No external dependencies for main script
- Customizable - Configurable whitelist/blacklist support
from somelang import somelang
# Basic language detection
lang = somelang("Bonjour tout le monde") # Returns: 'fra'
# Get language name instead of code
lang = somelang("Hello world", verbose=True) # Returns: 'English'python -m somelang 'text to analyze'from somelang import somelang_all, somelang_no_whitelist
# Get all probable languages with confidence scores
results = somelang_all("Hello world") # Returns: [['eng', 1.0], ...]
# Use all 194 languages (no whitelist)
lang = somelang_no_whitelist("Text in rare language")Currently, the library expects a minimum text length of 10 characters, but due to the current trigram-based approach, it may give a false positive on less than 100 character texts. This will be remedied in future updates.
Trained mainly on the OpenLID-v2 dataset and a few other datasets (for refinement).
Inspired by franc by Titus Wormer.
See CITATIONS file for more details.
This project is licensed under the MIT license. Authored by SomeAB.