Advanced charset encoding converter with Chain of Responsibility pattern, auto-detection, double-encoding repair, and JSON safety.
New optimized type-specific processing with Strategy + Visitor pattern:
// Custom property mapper for selective processing (60% faster!)
use Ducks\Component\EncodingRepair\Interpreter\PropertyMapperInterface;
class UserMapper implements PropertyMapperInterface
{
public function map(object $object, callable $transcoder, array $options): object
{
$copy = clone $object;
$copy->name = $transcoder($object->name);
$copy->email = $transcoder($object->email);
// password NOT transcoded (security)
return $copy;
}
}
$processor = new CharsetProcessor();
$processor->registerPropertyMapper(User::class, new UserMapper());New optimized batch processing methods for high-performance array conversion:
// Batch conversion with single encoding detection (40-60% faster!)
$rows = $db->query("SELECT * FROM users")->fetchAll();
$utf8Rows = CharsetHelper::toCharsetBatch($rows, 'UTF-8', CharsetHelper::AUTO);
// Detect encoding from array
$encoding = CharsetHelper::detectBatch($items);CharsetHelper now uses a service-based architecture following SOLID principles:
CharsetProcessor: Instanciable service with fluent APICharsetProcessorInterface: Service contract for dependency injection- Multiple instances: Different configurations for different contexts
- 100% backward compatible: Existing code works unchanged
// New way: Service instance
$processor = new CharsetProcessor();
$processor->addEncodings('SHIFT_JIS')->resetDetectors();
$utf8 = $processor->toUtf8($data);
// Old way: Static facade (still works)
$utf8 = CharsetHelper::toUtf8($data);Optional external cache integration for improved performance:
// Use built-in InternalArrayCache (default, optimized)
use Ducks\Component\EncodingRepair\Detector\CachedDetector;
use Ducks\Component\EncodingRepair\Detector\MbStringDetector;
$detector = new CachedDetector(new MbStringDetector());
// InternalArrayCache used automatically (no TTL overhead)
// Or use ArrayCache for TTL support
use Ducks\Component\EncodingRepair\Cache\ArrayCache;
$cache = new ArrayCache();
$detector = new CachedDetector(new MbStringDetector(), $cache, 3600);
// Or use any PSR-16 implementation (Redis, Memcached, APCu)
// $redis = new \Symfony\Component\Cache\Psr16Cache($redisAdapter);
// $detector = new CachedDetector(new MbStringDetector(), $redis, 7200);Unlike existing libraries, CharsetHelper provides:
- β Extensible architecture with Chain of Responsibility pattern
- β PSR-16 cache support for Redis, Memcached, APCu (NEW in v1.2)
- β Type-specific interpreters for optimized processing (NEW in v1.2)
- β Custom property mappers for selective object conversion (NEW in v1.2)
- β Multiple fallback strategies (UConverter β iconv β mbstring)
- β Smart auto-detection with multiple detection methods
- β Double-encoding repair for corrupted legacy data
- β Recursive conversion for arrays AND objects (not just arrays!)
- β Safe JSON encoding/decoding with automatic charset handling
- β Modern PHP with strict typing (PHP 7.4+)
- β Minimal dependencies (only PSR-16 interface for optional caching)
- Robust Transcoding: Implements a Chain of Responsibility pattern
trying best providers first (
Intl/UConverter->Iconv->MbString). - PSR-16 Cache Support: Optional external cache (Redis, Memcached, APCu) for detection results (NEW in v1.2).
- Type-Specific Interpreters: Optimized processing strategies per data type (NEW in v1.2).
- Custom Property Mappers: Selective object property conversion for security and performance (NEW in v1.2).
- Double-Encoding Repair: Automatically detects and fixes strings like
ΓΒ©tΓΒ©back toΓ©tΓ©. - Recursive Processing: Handles
string,array, andobjectrecursively. - Immutable: Objects are cloned before modification to prevent side effects.
- Safe JSON Wrappers: Prevents
json_encodefrom returningfalseon bad charsets. - Secure: Whitelisted encodings to prevent injection.
- Extensible: Register your own transcoders, detectors, interpreters, or cache providers without modifying the core.
- Modern Standards: PSR-12 compliant, strictly typed, SOLID architecture.
- PHP: 7.4, 8.0, 8.1, 8.2, or 8.3
- Extensions (required):
ext-mbstringext-json
- Extensions (recommended):
ext-intl
composer require ducks-project/charset-helper# Ubuntu/Debian
sudo apt-get install php-intl php-iconv
# macOS (via Homebrew)
brew install php@8.2
# Extensions are included by default
# Windows
# Enable in php.ini:
extension=intl
extension=iconv<?php
use Ducks\Component\Component\EncodingRepair\CharsetHelper;
// Simple UTF-8 conversion
$utf8String = CharsetHelper::toUtf8($latinString);
// Automatic encoding detection
$data = CharsetHelper::toCharset($mixedData, 'UTF-8', CharsetHelper::AUTO);
// Repair double-encoded strings
$fixed = CharsetHelper::repair($corruptedString);
// Safe JSON with encoding handling
$json = CharsetHelper::safeJsonEncode($data);Convert between different character encodings:
use Ducks\Component\Component\EncodingRepair\CharsetHelper;
$data = [
'name' => 'GΓ©rard', // ISO-8859-1 string
'meta' => ['desc' => 'Ca coΓ»te 10β¬'] // Nested array with Euro sign
];
// Convert to UTF-8
$utf8 = CharsetHelper::toUtf8($data, CharsetHelper::WINDOWS_1252);
// Convert to ISO-8859-1 (Windows-1252)
$iso = CharsetHelper::toIso($data, CharsetHelper::ENCODING_UTF8);
// Convert to any encoding
$result = CharsetHelper::toCharset(
$data,
CharsetHelper::ENCODING_UTF16,
CharsetHelper::ENCODING_UTF8
);Note: We use Windows-1252 instead of strict ISO-8859-1 by default because it includes common characters like β¬, Ε, β’ which are missing in standard ISO.
Supported Encodings:
UTF-8UTF-16UTF-32ISO-8859-1Windows-1252(CP1252)ASCIIAUTO(automatic detection)
Let CharsetHelper detect the source encoding:
// Automatic detection
$result = CharsetHelper::toCharset(
$unknownData,
CharsetHelper::ENCODING_UTF8,
CharsetHelper::AUTO // Will auto-detect source encoding
);
// Manual detection
$encoding = CharsetHelper::detect($string);
echo $encoding; // "UTF-8", "ISO-8859-1", etc.
// Batch detection from array (faster for large datasets)
$encoding = CharsetHelper::detectBatch($items);
// With custom encoding list
$encoding = CharsetHelper::detect($string, [
'encodings' => ['UTF-8', 'Shift_JIS', 'EUC-JP']
]);Optimized for processing large arrays with single encoding detection:
// Database migration with batch processing
$rows = $db->query("SELECT * FROM users")->fetchAll(); // 10,000 rows
// Slow: Detects encoding for each row (10,000 detections)
$utf8Rows = array_map(
fn($row) => CharsetHelper::toUtf8($row, CharsetHelper::AUTO),
$rows
);
// Fast: Detects encoding once (1 detection, 40-60% faster!)
$utf8Rows = CharsetHelper::toCharsetBatch(
$rows,
CharsetHelper::ENCODING_UTF8,
CharsetHelper::AUTO
);
// CSV import example
$csvData = array_map('str_getcsv', file('data.csv'));
$utf8Csv = CharsetHelper::toCharsetBatch($csvData, 'UTF-8', CharsetHelper::AUTO);Convert nested data structures:
// Array conversion
$data = [
'name' => 'CafΓ©',
'city' => 'SΓ£o Paulo',
'items' => [
'entrée' => 'Crème brûlée',
'plat' => 'BΕuf bourguignon'
]
];
$utf8Data = CharsetHelper::toUtf8($data, CharsetHelper::WINDOWS_1252);
// Object conversion
class User {
public $name;
public $email;
}
$user = new User();
$user->name = 'JosΓ©';
$user->email = 'josΓ©@example.com';
$utf8User = CharsetHelper::toUtf8($user, CharsetHelper::ENCODING_ISO);
// Returns a cloned object with converted propertiesFix strings that have been encoded multiple times (common with legacy databases):
// Example: "CafΓΒ©" (UTF-8 interpreted as ISO, then re-encoded as UTF-8)
$corrupted = "CafΓΒ©";
$fixed = CharsetHelper::repair($corrupted);
echo $fixed; // "CafΓ©"
// With custom max depth
$fixed = CharsetHelper::repair(
$corrupted,
CharsetHelper::ENCODING_UTF8,
CharsetHelper::ENCODING_ISO,
['maxDepth' => 10] // Try to peel up to 10 encoding layers
);How it works:
- Detects valid UTF-8 strings
- Attempts to reverse-convert (UTF-8 β source encoding)
- Repeats until no more layers found or max depth reached
- Returns the cleaned string
Prevent JSON encoding/decoding errors caused by invalid UTF-8:
// Safe encoding (auto-repairs before encoding)
$json = CharsetHelper::safeJsonEncode($data);
// Safe decoding with charset conversion
$data = CharsetHelper::safeJsonDecode(
$json,
true, // associative array
512, // depth
0, // flags
CharsetHelper::ENCODING_UTF8, // target encoding
CharsetHelper::WINDOWS_1252 // source encoding for repair
);
// Throws RuntimeException on error with clear message
try {
$json = CharsetHelper::safeJsonEncode($invalidData);
} catch (RuntimeException $e) {
echo $e->getMessage();
// "JSON Encode Error: Malformed UTF-8 characters"
}Fine-tune the conversion behavior:
$result = CharsetHelper::toCharset($data, 'UTF-8', 'ISO-8859-1', [
'normalize' => true, // Apply Unicode NFC normalization (default: true)
'translit' => true, // Transliterate unavailable chars (default: true)
'ignore' => true, // Ignore invalid sequences (default: true)
'encodings' => ['UTF-8', 'ISO-8859-1', 'Shift_JIS'] // For detection
]);Options explained:
normalize: Applies Unicode NFC normalization to UTF-8 output (combines accents)translit: Converts unmappable characters to similar ones (Γ© β e)ignore: Skips invalid byte sequences instead of failingencodings: List of encodings to try during auto-detection
For better testability and flexibility, use the CharsetProcessor service directly:
use Ducks\Component\EncodingRepair\CharsetProcessor;
// Create a processor instance
$processor = new CharsetProcessor();
// Fluent API for configuration
$processor
->addEncodings('SHIFT_JIS', 'EUC-JP')
->queueTranscoders(new MyCustomTranscoder())
->resetDetectors();
// Use the processor
$utf8 = $processor->toUtf8($data);// Production processor with strict encodings
$prodProcessor = new CharsetProcessor();
$prodProcessor->resetEncodings()->addEncodings('UTF-8', 'ISO-8859-1');
// Import processor with permissive encodings
$importProcessor = new CharsetProcessor();
$importProcessor->addEncodings('SHIFT_JIS', 'EUC-JP', 'GB2312');
// Both are independent
$prodResult = $prodProcessor->toUtf8($data);
$importResult = $importProcessor->toUtf8($legacyData);use Ducks\Component\EncodingRepair\CharsetProcessorInterface;
class MyService
{
private CharsetProcessorInterface $processor;
public function __construct(CharsetProcessorInterface $processor)
{
$this->processor = $processor;
}
public function process($data)
{
return $this->processor->toUtf8($data);
}
}
// Easy to mock in tests
$mock = $this->createMock(CharsetProcessorInterface::class);
$service = new MyService($mock);Optimize object processing by converting only specific properties:
use Ducks\Component\EncodingRepair\Interpreter\PropertyMapperInterface;
class UserMapper implements PropertyMapperInterface
{
public function map(object $object, callable $transcoder, array $options): object
{
$copy = clone $object;
$copy->name = $transcoder($object->name);
$copy->email = $transcoder($object->email);
// password is NOT transcoded (security)
// avatar_binary is NOT transcoded (performance)
return $copy;
}
}
$processor = new CharsetProcessor();
$processor->registerPropertyMapper(User::class, new UserMapper());
$user = new User();
$user->name = 'JosΓ©';
$user->password = 'secret123'; // Will NOT be converted
$utf8User = $processor->toUtf8($user);
// Performance: 60% faster for objects with 50+ propertiesAdd support for custom data types:
use Ducks\Component\EncodingRepair\Interpreter\TypeInterpreterInterface;
class ResourceInterpreter implements TypeInterpreterInterface
{
public function supports($data): bool
{
return \is_resource($data);
}
public function interpret($data, callable $transcoder, array $options)
{
$content = \stream_get_contents($data);
$converted = $transcoder($content);
$newResource = \fopen('php://memory', 'r+');
\fwrite($newResource, $converted);
\rewind($newResource);
return $newResource;
}
public function getPriority(): int
{
return 80;
}
}
$processor->registerInterpreter(new ResourceInterpreter(), 80);
$resource = fopen('data.txt', 'r');
$convertedResource = $processor->toUtf8($resource);Extend CharsetHelper with your own conversion strategies using the TranscoderInterface:
use Ducks\Component\EncodingRepair\Transcoder\TranscoderInterface;
class MyCustomTranscoder implements TranscoderInterface
{
public function transcode(string $data, string $to, string $from, array $options): ?string
{
if ($from === 'MY-CUSTOM-ENCODING') {
return myCustomConversion($data, $to);
}
// Return null to try next transcoder in chain
return null;
}
public function getPriority(): int
{
return 75; // Between iconv (50) and UConverter (100)
}
public function isAvailable(): bool
{
return extension_loaded('my_extension');
}
}
// Register with default priority
CharsetHelper::registerTranscoder(new MyCustomTranscoder());
// Register with custom priority
CharsetHelper::registerTranscoder(new MyCustomTranscoder(), 150);
// Legacy: Register a callable
CharsetHelper::registerTranscoder(
function (string $data, string $to, string $from, array $options): ?string {
if ($from === 'MY-CUSTOM-ENCODING') {
return myCustomConversion($data, $to);
}
return null;
},
150 // Priority
);Add custom encoding detection methods using the DetectorInterface:
use Ducks\Component\EncodingRepair\Detector\DetectorInterface;
class MyCustomDetector implements DetectorInterface
{
public function detect(string $string, array $options): ?string
{
// Check for UTF-16LE BOM
if (strlen($string) >= 2 && ord($string[0]) === 0xFF && ord($string[1]) === 0xFE) {
return 'UTF-16LE';
}
// Return null to try next detector
return null;
}
public function getPriority(): int
{
return 150; // Higher than MbStringDetector (100)
}
public function isAvailable(): bool
{
return true;
}
}
// Register with default priority
CharsetHelper::registerDetector(new MyCustomDetector());
// Register with custom priority
CharsetHelper::registerDetector(new MyCustomDetector(), 200);
// Legacy: Register a callable
CharsetHelper::registerDetector(
function (string $string, array $options): ?string {
if (strlen($string) >= 2 && ord($string[0]) === 0xFF && ord($string[1]) === 0xFE) {
return 'UTF-16LE';
}
return null;
},
200 // Priority
);The class uses a Chain of Responsibility pattern for both detection and transcoding.
CharsetHelper uses multiple strategies with automatic fallback:
UConverter (intl) β iconv β mbstring
β (fails) β (fails) β (always works)
Transcoder priorities:
- UConverter (priority: 100, requires
ext-intl): Best precision, supports many encodings - iconv (priority: 50): Good performance, supports transliteration
- mbstring (priority: 10): Universal fallback, most permissive
Custom transcoders can be registered with any priority value. Higher values execute first.
Detector priorities:
- CachedDetector (priority: 200, wraps MbStringDetector): Caches detection results
- MbStringDetector (priority: 100, requires
ext-mbstring): Fast and reliable using mb_detect_encoding - FileInfoDetector (priority: 50, requires
ext-fileinfo): Fallback using finfo class
Custom detectors can be registered with any priority value. Higher values execute first.
Cache Support (New in v1.2):
CachedDetector supports PSR-16 cache for persistent detection results:
// Default: InternalArrayCache (optimized, no TTL overhead)
$detector = new CachedDetector(new MbStringDetector());
// With TTL: ArrayCache
$cache = new ArrayCache();
$detector = new CachedDetector(new MbStringDetector(), $cache, 3600);
// External: Redis, Memcached, APCu, etc.
// $redis = new \Symfony\Component\Cache\Psr16Cache($redisAdapter);
// $detector = new CachedDetector(new MbStringDetector(), $redis, 7200);Benchmarks on 10,000 conversions (PHP 8.2, i7-12700K):
| Operation | Time | Memory |
|---|---|---|
| Simple UTF-8 conversion | 45ms | 2MB |
| Array (100 items) | 180ms | 5MB |
| Auto-detection + conversion | 92ms | 3MB |
| Double-encoding repair | 125ms | 4MB |
| Safe JSON encode | 67ms | 3MB |
| Batch conversion (1000 items) | ~60% faster | Same |
| Object with custom mapper (50 props) | ~60% faster | Same |
Tips for performance:
- Install
ext-intlfor best performance (UConverter is fastest) - Use specific encodings instead of
AUTOwhen possible - Use batch methods (
toCharsetBatch()) for arrays > 100 items with AUTO detection - Cache detection results for repeated operations
| Feature | CharsetHelper | ForceUTF8 | Symfony String | Portable UTF-8 |
|---|---|---|---|---|
| Multiple fallback strategies | β | β | β | β |
| Extensible (CoR pattern) | β | β | β | β |
| Object recursion | β | β | β | β |
| Double-encoding repair | β | β | β | |
| Safe JSON helpers | β | β | β | β |
| Multi-encoding support | β (7+) | |||
| Modern PHP (7.4+, strict types) | β | β | β | |
| Zero dependencies | β | β | β | β |
// Migrate user table
$users = $db->query("SELECT * FROM users")->fetchAll();
foreach ($users as $user) {
$user = CharsetHelper::toUtf8($user, CharsetHelper::ENCODING_ISO);
$db->update('users', $user, ['id' => $user['id']]);
}$csv = file_get_contents('data.csv');
// Auto-detect and convert
$utf8Csv = CharsetHelper::toCharset(
$csv,
CharsetHelper::ENCODING_UTF8,
CharsetHelper::AUTO
);
// Parse as UTF-8
$data = str_getcsv($utf8Csv);// Ensure API responses are always valid UTF-8
class ApiController
{
public function jsonResponse($data): JsonResponse
{
$json = CharsetHelper::safeJsonEncode($data);
return new JsonResponse($json, 200, [], true);
}
}$html = file_get_contents('https://example.com');
// Detect encoding from HTML meta tags or auto-detect
$encoding = CharsetHelper::detect($html);
// Convert to UTF-8 for processing
$utf8Html = CharsetHelper::toCharset(
$html,
CharsetHelper::ENCODING_UTF8,
$encoding
);
$dom = new DOMDocument();
$dom->loadHTML($utf8Html);// Fix double-encoded data from old system
$legacyData = $oldSystem->getData();
// Repair corruption
$clean = CharsetHelper::repair(
$legacyData,
CharsetHelper::ENCODING_UTF8,
CharsetHelper::ENCODING_ISO
);
// Process clean data
processData($clean);# Run tests
composer test
# Run tests with coverage
composer unittest -- --coverage-html coverage
# Static analysis
composer phpstan
# Auto-fix code style
composer phpcsfixer-check- Changelog
- How To
- About Middleware Pattern
- Type Interpreter System
CharsetHelperCharsetProcessorCharsetProcessorInterfacePrioritizedHandlerInterfaceTypeInterpreterInterfacePropertyMapperInterfaceInterpreterChainStringInterpreterArrayInterpreterObjectInterpreterTranscoderInterfaceCallableTranscoderIconvTranscoderMbStringTranscoderUConverterTranscoderDetectorInterfaceCallableDetectorMbStringDetectorFileInfoDetectorCallableAdapterTraitChainOfResponsibilityTraitCachedDetectorInternalArrayCacheArrayCache
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure tests pass (
composer test) - Run static analysis (
composer analyse) - Fix code style (
composer cs-fix) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
git clone https://github.com/ducks-project/encoding-repair.git
cd encoding-repair
composer install
# Run full CI checks locally
composer ci- PSR-12 / PER Coding Style
- PHPStan level 8
- 100% type coverage
- Minimum 90% code coverage
This project is licensed under the MIT license see the LICENSE file for details.
- Inspired by ForceUTF8 (simplified approach)
- Uses design patterns from Symfony (extensibility)
- Fallback strategies similar to Portable UTF-8
- Documentation: https://github.com/ducks-project/encoding-repair/wiki
- Issue Tracker: https://github.com/ducks-project/encoding-repair/issues
- Changelog: CHANGELOG.md
- Packagist: https://packagist.org/packages/ducks-project/encoding-repair
- π§ Email: adrien.loyant@gmail.com
- π¬ Discussions: https://github.com/ducks-project/encoding-repair/discussions
- π Issues: https://github.com/ducks-project/encoding-repair/issues
If this project helped you, please consider giving it a β on GitHub!
Made with β€οΈ by the Duck Project Team