Skip to content

Conversation

@tobil4sk
Copy link
Member

@tobil4sk tobil4sk commented Feb 9, 2026

Hopefully this counteracts the performance regression seen after they started being used for haxe.io.Bytes, see: #1294 (comment)

This is lighter than #1302 since it doesn't introduce an external library. It also addresses the problem from a different angle, by avoiding inefficiencies in the surrounding code, rather than making the encoding/decoding itself as fast as possible. Even if we decide to use simdutf I think it would be worth making some of these changes anyway, as some of them would also benefit the simdutf implementation.

These are the changes:

  • Merge encodings into a single file - having the encodings in separate files made it impossible for the compiler to inline simple function calls without link time optimisation (which is off by default).
  • Avoid duplicate string iteration for ascii checks - we can perform one iteration to count length and check for ascii.
  • Add utf8 encode function that allocates output - current haxe has to iterate to get the utf8 length, then we also iterate to ensure the utf8 length haxe gave is correct, which is wasteful.
  • Pass char32_t by value instead of by reference - not sure if this makes a big difference but it is better practice anyway.
  • Estimate utf8 length instead of iterating - this removes the need for iterating to find the length. Instead, we overallocate and truncate when done.

With this I see an improvement locally in the dox benchmark compared to master (1.3% vs 5.5% time spent in Bytes.ofString).

Expand for flamegraphs

Download as svg and use search bar to search for "Bytes_obj::ofString".

Unpatched:

unpatched
Patched:
patched

The decode methods use methods from other encodings. For example,
Utf8::decode calls Utf16::getCharCount and Utf16::encode in a loop.
Placing them in the same file makes it easier for them to be inlined
which improves performance.
The existing Utf8::encode function takes in a buffer, but we don't know
what size is required so we have to iterate through the string before
writing to make sure the buffer is big enough.

If the caller already ran getByteCount, then this means we have
duplicated their work just for the case where they did not do it
properly.

This new method allocates its own buffer that is definitely the right
size, which avoids the need for unnecessary checks
@Aidan63
Copy link
Contributor

Aidan63 commented Feb 9, 2026

Just did a quick bit of benchmarking with a "lorem ipsum" like file full of chinese characters. With these changes Bytes.toString is now faster than Haxe 4.3.7.

10k iterations:
latest: 0.09s
these: 0.07s

Bytes.ofString is still a fair bit slower than 4.3.7 though.

10k iterations:
latest: 0.05s
these: 0.12s

This was my benchmark code for reference.

class Main {
    static function main() {
        // final bytes = File.getBytes("C:\\Users\\AidanLee\\Desktop\\test\\lorem.txt");
        final string = File.getContent("C:\\Users\\AidanLee\\Desktop\\test\\lorem.txt");

        Timer.measure(() -> {
            for (_ in 0...10000) {
                final _ = Bytes.ofString(string);
                // final _ = bytes.toString();
            }
        });
    }
}

lorem.txt

@Aidan63
Copy link
Contributor

Aidan63 commented Feb 9, 2026

On length estimation, the one downside of this would be strings which are actually just under the "large object" size threshold of 2k bytes but are estimated to be higher. In this case those strings would go through the large object allocator rather than the thread local one.

@tobil4sk
Copy link
Member Author

tobil4sk commented Feb 9, 2026

On length estimation, the one downside of this would be strings which are actually just under the "large object" size threshold of 2k bytes but are estimated to be higher. In this case those strings would go through the large object allocator rather than the thread local one.

Yeah, this is where the tradeoff lies. __hxcpp_bytes_of_string loops and pushes to an array, so it does constant reallocations. This patch avoids reallocations at the expense of overallocating.

Bytes.ofString is still a fair bit slower than 4.3.7 though.

For some reason, it's the other way around for me:

ofString

10,000 1,000,000
patched haxe/hxcpp 0.06568s 5.673s
haxe 4.3.7 0.07428s 7.339s

toString

10,000 1,000,000
patched haxe/hxcpp 0.0674s 6.107s
haxe 4.3.7 0.05319s 4.535s

@skial skial mentioned this pull request Feb 10, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants