Skip to content

fix(grpc): percent-decode target URI path and reject non-UTF-8 characters#2677

Merged
arjan-bal merged 13 commits into
grpc:masterfrom
arjan-bal:non-utf-strings
Jun 18, 2026
Merged

fix(grpc): percent-decode target URI path and reject non-UTF-8 characters#2677
arjan-bal merged 13 commits into
grpc:masterfrom
arjan-bal:non-utf-strings

Conversation

@arjan-bal

@arjan-bal arjan-bal commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This PR ensures that non-UTF-8 strings in the target URI path are rejected.

Key Changes:

  • Target::from_str: Updated to fail for targets with non-UTF-8 symbols in the percent-decoded path.
  • Target::path(): Modified to return the percent-decoded path.
  • ResolverBuilder::default_authority(): percent-encode path characters disallowed by RFC 3986 section-3.2.

@arjan-bal arjan-bal requested a review from dfawley June 5, 2026 17:44
@arjan-bal arjan-bal changed the title fix(grpc): Fix handling of non-UTF-8 path strings in target URI fix(grpc): Fix handling of non-UTF-8 strings in target URI path Jun 5, 2026
@arjan-bal arjan-bal requested a review from ejona86 June 8, 2026 05:10
@ejona86

ejona86 commented Jun 8, 2026

Copy link
Copy Markdown
Member

This seems weird to me. We should be assuming UTF-8-encoded URIs in general, so I could understand adding a way to get the un-decoded path or a path-as-bytes. And then only this code would do something different.

But I'd expect we'd decode URIs as UTF-8 in general, as the encoding of the URI can be different than the encoding of the on-disk path. For example, if the system is using the Latin-1 codepage (LANG=en_US.ISO-8859-1, not the now-typical LANG=en_US.UTF-8), then we'd decode the URI as UTF-8 and then (someone would) convert that to Latin-1 when opening the socket. Historically that was libc that would do that conversion. I see Java do that conversion. It looks like Go assumes the Linux system uses utf-8, which is actually fair these days. Go probably converts the UTF-8 to UTF-16 for Windows.

RFC 8089 encourages UTF-8 as we do represent them as strings in every other language to my knowledge. https://encoding.spec.whatwg.org/#encoding assumes Unicode, and we'd definitely be using UTF-8.

So, yeah... While it is true that the files don't even need to be any encoding at all (you can store raw binary, except for \0 and /), and rust allows you to do that, I don't think we want to deal with that in general.

@arjan-bal

Copy link
Copy Markdown
Contributor Author

@ejona86, to clarify, is the recommendation that gRPC should validate that the target URI contains only valid UTF-8 in the percent-decoded path?

@ejona86 ejona86 closed this Jun 8, 2026
@ejona86 ejona86 reopened this Jun 8, 2026
@ejona86

ejona86 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Yeah, we should validate it. But either PercentDecode.decode_utf8() or decode_utf8_lossy() might work. Java uses a replacement character.. Go fails URL parsing doesn't validate. Dunno what C++ does.

Edit: C++ doesn't validate. But C++ doesn't care if you store raw binary in strings, and Go is tolerant to invalid UTF-8 in strings. Java doesn't allow that, and neither does Rust to my knowledge.

@arjan-bal arjan-bal assigned arjan-bal and unassigned ejona86 and dfawley Jun 9, 2026
@arjan-bal arjan-bal changed the title fix(grpc): Fix handling of non-UTF-8 strings in target URI path fix(grpc): percent-decode target URI path and reject non-UTF-8 characters Jun 9, 2026
@arjan-bal

arjan-bal commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I've reworked the approach to use PercentDecode.decode_utf8() during target URL parsing, which will now fail if it encounters non-UTF-8 characters. I prefer this to using replacement characters because it prevents unexpected behaviour and fails fast to ensures users are immediately aware of the invalid target so they can fix it.

@arjan-bal arjan-bal assigned ejona86 and unassigned arjan-bal Jun 9, 2026

@ejona86 ejona86 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I need to check callers to understand how this impacts other code.

Comment thread grpc/src/client/name_resolution/mod.rs Outdated
@arjan-bal arjan-bal requested a review from ejona86 June 12, 2026 08:20

@ejona86 ejona86 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All callsites except for default_authority look fine. default_authority does need some percent encoding done on it. We can potentially say that things like XDS would need to implement their own, but in that case having a broken default doesn't seem good. If we don't want to do the percent encoding by default we could instead have the channel or whatever calls default_authority check that only permitted characters are present.

@arjan-bal arjan-bal assigned arjan-bal and unassigned ejona86 Jun 18, 2026
@arjan-bal

Copy link
Copy Markdown
Contributor Author

I overlooked authority encoding while implementing the default_authority() method. I've added percent-encoding to match the behavior in Go and Java.

@arjan-bal arjan-bal requested a review from ejona86 June 18, 2026 10:26
@arjan-bal arjan-bal assigned ejona86 and unassigned arjan-bal Jun 18, 2026
@arjan-bal arjan-bal assigned arjan-bal and unassigned ejona86 Jun 18, 2026
@arjan-bal arjan-bal merged commit 36c2802 into grpc:master Jun 18, 2026
26 checks passed
@arjan-bal arjan-bal deleted the non-utf-strings branch June 18, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants