-
Notifications
You must be signed in to change notification settings - Fork 3.2k
MIME: Introduce MIME type parser #10640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
Various applications require making decisions based on a MIME type from an HTTP header, such as is found in the “Content-Type” header. This patch introduces a new utility class for parsing MIME media types compliant with the WHATWG MIME Sniffing algorithm. Co-authored-by: Jon Surrell <[email protected]>
304205a to
75854fd
Compare
|
Once again WPCS thinks vertical alignment is a crime… Thoughts after letting this stew for a bit:
|
Various applications require making decisions based on a MIME type from an HTTP header, such as is found in the “Content-Type” header. This patch introduces a new utility class for parsing MIME media types compliant with the WHATWG MIME Sniffing algorithm. Co-authored-by: Jon Surrell <[email protected]>
e2f6863 to
8f46374
Compare
|
I’ve made some large updates to the interface and I’m overall much happier with the new primary methods. There are still some parts of the sniffing algorithm left to implement ( Still need to think about |
| return $serialization; | ||
| } | ||
|
|
||
| public function essence(): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about get_essence()? While “essence” is still maybe esoteric, I also see it as a verb could mean “to perfume or scent”. Since class methods are normally verbs, I would not immediately recognize what an essence function is supposed to do, but I would understand get_essence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m really not happy overall with this term, as it was brand new to me, and I think to @sirreal, and I don’t know about you, but were it not for the spec I would not give it this name.
It seems worth continuing to review the spec (which I’m doing) and trying to understand how we want to use this in which contexts. Given the test failures there are things I still don’t understand and I think some of them stem from other specifications, such as the Fetch and HTTP Structured Fields specs.
For now I will leave this as-is not because I disagree with your advice, but mostly as a questionable artifact of implementing the spec as-is.
This discussion is rather insightful, and I think we will need to also implement Content-Type parsing from Fetch if we want to handle the full web-platform-tests suite. This addresses questions you and I have discussed around seeing multiple Content-Type headers, as even though the web servers we tested limit them to one, apparently not all do, plus there might be cases where the header is duplicated in order to split long lines. Various browsers still disagree around the edges, so we have some leeway.
I’m hoping this doesn’t get too complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that: I had $position .= strcspn() instead of $position += strcspn() 🤦♂️
The tests pass, but I did implement the Content-Type parsing and think it might be useful. The same tests pass and fail from the mime-types.json file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding essence, I was not familiar with the term before researching MIME.
It's obvious looking at the function, but the essence is type/subtype (without any parameters). I think most folks think of the essence when they think of a MIME type, things like text/plain or application/json.
It's redundant, but maybe mime_type_essence would be a good name. It includes mime_type so it should be visible to search and autocomplete when folks are looking want the essence and look for something about mime type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "media type" not the most common alternative term. So get_media_type() would make sense to me as a method name. The phpdoc can also mention that this is also the essence. Or would media type this entail that parameters may also be present?
| } | ||
| } | ||
|
|
||
| public function serialize(): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about also implementing a __toString method which is just an alias for this serialize method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the past, people have been pretty averse to adding these implicit magic methods. I don’t have an opinion, though I do know that there are multiple legitimate ways to represent one of these as a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall that aversion. I see that WP_HTML_Tag_Processor includes it. I don't feel strongly about this, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It came up in WordPress/gutenberg#42485 which led to the creation of get_updated_html() in WordPress/gutenberg#44597, which was a long time ago. It also is the reason (string) casting the HTML API is never used in documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I’m going to let this stew a bit without adding an implicit choice of what representation to use. I think serialize() makes sense, but the difference between the serialized and minimized form could matter in context-dependent ways.
For example, are we attempting to pass-through a normalized form of the input value? Are we trying to return the decoded MIME type? Are we trying to return the MIME type for an upload?
I think these three contexts warrant different returns:
serialize()will return forms of non-sensical types like!#$%&/!#$%&andx/xminimize()does not.
Whereas minimize() is supposed to filter the media types by whether they are “supported by the user agent,” meaning whether the user agent knows how to display the type. In the case of WordPress I don’t have a great answer for what this means, so I have tossed this out as a start:
- use
wp_get_mime_types()for reporting decoded MIME type when support is relevant - use
get_allowed_mime_types()for uploads or where user-permissions are relevant
My fear is that if we declare one canonical form of string value for the MIME type instance it will over-simplify this context-dependence and accidentally encourage developers to overlook it.
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Finishing up today’s work I have made substantial-enough changes that I am marking this ready for review:
The tests cover a lot of edge cases. A few related specifications integrate here and it makes me think about creating a
What should be reviewed?Please focus on design review: how will this be used? how should it be used? Are there better names? Are there different constructions that will lead to more responsible use? For example, I renamed
Another example is the split in Anyway, this is a bit of rambling but there is a risk in implementing specs like this that things are simpler than they appear. The spec provides us instructions on how to look at something and give the same answer that a browser would to the question, “what is this?” But the specs are general and not built for our unique or special use cases. We may choose to diverge where something better-meets the needs of WordPress developers, as we collectively see fit, as long as it doesn’t violate spec-compliance. Thank you for your time and consideration on this! |
Could this PR (or a new sub-PR) implement this new MIME type parser to replace the ad-hoc parsers listed in the ticket? This would more easily demonstrate that it satisfies the current use cases. |
Trac ticket: Core-64427
Introduces the
WP_Mime_Snifferclass for parsing MIME types from sources such as HTTPContent-Typeheaders, unknown binary files, and more.WP_Mime_Sniffer::from_declaration( $supplied_type )for decoding HTTPContent-Typeheaders, HTML<meta http-equiv>and<script type>tags, RFC 822 headers, and more, where the string is an affirmation of the type of content that should be contained within some associated resource.WP_Mime_Sniffer::from_file( $file_path )for inferring MIME type from the “resource header” of a file at the given path where harmonizing server and browser behaviors is warranted, largely to eliminate security vulnerabilities.WP_Mime_Sniffer::from_binary_file_contents( $file_contents )for the same, but when the file data has already been loaded, e.g. on media file upload or via HTTP GET.$mime_type->serialize()to produce a normalized version of a potentially-malformed input.$mime_type->minimize()to produce a privacy-sensitive stripped-down version of the MIME type suitable for use in APIs likePerformanceResourceTiming.$mime_type->get_indicated_charset()to return a canonical character encoding referenced by the MIME type, if included and recognized.$mime_type->is_json()and$mime_type->is_javascript().The
::declaring_javascript()and::declaring_json()methods are interesting and might be worth emphasizing overfrom_declaration()if they stay in the patch. They only return a parsed MIME type if given something that matches those classes.::from_http_headers_string( string $headers )?::from_http_headers_array( array<string> $headers )?These two methods could ease code attempting to infer content type without needing to know the details surrounding
Content-typeparsing: indownload_url(), inSimplePie, indiscover_pingback_server_uri(), inwp_staticize_emoji_for_email()even! It would updateWP_REST_Request::get_content_type()andwp_finalize_template_enhancement_output_buffer().The
Encodingpart unlocks non-UTF-8 inputs in the HTML API for$this->bail( 'Cannot yet process META tags with http-equiv Content-Type to determine encoding.' );Of the labeled encodings, they are mostly supported by the version of PHP running on my computer with
mbstringandiconvextensions. Of the unsupported ones:ISO-8859-8-Iis a variant ofISO-8859-8which might be textually identical and possibly only specified meta sequences based on the C0/C1 controls.replacementgroups security-risky encoding labels into a decoder that always fails. when decoded, the output is always''(empty string).x-user-definedis a mapping of non-US-ASCII bytes up by 0x4780 into the private-use area.