Description
I encountered a case where a text content like <...> inside a tag is incorrectly handled by the parser. The parser tries to interpret this as a tag name and breaks the tree structure. This is also applicable to the symbol \U+2026 Horizontal Ellipsis
What does <...> mean?
In text, <...> usually means something has been intentionally left out. It’s a placeholder that implies omitted words, omitted text that isn’t being shown, quoted, or repeated
However, the . symbol can indeed be used in the XML tag name, but only if it is not the starting character (NameStartChar in spec). And the ellipsis symbol seems to be invalid for NameChar. Please look at the Names and Tokens section here https://www.w3.org/TR/xml/#sec-common-syn
Also, while writing this issue, I tried using only the angle bracket characters individually and saw strange behavior. I've described this below as a second case.
I assume that the correct parser behavior if the angle brackets don't form a valid tag name according to XML specification would be to leave the angle bracket characters in the #text element or escape them in entities.
Input
Code
const xmlParser = new XMLParser({
preserveOrder: true,
allowBooleanAttributes: true,
ignoreAttributes: false,
ignoreDeclaration: true,
});
const firstCase = `<?xml version="1.0"?>
<root>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies <...> Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis. <...> Nulla gravida erat a tortor sollicitudin laoreet.</p>
<foo></foo>
</root>`;
const secondCase = `<?xml version="1.0"?>
<root>
<p>if (1 < 3) return text;</p>
</root>
`;
const jObj = xmlParser.parse(firstCase); // check first and second case
const xmlBuilder = new XMLBuilder({
ignoreAttributes: false,
preserveOrder: true,
});
const xmlContent = xmlBuilder.build(jObj);
console.log(xmlContent);
Output
In the first case:
jObj
[
{
"root": [
{
"p": [
{
"#text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies"
},
{
"...": [
{
"#text": "Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis."
},
{
"...": [
{
"#text": "Nulla gravida erat a tortor sollicitudin laoreet."
}
]
},
{
"foo": []
}
]
}
]
}
]
}
]
<root><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies<...>Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis.<...>Nulla gravida erat a tortor sollicitudin laoreet.</...><foo></foo></...></p></root>
In the second case:
jObj
[
{
"root": [
{
"p": [
{
"#text": "if (1"
},
{
"": [],
":@": {
"@_3)": true,
"@_return": true,
"@_text;": true,
"@_</p": true
}
}
]
}
]
}
]
<root><p>if (1< 3) return text; </p></></p></root>
expected data
In the first case:
<root><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies <...> Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis. <...> Nulla gravida erat a tortor sollicitudin laoreet.</p><foo></foo></root>
In the second case:
<root><p>if (1 < 3) return text;</p></root>
Would you like to work on this issue?
Bookmark this repository for further updates. Visit SoloThought to know about recent features.
Description
I encountered a case where a text content like
<...>inside a tag is incorrectly handled by the parser. The parser tries to interpret this as a tag name and breaks the tree structure. This is also applicable to the symbol\U+2026Horizontal EllipsisWhat does <...> mean?
In text,
<...>usually means something has been intentionally left out. It’s a placeholder that implies omitted words, omitted text that isn’t being shown, quoted, or repeatedHowever, the
.symbol can indeed be used in the XML tag name, but only if it is not the starting character (NameStartChar in spec). And the ellipsis symbol seems to be invalid for NameChar. Please look at the Names and Tokens section here https://www.w3.org/TR/xml/#sec-common-synAlso, while writing this issue, I tried using only the angle bracket characters individually and saw strange behavior. I've described this below as a second case.
I assume that the correct parser behavior if the angle brackets don't form a valid tag name according to XML specification would be to leave the angle bracket characters in the
#textelement or escape them in entities.Input
Code
Output
In the first case:
jObj
[ { "root": [ { "p": [ { "#text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pretium odio non ex hendrerit, eu convallis sapien ultricies" }, { "...": [ { "#text": "Sed sagittis at est auctor varius. Donec sit amet nibh sodales, varius nunc eu, tempus turpis." }, { "...": [ { "#text": "Nulla gravida erat a tortor sollicitudin laoreet." } ] }, { "foo": [] } ] } ] } ] } ]In the second case:
jObj
[ { "root": [ { "p": [ { "#text": "if (1" }, { "": [], ":@": { "@_3)": true, "@_return": true, "@_text;": true, "@_</p": true } } ] } ] } ]expected data
In the first case:
In the second case:
Would you like to work on this issue?
Bookmark this repository for further updates. Visit SoloThought to know about recent features.