Skip to content

proposal: encoding/xml: round-trip safe mode [freeze exception] #43168

Closed
@FiloSottile

Description

@FiloSottile

Background

Juho Nurminen of Mattermost reported multiple parsing issues in encoding/xml that lead to round-trip mismatches: parsing the output of encoding a list of tokens does not produce the same list of tokens.

Some of those issues are fixed by CLs linked to this issue, but encoding/xml does not provide such a security property, was not designed to, and we don’t believe that fixing the issues we know about is sufficient to guarantee it (nor that the round-trip guarantee is necessarily sufficient for all applications).

Indeed, the investigation kept surfacing different issues impacting round-trip stability, and we are aware of at least one variant affecting the Token API that is neither fixed by an open CL nor detected by the researchers’ validator (which we are told only aims to protect RawToken uses).

Unfortunately, XML-DSig and SAML implementations came to rely on round-trip stability, if not on perfect spec alignment, for their security. This is not entirely surprising, because XML-DSig and SAML are fundamentally fragile designs, but they are also critical enough in the ecosystem that we should try to find a way to let them operate safely.

Regardless of the outcome of this proposal, applications should avoid relying on complex parsers agreeing on how to interpret inputs for their security when possible, and should minimize transfers and modifications between the time of data validation and the time of use.

We’d like to thank Mattermost for reporting these issues, for working with us through the investigation, and for coordinating the disclosure of the issues to the downstream libraries so they could mitigate in advance.

If you use an affected SAML library, please refer to the following security advisories:

Proposal

Hopefully, applications need this kind of security property only when doing limited modifications of a document (for example, extracting a part to forward). If that’s the case, we can try to find a limited subset of the encoding/xml functionality for which we can effectively test that the outputs parse unambiguously.

(We tried fuzzing the full featureset, but the fuzzer found a lot of round-trip changes which are semantically correct, but hard to validate automatically, and changing those behaviors would probably be a breaking change.)

As a first suggestion, I propose adding an IgnoreNamespaces field to Decoder, which when set disables all namespace processing. Names with colons are not split and are copied whole into Name.Local.

We have a prototype of this and it survived extensive fuzzing (with the only caveat that CDATA sections end up merged with surrounding text, as the encoding/xml data structures don’t preserve escaping information).

Open questions

Is this API actually usable by the applications that need it?

Is round-trip stability sufficient, or do all those applications actually need perfect alignment with the spec on how the output is parsed? In other words, if the output is generated by Go but parsed by libxml and they disagree on the list of tokens, is that an issue? If yes, this is a much harder property to provide. HTTP/1 parsers have been struggling with it for decades, as shown by the constant stream of request smuggling vulnerabilities across the ecosystem, and HTTP/1 is arguably a much simpler protocol.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions