Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation PackageURL objects for invalid PURLs / Possible improper encoding of colons (":") in PURL fields #152

Open
njv299 opened this issue Mar 27, 2024 · 16 comments · May be fixed by #178

Comments

@njv299
Copy link

njv299 commented Mar 27, 2024

It is possible to create PackageURL objects that contain invalid fields, specifically by using the PackageURL kwarg constructor and passing in values that contain colons.

Simple example:

>>> from packageurl import PackageURL
>>> p = PackageURL(type="generic", name="Foo: <Bar>", version="1.2.3")
>>> p.to_string()
'pkg:generic/Foo:%20%3CBar%[email protected]'
>>> PackageURL.from_string(p.to_string())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/vossn/finitestate/finite-state-sip/venv/lib/python3.10/site-packages/packageurl/__init__.py", line 514, in from_string
    raise ValueError(msg)
ValueError: Invalid purl 'pkg:generic/Foo:%20%3CBar%[email protected]' cannot contain a "user:pass@host:port" URL Authority component: ''.

On closer inspection, it looks like the problem might be that colons (:) are not being percent-encoded correctly? I would expect the colon in the name to be encoded to %3A, but it looks like it is being left as a literal : in the to_string() function:

>>> p = PackageURL(type="generic", name="Foo: <Bar>", version="1.2.3")
>>> p
PackageURL(type='generic', namespace=None, name='Foo: <Bar>', version='1.2.3', qualifiers={}, subpath=None)
>>> p.to_string()
'pkg:generic/Foo:%20%3CBar%[email protected]'

I'm not sure I'm interpreting the PURL spec correctly with regards to the treatment of colon characters, but on the surface it sounds like any colon appearing within an individual field value should simply be percent-encoded if the field itself calls for it.

@njv299 njv299 changed the title PackageURL objects can represent invalid PURLs. Creation PackageURL objects for invalid PURLs / Possible improper encoding of colons (":") in PURL fields Mar 27, 2024
@itookyourboo
Copy link

Totally agree. Quote from PURL specification:

A name must be a percent-encoded string

@mprpic
Copy link

mprpic commented Jul 25, 2024

Colons are explicitly not encoded for an unknown reason:

https://github.com/package-url/packageurl-python/blob/f98abf0f3c295873e18f968ebd00138a02d63b25/src/packageurl/__init__.py#L71C40-L71C40

This line was added as part of commit d7be020, but there is no explanation as to why. This applies to all fields, not just name, fwiw.

@pombredanne as the author of this particular piece of the code, can you share why colons are treated special in this context? :-)

@jkugler
Copy link

jkugler commented Feb 5, 2025

I just had this long discussion with the folks at Syft to fix a bug in their tool which caused improper PURLs to be generated. Colons in names, namespaces, in versions (probably) need to be escaped.

See this bug:

>>> p = PackageURL.from_string('pkg:nuget/libiconv%3A%20character%20set%20conversion%[email protected]?package-id=e11a609df352e292')
>>> p.to_string()
'pkg:nuget/libiconv:%20character%20set%20conversion%[email protected]?package-id=e11a609df352e292'
>>> PackageURL.from_string("pkg:nuget/libiconv:%20character%20set%20conversion%[email protected]?package-id=e11a609df352e292")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    PackageURL.from_string("pkg:nuget/libiconv:%20character%20set%20conversion%[email protected]?package-id=e11a609df352e292")
  File "/Users/tek30584/repos/asset/kodiak-orchestrator/.venv.python312/lib/python3.12/site-packages/packageurl/__init__.py", line 508, in from_string
    raise ValueError(msg)
ValueError: Invalid purl 'pkg:nuget/libiconv:%20character%20set%20conversion%[email protected]?package-id=e11a609df352e292' cannot contain a "user:pass@host:port" URL Authority component: ''.

Also:

>>> p = PackageURL(type="nuget", name="libiconv: character set conversion library", version="1.9", qualifiers={'package-id': 'e11a609df352e292'})
>>> p
PackageURL(type='nuget', namespace=None, name='libiconv: character set conversion library', version='1.9', qualifiers={'package-id': 'e11a609df352e292'}, subpath=None)
>>> p.to_string()
'pkg:nuget/libiconv:%20character%20set%20conversion%[email protected]?package-id=e11a609df352e292'

This is a huge blocker for us, so I'm going to work on a PR.

@jkugler
Copy link

jkugler commented Feb 5, 2025

OK, so let's talk about the standard, from here: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst

Namespace:

Each namespace segment must be a percent-encoded string

Name:

A name must be a percent-encoded string

Version:

A version must be a percent-encoded string

Qualifiers:

A key must NOT be percent-encoded

A value must be a percent-encoded string

Subpath:

Each subpath segment must be a percent-encoded string

So, it would appear there are incorrect tests. For example:

E       AssertionError: assert 'pkg:docker/c...ry_url=gcr.io' == 'pkg:docker/c...ry_url=gcr.io'
E
E         - pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io
E         ?                                       ^^^
E         + pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io
E         ?

Hmm...does this mean the test data is wrong?

  {
    "description": "docker uses qualifiers and hash image id as versions",
    "purl": "pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io",
    "canonical_purl": "pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io",
    "type": "docker",
    "namespace": "customer",
    "name": "dockerimage",
    "version": "sha256:244fd47e07d1004f0aed9c",
    "qualifiers": {"repository_url": "gcr.io"},
    "subpath": null,
    "is_invalid": false
  },

If the version is supposed to be percent-encoded, then a version of sha256:244fd47e07d1004f0aed9c should end up as sha256%3A244fd47e07d1004f0aed9c

How do we go forward with this?

This is also incorrect:

E       AssertionError: 'pkg:[42 chars]https://github.com/z3APA3A/3proxy/releases/dow[26 chars].zip' != 'pkg:[42 chars]https%3A//github.com/z3APA3A/3proxy/releases/d[28 chars].zip'
E       - pkg:github/z3apa3a/[email protected]?download_url=https://github.com/z3APA3A/3proxy/releases/download/0.8.11/3proxy-0.8.11.zip
E       ?                                                    ^
E       + pkg:github/z3apa3a/[email protected]?download_url=https%3A//github.com/z3APA3A/3proxy/releases/download/0.8.11/3proxy-0.8.11.zip

The spec says the value of a qualifier must be percent-encoded. Thus, the : and / characters in the URL should be percent encoded.

Where do I go from here?

@matt-phylum
Copy link

The spec says the value of a qualifier must be percent-encoded. Thus, the : and / characters in the URL should be percent encoded.

This is a misunderstanding about what "percent-encoded" means. : and / are not supposed to be percent encoded in qualifiers: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#character-encoding

@mprpic
Copy link

mprpic commented Feb 5, 2025

@matt-phylum I don't think that's right, the spec says : and / are not supposed to be encoded only when used as separators of the individual fields, for example the colon between pkg:deb/... or the slash in pkg:pypi/django.... When used in the values of individual fields, they must be encoded.

@matt-phylum
Copy link

That is what the spec says, but it's not very clear about it. "It is unambiguous unencoded everywhere" means it is never percent encoded. However, it is actually wrong in some cases. For example, / in the name is ambiguous and must be encoded, but I don't know of any current types that do this.

@jkugler
Copy link

jkugler commented Feb 5, 2025

Please see the comment above: #152 (comment)

package-url python can't de-serialize a string it generates, instead throwing an error. I'm pretty sure : in a name or namespace should be encoded. A : or a / in a qualifier value could be left up for debate.

@matt-phylum
Copy link

That's a bug in packageurl-python's parsing code. No other PURL implementation that I am aware of fails in this way.

althonos/packageurl.rs, anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, package-url/packageurl-dotnet, package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-php, package-url/packageurl-ruby, package-url/packageurl-swift, phylum-dev/purl, sonatype/package-url-java all decode the PURL with the unencoded colon correctly.

maennchen/purl successfully parses but doesn't decode the name correctly: maennchen/purl#10

@jkugler
Copy link

jkugler commented Feb 5, 2025

That's really interesting. I mean, the error message isn't wrong as a PURL cannot contain a "user:pass@host:port" URL Authority component but does that mean there should be no un-encoded : in the name to avoid ambiguity? The Spec does say a PURL is a URL, so a : in the name makes it look like there is "login" information in the URL.

<long list of libraries> all decode the PURL with the unencoded colon correctly.

Just curious: are there tests that show it decoding a PURL with a : in the name or namespace correctly? Or did you run examples for all those languages as a test?

@matt-phylum
Copy link

I ran everything. I found that I often wanted to know how all the different PURL implementations would handle things so I wrote https://github.com/phylum-dev/purl-survey which provides a purl->json and json->purl for every implementation that I know of. It's been really useful for things like the /+/%2F/%2B problem.

@jkugler
Copy link

jkugler commented Feb 5, 2025

@matt-phylum Awesome!

So...sounds like the PR needs to fix the decoding...not the encoding.

@matt-phylum
Copy link

Is there really a package named "libiconv: character set conversion library"? I thought NuGet IDs could not contain spaces.

@jkugler
Copy link

jkugler commented Feb 5, 2025

I believe the "name" comes from the binary found by Syft. We recently closed a bug with syft where they now encode : found in names and namespaces. But we still have old SBOMs around with : in names, so we need to be able to handle them, especially if packageurl-python is also generating names with : in them. :(

jkugler added a commit to jkugler/packageurl-python that referenced this issue Feb 5, 2025
This attempts to comply with the specification.

Closes package-url#152
(I think)

See discussion at package-url#152
jkugler added a commit to jkugler/packageurl-python that referenced this issue Feb 5, 2025
This attempts to comply with the specification.

Closes package-url#152
(I think)

See discussion at package-url#152

Signed-off-by: Joshua Kugler <[email protected]>
jkugler added a commit to jkugler/packageurl-python that referenced this issue Feb 5, 2025
This attempts to comply with the specification.

Closes package-url#152
(I think)

See discussion at package-url#152

Signed-off-by: Joshua Kugler <[email protected]>
@jkugler jkugler linked a pull request Feb 5, 2025 that will close this issue
@jkugler
Copy link

jkugler commented Feb 5, 2025

I've opened a PR. I hope it can be merged quickly and unblock my team. :)

@matt-phylum
Copy link

Syft is making pkg:nuget PURLs in this file: https://github.com/anchore/syft/blob/e584c9f416028051fbdc80fffda4e624aaed528d/syft/pkg/cataloger/dotnet/parse_dotnet_portable_executable.go

Get the name of a DLL from the resources (this is wrong):
https://github.com/anchore/syft/blob/e584c9f416028051fbdc80fffda4e624aaed528d/syft/pkg/cataloger/dotnet/parse_dotnet_portable_executable.go#L57

Look up the name from the ProductName resource:
https://github.com/anchore/syft/blob/e584c9f416028051fbdc80fffda4e624aaed528d/syft/pkg/cataloger/dotnet/parse_dotnet_portable_executable.go#L202-L222

Generate a PURL using that name:
https://github.com/anchore/syft/blob/e584c9f416028051fbdc80fffda4e624aaed528d/syft/pkg/cataloger/dotnet/parse_dotnet_portable_executable.go#L83

The specific iconv string likely comes from some variant of libiconv.rc:

https://github.com/winlibs/libiconv/blob/880a1fa8b5581e37e136a7b051947d3ea39097b6/MSVC14/libiconv.rc#L35

It might work for cases where the file comes from a NuGet package and the project producing the NuGet package is using the .Net SDK to build the package and neither the name of the package nor the product name have been customized. Syft is making a big assumption and will often be wrong. In this case it's producing a PURL that refers to a package that AFAIK cannot exist because its name contains characters forbidden by the nuspec spec, but it could produce the name of a different NuGet package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants