Mapping resolution strategy: discussion and implementation. #256

mknorps · 2023-03-21T16:44:45Z

Execution plan:

Extra (related to the mapping strategy, but we decided to exclude them from the milestone):

The question is how to combine different ways of mapping dependencies to imports:

user-defined mapping
identity mapping
mapping from a local environment via automatic discovery (basepath, sys.path)
mapping added from local virtual environments via --pyenv option
installing packages from PyPI to an isolated local env
download packages from PyPI and extract import names from toplevel.txt or other (to be investigated)
local cache

This sparked from various discussions and conversation in a #252.

3 rounds of visualization of the flow of this translation funnel and we have following diagram, by @jherland

We should also decide what customization options should be given to a user:

switching on/off connecting to PyPI
use/not use/clean cache
use/not use sys.path, basepath
other?

The text was updated successfully, but these errors were encountered:

mknorps · 2023-03-21T16:50:26Z

What we also discussed is the cache should have a common format and Package objects serialized to toml or json could be good candidates. It would also simplify writing a user-defined mapping.

Nour-Mws · 2023-03-21T17:28:10Z

Caches

If this is not complex enough, I'd like to also note that we can have two different types of caches: per-project and system-wide. The latter would include cached mappings from various previous runs of fawltydeps (not necessarily related to the project currently checked).

Another consideration is that this central cache can be populated not only from PyPI installations but also from FawtyDeps looking at various local Python envs in previous runs.

That may create privacy concerns so we might want to have this cache be user-based rather than system-wide?

Order of mappings

My understanding of the diagram above is that it dictates the default order the resolver will take through mappings.

We have already established that some of those can be turned off (i.e. skipped):

The user-provided mapping and mapping via --pyenv are turned off by default.
Identity mapping is turned on by default, but I'm partial to allowing the user to turn it off.

Furthermore, I would expect a user that passed the user-provided mapping or a --pyenv option to want the resolver to be exclusively based on those? The diagram above seems to imply that passing those will "turn them on" but that the resolver would still proceed through the depicted pipeline and attempt to resolve via other means if those failed?

While this makes for a more powerful default approach, I would still like to:

provide the user with appropriate visibility over the origin of respective mappings.
allow them to rely solely on the mapping(s) they provided or pointed to (or, and that makes things a tad more complex, choose what other mappings the resolvers can have access to).

The point is that, if I provided a Python environment via --pyenv, I would prefer to be told when a dependency is missing rather than having FawltyDeps find missing dependencies elsewhere on my system.

jherland · 2023-03-22T01:20:04Z

@mknorps:

What we also discussed is the cache should have a common format and Package objects serialized to toml or json could be good candidates. It would also simplify writing a user-defined mapping.

Format of cache

Currently our Package objects end up looking like this in our JSON output (as part of the .resolved_deps member):

  "resolved_deps": {
    "importlib_metadata": {
      "package_name": "importlib-metadata",
      "mappings": {"local_env": ["importlib_metadata"]}
    },
    "isort": {
      "package_name": "isort",
      "mappings": {"local_env": ["isort"]}
    },
    "pip-requirements-parser": {
      "package_name": "pip-requirements-parser",
      "mappings": {"local_env": ["packaging_legacy_version", "pip_requirements_parser"]}
    },
    ...

This format is sufficient to decode back into Package objects without loss of information. Given that a LocalPackageLookup (after having interacted with importlib_metadata) is nothing more than a dict of Package objects, I think this is a good candidate for the format we should use for the cache.

Format of user-defined mapping

The above JSON format is too verbose for a user to provide a custom mapping. We could go with a simpler, more dict-like JSON representation:

    {
      "importlib-metadata": ["importlib_metadata"],
      "isort": ["isort"],
      "pip-requirements-parser": ["packaging_legacy_version", "pip_requirements_parser"]
    },

Still, I don't think JSON is the best format for a user-defined mapping. Instead, I think it makes most sense to put the user-defined mapping into pyproject.toml, next to the existing [tool.fawltydeps] section, for example:

[tool.fawltydeps.custom_mapping]
importlib-metadata = ["importlib_metadata"]
isort = ["isort"]
pip-requirements-parser = ["packaging_legacy_version", "pip_requirements_parser"]

Caches

@Nour-Mws:

If this is not complex enough, I'd like to also note that we can have two different types of caches: per-project and system-wide. The latter would include cached mappings from various previous runs of fawltydeps (not necessarily related to the project currently checked).
[...]
That may create privacy concerns so we might want to have this cache be user-based rather than system-wide?

Yes! A "per-user" cache stored under e.g. ~/.cache/fawltydeps is the longest I think we should stretch. I certainly don't think it's worth going wider than per-user, just trying to imagine the privacy/security aspects alone is bad enough, not to mention how you deal with a common/shared cache that needs to be writable by multiple users. Soon you end up with a fawltydeps daemon that needs to run in the background to manage access across users. I don't really see any big upside at all, outside of very specialized environments where you have multiple users accounts on the same system that works on overlapping projects.

As for "per-user" vs "per-project" I'm undecided, but I think it's better to start small and grow its use once we (a) see that it works well without major downsides, and (b) we identify more areas where caching bring a demonstrable benefit.

Another consideration is that this central cache can be populated not only from PyPI installations but also from FawtyDeps looking at various local Python envs in previous runs.

My opinion here is the same as above: Let's take one step at a time. Although the cache might end up encompassing "everything" in the end, I'd like to start with caching the PyPI parts where we already know that building a mapping can be very expensive. As for local envs, I'd like to first establish how slow the local mapping really is.

A small test on my laptop in my FD dev env with 43 packages installed shows that LocalPackageLookup can iterate over all the available packages in ~30ms. The cache we're contemplating here will also have to be read from disk, so I suspect it cannot be many orders of magnitude faster. Let's not solve problems we don't yet have...

Order and customization of mappings

My understanding of the diagram above is that it dictates the default order the resolver will take through mappings.

Yes, default order, and the ones that are disabled (either by default, by omission, or by explicit option) are simply skipped/removed from the diagram.

Furthermore, I would expect a user that passed the user-provided mapping or a --pyenv option to want the resolver to be exclusively based on those?

I'm not so sure: I don't this it's very user-friendly to tell the user: "If you want to provide your own mapping, you better remember to include everything in it...". For --pyenv the situation is a bit more nuanced. You could have a very diligent user that wants to make sure that everything is covered by their --pyenv , and they would therefore like to disable all other mappiings, including the identity mapping. But I'm equally sure that those diligent users are not in the majority...

The diagram above seems to imply that passing those will "turn them on" but that the resolver would still proceed through the depicted pipeline and attempt to resolve via other means if those failed?

Correct. That is my current POV when it comes to our default mapping.

While this makes for a more powerful default approach, I would still like to:

provide the user with appropriate visibility over the origin of respective mappings.

Agreed. As this mapping can become quite complex, we do need to "tag" our results with which mappings were involved. AFAICS, Analysis.resolved_deps is somewhat prepared for this already, as it maps each declared dependency into the corresponding Package object that was matched for that dependency, and by looking at the mappings dict you can see which mapping was used.

allow them to rely solely on the mapping(s) they provided or pointed to (or, and that makes things a tad more complex, choose what other mappings the resolvers can have access to).

The point is that, if I provided a Python environment via --pyenv, I would prefer to be told when a dependency is missing rather than having FawltyDeps find missing dependencies elsewhere on my system.

I agree that this should be possible, but I'm not sure it makes sense as the default behavior when you pass --pyenv.

Here is an idea that has presented itself while thinking about this:

Add a new option: --use-mappings that takes a sequence of "magic" words that correspond to the boxes in the diagram:
- custom: [tool.fawltydeps.custom_mapping] from config file (if it exists). ¹
- pyenv: Any --pyenv option(s) given on the command or in the config file.
- basepath: Any Python environments discovered under basepath i.e. in the current "project' (for lack of a better word). ²
- current: The Python environment that FD is currently running in (aka sys.path).
- pypi: The various components that make up the PyPI logic.
- id: The identity mapping.
The user can pass --use-mappingswith their own list to reorder and skip any of the mappings corresponding to the above words. In other words, advanced users have full control.
The default value of --use-mappings (based on the above diagram) is: --use-mappings=custom,pyenv,basepath,current,id (pypi is off by default).
Add a --use-pypi option to insert pypi before the id mapping in the default list, i.e. equivalent to --use-mappings=custom,pyenv,basepath,current,pypi,id
Add a --use-pyenv-only option to reduce the list to --use-mappings=pyenv
Consider adding more options to satisfy common user scenarios.
Make all of these options mutually exclusive to prevent confusing scenarios with option combinations that don't make sense. (e.g. --use-pyenv-only --use-pypi).

We might also consider exposing custom_mapping on the command-line, although I pity the fool that attempts to type in a correctly formatted/quoted JSON expression on the command line to encode an extensive custom mapping. ↩
I am not sure how 'basepath' and 'pyenv' will combine: If we allow --pyenv to override basepath in the same way that --code and --deps do, then the pyenv and basepath "magic" words migth collapse into one pyenv whose default value is basepath, ultimately Path("."). ↩

This does not change current behavior, but does a preliminary refactoring on the way towards having multiple package-to-import resolvers as discussed in #256 and associated issues. Introduce a BasePackageResolver abstract base class to define the interface for package resolvers, and provide, for now, two subclasses: - LocalPackageResolver (renamed from LocalPackageLookup) - IdentityMapping (refactored from resolve_dependencies()) Also refactor resolved_dependencies() to operate on a list of BasePackageResolver objects: For each dependency name we go through the list of resolvers until a resolver returns a Package object for this name. We then add this Package to the returned mapping. For now this list of resolvers is hardcoded to exactly two entries, the LocalPackageResolver we were already using, and the IdentityMapping that we were effectively falling back on for packages not found in the former. Ideas for future work that build on this: - Introduce more/different BasePackageResolver subclasses to do different kinds of package resolving. E.g. user-defined mapping from a config file. Resolving by looking up packages remotely on PyPI, etc. - Make the list of resolvers user-configurable instead of hardcoded. - Properly handle the case where none of the resolver find a package mapping, i.e. _unmapped_ depenedencies.

jherland · 2023-03-27T09:00:51Z

Idea: Only provide options to enable/disable resolvers, not reorder them.

Does not make sense to have ID mapping anywhere other than at the bottom of the stack
Does not make sense to reorder PyPI vs --pyenv
A user-provided mapping should always have the highest priority.

jherland · 2023-03-27T09:46:02Z

Now version of the resolve_dependencies() diagram after today's meeting:

This does not change current behavior, but does a preliminary refactoring on the way towards having multiple package-to-import resolvers as discussed in #256 and associated issues. Introduce a BasePackageResolver abstract base class to define the interface for package resolvers, and provide, for now, two subclasses: - LocalPackageResolver (renamed from LocalPackageLookup) - IdentityMapping (refactored from resolve_dependencies()) Also refactor resolved_dependencies() to operate on a list of BasePackageResolver objects: For each dependency name we go through the list of resolvers until a resolver returns a Package object for this name. We then add this Package to the returned mapping. For now this list of resolvers is hardcoded to exactly two entries, the LocalPackageResolver we were already using, and the IdentityMapping that we were effectively falling back on for packages not found in the former. Ideas for future work that build on this: - Introduce more/different BasePackageResolver subclasses to do different kinds of package resolving. E.g. user-defined mapping from a config file. Resolving by looking up packages remotely on PyPI, etc. - Make the list of resolvers user-configurable instead of hardcoded. - Properly handle the case where none of the resolver find a package mapping, i.e. _unmapped_ depenedencies.

mknorps · 2023-04-17T10:27:22Z

Visualization to the team meeting discussion with @Nour-Mws and @jherland.
How to handle user interface to the resolver options. We can gather all resolver options in the table below. We keep the order of resolvers unchangeable. We can provide a user with two levels of control over the resolvers:

one is a list of boolean flags that turn on and of a particular resolver
the second is a "preset" - option or list of options that is governing a combination of resolvers.

	X	R	R	X	R	DEFAULT		X	R			X	RC			X
user-provided mapping	-	-	-	-	-	-	-	-	+	+	+	+	+	+	+	+
local package resolver	-	-	-	-	+	+	+	+	-	-	-	-	+	+	+	+
install deps	-	-	+	+	-	-	+	+	-	-	+	+	-	-	+	+
identity mapping	-	+	-	+	-	+	-	+	-	+	-	+	-	+	-	+

About 1) Each of the four listed above should have a way to exclusively switch on and off. This option should be considered a power-user option and thus not easily available, for example binary flags:

--resolver-custom-mapping
--resolver-local-packages
--resolver-install-deps
--resolver-identity

As for what we called "presets" these are combinations of the resolvers above.

X - a combination that is uninteresting to a user
R - combination that should be resolved by --resolver-... option
RC - combination that we do not anticipate will be used often, thus achievable only by combining --resolve-.. options

Assuming that --install-deps option suppress identity mapping and presence of --custom-mapping-file does not change rest of the stack, for the rest of the table we have:

	`--install-deps`	`--custom-mapping-file PATH --resolver-local-packages=false`	`--custom-mapping-file PATH --resolver-local-packages=false --install-deps`	`--custom-mapping-file PATH`	`--custom-mapping-file PATH --install-deps`
user-provided mapping	-	+	+	+	+
local package resolver	+	-	-	+	+
install deps	+	-	+	-	+
identity mapping	-	+	-	+	-

I think this is quite elastic setup, and will not require more options for now. Unless you think that some of the combinations I marked as X is interesting.
What do you think?

jherland · 2023-04-17T15:43:10Z

Thank you for creating this overview @mknorps. It really helps organize the thought process around these options.

I agree with mostly everything you're writing, and I only comment on the things that I feel need more nuance or clarification:

About 1) Each of the four listed above should have a way to exclusively switch on and off. This option should be considered a power-user option and thus not easily available, for example binary flags:

--resolver-custom-mapping

--resolver-local-packages

--resolver-install-deps

--resolver-identity

I agree that these options should be considered for power users only, and they are meant for when the presets are not sufficient to configure the desired resolver stack.

As you say, they are binary/boolean flags, but I think it's worth pointing out that they also have an unspecified state (the default state), where we defer to the underlying presets (or default preset) to determine whether they are enabled/disabled.

Do you think these option should take a value, like =on, =off (defaulting to =auto)? Or should we provide separate options, like --enable-custom-mapping / --disable-custom-mapping, etc.?

On to the presets and how they map onto enabling/disabling the 4 resolvers:

About the user-provided mapping

There is one issue I want to raise with your DEFAULT preset above: When a user does not provide any options, but they do have a [tool.fawltydeps.custom_mapping] section in their pyproject.toml, I think we should enable the UserDefinedMapping without them having to specify --resolver-custom-mapping=on.

Therefore I would consider that the user-provided mapping should actually be ON by default. (When there are no actual values in the mapping, it does behave the same as if it was disabled). When it's on by default, the [tool.fawltydeps.custom_mapping] section or the --custom-mapping-file option will Just Work™️ out of the box (i.e. they serve as modifier options and not as presets), and a power user can explicitly override/disable them with --resolver-custom-mapping=off.

In other words, I'd like to relabel the DEFAULT column above to be an R column, and move the DEFAULT label to the third-last column (+ / + / - / +).

(We might want to issue a warning if --resolver-custom-mapping=on is given, but the custom mapping is empty.)

Otherwise, I think the user-provided mapping should be independent of the other resolvers. We discussed briefly in the meeting whether a user-provided mapping should disable the identity mapping (e.g. when a user wants full control via their own mapping), but I think we can consider this a power user scenario for now (solved it with --custom-mapping-file=... --resolver-local-packages=off --resolver-install-deps=off --resolver-identity=off).

About the install-deps mapping

I think we all agree that the this preset (--install-deps) should disable the identity mapping, but otherwise not interfere with the other mappings. In other words: --install-deps is equivalent to --resolver-install-deps=on --resolver-identity=off (albeit with lower priority than other --resolver-* options on the command line).

Summary

If I interpret this correctly, it seems we really only have one preset option for now: --install-deps, and the other use cases are either covered by the default column, or we consider them power user cases solved by the more explicit options. I think this is actually really good, as number of options we need to introduce is very low, and most of them are not even necessary for regular users.

Forgive me for inverting your rows/columns:

	custom-mapping	local-packages	install-deps	identity
DEFAULT	+ (empty)	+	-	+
`--install-deps`	+ (empty)	+	+	-
`--custom-mapping-file PATH` or `[tool.fawltydeps.custom_mapping]`	+	+	-	+
`--install-deps` AND `--custom-mapping-file PATH` or `[tool.fawltydeps.custom_mapping]`	+	+	+	-
`--custom-mapping-file PATH` AND `--resolver-local-packaged=off`	+	-	-	+
`--install-deps` AND `--custom-mapping-file PATH` AND `--resolver-local-packaged=off`	+	-	+	-

Nour-Mws · 2023-04-18T09:05:29Z

Thanks a lot @mknorps for taking the time to do this useful analysis.
Like @jherland, I'll only comment on points where I think more clarification is needed.

I disagree in general on the need to provide a boolean --resolver-x option for all mappings.
For the specific case of user-defined mappings, I would consider the resolver to be off by default unless any of the user-defined mapping settings is populated. In the case where [tool.fawltydeps.custom_mapping] is provided, but no mapping is actually specified, this will "turn on" the user-specified mapping but won't have an impact on the resolving result.
I would even argue the distinction between on and off for this specific resolver isn't that important. We can consider it's always on, but it would simply not resolve any package if not initialized with a non-empty mapping.

I must admit I don't see the benefit of specifying an additional boolean switch on top of this. The only case I can imagine where it might be potentially useful is when you have defined some user-defined mapping(s) but want to suppress them. And in that case, I would argue it's cleaner to remove/unset these settings rather than turning them off with a switch.

Nour-Mws · 2023-04-18T09:53:09Z

For the list of boolean switches:

--resolver-custom-mapping: I don't think this is needed based on the analysis above.
--resolver-local-packages: We need this to suppress any local env mapping (even the default one), but see below for more discussion.
--resolver-install-deps: can be replaced by --install-deps.
--resolver-identity: needed.

If we remove --resolver-custom-mapping and consider that --resolver-install-deps is covered by the --install-deps preset, this will leave us with:

--resolver-local-packages
--resolver-identity
And I'm not sure I agree these should be power-user options. I would expect a normal user to have the possibility to disable identity mapping and/or local packages mapping.

Local packages mapping

For the --resolver-local-packages, if this is set to off, and --pyenv is set, the user should be warned that --pyenv will be ignored.

`--install-deps`: a preset or a boolean switch?

Finally, I agree that using --install-deps would have the result of turning off identity mapping, but I still think we do not need to explicitly remove identity mapping from the stack. As --install-deps should resolve all remaining dependencies, no dependency should need to be resolved by identity mapping even if it's still on. In that way, --install-deps can still be used as a boolean switch rather than a preset that controls two different resolvers.

jherland · 2023-04-18T12:02:16Z

For the --resolver-local-packages, if this is set to off, and --pyenv is set, the user should be warned that --pyenv will be ignored.

Whether or not we try to minimize the amount of nonsense combination of command-line options, we should at least try to warn the user when these nonsense combinations occur in practice.

As --install-deps should resolve all remaining dependencies, no dependency should need to be resolved by identity mapping even if it's still on.

Although I expect this to be true most of the time, we do need to consider the case where a dependency fails to be installed with --install-deps. Is this an exceptional circumstance that should abort FD entirely, or should it be handled like an unresolved dependency (i.e. drop down to the next resolver). I would argue for the latter, and at that point, whether or not the ID mapping is there becomes significant.

jherland · 2023-04-18T12:44:05Z

Summary of discussion in today's status meeting: At this point in time, we need:

A good default: custom_mapping is on but empty, local packages and identity is on and install_deps is off.
Provide --install-deps as an option to normal-level users: turns install_deps on, and identity mapping off.
Defer the discussion on whether normal users need to be able to disable the identity mapping as a separate option.
Defer all other power user options till later.

jherland · 2023-08-28T11:51:24Z

Closing this now, as we're ready to release v0.13 with the new mapping strategy implemented.

mknorps mentioned this issue Mar 21, 2023

Install declared deps into a temp venv to infer mappings #252

Merged

Nour-Mws mentioned this issue Mar 21, 2023

Report dependencies not found via mappings available to FawltyDeps #257

Closed

Nour-Mws added the umbrella label Mar 22, 2023

This was referenced Mar 22, 2023

Allow users to provide their own dependency to imports mapping #260

Closed

Resolve packages from the Python environment before installing dependencies #263

Closed

Add mapping options to CLI (e.g. --install-deps) #266

Closed

jherland mentioned this issue Mar 23, 2023

package: Introduce a list of package resolvers #268

Merged

Nour-Mws mentioned this issue Mar 27, 2023

Make the list of resolvers user-configurable #273

Closed

This was referenced Apr 24, 2023

CLI considerations for the case when users can disable identity mapping #314

Open

Make identity mapping optional and introduce functionality to handle unresolved dependencies #298

Closed

Nour-Mws added this to the Mapping strategy milestone Aug 1, 2023

jherland closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping resolution strategy: discussion and implementation. #256

Mapping resolution strategy: discussion and implementation. #256

mknorps commented Mar 21, 2023 •

edited by Nour-Mws

Loading

mknorps commented Mar 21, 2023

Nour-Mws commented Mar 21, 2023

jherland commented Mar 22, 2023

jherland commented Mar 27, 2023

jherland commented Mar 27, 2023

mknorps commented Apr 17, 2023

jherland commented Apr 17, 2023

Nour-Mws commented Apr 18, 2023

Nour-Mws commented Apr 18, 2023

jherland commented Apr 18, 2023

jherland commented Apr 18, 2023

jherland commented Aug 28, 2023

Mapping resolution strategy: discussion and implementation. #256

Mapping resolution strategy: discussion and implementation. #256

Comments

mknorps commented Mar 21, 2023 • edited by Nour-Mws Loading

mknorps commented Mar 21, 2023

Nour-Mws commented Mar 21, 2023

Caches

Order of mappings

jherland commented Mar 22, 2023

Format of cache

Format of user-defined mapping

Caches

Order and customization of mappings

Footnotes

jherland commented Mar 27, 2023

jherland commented Mar 27, 2023

mknorps commented Apr 17, 2023

jherland commented Apr 17, 2023

About the user-provided mapping

About the install-deps mapping

Summary

Nour-Mws commented Apr 18, 2023

Nour-Mws commented Apr 18, 2023

Local packages mapping

--install-deps: a preset or a boolean switch?

jherland commented Apr 18, 2023

jherland commented Apr 18, 2023

jherland commented Aug 28, 2023

mknorps commented Mar 21, 2023 •

edited by Nour-Mws

Loading

`--install-deps`: a preset or a boolean switch?