Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping resolution strategy: discussion and implementation. #256

Closed
10 tasks done
mknorps opened this issue Mar 21, 2023 · 12 comments
Closed
10 tasks done

Mapping resolution strategy: discussion and implementation. #256

mknorps opened this issue Mar 21, 2023 · 12 comments

Comments

@mknorps
Copy link
Collaborator

mknorps commented Mar 21, 2023

Execution plan:

Extra (related to the mapping strategy, but we decided to exclude them from the milestone):

The question is how to combine different ways of mapping dependencies to imports:

  • user-defined mapping
  • identity mapping
  • mapping from a local environment via automatic discovery (basepath, sys.path)
  • mapping added from local virtual environments via --pyenv option
  • installing packages from PyPI to an isolated local env
  • download packages from PyPI and extract import names from toplevel.txt or other (to be investigated)
  • local cache

This sparked from various discussions and conversation in a #252.

3 rounds of visualization of the flow of this translation funnel and we have following diagram, by @jherland
image

We should also decide what customization options should be given to a user:

  • switching on/off connecting to PyPI
  • use/not use/clean cache
  • use/not use sys.path, basepath
  • other?
@mknorps
Copy link
Collaborator Author

mknorps commented Mar 21, 2023

What we also discussed is the cache should have a common format and Package objects serialized to toml or json could be good candidates. It would also simplify writing a user-defined mapping.

@Nour-Mws
Copy link
Collaborator

Caches

If this is not complex enough, I'd like to also note that we can have two different types of caches: per-project and system-wide. The latter would include cached mappings from various previous runs of fawltydeps (not necessarily related to the project currently checked).

Another consideration is that this central cache can be populated not only from PyPI installations but also from FawtyDeps looking at various local Python envs in previous runs.

That may create privacy concerns so we might want to have this cache be user-based rather than system-wide?

Order of mappings

My understanding of the diagram above is that it dictates the default order the resolver will take through mappings.

We have already established that some of those can be turned off (i.e. skipped):

  • The user-provided mapping and mapping via --pyenv are turned off by default.
  • Identity mapping is turned on by default, but I'm partial to allowing the user to turn it off.

Furthermore, I would expect a user that passed the user-provided mapping or a --pyenv option to want the resolver to be exclusively based on those? The diagram above seems to imply that passing those will "turn them on" but that the resolver would still proceed through the depicted pipeline and attempt to resolve via other means if those failed?

While this makes for a more powerful default approach, I would still like to:

  • provide the user with appropriate visibility over the origin of respective mappings.
  • allow them to rely solely on the mapping(s) they provided or pointed to (or, and that makes things a tad more complex, choose what other mappings the resolvers can have access to).

The point is that, if I provided a Python environment via --pyenv, I would prefer to be told when a dependency is missing rather than having FawltyDeps find missing dependencies elsewhere on my system.

@jherland
Copy link
Member

@mknorps:

What we also discussed is the cache should have a common format and Package objects serialized to toml or json could be good candidates. It would also simplify writing a user-defined mapping.

Format of cache

Currently our Package objects end up looking like this in our JSON output (as part of the .resolved_deps member):

  "resolved_deps": {
    "importlib_metadata": {
      "package_name": "importlib-metadata",
      "mappings": {"local_env": ["importlib_metadata"]}
    },
    "isort": {
      "package_name": "isort",
      "mappings": {"local_env": ["isort"]}
    },
    "pip-requirements-parser": {
      "package_name": "pip-requirements-parser",
      "mappings": {"local_env": ["packaging_legacy_version", "pip_requirements_parser"]}
    },
    ...

This format is sufficient to decode back into Package objects without loss of information. Given that a LocalPackageLookup (after having interacted with importlib_metadata) is nothing more than a dict of Package objects, I think this is a good candidate for the format we should use for the cache.

Format of user-defined mapping

The above JSON format is too verbose for a user to provide a custom mapping. We could go with a simpler, more dict-like JSON representation:

    {
      "importlib-metadata": ["importlib_metadata"],
      "isort": ["isort"],
      "pip-requirements-parser": ["packaging_legacy_version", "pip_requirements_parser"]
    },

Still, I don't think JSON is the best format for a user-defined mapping. Instead, I think it makes most sense to put the user-defined mapping into pyproject.toml, next to the existing [tool.fawltydeps] section, for example:

[tool.fawltydeps.custom_mapping]
importlib-metadata = ["importlib_metadata"]
isort = ["isort"]
pip-requirements-parser = ["packaging_legacy_version", "pip_requirements_parser"]

Caches

@Nour-Mws:

If this is not complex enough, I'd like to also note that we can have two different types of caches: per-project and system-wide. The latter would include cached mappings from various previous runs of fawltydeps (not necessarily related to the project currently checked).
[...]
That may create privacy concerns so we might want to have this cache be user-based rather than system-wide?

Yes! A "per-user" cache stored under e.g. ~/.cache/fawltydeps is the longest I think we should stretch. I certainly don't think it's worth going wider than per-user, just trying to imagine the privacy/security aspects alone is bad enough, not to mention how you deal with a common/shared cache that needs to be writable by multiple users. Soon you end up with a fawltydeps daemon that needs to run in the background to manage access across users. I don't really see any big upside at all, outside of very specialized environments where you have multiple users accounts on the same system that works on overlapping projects.

As for "per-user" vs "per-project" I'm undecided, but I think it's better to start small and grow its use once we (a) see that it works well without major downsides, and (b) we identify more areas where caching bring a demonstrable benefit.

Another consideration is that this central cache can be populated not only from PyPI installations but also from FawtyDeps looking at various local Python envs in previous runs.

My opinion here is the same as above: Let's take one step at a time. Although the cache might end up encompassing "everything" in the end, I'd like to start with caching the PyPI parts where we already know that building a mapping can be very expensive. As for local envs, I'd like to first establish how slow the local mapping really is.

A small test on my laptop in my FD dev env with 43 packages installed shows that LocalPackageLookup can iterate over all the available packages in ~30ms. The cache we're contemplating here will also have to be read from disk, so I suspect it cannot be many orders of magnitude faster. Let's not solve problems we don't yet have...

Order and customization of mappings

My understanding of the diagram above is that it dictates the default order the resolver will take through mappings.

Yes, default order, and the ones that are disabled (either by default, by omission, or by explicit option) are simply skipped/removed from the diagram.

Furthermore, I would expect a user that passed the user-provided mapping or a --pyenv option to want the resolver to be exclusively based on those?

I'm not so sure: I don't this it's very user-friendly to tell the user: "If you want to provide your own mapping, you better remember to include everything in it...". For --pyenv the situation is a bit more nuanced. You could have a very diligent user that wants to make sure that everything is covered by their --pyenv , and they would therefore like to disable all other mappiings, including the identity mapping. But I'm equally sure that those diligent users are not in the majority...

The diagram above seems to imply that passing those will "turn them on" but that the resolver would still proceed through the depicted pipeline and attempt to resolve via other means if those failed?

Correct. That is my current POV when it comes to our default mapping.

While this makes for a more powerful default approach, I would still like to:

  • provide the user with appropriate visibility over the origin of respective mappings.

Agreed. As this mapping can become quite complex, we do need to "tag" our results with which mappings were involved. AFAICS, Analysis.resolved_deps is somewhat prepared for this already, as it maps each declared dependency into the corresponding Package object that was matched for that dependency, and by looking at the mappings dict you can see which mapping was used.

  • allow them to rely solely on the mapping(s) they provided or pointed to (or, and that makes things a tad more complex, choose what other mappings the resolvers can have access to).

The point is that, if I provided a Python environment via --pyenv, I would prefer to be told when a dependency is missing rather than having FawltyDeps find missing dependencies elsewhere on my system.

I agree that this should be possible, but I'm not sure it makes sense as the default behavior when you pass --pyenv.

Here is an idea that has presented itself while thinking about this:

  • Add a new option: --use-mappings that takes a sequence of "magic" words that correspond to the boxes in the diagram:
    • custom: [tool.fawltydeps.custom_mapping] from config file (if it exists). 1
    • pyenv: Any --pyenv option(s) given on the command or in the config file.
    • basepath: Any Python environments discovered under basepath i.e. in the current "project' (for lack of a better word). 2
    • current: The Python environment that FD is currently running in (aka sys.path).
    • pypi: The various components that make up the PyPI logic.
    • id: The identity mapping.
  • The user can pass --use-mappingswith their own list to reorder and skip any of the mappings corresponding to the above words. In other words, advanced users have full control.
  • The default value of --use-mappings (based on the above diagram) is: --use-mappings=custom,pyenv,basepath,current,id (pypi is off by default).
  • Add a --use-pypi option to insert pypi before the id mapping in the default list, i.e. equivalent to --use-mappings=custom,pyenv,basepath,current,pypi,id
  • Add a --use-pyenv-only option to reduce the list to --use-mappings=pyenv
  • Consider adding more options to satisfy common user scenarios.
  • Make all of these options mutually exclusive to prevent confusing scenarios with option combinations that don't make sense. (e.g. --use-pyenv-only --use-pypi).

Footnotes

  1. We might also consider exposing custom_mapping on the command-line, although I pity the fool that attempts to type in a correctly formatted/quoted JSON expression on the command line to encode an extensive custom mapping.

  2. I am not sure how 'basepath' and 'pyenv' will combine: If we allow --pyenv to override basepath in the same way that --code and --deps do, then the pyenv and basepath "magic" words migth collapse into one pyenv whose default value is basepath, ultimately Path(".").

jherland added a commit that referenced this issue Mar 23, 2023
This does not change current behavior, but does a preliminary
refactoring on the way towards having multiple package-to-import
resolvers as discussed in #256 and associated issues.

Introduce a BasePackageResolver abstract base class to define the
interface for package resolvers, and provide, for now, two subclasses:

- LocalPackageResolver (renamed from LocalPackageLookup)
- IdentityMapping (refactored from resolve_dependencies())

Also refactor resolved_dependencies() to operate on a list of
BasePackageResolver objects: For each dependency name we go through the
list of resolvers until a resolver returns a Package object for this
name. We then add this Package to the returned mapping.

For now this list of resolvers is hardcoded to exactly two entries, the
LocalPackageResolver we were already using, and the IdentityMapping that
we were effectively falling back on for packages not found in the former.

Ideas for future work that build on this:
- Introduce more/different BasePackageResolver subclasses to do
  different kinds of package resolving. E.g. user-defined mapping from a
  config file. Resolving by looking up packages remotely on PyPI, etc.
- Make the list of resolvers user-configurable instead of hardcoded.
- Properly handle the case where none of the resolver find a package
  mapping, i.e. _unmapped_ depenedencies.
jherland added a commit that referenced this issue Mar 24, 2023
This does not change current behavior, but does a preliminary
refactoring on the way towards having multiple package-to-import
resolvers as discussed in #256 and associated issues.

Introduce a BasePackageResolver abstract base class to define the
interface for package resolvers, and provide, for now, two subclasses:

- LocalPackageResolver (renamed from LocalPackageLookup)
- IdentityMapping (refactored from resolve_dependencies())

Also refactor resolved_dependencies() to operate on a list of
BasePackageResolver objects: For each dependency name we go through the
list of resolvers until a resolver returns a Package object for this
name. We then add this Package to the returned mapping.

For now this list of resolvers is hardcoded to exactly two entries, the
LocalPackageResolver we were already using, and the IdentityMapping that
we were effectively falling back on for packages not found in the former.

Ideas for future work that build on this:
- Introduce more/different BasePackageResolver subclasses to do
  different kinds of package resolving. E.g. user-defined mapping from a
  config file. Resolving by looking up packages remotely on PyPI, etc.
- Make the list of resolvers user-configurable instead of hardcoded.
- Properly handle the case where none of the resolver find a package
  mapping, i.e. _unmapped_ depenedencies.
@jherland
Copy link
Member

Idea: Only provide options to enable/disable resolvers, not reorder them.

  • Does not make sense to have ID mapping anywhere other than at the bottom of the stack
  • Does not make sense to reorder PyPI vs --pyenv
  • A user-provided mapping should always have the highest priority.

@jherland
Copy link
Member

Now version of the resolve_dependencies() diagram after today's meeting:
fd_matching 3

jherland added a commit that referenced this issue Mar 28, 2023
This does not change current behavior, but does a preliminary
refactoring on the way towards having multiple package-to-import
resolvers as discussed in #256 and associated issues.

Introduce a BasePackageResolver abstract base class to define the
interface for package resolvers, and provide, for now, two subclasses:

- LocalPackageResolver (renamed from LocalPackageLookup)
- IdentityMapping (refactored from resolve_dependencies())

Also refactor resolved_dependencies() to operate on a list of
BasePackageResolver objects: For each dependency name we go through the
list of resolvers until a resolver returns a Package object for this
name. We then add this Package to the returned mapping.

For now this list of resolvers is hardcoded to exactly two entries, the
LocalPackageResolver we were already using, and the IdentityMapping that
we were effectively falling back on for packages not found in the former.

Ideas for future work that build on this:
- Introduce more/different BasePackageResolver subclasses to do
  different kinds of package resolving. E.g. user-defined mapping from a
  config file. Resolving by looking up packages remotely on PyPI, etc.
- Make the list of resolvers user-configurable instead of hardcoded.
- Properly handle the case where none of the resolver find a package
  mapping, i.e. _unmapped_ depenedencies.
@mknorps
Copy link
Collaborator Author

mknorps commented Apr 17, 2023

Visualization to the team meeting discussion with @Nour-Mws and @jherland.
How to handle user interface to the resolver options. We can gather all resolver options in the table below. We keep the order of resolvers unchangeable. We can provide a user with two levels of control over the resolvers:

  1. one is a list of boolean flags that turn on and of a particular resolver
  2. the second is a "preset" - option or list of options that is governing a combination of resolvers.
X R R X R DEFAULT X R X RC X
user-provided mapping - - - - - - - - + + + + + + + +
local package resolver - - - - + + + + - - - - + + + +
install deps - - + + - - + + - - + + - - + +
identity mapping - + - + - + - + - + - + - + - +

About 1) Each of the four listed above should have a way to exclusively switch on and off. This option should be considered a power-user option and thus not easily available, for example binary flags:

  • --resolver-custom-mapping
  • --resolver-local-packages
  • --resolver-install-deps
  • --resolver-identity

As for what we called "presets" these are combinations of the resolvers above.

  • X - a combination that is uninteresting to a user
  • R - combination that should be resolved by --resolver-... option
  • RC - combination that we do not anticipate will be used often, thus achievable only by combining --resolve-.. options
  • Assuming that --install-deps option suppress identity mapping and presence of --custom-mapping-file does not change rest of the stack, for the rest of the table we have:
    --install-deps --custom-mapping-file PATH --resolver-local-packages=false --custom-mapping-file PATH --resolver-local-packages=false --install-deps --custom-mapping-file PATH --custom-mapping-file PATH --install-deps
    user-provided mapping - + + + +
    local package resolver + - - + +
    install deps + - + - +
    identity mapping - + - + -

I think this is quite elastic setup, and will not require more options for now. Unless you think that some of the combinations I marked as X is interesting.
What do you think?

@jherland
Copy link
Member

Thank you for creating this overview @mknorps. It really helps organize the thought process around these options.

I agree with mostly everything you're writing, and I only comment on the things that I feel need more nuance or clarification:

About 1) Each of the four listed above should have a way to exclusively switch on and off. This option should be considered a power-user option and thus not easily available, for example binary flags:

  • --resolver-custom-mapping
  • --resolver-local-packages
  • --resolver-install-deps
  • --resolver-identity

I agree that these options should be considered for power users only, and they are meant for when the presets are not sufficient to configure the desired resolver stack.

As you say, they are binary/boolean flags, but I think it's worth pointing out that they also have an unspecified state (the default state), where we defer to the underlying presets (or default preset) to determine whether they are enabled/disabled.

Do you think these option should take a value, like =on, =off (defaulting to =auto)? Or should we provide separate options, like --enable-custom-mapping / --disable-custom-mapping, etc.?

On to the presets and how they map onto enabling/disabling the 4 resolvers:

About the user-provided mapping

There is one issue I want to raise with your DEFAULT preset above: When a user does not provide any options, but they do have a [tool.fawltydeps.custom_mapping] section in their pyproject.toml, I think we should enable the UserDefinedMapping without them having to specify --resolver-custom-mapping=on.

Therefore I would consider that the user-provided mapping should actually be ON by default. (When there are no actual values in the mapping, it does behave the same as if it was disabled). When it's on by default, the [tool.fawltydeps.custom_mapping] section or the --custom-mapping-file option will Just Work™️ out of the box (i.e. they serve as modifier options and not as presets), and a power user can explicitly override/disable them with --resolver-custom-mapping=off.

In other words, I'd like to relabel the DEFAULT column above to be an R column, and move the DEFAULT label to the third-last column (+ / + / - / +).

(We might want to issue a warning if --resolver-custom-mapping=on is given, but the custom mapping is empty.)

Otherwise, I think the user-provided mapping should be independent of the other resolvers. We discussed briefly in the meeting whether a user-provided mapping should disable the identity mapping (e.g. when a user wants full control via their own mapping), but I think we can consider this a power user scenario for now (solved it with --custom-mapping-file=... --resolver-local-packages=off --resolver-install-deps=off --resolver-identity=off).

About the install-deps mapping

I think we all agree that the this preset (--install-deps) should disable the identity mapping, but otherwise not interfere with the other mappings. In other words: --install-deps is equivalent to --resolver-install-deps=on --resolver-identity=off (albeit with lower priority than other --resolver-* options on the command line).

Summary

If I interpret this correctly, it seems we really only have one preset option for now: --install-deps, and the other use cases are either covered by the default column, or we consider them power user cases solved by the more explicit options. I think this is actually really good, as number of options we need to introduce is very low, and most of them are not even necessary for regular users.

Forgive me for inverting your rows/columns:

custom-mapping local-packages install-deps identity
DEFAULT + (empty) + - +
--install-deps + (empty) + + -
--custom-mapping-file PATH or [tool.fawltydeps.custom_mapping] + + - +
--install-deps AND --custom-mapping-file PATH or [tool.fawltydeps.custom_mapping] + + + -
--custom-mapping-file PATH AND --resolver-local-packaged=off + - - +
--install-deps AND --custom-mapping-file PATH AND --resolver-local-packaged=off + - + -

@Nour-Mws
Copy link
Collaborator

Thanks a lot @mknorps for taking the time to do this useful analysis.
Like @jherland, I'll only comment on points where I think more clarification is needed.

I disagree in general on the need to provide a boolean --resolver-x option for all mappings.
For the specific case of user-defined mappings, I would consider the resolver to be off by default unless any of the user-defined mapping settings is populated. In the case where [tool.fawltydeps.custom_mapping] is provided, but no mapping is actually specified, this will "turn on" the user-specified mapping but won't have an impact on the resolving result.
I would even argue the distinction between on and off for this specific resolver isn't that important. We can consider it's always on, but it would simply not resolve any package if not initialized with a non-empty mapping.

I must admit I don't see the benefit of specifying an additional boolean switch on top of this. The only case I can imagine where it might be potentially useful is when you have defined some user-defined mapping(s) but want to suppress them. And in that case, I would argue it's cleaner to remove/unset these settings rather than turning them off with a switch.

@Nour-Mws
Copy link
Collaborator

For the list of boolean switches:

  • --resolver-custom-mapping: I don't think this is needed based on the analysis above.
  • --resolver-local-packages: We need this to suppress any local env mapping (even the default one), but see below for more discussion.
  • --resolver-install-deps: can be replaced by --install-deps.
  • --resolver-identity: needed.

If we remove --resolver-custom-mapping and consider that --resolver-install-deps is covered by the --install-deps preset, this will leave us with:

  • --resolver-local-packages
  • --resolver-identity
    And I'm not sure I agree these should be power-user options. I would expect a normal user to have the possibility to disable identity mapping and/or local packages mapping.

Local packages mapping

For the --resolver-local-packages, if this is set to off, and --pyenv is set, the user should be warned that --pyenv will be ignored.

--install-deps: a preset or a boolean switch?

Finally, I agree that using --install-deps would have the result of turning off identity mapping, but I still think we do not need to explicitly remove identity mapping from the stack. As --install-deps should resolve all remaining dependencies, no dependency should need to be resolved by identity mapping even if it's still on. In that way, --install-deps can still be used as a boolean switch rather than a preset that controls two different resolvers.

@jherland
Copy link
Member

For the --resolver-local-packages, if this is set to off, and --pyenv is set, the user should be warned that --pyenv will be ignored.

Whether or not we try to minimize the amount of nonsense combination of command-line options, we should at least try to warn the user when these nonsense combinations occur in practice.

As --install-deps should resolve all remaining dependencies, no dependency should need to be resolved by identity mapping even if it's still on.

Although I expect this to be true most of the time, we do need to consider the case where a dependency fails to be installed with --install-deps. Is this an exceptional circumstance that should abort FD entirely, or should it be handled like an unresolved dependency (i.e. drop down to the next resolver). I would argue for the latter, and at that point, whether or not the ID mapping is there becomes significant.

@jherland
Copy link
Member

Summary of discussion in today's status meeting: At this point in time, we need:

  • A good default: custom_mapping is on but empty, local packages and identity is on and install_deps is off.
  • Provide --install-deps as an option to normal-level users: turns install_deps on, and identity mapping off.
  • Defer the discussion on whether normal users need to be able to disable the identity mapping as a separate option.
  • Defer all other power user options till later.

@jherland
Copy link
Member

Closing this now, as we're ready to release v0.13 with the new mapping strategy implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants