Prepare for supporting multiple `--pyenv` options #313

jherland · 2023-04-24T00:48:16Z

This was too long in the making, but I kept bumping into corners that delayed progress. (For example: I have bumped the integration between --pyenv and traverse_project to a later PR.)

Best reviewed one commit at a time. The first three commits are pure preparation for later changes, the next four commits are related fixes that I found had to be made before tackling the main topic. The followiing two commits have the actual changes, and the last commit fixes another (unrelated) issue that was found while adding a new tests for multiple pyenvs.

Although I haven't looked at this yet, I suspect there will be conflicts with Nour's ongoing PRs (#296, #297, #298). ~~I'll be happy to rebase on top of those, and resolve the conflicts, but I wanted to get this PR up for review without further delay.~~ Edit: Rebased now on top of origin/main after #296 was merged. Conflicts were fortunately smaller than I had feared. 🎉

Commits:

noxfile: Silence unhelpful pylint warning in tests
packages.LocalPackageResolver.packages: Use @calculated_once decorator
conftest: Return site_dir in addition to venv_dir from fake_venv()
conftest: Partially isolate the default LocalPackageResolver
packages: Replace DependenciesMapping with string descriptions
LocalPackageResolver: Resolve packages using first match in sys.path
package.UserDefinedMapping: Include source of mapping in Package objects
LocalPackageResolver: Accept multiple pyenvs
Add support for multiple --pyenv options
settings.print_toml_config(): Represent sets as sorted lists

Still TODO in future PRs:

Integrate pyenv into traverse_project, so that we can auto-discover pyenvs underneath basepaths.
A single --pyenv=.venv could result in multiple actual environments (.venv/lib/python3.7/site-packages AND .venv/lib/python3.8/site-packages, etc.); so far we only grab the first one. (Fixed in LocalPackageResolver: Handle multiple package dirs inside one Python env #318)

mknorps

There is a lot of good stuff there! I see you had to do a lot of refactor to arrive at a point where it is possible to implement multiple --pyenv option.

I would probably be happier to see it split to 2 or three separate PRs and the thing that keeps me focused in the current setup is a logical and perfectly described single commits ❤️.

So I will just start with the first few commit that are in the preparation phase of the multiple --pyenv option implementation.

First three are straightforward and I have no comments.

11ef514 and 94ca571 are introduce the better separation of the environment that may be used in tests. Here I wonder if we should not use this feature also for sample and real-world projects integration tests.

With 8ace2c1 I do not agree.

The only purpose of the DependenciesMapping enum is to differentiate the

various types of mappings stored inside a Package object,

Yes. And this is an important purpose. If you want to know what different mappings we use, you check DependenciesMapping object. If you want to yet another mapping, you would do it here. In this PR you exchange a strict definition provided in a type to a string-based description. This is not a good idea in my opinion as for debugging you need to parse back string to the correct mapping type and for refactor you do not have a support from a typechecker.

I will post some ideas how to tackle the need to have a location of a mapping in the next comment :)

fawltydeps/packages.py

mknorps

To have a way to include location in the dependency mapping, we could use an object with location included for example:

@dataclass(eq=True, frozen=True, order=True)
class DependencyMappingEntry:
    mapping_type: DependencyMapping
    source: Location
    items: Set[str]

and then for a package:

  Package(
    "foo",
    {
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path1"), {"bar", "baz"},
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path2"), {"blob"},

    }
  )

Second idea. Given that:

However, the way we are currently structuring our resolver stack, we
will never mix mappings from different resolvers inside a single
Package object.

we could add resolved_with to package object:

class Package:
    """Encapsulate an installable Python package.

    This encapsulates the mapping between a package name (i.e. something you can
    pass to `pip install`) and the import names that it provides once it is
    installed.
    """

    package_name: str
    mappings: Dict[Location, Set[str]] = field(default_factory=dict) # here instead of Location we can have a string representing path
    import_names: Set[str] = field(default_factory=set)
    resolved_with: Optional[DependenciesMapping] = None

jherland · 2023-05-02T09:58:59Z

I would probably be happier to see it split to 2 or three separate PRs and the thing that keeps me focused in the current setup is a logical and perfectly described single commits .

It looks like I should probably split the PR right before ec726e6 (LocalPackageResolver: Accept multiple pyenvs), so that we can properly discuss and land the preparatory patches before introducing the actual feature (multiple --pyenvs)?

11ef514 and 94ca571 introduce the better separation of the environment that may be used in tests. Here I wonder if we should not use this feature also for sample and real-world projects integration tests.

That's a good idea. We already do isolate these sample/real project tests from the current environment by using the requirements = [ ... ] directive in the TOML files and the CachedExperimentVenv helper, and using a fake_venv instead of a (cached) real venv would probably be a bit faster. That said, I like having the real_projects run in a scenario that is as close to a real user scenario as we can get it, i.e. with limited use of test fakes.

With 8ace2c1 I do not agree.

The only purpose of the DependenciesMapping enum is to differentiate the various types of mappings stored inside a Package object,

Yes. And this is an important purpose.

Ok, we can certainly revisit the design here.

To have a way to include location in the dependency mapping, we could use an object with location included for example:

@dataclass(eq=True, frozen=True, order=True)
class DependencyMappingEntry:
    mapping_type: DependencyMapping
    source: Location
    items: Set[str]

and then for a package:

  Package(
    "foo",
    {
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path1"), {"bar", "baz"},
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path2"), {"blob"},

    }
  )

I'm not sure that source: Location makes sense for all mappings (e.g. "User-defined mapping from settings" which is not always from pyproject.toml but could also be from the $fawltydeps_custom_mapping environment variable, or a LocalPackageResolver pointing at a sys.path with many entries).

I also wanted to keep the option open for future resolvers to do things differently (although I could easily be persuaded to leave that as a problem for future-us, and rather solve the problem that is in front of us right now).

Second idea. Given that:

the way we are currently structuring our resolver stack, we
will never mix mappings from different resolvers inside a single
Package object.

we could add resolved_with to package object:
class Package:
    package_name: str
    mappings: Dict[Location, Set[str]] = field(default_factory=dict) # here instead of Location we can have a string representing path
    import_names: Set[str] = field(default_factory=set)
    resolved_with: Optional[DependenciesMapping] = None

I think I like this idea better. But I'd still want to get rid of the DependenciesMapping enum as I struggle to see any real value added by it (it only functions only as a proxy for the resolver type itself, AFAICS). Could we use the relevant class/type directly to identify its source (i.e. change Optional[DependenciesMapping] into Type[BasePackageResolver])?.

Also, I'm growing slightly weary of the duplication between .mappings and .import_names. The .import_names is what we use in practice, and the extra information in .mappings is there mostly for debugging/documentation. Maybe remove .mappings and convert .import_names into a "set-like" container that optionally allows mapping to a resolver-specific source or description? By "set-like", I mean a container that provides quick lookup ("foobar" in package.import_names), set intersection (set.intersection(imported_names, package.import_names)), and iteration (for name in package.import_names), and this can be both satisfied by a proper set, but also by a dict where the keys are the import_names and the values are the associated source/description.

In other words:

class Package:
    package_name: str
    resolved_with: Type[BasePackageResolver]  # The serialization of this in JSON output would be the class' .__name__
    import_names: Union[Set[str], Dict[str, Something]]  # Set-like collection of import names with resolver-specific metadata.

(The only limitation of the Something dict-values is that they would have to be serializable to JSON, hence Any is not a good choice...)

Resolvers such as IdentityMapping don't really have any useful metadata to attach here, so they would use the simple set:

class IdentityMapping(BasePackageResolver):
    def lookup_package(package_name: str) -> Package:
        return Package(
            package_name,
            resolved_with=IdentityMapping,
            import_names={Package.normalize_name(package_name)},
        )

But the more complex resolvers could themselves choose how/what metadata to attach to each import name...

mknorps · 2023-05-05T06:59:01Z

Second idea. Given that:

the way we are currently structuring our resolver stack, we
will never mix mappings from different resolvers inside a single
Package object.

we could add resolved_with to package object:
class Package:
    package_name: str
    mappings: Dict[Location, Set[str]] = field(default_factory=dict) # here instead of Location we can have a string representing path
    import_names: Set[str] = field(default_factory=set)
    resolved_with: Optional[DependenciesMapping] = None
I think I like this idea better. But I'd still want to get rid of the DependenciesMapping enum as I struggle to see any real value added by it (it only functions only as a proxy for the resolver type itself, AFAICS). Could we use the relevant class/type directly to identify its source (i.e. change Optional[DependenciesMapping] into Type[BasePackageResolver])?.

I like this idea! Theoretically a mapping is a more general concept and I can imagine that a single mapping may be resolved with several resolvers, but this is a very unlikely abstraction. I think changing from Optional[DependenciesMapping] to Type[BasePackageResolver] is enough.
It is good that we can get rid of Optional that way.

Consider everything written below (along with your suggestion to get rid of duplication between .mappings and .import_names) as something that is a good improvement but may be done in a separate PR.

In fact, the condition that we may use only one BasePackageResolver in the process and what you wrote:

Also, I'm growing slightly weary of the duplication between .mappings and .import_names. The .import_names is what we use in practice, and the extra information in .mappings is there mostly for debugging/documentation. Maybe remove .mappings and convert .import_names into a "set-like" container that optionally allows mapping to a resolver-specific source or description? By "set-like", I mean a container that provides quick lookup ("foobar" in package.import_names), set intersection (set.intersection(imported_names, package.import_names)), and iteration (for name in package.import_names), and this can be both satisfied by a proper set, but also by a dict where the keys are the import_names and the values are the associated source/description.

In other words:
class Package:
    package_name: str
    resolved_with: Type[BasePackageResolver]  # The serialization of this in JSON output would be the class' .__name__
    import_names: Union[Set[str], Dict[str, Something]]  # Set-like collection of import names with resolver-specific metadata.
(The only limitation of the Something dict-values is that they would have to be serializable to JSON, hence Any is not a good choice...)

makes me think that we could go one step further and not mutate Package with add_import_names, but give those import names at the construction of Package object. In fact this is what we do in every test. We could remove add_import_names method so there is no easy option to modify Package after it has been resolved.
Then we would need to handle accumulate_mappings a bit differently, to gather first all dictionary elements and then create a Package object.

As for the deduplication of .mappings and .import_names

 import_names: Set[str]

would probably be the best option. But if we need to take into account that for each name in import_names there may be a different source (for example a file in custom mapping) that it came from, then we may just do something like:

class Package:
    package_name: str
    resolved_with: Type[BasePackageResolver]  # The serialization of this in JSON output would be the class' .__name__
    mappings: Set[Tuple[str,str]] # first element of a tuple is import name and second is the description of source

    @property
    def import_names(self):
        return {name for name,_ in self.mappings}

jherland · 2023-05-05T10:51:38Z

I think we're converging on the same (or very similar) designs here 😄. As of right now, this is the Package definition I'm working with:

ExtraImportInfo = Union[None, Location, str]

class Package:
    package_name: str
    resolved_with: Type["BasePackageResolver"]
    import_names: Dict[str, ExtraImportInfo] = field(default_factory=dict)

IMHO, an import_names dictionary where the keys are the import names is functionally equivalent to an .import_names set.

Examples of Package objects would then look like:

Package(
    package_name="foo",
    resolved_with=UserDefinedMapping,
    import_names={
        "foo": Location(Path("mappings.toml")),
        "bar": Location(Path("pyproject.toml")),
        "baz": Location(Path("pyproject.toml")),
    },
)
Package(
    package_name="foo",
    resolved_with=LocalPackageResolver,
    import_names={
        "foo": Location(Path(".venv/lib/python3.7/site-packages")),
        "bar": Location(Path(".venv/lib/python3.7/site-packages")),
        "baz": Location(Path("__pypackages__/3.10/lib")),
    },
)
Package(
    package_name="foo",
    resolved_with=IdentityMapping,
    import_names={
        "foo": None,
    },
)

I'm only 95% happy with this design, though, as there is often duplication between the ExtraImportInfo values in a Package object with multiple import names. Furthermore, for a Package with zero import names, there is no good place to attach any meta-information at all.

Maybe separating the meta-information into a separate member is cleaner? This also allows each resolver more control over how to organize it (for better or worse?):

PackageDebugInfo = Union[None, ...]

class Package:
    package_name: str
    resolved_with: Type["BasePackageResolver"]
    import_names: Set[str] = field(default_factory=set)
    debug_info: PackageDebugInfo = None

# Examples:
Package(
    package_name="foo",
    resolved_with=UserDefinedMapping,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path("mappings.toml")): {"foo"},
        "from configured settings": {"bar", "baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=LocalPackageResolver,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path(".venv/lib/python3.7/site-packages")): {"foo", "bar"},
        Location(Path("__pypackages__/3.10/lib")): {"baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=IdentityMapping,
    import_names={"foo"},
    debug_info=None,
)

What do you think?

Otherwise, I agree with your point on making Package objects immutable. I'll try to get that done in my ongoing refactoring.

jherland · 2023-05-06T00:00:13Z

Finished refactoring for now, and split up the PR as discussed above. Here is a summary of the changes:

Initial commits, unchanged:

noxfile: Silence unhelpful pylint warning in tests
packages.LocalPackageResolver.packages: Use @calculated_once decorator
conftest: Return site_dir in addition to venv_dir from fake_venv()
conftest: Partially isolate the default LocalPackageResolver

Refactored/changed/new commits:

LocalPackageResolver: Resolve packages using first match in sys.path
- Unchanged in spirit from above, but moved before some other refactoring to keep tests green throughout the series
packages: Refactor Package to add more debugging information
- Completely rewritten according to the above discussion
packages: Remove unused Package.add_import_names() method
- New commit, based on suggestion from @mknorps above.
package.UserDefinedMapping: Include source of mapping in Package objects
- Unchanged in spirit, but affected by above rework.

Dropped commits, these will be moved to a new PR:

~~LocalPackageResolver: Accept multiple pyenvs~~
~~Add support for multiple --pyenv options~~
~~settings.print_toml_config(): Represent sets as sorted lists~~

mknorps

Thank you for splitting the PR :)

This one is ready to 🚀 . I really like the place it converged to 💟

I put some small comments, that are non-blocking.

Maybe separating the meta-information into a separate member is cleaner? This also allows each resolver more control over how to organize it (for better or worse?):

PackageDebugInfo = Union[None, ...]

class Package:
    package_name: str
    resolved_with: Type["BasePackageResolver"]
    import_names: Set[str] = field(default_factory=set)
    debug_info: PackageDebugInfo = None

# Examples:
Package(
    package_name="foo",
    resolved_with=UserDefinedMapping,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path("mappings.toml")): {"foo"},
        "from configured settings": {"bar", "baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=LocalPackageResolver,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path(".venv/lib/python3.7/site-packages")): {"foo", "bar"},
        Location(Path("__pypackages__/3.10/lib")): {"baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=IdentityMapping,
    import_names={"foo"},
    debug_info=None,
)

What do you think?

I really like this design. Keeping the core functionality and debug info separate makes the code cleaner, as for the usage in comparing imports and dependencies package_name and import_names are the most important fields.

fawltydeps/packages.py

tests/test_local_env.py

tests/test_packages.py

The fake_venv() fixture is about to be put into service to attempt further isolation of tests that use a default LocalPackageResolver (i.e. one that looks at sys.path) from the virtualenv in which the test happens to be running. This will be done by prepending a fake_venv to sys.path, and for that we need to know the site-packages directory created by the fake_venv.

LocalPackageResolver without a given pyenv_path defaults to looking up packages in sys.path. This is problematic in tests, as we do not fully control the Python environment in which the tests are run (e.g. consider a developer running pytest directly in their dev environment where they happen to have installed some packages that we don't expect). Until now, we've skirted around the problem by adding a couple of TODO comments near the relevant tests, but it's now time to solve this in a better way. However, short of running the tests in a subprocess, we cannot _fully_ control the sys.path. What we can do, is to use a combination of our fake_venv fixture and pytest's monkeypatch.syspath_prepend() function to insert an environment that we _do_ control at the start of sys.path. This will allow us to place entries into that fake_venv that we can then expect to be resolved exactly as we expect in tests. We introduce a new isolate_default_resolver fixture to apply this partial isolation for a given dict of package-to-imports mappings. At the same time we add a default_sys_path_env_for_tests dict that encodes what we previously considered the minimal/default package mappings that we expected (but could not guarantee) to be present in the test environment. With this, we can now call isolate_default_resolver(default_sys_path_env_for_tests) from tests that previously relied on the surrounding test environment, and their expectation is now guarateed by this new fake_venv inserted at the start of sys.path. Note that for tests that rely on packages NOT being installed into the test environment (e.g. testing the fallback to the identity mapping), we must still take care to use package names that we can ~guarantee never to be present in the environment in which the tests are run. Thus, we are only _partially_ isolated from the surrounding test environment.

When LocalPackageResolver is used to look up packages in the _current_ Python environment, it should take care to use the same package that Python itself would end up using when importing a package of that name. Until now, the way we have used importlib_metadata (specifically the way we iterate over MetadataPathFinder().find_distributions()), if a packages occurs multiple times in sys.path, we would end up using the _last_ instance, whereas Python would use the _first_. Fix our iteration to be in line with Python.

Until now we have designed the Package object to be able to accept package-name-to-import-names mappings from _multiple_ resolvers (signified with the DependenciesMapping enum). However, our current design relies on each Package object being built by a _single_ resolver, so the multiple mappings are never used in practice. What would be nice, however, is to be able to attach some more debug information to each Package, specifically pointing to where we found the relevant package-name-to-import-names mappings represented in this object. (This will soon become more important when we add support for multiple --pyenv options.) In this refactoring we reshape the Package object as follows: - We remove the DependenciesMapping enum. It has served as a simple proxy for the resolver class/type used to create the Package object. - Instead, we record the responsible class itself directly into the Package object, by adding a .resolved_with member of type Type[BasePackageResolver]. When serializing Package objects to JSON, this member is serialized to a simple string: the class name. - We remove the .mappings member (which only ever has a single member in practice) - Instead, we add a .debug_info member which each resolver is free to populate with any information (but defaults to None). The only real requirement is that the information be JSON-serializable. For now, we use these kinds of debug information: - UserDefinedMapping adds nothing (None) for now, but this will soon change to indicate from where the relevant user-defined mapping was read - LocalPackageResolver stores a dictionary mapping the site-packages path where a Package was found to the import names found in that install location. This will soon become even more useful when we add support for multiple --pyenv arguments, which will potentially merge Package definitions from multiple site-packages directories. - TemporaryPipInstallResolver overrides the debug_info from the underlying LocalPackageResolver to a simple string indicating that packages were temporarily installed with `pip install`. - IdentityMapping uses None as it has no extra information to give. Most of the diff is occupied by associated test updates to deal with the reshaped Package objects. Except for a very few unit tests, we don't bother verifying the exact contents of the .debug_info member, instead we liberally use a ignore_package_debug_info() helper to remove .debug_info details before comparing Package objects.

All Package objects are now constructed in their fully developed form and can be treated as frozen/immutable.

When accumulating a user-defined Package object from multiple files (e.g. a [tool.fawltydeps.custom_mapping] section in pyproject.toml together with a --custom-mapping-file=mapping.toml), include the ultimate source of the mapping into the .debug_info member of the Package object. This makes it easier to debug the source of these user-defined meppings. In other words, given a mapping.toml with: apache-airflow = ["airflow"] and a pyproject.toml with: [tool.fawltydeps.custom_mapping] apache-airflow = ["foo", "bar"] we used to generate this Package object: Package( package_name="apache_airflow", import_names={"airflow", "foo", "bar"}, resolved_with=UserDefinedMapping, debug_info=None, ) but with this commit we instead get: Package( package_name="apache_airflow", import_names={"airflow", "foo", "bar"}, resolved_with=UserDefinedMapping, debug_info={ "mapping.toml": {"airflow}, "from settings": {"foo", "bar"}, }, ) Also refactor the accumulate_mappings() helper out of the UserDefinedMapping class as we will soon start to use it from LocalPackageResolver as well.

jherland self-assigned this Apr 24, 2023

jherland requested review from mknorps and Nour-Mws April 24, 2023 00:48

jherland force-pushed the jherland/multiple-pyenvs branch from 70042cd to f337dd8 Compare April 24, 2023 11:07

jherland mentioned this pull request Apr 24, 2023

Test the resolver with the install_deps option #297

Merged

1 task

jherland added this to the Mapping strategy milestone Apr 26, 2023

This was referenced Apr 26, 2023

LocalPackageResolver: Handle multiple package dirs inside one Python env #318

Merged

Pin our importlib_metadata version (and some cosmetic fixes) #319

Merged

mknorps reviewed May 2, 2023

View reviewed changes

fawltydeps/packages.py Show resolved Hide resolved

mknorps reviewed May 2, 2023

View reviewed changes

jherland force-pushed the jherland/multiple-pyenvs branch from 5e5904a to d03225a Compare May 5, 2023 23:44

jherland requested a review from mknorps May 6, 2023 00:01

jherland changed the title ~~Support multiple --pyenv options~~ Prepare for supporting multiple --pyenv options May 6, 2023

jherland force-pushed the jherland/multiple-pyenvs branch from d03225a to 1ade7d9 Compare May 6, 2023 01:02

jherland mentioned this pull request May 6, 2023

Support multiple --pyenv options #321

Merged

mknorps approved these changes May 8, 2023

View reviewed changes

fawltydeps/packages.py Outdated Show resolved Hide resolved

fawltydeps/packages.py Show resolved Hide resolved

tests/test_local_env.py Outdated Show resolved Hide resolved

tests/test_packages.py Show resolved Hide resolved

jherland added 8 commits May 8, 2023 15:19

noxfile: Silence unhelpful pylint warning in tests

b5a55e4

packages.LocalPackageResolver.packages: Use @calculated_once decorator

b1529dc

packages: Remove unused Package.add_import_names() method

c23b8ea

All Package objects are now constructed in their fully developed form and can be treated as frozen/immutable.

jherland force-pushed the jherland/multiple-pyenvs branch from 517775d to a10e43a Compare May 8, 2023 13:25

jherland merged commit 0580d66 into main May 8, 2023

jherland deleted the jherland/multiple-pyenvs branch May 8, 2023 13:50

jherland mentioned this pull request May 8, 2023

Using _multiple_ Python environments to resolve deps -> packages -> imports? #249

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare for supporting multiple `--pyenv` options #313

Prepare for supporting multiple `--pyenv` options #313

jherland commented Apr 24, 2023 •

edited

Loading

mknorps left a comment

mknorps left a comment

jherland commented May 2, 2023

mknorps commented May 5, 2023

jherland commented May 5, 2023

jherland commented May 6, 2023

mknorps left a comment

Prepare for supporting multiple --pyenv options #313

Prepare for supporting multiple --pyenv options #313

Conversation

jherland commented Apr 24, 2023 • edited Loading

mknorps left a comment

Choose a reason for hiding this comment

mknorps left a comment

Choose a reason for hiding this comment

jherland commented May 2, 2023

mknorps commented May 5, 2023

jherland commented May 5, 2023

jherland commented May 6, 2023

mknorps left a comment

Choose a reason for hiding this comment

Prepare for supporting multiple `--pyenv` options #313

Prepare for supporting multiple `--pyenv` options #313

jherland commented Apr 24, 2023 •

edited

Loading