Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for supporting multiple --pyenv options #313

Merged
merged 8 commits into from
May 8, 2023

Conversation

jherland
Copy link
Member

@jherland jherland commented Apr 24, 2023

This was too long in the making, but I kept bumping into corners that delayed progress. (For example: I have bumped the integration between --pyenv and traverse_project to a later PR.)

Best reviewed one commit at a time. The first three commits are pure preparation for later changes, the next four commits are related fixes that I found had to be made before tackling the main topic. The followiing two commits have the actual changes, and the last commit fixes another (unrelated) issue that was found while adding a new tests for multiple pyenvs.

Although I haven't looked at this yet, I suspect there will be conflicts with Nour's ongoing PRs (#296, #297, #298). I'll be happy to rebase on top of those, and resolve the conflicts, but I wanted to get this PR up for review without further delay. Edit: Rebased now on top of origin/main after #296 was merged. Conflicts were fortunately smaller than I had feared. 🎉

Commits:

  • noxfile: Silence unhelpful pylint warning in tests
  • packages.LocalPackageResolver.packages: Use @calculated_once decorator
  • conftest: Return site_dir in addition to venv_dir from fake_venv()
  • conftest: Partially isolate the default LocalPackageResolver
  • packages: Replace DependenciesMapping with string descriptions
  • LocalPackageResolver: Resolve packages using first match in sys.path
  • package.UserDefinedMapping: Include source of mapping in Package objects
  • LocalPackageResolver: Accept multiple pyenvs
  • Add support for multiple --pyenv options
  • settings.print_toml_config(): Represent sets as sorted lists

Still TODO in future PRs:

Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of good stuff there! I see you had to do a lot of refactor to arrive at a point where it is possible to implement multiple --pyenv option.

I would probably be happier to see it split to 2 or three separate PRs and the thing that keeps me focused in the current setup is a logical and perfectly described single commits ❤️.

So I will just start with the first few commit that are in the preparation phase of the multiple --pyenv option implementation.

First three are straightforward and I have no comments.

11ef514 and 94ca571 are introduce the better separation of the environment that may be used in tests. Here I wonder if we should not use this feature also for sample and real-world projects integration tests.

With 8ace2c1 I do not agree.

The only purpose of the DependenciesMapping enum is to differentiate the

various types of mappings stored inside a Package object,

Yes. And this is an important purpose. If you want to know what different mappings we use, you check DependenciesMapping object. If you want to yet another mapping, you would do it here. In this PR you exchange a strict definition provided in a type to a string-based description. This is not a good idea in my opinion as for debugging you need to parse back string to the correct mapping type and for refactor you do not have a support from a typechecker.

I will post some ideas how to tackle the need to have a location of a mapping in the next comment :)

Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have a way to include location in the dependency mapping, we could use an object with location included for example:

@dataclass(eq=True, frozen=True, order=True)
class DependencyMappingEntry:
    mapping_type: DependencyMapping
    source: Location
    items: Set[str]

and then for a package:

  Package(
    "foo",
    {
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path1"), {"bar", "baz"},
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path2"), {"blob"},

    }
  )

Second idea. Given that:

However, the way we are currently structuring our resolver stack, we
will never mix mappings from different resolvers inside a single
Package object.

we could add resolved_with to package object:

class Package:
    """Encapsulate an installable Python package.

    This encapsulates the mapping between a package name (i.e. something you can
    pass to `pip install`) and the import names that it provides once it is
    installed.
    """

    package_name: str
    mappings: Dict[Location, Set[str]] = field(default_factory=dict) # here instead of Location we can have a string representing path
    import_names: Set[str] = field(default_factory=set)
    resolved_with: Optional[DependenciesMapping] = None

@jherland
Copy link
Member Author

jherland commented May 2, 2023

I would probably be happier to see it split to 2 or three separate PRs and the thing that keeps me focused in the current setup is a logical and perfectly described single commits .

It looks like I should probably split the PR right before ec726e6 (LocalPackageResolver: Accept multiple pyenvs), so that we can properly discuss and land the preparatory patches before introducing the actual feature (multiple --pyenvs)?

11ef514 and 94ca571 introduce the better separation of the environment that may be used in tests. Here I wonder if we should not use this feature also for sample and real-world projects integration tests.

That's a good idea. We already do isolate these sample/real project tests from the current environment by using the requirements = [ ... ] directive in the TOML files and the CachedExperimentVenv helper, and using a fake_venv instead of a (cached) real venv would probably be a bit faster. That said, I like having the real_projects run in a scenario that is as close to a real user scenario as we can get it, i.e. with limited use of test fakes.

With 8ace2c1 I do not agree.

The only purpose of the DependenciesMapping enum is to differentiate the various types of mappings stored inside a Package object,

Yes. And this is an important purpose.

Ok, we can certainly revisit the design here.

To have a way to include location in the dependency mapping, we could use an object with location included for example:

@dataclass(eq=True, frozen=True, order=True)
class DependencyMappingEntry:
    mapping_type: DependencyMapping
    source: Location
    items: Set[str]

and then for a package:

  Package(
    "foo",
    {
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path1"), {"bar", "baz"},
      DependencyMappingEntry(DependenciesMapping.LOCAL_ENV, Location("path2"), {"blob"},

    }
  )

I'm not sure that source: Location makes sense for all mappings (e.g. "User-defined mapping from settings" which is not always from pyproject.toml but could also be from the $fawltydeps_custom_mapping environment variable, or a LocalPackageResolver pointing at a sys.path with many entries).

I also wanted to keep the option open for future resolvers to do things differently (although I could easily be persuaded to leave that as a problem for future-us, and rather solve the problem that is in front of us right now).

Second idea. Given that:

the way we are currently structuring our resolver stack, we
will never mix mappings from different resolvers inside a single
Package object.

we could add resolved_with to package object:

class Package:
    package_name: str
    mappings: Dict[Location, Set[str]] = field(default_factory=dict) # here instead of Location we can have a string representing path
    import_names: Set[str] = field(default_factory=set)
    resolved_with: Optional[DependenciesMapping] = None

I think I like this idea better. But I'd still want to get rid of the DependenciesMapping enum as I struggle to see any real value added by it (it only functions only as a proxy for the resolver type itself, AFAICS). Could we use the relevant class/type directly to identify its source (i.e. change Optional[DependenciesMapping] into Type[BasePackageResolver])?.

Also, I'm growing slightly weary of the duplication between .mappings and .import_names. The .import_names is what we use in practice, and the extra information in .mappings is there mostly for debugging/documentation. Maybe remove .mappings and convert .import_names into a "set-like" container that optionally allows mapping to a resolver-specific source or description? By "set-like", I mean a container that provides quick lookup ("foobar" in package.import_names), set intersection (set.intersection(imported_names, package.import_names)), and iteration (for name in package.import_names), and this can be both satisfied by a proper set, but also by a dict where the keys are the import_names and the values are the associated source/description.

In other words:

class Package:
    package_name: str
    resolved_with: Type[BasePackageResolver]  # The serialization of this in JSON output would be the class' .__name__
    import_names: Union[Set[str], Dict[str, Something]]  # Set-like collection of import names with resolver-specific metadata.

(The only limitation of the Something dict-values is that they would have to be serializable to JSON, hence Any is not a good choice...)

Resolvers such as IdentityMapping don't really have any useful metadata to attach here, so they would use the simple set:

class IdentityMapping(BasePackageResolver):
    def lookup_package(package_name: str) -> Package:
        return Package(
            package_name,
            resolved_with=IdentityMapping,
            import_names={Package.normalize_name(package_name)},
        )

But the more complex resolvers could themselves choose how/what metadata to attach to each import name...

@mknorps
Copy link
Collaborator

mknorps commented May 5, 2023

Second idea. Given that:

the way we are currently structuring our resolver stack, we
will never mix mappings from different resolvers inside a single
Package object.

we could add resolved_with to package object:

class Package:
    package_name: str
    mappings: Dict[Location, Set[str]] = field(default_factory=dict) # here instead of Location we can have a string representing path
    import_names: Set[str] = field(default_factory=set)
    resolved_with: Optional[DependenciesMapping] = None

I think I like this idea better. But I'd still want to get rid of the DependenciesMapping enum as I struggle to see any real value added by it (it only functions only as a proxy for the resolver type itself, AFAICS). Could we use the relevant class/type directly to identify its source (i.e. change Optional[DependenciesMapping] into Type[BasePackageResolver])?.

I like this idea! Theoretically a mapping is a more general concept and I can imagine that a single mapping may be resolved with several resolvers, but this is a very unlikely abstraction. I think changing from Optional[DependenciesMapping] to Type[BasePackageResolver] is enough.
It is good that we can get rid of Optional that way.


Consider everything written below (along with your suggestion to get rid of duplication between .mappings and .import_names) as something that is a good improvement but may be done in a separate PR.

In fact, the condition that we may use only one BasePackageResolver in the process and what you wrote:

Also, I'm growing slightly weary of the duplication between .mappings and .import_names. The .import_names is what we use in practice, and the extra information in .mappings is there mostly for debugging/documentation. Maybe remove .mappings and convert .import_names into a "set-like" container that optionally allows mapping to a resolver-specific source or description? By "set-like", I mean a container that provides quick lookup ("foobar" in package.import_names), set intersection (set.intersection(imported_names, package.import_names)), and iteration (for name in package.import_names), and this can be both satisfied by a proper set, but also by a dict where the keys are the import_names and the values are the associated source/description.

In other words:

class Package:
    package_name: str
    resolved_with: Type[BasePackageResolver]  # The serialization of this in JSON output would be the class' .__name__
    import_names: Union[Set[str], Dict[str, Something]]  # Set-like collection of import names with resolver-specific metadata.

(The only limitation of the Something dict-values is that they would have to be serializable to JSON, hence Any is not a good choice...)

makes me think that we could go one step further and not mutate Package with add_import_names, but give those import names at the construction of Package object. In fact this is what we do in every test. We could remove add_import_names method so there is no easy option to modify Package after it has been resolved.
Then we would need to handle accumulate_mappings a bit differently, to gather first all dictionary elements and then create a Package object.

As for the deduplication of .mappings and .import_names

 import_names: Set[str]

would probably be the best option. But if we need to take into account that for each name in import_names there may be a different source (for example a file in custom mapping) that it came from, then we may just do something like:

class Package:
    package_name: str
    resolved_with: Type[BasePackageResolver]  # The serialization of this in JSON output would be the class' .__name__
    mappings: Set[Tuple[str,str]] # first element of a tuple is import name and second is the description of source

    @property
    def import_names(self):
        return {name for name,_ in self.mappings}

@jherland
Copy link
Member Author

jherland commented May 5, 2023

I think we're converging on the same (or very similar) designs here 😄. As of right now, this is the Package definition I'm working with:

ExtraImportInfo = Union[None, Location, str]

class Package:
    package_name: str
    resolved_with: Type["BasePackageResolver"]
    import_names: Dict[str, ExtraImportInfo] = field(default_factory=dict)

IMHO, an import_names dictionary where the keys are the import names is functionally equivalent to an .import_names set.

Examples of Package objects would then look like:

Package(
    package_name="foo",
    resolved_with=UserDefinedMapping,
    import_names={
        "foo": Location(Path("mappings.toml")),
        "bar": Location(Path("pyproject.toml")),
        "baz": Location(Path("pyproject.toml")),
    },
)
Package(
    package_name="foo",
    resolved_with=LocalPackageResolver,
    import_names={
        "foo": Location(Path(".venv/lib/python3.7/site-packages")),
        "bar": Location(Path(".venv/lib/python3.7/site-packages")),
        "baz": Location(Path("__pypackages__/3.10/lib")),
    },
)
Package(
    package_name="foo",
    resolved_with=IdentityMapping,
    import_names={
        "foo": None,
    },
)

I'm only 95% happy with this design, though, as there is often duplication between the ExtraImportInfo values in a Package object with multiple import names. Furthermore, for a Package with zero import names, there is no good place to attach any meta-information at all.

Maybe separating the meta-information into a separate member is cleaner? This also allows each resolver more control over how to organize it (for better or worse?):

PackageDebugInfo = Union[None, ...]

class Package:
    package_name: str
    resolved_with: Type["BasePackageResolver"]
    import_names: Set[str] = field(default_factory=set)
    debug_info: PackageDebugInfo = None

# Examples:
Package(
    package_name="foo",
    resolved_with=UserDefinedMapping,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path("mappings.toml")): {"foo"},
        "from configured settings": {"bar", "baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=LocalPackageResolver,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path(".venv/lib/python3.7/site-packages")): {"foo", "bar"},
        Location(Path("__pypackages__/3.10/lib")): {"baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=IdentityMapping,
    import_names={"foo"},
    debug_info=None,
)

What do you think?


Otherwise, I agree with your point on making Package objects immutable. I'll try to get that done in my ongoing refactoring.

@jherland jherland force-pushed the jherland/multiple-pyenvs branch from 5e5904a to d03225a Compare May 5, 2023 23:44
@jherland
Copy link
Member Author

jherland commented May 6, 2023

Finished refactoring for now, and split up the PR as discussed above. Here is a summary of the changes:

Initial commits, unchanged:

  • noxfile: Silence unhelpful pylint warning in tests
  • packages.LocalPackageResolver.packages: Use @calculated_once decorator
  • conftest: Return site_dir in addition to venv_dir from fake_venv()
  • conftest: Partially isolate the default LocalPackageResolver

Refactored/changed/new commits:

  • LocalPackageResolver: Resolve packages using first match in sys.path
    • Unchanged in spirit from above, but moved before some other refactoring to keep tests green throughout the series
  • packages: Refactor Package to add more debugging information
    • Completely rewritten according to the above discussion
  • packages: Remove unused Package.add_import_names() method
    • New commit, based on suggestion from @mknorps above.
  • package.UserDefinedMapping: Include source of mapping in Package objects
    • Unchanged in spirit, but affected by above rework.

Dropped commits, these will be moved to a new PR:

  • LocalPackageResolver: Accept multiple pyenvs
  • Add support for multiple --pyenv options
  • settings.print_toml_config(): Represent sets as sorted lists

@jherland jherland requested a review from mknorps May 6, 2023 00:01
@jherland jherland changed the title Support multiple --pyenv options Prepare for supporting multiple --pyenv options May 6, 2023
@jherland jherland force-pushed the jherland/multiple-pyenvs branch from d03225a to 1ade7d9 Compare May 6, 2023 01:02
Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for splitting the PR :)

This one is ready to 🚀 . I really like the place it converged to 💟

I put some small comments, that are non-blocking.

Maybe separating the meta-information into a separate member is cleaner? This also allows each resolver more control over how to organize it (for better or worse?):

PackageDebugInfo = Union[None, ...]

class Package:
    package_name: str
    resolved_with: Type["BasePackageResolver"]
    import_names: Set[str] = field(default_factory=set)
    debug_info: PackageDebugInfo = None

# Examples:
Package(
    package_name="foo",
    resolved_with=UserDefinedMapping,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path("mappings.toml")): {"foo"},
        "from configured settings": {"bar", "baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=LocalPackageResolver,
    import_names={"foo", "bar", "baz"},
    debug_info={
        Location(Path(".venv/lib/python3.7/site-packages")): {"foo", "bar"},
        Location(Path("__pypackages__/3.10/lib")): {"baz"},
    },
)
Package(
    package_name="foo",
    resolved_with=IdentityMapping,
    import_names={"foo"},
    debug_info=None,
)

What do you think?

I really like this design. Keeping the core functionality and debug info separate makes the code cleaner, as for the usage in comparing imports and dependencies package_name and import_names are the most important fields.

jherland added 8 commits May 8, 2023 15:19
The fake_venv() fixture is about to be put into service to attempt
further isolation of tests that use a default LocalPackageResolver
(i.e. one that looks at sys.path) from the virtualenv in which the
test happens to be running.

This will be done by prepending a fake_venv to sys.path, and for that
we need to know the site-packages directory created by the fake_venv.
LocalPackageResolver without a given pyenv_path defaults to looking up
packages in sys.path. This is problematic in tests, as we do not fully
control the Python environment in which the tests are run (e.g. consider
a developer running pytest directly in their dev environment where they
happen to have installed some packages that we don't expect).

Until now, we've skirted around the problem by adding a couple of TODO
comments near the relevant tests, but it's now time to solve this in a
better way. However, short of running the tests in a subprocess, we
cannot _fully_ control the sys.path. What we can do, is to use a
combination of our fake_venv fixture and pytest's
monkeypatch.syspath_prepend() function to insert an environment that we
_do_ control at the start of sys.path. This will allow us to place
entries into that fake_venv that we can then expect to be resolved
exactly as we expect in tests.

We introduce a new isolate_default_resolver fixture to apply this
partial isolation for a given dict of package-to-imports mappings.
At the same time we add a default_sys_path_env_for_tests dict that
encodes what we previously considered the minimal/default package
mappings that we expected (but could not guarantee) to be present in
the test environment.

With this, we can now call

  isolate_default_resolver(default_sys_path_env_for_tests)

from tests that previously relied on the surrounding test environment,
and their expectation is now guarateed by this new fake_venv inserted at
the start of sys.path.

Note that for tests that rely on packages NOT being installed into the
test environment (e.g. testing the fallback to the identity mapping), we
must still take care to use package names that we can ~guarantee never
to be present in the environment in which the tests are run. Thus, we
are only _partially_ isolated from the surrounding test environment.
When LocalPackageResolver is used to look up packages in the _current_
Python environment, it should take care to use the same package that
Python itself would end up using when importing a package of that name.

Until now, the way we have used importlib_metadata (specifically the way
we iterate over MetadataPathFinder().find_distributions()), if a
packages occurs multiple times in sys.path, we would end up using the
_last_ instance, whereas Python would use the _first_. Fix our iteration
to be in line with Python.
Until now we have designed the Package object to be able to accept
package-name-to-import-names mappings from _multiple_ resolvers
(signified with the DependenciesMapping enum). However, our current
design relies on each Package object being built by a _single_ resolver,
so the multiple mappings are never used in practice.

What would be nice, however, is to be able to attach some more debug
information to each Package, specifically pointing to where we found
the relevant package-name-to-import-names mappings represented in this
object. (This will soon become more important when we add support for
multiple --pyenv options.)

In this refactoring we reshape the Package object as follows:
- We remove the DependenciesMapping enum. It has served as a simple
  proxy for the resolver class/type used to create the Package object.
- Instead, we record the responsible class itself directly into the
  Package object, by adding a .resolved_with member of type
  Type[BasePackageResolver]. When serializing Package objects to JSON,
  this member is serialized to a simple string: the class name.
- We remove the .mappings member (which only ever has a single member in
  practice)
- Instead, we add a .debug_info member which each resolver is free to
  populate with any information (but defaults to None). The only
  real requirement is that the information be JSON-serializable.
  For now, we use these kinds of debug information:
    - UserDefinedMapping adds nothing (None) for now, but this will soon
      change to indicate from where the relevant user-defined mapping
      was read
    - LocalPackageResolver stores a dictionary mapping the site-packages
      path where a Package was found to the import names found in that
      install location. This will soon become even more useful when we
      add support for multiple --pyenv arguments, which will potentially
      merge Package definitions from multiple site-packages directories.
    - TemporaryPipInstallResolver overrides the debug_info from the
      underlying LocalPackageResolver to a simple string indicating that
      packages were temporarily installed with `pip install`.
    - IdentityMapping uses None as it has no extra information to give.

Most of the diff is occupied by associated test updates to deal with the
reshaped Package objects. Except for a very few unit tests, we don't
bother verifying the exact contents of the .debug_info member, instead
we liberally use a ignore_package_debug_info() helper to remove
.debug_info details before comparing Package objects.
All Package objects are now constructed in their fully developed form
and can be treated as frozen/immutable.
When accumulating a user-defined Package object from multiple files
(e.g. a [tool.fawltydeps.custom_mapping] section in pyproject.toml
together with a --custom-mapping-file=mapping.toml), include the
ultimate source of the mapping into the .debug_info member of the
Package object. This makes it easier to debug the source of these
user-defined meppings.

In other words, given a mapping.toml with:

  apache-airflow = ["airflow"]

and a pyproject.toml with:

  [tool.fawltydeps.custom_mapping]
  apache-airflow = ["foo", "bar"]

we used to generate this Package object:

  Package(
    package_name="apache_airflow",
    import_names={"airflow", "foo", "bar"},
    resolved_with=UserDefinedMapping,
    debug_info=None,
  )

but with this commit we instead get:

  Package(
    package_name="apache_airflow",
    import_names={"airflow", "foo", "bar"},
    resolved_with=UserDefinedMapping,
    debug_info={
      "mapping.toml": {"airflow},
      "from settings": {"foo", "bar"},
    },
  )

Also refactor the accumulate_mappings() helper out of the
UserDefinedMapping class as we will soon start to use it from
LocalPackageResolver as well.
@jherland jherland force-pushed the jherland/multiple-pyenvs branch from 517775d to a10e43a Compare May 8, 2023 13:25
@jherland jherland merged commit 0580d66 into main May 8, 2023
@jherland jherland deleted the jherland/multiple-pyenvs branch May 8, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants