Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traverse refactor, part 1 #288

Merged
merged 11 commits into from
Apr 14, 2023
Merged

Traverse refactor, part 1 #288

merged 11 commits into from
Apr 14, 2023

Conversation

jherland
Copy link
Member

@jherland jherland commented Mar 31, 2023

This is the first part of several with the ultimate goal to:

This part of the bigger refactoring, however, does not do any of that.

There should in fact be ~zero observable difference before and after this PR (modulo some changed INFO log messages). This PR is all about providing a single traversal structure that will more easily enable the goals above to be done in future PRs.

In graphical terms, this PR tries to implement this restructuring:
traverse_refactor_step1

Probably best viewed on a per-commit basis.

Commits:

  • sample_project/no_issues: Add a dummy package inside the bundled venv
  • Introduce traverse_project module
  • extract_imports: Add new parse functions
  • extract_declared_dependencies: Add new parse functions
  • test_cmdline: Unify testing of log messages
  • main: Use new traversal logic and new parsers
  • test_cmdline: Remove unnecessary asserts for log messages
  • extract_imports: Remove obsolete API for traversing files/dirs
  • extract_declared_dependencies: Remove obsolete API

@jherland jherland added this to the Mapping strategy milestone Mar 31, 2023
@jherland jherland self-assigned this Mar 31, 2023
@jherland jherland requested review from mknorps and Nour-Mws March 31, 2023 22:51
Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good step towards the traverse logic 😍

Thank you for writing informative commit messages, they are an easy read forming a story :) Also, the graph makes it easy to understand the main idea behind the PR!

Remarks

The thing I would like you to look at closer is the data flow f the CodeSource and DepsSource.

CodeSource

The problem is that the validation step for CodeSource is done twice in is_valid_code_source and then again in parse_source. It would be more readable and less error-prone to pass an already validated source to parse_source and make it on the type level. In that case, we could use pydantic-type validation in the CodeSource object, which has either a path to a particular file or stdin. In pydantic it would look like this:

class CodeSource(BaseModel):
    """A Python code source to be parsed for imports statements."""

    path: PathOrSpecial
    base_dir: Optional[Path] = None

    @validator("base_dir")
    def extension_must_be_in_parsable_set(cls, v)
        if v is not None and v.suffix not in {".py", ".ipynb"+
            raise UnparsablePathException(ctx="Supported formats are .py and .ipynb; Cannot parse code", path=base_dir)
        return v

and the corresponding section in is_valid_code_source will change to:

    if path.is_file():
        return CodeSource(path, base_dir)

That way we will not face redoing the validation in parse_source.

DepsSource

As for DepsSource we could also move the whole validation section for a single file to the DepsSource object.

Other remarks

Also from what I understand, both DepsSource and CodeSource path parts are actually files, never directories. That could also be either checked at the object creation validation stage or informed by a comment.

@jherland
Copy link
Member Author

@mknorps writes:

The thing I would like you to look at closer is the data flow f the CodeSource and DepsSource.

CodeSource

The problem is that the validation step for CodeSource is done twice in is_valid_code_source and then again in parse_source. It would be more readable and less error-prone to pass an already validated source to parse_source and make it on the type level.

I refactored CodeSource in f074e77, although I did not go all the way to pydantic for validation (we can do that later in a separate PR), instead - since it's already a dataclass - I wrote a __post_init__() to validate the invariants. The result does help remove some duplication between validate_code_source() and parse_source(), although mypy insists that some assertions remain...

DepsSource

As for DepsSource we could also move the whole validation section for a single file to the DepsSource object.

Yes, I attempted a similar/symmetrical DepsSource refactoring in 8a36a36. This was a bigger change, as I had to change some APIs, and how we store/communicate the parser choice. However, I think the result is much better/cleaner: The choice of deps parser is now clearly made in one place, and is stored in the DepsSource object for the remainder of its lifetime.

When re-reviewing, it's probably better to review commit-by-commit, starting from
3346cdf, as the overall diff is quite impacted by the new refactoring. I'll wait to rebase/resolve conflicts until the review is done.

@jherland jherland requested a review from mknorps April 13, 2023 10:39
Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good. My main concern was double validating sources, and you managed to work around it with __post_init__ 🎉

Kudos for structuring the changes after PR comments, it was an easy read, not overwhelming, and it was easy to continue the review.

🚀 🚀 🚀

jherland added 11 commits April 14, 2023 15:34
This will be used in traverse tests to verify that we don't traverse
into dot directories.
This module will take over the traversal of the project
files/directories specified in the Settings object, and generate sources
that will then be processed other parts of FawltyDeps, primarily
extract_imports and extract_declared_dependencies.

We start by reimplementing the file/directory traversal that currently
lives in the extract_imports and extract_declared_dependencies modules,
except that we focus on traversing the project _once_ and generating
CodeSource and DepsSource objects that will later be passed on to the
extract_imports and extract_declared_dependencies modules, respectively.

For now, the new code is only used by a test suite that exercises
various traversal scenarios on the projects under tests/sample_projects.
The new parse_sources() and parse_source() functions will soon replace
the older functions that operate directly on files/dirs instead of
CodeSource objects (parse_any_args(), parse_any_arg(), parse_dir()).

Rephrase test cases for the functions that are about to go away to use
the new parse functions instead. (Remove some test cases already covered
better by test_traverse_project.)
The new parse_sources() and parse_source() functions will soon replace
the older functions that operate directly on files/dirs instead of
DepsSource objects (extract_declared_dependencies() and
extract_declared_dependencies_from_path()).

Rephrase test cases for the functions that are about to go away to use
the new parse functions instead. (Remove some test cases already covered
better by test_traverse_project.)
The _ordering_ of log messages is not something we should be too strict
about in tests, as it just makes it harder to refactor our main logic
without having to deal with lots of "collateral diff damage" in our
tests.

Establish a common pattern across all our command-line tests for
verifying expected log messages: We declare a list of expected log
message lines, we then split the stderr into lines and use the
assert_unordered_equivalence() helper to communicate that the _ordering_
of log messages is not important.
Use the new traverse_project module to find Source objects from the
given settings, and pass those on to the new parsing interface in
extract_imports and extract_declared_dependencies.
extract_declared_dependencies_from_path() and
extract_declared_dependencies() have now been replaced by
parse_source() and parser_sources(), respectively.
Use CodeSource.__post_init__() to verify invariants about the CodeSource
object, instead of doing this in validate_code_parse() and then
re-asserting the same in parse_source().

Some asserts remain in order to appease Mypy.

Suggested-by: Maria Knorps <[email protected]>
In preparation for storing parser choice inside DepsSource.
The parser choice was passed into both validate_deps_source() (to help
select which files become DepsSource objects), and into parse_source()
(to select the parser to be used for each DepsSource object). This is
redundant when we instead can record the parser_choice as part of the
DepsSource object.
@jherland jherland force-pushed the jherland/traverse-refactor-1 branch from 250d980 to e5835ec Compare April 14, 2023 15:22
@jherland
Copy link
Member Author

When rebasing this onto origin/main I encountered some conflicts with #271, and as part of that I ended up doing a more thorough cleanup of tests/test_cmdline.py than I had at first expected.

The end result was no longer recognizable as 99c2c96 ("test_cmdline: Remove unnecessary asserts for log messages"), so I ended up removing that commit from this PR and spinning it off into a separate PR (to be posted as soon as this is merged).

@jherland jherland merged commit 2204b74 into main Apr 14, 2023
@jherland jherland deleted the jherland/traverse-refactor-1 branch April 14, 2023 16:55
@jherland jherland mentioned this pull request Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants