Skip to content

Use patience diff algorithm #891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 23, 2025
Merged

Use patience diff algorithm #891

merged 11 commits into from
May 23, 2025

Conversation

weswigham
Copy link
Member

@weswigham weswigham commented May 21, 2025

This solves the perf issue with test diffing seen in #791

The patience algorithm is suitable for use for all test diffs, but it makes some subtly different choices in edit order in some cases, and is whitespace-insensitive, so will result in some whitespace changes in the diffs.

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham
@weswigham weswigham requested review from jakebailey and Copilot May 21, 2025 00:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to improve performance in test diffing by using an alternate histogram diff algorithm for larger diffs. Changes include:

  • Replacing diff.Text with a custom diffText function that selects the diff algorithm based on the size of the test output.
  • Introducing new helper types and functions (inputPair, intoEditScript, and addRange) to facilitate the new diff algorithm.
  • Adding a new dependency on github.com/octavore/delta in go.mod.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
internal/testutil/baseline/baseline.go Introduces the new diffText function and associated helpers for histogram diffing.
go.mod Adds dependency on github.com/octavore/delta v0.6.1.
Comments suppressed due to low confidence (1)

internal/testutil/baseline/baseline.go:241

  • The function diffText references the 'strings' package without importing it; add 'import "strings"' to resolve the missing dependency.
if math.Abs(float64(strings.Count(expected, "\n")-strings.Count(actual, "\n"))) < 10000 {

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham
@jakebailey
Copy link
Member

I tried this out by disabling the diff.Text bit just to see, and it seems like some of the ranges might be off:

image

Which I can rationalize as "starting at -1 in the original because the first line is missing", but maybe the old algorithm isn't great either.

I'm not attached to the pkg/diff lib if there's another one out there, it's just reasonably popular and in an org I trust (pkg).

@weswigham
Copy link
Member Author

some of the ranges might be off:

In the existing diff.Text one? Yeah. I'm pretty sure they're just wrong? (--1 cannot possibly be right.) But it's not too important because it's just omitted context lines, not actual diff lines - AFAIK nothing uses that information for anything? Certainly not us, where these diffs are only for human readers. I didn't bother figuring out how to replicate that bugginess in the delta->edits mapping code I wrote - matching the context lines (for baselines where both algorithms make the same edit spans) seemed good enough.

@jakebailey
Copy link
Member

FWIW this seems fine to me with the nits I've mentioned fixed, though I wish we could use a more popular dependency to do the same thing.

I've been meaning to send a PR to pkg/diff to fix some of the memory problems, which would also probably help.

@jakebailey
Copy link
Member

Actually, maybe https://pkg.go.dev/github.com/peter-evans/patience would work even better?

@jakebailey
Copy link
Member

Or, technically we also pull in github.com/pmezard/go-difflib via moq, but patience I think is now the "best" differ?

@weswigham
Copy link
Member Author

Actually, maybe https://pkg.go.dev/github.com/peter-evans/patience would work even better?
Or, technically we also pull in github.com/pmezard/go-difflib via moq, but patience I think is now the "best" differ?

The histogram algorithm is supposed to be an improvement over the patience algorithm that uses a heuristic to skip matching high-frequency lines, iirc. Makes it more performant than the original meyers algorithm, even, on most code inputs (generally - no idea about this implementation without doing benchmarking). Same origin. The question for us, assuming that all options fix the quadratic memory issue from diff and have acceptable perf when swapped in, is just which dependency we'd rather take.

@jakebailey
Copy link
Member

is just which dependency we'd rather take.

I'm biased towards whichever one involves less code on our part; that's the only real concern I have, just the code that was added here in order to use the new lib compared to the old one.

And if we do switch, I don't mind biting the bullet and just using it for everything. There's not many PRs open that couldn't just re-accept baselines.

weswigham added 6 commits May 23, 2025 10:28

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham
@weswigham weswigham changed the title Use alternate diff algorithm for larger diffs Use patience diff algorithm May 23, 2025
@jakebailey
Copy link
Member

I've measured this before/after, and this PR halves the time the tests spend diffing, more than halves the memory allocated in bytes, and halves the number of allocs. That and presumably, makes the other PR not time out!

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham
@jakebailey
Copy link
Member

It's impossible to comment on the PR, but we don't need the io.Writer anymore given the output is a []byte, which is what the caller wants anyway.

@jakebailey
Copy link
Member

jakebailey commented May 23, 2025

I said []byte, but I think it's just string.

So:

diff --git a/internal/testutil/baseline/baseline.go b/internal/testutil/baseline/baseline.go
index 3767a2b5b..1f64731ae 100644
--- a/internal/testutil/baseline/baseline.go
+++ b/internal/testutil/baseline/baseline.go
@@ -2,7 +2,6 @@ package baseline
 
 import (
 	"fmt"
-	"io"
 	"os"
 	"path/filepath"
 	"regexp"
@@ -109,17 +108,14 @@ type inputPair struct {
 	b []string
 }
 
-func diffText(oldName string, newName string, expected string, actual string, w io.Writer) error {
+func diffText(oldName string, newName string, expected string, actual string) string {
 	lines := patience.Diff(stringutil.SplitLines(expected), stringutil.SplitLines(actual))
-	result := patience.UnifiedDiffTextWithOptions(lines, patience.UnifiedDiffOptions{
+	return patience.UnifiedDiffTextWithOptions(lines, patience.UnifiedDiffOptions{
 		Precontext:  3,
 		Postcontext: 3,
 		SrcHeader:   oldName,
 		DstHeader:   newName,
 	})
-
-	_, err := w.Write([]byte(result))
-	return err
 }
 
 func getBaselineDiff(t *testing.T, actual string, expected string, fileName string, fixupOld func(string) string) string {
@@ -129,14 +125,11 @@ func getBaselineDiff(t *testing.T, actual string, expected string, fileName stri
 	if actual == expected {
 		return NoContent
 	}
-	var b strings.Builder
-	if err := diffText("old."+fileName, "new."+fileName, expected, actual, &b); err != nil {
-		return fmt.Sprintf("failed to diff the actual and expected content: %v\n", err)
-	}
+
+	s := diffText("old."+fileName, "new."+fileName, expected, actual)
 
 	// Remove line numbers from unified diff headers; this avoids adding/deleting
 	// lines in our baselines from causing knock-on header changes later in the diff.
-	s := b.String()
 
 	aCurLine := 1
 	bCurLine := 1

weswigham added 2 commits May 23, 2025 11:49

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham

Verified

This commit was signed with the committer’s verified signature.
weswigham Wesley Wigham
@weswigham weswigham enabled auto-merge May 23, 2025 19:22
@weswigham weswigham requested a review from jakebailey May 23, 2025 19:22
@weswigham weswigham added this pull request to the merge queue May 23, 2025
@jakebailey jakebailey requested a review from Copilot May 23, 2025 19:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR switches the test baseline diffing to use the “patience” algorithm for improved performance and whitespace-insensitive diffs.

  • Introduce a new diffText helper using peter-evans/patience in baseline.go
  • Remove github.com/pkg/diff, add github.com/peter-evans/patience in go.mod
  • Update all affected baseline .diff files under testdata/baselines/... to match the new diff output

Reviewed Changes

Copilot reviewed 20791 out of 20791 changed files in this pull request and generated 1 comment.

File Description
internal/testutil/baseline/baseline.go Replaced diff.Text usage with a new diffText helper leveraging patience
go.mod Removed github.com/pkg/diff, added github.com/peter-evans/patience
testdata/baselines/reference/submodule/compiler/*.errors.txt.diff Updated expected error diff outputs to align with the patience algorithm
testdata/baselines/reference/submodule/compiler/2dArrays.symbols.diff Updated symbol diff ordering/output from patience diff
Comments suppressed due to low confidence (2)

internal/testutil/baseline/baseline.go:106

  • [nitpick] Consider renaming diffText to something more descriptive like generateUnifiedDiff or computeBaselineDiff to clarify its purpose.
func diffText(oldName string, newName string, expected string, actual string) string {

internal/testutil/baseline/baseline.go:14

  • If assert is no longer used in this file after the diff changes, remove the import to clean up unused dependencies.
"gotest.tools/v3/assert"

@@ -102,21 +103,27 @@ func readFileOrNoContent(fileName string) string {
return string(content)
}

Copy link
Preview

Copilot AI May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Add a Go doc comment for diffText, briefly explaining the parameters and the returned unified diff string to improve maintainability.

Suggested change
// diffText generates a unified diff string between the expected and actual text.
// Parameters:
// - oldName: The name of the source file or content being compared.
// - newName: The name of the destination file or content being compared.
// - expected: The expected content as a string.
// - actual: The actual content as a string.
// Returns:
// A unified diff string highlighting the differences between the expected and actual content.

Copilot uses AI. Check for mistakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why would anyone add docs like this ever again if they can just ask copilot to guess at the docs like this? 🤔

...it's also just self-evident, in this case.

Merged via the queue into microsoft:main with commit 4764672 May 23, 2025
23 checks passed
@weswigham weswigham deleted the new-diff branch May 23, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants