Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemini 1.5 PRO latest + CEDARScript-G edit format #1897

Closed
wants to merge 6 commits into from

Conversation

elifarley
Copy link

@elifarley elifarley commented Oct 3, 2024

The new CEDARScript edit format looks promising, as it allowed Gemini-1.5-Flash to surpass Sonnet 3.5.

Here we're not using architect mode, but you can kinda say that Gemini is acting as an architect, and the edit format itself (CEDARScript) is acting as the editor.

Quick comparisons

image

Sonnet 3.5 + diff

- dirname: refac-claude-3.5-sonnet-diff-not-lazy
  model: claude-3.5-sonnet (diff)
  edit_format: diff
  pass_rate_1: 64.0
  percent_cases_well_formed: 76.4

Gemini 1.5 PRO + diff-fenced (leaderboard site)

- dirname: refac-gemini
  model: gemini/gemini-1.5-pro-latest
  edit_format: diff-fenced
  pass_rate_1: 49.4
  percent_cases_well_formed: 7.9

Gemini 1.5 PRO + diff-fenced (my own tests)

- dirname: 2024-10-05-00-43-21--diff-fenced-Gemini-Refactoring
  test_cases: 89
  model: gemini/gemini-1.5-pro-latest
  edit_format: diff-fenced
  commit_hash: 772710b-dirty
  pass_rate_1: 18.0
  pass_rate_2: 21.3
  pass_rate_3: 24.7
  percent_cases_well_formed: 34.8
  error_outputs: 180
  num_malformed_responses: 180
  num_with_malformed_responses: 58
  user_asks: 128
  lazy_comments: 2
  syntax_errors: 21
  indentation_errors: 93
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model gemini/gemini-1.5-pro-latest
  date: 2024-10-05
  versions: 0.57.2.dev
  seconds_per_case: 110.1
  total_cost: 28.2515

Gemini 1.5 PRO + CEDARScript

- dirname: 2024-10-19-22-48-07--cedarscript-0.3.1-refactoring-gemini1.5pro
  test_cases: 89
  model: gemini/gemini-1.5-pro-latest
  edit_format: cedarscript-g
  commit_hash: 4da1e9b-dirty
  pass_rate_1: 77.5
  percent_cases_well_formed: 86.5
  error_outputs: 337
  num_malformed_responses: 19
  num_with_malformed_responses: 12
  user_asks: 12
  lazy_comments: 0
  syntax_errors: 4
  indentation_errors: 3
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model gemini/gemini-1.5-pro-latest
  date: 2024-10-19
  versions: 0.59.2.dev
  seconds_per_case: 29.0
  total_cost: 26.2374

Gemini 1.5 Flash + CEDARScript

- dirname: 2024-10-20-00-33-27--cedarscript-0.3.1-refactoring-gemini1.5flash
  test_cases: 89
  model: gemini/gemini-1.5-flash-latest
  edit_format: cedarscript-g
  commit_hash: 4da1e9b-dirty
  pass_rate_1: 76.4
  percent_cases_well_formed: 94.4
  error_outputs: 403
  num_malformed_responses: 13
  num_with_malformed_responses: 5
  user_asks: 21
  lazy_comments: 0
  syntax_errors: 3
  indentation_errors: 5
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model gemini/gemini-1.5-flash-latest
  date: 2024-10-20
  versions: 0.59.2.dev
  seconds_per_case: 14.7
  total_cost: 0.6757

functional_Functional__conform_to_reference_input

diff-fenced

    "cost": 0.33188854999999995,
    "duration": 27.793912172317505,
    "test_timeouts": 0,
    "commit_hash": "772710b-dirty",
    "num_error_outputs": 2,
    "num_user_asks": 3,
    "num_exhausted_context_windows": 0,
    "num_malformed_responses": 2,
    "syntax_errors": 0,
    "indentation_errors": 3,
    "lazy_comments": 0,

cedarscript-g

    "cost": 0.18178265,
    "duration": 11.176445960998535,
    "test_timeouts": 0,
    "commit_hash": "772710b-dirty",
    "num_error_outputs": 0,
    "num_user_asks": 1,
    "num_exhausted_context_windows": 0,
    "num_malformed_responses": 0,
    "syntax_errors": 0,
    "indentation_errors": 0,
    "lazy_comments": 0,

See line count comparisons for some refactoring benchmark tasks.
image

Analysis: CEDARScript vs. Common Edit Formats in AI-Assisted Code Refactoring

The introduction of CEDARScript as an edit format for AI-assisted code refactoring has demonstrated an important leap in performance, particularly when used with Gemini 1.5 PRO and Gemini 1.5 Flash. This analysis compares CEDARScript against traditional diff-based edit formats, revealing striking improvements across multiple metrics.

Overall Performance:

CEDARScript has dramatically enhanced the performance of Gemini models in code refactoring tasks. When paired with Gemini 1.5 PRO, it achieved an impressive 77.5% pass rate and 86.5% well-formed cases, significantly outperforming both its own diff-fenced format results (49.4% pass rate, 7.9% well-formed cases) and the highly regarded Claude 3.5 Sonnet (64.0% pass rate, 76.4% well-formed cases).

Most remarkably, the cost-effective Gemini 1.5 Flash model, when using CEDARScript, not only matched but surpassed the performance of Claude 3.5 Sonnet. With a 76.4% pass rate and an outstanding 94.4% well-formed cases, Gemini 1.5 Flash demonstrates that even a more affordable model can outperform top-tier competitors when equipped with the right tools. This breakthrough suggests that CEDARScript can level the playing field, enabling more accessible AI models to compete with and even exceed the capabilities of more expensive options in complex coding tasks.

Code Quality and Accuracy:

  • Syntax Errors: CEDARScript reduced syntax errors from 21 to just 4 with Gemini 1.5 PRO, and to 3 with Gemini 1.5 Flash.
  • Indentation Errors: A dramatic decrease from 93 to 3 errors with Gemini 1.5 PRO, and 5 with Gemini 1.5 Flash.
  • Lazy Comments: Eliminated entirely across all CEDARScript tests.

These improvements suggest that CEDARScript enables AI models to produce more accurate, syntactically correct, and well-structured code modifications.

Efficiency and Resource Utilization:

Examining the "functional_Functional__conform_to_reference_input" test case:

  • Cost: CEDARScript reduced costs by 45% (from $0.33 to $0.18).
  • Duration: Processing time decreased by 60% (from 27.8s to 11.2s).
  • User Interactions: Required user asks dropped from 3 to 1.

On a larger scale, CEDARScript with Gemini 1.5 PRO reduced the average time per case from 110.1 seconds to 29.0 seconds, a 73.7% improvement. Gemini 1.5 Flash further reduced this to 14.7 seconds, an 86.6% improvement over the original diff-fenced format.

Robustness and Reliability:

While the number of error outputs increased with CEDARScript, the number of malformed responses decreased significantly:

  • Gemini 1.5 PRO: from 180 to 19 malformed responses
  • Gemini 1.5 Flash: further reduced to 13 malformed responses

This suggests that while CEDARScript may generate more error outputs, it produces fewer malformed responses, potentially indicating more precise error handling and feedback.

Scalability and Cost-Effectiveness:

CEDARScript demonstrated impressive cost savings:

  • Gemini 1.5 PRO: Total cost reduced from $28.25 to $26.24 (7.1% savings)
  • Gemini 1.5 Flash: Dramatically reduced cost to $0.68 (97.6% savings compared to diff-fenced)

This cost reduction, combined with faster processing times, indicates excellent scalability for larger, more complex refactoring tasks.

Model Comparison:

Gemini 1.5 Flash with CEDARScript showed slightly lower pass rates (76.4% vs 77.5%) but higher well-formed case percentages (94.4% vs 86.5%) compared to Gemini 1.5 PRO. The Flash model also demonstrated superior cost-effectiveness and speed, making it an attractive option for many use cases.

Conclusion:

CEDARScript has shown significant improvements for AI-assisted code refactoring.

By improving cost-savings, accuracy, efficiency, and reliability across different models, it addresses many of the challenges associated with traditional diff-based formats.
The consistent performance boost across various metrics indicates that CEDARScript could be an important enabler for AI models to handle complex code transformations more effectively.

These results could have positive implications for developer productivity, code quality, and the future of AI-assisted software development.

@elifarley elifarley marked this pull request as draft October 3, 2024 03:37
@elifarley elifarley changed the title Gemini 1.5 PRO latest with CEDARScript-G edit format Gemini 1.5 PRO latest + CEDARScript-G edit format Oct 4, 2024
@elifarley elifarley marked this pull request as ready for review October 5, 2024 00:03
@fry69
Copy link
Contributor

fry69 commented Oct 5, 2024

What is the point of this PR? The coder does not exist in aider currently.

These numbers are at best for private preview interest, not for public disclosure on the aider website (IMHO).

@elifarley
Copy link
Author

Ok, I'll make it a draft PR. Once a PR in Aider is created and merged, I can then make this PR ready for review once more.

@elifarley elifarley marked this pull request as draft October 5, 2024 10:47
@fry69
Copy link
Contributor

fry69 commented Oct 5, 2024

Once a PR in Aider is created and merged, I can then make this PR ready for review once more.

I'll close this PR until this happened.

@fry69 fry69 closed this Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants