Training on n epochs results in 1 more step in n+1 epoch #1793

Oseltamivir · 2024-10-19T06:47:08Z

Bug description

Training on n epochs results in 1 more step in n+1 epoch

while state["step_count"] < max_steps and train_iterator.epoch < train.epochs:
        state["iter_num"] += 1
        iter_t0 = time.perf_counter()
        batch = next(train_iterator)

The problem is caused by next(train_iterator) being called after checking if train_iterator.epoch < train.epochs.

Epoch 1 | iter 1 step 1 | loss train: 2.112, val: n/a | iter time: 3882.13 ms (step)
Epoch 1 | iter 1 step 1 | loss train: 2.167, val: n/a | iter time: 3887.25 ms (step)
Epoch 1 | iter 2 step 2 | loss train: 1.257, val: n/a | iter time: 11192.95 ms (step)
Epoch 1 | iter 2 step 2 | loss train: 1.258, val: n/a | iter time: 11224.30 ms (step)
Epoch 2 | iter 3 step 3 | loss train: 2.108, val: n/a | iter time: 3683.60 ms (step)
Epoch 2 | iter 3 step 3 | loss train: 2.165, val: n/a | iter time: 3726.28 ms (step)

for epoch = 1

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.12

The text was updated successfully, but these errors were encountered:

rasbt · 2024-10-19T14:06:57Z

Thanks for reporting! And yes, I can confirm that this has been an issue. I have it on my list of things to address in the next few weeks.

Oseltamivir · 2024-10-19T14:18:47Z

Thanks for reporting! And yes, I can confirm that this has been an issue. I have it on my list of things to address in the next few weeks.

May I suggest instead:

while state["step_count"] < max_steps:
        state["iter_num"] += 1
        iter_t0 = time.perf_counter()
        batch = next(train_iterator)
        if train_iterator.epoch >= train.epochs:
                break

I don't think this is worth a PR, but if you'd want one, I can create one

rasbt · 2024-10-19T14:25:52Z

Yes, this is the fix I would suggest as well. This needs to be added in several places, and I just want to test this carefully. (Probably also requires slight adjustments to some CI tests because the loss values will change). It's on my list!

Oseltamivir · 2024-10-19T14:34:52Z

Alright, perfect. Thanks for the quick replies, i'll leave this in for my patch for this code.

Oseltamivir added the bug Something isn't working label Oct 19, 2024

rasbt self-assigned this Oct 19, 2024

rasbt mentioned this issue Oct 21, 2024

Fix step iteration bug in finetuning scripts #1794

Merged

rasbt closed this as completed in #1794 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on n epochs results in 1 more step in n+1 epoch #1793

Training on n epochs results in 1 more step in n+1 epoch #1793

Oseltamivir commented Oct 19, 2024

rasbt commented Oct 19, 2024

Oseltamivir commented Oct 19, 2024 •

edited

Loading

rasbt commented Oct 19, 2024

Oseltamivir commented Oct 19, 2024

Training on n epochs results in 1 more step in n+1 epoch #1793

Training on n epochs results in 1 more step in n+1 epoch #1793

Comments

Oseltamivir commented Oct 19, 2024

Bug description

What operating system are you using?

LitGPT Version

rasbt commented Oct 19, 2024

Oseltamivir commented Oct 19, 2024 • edited Loading

rasbt commented Oct 19, 2024

Oseltamivir commented Oct 19, 2024

Oseltamivir commented Oct 19, 2024 •

edited

Loading