-
Notifications
You must be signed in to change notification settings - Fork 309
Issues: pytorch/torchtitan
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[Feature] Add Multi-Token Prediction module
enhancement
New feature or request
#933
opened Mar 5, 2025 by
lessw2020
[TP] RuntimeError: shape '[1, 8192, -1, 128]' is invalid for input of size 524288
module: dtensor
#932
opened Mar 5, 2025 by
aahehehe
CheckpointManager.save
with async mode is vulnerable to race conditions
module: checkpoint
question
#930
opened Mar 5, 2025 by
jamesbraza
Performance regression with FSDP2 due to exposed/non-overlapped comms
module: fsdp
#929
opened Mar 5, 2025 by
danielvegamyhre
[Feature] add preflight NCCL and GEMM check to multinode slurm script
#915
opened Mar 3, 2025 by
lessw2020
[Feature] expose Torch Nan checker as configurable option in toml for those training at scale
#914
opened Mar 3, 2025 by
lessw2020
[Checkpointing] fails out if checkpoint folder does not exist when using keep_latest_k
bug
Something isn't working
module: checkpoint
#911
opened Mar 2, 2025 by
lessw2020
[Possible PR discuss] Will a PR of training HF model be welcomed?
community help wanted
huggingface integration
#903
opened Feb 28, 2025 by
junjzhang
Question about triton in deepseek implementtion
question
Further information is requested
#902
opened Feb 28, 2025 by
zqwenn
dcp.load fails on checkpoints prior to AdamW refactor
module: checkpoint
#886
opened Feb 25, 2025 by
eminorhan
[Evaluation] Minimal support for downstream tasks
community help wanted
enhancement
New feature or request
#883
opened Feb 24, 2025 by
K-H-Ismail
[Float8] Rowwise with AsyncTP runs at roughly same perf as vanilla TP
bug
Something isn't working
module: float8
#866
opened Feb 20, 2025 by
lessw2020
How to define Custom Communication Operations for Custom Operators in Distributed Settings
module: dtensor
question
Further information is requested
#852
opened Feb 17, 2025 by
Doraemonzzz
"Universal" Checkpointing
module: checkpoint
question
Further information is requested
#850
opened Feb 17, 2025 by
jeromeku
Mitigation to HuggingFace Trainer
enhancement
New feature or request
huggingface integration
#824
opened Feb 6, 2025 by
huyiwen
HSDP causes loss instability
module: fsdp
question
Further information is requested
#813
opened Jan 31, 2025 by
apkumar
debug model training hangs on NVIDIA B200 with >1 GPU
bug
Something isn't working
module: c10d
#810
opened Jan 28, 2025 by
vkuzo
Previous Next
ProTip!
Type g i on any issue or pull request to go back to the issue listing page.