Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory #2

Closed
xingjici opened this issue Jul 11, 2019 · 7 comments
Closed

Out of memory #2

xingjici opened this issue Jul 11, 2019 · 7 comments

Comments

@xingjici
Copy link

Hi, I have 4x12GB gpus, but is seams only the first one works.
out of memory were encountered after few second training.

@NoamRosenberg
Copy link
Owner

NoamRosenberg commented Jul 12, 2019

@xingjici a batch size of 2 is approximately 18GB memory on Cityscapes, and 2 is the default.
What batch size are you training with? If you are training with 2 consider training with 1 or using a larger GPU or shrinking the model input or the reducing the size of the architecture itself.

If I remember correctly batch size 1 should apx 12GB, maybe a bit more. Keep me posted on your progress.

@xingjici
Copy link
Author

@NoamRosenberg Batch size is 4. If computation with 2 batch approximately use 18GB memory, the one should be 9GB when nn.dataParallel is on. I have 4x12 gpus but it doesn't work when batch size equals to 4.

@NoamRosenberg
Copy link
Owner

NoamRosenberg commented Jul 12, 2019

@xingjici In practice it's not linear and 1 GPU will take more than 9GB, I suggest shrinking the model input for now as a test. It's easy to do, adjust args.base_size

@xingjici
Copy link
Author

@NoamRosenberg I found that only GPU0 works in training and nn.Dataparellel may get crashed. Could you check the memory usage vis Nvidia-smi ? I thought the reason may be all computation burden were taken by GPU0

@NoamRosenberg
Copy link
Owner

NoamRosenberg commented Jul 12, 2019

@xingjici this is very odd, could you elaborate on what you tried so far and what errors you get with the data parallel. I won’t have access to a computer till Monday, I will do my best to help you figure this out then. Please keep me updated,

By the way I’m looking for contributors to this project.. happy to have you join forces

@xingjici
Copy link
Author

@NoamRosenberg
Thank you for your reply.
if args.cuda:
self.model = self.model.cuda()
self.model = torch.nn.DataParallel(self.model)
#self.model = self.model.cuda()
I just remove the last cuda() operation. It works fine in parallel with batch_size 4 and base size 128 .
I found that the architect search burden were afford by gp0 (GPU0, 12088MB, GPU0~3 4906MB), data parameter training were distributed well. I wondering if let self.architect() inherit nn.Module for multi GPUs search?

@NoamRosenberg
Copy link
Owner

@xingjici , thanks for your ideas. I wonder if you wouldn't mind commiting them.

Specifically, self.architect receives the self.model object which has just recently been distributed. So, I'm not quite sure what you mean, but if you commit this idea I can check more carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants