Out of memory #2

xingjici · 2019-07-11T08:04:58Z

Hi, I have 4x12GB gpus, but is seams only the first one works.
out of memory were encountered after few second training.

NoamRosenberg · 2019-07-12T01:05:45Z

@xingjici a batch size of 2 is approximately 18GB memory on Cityscapes, and 2 is the default.
What batch size are you training with? If you are training with 2 consider training with 1 or using a larger GPU or shrinking the model input or the reducing the size of the architecture itself.

If I remember correctly batch size 1 should apx 12GB, maybe a bit more. Keep me posted on your progress.

xingjici · 2019-07-12T01:14:46Z

@NoamRosenberg Batch size is 4. If computation with 2 batch approximately use 18GB memory, the one should be 9GB when nn.dataParallel is on. I have 4x12 gpus but it doesn't work when batch size equals to 4.

NoamRosenberg · 2019-07-12T01:37:50Z

@xingjici In practice it's not linear and 1 GPU will take more than 9GB, I suggest shrinking the model input for now as a test. It's easy to do, adjust args.base_size

xingjici · 2019-07-12T11:24:35Z

@NoamRosenberg I found that only GPU0 works in training and nn.Dataparellel may get crashed. Could you check the memory usage vis Nvidia-smi ? I thought the reason may be all computation burden were taken by GPU0

NoamRosenberg · 2019-07-12T11:56:16Z

@xingjici this is very odd, could you elaborate on what you tried so far and what errors you get with the data parallel. I won’t have access to a computer till Monday, I will do my best to help you figure this out then. Please keep me updated,

By the way I’m looking for contributors to this project.. happy to have you join forces

xingjici · 2019-07-13T14:52:17Z

@NoamRosenberg
Thank you for your reply.
if args.cuda:
self.model = self.model.cuda()
self.model = torch.nn.DataParallel(self.model)
#self.model = self.model.cuda()
I just remove the last cuda() operation. It works fine in parallel with batch_size 4 and base size 128 .
I found that the architect search burden were afford by gp0 (GPU0, 12088MB, GPU0~3 4906MB), data parameter training were distributed well. I wondering if let self.architect() inherit nn.Module for multi GPUs search?

NoamRosenberg · 2019-07-15T02:14:17Z

@xingjici , thanks for your ideas. I wonder if you wouldn't mind commiting them.

Specifically, self.architect receives the self.model object which has just recently been distributed. So, I'm not quite sure what you mean, but if you commit this idea I can check more carefully.

merge

NoamRosenberg closed this as completed Jul 29, 2019

NoamRosenberg pushed a commit that referenced this issue Sep 6, 2019

Merge pull request #2 from NoamRosenberg/master

6043550

merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory #2

Out of memory #2

xingjici commented Jul 11, 2019

NoamRosenberg commented Jul 12, 2019 •

edited

Loading

xingjici commented Jul 12, 2019

NoamRosenberg commented Jul 12, 2019 •

edited

Loading

xingjici commented Jul 12, 2019

NoamRosenberg commented Jul 12, 2019 •

edited

Loading

xingjici commented Jul 13, 2019

NoamRosenberg commented Jul 15, 2019

Out of memory #2

Out of memory #2

Comments

xingjici commented Jul 11, 2019

NoamRosenberg commented Jul 12, 2019 • edited Loading

xingjici commented Jul 12, 2019

NoamRosenberg commented Jul 12, 2019 • edited Loading

xingjici commented Jul 12, 2019

NoamRosenberg commented Jul 12, 2019 • edited Loading

xingjici commented Jul 13, 2019

NoamRosenberg commented Jul 15, 2019

NoamRosenberg commented Jul 12, 2019 •

edited

Loading

NoamRosenberg commented Jul 12, 2019 •

edited

Loading

NoamRosenberg commented Jul 12, 2019 •

edited

Loading