Skip to content

Commit 82168d4

Browse files
coeuvreEdSchoutenShreeM01
authored
Make Bazel more responsive and use less memory when --jobs is high (bazelbuild#17398)
When using Bazel in combination with a larger remote execution cluster, it's not uncommon to call it with something like --jobs=512. We have observed that this is currently problematic for a couple of reasons: 1. It causes Bazel to launch 512 local threads, each being responsible for running one action remotely. All of these local threads may spend a lot of time in buildRemoteAction(), generating input roots in the form of Merkle trees. As the local system tends to have fewer than 512 CPUs, all of these threads will unnecessarily compete with each other. One practical downside of that is that interrupting Bazel using ^C takes a very long time, as it first wants to complete the computation of all 512 Merkle trees. Let's put a semaphore in place, limiting the number of concurrent Merkle tree computations to the number of CPU cores available. By making the semaphore interruptible, Bazel will immediately stop processing the `512 - nCPU` actions that were waiting for the semaphore. 2. Related to the above, Bazel will end up keeping 512 Merkle trees in memory throughout all stages of execution. This makes sense, as we may get cache misses, requiring us to upload the input root afterwards. Or the execution of a remote action may fail, requiring us to upload the input root. That said, generally speaking these cases are fairly uncommon. Most builds have relatively high cache hit rates and execution retries only happen rarely. It is therefore not worth keeping these Merkle trees in memory constantly. We only need it when computing the action digest for GetActionResult(), and while uploading it into the CAS. 3. AbstractSpawnStrategy.getInputMapping() has some smartness to memoize its results. This makes a lot of sense for local execution, where the input mapping is used in a couple of places. For remote caching/execution it is not evident that this is a good idea. Assuming you end up having a remote cache hit, you don't need it. Let's make the memoization optional, only using it in cases where we do local execution (which may also happen when you get a cache miss when doing remote caching without remote exection). Similar changes against Bazel 5.x have allowed me to successfully do builds of a large monorepo using --jobs=512 using the default heap size limits, whereas I would normally see occasional OOM behaviour when providing --host_jvm_args=-Xmx64g. Closes bazelbuild#17120. PiperOrigin-RevId: 500990181 Change-Id: I6d1ba03470b79424ce2e1c2e83abd8fa779dd268 Co-authored-by: Ed Schouten <[email protected]> Co-authored-by: kshyanashree <[email protected]>
1 parent c540ebf commit 82168d4

21 files changed

+234
-110
lines changed

src/main/java/com/google/devtools/build/lib/exec/AbstractSpawnStrategy.java

+28-15
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ public ImmutableList<SpawnResult> exec(
185185
spawnLogContext.logSpawn(
186186
spawn,
187187
actionExecutionContext.getMetadataProvider(),
188-
context.getInputMapping(PathFragment.EMPTY_FRAGMENT),
188+
context.getInputMapping(PathFragment.EMPTY_FRAGMENT, /* willAccessRepeatedly= */ false),
189189
context.getTimeout(),
190190
spawnResult);
191191
} catch (IOException | ForbiddenActionInputException e) {
@@ -246,7 +246,9 @@ public ListenableFuture<Void> prefetchInputs()
246246
return actionExecutionContext
247247
.getActionInputPrefetcher()
248248
.prefetchFiles(
249-
getInputMapping(PathFragment.EMPTY_FRAGMENT).values(), getMetadataProvider());
249+
getInputMapping(PathFragment.EMPTY_FRAGMENT, /* willAccessRepeatedly= */ true)
250+
.values(),
251+
getMetadataProvider());
250252
}
251253

252254
return immediateVoidFuture();
@@ -306,22 +308,33 @@ public FileOutErr getFileOutErr() {
306308
}
307309

308310
@Override
309-
public SortedMap<PathFragment, ActionInput> getInputMapping(PathFragment baseDirectory)
311+
public SortedMap<PathFragment, ActionInput> getInputMapping(
312+
PathFragment baseDirectory, boolean willAccessRepeatedly)
310313
throws IOException, ForbiddenActionInputException {
311-
if (lazyInputMapping == null || !inputMappingBaseDirectory.equals(baseDirectory)) {
312-
try (SilentCloseable c =
313-
Profiler.instance().profile("AbstractSpawnStrategy.getInputMapping")) {
314-
inputMappingBaseDirectory = baseDirectory;
315-
lazyInputMapping =
316-
spawnInputExpander.getInputMapping(
317-
spawn,
318-
actionExecutionContext.getArtifactExpander(),
319-
baseDirectory,
320-
actionExecutionContext.getMetadataProvider());
321-
}
314+
// Return previously computed copy if present.
315+
if (lazyInputMapping != null && inputMappingBaseDirectory.equals(baseDirectory)) {
316+
return lazyInputMapping;
317+
}
318+
319+
SortedMap<PathFragment, ActionInput> inputMapping;
320+
try (SilentCloseable c =
321+
Profiler.instance().profile("AbstractSpawnStrategy.getInputMapping")) {
322+
inputMapping =
323+
spawnInputExpander.getInputMapping(
324+
spawn,
325+
actionExecutionContext.getArtifactExpander(),
326+
baseDirectory,
327+
actionExecutionContext.getMetadataProvider());
322328
}
323329

324-
return lazyInputMapping;
330+
// Don't cache the input mapping if it is unlikely that it is used again.
331+
// This reduces memory usage in the case where remote caching/execution is
332+
// used, and the expected cache hit rate is high.
333+
if (willAccessRepeatedly) {
334+
inputMappingBaseDirectory = baseDirectory;
335+
lazyInputMapping = inputMapping;
336+
}
337+
return inputMapping;
325338
}
326339

327340
@Override

src/main/java/com/google/devtools/build/lib/exec/SpawnRunner.java

+2-1
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,8 @@ void lockOutputFiles(int exitCode, String errorMessage, FileOutErr outErr)
251251
* mapping is used in a context where the directory relative to which the keys are interpreted
252252
* is not the same as the execroot.
253253
*/
254-
SortedMap<PathFragment, ActionInput> getInputMapping(PathFragment baseDirectory)
254+
SortedMap<PathFragment, ActionInput> getInputMapping(
255+
PathFragment baseDirectory, boolean willAccessRepeatedly)
255256
throws IOException, ForbiddenActionInputException;
256257

257258
/** Reports a progress update to the Spawn strategy. */

src/main/java/com/google/devtools/build/lib/remote/RemoteAction.java

+14-7
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
import com.google.devtools.build.lib.vfs.PathFragment;
2929
import java.io.IOException;
3030
import java.util.SortedMap;
31+
import javax.annotation.Nullable;
3132

3233
/** A value class representing an action which can be executed remotely. */
3334
public class RemoteAction {
@@ -36,7 +37,9 @@ public class RemoteAction {
3637
private final SpawnExecutionContext spawnExecutionContext;
3738
private final RemoteActionExecutionContext remoteActionExecutionContext;
3839
private final RemotePathResolver remotePathResolver;
39-
private final MerkleTree merkleTree;
40+
@Nullable private final MerkleTree merkleTree;
41+
private final long inputBytes;
42+
private final long inputFiles;
4043
private final Digest commandHash;
4144
private final Command command;
4245
private final Action action;
@@ -51,12 +54,15 @@ public class RemoteAction {
5154
Digest commandHash,
5255
Command command,
5356
Action action,
54-
ActionKey actionKey) {
57+
ActionKey actionKey,
58+
boolean remoteDiscardMerkleTrees) {
5559
this.spawn = spawn;
5660
this.spawnExecutionContext = spawnExecutionContext;
5761
this.remoteActionExecutionContext = remoteActionExecutionContext;
5862
this.remotePathResolver = remotePathResolver;
59-
this.merkleTree = merkleTree;
63+
this.merkleTree = remoteDiscardMerkleTrees ? null : merkleTree;
64+
this.inputBytes = merkleTree.getInputBytes();
65+
this.inputFiles = merkleTree.getInputFiles();
6066
this.commandHash = commandHash;
6167
this.command = command;
6268
this.action = action;
@@ -80,12 +86,12 @@ public Spawn getSpawn() {
8086
* Returns the sum of file sizes plus protobuf sizes used to represent the inputs of this action.
8187
*/
8288
public long getInputBytes() {
83-
return merkleTree.getInputBytes();
89+
return inputBytes;
8490
}
8591

8692
/** Returns the number of input files of this action. */
8793
public long getInputFiles() {
88-
return merkleTree.getInputFiles();
94+
return inputFiles;
8995
}
9096

9197
/** Returns the id this is action. */
@@ -111,6 +117,7 @@ public Command getCommand() {
111117
return command;
112118
}
113119

120+
@Nullable
114121
public MerkleTree getMerkleTree() {
115122
return merkleTree;
116123
}
@@ -119,9 +126,9 @@ public MerkleTree getMerkleTree() {
119126
* Returns a {@link SortedMap} which maps from input paths for remote action to {@link
120127
* ActionInput}.
121128
*/
122-
public SortedMap<PathFragment, ActionInput> getInputMap()
129+
public SortedMap<PathFragment, ActionInput> getInputMap(boolean willAccessRepeatedly)
123130
throws IOException, ForbiddenActionInputException {
124-
return remotePathResolver.getInputMapping(spawnExecutionContext);
131+
return remotePathResolver.getInputMapping(spawnExecutionContext, willAccessRepeatedly);
125132
}
126133

127134
/**

src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java

+113-63
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@
142142
import java.util.concurrent.ConcurrentLinkedQueue;
143143
import java.util.concurrent.Executor;
144144
import java.util.concurrent.Phaser;
145+
import java.util.concurrent.Semaphore;
145146
import java.util.concurrent.atomic.AtomicBoolean;
146147
import java.util.function.Predicate;
147148
import javax.annotation.Nullable;
@@ -377,7 +378,9 @@ private MerkleTree buildInputMerkleTree(
377378
}
378379
return MerkleTree.merge(subMerkleTrees, digestUtil);
379380
} else {
380-
SortedMap<PathFragment, ActionInput> inputMap = remotePathResolver.getInputMapping(context);
381+
SortedMap<PathFragment, ActionInput> inputMap =
382+
remotePathResolver.getInputMapping(
383+
context, /* willAccessRepeatedly= */ !remoteOptions.remoteDiscardMerkleTrees);
381384
if (!outputDirMap.isEmpty()) {
382385
// The map returned by getInputMapping is mutable, but must not be mutated here as it is
383386
// shared with all other strategies.
@@ -436,63 +439,90 @@ private static ByteString buildSalt(Spawn spawn) {
436439
return null;
437440
}
438441

442+
/**
443+
* Semaphore for limiting the concurrent number of Merkle tree input roots we compute and keep in
444+
* memory.
445+
*
446+
* <p>When --jobs is set to a high value to let the remote execution service runs many actions in
447+
* parallel, there is no point in letting the local system compute Merkle trees of input roots
448+
* with the same amount of parallelism. Not only does this make Bazel feel sluggish and slow to
449+
* respond to being interrupted, it causes it to exhaust memory.
450+
*
451+
* <p>As there is no point in letting Merkle tree input root computation use a higher concurrency
452+
* than the number of CPUs in the system, use a semaphore to limit the concurrency of
453+
* buildRemoteAction().
454+
*/
455+
private final Semaphore remoteActionBuildingSemaphore =
456+
new Semaphore(Runtime.getRuntime().availableProcessors(), true);
457+
458+
@Nullable
459+
private ToolSignature getToolSignature(Spawn spawn, SpawnExecutionContext context)
460+
throws IOException, ExecException, InterruptedException {
461+
return remoteOptions.markToolInputs
462+
&& Spawns.supportsWorkers(spawn)
463+
&& !spawn.getToolFiles().isEmpty()
464+
? computePersistentWorkerSignature(spawn, context)
465+
: null;
466+
}
467+
439468
/** Creates a new {@link RemoteAction} instance from spawn. */
440469
public RemoteAction buildRemoteAction(Spawn spawn, SpawnExecutionContext context)
441470
throws IOException, ExecException, ForbiddenActionInputException, InterruptedException {
442-
ToolSignature toolSignature =
443-
remoteOptions.markToolInputs
444-
&& Spawns.supportsWorkers(spawn)
445-
&& !spawn.getToolFiles().isEmpty()
446-
? computePersistentWorkerSignature(spawn, context)
447-
: null;
448-
final MerkleTree merkleTree = buildInputMerkleTree(spawn, context, toolSignature);
449-
450-
// Get the remote platform properties.
451-
Platform platform = PlatformUtils.getPlatformProto(spawn, remoteOptions);
452-
if (toolSignature != null) {
453-
platform =
454-
PlatformUtils.getPlatformProto(
455-
spawn, remoteOptions, ImmutableMap.of("persistentWorkerKey", toolSignature.key));
456-
} else {
457-
platform = PlatformUtils.getPlatformProto(spawn, remoteOptions);
458-
}
459-
460-
Command command =
461-
buildCommand(
462-
spawn.getOutputFiles(),
463-
spawn.getArguments(),
464-
spawn.getEnvironment(),
465-
platform,
466-
remotePathResolver);
467-
Digest commandHash = digestUtil.compute(command);
468-
Action action =
469-
Utils.buildAction(
470-
commandHash,
471-
merkleTree.getRootDigest(),
472-
platform,
473-
context.getTimeout(),
474-
Spawns.mayBeCachedRemotely(spawn),
475-
buildSalt(spawn));
476-
477-
ActionKey actionKey = digestUtil.computeActionKey(action);
478-
479-
RequestMetadata metadata =
480-
TracingMetadataUtils.buildMetadata(
481-
buildRequestId, commandId, actionKey.getDigest().getHash(), spawn.getResourceOwner());
482-
RemoteActionExecutionContext remoteActionExecutionContext =
483-
RemoteActionExecutionContext.create(
484-
spawn, metadata, getWriteCachePolicy(spawn), getReadCachePolicy(spawn));
485-
486-
return new RemoteAction(
487-
spawn,
488-
context,
489-
remoteActionExecutionContext,
490-
remotePathResolver,
491-
merkleTree,
492-
commandHash,
493-
command,
494-
action,
495-
actionKey);
471+
remoteActionBuildingSemaphore.acquire();
472+
try {
473+
ToolSignature toolSignature = getToolSignature(spawn, context);
474+
final MerkleTree merkleTree = buildInputMerkleTree(spawn, context, toolSignature);
475+
476+
// Get the remote platform properties.
477+
Platform platform = PlatformUtils.getPlatformProto(spawn, remoteOptions);
478+
if (toolSignature != null) {
479+
platform =
480+
PlatformUtils.getPlatformProto(
481+
spawn, remoteOptions, ImmutableMap.of("persistentWorkerKey", toolSignature.key));
482+
} else {
483+
platform = PlatformUtils.getPlatformProto(spawn, remoteOptions);
484+
}
485+
486+
Command command =
487+
buildCommand(
488+
spawn.getOutputFiles(),
489+
spawn.getArguments(),
490+
spawn.getEnvironment(),
491+
platform,
492+
remotePathResolver);
493+
Digest commandHash = digestUtil.compute(command);
494+
Action action =
495+
Utils.buildAction(
496+
commandHash,
497+
merkleTree.getRootDigest(),
498+
platform,
499+
context.getTimeout(),
500+
Spawns.mayBeCachedRemotely(spawn),
501+
buildSalt(spawn));
502+
503+
ActionKey actionKey = digestUtil.computeActionKey(action);
504+
505+
RequestMetadata metadata =
506+
TracingMetadataUtils.buildMetadata(
507+
buildRequestId, commandId, actionKey.getDigest().getHash(), spawn.getResourceOwner());
508+
RemoteActionExecutionContext remoteActionExecutionContext =
509+
RemoteActionExecutionContext.create(
510+
spawn, metadata, getWriteCachePolicy(spawn), getReadCachePolicy(spawn));
511+
512+
return new RemoteAction(
513+
spawn,
514+
context,
515+
remoteActionExecutionContext,
516+
remotePathResolver,
517+
merkleTree,
518+
commandHash,
519+
command,
520+
action,
521+
actionKey,
522+
remoteOptions.remoteDiscardMerkleTrees);
523+
} finally {
524+
remoteActionBuildingSemaphore.release();
525+
}
496526
}
497527

498528
@Nullable
@@ -1338,7 +1368,7 @@ private void reportUploadError(Throwable error) {
13381368
* <p>Must be called before calling {@link #executeRemotely}.
13391369
*/
13401370
public void uploadInputsIfNotPresent(RemoteAction action, boolean force)
1341-
throws IOException, InterruptedException {
1371+
throws IOException, ExecException, ForbiddenActionInputException, InterruptedException {
13421372
checkState(!shutdown.get(), "shutdown");
13431373
checkState(mayBeExecutedRemotely(action.getSpawn()), "spawn can't be executed remotely");
13441374

@@ -1347,13 +1377,33 @@ public void uploadInputsIfNotPresent(RemoteAction action, boolean force)
13471377
Map<Digest, Message> additionalInputs = Maps.newHashMapWithExpectedSize(2);
13481378
additionalInputs.put(action.getActionKey().getDigest(), action.getAction());
13491379
additionalInputs.put(action.getCommandHash(), action.getCommand());
1350-
remoteExecutionCache.ensureInputsPresent(
1351-
action
1352-
.getRemoteActionExecutionContext()
1353-
.withWriteCachePolicy(CachePolicy.REMOTE_CACHE_ONLY), // Only upload to remote cache
1354-
action.getMerkleTree(),
1355-
additionalInputs,
1356-
force);
1380+
1381+
// As uploading depends on having the full input root in memory, limit
1382+
// concurrency. This prevents memory exhaustion. We assume that
1383+
// ensureInputsPresent() provides enough parallelism to saturate the
1384+
// network connection.
1385+
remoteActionBuildingSemaphore.acquire();
1386+
try {
1387+
MerkleTree merkleTree = action.getMerkleTree();
1388+
if (merkleTree == null) {
1389+
// --experimental_remote_discard_merkle_trees was provided.
1390+
// Recompute the input root.
1391+
Spawn spawn = action.getSpawn();
1392+
SpawnExecutionContext context = action.getSpawnExecutionContext();
1393+
ToolSignature toolSignature = getToolSignature(spawn, context);
1394+
merkleTree = buildInputMerkleTree(spawn, context, toolSignature);
1395+
}
1396+
1397+
remoteExecutionCache.ensureInputsPresent(
1398+
action
1399+
.getRemoteActionExecutionContext()
1400+
.withWriteCachePolicy(CachePolicy.REMOTE_CACHE_ONLY), // Only upload to remote cache
1401+
merkleTree,
1402+
additionalInputs,
1403+
force);
1404+
} finally {
1405+
remoteActionBuildingSemaphore.release();
1406+
}
13571407
}
13581408

13591409
/**

src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java

+1-1
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ public void close() {}
202202

203203
private void checkForConcurrentModifications()
204204
throws IOException, ForbiddenActionInputException {
205-
for (ActionInput input : action.getInputMap().values()) {
205+
for (ActionInput input : action.getInputMap(true).values()) {
206206
if (input instanceof VirtualActionInput) {
207207
continue;
208208
}

src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java

+2-2
Original file line numberDiff line numberDiff line change
@@ -621,9 +621,9 @@ private Map<Path, Long> getInputCtimes(SortedMap<PathFragment, ActionInput> inpu
621621
SpawnResult execLocallyAndUpload(
622622
RemoteAction action, Spawn spawn, SpawnExecutionContext context, boolean uploadLocalResults)
623623
throws ExecException, IOException, ForbiddenActionInputException, InterruptedException {
624-
Map<Path, Long> ctimesBefore = getInputCtimes(action.getInputMap());
624+
Map<Path, Long> ctimesBefore = getInputCtimes(action.getInputMap(true));
625625
SpawnResult result = execLocally(spawn, context);
626-
Map<Path, Long> ctimesAfter = getInputCtimes(action.getInputMap());
626+
Map<Path, Long> ctimesAfter = getInputCtimes(action.getInputMap(true));
627627
uploadLocalResults =
628628
uploadLocalResults && Status.SUCCESS.equals(result.status()) && result.exitCode() == 0;
629629
if (!uploadLocalResults) {

0 commit comments

Comments
 (0)