Api Overhead Benchmark is a set of tests aimed at measuring CPU-side execution duration of compute API calls.
Test name | Description | Params | L0 | OCL |
AppendLaunchKernel | measures time spent in zeCommandListAppendLaunchKernel on CPU. |
✔️ | ❌ |
CreateCommandList | measures time spent in zeCommandListCreate on CPU. |
✔️ | ❌ |
CreateCommandListImmediate | measures time spent in zeCommandListCreateImmediate on CPU. |
✔️ | ❌ |
DestroyCommandList | measures time spent in zeCommandListDestroy on CPU. |
✔️ | ❌ |
DestroyCommandListImmediate | measures time spent in zeCommandListDestroy on CPU, for immediate cmdlist. |
✔️ | ❌ |
EnqueueNdrNullLws | measures time spent in clEnqueueNDRangeKernel on CPU. Null LWS is provided, which causes driver to calculate it |
❌ | ✔️ |
EnqueueNdrTime | measures time spent in clEnqueueNDRangeKernel on CPU. |
❌ | ✔️ |
EventCreation | measures time spent to create event |
✔️ | ❌ |
ExecuteCommandList | measures time spent in zeCommandQueueExecuteCommandLists on CPU. |
✔️ | ❌ |
ExecuteCommandListImmediate | measures time spent in appending launch kernel for immediate command list on CPU. |
✔️ | ❌ |
ExecuteCommandListImmediateMultiKernel | measures time spent in executing multiple instances of two different kernels with immediate command list on CPU. |
✔️ | ❌ |
ExecuteCommandListWithFenceCreate | measures time spent in zeFenceCreate on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithFenceDestroy | measures time spent in zeFenceDestroy on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithFenceUsage | measures time spent in zeCommandQueueExecuteCommandLists and zeFenceSynchronize on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithIndirectAccess | measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are accessed. |
✔️ | ❌ |
ExecuteCommandListWithIndirectArguments | measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are used. |
✔️ | ❌ |
FlushTime | measures time spent in clEnqueueNDRangeKernel on CPU. |
❌ | ✔️ |
KernelSetArgumentValueImmediate | measures time spent in zeKernelSetArgumentValue for immediate arguments on CPU. |
✔️ | ❌ |
LifecycleCommandList | measures time spent in zeCommandListCreate + Close + Execute on CPU. |
✔️ | ❌ |
SetKernelArgSvmPointer | measures time spent in clSetKernelArgSVMPointer on CPU. |
✔️ | ✔️ |
UsmMemoryAllocation | measures time spent in USM memory allocation APIs. |
✔️ | ❌ |
Atomic Benchmark is a set of tests aimed at measuring performance of atomic operations inside kernels.
Test name | Description | Params | L0 | OCL |
OneAtomic | enqueues kernel performing an atomic operation on a single address |
❌ | ✔️ |
OneAtomicExplicit | enqueues kernel performing an atomic operation on a single address using OpenCL 2.0 Atomics with explicit memory order and scope |
❌ | ✔️ |
OneLocalAtomic | enqueues kernel performing an atomic operation on a single location placed in SLM |
❌ | ✔️ |
OneLocalAtomicExplicit | enqueues kernel performing an atomic operation on a single location placed in SLM using OpenCL 2.0 Atomics with explicit memory order and scope |
❌ | ✔️ |
SeparateAtomics | enqueues kernel performing an atomic operation on different addresses |
❌ | ✔️ |
SeparateAtomicsExplicit | enqueues kernel performing an atomic operation on different addresses |
❌ | ✔️ |
EU Benchmark is a set of tests aimed at measuring performance of calculations performed in kernels.
Test name | Description | Params | L0 | OCL |
DoMathOperation | enqueues kernel performing a math operation |
❌ | ✔️ |
ReadAfterAtomicWrite | enqueues kernel, which writes to global memory using atomic and then reads non atomically |
❌ | ✔️ |
Gpu Commands Benchmark is a set of tests aimed at measuring GPU-side execution duration of various commands.
Test name | Description | Params | L0 | OCL |
BarrierBetweenKernels | measures time required to run a barrier command between 2 kernels, including potential cache flush commands |
✔️ | ❌ |
CopyWithEvent | measures time required to run a copy kernel with various event configurations. |
✔️ | ❌ |
EmptyKernel | measures time required to run an empty kernel on GPU. |
✔️ | ❌ |
EventCtxtSwitchLatency | measures context switching latency time required to switch between various engine types |
✔️ | ❌ |
KernelWithEvent | measures time required to run an empty kernel with various event configurations. |
✔️ | ❌ |
KernelWithWork | measures time required to run a GPU kernel which assigns values to elements of a buffer. |
✔️ | ❌ |
WaitOnEventCold | measures time required to service a signalled semaphore, that has never been waited for. |
✔️ | ❌ |
WaitOnEventFromWalker | measures time required to service a signalled semaphore coming from Walker command |
✔️ | ❌ |
WaitOnEventHot | measures time required to service a signalled semaphore, that was previously used |
✔️ | ❌ |
WriteTimestamp | measures time required to write a timestamp on GPU. |
✔️ | ❌ |
Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers.
Test name | Description | Params | L0 | OCL |
CopyBuffer | allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
❌ | ✔️ |
CopyBufferRect | allocates two OpenCL buffers and measures rectangle copy bandwidth betweem them. Buffers will be placed in device memory, if it's available. |
❌ | ✔️ |
CopyEntireImage | allocates two image objects and measures copy bandwidth between them. Images will be placed in device memory, if it's available. |
✔️ | ✔️ |
FillBuffer | allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available. |
❌ | ✔️ |
MapBuffer | allocates an OpenCL buffer and measures map bandwidth. Mapping operation means memory transfer from GPU to CPU or a no-op, depending on map flags. |
❌ | ✔️ |
ReadBuffer | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. |
❌ | ✔️ |
ReadBufferMisaligned | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. Destination pointer passed by the application will be misaligned by the specified amount of bytes. |
❌ | ✔️ |
ReadBufferRect | allocates an OpenCL buffer and measures rectangle read bandwidth. Rectangle read operation means transfer from GPU to CPU. |
❌ | ✔️ |
ReadDeviceMemBuffer | allocates two OpenCL buffers and measures source buffer read bandwidth. Source buffer resides in device memory. |
❌ | ✔️ |
RemoteAccessMemory | Uses stream memory triad to measure bandwidth with different percentages of remote memory access. |
❌ | ✔️ |
SLM_DataAccessLatency | generates SLM local memory transactions inside thread group to measure latency between reads (uses Intel only private intel_get_cycle_counter() ) |
❌ | ✔️ |
StreamMemory | Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to. |
✔️ | ✔️ |
UnmapBuffer | allocates an OpenCL buffer and measures unmap bandwidth. Unmapping operation meansmemory transfer from CPU to GPU or a no-op, depending on map flags. |
❌ | ✔️ |
UsmCopy | allocates two unified shared memory buffers and measures copy bandwidth between them. |
✔️ | ✔️ |
UsmCopyMultipleBlits | allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
✔️ | ✔️ |
UsmFill | allocates a unified memory buffer and measures fill bandwidth |
✔️ | ✔️ |
UsmFillMultipleBlits | allocates a unified shared memory buffer, divides it into chunks, copies each chunk using a different copy engine and measures bandwidth. Refer to UsmCopyMultipleBlits for more details. |
✔️ | ✔️ |
UsmFillSpecificPattern | allocates a unified memory buffer and measures fill bandwidth. Allow specifying arbitrary pattern. |
✔️ | ✔️ |
UsmMemset | allocates a unified memory buffer and measures memset bandwidth |
❌ | ✔️ |
UsmSharedMigrateCpu | allocates a unified shared memory buffer and measures time to migrate it from GPU to CPU |
✔️ | ✔️ |
UsmSharedMigrateGpu | allocates a unified shared memory buffer and measures time to migrate it from CPU to GPU |
✔️ | ✔️ |
WriteBuffer | allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU. |
❌ | ✔️ |
WriteBufferRect | allocates an OpenCL buffer and measures rectangle write bandwidth. Rectangle write operation means transfer from CPU to GPU. |
❌ | ✔️ |
Miscellaneous Benchmark is a set of tests measuring different simple compute scenarios.
Test name | Description | Params | L0 | OCL |
KernelWithWork | measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Benchmark checks the impact of kernel split. |
❌ | ✔️ |
Reduction | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
❌ | ✔️ |
Reduction2 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
❌ | ✔️ |
Reduction3 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
❌ | ✔️ |
Reduction4 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
❌ | ✔️ |
Reduction5 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
❌ | ✔️ |
VectorSum | Performs vector addition |
❌ | ✔️ |
Multithread Benchmark is a set of tests aimed at measuring how different commands benefit from multithreaded execution.
Test name | Description | Params | L0 | OCL |
SvmCopy | enqueues multiple svm copies on multiple threads concurrently. |
✔️ | ✔️ |
Multi-tile Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers performed on a multi-tile device.
Test name | Description | Params | L0 | OCL |
CopyBuffer | allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
❌ | ✔️ |
FillBuffer | allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available. |
❌ | ✔️ |
ReadBuffer | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. |
❌ | ✔️ |
UsmCopy | allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function. |
✔️ | ✔️ |
UsmCopyImmediate | allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function appended to an immediate list. |
✔️ | ❌ |
UsmCopyKernel | allocates two unified shared memory buffers and measures copy bandwidth between them using a custom kernel. |
✔️ | ✔️ |
UsmFill | allocates a unified shared memory buffer and measures fill bandwidth. |
✔️ | ✔️ |
UsmSharedMigrateCpu | allocates a unified shared memory buffer and measures time to migrate it from GPU to CPU. |
✔️ | ✔️ |
UsmSharedMigrateGpu | allocates a unified shared memory buffer and measures time to migrate it from CPU to GPU. |
✔️ | ✔️ |
WriteBuffer | allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU. |
❌ | ✔️ |
Overlap Benchmark is a set of tests aimed at measuring how different commands benefit for simultaneous execution.
Test name | Description | Params | L0 | OCL |
KernelAndCopy | enqueues kernel and copy operation with the ability to perform both tasks on different command queues. |
❌ | ✔️ |
MultiProcessCompute | Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution |
✔️ | ❌ |
MultiProcessComputeSharedBuffer | Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution. A single buffer for each tile is created by parent process. All processes executing on a given tile will share it via IPC calls. |
✔️ | ❌ |
Ulls Benchmark is a set of tests aimed at measuring Ultra Low Latency Submission (ULLS) performance impact.
Test name | Description | Params | L0 | OCL |
BestSubmission | enqueues a system memory write via PIPE_CONTROL and measures when update becomes visible on the CPU. | ✔️ | ❌ | |
BestWalkerSubmission | enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. | ✔️ | ✔️ | |
BestWalkerSubmissionImmediate | enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific. | ✔️ | ❌ | |
CompletionLatency | enqueues system memory write and measures time between the moment, when update is visible on CPU and the moment, when synchronizing call returns. | ✔️ | ❌ | |
CopySubmissionEvents | enqueues 4 byte copy to copy engine and return submission delta which is time between host API call and copy engine start |
✔️ | ✔️ |
EmptyKernel | enqueues empty kernel and measures time to launch it and wait for it on CPU, thus measuring walker spawn time. |
✔️ | ✔️ |
KernelSwitchLatency | measures time from end of one kernel till start of next kernel |
✔️ | ✔️ |
KernelSwitchLatencyImmediate | measures time from end of one kernel till start of next kernel using immediate command lists |
✔️ | ❌ |
KernelWithWork | measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. |
✔️ | ✔️ |
MultiQueueSubmission | enqueues kernel on multiple command queues |
✔️ | ✔️ |
NewResourcesSubmissionDevice | enqueues kernel that uses a buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
✔️ | ✔️ |
NewResourcesSubmissionHost | enqueues kernel that uses a buffer placed in host memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
✔️ | ✔️ |
NewResourcesWithGpuAccess | enqueues kernel that accesses an entire buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
✔️ | ✔️ |
QueuePriorities | Uses queues with different priorities to meassure submission and context switch latencies |
❌ | ✔️ |
ResourceReassign | Enqueues stress kernel which utilizes majority of GPU's execution units, then enqueues next kernel, measuring its execution time. Shows overhead releated to GPU's resources releasing and assigning. |
❌ | ✔️ |
RoundTripSubmission | enqueues kernel which updates system memory location and waits for it with a synchronizing API. | ✔️ | ✔️ | |
UsmSharedFirstCpuAccess | allocates a unified shared memory buffer and measures time to access it on CPU after creation. |
✔️ | ✔️ |
UsmSharedFirstGpuAccess | allocates a unified shared memory buffer and measures time to access it on GPU after creation. |
✔️ | ✔️ |
WalkerCompletionLatency | enqueues a kernel writing to system memory and measures time between the moment when update is visible on CPU and the moment when synchronizing call returns |
✔️ | ✔️ |
WalkerSubmissionEvents | enqueues an empty kernel with GPU-side profiling and checks delta between queue time and start time. | ✔️ | ✔️ | |
WriteLatency | unblocks event on GPU, then waits for timestamp being written. | ✔️ | ❌ |