Reports

This behaviour is NVIDIA specific due to the limitations on memory transfers in CUDA implementation. CUDA docs, PyOpenCL issue. Internally, OpenCL calls to CUDA API, thus the limitations propagate to OpenCL.

Firstly, host allocated memory can be of two types: pinned(non-paged) and non-pinned(paged), whereby only pinned memroy (that is stored only in RAM and not offloaded to disk) transfers can be performed non-blocking or asyncronously in CUDA.
Secondly, if the memory is paged, CUDA first copies it to a pinned buffer and then transfers to device memory and the whole operation is blocking, see StackOverflow Answer. Supposedly, this explains such a long copy time of the first transfer.
In order to use asyncronous memory transfers and kernel execution, only pinned memory must be used. To use pinned memory, it has to be allocated by OpenCL itself. Arrays created by Numpy are usually created with paged memory and Numpy has no functionality to explicitly use pinned memory.

To create an array with pinned memory, numpy arrays should be created using a buffer allocated by OpenCL.

The first step is to create a Buffer:
buffer = cl.Buffer(ctx, cl.mem_flags.READ_WRITE|cl.mem_flags.ALLOC_HOST_PTR, size=a.size)
This allocates memory on both host and device. The ALLOC_HOST_PTR flag forces OpenCL to allocate pinned memory on host. Unlike with the COPY_HOST_PTR flag, this memory is created empty and is not tied to an existing Numpy array.

Then, the buffer has to be mapped to a Numpy array:
mapped, event = cl.enqueue_map_buffer(queue, buffer, cl.map_flags.WRITE, 0, shape=a.shape, dtype=a.dtype)
mapped is a Numpy array that then can be used conventionally in Python.

Finally, the mapped array can be filled with data from target array:
mapped[...] = a

Now, running the same benchmark shows non-blocking behaviour:

import  numpy as np
import pyopencl as cl
from timeit import default_timer as dt

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

a = np.random.random((1000, 1000, 500)).astype(np.float64)
mf = cl.mem_flags
start = dt()
size = a.size * a.itemsize
a_buff = cl.Buffer(ctx, mf.READ_WRITE | mf.ALLOC_HOST_PTR, size=size)
a_mapped, event = cl.enqueue_map_buffer(queue, a_buff, cl.map_flags.WRITE, 0, shape=a.shape, dtype=a.dtype)
a_mapped[:] = a
cl.enqueue_copy(queue, a_buff, a_mapped, is_blocking=False)
print(f'Buffer creation time: {dt()-start:0.4f} s')

start = dt()
event1 = cl.enqueue_copy(queue, a_buff, a_mapped, is_blocking=True)
print(f'Copy time blocking 1: {dt()-start:0.4f} s')

start = dt()
event2 = cl.enqueue_copy(queue, a_buff, a_mapped, is_blocking=False)
print(f'Copy time non-blocking (Host to Device): {dt()-start:0.4f} s')

start = dt()
event3 = cl.enqueue_copy(queue, a_mapped, a_buff, is_blocking=False)
print(f'Copy time non-blocking (Device to Host): {dt()-start:0.4f} s')

Result:

Buffer creation time: 1.8355 s
Copy time blocking 1: 0.3096 s
Copy time non-blocking (Host to Device): 0.0001 s
Copy time non-blocking (Device to Host): 0.0000 s

PS: as you can see, having non-blocking functionality changes the underlying memory allocation. It would require refactoring of all array creation routines, which means it cannot be implemented 'on top' without significantly changing source code.

79374603