Reports

Sorry guys, but I have managed to beat the performance of all your solutions.

First, some details about my computer: CPU Intel(R) Core(TM) i5-4430 CPU @ 3.00GHz, RAM Kingston DDR3 8GiB * 2, OS Windows 11 Pro 23H2 x64, CPython 3.12.6 x64, NumPy 2.0.2, I am using IPython 8.27.0 on Windows Terminal.

I have tried to implement my idea mentioned in my latest edit to the question, and surprisingly I have actually beaten all submitted solutions and even np.unpackbits...

First I will describe the ways I have found to generate sequence of integers with periodic gaps.

Say you want to generate a list of numbers that includes every other group of width numbers, for example, if the width is 5, if we include the first five natural numbers, we get 0, 1, 2, 3, 4, then we skip the next five numbers and we append 10, 11, 12, 13, 14 to the list, then we skip another five numbers and append 20, 21, 22, 23, 24...

Each group of included numbers has width of 5, and they all start at a multiple of 5. I have identified that the groups that are included have a floor division of zero when divided by 5.

So if I want every other group of five integers starting at 7 ending at 100 I can do this:

In [35]: nums = np.arange(100)

In [36]: ~((nums - 7) // 5) & 1
Out[36]:
array([1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1])

But this is inefficient (and the first 2 bits should be off). NumPy arrays can be added, so I should instead do this:

In [37]: np.arange(7, 100, 10)[:, None] + np.arange(5)
Out[37]:
array([[  7,   8,   9,  10,  11],
       [ 17,  18,  19,  20,  21],
       [ 27,  28,  29,  30,  31],
       [ 37,  38,  39,  40,  41],
       [ 47,  48,  49,  50,  51],
       [ 57,  58,  59,  60,  61],
       [ 67,  68,  69,  70,  71],
       [ 77,  78,  79,  80,  81],
       [ 87,  88,  89,  90,  91],
       [ 97,  98,  99, 100, 101]])

But both are inefficient:

In [38]: %timeit column = np.arange(65536, dtype=np.uint16); ~((column - 64) >> 7) & 1
219 μs ± 15.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [39]: %timeit np.arange(64, 65536, 256, dtype=np.uint16)[:, None] + np.arange(128)
61.2 μs ± 2.97 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [40]: %timeit np.tile(np.concatenate([np.zeros(64, dtype=bool), np.ones(64, dtype=bool)]), 512)
17.9 μs ± 662 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Though as you can see we need to use np.concatenate if the starting position of the tiling isn't a multiple of the tile's length.

I have implemented 5 new functions, binary_codes_8 implements the above idea, though it isn't efficient:

def binary_codes_8(n: int) -> np.ndarray:
    dtype = get_dtype(n)
    result = np.zeros(((length := 1 << n), n), dtype=bool)
    width = 1
    for i in range(n - 1, -1, -1):
        result[:, i][
            np.arange(width, length, (step := width << 1), dtype=dtype)[:, None]
            + np.arange(width)
        ] = 1
        width = step

    return result

In [41]: %timeit binary_codes_6(16)
1.14 ms ± 37.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [42]: %timeit binary_codes_8(16)
2.53 ms ± 41.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Instead I have a new idea, instead of generating a two dimensional array, we can just join the columns head to tail to a one dimensional array, this is to make memory access contiguous and to avoid broadcasting assignments per iteration.

def binary_codes_9(n: int) -> np.ndarray:
    validate(n)
    places = 1 << np.arange(n)
    return (
        np.concatenate(
            [
                np.tile(
                    np.concatenate([np.zeros(a, dtype=bool), np.ones(a, dtype=bool)]), b
                )
                for a, b in zip(places[::-1], places)
            ]
        )
        .reshape((n, 1 << n))
        .T
    )

In [43]: %timeit binary_codes_9(16)
910 μs ± 26.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As you can see, it actually beats the np.unpackbits based solution.

gray_codes_4 implements the same idea as binary_codes_8, so it is also inefficient:

def gray_codes_4(n: int) -> np.ndarray:
    dtype = get_dtype(n)
    result = np.zeros(((length := 1 << n), n), dtype=bool)
    width = 2
    start = 1
    for i in range(n - 1, 0, -1):
        result[:, i][
            np.arange(start, length, (step := width << 1), dtype=dtype)[:, None]
            + np.arange(width)
        ] = 1
        width = step
        start <<= 1

    result[:, 0][length >> 1 :] = 1

    return result

In [44]: %timeit gray_codes_4(16)
2.52 ms ± 161 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So I tried to implement the idea of binary_codes_9 to gray_codes_5:

def gray_codes_5(n: int) -> np.ndarray:
    m = n - 1
    half = 1 << m
    column = np.arange(1 << n, dtype=get_dtype(n))
    offsets = (1 << np.arange(m - 1, -1, -1, dtype=np.uint8)).tolist()
    return (
        np.concatenate(
            [
                np.concatenate([np.zeros(half, dtype=bool), np.ones(half, dtype=bool)]),
                *(~((column - a) >> b) & 1 for a, b in zip(offsets, range(m, 0, -1))),
            ]
        )
        .reshape((n, 1 << n))
        .T
    )

In [45]: %timeit gray_codes_5(16)
3.67 ms ± 60.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It is somehow slower, but the idea is sound, just that the way each column is generated is inefficient.

So I tried again, and this time it even beats binary_codes_9:

def gray_codes_6(n: int) -> np.ndarray:
    validate(n)
    if n == 1:
        return np.array([(0,), (1,)], dtype=bool)

    half = 1 << (n - 1)
    places = (1 << np.arange(n - 1, -1, -1)).tolist()
    return (
        np.concatenate(
            [
                np.zeros(half, dtype=bool),
                np.ones(half, dtype=bool),
                *(
                    np.tile(
                        np.concatenate(
                            [
                                np.zeros(b, dtype=bool),
                                np.ones(a, dtype=bool),
                                np.zeros(b, dtype=bool),
                            ]
                        ),
                        1 << i,
                    )
                    for i, (a, b) in enumerate(zip(places, places[1:]))
                ),
            ]
        )
        .reshape((n, 1 << n))
        .T
    )

In [46]: %timeit gray_codes_6(16)
759 μs ± 19.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is the benchmark of all the functions on my IPython interpreter:

In [7]: %timeit binary_codes_0(16)
1.62 ms ± 58.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit binary_codes_1(16)
829 μs ± 9.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit binary_codes_2(16)
1.9 ms ± 67.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: %timeit binary_codes_3(16)
1.55 ms ± 9.63 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: %timeit binary_codes_4(16)
1.66 ms ± 40.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [12]: %timeit binary_codes_5(16)
1.8 ms ± 54.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [13]: %timeit binary_codes_6(16)
1.11 ms ± 22.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [14]: %timeit binary_codes_7(16)
7.01 ms ± 46.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: %timeit binary_codes_8(16)
2.5 ms ± 57.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit binary_codes_9(16)
887 μs ± 5.43 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [17]: %timeit gray_codes_0(16)
1.65 ms ± 11.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: %timeit gray_codes_1(16)
1.79 ms ± 9.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [19]: %timeit gray_codes_2(16)
3.97 ms ± 28.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: %timeit gray_codes_3(16)
1.9 ms ± 49.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [21]: %timeit gray_codes_4(16)
2.38 ms ± 33.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit gray_codes_5(16)
3.95 ms ± 19.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %timeit gray_codes_6(16)
718 μs ± 4.91 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [24]: %timeit JR_gray_codes_1(16)
1.42 ms ± 10.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [25]: %timeit gray_codes_nocomment(16)
1.03 ms ± 4.88 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [26]: %timeit JR_gray_codes_2()
809 μs ± 12.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As you can see, my solutions beat all functions from the other answers.

But how do they scale with larger inputs? To find out, I reused code from my last answer:

import colorsys
import matplotlib.pyplot as plt
import numpy as np
import timeit
from math import ceil
from scipy.interpolate import make_interp_spline

def measure_command(func, *args, **kwargs):
    elapsed = timeit.timeit(lambda: func(*args, **kwargs), number=5)
    once = elapsed / 5
    if elapsed >= 1:
        return int(1e9 * once + 0.5)

    times = min(1024, ceil(1 / once))
    return int(
        1e9 * timeit.timeit(lambda: func(*args, **kwargs), number=times) / times + 0.5
    )


def test(func):
    return [measure_command(func, i) for i in range(1, 21)]


bin6 = binary_codes_9(6)
gra6 = gray_codes_6(6)
execution_times = {}
to_test = [
    *((f"binary_codes_{i}", bin6) for i in range(10)),
    *((f"gray_codes_{i}", gra6) for i in range(7)),
    ("JR_gray_codes_1", gra6),
    ("gray_codes_nocomment", gra6),
]
for func_name, correct in to_test:
    func = globals()[func_name]
    assert np.array_equal(func(6), correct)
    execution_times[func_name] = test(func)


cost_per_item = {
    k: [e / (1 << i) for i, e in enumerate(v, start=1)]
    for k, v in execution_times.items()
}

largest_execution = sorted(
    [(k, v[-1]) for k, v in execution_times.items()], key=lambda x: x[1]
)
average_execution = sorted(
    [(k, sum(v) / 20) for k, v in cost_per_item.items()], key=lambda x: x[1]
)
columns = [
    sorted([v[i] for v in execution_times.values()], reverse=True) for i in range(20)
]
overall_performance = [
    (k, [columns[i].index(e) for i, e in enumerate(v)])
    for k, v in execution_times.items()
]
overall_execution = sorted(
    [(a, sum(b) / 36) for a, b in overall_performance], key=lambda x: -x[1]
)

execution_times_1.json

In [14]: average_execution
Out[14]:
[('binary_codes_6', 206.31875624656678),
 ('gray_codes_nocomment', 292.46834869384764),
 ('binary_codes_5', 326.2715059280396),
 ('binary_codes_4', 425.4694920539856),
 ('gray_codes_1', 432.8871788024902),
 ('gray_codes_4', 440.75454263687135),
 ('binary_codes_3', 486.1538872718811),
 ('JR_gray_codes_1', 505.06243762969973),
 ('gray_codes_0', 518.2342618465424),
 ('gray_codes_3', 560.9797175884247),
 ('binary_codes_2', 585.7214835166931),
 ('binary_codes_8', 593.8069385528564),
 ('gray_codes_5', 1012.6498884677887),
 ('gray_codes_6', 1102.2803171157836),
 ('binary_codes_0', 1102.6035027503967),
 ('gray_codes_2', 1152.2696633338928),
 ('binary_codes_1', 1207.228157234192),
 ('binary_codes_7', 1289.8271428585053),
 ('binary_codes_9', 1667.4837736606598)]

In [15]: [(a, b / 1e9) for a, b in largest_execution]
Out[15]:
[('gray_codes_6', 0.01664672),
 ('binary_codes_9', 0.017008553),
 ('gray_codes_nocomment', 0.0200915),
 ('binary_codes_1', 0.025319672),
 ('binary_codes_6', 0.027002585),
 ('gray_codes_3', 0.038912479),
 ('binary_codes_2', 0.045456482),
 ('JR_gray_codes_1', 0.053382224),
 ('binary_codes_3', 0.054410716),
 ('binary_codes_4', 0.0555085),
 ('gray_codes_0', 0.058621065),
 ('binary_codes_0', 0.0718396),
 ('binary_codes_5', 0.084661),
 ('binary_codes_8', 0.085592108),
 ('gray_codes_1', 0.088250008),
 ('gray_codes_4', 0.091165908),
 ('gray_codes_2', 0.093191058),
 ('binary_codes_7', 0.104509167),
 ('gray_codes_5', 0.146622829)]

In [16]: overall_execution
Out[16]:
[('binary_codes_6', 9.055555555555555),
 ('gray_codes_nocomment', 8.88888888888889),
 ('JR_gray_codes_1', 7.5),
 ('binary_codes_5', 6.972222222222222),
 ('binary_codes_4', 6.527777777777778),
 ('gray_codes_1', 6.0),
 ('binary_codes_3', 5.916666666666667),
 ('gray_codes_0', 5.805555555555555),
 ('gray_codes_3', 4.888888888888889),
 ('gray_codes_6', 4.555555555555555),
 ('gray_codes_4', 4.25),
 ('binary_codes_1', 4.083333333333333),
 ('binary_codes_2', 4.0),
 ('binary_codes_9', 3.7222222222222223),
 ('binary_codes_8', 3.6944444444444446),
 ('binary_codes_0', 3.361111111111111),
 ('gray_codes_2', 2.611111111111111),
 ('gray_codes_5', 2.2777777777777777),
 ('binary_codes_7', 0.8888888888888888)]

This is puzzling, clearly my new functions gray_codes_6, binary_codes_9 perform the best for the largest input (20, it means the output will have 1048576 rows), but according to my metrics they somehow score poorly...

Just to sanity check:

In [17]: %timeit binary_codes_1(16)
908 μs ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: %timeit binary_codes_6(16)
1.12 ms ± 8.18 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [19]: %timeit binary_codes_9(16)
925 μs ± 12.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [20]: %timeit binary_codes_1(20)
17.3 ms ± 205 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit binary_codes_6(20)
28.6 ms ± 233 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [22]: %timeit binary_codes_9(20)
23 ms ± 753 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [23]: %timeit gray_codes_6(16)
854 μs ± 23.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [24]: %timeit JR_gray_codes_1(16)
1.69 ms ± 34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [25]: %timeit gray_codes_nocomment(16)
1.11 ms ± 31 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [26]: %timeit gray_codes_6(20)
15.9 ms ± 959 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [27]: %timeit gray_codes_nocomment(20)
20.5 ms ± 640 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [28]: %timeit JR_gray_codes_1(20)
48.5 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Indeed, they performed better for larger inputs.

So I tried to graph the performance, the graphs are kind of busy, a lot goings on, but I saw something:

ymax = -1e309
for v in execution_times.values():
    ymax = max(max(v), ymax)

rymax = ceil(ymax)
fig, ax = plt.subplots()
ax.axis([0, 20, 0, ymax])
ax.set_xticks(range(1, 21))
ax.set_xticklabels([f"{1<<i}" for i in range(1, 21)])

for k, v in execution_times.items():
    x = np.linspace(1, 20, 256)
    plt.plot(x, make_interp_spline(range(1, 21), v)(x), label=k)

plt.legend()
plt.show()


fig, ax = plt.subplots()
ax.axis([0, 20, 0, 19])
ax.set_xticks(range(1, 21))
ax.set_yticks(range(20))
ax.set_xticklabels([f"{1<<i}" for i in range(1, 21)])


for k, v in overall_performance:
    plt.plot(range(1, 21), v, label=k)

plt.legend()
plt.show()

My smart functions have flatter curves than the inefficient ones, but they are slower for small inputs than the inefficient ones, but the inefficient ones grow much faster than my smart functions.

In my cost_per_item, in order to get a sensible average of the different execution times for different inputs, I divided the execution time for each input by the corresponding output size, so I get some average number... But this is wrong, the functions don't scale linearly, the bigger the input, the harder it is to finish the execution in record time.

And in overall_execution the scoring is also wrong, we care about how does a function scale for ever larger inputs, so if a function completes faster for smaller inputs, it should have less importance:

In [35]: overall_execution1 = overall_execution = sorted(
    ...:     [(a, sum(c << i for i, c in enumerate(b))) for a, b in overall_performance], key=lambda x: -x[1]
    ...: )

In [36]: overall_execution1
Out[36]:
[('gray_codes_6', 18462984),
 ('binary_codes_9', 17880641),
 ('gray_codes_nocomment', 16642805),
 ('binary_codes_1', 15627986),
 ('binary_codes_6', 14995433),
 ('gray_codes_3', 13092872),
 ('binary_codes_3', 11226678),
 ('binary_codes_2', 11189287),
 ('JR_gray_codes_1', 10861503),
 ('binary_codes_4', 9194493),
 ('binary_codes_0', 9184718),
 ('gray_codes_0', 8130547),
 ('binary_codes_5', 6373604),
 ('binary_codes_8', 5019700),
 ('gray_codes_1', 4305913),
 ('gray_codes_4', 3413060),
 ('gray_codes_2', 2567962),
 ('binary_codes_7', 925733),
 ('gray_codes_5', 210406)]

In [37]: average_execution1 = sorted(
    ...:     [(k, sum(v) / 20) for k, v in execution_times.items()], key=lambda x: x[1]
    ...: )

In [38]: [(a, round(b / 1e9, 6)) for a, b in average_execution1]
Out[38]:
[('binary_codes_9', 0.001746),
 ('gray_codes_6', 0.001789),
 ('gray_codes_nocomment', 0.001967),
 ('binary_codes_1', 0.002462),
 ('binary_codes_6', 0.002567),
 ('gray_codes_3', 0.003661),
 ('binary_codes_2', 0.004313),
 ('binary_codes_3', 0.004558),
 ('JR_gray_codes_1', 0.004716),
 ('binary_codes_4', 0.005205),
 ('gray_codes_0', 0.005425),
 ('binary_codes_0', 0.005544),
 ('binary_codes_5', 0.007468),
 ('binary_codes_8', 0.007723),
 ('gray_codes_1', 0.007831),
 ('gray_codes_4', 0.008079),
 ('gray_codes_2', 0.0086),
 ('binary_codes_7', 0.009952),
 ('gray_codes_5', 0.013721)]

According to this new metric, it is undeniable that my binary_codes_9 and gray_codes_6 are much faster than the solutions provided by the other answers.

And I selected the fastest 9 functions to graph as proof:

In the last image, the vertical axis signifies how many functions have been beaten by the function corresponding to the graph.

My functions aren't beaten, but since the code from nocomment beats my original solutions I will accept that answer.

79504645