Reports

Packing the matrices A and B is indeed necessary.

For an short outline, consider the PowerPC documentation (red book). https://www.redbooks.ibm.com/abstracts/redp5612.html (page 35). PowerPC has a similar blocked matrix multiply instruction as VNNI and Arm Neon.

I have written such packing function within the matrix multiply code. The packing didn't do any harm to the throughput of the code.

79331136