Packing the matrices A and B is indeed necessary.
For an short outline, consider the PowerPC documentation (red book). https://www.redbooks.ibm.com/abstracts/redp5612.html (page 35). PowerPC has a similar blocked matrix multiply instruction as VNNI and Arm Neon.
I have written such packing function within the matrix multiply code. The packing didn't do any harm to the throughput of the code.