np.dot(arr1, arr2.T) seems fast enough (equivalent: np.tensordot(arr1, arr2, axes=(1,1)); see here for the tensordot documentation). If it's faster for your array sizes then why use your for loop version?
np.dot(arr1, arr2.T)
np.tensordot(arr1, arr2, axes=(1,1))
tensordot
for