I just wrote a package to do this: https://github.com/biona001/sweepystats
Internally sweeping is dispatched to BLAS3 calls, so it should be nearly optimally efficient.