Credit given to user @paleonix
line
A_s[threadIdx.y * tile_size + threadIdx.x] = a_d[row*n + tile*tile_size + threadIdx.x];
should be changed to
A_s[threadIdx.y * tile_size + threadIdx.x] = a_d[row*k + tile*tile_size + threadIdx.x];
this is due to incorrect indexing of global memory Matrix A into the shared memory matrix A_s, resulting is an incorrect partial sum.