As mentioned in the comments your matrix is way to big.
You should look into mechanisms for out-of-core computation, which are designed for exactly this.
Classical ones include:
Blocking: instead of allocating the entire matrix, you can have each thread allocate a row, or whatever block size you can so that block&size*num_cores is less than your ram. On completion they put it to disk and free the memory. This is easy if the rows can be distributed easily, can en more difficult if all of them need access to all of it at the same time.
The second is memory mapping (mmap) the file, this is a Linux mechanism that gives you something that looks like a memory block, and you can then map to your array, but instead when you write to a[][], the OS caches it for a while and when memory is full writes it to disk. This approach can be complicated, as the mmap file is shared, which means you need synchronization control (locks) to ensure threads don't step on each other when writing and reading. With that the algorithm should definitely work, but, if you have access patterns that are very random, and a hard drive, this is a well know way to tank your performance as hdds don't perform well with random Io.
Other methods exist for this with different tradeoffs or requirements, and would not require any change to the algorithm.