Reports

i can't comment so i leave this as an answer.

If I use a single GPU, then its fine. Below shows a dummy script that results in nan's after a few steps.

i think this might be due to your batch size; try increasing it as it will give your loss more stability. also what was the batch size you used for the single GPU training?

https://www.tensorflow.org/tutorials/distribute/keras#set_up_the_input_pipeline

if you check the link above you can see the line of code below.

BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

hope this helps.

Reasons:

Whitelisted phrase (-1): hope this helps
RegEx Blacklisted phrase (1): can't comment
RegEx Blacklisted phrase (1): check the link
No code block (0.5):
Contains question mark (0.5):
Low reputation (1):

Posted by: sccs

79296317