79296317

Date: 2024-12-20 06:01:18
Score: 3
Natty:
Report link

i can't comment so i leave this as an answer.

If I use a single GPU, then its fine. Below shows a dummy script that results in nan's after a few steps.

i think this might be due to your batch size; try increasing it as it will give your loss more stability. also what was the batch size you used for the single GPU training?

https://www.tensorflow.org/tutorials/distribute/keras#set_up_the_input_pipeline

if you check the link above you can see the line of code below.

BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

hope this helps.

Reasons:
  • Whitelisted phrase (-1): hope this helps
  • RegEx Blacklisted phrase (1): can't comment
  • RegEx Blacklisted phrase (1): check the link
  • No code block (0.5):
  • Contains question mark (0.5):
  • Low reputation (1):
Posted by: sccs