Hello i am new to AI stuff and also going through this example of using transformer block for time series classification.
Aside from the padding issue, may i ask why it use "channels_first" rather than "channels_last" in GlobalAveragePooling2D layer?
I have a 2D data like yours and reshape it to (batch, height, width, 1). "channel_first" give me a high accuracy to 9X% but not "channel_last".
The Keras example use 1D data, using "channel_last" but also result in a poor accuracy. But according to the definition "channels_last" should be correct