I think you might be mixing up a few concepts.
When working with Convolutional Neural Networks (CNNs), the goal is typically to perform automatic feature extraction and model training directly from images. It's not like working with traditional models where you have predefined columns or features — in CNNs, those features are implicitly learned from the images themselves.
Regarding your question "whether to freeze the model or to extract until Dense(128...)," — it's quite common to include one or more dense layers after the convolutional and max pooling layers, before the final classification output. This is because connecting the convolutional outputs directly to the output layer would be too raw and inefficient. Dense layers help consolidate and interpret the extracted features before making the final prediction.
As for "whether to use a number of folds or if that's only for machine learning," — CNNs can also benefit from techniques like K-Fold cross-validation. In my case, I usually start with a small number of epochs and use something like a 5x10 K-Fold setup to get an initial idea of model performance. Based on that, I adjust hyperparameters and experiment with different fold sizes.
Also, if your dataset is large or training from scratch is too slow, you might want to try transfer learning — it can save a lot of time and resources.
Regarding dropout layers, they are definitely useful and commonly used to prevent overfitting by randomly deactivating neurons during training. However, if dropout is too aggressive (e.g., using very high dropout rates), it can prevent the network from learning effectively. So again, it comes down to experimentation — finding the right balance is key.
In the end, there’s no single “right” way to do this — trial and error, careful tuning, and adapting to your specific dataset and task is what usually leads to the best results.