Reports

This is a common pitfall in machine learning workflows!

Here’s the correct order you should follow:

First split your dataset into training and testing sets.
Then fit the StandardScaler only on the training set (i.e., use scaler.fit(X_train)).
After fitting, transform both the training and testing data (X_train_scaled = scaler.transform(X_train) and X_test_scaled = scaler.transform(X_test)).

Why?
If you scale the full dataset before splitting, information from the test set "leaks" into the training process — because the mean and standard deviation used for scaling are computed using all the data, including the test set. This can mess up the model's generalization ability and make evaluation unreliable.

Scaling after splitting makes sure the test data remains unseen and truly independent.

Quick fix for your case:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# First split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Then scale
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

79594328