Reports

The problem is not XGBoost itself, it is how the data is being represented. By one-hot encoding every email address, you have turned each unique email into its own column, which is why your model now expects 1000 inputs. That approach also doesn’t generalize, your model is just memorizing specific emails instead of learning patterns.

If the label is truly tied to individual emails (e.g. abc@gmail → high, xyz@yahoo → low), then you don’t need ML at all, you just need a lookup table or dictionary. A model will never be able to guess the label for an unseen email in that case.

If you want ML to work, you need to extract features from the email that can generalize. For example, use the domain (gmail.com, yahoo.com), the top-level domain (.com, .org), or simple stats about the username (length, numbers, special characters, etc.). That way you only have a few numeric features, and your model input is small and stable.

Another option is to use techniques like hashing (fixed-size numeric representation) or target encoding instead of one-hot encoding. And when you deploy, make sure your API does the same preprocessing step so you can just send an email string, and the server will convert it into the right features before calling the model.

79748879