79566850

Date: 2025-04-10 13:59:12
Score: 0.5
Natty:
Report link

What happens is at every node model will (1) take all the features available (p in your notation) and randomly take a subset of m (in your notation) features from it. Then, (2) it will set a threshold for each of them and using impurity or entropy (3) choose the one giving the best split - where the two subsets of samples are the most alike. And it will do it every time exactly in this order - for every node.

Basically, there are 3 possible ways to set max_features: all features, only 1 at a time and options in between. Those will be m. What is the difference?

  1. When selecting all (default), model will have every time the most wide selection of parameters on which it will perform step (2) and choose the best one in step (3). This is a common approach and unless you have a giant dataset and heavy tuning or something of a sort that requires you to be more computationally efficient, this is the best bet.

  2. Choosing 1 feature basically kills the power of the algorithm, as there is nothing to choose the best split from, so the whole concept of best split is not applicable here, as the model will do the only possible split - on that one feature randomly taken at step (1). Performance of the algorithm here is an average of randomness in that feature selection at step (1) and bootstrapping. This is still a way to go if the task is relatively simple, most of the features are heavily multicollinear, and computational efficiency is your priority.

  3. Middle ground when you want to gain some speed on heavy computations but don't want to kill all the magic of choosing the feature with the best threshold for the split.

So I would like to emphasize that randomness of the forest always come from bootstrapping of the data and random selection of that one feature per node. max_features just gives an opportunity to add an extra step for extra randomness to mostly gain computational performance, though sometimes helps with regularization.

Reasons:
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Starts with a question (0.5): What
  • Low reputation (1):
Posted by: Razguliaev Nikita