Publish AI, ML & data-science insights to a global community of data professionals.

The Machine Learning “Advent Calendar” Day 7: Decision Tree Classifier

How a decision tree classifier picks its first split

Decision Tree Classifier in Excel – image by author

In the previous article, we explored how a Decision Tree Regressor chooses its optimal split by minimizing the Mean Squared Error (MSE).

Today for Day 7 of the Machine Learning “Advent Calendar”, we continue the same approach but with a Decision Tree Classifier, the classification counterpart of yesterday’s model.

Quick intuition experiment with two simple datasets

Let us begin with a very small toy dataset that I generated, with one numerical feature and one target variable with two classes: 0 and 1.

The idea is to cut the dataset into two parts, based on one rule. But the question is: what should this rule be? What is the criterion that tells us which split is better?

Now, even if we do not know the mathematics yet, we can already look at the data and guess possible split points.

And visually, it would 8 or 12, right?

But the question is which one is more suitable numerically.

Decision Tree Classifier in Excel – image by author

If we think intuitively:

  • With a split at 8:
    • left side: no misclassification
    • right side: one misclassification
  • With a split at 12:
    • right side: no misclassification
    • left side: two misclassifications

So clearly, the split at 8 feels better.

Now, let us look at an example with three classes. I added some more random data, and made 3 classes.

Here I label them 0, 1, 3, and I plot them vertically.

But we must be careful: these numbers are just category names, not numeric values. They should not be interpreted as “ordered”.

So the intuition is always: How homogeneous is each region after the split?

But it is harder to visually determine the best split.

Now, we need a mathematical way to express this idea.

This is exactly the topic of the next chapter.

Impurity measure as the criterion of split

In the Decision Tree Regressor, we already know:

  • The prediction for a region is the average of the target.
  • The quality of a split is measured by MSE.

In the Decision Tree Classifier:

  • The prediction for a region is the majority class of the region.
  • The quality of a split is measured by an impurity measure: Gini impurity or Entropy.

Both are standard in textbooks, and both are available in scikit-learn. Gini is used by default.

BUT, what is this impurity measure, really?

If you look at the curves of Gini and Entropy, they both behave the same way:

  • They are 0 when the node is pure (all samples have the same class).
  • They reach their maximum when the classes are evenly mixed (50 percent / 50 percent).
  • The curve is smooth, symmetric, and increases with disorder.

This is the essential property of any impurity measure:

Impurity is low when groups are clean, and high when groups are mixed.

Decision Tree Classifier in Excel – gini and entropy – image by author

So we will use these measures to decide which split to create.

Split with One Continuous Feature

Just like for the Decision Tree Regressor, we will follow the same structure.

List of all possible splits

Exactly like the regressor version, with one numerical feature, the only splits we need to test are the midpoints between consecutive sorted x values.

For each split, compute impurity on each side

Let us take a split value, for example, x = 5.5.

We separate the dataset into two regions:

  • Region L: x < 5.5
  • Region R: x ≥ 5.5

For each region:

  1. We count the total number of observations
  2. We compute Gini impurity
  3. At last, we compute weighted impurity of the split
Decision Tree Classifier in Excel – image by author

Select the split with the lowest impurity

Like in the regressor case:

  • List all possible splits
  • Compute impurity for each
  • The optimal split is the one with the minimum impurity
Decision Tree Classifier in Excel – image by author

Synthetic Table of All Splits

To make everything automatic in Excel,
we organize all calculations in one table, where:

  • each row corresponds to one candidate split,
  • for each row, we compute:
    • Gini of the left region,
    • Gini of the right region,
    • and the overall weighted Gini of the split.

This table gives a clean, compact overview of every possible split,
and the best split is simply the one with the lowest value in the final column.

Decision Tree Classifier in Excel – image by author

Multi-class classification

Until now, we worked with two classes. But the Gini impurity extends naturally to three classes, and the logic of the split stays exactly the same.

Nothing changes in the structure of the algorithm:

  • we list all possible splits,
  • we compute impurity on each side,
  • we take the weighted average,
  • we select the split with the lowest impurity.

Only the formula of the Gini impurity becomes slightly longer.

Gini impurity with three classes

If a region contains proportions p1,  p2,  p3

for the three classes, then the Gini impurity is:

The same idea as before:
a region is “pure” when one class dominates,
and the impurity becomes large when classes are mixed.

Left and Right regions

For each split:

  • Region L contains some observations of classes 1, 2, and 3
  • Region R contains the remaining observations

For each region:

  1. count how many points belong to each class
  2. compute the proportions p1,p2,p3
  3. compute the Gini impurity using the formula above

Everything is exactly the same as in the binary case, just with one more term.

Summary Table for 3-class splits

Just like before, we collect all computations in one table:

  • each row is one possible split
  • we count class 1, class 2, class 3 on the left
  • we count class 1, class 2, class 3 on the right
  • we compute Gini (Left), Gini (Right)​, and the weighted Gini

The split with the smallest weighted impurity is the one selected by the decision tree.

Decision Tree Classifier in Excel – image by author

We can easily generalize the algorithm to K classes, using these following formulas to calculate Gini or Entropy

Decision Tree Classifier in Excel – image by author

How Different Are Impurity Measures, Really?

Now, we always mention Gini or Entropy as criterion, but do they really differ? When looking at the mathematical formulas, some may say

The answer is not that much.

In theory, in almost all practical situations:

  • Gini and Entropy choose the same split
  • The tree structure is almost identical
  • The predictions are the same

Why?

Because their curves look extremely similar.

They both peak at 50 percent mixing and drop to zero at purity.

The only difference is the shape of the curve:

  • Gini is a quadratic function.​ It penalizes misclassification more linearly.
  • Entropy is a logarithmic function, so it penalizes uncertainty a bit more strongly near 0.5.

But the difference is tiny, in practice, and you can do it in Excel!

Other impurity measures?

Another natural question: is it possible to invent/use other measures?

Yes, you could invent your own function, as long as:

  • It is 0 when the node is pure
  • It is maximal when classes are mixed
  • It is smooth and strictly increasing in “disorder”

For example: Impurity = 4*p0*p1

This is another valid impurity measure. And it is actually equal to Gini multiplied by a constant when there are only two classes.

So again, it gives the same splits. If you are not convinced, you can

Here are some other measures that can also be used.

Decision Tree Classifier in Excel – many impurity measures – image by author

Exercises in Excel

Tests with other parameters and features

Once you build the first split, you can extend your file:

  • Try Entropy instead of Gini
  • Try adding categorical features
  • Try building the next split
  • Try changing max depth and observe under- and over-fitting
  • Try creating a confusion matrix for predictions

These simple tests already give you a good intuition for how real decision trees behave.

Implementations of the rules for Titanic Survival Dataset

A natural follow-up exercise is to recreate decision rules for the famous Titanic Survival Dataset (CC0 / Public Domain).

First, we can start with only two features: sex and age.

Implementing the rules in Excel is long and a bit tedious, but this is exactly the point: it makes you realize what decision rules really look like.

They are nothing more than a sequence of IF / ELSE statements, repeated again and again.

This is the true nature of a decision tree: simple rules, stacked on top of each other.

Decision Tree Classifier in Excel for Titanic Survival Dataset (CC0 / Public Domain) – image by author

Conclusion

Implementing a Decision Tree Classifier in Excel is surprisingly accessible.

With a few formulas, you uncover the heart of the algorithm:

  • list possible splits
  • compute impurity
  • choose the cleanest split
Decision Tree Classifier in Excel – image by author

This simple mechanism is the foundation of more advanced ensemble models like Gradient Boosted Trees, which we will discuss later in this series.

And stay tuned for Day 8 tomorrow!

All the Excel files are available through this Kofi link. Your support means a lot to me. The price will increase during the month, so early supporters get the best value.

All Excel/Google sheet files for ML and DL

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles