- Decision trees model predictions via recursive splits chosen to minimize impurity, using measures like Gini, entropy or variance.
- Information Gain guides the choice of feature and threshold at each node, allowing trees to handle both regression and classification.
- Hyperparameters such as max_depth, min_samples_split and min_information_gain control overfitting and tree complexity.
- Understanding single-tree mechanics is essential before moving to ensembles like random forests that stabilize and boost performance.

Decision tree regression from scratch is one of the most eye‑opening exercises you can do if you want to really understand how tree‑based models think and why they’re so popular in machine learning. Instead of treating the tree as a mysterious black box, you’ll see how each split is chosen, how impurity is measured and how numeric predictions are produced at the leaves, both for regression and classification problems.
In this guide we’ll walk through the core ideas behind decision trees, the cost functions they use, how they search for the best splits, and how to code a basic tree that supports both regression and classification, using only fundamental concepts like loops, conditions and simple statistics. Along the way we’ll compare regression vs. classification trees, connect the theory to practical implementations in tools such as Python and R (for example with rpart and tree), and briefly situate decision trees inside larger ensembles like random forests.
What is a decision tree and why is it so intuitive?
A decision tree is essentially a flow of yes/no questions (or simple rules) that guides you from a root decision down to a final prediction in a leaf node. In a typical supervised learning setting, the goal is to predict a target variable Y using multiple predictors (features, covariates), and the tree learns a sequence of questions such as “is weight ≤ 103?” or “is country in {US, UK, CA}?” that gradually partitions the data into more homogeneous groups.
To get some intuition, imagine you want to predict whether someone is obese using only height and weight, and you have a labeled dataset telling you who is obese and who is not. A tree might discover a rule like “if weight > 100 kg, predict obese”, but that rule will not be perfect: some people above 100 kg won’t be obese, and some below that threshold will be. The tree then keeps adding more questions (sub‑splits), for example on height or a refined weight threshold, to “fine‑tune” those initial rough predictions.
Each internal node in the tree corresponds to a decision rule, each branch corresponds to one outcome of that rule, and each leaf node corresponds to a region of the feature space where the predictions are constant. In classification, a leaf returns a class label (or probability distribution over labels); in regression, a leaf typically returns the mean of the target values that fall into that region.
One of the main strengths of decision trees is that they handle both regression and classification naturally, they are easy to interpret, and they work with both quantitative and qualitative (categorical) predictors without needing heavy preprocessing. You don’t need to assume any specific distribution for your features or your target, which makes trees very attractive in real‑world scenarios where classical linear assumptions are often violated.
Classification vs. regression trees
Although the structure of classification and regression trees is the same, the nature of the response variable Y and the cost function used for splitting differ between these two types. When Y is quantitative (for example, sales, life expectancy, fuel consumption), we talk about a regression tree; when Y is qualitative or categorical (for example, survived vs. not survived, obese vs. not obese), we talk about a classification tree.
In a regression tree, the usual objective is to partition the feature space into regions where the response can be approximated by a constant, often the mean of the observations in that region. Typical decision rules have the form “is xk ≤ c?”, where xk is one of the covariates and c is a threshold; these rules repeatedly split the space into hyper‑rectangles, and all points in the same hyper‑rectangle share the same predicted value ŷ.
In a classification tree, the splits are still “feature ≤ threshold?” or “category in set S?”, but the quality of a split is measured by how pure the resulting child nodes are in terms of class labels. The leaf prediction is usually the majority class inside that node, and the model tries to create leaves that are as close as possible to containing only a single class.
Despite these differences in the target type, from a coding perspective you can implement a single generic tree structure and simply plug in different impurity or loss measures depending on whether you are doing regression or classification. Later, when we compute Information Gain, you’ll see that the formulas for classification (based on entropy) and regression (based on variance) are parallel in spirit.
Impurity and cost functions in decision trees
At the heart of any decision tree algorithm lies a cost function that evaluates how good a particular split is at separating the data into meaningful groups. This cost function is expressed in terms of impurity: a node is considered pure if all its samples belong to the same class (for classification) or have nearly the same numeric value (for regression).
Whenever you select a candidate split on a feature, the algorithm looks at the child nodes it produces and asks: “how mixed are the labels (or values) in each child?” A good split is one that produces child nodes that are much less impure than the parent, which means that the data within each child are more homogeneous with respect to the target.
In classification trees, impurity is usually measured by criteria such as the Gini index or entropy, both of which capture how likely it is that a randomly chosen observation in that node would be misclassified if we simply predicted the majority class. In regression trees, impurity is commonly measured with squared error or variance, reflecting how spread out the target values are within the node.
Gini index: measuring impurity in classification trees
The Gini index is one of the most commonly used impurity measures for classification trees because it’s simple to compute and works well in practice. Conceptually, it measures the probability that a randomly selected observation from the node would be incorrectly classified if its label were predicted according to the label distribution in that node.
If a node contains classes with probabilities P1, P2, …, Pn, the Gini index is computed as Gini = 1 − Σ (Pi)². When a node is perfectly pure (all observations belong to the same class), one of the probabilities is 1 and the rest are 0, so the sum of squares is 1 and the Gini index is 0, indicating full purity.
On the other hand, the Gini index reaches its maximum when classes are evenly mixed inside the node, for example in a binary problem with P1 = P2 = 0.5, which gives Gini = 1 − (0.5² + 0.5²) = 0.5. In that situation, predicting the majority class is as bad as you can get for that distribution because the node contains half of each class.
When you implement Gini in code, you typically take the label vector for the node, compute the frequency of each class, convert frequencies to probabilities, and then apply the formula 1 − Σ p². If you do this for multiple candidate splits, you can compare which split produces children with lower weighted average Gini impurity, which is exactly what the tree needs to decide on the best partition.
Entropy: another view of classification impurity
Entropy is an alternative impurity measure widely used in information theory and in early tree algorithms like ID3 and C4.5, and it captures the amount of randomness or uncertainty in the node’s class distribution. While Gini focuses on misclassification probability, entropy quantifies the “surprise” associated with observing a particular class when the distribution is mixed.
Given class probabilities p1, …, pc for a node S, its entropy is defined as E(S) = − Σ pi log₂(pi). If the node is pure, one of the probabilities is 1 and all others are 0, which makes the sum zero (because log₂(1) = 0), so the entropy is 0, indicating no uncertainty.
When the node contains a uniform distribution of classes, entropy is maximized; for a binary problem with p1 = p2 = 0.5, the entropy is 1 bit, which is the highest possible value for two classes. This value corresponds to maximum uncertainty, meaning the node is as impure as it can be under that distribution.
Even though Gini and entropy use different formulas and have different numeric ranges (Gini between 0 and 0.5 for two classes, entropy between 0 and 1), both measure essentially the same concept, so they usually lead to very similar trees in practice. When you compute both on the same node, you’ll find that high Gini corresponds to high entropy and vice versa, which is why many libraries allow you to choose either without drastically changing performance.
Information Gain and choosing the best splits
To pick the best split among many candidates, the tree algorithm uses a metric called Information Gain, which measures how much impurity decreases when we split a node into its children. Intuitively, a split has high Information Gain if the children are much purer than the parent, meaning the rule successfully separated the data into more meaningful groups.
For classification trees using entropy, the Information Gain of a split is defined as IGclassification = E(parent) − Σ (|Schild| / |Sparent|) · E(Schild). You first compute the entropy of the parent node, then subtract the weighted average entropy of the child nodes, where the weights are their relative sizes.
For regression trees, an analogous concept uses variance or mean squared error as the impurity measure, giving IGregression = Var(parent) − Σ (|Schild| / |Sparent|) · Var(Schild). In this setting, a good split is one that substantially reduces the variability of the target values inside each child.
The tree training algorithm evaluates this Information Gain for every possible candidate split on every feature, then chooses the split with the highest gain, provided it exceeds some minimum threshold to avoid creating useless, tiny improvements. This process is then repeated recursively on each child node until some stopping criteria are reached.
How to search for the best split on each feature
Finding the best split on a single feature depends on whether the feature is numeric or categorical, but the underlying idea is always the same: enumerate candidate partitions and compute their Information Gain. For numeric features, a partition is defined by a threshold; for categorical features, it’s defined by grouping levels into subsets.
For a numeric predictor, the usual strategy is to look at all unique values that feature takes in the current node, sort them, and then consider candidate thresholds between consecutive values. For each candidate threshold c, you create two groups (x ≤ c and x > c), compute the impurity of each group, and then compute the Information Gain; the threshold that yields the highest gain is your best numeric split on that feature.
When dealing with categorical predictors, the search space is more complex because, in principle, any subset of categories could form one side of the split, with the complement on the other side. In a feature with K categories, there are many possible subsets (2K−1 − 1 non‑trivial partitions), so in practice implementations often restrict this search or use heuristics, especially when K is large.
Once you’ve computed the best split for each feature, you compare their Information Gains and select the feature and threshold (or category subset) corresponding to the maximum gain. This chosen split becomes the decision at the current node, and the training process then recurses on each child with the corresponding subset of observations.
Controlling tree growth with hyperparameters
If you allow a decision tree to grow without any constraints, it will keep splitting until every leaf is either perfectly pure or contains very few observations, which almost always leads to severe overfitting (overfitting vs underfitting). To avoid this, you set a collection of hyperparameters that control the depth and complexity of the tree.
A common hyperparameter is max_depth, which caps the maximum number of levels the tree can grow from the root to any leaf. If max_depth is set to None (or a very large number), the tree can keep growing as long as other constraints are satisfied; if it’s small, the tree remains shallow and more interpretable but might underfit.
Another key hyperparameter is min_samples_split, which specifies the minimum number of observations a node must contain before it is allowed to be split. If a node has fewer samples than this threshold, it is turned into a leaf, preventing the model from chasing noise in very small subsets of data.
You can also enforce a minimum Information Gain (min_information_gain) so that the algorithm only performs a split if it produces a meaningful improvement in impurity reduction. This avoids creating unnecessary branches that hardly change the predictions and merely complicate the tree structure.
Building a decision tree from scratch in code
Implementing a decision tree from scratch usually revolves around a small set of core functions that are called recursively. While libraries like scikit‑learn or rpart do all this under the hood, coding these steps yourself makes the logic much clearer (programming logic) and gives you full control over the behavior.
First, you need a routine that, given the current data at a node, evaluates every feature and every candidate split to find the one with the highest Information Gain. This function returns the chosen feature, the split rule (threshold or subset of categories), the gain value, and the boolean mask or index sets that identify which samples go left and which go right.
Second, you need a prediction function for leaf nodes that converts the set of target values in that node into a single prediction. For regression, this is typically the mean of y in that node; for classification, you usually take the mode (most frequent class), possibly storing class probabilities as well if you want probabilistic outputs.
Third, you create a recursive training function that checks stopping criteria, searches for the best split if allowed, and then builds child nodes by calling itself on the left and right subsets. If the minimum sample size, maximum depth, or minimum gain conditions are not met, the function stops splitting and stores a leaf prediction instead of further branches.
How prediction works in a trained decision tree
Once your tree has been trained and you’ve stored all the split rules and leaf predictions, making a prediction for a new observation is simply a matter of walking down the tree from the root to a leaf. At each internal node, you inspect the required feature and test whether the observation satisfies the node’s condition.
If the split rule is numeric, you check whether the feature value is less than or equal to the threshold; if the split rule is categorical, you check whether the category is in a particular subset. Depending on the outcome, you follow the appropriate branch (for example, “yes” to the left, “no” to the right) and repeat this process at the next node.
You continue descending the tree until you reach a node without children, which is a leaf that stores a constant output value or a class label. For a regression tree, the prediction will be a number like an estimated life expectancy or fuel efficiency; for a classification tree, the output will be a predicted category such as “survived” or “did not survive”.
If you test this approach on the same data you used for training, you’ll often see quite high accuracy for classification (for example, around 85% in some simple obesity or Titanic‑style examples), but that performance might drop on unseen data if your tree is too deep. This is precisely why controlling tree depth and size is so important, and why ensembles like random forests were invented to stabilize tree predictions.
Working with regression trees in practice
Regression trees are particularly handy when the relationship between predictors and the response is strongly nonlinear and involves interactions that are hard to model with classic linear regression. Instead of trying to fit a single global equation, the tree partitions the feature space into regions and fits a simple constant model within each region.
In R, popular packages such as rpart and tree make it easy to build regression trees with a single function call, specifying a formula like y ~ x1 + x2 + … + x11. These packages were influenced by the original CART methodology described by Breiman and colleagues, and they implement many of the splitting and pruning ideas standard in modern tree‑based modeling.
For example, you can use the rpart package to model a response y based on eleven covariates x1 to x11, clean the data of missing values, and then visualize the resulting tree with helper functions like prp from the rpart.plot package. The terminal nodes show the predicted y for each region, which you can use directly for new observations.
Given a trained regression tree, you can feed new covariate values such as x9 = 70, x2 = 100 or x9 = 60, x2 = 150 into the predict function to obtain estimated values ŷ (for instance around 20 or 28 in a fuel consumption example). Comparing these predictions against observed values, for example through correlation between y and ŷ, gives you a quick sense of how well the tree is capturing the underlying pattern, even when the dataset is fairly small.
From single trees to random forests
A single decision tree is powerful but also notoriously sensitive to the particularities of the training data, which can lead to high variance (bias and variance) and overfitting. To mitigate this, random forests build many trees on bootstrapped samples of the data and aggregate their predictions, producing a more stable and usually more accurate model.
In a random forest, each tree is trained on a bootstrap sample, which means a new dataset of the same size is drawn from the original training set with replacement. This sampling process makes each tree see a slightly different dataset, so their errors are less correlated and can cancel out when aggregated.
Additionally, random forests introduce randomness in the feature selection process by considering only a random subset of predictors at each split rather than all predictors. This further reduces correlation between trees, enhances diversity in the forest, and tends to reduce variance without increasing bias too much.
The combination of bootstrapping and aggregation of predictions is known as bagging, and in random forests you also get an internal estimate of model error by evaluating each tree on the data points that were not included in its bootstrap sample (the so‑called out‑of‑bag observations). This out‑of‑bag error provides a convenient way to gauge performance without needing a separate validation set.
Although this article focuses on building a single tree from scratch, understanding how that basic component works makes it much easier to grasp how ensembles like random forests, gradient boosting and other tree‑based methods build on the same principles to achieve state‑of‑the‑art results in many applied problems.
Putting everything together, decision tree regression from scratch shows you how a simple set of rules, cost functions and recursive splits can model complex relationships, whether you are predicting a binary outcome such as survival, a categorical label like obesity status, or a numeric target such as life expectancy or fuel consumption, and this deep understanding becomes a solid foundation for using more advanced tree‑based techniques in practice.