How to Plot a Decision Tree in R

Spread the love

Decision trees are powerful and widely used models in machine learning and data analysis. They allow us to predict outcomes by making decisions based on rules. One of the primary advantages of decision trees is their interpretability: the model’s decisions can be visualized and understood, making them an excellent tool for exploratory data analysis and communication of results.

This comprehensive guide will show you how to plot a decision tree in R using both the rpart and party packages. We will also cover how to interpret the plotted tree and how to customize its appearance.

1. Building and Plotting a Decision Tree with rpart

The rpart (Recursive Partitioning and Regression Trees) package in R provides functionality to build and visualize decision trees. The package is not part of the base R installation, so you will need to install it first:

install.packages("rpart")

Once installed, load the package:

library(rpart)

For this guide, we will use the built-in iris dataset to predict species based on sepal and petal measurements. Let’s fit a decision tree model:

# Fit a decision tree model
model <- rpart(Species ~ ., data = iris)

Here, rpart(Species ~ ., data = iris) fits a decision tree model with Species as the dependent variable and all other variables in the iris dataset as independent variables.

To plot this tree, we will also need the rpart.plot package, which provides advanced plotting capabilities for rpart trees:

install.packages("rpart.plot")
library(rpart.plot)

Now you can plot the decision tree:

# Plot the decision tree
rpart.plot(model)

2. Building and Plotting a Decision Tree with party

The party package is another powerful tool for building and plotting decision trees in R. Like rpart, it is not part of the base R installation:

install.packages("party")
library(party)

To fit a decision tree model with the party package, you use the ctree() function, which stands for Conditional Inference Tree:

# Fit a decision tree model
model <- ctree(Species ~ ., data = iris)

Plotting a decision tree with party is as simple as calling the plot() function:

# Plot the decision tree
plot(model)

Unlike the rpart plot, the party plot uses rectangles to represent terminal nodes (also known as leaves) and ellipses to represent internal nodes. The rectangles are labeled with the predicted outcome and the proportion of observations in each category.

3. Interpreting a Plotted Decision Tree

Regardless of which package you used, interpreting the plotted decision tree involves understanding the structure of the tree and the information presented at each node.

  • Root Node: The topmost node, which applies to the entire dataset. It represents the first decision that splits the data based on the independent variable that provides the best separation.
  • Branches: Lines connecting the nodes, representing the outcome of a decision. For example, a branch may represent cases where the sepal length is greater than a certain value.
  • Internal Nodes: Nodes that split into further branches. Each internal node represents a decision based on one of the independent variables.
  • Leaf Nodes: The terminal nodes at the end of the branches. Each leaf node represents a prediction and does not split further. The predicted outcome is the most common outcome of the observations that fall into that leaf.

The information presented at each node depends on the package and the type of tree (classification or regression), but it typically includes the decision rule, the number of observations, and the distribution of outcomes.

4. Customizing the Decision Tree Plot

Both the rpart.plot and party packages provide several options to customize the appearance of the decision tree plot. Here are a few examples using rpart.plot:

# Plot the decision tree with custom settings
rpart.plot(model, type = 3, extra = 101, tweak = 1.2)

In this command:

  • type = 3 produces a fancier style plot with color-coded nodes.
  • extra = 101 adds the percentage of observations in each node and the split criterion to the plot.
  • tweak = 1.2 increases the size of the text and points in the plot.

For more information on customizing the plot, you can refer to the package documentation or use the help() function in R:

help(rpart.plot)

In conclusion, decision trees are valuable tools in data analysis and machine learning, and visualizing these trees can provide important insights into the decision-making process of the model. Whether you use rpart or party, R offers robust and versatile tools for building and plotting decision trees. By understanding these plots, you can better interpret your models and communicate your findings.

Posted in RTagged

Leave a Reply