Decision trees are powerful and widely used models in machine learning and data analysis. They allow us to predict outcomes by making decisions based on rules. One of the primary advantages of decision trees is their interpretability: the model’s decisions can be visualized and understood, making them an excellent tool for exploratory data analysis and communication of results.
This comprehensive guide will show you how to plot a decision tree in R using both the rpart and party packages. We will also cover how to interpret the plotted tree and how to customize its appearance.
1. Building and Plotting a Decision Tree with rpart
The rpart (Recursive Partitioning and Regression Trees) package in R provides functionality to build and visualize decision trees. The package is not part of the base R installation, so you will need to install it first:
Once installed, load the package:
For this guide, we will use the built-in iris dataset to predict species based on sepal and petal measurements. Let’s fit a decision tree model:
# Fit a decision tree model model <- rpart(Species ~ ., data = iris)
rpart(Species ~ ., data = iris) fits a decision tree model with
Species as the dependent variable and all other variables in the iris dataset as independent variables.
To plot this tree, we will also need the rpart.plot package, which provides advanced plotting capabilities for rpart trees:
Now you can plot the decision tree:
# Plot the decision tree rpart.plot(model)
2. Building and Plotting a Decision Tree with party
The party package is another powerful tool for building and plotting decision trees in R. Like rpart, it is not part of the base R installation:
To fit a decision tree model with the party package, you use the
ctree() function, which stands for Conditional Inference Tree:
# Fit a decision tree model model <- ctree(Species ~ ., data = iris)
Plotting a decision tree with party is as simple as calling the
# Plot the decision tree plot(model)
Unlike the rpart plot, the party plot uses rectangles to represent terminal nodes (also known as leaves) and ellipses to represent internal nodes. The rectangles are labeled with the predicted outcome and the proportion of observations in each category.
3. Interpreting a Plotted Decision Tree
Regardless of which package you used, interpreting the plotted decision tree involves understanding the structure of the tree and the information presented at each node.
- Root Node: The topmost node, which applies to the entire dataset. It represents the first decision that splits the data based on the independent variable that provides the best separation.
- Branches: Lines connecting the nodes, representing the outcome of a decision. For example, a branch may represent cases where the sepal length is greater than a certain value.
- Internal Nodes: Nodes that split into further branches. Each internal node represents a decision based on one of the independent variables.
- Leaf Nodes: The terminal nodes at the end of the branches. Each leaf node represents a prediction and does not split further. The predicted outcome is the most common outcome of the observations that fall into that leaf.
The information presented at each node depends on the package and the type of tree (classification or regression), but it typically includes the decision rule, the number of observations, and the distribution of outcomes.
4. Customizing the Decision Tree Plot
Both the rpart.plot and party packages provide several options to customize the appearance of the decision tree plot. Here are a few examples using rpart.plot:
# Plot the decision tree with custom settings rpart.plot(model, type = 3, extra = 101, tweak = 1.2)
In this command:
type = 3produces a fancier style plot with color-coded nodes.
extra = 101adds the percentage of observations in each node and the split criterion to the plot.
tweak = 1.2increases the size of the text and points in the plot.
For more information on customizing the plot, you can refer to the package documentation or use the
help() function in R:
In conclusion, decision trees are valuable tools in data analysis and machine learning, and visualizing these trees can provide important insights into the decision-making process of the model. Whether you use rpart or party, R offers robust and versatile tools for building and plotting decision trees. By understanding these plots, you can better interpret your models and communicate your findings.