Leveraging Decision Trees for Effective Exploratory Data Analysis
Written on
Introduction
The Decision Tree (DT) stands out as one of the most straightforward Machine Learning algorithms. This is a subjective view, yet it's a sentiment shared widely in the Data Science community.
A Decision Tree is a machine learning technique that simplifies complex decisions through a series of basic choices. As stated by Brett Lantz in Machine Learning with R, the decision-making process resembles that of human reasoning, structured like a flow chart where each node presents a binary decision based on a specific variable, leading to a final conclusion.
For instance, consider the process of buying a T-shirt:
- If the price exceeds $30, I won’t purchase it; if not, I will.
- Upon finding a shirt priced under $30, I check if it’s from a preferred brand. If it is, I proceed.
- Next, I verify if it fits my size; if it does, I continue.
- Ultimately, if the shirt meets all criteria—being under $30, from brand X, size S, and black—I will buy it; otherwise, I may continue searching or decide not to buy.
This logical and straightforward approach can be applied across various data types. However, the sensitivity of Decision Trees to minor changes in the dataset, particularly with smaller datasets, poses a challenge. They can easily adapt to slight variations in the data, leading to overfitting in the machine learning model.
Although this characteristic may jeopardize predictive accuracy, it is precisely what we can leverage during our Exploratory Data Analysis (EDA).
In this article, we will explore how to harness the power of Decision Trees to derive deeper insights from our data. Let’s dive in.
What is EDA?
Exploratory Data Analysis, abbreviated as EDA, is a critical phase in a Data Science project where we examine the dataset and its variables to gain insights regarding what significantly impacts the target variable.
During this phase, data scientists seek to comprehend the data distribution, identify any errors or missing values, extract preliminary insights, and visualize how each explanatory variable influences the target variable.
Using Decision Trees in the Process
The ability of Decision Trees to capture subtle variations in data makes them invaluable for understanding variable relationships. Since this phase is exploratory, there is no need for meticulous data splitting or fine-tuning the algorithm; simply running a Decision Tree can yield significant insights.
Let’s explore how to implement this.
The Dataset
For this exercise, we will utilize the Student Performance dataset from the UCI Repository, created by Paulo Cortez. This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
# Importing libraries import pandas as pd import seaborn as sns sns.set_style() import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor from sklearn.tree import plot_tree
# Loading a dataset from ucimlrepo import fetch_ucirepo
# Fetch dataset student_performance = fetch_ucirepo(id=320)
# Data (as pandas dataframes) X = student_performance.data.features y = student_performance.data.targets
# Gather X and Y for visualizations df = pd.concat([X, y], axis=1)
df.head(3)
Our goal is to identify which variables significantly influence the final grade G3.
Exploring with a Regression DT
Next, we will construct a Decision Tree to analyze the impact of failures, absences, and studytime on G3.
# Columns to explore cols = ['failures', 'absences', 'studytime']
# Split X & Y X = df[cols] y = df.G3
# Fit Decision Tree dt = DecisionTreeRegressor().fit(X, y)
# Plot DT plt.figure(figsize=(20, 10)) plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, fontsize=8);
This results in the following Decision Tree.
This visualization provides a clear understanding of the relationships among the selected variables. Here are some insights derived from this tree:
- The left side indicates "Yes" while the right side indicates "No" regarding the condition in the first line of each box.
- Students with fewer failures (less than 0.5) tend to achieve higher grades. Note that the values on the left side of the box are consistently higher than those on the right.
- Among students with no failures, those who study more than 2.5 hours achieve better grades. The value difference is nearly one point higher.
- Students with no failures, studying less than 1.5 hours, and having fewer than 22 absences attain higher final grades compared to peers with low study time and higher absences.
Free Time and Going Out
If we wish to investigate how free time and social outings affect grades, we can use the following code.
# Columns to explore cols = ['freetime', 'goout']
# Split X & Y X = df[cols] y = df.G3
# Fit Decision Tree dt = DecisionTreeRegressor().fit(X, y)
# Plot DT plt.figure(figsize=(20, 10)) plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, fontsize=10);
The variables goout and freetime are rated on a scale from 1 (Very Low) to 5 (Very High). It's evident that students who rarely go out (less than 1.5) and have minimal free time (also less than 1.5) tend to achieve grades as low as those who frequently go out (more than 4.5) and have a decent amount of free time. The highest grades come from individuals who find a balance between going out (greater than 1.5) and having free time in the range of 1.5 to 2.5.
Exploring with a Classification DT
Similarly, we can apply a Classification Tree algorithm to conduct the same analysis. The methodology and coding remain consistent; however, the resulting output will indicate the predicted class rather than a numerical value. Let's examine a straightforward example using the Taxis dataset from the Seaborn package (BSD License), which encompasses various taxi rides in New York City.
To explore the relationship between the total fare and the payment method, use the following code.
# Load the dataset df = sns.load_dataset('taxis').dropna()
# Columns to explore cols = ['total']
# Split X & Y X = df[cols] y = df['payment']
# Fit Decision Tree dt = DecisionTreeClassifier().fit(X, y)
# Plot Tree plt.figure(figsize=(21, 10)) plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3,
fontsize=10, class_names=['cash', 'credit_card']);
A quick glance at the resulting tree suggests that lower fare amounts are predominantly paid in cash. Generally, totals below $9.32 are likely to be settled in cash.
Conclusion
In this tutorial, we explored an efficient way to use Decision Trees for examining variable relationships within our dataset. This algorithm effectively identifies patterns that may not be immediately apparent. By employing Decision Trees, we can uncover valuable insights.
Also, a quick note about the code: the plot_tree() function allows you to specify the desired depth with the max_depth parameter. You can set this hyperparameter in the Decision Tree instance from sklearn as well. Using it in plot_tree() enables quick testing of different depths without needing to retrain the model.
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3);
If you found this content helpful, feel free to follow me for more insights.
Gustavo Santos - Medium
Read writing from Gustavo Santos on Medium. Data Scientist focused on extracting insights from data to assist individuals and organizations.
gustavorsantos.medium.com
Connect with me on LinkedIn. Let’s collaborate!
References
- Decision Tree - Wikipedia: A decision tree is a decision support hierarchical model that utilizes a tree-like structure of decisions and their possible consequences. [Link](https://en.wikipedia.org)
- UCI Machine Learning Repository: Explore datasets from around the globe! [Link](https://archive.ics.uci.edu)
- GitHub - mwaskom/seaborn-data: Data repository for Seaborn examples. [Link](https://github.com)
- DecisionTreeRegressor: Gallery examples for scikit-learn. [Link](https://scikit-learn.org)
I would like to acknowledge a valuable reference: I learned this technique from the talented Brazilian data scientist, Teo Calvo. He offers an excellent free program with daily live sessions on his channel, Teo Me Why. If you speak Portuguese, check out his work.
Olá, boas vindas!
Uma iniciativa de educação gratuita para a área de dados e tecnologia.
[Link](https://teomewhy.org)