import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
import warnings
'ignore')
warnings.filterwarnings(
# Load training data
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
train_url = pd.read_csv(train_url) train
Classifying Palmer Penguins
In this blog post, we are going to use a simplified, but standard machine learning workflow to determine which three features (two quantitative and one qualitative) will allow us to confidently determine the species of a penguin.
Download Training Data
First, we download our given training data.
Prepare Training Data
Next, we tidy up our data. We remove any columns that are irrelevant to determining the species of a penguin and modify any qualitative features (e.g. sex, clutch completion, island), so that they are represented through numerical values instead of strings, since strings are difficult to quantify and work with.
= LabelEncoder()
le "Species"])
le.fit(train[
"""
Prepare qualitative data and mark species as labels
"""
def prepare_data(df):
= df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
df = df[df["Sex"] != "."]
df = df.dropna()
df = le.transform(df["Species"])
y = df.drop(["Species"], axis = 1)
df = pd.get_dummies(df)
df return df, y
# Prepare training data
= prepare_data(train) X_train, y_train
Explore: Feature Selection
Now that we have prepared our training data, we want to figure our which three features of the data (two quantitative and one qualitative) will allow a model to achieve 100% testing accuracy when trained on those features.
The first way in which we tried to select these features was through the SelectKBest
and f_classif
functions in the sklearn.feature_selection
package.
# Resource: https://www.datatechnotes.com/2021/02/seleckbest-feature-selection-example-in-python.html
from sklearn.feature_selection import SelectKBest, f_classif
= ["Island_Biscoe", "Island_Dream", "Island_Torgersen", "Clutch Completion_No", "Clutch Completion_Yes", "Sex_FEMALE", "Sex_MALE"]
all_qual_cols = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)']
all_quant_cols
# Pick quantatative features
= X_train[all_quant_cols]
X_quant = SelectKBest(f_classif, k=2).fit(X_quant, y_train)
quant_select = quant_select.get_support()
mask = X_quant.columns[mask]
quant_names
# Pick qualatative features
= X_train[all_qual_cols]
X_qual = SelectKBest(f_classif, k=3).fit(X_qual, y_train)
qual_selected = qual_selected.get_support()
mask = X_qual.columns[mask]
qual_names
= np.concatenate((quant_names, qual_names)) features
print(f"quant_names: {quant_names}")
print(f"qual_names: {qual_names}")
print(f"features: {features}")
quant_names: Index(['Culmen Length (mm)', 'Flipper Length (mm)'], dtype='object')
qual_names: Index(['Island_Biscoe', 'Island_Dream', 'Island_Torgersen'], dtype='object')
features: ['Culmen Length (mm)' 'Flipper Length (mm)' 'Island_Biscoe' 'Island_Dream'
'Island_Torgersen']
When we inspect the features SelectKBest
chose based on the f_classif
score function, we see that it found the quantative Culmen Length (mm) and Flipper Length (mm) features and qualitative Island feature to be our most useful features with the highest scores.
Since our data doesn’t have too many features, another way in which we could have selected features was through an exhaustive search that uses the combinations
function from the itertools
package. To guard ourselves from overfitting issues, we use cross validation throughout this process with LogisticRegression
as our model.
from itertools import combinations
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
= ["Island", "Clutch", "Sex"]
all_qual_cols = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)']
all_quant_cols
# Create dataframe to better inspect the scores
'max_colwidth', 100)
pd.set_option(= pd.DataFrame(columns=['Columns', 'Score'])
scores_df
# Go through possible combinations of features and train model on them
# Using 1 qualitative and 2 quantiative
for qual in all_qual_cols:
= [col for col in X_train.columns if qual in col ]
qual_cols for pair in combinations(all_quant_cols, 2):
= list(pair) + qual_cols
cols # Using logistic regression for modeling
= LogisticRegression()
LR # Incorportating cross validation
= cross_val_score(LR, X_train[cols], y_train, cv=10).mean()
cv_mean_score = scores_df.append({'Columns': cols, 'Score': cv_mean_score.round(3)}, ignore_index=True)
scores_df
= scores_df.sort_values(by='Score', ascending=False).reset_index(drop=True) scores_df
= scores_df.iloc[0,0]
features features
['Culmen Length (mm)',
'Culmen Depth (mm)',
'Island_Biscoe',
'Island_Dream',
'Island_Torgersen']
We see that the exhaustive search also found the qualitative Island feature to be most useful. We can further inspect why this qualitative Island feature was chosen over Sex and Clutch Completion using functions like groupby
and aggregate
from the pandas
package.
"""
Resources:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html
https://www.geeksforgeeks.org/python-pandas-dataframe-reset_index/
https://towardsdatascience.com/interesting-ways-to-select-pandas-dataframe-columns-b29b82bbfb33
https://sites.ualberta.ca/~hadavand/DataAnalysis/notebooks/Reshaping_Pandas.html
"""
# Group the penguins by species and island, and count the number of occurrences
= train.groupby(['Species', 'Island']).size().reset_index(name='count')
counts
# Group the penguins by island and compute the total count for each island
= counts.groupby('Island')['count'].sum().reset_index(name='total')
island_totals
# Merge the counts and island_totals
= pd.merge(counts, island_totals, on='Island')
results
# Compute the percentage
'percentage'] = results['count'] / results['total'] * 100
results[
# Edit results dataframe so that it only contains the Island, Species and percentage
= results[['Island', 'Species', 'percentage']]
results
# Arrange results to have islands as columns and species as rows
= results.pivot(index='Species', columns='Island', values='percentage').round(2)
results
print(results)
Island Biscoe Dream Torgersen
Species
Adelie Penguin (Pygoscelis adeliae) 25.74 42.27 100.0
Chinstrap penguin (Pygoscelis antarctica) NaN 57.73 NaN
Gentoo penguin (Pygoscelis papua) 74.26 NaN NaN
= train.groupby(['Species', 'Sex']).size().reset_index(name='count')
counts = counts.groupby('Sex')['count'].sum().reset_index(name='total')
sex_totals = pd.merge(counts, sex_totals, on='Sex')
results 'percentage'] = results['count'] / results['total'] * 100
results[= results[['Sex', 'Species', 'percentage']]
results = results.pivot(index='Species', columns='Sex', values='percentage').round(2)
results = results.drop(columns='.')
results
print(results)
Sex FEMALE MALE
Species
Adelie Penguin (Pygoscelis adeliae) 44.53 40.44
Chinstrap penguin (Pygoscelis antarctica) 22.66 19.85
Gentoo penguin (Pygoscelis papua) 32.81 39.71
= train.groupby(['Species', 'Clutch Completion']).size().reset_index(name='count')
counts = counts.groupby('Clutch Completion')['count'].sum().reset_index(name='total')
clutch_totals = pd.merge(counts, clutch_totals, on='Clutch Completion')
results 'percentage'] = results['count'] / results['total'] * 100
results[= results[['Clutch Completion', 'Species', 'percentage']]
results = results.pivot(index='Species', columns='Clutch Completion', values='percentage').round(2)
results
print(results)
Clutch Completion No Yes
Species
Adelie Penguin (Pygoscelis adeliae) 37.93 43.50
Chinstrap penguin (Pygoscelis antarctica) 37.93 18.29
Gentoo penguin (Pygoscelis papua) 24.14 38.21
When we inspect the qualitative features in this way, we see that each island has at most two different penguin species that live there. For the Sex and Clutch Completion features, however, we see that it’s harder to differentiate the species based on those values for it could be any of the three species.
Looking back at our exhaustive search feature results, we also see, however, that it found a different pair of quantatative features with a higher sore. It found Culmen Depth (mm) and Culmen Length to be more useful features.
To figure out whether the SelectKBest
Flipper Length (mm) and Culmen Length features or exhaustive search features Culmen Depth (mm) and Culmen Length (mm) are the better pair of quantitative features, let’s inspect what they look like when graphed using the seaborn
package.
import seaborn as sns
sns.set_theme()
sns.relplot(=train,
data="Culmen Length (mm)", y='Flipper Length (mm)', hue="Species"
xset(title = "SelectKBest Features")
).
sns.relplot(=train,
data="Culmen Length (mm)", y="Culmen Depth (mm)", hue="Species"
xset(title = "Exhaustive Search Features") ).
<seaborn.axisgrid.FacetGrid at 0x157babbb0>
Based on these graphs, it looks like Culmen Depth (mm) and Culmen Length (mm) may be the better quantative options for it looks like they have less overlap among their species. In other words, it’s more easily separable and distinguishable.
To summarize when training our models, we will use the quantitative Culmen Depth and Culmen Length features and qualitative Island feature.
Explore: Modeling
Now that we have chosen our features, we can begin to train different models using our training data. In this blog post, we explore the models of DecisionTreeClassifier
, RandomForestClassifier
, and LogisticRegression
and use the plot_regions
method below to visualize our decision regions.
from matplotlib.patches import Patch
from mlxtend.plotting import plot_decision_regions
from matplotlib import pyplot as plt
import numpy as np
def plot_regions(model, X, y):
= X[X.columns[0]]
x0 = X[X.columns[1]]
x1 = X.columns[2:]
qual_features
= plt.subplots(1, len(qual_features), figsize = (7, 3))
fig, axarr
# create a grid
= np.linspace(x0.min(),x0.max(),501)
grid_x = np.linspace(x1.min(),x1.max(),501)
grid_y = np.meshgrid(grid_x, grid_y)
xx, yy
= xx.ravel()
XX = yy.ravel()
YY
for i in range(len(qual_features)):
= pd.DataFrame({
XY 0] : XX,
X.columns[1] : YY
X.columns[
})
for j in qual_features:
= 0
XY[j]
= 1
XY[qual_features[i]]
= model.predict(XY)
p = p.reshape(xx.shape)
p
# use contour plot to visualize the predictions
= "jet", alpha = 0.2, vmin = 0, vmax = 2)
axarr[i].contourf(xx, yy, p, cmap
= X[qual_features[i]] == 1
ix # plot the data
= y[ix], cmap = "jet", vmin = 0, vmax = 2)
axarr[i].scatter(x0[ix], x1[ix], c
set(xlabel = X.columns[0],
axarr[i].= X.columns[1])
ylabel
axarr[i].set_title(qual_features[i])
= []
patches for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
= color, label = spec))
patches.append(Patch(color
f"Score {model.score(X, y)}")
plt.suptitle(= "Species", handles = patches, loc = "best")
plt.legend(title plt.tight_layout()
DecisionTreeClassifier
For the DecisionTreeClassifer
model, we need to provide it with a max_depth
argument, which helps control the complexity of the model. In order to find a good max_depth
value, we use cross validation to help prevent overfitting.
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
= plt.subplots(1)
fig, ax
= 0
max_score = 0
best_depth for d in range(2, 10):
= DecisionTreeClassifier(max_depth = d)
T = cross_val_score(T, X_train[features], y_train, cv = 10).mean()
cv_mean = "black")
ax.scatter(d, cv_mean, color if cv_mean > max_score:
= cv_mean
max_score = d
best_depth
= ax.set(xlabel = "Complexity (depth)", ylabel = "Performance (score)") labs
Now that we have found the most suitable max_depth
, we can train our model with it and look at its decision regions.
= DecisionTreeClassifier(max_depth = best_depth)
DTC
DTC.fit(X_train[features], y_train) plot_regions(DTC, X_train[features], y_train)
Based on these plotted decision regions and the score, it looks like our DecisionTreeClassifier
model did a good job with correctly classifying our training data. It also looks like it was able to do it without overfitting, for the graphs do not look too wiggly or tailored so much to our training data.
RandomForestClassifier
For the RandomForestClassifier
model, no arguments are required. So we can just train our model with our selected features.
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier()
RFC
RFC.fit(X_train[features], y_train) plot_regions(RFC, X_train[features], y_train)
Again, based on these plotted decision regions and the score, it looks like our model did a good job with correctly classifying our training data. It does, however, look like it may be overfitting the data for some of the decision regions look a bit wiggly and tailored too much towards our training data.
LogisticRegression
For the LogisticRegression
model, no arguments are also required. So we can just train our model with our selected features.
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
LR
LR.fit(X_train[features], y_train) plot_regions(LR, X_train[features], y_train)
Again, based on these plotted decision regions and the score, it looks like our model did a great job with correctly classifying our training data, even better than the DecisionTreeClassifier
model. It also looks like it was able to do it without overfitting, for the graphs do not look too wiggly or tailored so much to our training data.
Testing and Results
Now, that we have models trained using DecisionTreeClassifer
, RandomForestClassifier
, and LogisticRegression
, we can see which one will yield us our desired results of 100% testing accuracy.
First, we download and prepare our testing data.
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/test.csv"
test_url = pd.read_csv(test_url)
test = prepare_data(test) X_test, y_test
Next, we can inspect the performance of each model on the testing data by plotting their decision regions.
plot_regions(DTC, X_test[features], y_test)
plot_regions(RFC, X_test[features], y_test)
plot_regions(LR, X_test[features], y_test)
As we can see based off our decision regions and scores, while our DecisionTreeClassifer
and RandomForestClassifier
model yields promising classification, our LogisticRegression
model trained on the features Culmen Length, Culmen Depth, and Islands does even better as it results in 100% testing accuracy.