Convenience Store's Customer Segmentation

Sector: RETAIL. The store owner asks help from a Data Scientist to figure out hidden consumer behaviors and customer segments, to be able to send the right offers, to the right customer segment, and to use the right buying channel.

Stephany Gochuico

4/10/20257 min read

Problems To Solve

The revenues of the local store called M-Store have been decreasing for a few years now. Their business team has done marketing operations by segmenting customers based on demographics, geography, techno-graphics, and values-based criteria, but sales haven't increased. Thus, the store owner calls for a Data Scientist to figure out hidden customer behaviors and segments, and to be able to send the right offers, to the right customer segment, and use the right buying channel.

Objective

The key objective of the Data Analyst is to identify hidden customer clusters more efficiently by applying Machine Learning algorithms. When clusters are identified, the Marketing Expert will build a business solution design to better address the different pain points of each customer group, challenge their conventional way of doing business, enable the design of customized marketing campaign offers, and eventually implement innovative solutions.

Dataset Provided

● ID: Customer ID number
● Year_Birth: Customer’s year of birth
● Education: Customer's level of education
● Marital_Status: Customer's marital status
● Kidhome: Number of small children in the customer's household
● Teenhome: Number of teenagers in customer's household
● Income: Customer's yearly household income
● Recency: Number of days since the last purchase
● Dt_Customer: Date of customer's enrollment with the company
● MntFishProducts: The amount spent on fish products in the last 5 years
● MntMeatProducts: The amount spent on meat products in the last 5 years
● MntFruits: The amount spent on fruit products in the last 5 years
● MntSweetProducts: Amount spent on sweet products in the last 5 years
● MntWines: The amount spent on wine products in the last 5 years
● MntGoldProds: The amount spent on gold products in the last 5 years
● NumDealsPurchases: Number of purchases made with discount
● NumCatalogPurchases: Number of purchases made using a catalog (goods delivered by mail)
● NumStorePurchases: Number of purchases made directly in stores
● NumWebPurchases: Number of purchases made through the company's website
● NumWebVisitsMonth: Number of visits to the company's website in the last month
● AcceptedCmp1: 1 if the customer accepted the offer in the first campaign, 0 otherwise
● AcceptedCmp2: 1 if the customer accepted the offer in the second campaign, 0 otherwise
● AcceptedCmp3: 1 if the customer accepted the offer in the third campaign, 0 otherwise
● AcceptedCmp4: 1 if the customer accepted the offer in the fourth campaign, 0 otherwise
● AcceptedCmp5: 1 if the customer accepted the offer in the fifth campaign, 0 otherwise
● Response: 1 if the customer accepted the offer in the last campaign, 0 otherwise
● Complain: 1 If the customer complained in the last 5 years, 0 otherwise

Loading Libraries in Colab Notebook

Checking the data types and missing values for each column

# Imputing the missing values in INCOME with the median value of all INCOME values
data.Income.fillna(data.Income.median(), inplace = True)

Since there are missing 24 values in Income column. Thus, I'm imputing the missing values with median value.

Feature engineering

After exploring the statistics of all categorical & numeric variables, I found that there are categories that should be combined.

# Replace the category "2n Cycle" with the category "Master" in Education column
data.Education.replace(to_replace = ["2n Cycle"], value = ["Master"], inplace = True)

[ ] data["Education"].unique()
--> array(['Graduation', 'PhD', 'Master', 'Basic'], dtype=object)

# Replace the categories "Alone", "Absurd", "YOLO" with the category "Single" in Marital_Status column
data.Marital_Status.replace(to_replace = ["Alone"], value = ['Single'], inplace = True)
data.Marital_Status.replace(to_replace = ["Absurd"], value = ['Single'], inplace = True)
data.Marital_Status.replace(to_replace = ["YOLO"], value = ['Single'], inplace = True)

[ ] data["Marital_Status"].unique()
--> array(['Single', 'Together', 'Married', 'Divorced', 'Widow'], dtype = object)

# Add Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data["Kidhome"] + data["Teenhome"]

# Replace "Married" and "Together" with "Relationship"
data.Marital_Status.replace(to_replace = ["Married"], value = ['Relationship'], inplace = True)
data.Marital_Status.replace(to_replace = ["Together"], value = ['Relationship'], inplace = True)

# Replace "Divorced" and "Widow" with "Single"
data.Marital_Status.replace(to_replace = ["Divorced"], value = ['Single'], inplace = True)
data.Marital_Status.replace(to_replace = ["Widow"], value = ['Single'], inplace = True)

# Create a new feature "Status", replacing "Single"=1 and "Relationship"=2 in Marital_Status
data.Marital_Status.replace(to_replace = ["Single"], value = [1], inplace = True)
data.Marital_Status.replace(to_replace = ["Relationship"], value = [2], inplace = True)

# Create a new feature "Status" after replacing "Single"=1 and "Relationship"=2 in "Marital_Status"
data["Status"] = data["Marital_Status"]

# Create a new feature "Family_Size"="Status"+"Kids" to get the total # of persons in each family
data["Family_Size"] = data["Status"] + data["Kids"]

# Create a new feature "Expenses"
data["Expenses"] = data["MntWines"] + data["MntFruits"] + data["MntMeatProducts"] + data["MntFishProducts"] + data["MntSweetProducts"] + data["MntGoldProds"]

# Create a new feature "NumTotalPurchases"
data["NumTotalPurchases"] = data["NumDealsPurchases"] + data["NumWebPurchases"] + data["NumCatalogPurchases"] + data["NumStorePurchases"]

# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])

# Create a new column "TotalAcceptedCmp" to get the total # of accepted campaigns by a customer
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"]

Checking for outliers

Outliers can greatly affect the machine learning model, so it's recommended to drop rows with Income > Upper whisker.

# Calculating the upper whisker for the Income variable
Q1 = data.Income.quantile(q=0.25)
Q3 = data.Income.quantile(q=0.75)
IQR = Q3 - Q1
upper_whisker = (Q3 + 1.5 * IQR)

# Checking the rows with extreme values for the Income variable
data[data.Income > upper_whisker]

# Dropping rows identified as outliers (Income > upper_whisker)
data.drop(index=[164,617,655,687,1300,1653,2132,2233], inplace=True)

Distribution of the income variable now becomes normal after dropping the outliers (Income > upper_whisker):

Data preparation for model building

# Dropping all irrelevant columns and storing in data_model

data_model = data.drop(
columns=[
"Year_Birth","Dt_Customer","day","Complain","Response","AcceptedCmp1","AcceptedCmp2",
"AcceptedCmp3","AcceptedCmp4","AcceptedCmp5","Marital_Status","Status","Kids","Education",
"Kidhome","Teenhome","Income","Age","Family_Size"
],
axis=1,
)

Scaling the Data for Unsupervised Learning

Feature scaling is quite relevant in machine learning models that compute distance metrics as we do in most clustering algorithms, for example, K-Means. Scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.

# Applying Standard Scaler on the new data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(data_model)
df_scaled = pd.DataFrame(df_scaled, columns = data_model.columns)

Applying PCA to the data to visualize data distributed in 2 dimensions

# Uploading the PCA library
from sklearn.decomposition import PCA

# Storing the number of variables in the data
n = df_scaled.shape[1]

# Initialize PCA with n_components = n and random_state=1
pca = PCA(n_components = n, random_state = 1)

# Fit_transform PCA on the scaled data
data_pca = pd.DataFrame(pca.fit_transform(df_scaled))

I've applied T-SNE and PCA to the data. With T-SNE, I see no pattern in the data which is scattered all over the graph. However, with PCA, the data is well grouped as seen in the graph above.
Therefore, I'm applying the clustering algorithms (K-Means, etc.) to the PCA data generated.

Selecting K (Clusters) with the Elbow Method via K-Means

In the plot below, the elbow is seen for K=3 and K=5 as there is some drop in distortion at K=3 and K=5. After several days of model building and trials to see the best customer clusters, I conclude that the optimal K value is 5.