Bank's Customer Segmentation

Sector: BANKING. C-Bank wants to expand its business in a new city by focusing on its credit card customer base. The team wants to improve market penetration, run personalized campaigns to upsell new products to existing customers, upgrade its service delivery model, and target new customers.

Stephany Gochuico

1/25/20245 min read

Problems To Solve

C-Bank, a world-leading bank wants to expand its business in a new city and focus on its credit card customer base in the next financial year. Their Marketing team wants to improve market penetration and run personalized campaigns to upsell new products to existing customers, and target new customers.

They also received feedback from existing customers that customer support services are poor. Based on the survey report, the Operations team wants to upgrade the service delivery model to ensure that customers' queries are resolved more efficiently and effectively.


Identify different customer segments in the existing customer base, taking into account their spending patterns, past interactions with the bank, and hidden behaviors.

Dataset contains the following features

● SI_no: Customer serial ID number
● Customer Key: Customer identification
Avg_Credit_Limit: Average credit limit in Euro currency
Total_Credit_Cards: Total number of credit cards
Total_visits_bank: Total bank visits
Total_visits_online: Total online visits
Total_calls_made: Total calls made

Loading the libraries in Colab Notebook
Data Preprocessing and Exploratory Data Analysis

First of all, I checked if there are any duplicates on the "Customer Key" (identifier). "SI_No" has 660 unique values but "Customer Key" has 655 unique values; that means that there are 5 duplicates of "Customer Key".

Duplicates have to be dealt with before applying any algorithm. So, I'm locating the duplicate Customer Keys, dropping them from the dataset, and resetting the indexes.

duplicate_keys = data['Customer Key'].duplicated()
data = data[~data['Sl_No'].isin([333, 399, 433, 542, 633])]
data = data.reset_index(drop=True)

After exploring all the data, I'm dropping "SI_No" and "Customer Key" prior to building the Machine Learning model.

data.drop(columns=['Sl_No', 'Customer Key'], inplace = True)

Checking correlation between all columns.

plt.figure(figsize = (10, 6))
sns.heatmap(data.corr(), annot = True, fmt = '0.2f')
plt.xticks(rotation = 45)


  • "Avg_Credit_Limit" is positively correlated with "Total_Credit_Cards" and "Total_visits_online" which makes sense.

  • "Avg_Credit_Limit" is negatively correlated with "Total_calls_made" and "Total_visits_bank".

  • "Total_visits_bank", "Total_visits_online", "Total_calls_made" are negatively correlated which implies that the majority of customers use only one of these channels to contact the bank.

Scaling the data

I'm standardizing the data to have a mean of ~0 and a variance of 1.

scaler = StandardScaler()
data_scaled = StandardScaler().fit_transform(data)

n = data.shape[1]
pca = PCA(n_components=n)
principal_components = pca.fit_transform(data_scaled)
data_pca = pd.DataFrame(principal_components, columns = data.columns)
data_copy = data_pca.copy(deep = True)

Fitting K-Means algorithm on the PCA components
Applying PCA on scaled data

# Empty dictionary to store the SSE for each value of K
sse = {}

# Iterate for a range of K's and fit the PCA components to the algorithm
for k inrange(1, 10):
kmeans = KMeans(n_clusters = k, max_iter = 1000, random_state = 1).fit(data_pca)
sse[k] = kmeans.inertia_

# Elbow plot
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")

From the graph above, K=3 is the optimal cluster value.

# Number of observations in each cluster

1 : 374
0 : 221
2 : 49
Name: Labels, dtype: int64

# Calculating the summary statistics of the original data for each label
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis = 0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']

Applying the K-Means algorithm

# Visualizing different features with reference to K-means labels
data_copy.boxplot(by = 'Labels', layout = (1, 5), figsize = (20, 7))

Creating cluster profiles using K-Means model

kmeans = KMeans(n_clusters = 3, random_state = 1)
data_copy['Labels'] = kmeans.predict(data_scaled)
data['Labels'] = kmeans.predict(data_scaled)

# I've just generated the labels with K-Means

Creating cluster profiles using Gaussian Mixture Model

# Applying the Gaussian Mixture algorithm on the PCA components
gmm = GaussianMixture(n_components = 3, random_state = 1)
data_copy['GmmLabels'] = gmm.predict(data_pca)
data['GmmLabels'] = gmm.predict(data_pca)

# Number of observations in each cluster

1 : 582
0 : 37
2 : 25
GmmLabels, dtype: int64

Creating cluster profiles using K-Medoids Model

# Applying the K-Medoids algorithm on the PCA components
kmedo = KMedoids(n_clusters = 3, random_state = 1)
data_copy['kmedoLabels'] = kmedo.predict(data_pca)
data['kmedoLabels'] = kmedo.predict(data_pca)

# Number of observations in each cluster

1 : 289
0 : 222
2 : 133
kmedoLabels, dtype: int64

Conclusion: Customer Insights of 3 Clusters
from 3 Machine Learning Models

Recommendations: 3 Potential Customer Segments

After performing the 3 unsupervised learning algorithms which are K-Means, Gaussian Mixture Model, and K-Medoids, we can conclude that there are 3 main customer segments:


-Findings: Has the lowest credit limits and has the lowest number of credit cards
-Main channel: Call Center
-Insights: This group may have a very low income, and low education level which implies that this group has no computer or internet access at home.
-Recommendations: (1) C-Bank may consider to setup computer stations at the Bank Agency for this customer segment to be autonomous in performing their bank transactions. A bank staff may also assist or guide the customers to complete their transactions via the bank's platform. (2) Allocate more human and technical phone resources at the Call Center to accommodate the needs of this group.

-Findings: Has the highest number of credit cards and highest credit limits
-Main channel: Website
-Insights: This group may have a very high income and education level in prestigious schools, which implies that this group has computer and internet access at home.
-Recommendations: (1) C-Bank may consider providing premium bank services and consequently upsell other banks' products and services.

-Findings: Has the lowest website visits and higher credit limits.
-Main channel: Bank visit
-Insights: The majority of this group may be retired seniors who are not tech-savvy. It is possible that they don't have home computers or internet access. This group may also have good financial means.
-Recommendations: (1) C-Bank may consider to setup computer stations at the Bank Agency specifically for retired seniors. A bank staff may provide a short course to help them perform the usual basic bank transactions. (2) C-Bank may also consider providing premium bank services and upselling other products in a more personalized way.

Concerning the 3 unsupervised learning models used, the K-Medoids model can be considered more robust to noise and outliers in comparison to K-Means. For this particular data analysis, the Gaussian Mixture Model wasn't able to cluster the 3 groups effectively.

Need help in customer segmentation?