How to perform Vintage Cars segmentation to create target audience?

Sector: AUTOMOBILE. A leading used car dealer in São Paulo is shifting to selling vintage cars. The owner has collected car data. He asks a Data Scientist to extract insights, identify the right groups of vintage cars, and target the right audience.

Stephany Gochuico

1/26/20245 min read

Problem to solve

A leading used car dealer Cool Used Cars in São Paulo, Brazil is shifting its business to selling vintage cars instead of modern ones. The business owner has collected car data. He asks a Data Scientist to extract insights from the data, identify relevant segments of vintage cars that would meet customer needs, and to be able to perform cost-effective and targeted marketing campaigns.


The owner of Cool Used Cars wants to leverage the collected data to extract meaningful insights, and find different segments of vintage cars to target different groups of customers more effectively.


The dataset contains 8 variables:

  • mpg: miles per gallon

  • cyl: number of cylinders

  • disp: engine displacement (cu. inches) or engine size

  • hp: horsepower, measurement used to calculate how fast force is produced from an engine of a vehicle

  • wt: vehicle weight (lbs.)

  • acc: time taken to accelerate from 0 to 60 mph (sec.)

  • yr: model year

  • car name: car model name

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Loading the dataset
Feature engineering

After checking all values in the dataset, I've found out that the column "horsepower" has 6 values that are not digit numbers. So first, I have to replace them with a NaN value. Second, I replace the NaN values with the median value of the column "horsepower", and then convert them to a "float" data type so that all values are of the same data type.

data_orig = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Cool Used Cars/cool-used-cars.csv')
data = data_orig.copy()

Checking the distribution and outliers for each column in the data

Some boxplots and histograms are shown below for "horsepower" and "acceleration" columns. The green triangle and line indicate the mean value, while the black line indicates the median value. Among all variables, these 2 have outliers.

Preparing data for machine learning

scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns = data.columns)

Applying the PCA algorithm

Fitting and transforming the PCA function on the scaled data is the first step. Then visualizing the explained variance by individual components. The number of principal components that explain at least 90% variance is 3 as shown below.

n = data_scaled.shape[1]
pca = PCA(n_components = n, random_state = 1)
data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled))
exp_var = pca.explained_variance_ratio_

Let's make a new dataframe with the first 3 principal components.

pc_comps = ['PC1', 'PC2', 'PC3']
data_pca = pd.DataFrame(np.round(pca.components_[:3,:], 2), index = pc_comps, columns = data_scaled.columns)

Let's explore visualizing the data in 2 dimensions using the first 2 principal components. To illustrate, "horsepower" and "acceleration" variables are shown below.

Applying the t-SNE algorithm

t-SNE is another unsupervised machine learning algorithm. Let's fit and transform the t-SNE function on the scaled data. Then, explore visualizing the data in 2 dimensions using the first 2 principal components. To illustrate, "horsepower" and "acceleration" variables are shown below.

tsne = TSNE(n_components = 2, random_state = 1)
data_tsne = tsne.fit_transform(data_scaled)
data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1], hue = data.horsepower)
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1], hue = data.acceleration)

It looks like the t-SNE model represents data clustering better than PCA, so we'll proceed with the t-SNE model. When data points are plotted into 3 different groups, the algorithm clusters them visibly clearly and distinctly.

The following t-SNE graphs shows 3 groups based on features:

Conclusion & Recommendations

The multivariate correlation analysis reveals the main characteristics of vintage cars from the dataset:

  1. MPG is positively correlated to CYLINDERS, DISPLACEMENT, HORSEPOWER, and WEIGHT. The more cylinders, the bigger the displacement/ horsepower/ weight the car has, and the "more" it consumes fuel, thus, MPG increases accordingly.

  2. When CYLINDERS, DISPLACEMENT, and HORSEPOWER increase, so does the car's weight.

  3. The bigger the HORSEPOWER is, the longer the ACCELERATION is.

  4. As MODEL YEARS increase, car features also increase thanks to innovation. Cars consume "more" fuel to activate all car features.

To determine different groups of vintage cars, two-dimensionality reduction techniques are applied: PCA and t-SNE. Both techniques generated almost similar results, 3 groups are identified from the dataset:

  • Car group 1: Vintage cars are grouped according to their physical and technical features: cylinders, displacement, horsepower, and weight.

  • Car group 2: Vintage cars are grouped based on model years. Physical and technical features, style, shape, and color indeed differ from each year range.

  • Car group 3: Vintage cars are grouped based on acceleration features. t-SNE technique added the MPG feature to this group.

Given the 3 groups, Cool Used Cars teams can develop 3 types of marketing communication, personalized campaigns, and events that will satisfy 4 identified target audiences:

  • Target Audience 1 --> Car groups 1 & 2: Vintage car collectors know by heart the physical and technical features, as well as the car's model year.

  • Target Audience 2 --> Car group 2: Professionals and business owners who would like to follow in the footsteps of car collectors, give high importance to car aesthetics of a particular year. The personal image here plays a more important role than the technical capacity of the car.

  • Target Audience 3 --> Car group 2: Families with a certain social standing would be highly interested in driving vintage cars for family outings.

  • Target Audience 4--> Car group 3: Vintage car racers consider the acceleration feature critical to winning the race!

Need help in product segmentation?