# How to perform Vintage Cars segmentation to create target audience?

Sector: AUTOMOBILE. A leading used car dealer in São Paulo is shifting to selling vintage cars. The owner has collected car data. He asks a Data Scientist to extract insights, identify the right groups of vintage cars, and target the right audience.

Stephany Gochuico

1/26/20245 min read

###### Problem to solve

A leading used car dealer **Cool Used Cars **in São Paulo, Brazil is shifting its business to selling vintage cars instead of modern ones. The business owner has collected car data. He asks a Data Scientist to extract insights from the data, identify relevant segments of vintage cars that would meet customer needs, and to be able to perform cost-effective and targeted marketing campaigns.

###### Objective

The owner of **Cool Used Cars **wants to leverage the collected data to extract meaningful insights, and find different segments of vintage cars to target different groups of customers more effectively.

###### Dataset

The dataset contains 8 variables:

mpg: miles per gallon

cyl: number of cylinders

disp: engine displacement (cu. inches) or engine size

hp: horsepower, measurement used to calculate how fast force is produced from an engine of a vehicle

wt: vehicle weight (lbs.)

acc: time taken to accelerate from 0 to 60 mph (sec.)

yr: model year

car name: car model name

Importing libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

###### Loading the dataset

###### Feature engineering

After checking all values in the dataset, I've found out that the column "horsepower" has 6 values that are __not__ digit numbers. So first, I have to replace them with a NaN value. Second, I replace the NaN values with the median value of the column "horsepower", and then convert them to a "float" data type so that all values are of the same data type.

data_orig = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Cool Used Cars/cool-used-cars.csv')

data = data_orig.copy()

###### Checking the distribution and outliers for each column in the data

Some boxplots and histograms are shown below for "horsepower" and "acceleration" columns. The **green** triangle and line indicate the mean value, while the **black** line indicates the median value. Among all variables, these 2 have outliers.

###### Preparing data for machine learning

scaler = StandardScaler()

data_scaled = pd.DataFrame(scaler.fit_transform(data), columns = data.columns)

###### Applying the PCA algorithm

Fitting and transforming the PCA function on the scaled data is the first step. Then visualizing the explained variance by individual components. The **number of principal components that explain at least 90% variance is 3** as shown below.

n = data_scaled.shape[1]

pca = PCA(n_components = n, random_state = 1)

data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled))

exp_var = pca.explained_variance_ratio_

Let's make a new dataframe with the first 3 principal components.

pc_comps = ['PC1', 'PC2', 'PC3']

data_pca = pd.DataFrame(np.round(pca.components_[:3,:], 2), index = pc_comps, columns = data_scaled.columns)

data_pca.T

Let's explore visualizing the data in 2 dimensions using the first 2 principal components. To illustrate, "horsepower" and "acceleration" variables are shown below.

###### Applying the t-SNE algorithm

t-SNE is another unsupervised machine learning algorithm. Let's fit and transform the t-SNE function on the scaled data. Then, explore visualizing the data in 2 dimensions using the first 2 principal components. To illustrate, "horsepower" and "acceleration" variables are shown below.

tsne = TSNE(n_components = 2, random_state = 1)

data_tsne = tsne.fit_transform(data_scaled)

data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])

sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1], hue = data.horsepower)

plt.show()

sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1], hue = data.acceleration)

plt.show()

It looks like the t-SNE model represents data clustering better than PCA, so we'll proceed with the t-SNE model. When data points are plotted into 3 different groups, the algorithm clusters them visibly clearly and distinctly.

The following t-SNE graphs shows 3 groups based on features:

#### Conclusion & Recommendations

**The multivariate correlation analysis reveals the main characteristics of vintage cars from the dataset: **

MPG is positively correlated to CYLINDERS, DISPLACEMENT, HORSEPOWER, and WEIGHT. The more cylinders, the bigger the displacement/ horsepower/ weight the car has, and the "more" it consumes fuel, thus, MPG increases accordingly.

When CYLINDERS, DISPLACEMENT, and HORSEPOWER increase, so does the car's weight.

The bigger the HORSEPOWER is, the longer the ACCELERATION is.

As MODEL YEARS increase, car features also increase thanks to innovation. Cars consume "more" fuel to activate all car features.

To determine different groups of vintage cars, two-dimensionality reduction techniques are applied: PCA and t-SNE. Both techniques generated almost similar results, 3 groups are identified from the dataset:

: Vintage cars are grouped according to their physical and technical features: cylinders, displacement, horsepower, and weight.__Car group 1__: Vintage cars are grouped based on model years. Physical and technical features, style, shape, and color indeed differ from each year range.__Car group 2__: Vintage cars are grouped based on acceleration features. t-SNE technique added the MPG feature to this group.__Car group 3__

Given the 3 groups, **Cool Used Cars** teams can develop 3 types of marketing communication, personalized campaigns, and events that will satisfy 4 identified target audiences:

--> Car groups 1 & 2:__Target Audience 1____Vintage car collectors__know by heart the physical and technical features, as well as the car's model year.--> Car group 2:__Target Audience 2____Professionals and business owners__who would like to follow in the footsteps of car collectors, give high importance to car aesthetics of a particular year. The personal image here plays a more important role than the technical capacity of the car.--> Car group 2:__Target Audience 3____Families__with a certain social standing would be highly interested in driving vintage cars for family outings.--> Car group 3:__Target Audience 4____Vintage car racers__consider the acceleration feature critical to winning the race!

## Need help in product segmentation?

###### Contact

Stephany Gochuico

MIT-Certified Business Data Analyst__stephany@data-augmented.com__

###### Data Augmented

Data Science consulting to transform your complex data into actionable insights, make effective decision-making, and achieve better success in business.

Based in Paris, France. Copyright © 2024.