Customer Segmentation with RFM Analysis — Part 2

Continuing our analysis with the fun part

Previously, we created RFM clusters and assigned each customer to one of them. Now we will use a new metric called Lifetime Value (LTV) to validate our model and perform classification of customers into LTV segments.

To calculate LTV, we first define a time frame of lets say 4 months. This could depend on the business in question. For an online retail business, 4 months might be too long whereas for a manufacturing business it might be too short. LTV for a customer usually follows the formula:

Lifetime Value: Total Gross Revenue — Total Cost

In our case we consider only Total Gross Revenue to keep things simple. Using this formula on a customer will give us their historical lifetime value. Our end objective is to predict LTV segment of a customer once we have some some data to calculate their RFM scores.

Model Validation

To ensure robustness, we will create two time periods and divide our data. We calculate our two metrics from separate time periods to make sure the model works in a real life scenario.

Here’s the plan for validation:

  • We calculate RFM scores of customers from time period 1 or T1
  • We calculate LTV scores of the same customers from time period 2 or T2
  • We plot them together to see if RFM segments are useful in predicting LTV

Lets start with calculating RFM segments for T1 and see how they are distributed in the two businesses.

RFM Segments from Time Period 1

Now we will look at LTV for T2. Let’s visualize the Total Revenue from the two businesses.


We are ready to perform a visual model validation. In the below scatter plots RFM scores are calculated from T1 and LTV from T2.

We see a clear positive correlation between RFM and LTV. This is good news, because it shows that RFM scores are useful in predicting LTV. We also see that between our 3 RFM segments there is significant variation.

The model seems to be on the right track. We will perform some further checks later in this article. For now lets apply the model to predict LTV segment of a customer.


At this stage we have data on the customer level, with each row showing us RFM and LTV scores for one customer.

We want to create clusters to segment our LTV scores just as we clustered the RFM scores. We perform k-means clustering and create LTV segments and get an additional column.

Let’s look at how the LTV Segments are distributed.

LTV Segments distribution

We define the variable to be predicted or y, and the features be used for prediction or X. Dummy variables need to be created for categorical variables.

  • y — LTV Segments
  • X — RFM Scores, RFM Clusters and RFM Segments

At this point we split our data into training and testing sets. We randomly sample 30% of the data for testing, and stratify on LTV Segments to make sure we preserve class balance in the testing set. We will also scale our feature set to make them easier for the model to understand.

With our y and X ready, it is time to use a powerful Machine Learning (ML) algorithm called XGBoost to classify the customers in the testing set into LTV Segments. We do this for both company A and B.

That’s a lot of numbers, what does it all mean? Lets examine one by one:

  • Company A- We have an impressive accuracy of 98% on the test set, but keep in mind we have few medium and high value customers to predict. Looking closely at the confusion matrix, we can see the classifier predicts Low-Value customers with 100% accuracy, but is less sure in differentiating between Medium/High Value customers.
  • Company B- We observe an accuracy of 82% on the test set. While this drop in accuracy might seem drastic, it is due to two reasons: 1. We have less data (43k for Company B vs 226k rows for Company A). 2. Our classifier is unsure between Medium/High Value customers as seen earlier. We see that the classifier correctly identified 95% of the Low-Value Customers.

To improve the model in the future we can think of the following:

  • Adding more variables to the feature set
  • Alternative ML algorithms
  • Use a different clustering technique
  • Use a larger data source

In the next section we explore some technical considerations and attempt to improve the model.

Cross Validation, GridSearch and Ensemble Voting

Overfitting occurs when the results of a model are not replicated in other datasets. It’s a good idea to avoid it.

Cross Validation is a good way to avoid overfitting and assessing model robustness. The below picture summarizes how it works.

k-fold cross validation

We perform k iterations of training and testing on our data, changing the train and test sets each time. In the end we get an average score of how our model performed over all iterations. In our analysis we choose k=3. The below results show accuracy averages for different ML algorithms with k-fold cross validation.

Image for post
Image for post

Looking at the results for Company A, we see that different ML Algorithms perform consistently, with only a small drop from the benchmark accuracy 98%. Similarly with Company B, our cross validation scores are close to benchmark accuracy of 82%. Our model has passed this test.

We might be able to improve performance for Company B by fine tuning our model. To find just the right hyper-parameters or tuning knobs for ML we can use GridSearch, which runs many iterations of the model with different tunings. It then finds and gives us the best model. But it doesn’t stop there, what if we could use multiple ML algorithms instead of one? With Ensemble Voting, we run different ML algorithms on the same problem, and take a vote between them to predict.

Our approach to improve performance combines Cross Validation, GridSearch and Ensemble Voting. Here is the summary:

  • Run ML algorithms and find which ones are performing best while using Cross Validation
  • Fine tune the best models with GridSearch
  • Use the fine tuned models in the Ensemble Voting Classifier

We identified Naive Bayes, XGBoost and Random Forest as our best classifiers. Finally, lets use these 3 in a voting classifier..

Image for post

And we have 87.7%! That looks better than our benchmark of 82%. We have successfully improved model performance.


In this project we have achieved two goals:

  • Segment customers based on RFM Metrics
  • Predict LTV Segments based on RFM metrics with Machine Learning
Image for post
Customers transact with a business for some time, they show up on this plot

Businesses need to keep a close eye on their customers to drive growth. Among other things, knowing which customers are more profitable helps managers prioritize resources and maximise cross/up-selling opportunities.

Segmentation gives us a mechanism to keep track of our customers with a minimal data requirement (remember we started with 4 columns). Knowing which segment a customer belongs to gives the manager a deeper understanding and allows for customized strategies. As demonstrated our model is robust and can be generalized to other applications.

Please feel free to send me any feedback or questions. Thanks for reading this far!

Aman Prasad is a music, deep space and data science enthusiast.