Continuing our analysis with the fun part
Previously, we created RFM clusters and assigned each customer to one of them. Now we will use a new metric called Lifetime Value (LTV) to validate our model and perform classification of customers into LTV segments.
To calculate LTV, we first define a time frame of lets say 4 months. This could depend on the business in question. For an online retail business, 4 months might be too long whereas for a manufacturing business it might be too short. LTV for a customer usually follows the formula:
Lifetime Value: Total Gross Revenue — Total Cost
In our case we consider only Total Gross Revenue to keep things simple. Using this formula on a customer will give us their historical lifetime value. Our end objective is to predict LTV segment of a customer once we have some some data to calculate their RFM scores.
To ensure robustness, we will create two time periods and divide our data. We calculate our two metrics from separate time periods to make sure the model works in a real life scenario.
Here’s the plan for validation:
- We calculate RFM scores of customers from time period 1 or T1
- We calculate LTV scores of the same customers from time period 2 or T2
- We plot them together to see if RFM segments are useful in predicting LTV
Lets start with calculating RFM segments for T1 and see how they are distributed in the two businesses.
Now we will look at LTV for T2. Let’s visualize the Total Revenue from the two businesses.
We are ready to perform a visual model validation. In the below scatter plots RFM scores are calculated from T1 and LTV from T2.
We see a clear positive correlation between RFM and LTV. This is good news, because it shows that RFM scores are useful in predicting LTV. We also see that between our 3 RFM segments there is significant variation.
The model seems to be on the right track. We will perform some further checks later in this article. For now lets apply the model to predict LTV segment of a customer.
At this stage we have data on the customer level, with each row showing us RFM and LTV scores for one customer.
We want to create clusters to segment our LTV scores just as we clustered the RFM scores. We perform k-means clustering and create LTV segments and get an additional column.
Let’s look at how the LTV Segments are distributed.
We define the variable to be predicted or y, and the features be used for prediction or X. Dummy variables need to be created for categorical variables.
- y — LTV Segments
- X — RFM Scores, RFM Clusters and RFM Segments
At this point we split our data into training and testing sets. We randomly sample 30% of the data for testing, and stratify on LTV Segments to make sure we preserve class balance in the testing set. We will also scale our feature set to make them easier for the model to understand.
With our y and X ready, it is time to use a powerful Machine Learning (ML) algorithm called XGBoost to classify the customers in the testing set into LTV Segments. We do this for both company A and B.
That’s a lot of numbers, what does it all mean? Lets examine one by one:
- Company A- We have an impressive accuracy of 98% on the test set, but keep in mind we have few medium and high value customers to predict. Looking closely at the confusion matrix, we can see the classifier predicts Low-Value customers with 100% accuracy, but is less sure in differentiating between Medium/High Value customers.
- Company B- We observe an accuracy of 82% on the test set. While this drop in accuracy might seem drastic, it is due to two reasons: 1. We have less data (43k for Company B vs 226k rows for Company A). 2. Our classifier is unsure between Medium/High Value customers as seen earlier. We see that the classifier correctly identified 95% of the Low-Value Customers.
To improve the model in the future we can think of the following:
- Adding more variables to the feature set
- Alternative ML algorithms
- Use a different clustering technique
- Use a larger data source
In the next section we explore some technical considerations and attempt to improve the model.
Cross Validation, GridSearch and Ensemble Voting
Overfitting occurs when the results of a model are not replicated in other datasets. It’s a good idea to avoid it.
Cross Validation is a good way to avoid overfitting and assessing model robustness. The below picture summarizes how it works.
We perform k iterations of training and testing on our data, changing the train and test sets each time. In the end we get an average score of how our model performed over all iterations. In our analysis we choose k=3. The below results show accuracy averages for different ML algorithms with k-fold cross validation.
Looking at the results for Company A, we see that different ML Algorithms perform consistently, with only a small drop from the benchmark accuracy 98%. Similarly with Company B, our cross validation scores are close to benchmark accuracy of 82%. Our model has passed this test.
We might be able to improve performance for Company B by fine tuning our model. To find just the right hyper-parameters or tuning knobs for ML we can use GridSearch, which runs many iterations of the model with different tunings. It then finds and gives us the best model. But it doesn’t stop there, what if we could use multiple ML algorithms instead of one? With Ensemble Voting, we run different ML algorithms on the same problem, and take a vote between them to predict.
Our approach to improve performance combines Cross Validation, GridSearch and Ensemble Voting. Here is the summary:
- Run ML algorithms and find which ones are performing best while using Cross Validation
- Fine tune the best models with GridSearch
- Use the fine tuned models in the Ensemble Voting Classifier
We identified Naive Bayes, XGBoost and Random Forest as our best classifiers. Finally, lets use these 3 in a voting classifier..
And we have 87.7%! That looks better than our benchmark of 82%. We have successfully improved model performance.
In this project we have achieved two goals:
- Segment customers based on RFM Metrics
- Predict LTV Segments based on RFM metrics with Machine Learning
Businesses need to keep a close eye on their customers to drive growth. Among other things, knowing which customers are more profitable helps managers prioritize resources and maximise cross/up-selling opportunities.
Segmentation gives us a mechanism to keep track of our customers with a minimal data requirement (remember we started with 4 columns). Knowing which segment a customer belongs to gives the manager a deeper understanding and allows for customized strategies. As demonstrated our model is robust and can be generalized to other applications.
Please feel free to send me any feedback or questions. Thanks for reading this far!