Skip to content

Predicting Customer Churn for a Bank

Acknowledgment: This post uses data obtained from this Kaggle competition, Predicting churn for Bank Customers.

Introduction

No business wants its valuable customers to churn. Retaining existing customers is often the most efficient and cost-effective way to bring in revenue. In this tutorial, we will show you how to build a machine learning model to help predict custom churn for a bank.

Set-up a project

To begin using this platform, you would need to set up an Organization and a project

Add Data to the platform

Once we have set up a Project to organize the work, we will supply the platform with data on which to build the machine learning models. In this case, we will upload a CSV containing the data. The platform supports many other data ingestion methods, including from traditional SQL databases like PostgreSQL and from NoSQL databases like MongoDB.

Data Wrangling

Once the data has been uploaded, you will see the data wrangling user interface

and on the left panel you will see the iteration you are at and on the right the insights on the datasets. The first insight you will encounter is the schema inference

Click on the solution and then "Select Solution".

You can now press "Review" and "Commit Actions" before proceeding to the left panel which will show the progress bar on iteration 1.

Click continue once the progression bar is clear of animations to begin the next iteration, and you will be reminded to set the target column

The target column is Exited which you can set on the target column drop down

You may now proceed to the second iteration and continue to accept recommendations following the same procedure as with the schema confirmation. After a couple of iterations, the Engine may have no more recommendations for the dataset.

You can finalize the dataset by pressing "Finalize" to complete the data wrangling step

Note: The Finalization process also creates a recipe consisting of the committed actions thus far and is a reproducible data processing pipeline that can be applied to other any other dataset with a compliant schema.

Data Visualization

Once the data wrangling process is complete, every column of data gets categorized as numeric, categorical, or other. The data is then visualized using the appropriate visualizations

Setting up Apps

The concept of an App in central to the AI & Analytics Engine platform. Each app is tied to a dataset, and it can contain multiple machine learning models built on that dataset. One can have multiple apps under one project. In this case study, we will create an app for predicting the propensity of churn.

Each App takes care of details like partitioning the data into development and test samples so that we can scientifically assess the model's performance.

Creating new models

You can now proceed to the Model Recommender. The Model Recommender will rank our expertly-crafted model templates by their Predictive Performance, Training Time, and Prediction Time metrics! For this use-case, we select the top two models according to predictive performance, which are XGBoost and LightGBM.

We will select our models based on model accuracy and here are the models we have selected

Rank Model AI & Analytics Engine Predicted Accuracy^ Actual Accuracy Actual ROC/AUC Actual Gini
1 XGBoost 0.74 0.75 0.856 0.712
2 LightGBM 0.73 0.74 0.857 0.714

Feature Importance

For each app, you can take a look at the Feature Importance chart and see which features are the most useful at predicting churn. For this app and dataset, we see that Age and NumberOfProduts are the most important features for predicting churn.

Automated creation of Key Diagnostics

The below diagnostic statistics are computed automatically. It automates away one of the most tedious albeit important tasks for you.

Diagnostics

Let’s compare the model built by the AI & Analytics Engine with the models built by the top Kaggle kernels

Top Kaggle models

As predicted by the AI & Analytics Engine, XGBoost Classifier is the best performing model and has an AUC of 0.856. This compares very favorably to the results from the top voted Kaggle results above. Also, the engine reports on a lot more statistics and measure for you.

If you want to look at a model's performance in-depth, you can go to the model page and get detailed metrics and diagnostics for each model. Below is a list of metrics and visualizations we automatically generate for you.

The page also creates a number of key visualizations for you including the ROC and PR curves. As an example, this is the visualization of the confusion matrix

Deploying the model

With a few clicks of the button, the chosen model can be operationalized and deployed ready to do the grunt work of prediction! Deployment can be performed on the cloud or to an on-premise server of your choosing.

Once the deployed, the model can be accessed via an URL endpoint, and code examples for calling the endpoint are provided in Python, R, bash, Java, and NodeJS.

You can also test the endpoint using the API Test service directly in the browser.