How is our World changing?

How is our World changing?, a Medium series by Sanjeev Arora

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Prevent losing customers

How machine learning can help

This post is about big data from a hypothetical music streaming service — Sparkify. The question to be answered is: How can this data be used to predict potential churn of individual customers? This information is most relevant. It can be used to trigger measures which in turn may prevent the customers from turning their backs.

This post sheds light on the technical process of arriving at this information … including the question of how cloud computing can help to tackle the 12 GB of data provided for this purpose.

The dataset consists of data on user transactions — such as logging in, playing a song or upgrading the service—which occurred on the streaming services’ website. The data comes in two forms: the full set of transactions as well as a subset which accounts for roughly 1 % of the total records.

I have used and compared two different setups for working with this data:

All data transformation and machine learning was performed by using PySpark.

Cleaning the data involved primarily the following steps:

The following plot shows the features which make up the cleaned dataset. They can be seen on the vertical axis. The horizontal axis depicts the underlying transactions for which the data was collected. The red squares indicate missing data.

As can be seen, the dataset was free of NaNs except for song attributes which are only relevant for a “NextSong” transaction. These attributes were not used in the ensuing engineering and modeling steps.

I have transformed the cleaned data to come up with a dataset suitable for machine learning algorithms. The following table shows the features created for the subsequent modeling.

I have vectorized and normalized the above features before feeding them into three different classification algorithms: Logistic regression as well as a Naive Bayes and a Random Forest classifier. In each case the model’s dependent variable is binary: the customer is either predicted to churn or not to churn. Some hyper-parameters of each algorithm were tuned via grid search. The F1 score was used as the primary evaluation metric.

I have performed the analysis for the smaller subset of data as well as for the full dataset. In the process, both datasets were randomly split into training, testing, and validation data.

Here is a summary of the results for the 3 tested models:

Building upon the full dataset the random forest model performed best according to the F1 metric. The model’s prediction capability was confirmed by processing the validation data. If speed and simplicity is key in the model’s practical application the logistic regression model may still be preferable. The Naive Bayes classifier didn’t do nearly as well — which might be due to its underlying assumption of feature independence.

There is a lot of fine-tuning which could be done to improve the process and the prediction models. Some examples are:

Thanks to FancyCrave for sharing the above photo via Pixabay!

Add a comment

Related posts:

How to Have Courage When You Have None

Trying something new or trying to stop yourself from doing something that is familiar to you takes a lot of courage. But what if you don’t have it? Last month, I started to list everything that I…

What did I learn today? From Digital Marketing Internship by Digital Deepak

Everyone of us is a treasure hunter who is in search of finding something that is valuable to us. Deciding where to compete is half your success. The famous management guru Peter Drucker said that…