Sumpah serapah untuk anak durhaka

Kutuklah aku menjadi orang yang senantiasa berbuat baik, tanpa mengharapkan imbalan kebaikan. Kutuklah aku menjadi orang yang sabar, yang selalu membumikan perasaannya. Kutuklah aku menjadi orang…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Prediction Starbucks Offer Completion with Machine Learning Model

Capstone project for Udacity Data Scientist Nanodegree

This project is the capstone project for Udacity Data Scientist Nanodegree. Three datasources, i.e. Portfolio which contains features of . different offers sent by Starbucks, Profile which which is the datasource for Satrbucks’ customers and their soci-economic characteristics and transcript which is the log table for different offers’ status at any given time per customer and the transactions that they make.

These datasources provide great opportunity for variety of studies such as market segmentation, the amount a person might spend, forecasting the future customers, etc.. The main objective of the current study is to predict which offer will be completed after viewing by the customer.

Starbucks send out different offers randomly to customers. Some of these offers are only informational and some of them come with reward upon the completion. One of the marketing question that can help with optimizing the number of offers to send out is whether an offer will be completed based on offer’s features and the customer’s soci-economics characteristics. Therefore, this study aims to create a classifier using which we can predict whether an offer will be competed.

However, the data cannot be used at the current format an dit needs to be cleaned first. Thus, the structure of the rest of this paper is as follows:

the target variable in this study is a binary variable that shows whether an offer will be successfully completed. Two machine learning algorithms, i.e random forest classifier and logistic regression, is used to estimate the model. Each models get tuned using different hyper parameters via grid search. As the next stage, the best model will be picked up according to two metrics namely accuracy and F1.

The accuracy is a measure that determines how often a machine learning algorithm classifies a data point correctly. The higher the accuracy, the better the model is in classification. it is the ratio of correctly classified data points over the total number of predicted data points. The main shortcoming of the accuracy is that under imbalanced dataset, the accuracy will be always high. On the other hand, F1 is the harmonic means of precision and recall.

There are 10 records in this table with three types of offer (i.e. bogo, discount, and informational). However, the values channels column is a list. I would like to create a column for each element and use 0 and 1 to determine what channel(s) is used to send out an offer.

Cleaning:

2. profile: this table stores 5 data points of 17000 customers.

AS the first step, I checked the number of records per feature as well as some descriptive statistics about numerical data:

I can conclude two main issues:

3. transcript: there are 306534 records and 4 features in this datasource without any missing values. Let’s grab data for one person and explore the structure of the table:

Above sample shows that transcript is a log table in which every action of the person is logged in.

Cleaning:

One issue is that the values in value column is dictionaries with two possible keys: offer id and amount. So, my goal is to extract offer id and amount from value

after above changes, the data for the same person is as below:

The above result shows that:

After creating my target variable I join all three datasources to prepare my data for some exploratory data analysis.

Here is the information about the cleaned dataframe:

So, it seems that I not have class imbalanced problem with target variable.

The next question that I need to answer is if I need to normalised the numerical features.

The variation between variables are huge and they need to be normalised. After normalising the numerical variables, I created a violent and box plots to get more insights about the distribution of them:

The above results indicates:

Based on the above results, I will not use the total number of different offers received as explanatory variables.

As the final check, I want to make sure that there is no high correlation between explanatory variables that could potentially causes redundancy. To this end, I uses pair-wise correlations which shows how strong two variables are linearly correlated:

Although there are not high correlation between the variables, there is no correlation has been estimated for email. This can happens if the email does not have any variation. Let's see which values email has got:

email has only value 1. Therefore, I am not going to use is as explanatory feature in my modelling stage as there is no variability within this feature.

Another question that I had was how the success rate looks like based on the gender. The below graph shows that although the number of men are higher than the other genders, it looks like female completed as much offer as not completed ones while incomplete offers for the men are slightly higher than the completed ones:

So, can it be related to the income? Income distribution across different genders indicated that men income is closer to normal distribution that females.

In this section, I use two classifiers, namely random forest and logistic regression, to make the prediction about whether an offer will be completed. Then, I will compare the result and pick the best model.

Random forest: I used grid search to tune the model by finding the best parameter which are:

Using these hyper parameter, the model’s performance metrics are:

Logistic Regression: the grid search suggests the best parameters are

Using these parameter, the performance of the model is:

Therefore, using either accuracy or F1 to choose the best performed model to predict whether an offer gets completed, random forest is a better predictor in classifying the data points.

The goal of this study was to predict which offers sent by Starbuck will be completed by the customer. To this end, I cleaned three dataframes and created features that could explain the success of an offer. The bets ML classifier was random forest classifier with the accuracy of 75 percentage.

It will be recommended to add which features plays the most contribution of the model success for the further improvement.

Add a comment

Related posts:

Zenith Detox

This is a process for learning that. These are the only things I really think of as urgent when picking a Zenith Detox. You then arrive at the choice of either Zenith Detox or Zenith Detox. It is…

Importance of Social Media for Search Engines

Many people say social media doesn't affect your ranking on search engines specifically on google but I have 5 points which will definitely change your perspective 3. Reach Social media is the…

What Makes Leaders Different From Managers?

Everybody aspires to achieve a dream job and a great career in this world of competition and rat race. What if your role can make you a part of change and revolution within the organization? That is…