Eduardo Gonçalves (https://github.com/edugvs)
Variable | Description |
---|---|
customerID | customer id |
gender | client gender (male / female) |
SeniorCitizen | is the client retired (Yes, No) |
Partner | is the client married (Yes, No) |
Dependents | is the client has dependents (Yes, No) |
tenure | how many months a person has been a client of the company |
PhoneService | is the telephone service connected (Yes, No) |
MultipleLines | are multiple phone lines connected (Yes, No, No phone service) |
InternetService | client's Internet service provider (DSL, Fiber optic, No) |
OnlineSecurity | is the online security service connected (Yes, No, No internet service) |
OnlineBackup | is the online backup service activated (Yes, No, No internet service) |
DeviceProtection | does the client have equipment insurance (Yes, No, No internet service) |
TechSupport | is the technical support service connected (Yes, No, No internet service) |
StreamingTV | is the streaming TV service connected (Yes, No, No internet service) |
StreamingMovies | is the streaming cinema service activated (Yes, No, No internet service) |
Contract | type of customer contract (Month-to-month, One year, Two year) |
PaperlessBilling | whether the client uses paperless billing (Yes, No) |
PaymentMethod | payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) |
MonthlyCharges | current monthly payment |
TotalCharges | the total amount that the client paid for the services for the entire time |
Churn | whether there was a churn (Yes or No) |
Original database on Kaggle: https://www.kaggle.com/radmirzosimov/telecom-users-dataset
# To hide warning messages
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)
# For dataset manipulation and data exploration
import numpy as np
import pandas as pd
# Importing libraries needed for plotting interactive graphics with plotly
import plotly.offline as py
import plotly.express as px
py.init_notebook_mode(connected = False)
# Importing some stuff to calculate some statistics
from scipy.stats import kurtosis, skew, chisquare
import statsmodels.api as sm
# Loading dataset
data = pd.read_csv("data/telecom_users.csv")
display(data)
# Creating a copy of dataset
df = data
# Verifying dimensions
df.shape
# Checking duplicate data
df.duplicated().sum()
# Checking missing values
df.isna().sum()
# Counting unique values
info = df.nunique().sort_values()
# Determining the data type for each variables
info = pd.DataFrame(info.values, index = info.index, columns = ['NUniques'])
# Assigning information about data type of the variables to a DataFrame.
info['dtypes'] = df.dtypes
# Show dataframe.
display(info)
We can see some problems with data types in few variables. "SeniorCitizen" was loaded as int64, but this variable is not numeric, it represents a category (Yes/No). "TotalCharges" was loaded as an object (string), we must change it to numeric. "Unnamed: 0" does not appear to represent anything relevant, we will remove it from the dataset.
# Removing "Unnamed: 0" collumn
df = df.drop(["Unnamed: 0"], axis=1)
# Changing 1 and 0 in "SeniorCitizen" to Yes and No
df.SeniorCitizen.replace((1, 0), ("Yes", "No"), inplace=True)
# Changing "TotalCharges" data type to float
df["TotalCharges"] = pd.to_numeric(df.TotalCharges, errors='coerce')
# Checking data types for each variable
df.dtypes
# Checking duplicate data
df.duplicated().sum()
# Checking missing values
df.isna().sum()
In the process of cleaning and transforming the data we cause some missing values (10 rows) in "TotalCharges". As they are not representative of the size of the dataset, we will just remove them. This change will not cause any problems in the analysis that we will do next.
# Removing missing values in "Total Charges"
df = df.dropna()
# Show dataset
display(df)
Now the dataset is ready to be analyzed.
# Defining function to generate a dataframe with numeric variables statistics.
def varStats(col, data, target = ''):
if target == '':
stats = pd.DataFrame({
'Min' : data[col].min(),
'Q1' : data[col].quantile(.25),
'Median': data[col].median(),
'Mean' : data[col].mean(),
'Q3' : data[col].quantile(.75),
'Max' : data[col].max(),
'SD' : data[col].std(),
'SK' : skew(data[col]),
'KU' : kurtosis(data[col])
}, index = [col])
else:
stats = pd.concat([
df[[col, target]].groupby(target).min(),
df[[col, target]].groupby(target).quantile(.25),
df[[col, target]].groupby(target).median(),
df[[col, target]].groupby(target).mean(),
df[[col, target]].groupby(target).quantile(.75),
df[[col, target]].groupby(target).max(),
df[[col, target]].groupby(target).std(),
df[[col, target]].groupby(target).skew(),
df[[col, target]].groupby(target).apply(lambda group: kurtosis(group)[0])
], axis = 1)
stats.columns = ['Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max', 'SD', 'SK', 'KU']
return stats
The Asymmetry coefficient (Skewness) indicates how the data is distributed and to interpret its result we can look at the following table:
Skewness | Description |
---|---|
SK ≈ 0 | The data is symmetric. Both the right and left tail of the probability density function are the same. |
SK < 0 | Asymmetry is negative. The tail on the left side of the probability density function is larger than the tail on the right. |
SK > 0 | Asymmetry is positive. The tail on the right side of the probability density function is larger than the tail on the left. |
from IPython.display import Image
Image("img/skew.png")
Img Reference:
Skewness - https://www.assetinsights.net/Glossary/G_Skewness.html - Accessed: 2021-05-02
The Kurtosis coefficient (Kurtosis) is a measure that characterizes the flattening of the distribution function curve and to interpret its result we can look at the following table:
Kurtosis | Description |
---|---|
KU ≈ 0 | The distribution is normal and is called Mesokurtic Kurtosis. |
KU < 0 | The curve is flatter than normal. For a negative kurtosis coefficient there is a Platykurtic Kurtosis. |
KU > 0 | The curve is more prominent than the normal. For a positive kurtosis coefficient, there is a Leptokurtic Kurtosis. |
from IPython.display import Image
Image("img/kurtosis.png")
Img reference:
Kurtosis and Skewness Example Question | CFA Level I - AnalystPrep https://analystprep.com/cfa-level-1-exam/quantitative-methods/kurtosis-and-skewness-types-of-distributions/ Accessed: 2021-05-02
# Defining a function to plot interactive graphics in a non-standard jupyter environment
def configure_plotly_browser_state():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-1.43.1.min.js?noext',
},
});
</script>
'''))
To create plots offline with plotly (that is, in environments such as Google Colab, Azure, Kaggle, Nteract, etc.), we need to define the following function and call it whenever we are going to generate a graph.
Whether the customer canceled the service or not.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'Churn'
# Defining label
label = 'Customer Churn'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['Churn'].value_counts(normalize=True).map('{:.2%}'.format))
Within this dataset, which encompasses approximately 5976 customers, 26.56% canceled the service.
Customer gender (Male/Female).
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'gender'
# Defining label
label = 'Gender'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['gender'].value_counts(normalize=True).map('{:.2%}'.format))
A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency.
Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:
from IPython.display import Image
Image("img/independence.png")
We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:
# Creating A two-way contingency table.
data = df[["Churn", "gender"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
We can see that the data is balanced between genders and Churn. In other words, the customer's gender is not relevant for the cancellation of the service.
If the customer is retired.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'SeniorCitizen'
# Defining label
label = 'Senior Citizen'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['SeniorCitizen'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "SeniorCitizen"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
The group of customers who are retired and who canceled the service is the ones that most violate independence.
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Only 16% of customers are retired. The percentage of customers who canceled the service was higher in the group of retirees, about 41,61% of 966 records. According to these results, retired customers are 1.75x more likely to cancel the service.
Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that customers who are retired and have canceled the service are the ones that most contribute to the difference in proportion between groups. However, we cannot indicate with certainty the causes of the differences observed.
If the customer is married.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'Partner'
# Defining label
label = 'Partner'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['Partner'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "Partner"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Single or non married customers are 1.65x more likely to cancel the service. 32,82% of customers who do not have partners canceled, in the group of customers who have a partner 19,88% canceled service.
Observing the results of the chi-squared test, we can reject the null hypothesis that there are no differences between the groups, stating that the clients who are married are the ones that most contribute to the difference in proportion between the groups. However, we cannot indicate with certainty the causes of the differences observed
If the customer has dependents
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'Dependents'
# Defining label
label = 'Dependents'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['Partner'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "Dependents"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Customers who do not have dependents are about 2x more likely to cancel the service. 31.13% of customers who have no dependents canceled, while 15.77% of customers who have dependents canceled the service.
Observing the results of the chi-squared test, we can reject the null hypothesis that there are no differences between the groups, stating that the clients who have dependents are the ones that most contribute to the difference in proportion between the groups. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'PaperlessBilling'
# Defining label
label = 'Paperless Billing'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['PaperlessBilling'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "PaperlessBilling"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Customers who do not receive a paper bill have about 2x more chances of canceling the service. 16.56% of the group of customers who received paper bills canceled, while in the group of customers with paperless billing 33.50% canceled the service.
Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that the type of account that customers receive contributes to the difference in proportion between groups. We can conclude, with certainty, that customers with paperless billing are more likely to cancel the service. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'Contract'
# Defining label
label = 'Contract'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['Contract'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "Contract"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Customers with the Month-to-month contract type have a strong tendency to cancel the service.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'PaymentMethod'
# Defining label
label = 'Payment Method'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['PaymentMethod'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "PaymentMethod"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Customers who have paid for the service by Eletronic check have a strong tendency to cancel the service.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'PhoneService'
# Defining label
label = 'Phone Service'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['PhoneService'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "PhoneService"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
90% of customers have phone service. The cancellation rate among customers who hired the service was 26.68%. Among customers who did not contract phone service, the cancellation rate was 25.34%. We can say with certainty, based on statistical tests, that according to this data, whether to hire the phone service does not affect the Churn.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'MultipleLines'
# Defining label
label = 'Multiple Lines'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['MultipleLines'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "MultipleLines"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
There are no statistically significant differences to affirm that there is a difference between the groups of customers who have hired multiple lines. However, among those who did not contract the telephone service, there were differences in relation to churn.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'InternetService'
# Defining label
label = 'Internet Service'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['InternetService'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "InternetService"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
21.50% of customers do not hire internet service. Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that customers who hired optic fiber internet service are the ones that most contribute to the difference in proportion between groups.We can say that customers who contracted fiber optic internet service are more likely to cancel the service. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'StreamingMovies'
# Defining label
label = 'Streaming Movies'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['StreamingMovies'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "StreamingMovies"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
There was no significant difference within the group of customers who hired the streaming movies service. However, hiring the internet service contributes significantly to the differences observed between the groups. Customers who hire the internet service tend to cancel the service.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'StreamingTV'
# Defining label
label = 'Streaming TV'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['StreamingTV'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "StreamingTV"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
There was no significant difference within the group of customers who hired the streaming TV service. However, hiring the internet service contributes significantly to the differences observed between the groups. Customers who hire the internet service tend to cancel the service.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'TechSupport'
# Defining label
label = 'Tech Support'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['TechSupport'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "TechSupport"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
41.29% of the customers in the group that did not hire the technical support service canceled the service compared to only 15.35% of the customers in the group that hired. We can conclude that customers who hire the technical support service are 2.69x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'OnlineBackup'
# Defining label
label = 'OnlineBackup'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['OnlineBackup'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "OnlineBackup"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
39.86% of the customers in the group that did not hire the online backup service canceled the service and 21.56% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the online backup service are 1.84x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'OnlineSecurity'
# Defining label
label = 'Online Security'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['OnlineSecurity'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "OnlineSecurity"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
41,64% of the customers in the group that did not hire the online security service canceled the service and 14,39% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the online security service are about 3x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
# Defining variable to be analyzed
col = 'DeviceProtection'
# Defining label
label = 'Device Protection'
# Creating Barplot
fig = px.histogram(df,
x = col,
template = 'plotly_white',
title = 'Absolute frequency of ' + col,
labels = {col: label},
opacity = 0.70,
color = "Churn"
)
# Show figure
fig.show()
# Getting relative frequency
display(df['DeviceProtection'].value_counts(normalize=True).map('{:.2%}'.format))
# Creating A two-way contingency table.
data = df[["Churn", "DeviceProtection"]]
table = sm.stats.Table.from_data(data)
print(table.table_orig)
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and
# columns of the table are independent.
print(table.fittedvalues)
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table
# are independent.
print(table.resid_pearson)
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing.
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
# Chi-squared statistic result
print(rslt.statistic)
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
39,06% of the customers in the group that did not hire the device protection service canceled the service, and 22,27% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the device protection service are about 1,75x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.
configure_plotly_browser_state()
fig1 = px.histogram(df,
x = 'tenure',
template = 'plotly_white',
title = 'Tenure distribution',
opacity = 0.70,
color = "Churn"
)
fig2 = px.histogram(df,
x = 'MonthlyCharges',
template = 'plotly_white',
title = 'Monthly Charges distribution',
opacity = 0.70,
nbins=30,
color = "Churn"
)
fig3 = px.histogram(df,
x = 'TotalCharges',
template = 'plotly_white',
title = 'Total Charges distribution',
opacity = 0.70,
color = "Churn"
)
fig1.show()
fig2.show()
fig3.show()
The variables tenure and TotalCharges are related directly and convey the same insight. Therefore, these variables represent customer loyalty, that is, the longer a person is a consumer of the company, the higher the customer retention rate.
The more expensive the monthly fee for the service, the greater the chance of losing the customer.
configure_plotly_browser_state()
fig1 = px.box(df,
x = 'tenure',
template = 'plotly_white',
title = 'Tenure boxplot',
color = "Churn",
)
fig2 = px.box(df,
x = 'MonthlyCharges',
template = 'plotly_white',
title = 'MonthlyCharges boxplot',
color = "Churn",
)
fig3 = px.box(df,
x = 'TotalCharges',
template = 'plotly_white',
title = 'TotalCharges boxplot',
color = "Churn",
)
fig1.show()
fig2.show()
fig3.show()
We can see outliers in the TotalCharges boxplot within the group of customers who canceled the service. This result shows that it is unusual for a customer to cancel the service after having spent more than $ 5,688.05 (upper fence of the boxplot) for the service and this is totally related to the time, in months, that the customer is a subscriber.
# Defining variable to be calculated
col = 'tenure'
# Apllying varStats function
varStats(col, target = 'Churn', data = df)
The average tenure of customers who canceled the service was 2x lower, but it is still within the standard deviation of the group of customers who did not. It is not possible to state that this difference is statistically significant. The Kurtosis and Skewness coefficients for the group that canceled indicate that the data distribution tends to the left in the distribution and the curve is slightly flatter than the normal one.
# Defining variable to be calculated
col = 'MonthlyCharges'
# Apllying varStats function
varStats(col, target = 'Churn', data = df)
It is not possible to state that differences in the median and mean are statistically significant. The Kurtosis and Skewness coefficients for the group that canceled the service indicate that the data distribution tends to the right in the distribution and the curve is slightly prominent than the normal one.
# Defining variable to be calculated
col = 'TotalCharges'
# Apllying varStats function
varStats(col, target = 'Churn', data = df)
It is not possible to state that differences in the median and mean are statistically significant. The Kurtosis and Skewness coefficients for the group that canceled the service indicate that the data distribution tends to the left in the distribution and the curve is flatter than the normal one.
I will be incredibly happy to receive suggestions to improve the project, contact me if you have any questions.