Using Python for data analysis - Analysing customer churn¶

version 1.0 - 05.02.21

Eduardo Gonçalves (https://github.com/edugvs)

1. Business problem¶

2. Data dictionary¶

Variable	Description
customerID	customer id
gender	client gender (male / female)
SeniorCitizen	is the client retired (Yes, No)
Partner	is the client married (Yes, No)
Dependents	is the client has dependents (Yes, No)
tenure	how many months a person has been a client of the company
PhoneService	is the telephone service connected (Yes, No)
MultipleLines	are multiple phone lines connected (Yes, No, No phone service)
InternetService	client's Internet service provider (DSL, Fiber optic, No)
OnlineSecurity	is the online security service connected (Yes, No, No internet service)
OnlineBackup	is the online backup service activated (Yes, No, No internet service)
DeviceProtection	does the client have equipment insurance (Yes, No, No internet service)
TechSupport	is the technical support service connected (Yes, No, No internet service)
StreamingTV	is the streaming TV service connected (Yes, No, No internet service)
StreamingMovies	is the streaming cinema service activated (Yes, No, No internet service)
Contract	type of customer contract (Month-to-month, One year, Two year)
PaperlessBilling	whether the client uses paperless billing (Yes, No)
PaymentMethod	payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges	current monthly payment
TotalCharges	the total amount that the client paid for the services for the entire time
Churn	whether there was a churn (Yes or No)

Original database on Kaggle: https://www.kaggle.com/radmirzosimov/telecom-users-dataset

# To hide warning messages
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

# For dataset manipulation and data exploration 
import numpy as np
import pandas as pd

# Importing libraries needed for plotting interactive graphics with plotly
import plotly.offline as py
import plotly.express as px

py.init_notebook_mode(connected = False)

# Importing some stuff to calculate some statistics
from scipy.stats import kurtosis, skew, chisquare
import statsmodels.api as sm

3. Understanding the data¶

# Loading dataset
data = pd.read_csv("data/telecom_users.csv")
display(data)

# Creating a copy of dataset
df = data

# Verifying dimensions
df.shape

(5986, 22)

# Checking duplicate data
df.duplicated().sum()

0

# Checking missing values
df.isna().sum()

Unnamed: 0          0
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

# Counting unique values
info = df.nunique().sort_values()

# Determining the data type for each variables
info = pd.DataFrame(info.values, index = info.index, columns = ['NUniques'])

# Assigning information about data type of the variables to a DataFrame.
info['dtypes'] = df.dtypes

# Show dataframe.
display(info)

We can see some problems with data types in few variables. "SeniorCitizen" was loaded as int64, but this variable is not numeric, it represents a category (Yes/No). "TotalCharges" was loaded as an object (string), we must change it to numeric. "Unnamed: 0" does not appear to represent anything relevant, we will remove it from the dataset.

4. Data munging¶

# Removing "Unnamed: 0" collumn
df = df.drop(["Unnamed: 0"], axis=1)

# Changing 1 and 0 in "SeniorCitizen" to Yes and No
df.SeniorCitizen.replace((1, 0), ("Yes", "No"), inplace=True)

# Changing "TotalCharges" data type to float
df["TotalCharges"] = pd.to_numeric(df.TotalCharges, errors='coerce')

# Checking data types for each variable
df.dtypes

customerID           object
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

# Checking duplicate data
df.duplicated().sum()

0

# Checking missing values
df.isna().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        10
Churn                0
dtype: int64

In the process of cleaning and transforming the data we cause some missing values (10 rows) in "TotalCharges". As they are not representative of the size of the dataset, we will just remove them. This change will not cause any problems in the analysis that we will do next.

# Removing missing values in "Total Charges"
df = df.dropna()

# Show dataset
display(df)

Now the dataset is ready to be analyzed.

5. Exploratory Analysis¶

# Defining function to generate a dataframe with numeric variables statistics.
def varStats(col, data, target = ''):

    if target == '':

        stats = pd.DataFrame({
            'Min'   : data[col].min(),
            'Q1'    : data[col].quantile(.25),
            'Median': data[col].median(),
            'Mean'  : data[col].mean(),
            'Q3'    : data[col].quantile(.75),
            'Max'   : data[col].max(),
            'SD'    : data[col].std(),
            'SK'    : skew(data[col]),
            'KU'    : kurtosis(data[col])
        }, index = [col])

    else:

        stats = pd.concat([
            df[[col, target]].groupby(target).min(),
            df[[col, target]].groupby(target).quantile(.25),
            df[[col, target]].groupby(target).median(),
            df[[col, target]].groupby(target).mean(),
            df[[col, target]].groupby(target).quantile(.75),
            df[[col, target]].groupby(target).max(),
            df[[col, target]].groupby(target).std(),
            df[[col, target]].groupby(target).skew(),
            df[[col, target]].groupby(target).apply(lambda group: kurtosis(group)[0])

        ], axis = 1)

        stats.columns = ['Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max', 'SD', 'SK', 'KU']

    return stats

The Asymmetry coefficient (Skewness) indicates how the data is distributed and to interpret its result we can look at the following table:

Skewness	Description
SK ≈ 0	The data is symmetric. Both the right and left tail of the probability density function are the same.
SK < 0	Asymmetry is negative. The tail on the left side of the probability density function is larger than the tail on the right.
SK > 0	Asymmetry is positive. The tail on the right side of the probability density function is larger than the tail on the left.

from IPython.display import Image
Image("img/skew.png")

Img Reference:

Skewness - https://www.assetinsights.net/Glossary/G_Skewness.html - Accessed: 2021-05-02

The Kurtosis coefficient (Kurtosis) is a measure that characterizes the flattening of the distribution function curve and to interpret its result we can look at the following table:

Kurtosis	Description
KU ≈ 0	The distribution is normal and is called Mesokurtic Kurtosis.
KU < 0	The curve is flatter than normal. For a negative kurtosis coefficient there is a Platykurtic Kurtosis.
KU > 0	The curve is more prominent than the normal. For a positive kurtosis coefficient, there is a Leptokurtic Kurtosis.

from IPython.display import Image
Image("img/kurtosis.png")

Img reference:

Kurtosis and Skewness Example Question | CFA Level I - AnalystPrep https://analystprep.com/cfa-level-1-exam/quantitative-methods/kurtosis-and-skewness-types-of-distributions/ Accessed: 2021-05-02

# Defining a function to plot interactive graphics in a non-standard jupyter environment
def configure_plotly_browser_state():
  
  import IPython
  
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-1.43.1.min.js?noext',
            },
          });
        </script>
        '''))

To create plots offline with plotly (that is, in environments such as Google Colab, Azure, Kaggle, Nteract, etc.), we need to define the following function and call it whenever we are going to generate a graph.

Variable: Churn¶

Whether the customer canceled the service or not.

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Churn'

# Defining label
label = 'Customer Churn'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['Churn'].value_counts(normalize=True).map('{:.2%}'.format))

No     73.44%
Yes    26.56%
Name: Churn, dtype: object

Within this dataset, which encompasses approximately 5976 customers, 26.56% canceled the service.

Variable: gender¶

Customer gender (Male/Female).

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'gender'

# Defining label
label = 'Gender'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['gender'].value_counts(normalize=True).map('{:.2%}'.format))

Male      50.94%
Female    49.06%
Name: gender, dtype: object

Chi-squared test¶

A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency.

Independence¶

Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:

from IPython.display import Image
Image("img/independence.png")

We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:

# Creating A two-way contingency table.
data = df[["Churn", "gender"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

gender  Female  Male
Churn               
No        2141  2248
Yes        791   796

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

gender       Female         Male
Churn                           
No      2153.371486  2235.628514
Yes      778.628514   808.371486

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

gender    Female      Male
Churn                     
No     -0.266601  0.261651
Yes     0.443360 -0.435127

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.4685296749782847

# Chi-squared statistic result
print(rslt.statistic)

0.5254415071601595

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

gender    Female      Male
Churn                     
No      0.071076  0.068461
Yes     0.196568  0.189336

We can see that the data is balanced between genders and Churn. In other words, the customer's gender is not relevant for the cancellation of the service.

Variable: SeniorCitizen¶

If the customer is retired.

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'SeniorCitizen'

# Defining label
label = 'Senior Citizen'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['SeniorCitizen'].value_counts(normalize=True).map('{:.2%}'.format))

No     83.84%
Yes    16.16%
Name: SeniorCitizen, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "SeniorCitizen"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

SeniorCitizen    No  Yes
Churn                   
No             3825  564
Yes            1185  402

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

SeniorCitizen           No         Yes
Churn                                 
No             3679.533133  709.466867
Yes            1330.466867  256.533133

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

SeniorCitizen        No       Yes
Churn                            
No             2.398102 -5.461325
Yes           -3.988063  9.082227

The group of customers who are retired and who canceled the service is the ones that most violate independence.

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

133.96846458355992

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

SeniorCitizen         No        Yes
Churn                              
No              5.750895  29.826072
Yes            15.904650  82.486848

Only 16% of customers are retired. The percentage of customers who canceled the service was higher in the group of retirees, about 41,61% of 966 records. According to these results, retired customers are 1.75x more likely to cancel the service.

Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that customers who are retired and have canceled the service are the ones that most contribute to the difference in proportion between groups. However, we cannot indicate with certainty the causes of the differences observed.

Variable: Partner¶

If the customer is married.

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Partner'

# Defining label
label = 'Partner'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['Partner'].value_counts(normalize=True).map('{:.2%}'.format))

No     51.54%
Yes    48.46%
Name: Partner, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "Partner"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

Partner    No   Yes
Churn              
No       2069  2320
Yes      1011   576

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

Partner           No          Yes
Churn                            
No       2262.068273  2126.931727
Yes       817.931727   769.068273

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

Partner        No       Yes
Churn                      
No      -4.059365  4.186337
Yes      6.750756 -6.961911

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

128.04475953195737

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

Partner         No        Yes
Churn                        
No       16.478441  17.525414
Yes      45.572701  48.468204

Single or non married customers are 1.65x more likely to cancel the service. 32,82% of customers who do not have partners canceled, in the group of customers who have a partner 19,88% canceled service.

Observing the results of the chi-squared test, we can reject the null hypothesis that there are no differences between the groups, stating that the clients who are married are the ones that most contribute to the difference in proportion between the groups. However, we cannot indicate with certainty the causes of the differences observed

Variable: Dependents¶

If the customer has dependents

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Dependents'

# Defining label
label = 'Dependents'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['Partner'].value_counts(normalize=True).map('{:.2%}'.format))

No     51.54%
Yes    48.46%
Name: Partner, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "Dependents"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

Dependents    No   Yes
Churn                 
No          2889  1500
Yes         1306   281

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

Dependents           No          Yes
Churn                               
No          3080.966365  1308.033635
Yes         1114.033635   472.966365

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

Dependents        No       Yes
Churn                         
No         -3.458451  5.307814
Yes         5.751432 -8.826937

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

151.12755595173218

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

Dependents         No        Yes
Churn                           
No          11.960885  28.172888
Yes         33.078970  77.914812

Customers who do not have dependents are about 2x more likely to cancel the service. 31.13% of customers who have no dependents canceled, while 15.77% of customers who have dependents canceled the service.

Observing the results of the chi-squared test, we can reject the null hypothesis that there are no differences between the groups, stating that the clients who have dependents are the ones that most contribute to the difference in proportion between the groups. However, we cannot indicate with certainty the causes of the differences observed.

Variable: PaperlessBilling¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'PaperlessBilling'

# Defining label
label = 'Paperless Billing'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['PaperlessBilling'].value_counts(normalize=True).map('{:.2%}'.format))

Yes    58.99%
No     41.01%
Name: PaperlessBilling, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "PaperlessBilling"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

PaperlessBilling    No   Yes
Churn                       
No                2045  2344
Yes                406  1181

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

PaperlessBilling           No          Yes
Churn                                     
No                1800.106928  2588.893072
Yes                650.893072   936.106928

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

PaperlessBilling        No       Yes
Churn                               
No                5.772014 -4.813040
Yes              -9.598905  8.004123

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

212.68645155316744

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

PaperlessBilling         No        Yes
Churn                                 
No                33.316141  23.165351
Yes               92.138969  64.065990

Customers who do not receive a paper bill have about 2x more chances of canceling the service. 16.56% of the group of customers who received paper bills canceled, while in the group of customers with paperless billing 33.50% canceled the service.

Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that the type of account that customers receive contributes to the difference in proportion between groups. We can conclude, with certainty, that customers with paperless billing are more likely to cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: Contract¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Contract'

# Defining label
label = 'Contract'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['Contract'].value_counts(normalize=True).map('{:.2%}'.format))

Month-to-month    54.70%
Two year          23.96%
One year          21.34%
Name: Contract, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "Contract"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

Contract  Month-to-month  One year  Two year
Churn                                       
No                  1871      1127      1391
Yes                 1398       148        41

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

Contract  Month-to-month    One year     Two year
Churn                                            
No           2400.877008  936.408133  1051.714859
Yes           868.122992  338.591867   380.285141

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

Contract  Month-to-month   One year   Two year
Churn                                         
No            -10.814093   6.228332  10.462027
Yes            17.983923 -10.357766 -17.398434

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

998.601081701014

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

Contract  Month-to-month    One year    Two year
Churn                                           
No            116.944618   38.792124  109.454008
Yes           323.421504  107.283321  302.705508

Customers with the Month-to-month contract type have a strong tendency to cancel the service.

Variable: PaymentMethod¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'PaymentMethod'

# Defining label
label = 'Payment Method'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['PaymentMethod'].value_counts(normalize=True).map('{:.2%}'.format))

Electronic check             33.57%
Mailed check                 22.79%
Bank transfer (automatic)    21.85%
Credit card (automatic)      21.79%
Name: PaymentMethod, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "PaymentMethod"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                                  1082                     1104   
Yes                                  224                      198   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                         1104          1099  
Yes                         902           263

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                            959.175703               956.237952   
Yes                           346.824297               345.762048   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                  1473.282129   1000.304217  
Yes                  532.717871    361.695783

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                              3.965840                 4.778372   
Yes                            -6.595224                -7.946470   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                    -9.620892      3.120560  
Yes                   15.999620     -5.189516

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

530.4224414393452

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                             15.727888                22.832834   
Yes                            43.496976                63.146383   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                    92.561559      9.737895  
Yes                  255.987827     26.931079

Customers who have paid for the service by Eletronic check have a strong tendency to cancel the service.

Variable: PhoneService¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'PhoneService'

# Defining label
label = 'Phone Service'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['PhoneService'].value_counts(normalize=True).map('{:.2%}'.format))

Yes    90.16%
No      9.84%
Name: PhoneService, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "PhoneService"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

PhoneService   No   Yes
Churn                  
No            439  3950
Yes           149  1438

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

PhoneService          No          Yes
Churn                                
No            431.849398  3957.150602
Yes           156.150602  1430.849398

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

PhoneService        No       Yes
Churn                           
No            0.344094 -0.113671
Yes          -0.572230  0.189037

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.48192512451686575

# Chi-squared statistic result
print(rslt.statistic)

0.4945037706856925

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

PhoneService        No       Yes
Churn                           
No            0.118400  0.012921
Yes           0.327447  0.035735

90% of customers have phone service. The cancellation rate among customers who hired the service was 26.68%. Among customers who did not contract phone service, the cancellation rate was 25.34%. We can say with certainty, based on statistical tests, that according to this data, whether to hire the phone service does not affect the Churn.

Variable: MultipleLines¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'MultipleLines'

# Defining label
label = 'Multiple Lines'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['MultipleLines'].value_counts(normalize=True).map('{:.2%}'.format))

No                  47.57%
Yes                 42.59%
No phone service     9.84%
Name: MultipleLines, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "MultipleLines"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

MultipleLines    No  No phone service   Yes
Churn                                      
No             2128               439  1822
Yes             715               149   723

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

MultipleLines           No  No phone service          Yes
Churn                                                    
No             2088.006526        431.849398  1869.144076
Yes             754.993474        156.150602   675.855924

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

MultipleLines        No  No phone service       Yes
Churn                                              
No             0.875232          0.344094 -1.090450
Yes           -1.455518         -0.572230  1.813427

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.020161009617758352

# Chi-squared statistic result
print(rslt.statistic)

7.808009513573989

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

MultipleLines        No  No phone service       Yes
Churn                                              
No             0.766031          0.118400  1.189081
Yes            2.118532          0.327447  3.288517

There are no statistically significant differences to affirm that there is a difference between the groups of customers who have hired multiple lines. However, among those who did not contract the telephone service, there were differences in relation to churn.

Variable: InternetService¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'InternetService'

# Defining label
label = 'Internet Service'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['InternetService'].value_counts(normalize=True).map('{:.2%}'.format))

Fiber optic    43.96%
DSL            34.54%
No             21.50%
Name: InternetService, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "InternetService"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

InternetService   DSL  Fiber optic    No
Churn                                   
No               1667         1536  1186
Yes               397         1091    99

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

InternetService          DSL  Fiber optic         No
Churn                                               
No               1515.879518  1929.367972  943.75251
Yes               548.120482   697.632028  341.24749

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

InternetService       DSL  Fiber optic         No
Churn                                            
No               3.881423    -8.955534   7.885518
Yes             -6.454838    14.893124 -13.113679

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

592.8870555895634

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

InternetService        DSL  Fiber optic          No
Churn                                              
No               15.065445    80.201581   62.181394
Yes              41.664927   221.805128  171.968580

21.50% of customers do not hire internet service. Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that customers who hired optic fiber internet service are the ones that most contribute to the difference in proportion between groups.We can say that customers who contracted fiber optic internet service are more likely to cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: StreamingMovies¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'StreamingMovies'

# Defining label
label = 'Streaming Movies'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['StreamingMovies'].value_counts(normalize=True).map('{:.2%}'.format))

No                     39.37%
Yes                    39.12%
No internet service    21.50%
Name: StreamingMovies, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "StreamingMovies"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

StreamingMovies    No  No internet service   Yes
Churn                                           
No               1561                 1186  1642
Yes               792                   99   696

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

StreamingMovies           No  No internet service          Yes
Churn                                                         
No               1728.132028            943.75251  1717.115462
Yes               624.867972            341.24749   620.884538

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

StreamingMovies        No  No internet service       Yes
Churn                                                   
No              -4.020418             7.885518 -1.812715
Yes              6.685987           -13.113679  3.014560

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

307.3896710297794

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

StreamingMovies         No  No internet service       Yes
Churn                                                    
No               16.163762            62.181394  3.285937
Yes              44.702427           171.968580  9.087571

There was no significant difference within the group of customers who hired the streaming movies service. However, hiring the internet service contributes significantly to the differences observed between the groups. Customers who hire the internet service tend to cancel the service.

Variable: StreamingTV¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'StreamingTV'

# Defining label
label = 'Streaming TV'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['StreamingTV'].value_counts(normalize=True).map('{:.2%}'.format))

No                     39.96%
Yes                    38.54%
No internet service    21.50%
Name: StreamingTV, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "StreamingTV"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

StreamingTV    No  No internet service   Yes
Churn                                       
No           1589                 1186  1614
Yes           799                   99   689

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

StreamingTV           No  No internet service          Yes
Churn                                                     
No           1753.837349            943.75251  1691.410141
Yes           634.162651            341.24749   611.589859

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

StreamingTV        No  No internet service       Yes
Churn                                               
No          -3.936053             7.885518 -1.882233
Yes          6.545688           -13.113679  3.130169

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

305.82927501100244

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

StreamingTV         No  No internet service       Yes
Churn                                                
No           15.492515            62.181394  3.542801
Yes          42.846030           171.968580  9.797955

There was no significant difference within the group of customers who hired the streaming TV service. However, hiring the internet service contributes significantly to the differences observed between the groups. Customers who hire the internet service tend to cancel the service.

Variable: TechSupport¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'TechSupport'

# Defining label
label = 'Tech Support'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['TechSupport'].value_counts(normalize=True).map('{:.2%}'.format))

No                     49.51%
Yes                    28.98%
No internet service    21.50%
Name: TechSupport, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "TechSupport"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

TechSupport    No  No internet service   Yes
Churn                                       
No           1737                 1186  1466
Yes          1222                   99   266

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

TechSupport           No  No internet service          Yes
Churn                                                     
No           2173.201305            943.75251  1272.046185
Yes           785.798695            341.24749   459.953815

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

TechSupport         No  No internet service       Yes
Churn                                                
No           -9.357008             7.885518  5.438096
Yes          15.560778           -13.113679 -9.043597

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

675.2009219256641

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

TechSupport          No  No internet service        Yes
Churn                                                  
No            87.553591            62.181394  29.572890
Yes          242.137815           171.968580  81.786652

41.29% of the customers in the group that did not hire the technical support service canceled the service compared to only 15.35% of the customers in the group that hired. We can conclude that customers who hire the technical support service are 2.69x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: OnlineBackup¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'OnlineBackup'

# Defining label
label = 'OnlineBackup'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['OnlineBackup'].value_counts(normalize=True).map('{:.2%}'.format))

No                     43.57%
Yes                    34.92%
No internet service    21.50%
Name: OnlineBackup, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "OnlineBackup"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

OnlineBackup    No  No internet service   Yes
Churn                                        
No            1566                 1186  1637
Yes           1038                   99   450

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

OnlineBackup           No  No internet service          Yes
Churn                                                      
No            1912.475904            943.75251  1532.771586
Yes            691.524096            341.24749   554.228414

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

OnlineBackup         No  No internet service       Yes
Churn                                                 
No            -7.922734             7.885518  2.662241
Yes           13.175569           -13.113679 -4.427328

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

497.20406224603863

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

OnlineBackup          No  No internet service        Yes
Churn                                                   
No             62.769707            62.181394   7.087528
Yes           173.595616           171.968580  19.601237

39.86% of the customers in the group that did not hire the online backup service canceled the service and 21.56% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the online backup service are 1.84x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: OnlineSecurity¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'OnlineSecurity'

# Defining label
label = 'Online Security'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['OnlineSecurity'].value_counts(normalize=True).map('{:.2%}'.format))

No                     49.90%
Yes                    28.60%
No internet service    21.50%
Name: OnlineSecurity, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "OnlineSecurity"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

OnlineSecurity    No  No internet service   Yes
Churn                                          
No              1740                 1186  1463
Yes             1242                   99   246

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

OnlineSecurity           No  No internet service          Yes
Churn                                                        
No              2190.093373            943.75251  1255.154116
Yes              791.906627            341.24749   453.845884

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

OnlineSecurity         No  No internet service       Yes
Churn                                                   
No              -9.617702             7.885518  5.866687
Yes             15.994314           -13.113679 -9.756347

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

712.0725709915333

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

OnlineSecurity          No  No internet service        Yes
Churn                                                     
No               92.500186            62.181394  34.418013
Yes             255.818095           171.968580  95.186302

41,64% of the customers in the group that did not hire the online security service canceled the service and 14,39% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the online security service are about 3x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: DeviceProtection¶

configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'DeviceProtection'

# Defining label
label = 'Device Protection'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()

# Getting relative frequency
display(df['DeviceProtection'].value_counts(normalize=True).map('{:.2%}'.format))

No                     44.16%
Yes                    34.34%
No internet service    21.50%
Name: DeviceProtection, dtype: object

Independence and chi-squared test¶

# Creating A two-way contingency table.
data = df[["Churn", "DeviceProtection"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)

DeviceProtection    No  No internet service   Yes
Churn                                            
No                1608                 1186  1595
Yes               1031                   99   457

# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)

DeviceProtection           No  No internet service          Yes
Churn                                                          
No                1938.181225            943.75251  1507.066265
Yes                700.818775            341.24749   544.933735

# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)

DeviceProtection         No  No internet service       Yes
Churn                                                     
No                -7.499895             7.885518  2.265110
Yes               12.472385           -13.113679 -3.766896

# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result

0.0

# Chi-squared statistic result
print(rslt.statistic)

465.27902078486363

# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)

DeviceProtection          No  No internet service        Yes
Churn                                                       
No                 56.248425            62.181394   5.130724
Yes               155.560389           171.968580  14.189508

39,06% of the customers in the group that did not hire the device protection service canceled the service, and 22,27% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the device protection service are about 1,75x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Numeric variables: tenure, MonthlyCharges, TotalCharges¶

configure_plotly_browser_state()


fig1 = px.histogram(df, 
                      x = 'tenure',
                      template = 'plotly_white',
                      title = 'Tenure distribution',
                      opacity = 0.70,
                      color = "Churn"
                     )

fig2 = px.histogram(df, 
                      x = 'MonthlyCharges',
                      template = 'plotly_white',
                      title = 'Monthly Charges distribution',
                      opacity = 0.70,
                    nbins=30,
                      color = "Churn"
                     )

fig3 = px.histogram(df, 
                      x = 'TotalCharges',
                      template = 'plotly_white',
                      title = 'Total Charges distribution',
                      opacity = 0.70,
                      color = "Churn"
                     )

fig1.show()
fig2.show()
fig3.show()

The variables tenure and TotalCharges are related directly and convey the same insight. Therefore, these variables represent customer loyalty, that is, the longer a person is a consumer of the company, the higher the customer retention rate.
The more expensive the monthly fee for the service, the greater the chance of losing the customer.

configure_plotly_browser_state()

fig1 = px.box(df, 
              x = 'tenure',      
              template = 'plotly_white',
              title = 'Tenure boxplot',
              color = "Churn",
             )

fig2 = px.box(df, 
              x = 'MonthlyCharges', 
              template = 'plotly_white',
              title = 'MonthlyCharges boxplot',
              color = "Churn",
             )

fig3 = px.box(df, 
              x = 'TotalCharges',   
              template = 'plotly_white',
              title = 'TotalCharges boxplot',
              color = "Churn",
             )

fig1.show()
fig2.show()
fig3.show()

We can see outliers in the TotalCharges boxplot within the group of customers who canceled the service. This result shows that it is unusual for a customer to cancel the service after having spent more than $ 5,688.05 (upper fence of the boxplot) for the service and this is totally related to the time, in months, that the customer is a subscriber.

# Defining variable to be calculated
col = 'tenure'

# Apllying varStats function
varStats(col, target = 'Churn', data = df)

The average tenure of customers who canceled the service was 2x lower, but it is still within the standard deviation of the group of customers who did not. It is not possible to state that this difference is statistically significant. The Kurtosis and Skewness coefficients for the group that canceled indicate that the data distribution tends to the left in the distribution and the curve is slightly flatter than the normal one.

# Defining variable to be calculated
col = 'MonthlyCharges'

# Apllying varStats function
varStats(col, target = 'Churn', data = df)

It is not possible to state that differences in the median and mean are statistically significant. The Kurtosis and Skewness coefficients for the group that canceled the service indicate that the data distribution tends to the right in the distribution and the curve is slightly prominent than the normal one.

# Defining variable to be calculated
col = 'TotalCharges'

# Apllying varStats function
varStats(col, target = 'Churn', data = df)

It is not possible to state that differences in the median and mean are statistically significant. The Kurtosis and Skewness coefficients for the group that canceled the service indicate that the data distribution tends to the left in the distribution and the curve is flatter than the normal one.

Conclusions and Insights¶

Gender of the customer is not a relevant factor.
Only 16% of customers are retired.
Customers with the Month-by-month contract type are the most likely to cancel the service.
Customers who pay using the electronic check method are much more likely to cancel the service.
90% of customers use the telephone service.
Whether or not to hire a telephone service is not relevant for churn.
21.5% of customers do not contract internet service.
There are failures in the optic fiber service. Customers who purchased this service were more likely to cancel.
The more expensive the monthly fee for the service, the greater the chance of losing the customer.
The longer a person is a consumer of the company, the higher the customer retention rate.
That it is unusual for a customer to cancel the service after having spent more than $ 5,688.05 for the service and this is totally related to the time, in months, that the customer is a subscriber.
The average tenure of customers who canceled the service was 2x lower, but it is not possible to state that this difference is statistically significant.

They are more likely to cancel the service:¶

Retired customers (1.75x).
Married customers (1.65x).
Customers who do not have dependents (2x).
Customers with Paperlessbill (2x).

They are less likely to cancel the service:¶

Customers who do not contract internet service.
Customers who hire the technical support service (2.69x).
Customers who hire the online backup service (1.84x).
Customers who hire the online security service (3x).
Customers who hire the device protection service (1.75x).

I will be incredibly happy to receive suggestions to improve the project, contact me if you have any questions.

	Min	Q1	Median	Mean	Q3	Max	SD	SK	KU
Churn
No	18.25	25.150	64.8	61.477364	88.75	118.75	31.085043	-0.036103	-1.357109
Yes	18.85	55.675	79.5	74.164871	94.40	118.35	24.965002	-0.694174	-0.436118

	Min	Q1	Median	Mean	Q3	Max	SD	SK	KU
Churn
No	18.80	579.000	1689.45	2568.294874	4287.200	8672.45	2335.459814	0.798054	-0.575015
Yes	18.85	131.925	706.60	1550.701985	2366.775	8684.80	1905.709839	1.478092	1.325142

	Unnamed: 0	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	1869	7010-BRBUU	Male	0	Yes	Yes	72	Yes	Yes	No	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	24.10	1734.65	No
1	4528	9688-YGXVR	Female	0	No	No	44	Yes	No	Fiber optic	...	Yes	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	88.15	3973.2	No
2	6344	9286-DOJGF	Female	1	Yes	No	38	Yes	Yes	Fiber optic	...	No	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	74.95	2869.85	Yes
3	6739	6994-KERXL	Male	0	No	No	4	Yes	No	DSL	...	No	No	No	Yes	Month-to-month	Yes	Electronic check	55.90	238.5	No
4	432	2181-UAESM	Male	0	No	No	2	Yes	No	DSL	...	Yes	No	No	No	Month-to-month	No	Electronic check	53.45	119.5	No
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5981	3772	0684-AOSIH	Male	0	Yes	No	1	Yes	No	Fiber optic	...	No	No	Yes	Yes	Month-to-month	Yes	Electronic check	95.00	95	Yes
5982	5191	5982-PSMKW	Female	0	Yes	Yes	23	Yes	Yes	DSL	...	Yes	Yes	Yes	Yes	Two year	Yes	Credit card (automatic)	91.10	2198.3	No
5983	5226	8044-BGWPI	Male	0	Yes	Yes	12	Yes	No	No	...	No internet service	No internet service	No internet service	No internet service	Month-to-month	Yes	Electronic check	21.15	306.05	No
5984	5390	7450-NWRTR	Male	1	No	No	12	Yes	Yes	Fiber optic	...	Yes	No	Yes	Yes	Month-to-month	Yes	Electronic check	99.45	1200.15	Yes
5985	860	4795-UXVCJ	Male	0	No	No	26	Yes	No	No	...	No internet service	No internet service	No internet service	No internet service	One year	No	Credit card (automatic)	19.80	457.3	No

	NUniques	dtypes
Churn	2	object
PaperlessBilling	2	object
gender	2	object
SeniorCitizen	2	int64
Partner	2	object
Dependents	2	object
PhoneService	2	object
Contract	3	object
StreamingMovies	3	object
StreamingTV	3	object
TechSupport	3	object
DeviceProtection	3	object
OnlineSecurity	3	object
InternetService	3	object
MultipleLines	3	object
OnlineBackup	3	object
PaymentMethod	4	object
tenure	73	int64
MonthlyCharges	1526	float64
TotalCharges	5611	object
customerID	5986	object
Unnamed: 0	5986	int64

	Min	Q1	Median	Mean	Q3	Max	SD	SK	KU
Churn
No	1	15.0	38	37.685350	61.0	72	24.025427	-0.031064	-1.411587
Yes	1	2.0	10	18.246377	30.0	72	19.667262	1.105829	0.061708