Using Python for data analysis - Analysing customer churn

  • version 1.0 - 05.02.21

Eduardo Gonçalves (https://github.com/edugvs)

1. Business problem

Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service. Accordingly, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet. These data from a telecommunications company contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator's services, the method of payment, and the amount of payment. The task is to analyze and take some insights from data, and identify what is causing the customers churn.

2. Data dictionary


Variable Description
customerID customer id
gender client gender (male / female)
SeniorCitizen is the client retired (Yes, No)
Partner is the client married (Yes, No)
Dependents is the client has dependents (Yes, No)
tenure how many months a person has been a client of the company
PhoneService is the telephone service connected (Yes, No)
MultipleLines are multiple phone lines connected (Yes, No, No phone service)
InternetService client's Internet service provider (DSL, Fiber optic, No)
OnlineSecurity is the online security service connected (Yes, No, No internet service)
OnlineBackup is the online backup service activated (Yes, No, No internet service)
DeviceProtection does the client have equipment insurance (Yes, No, No internet service)
TechSupport is the technical support service connected (Yes, No, No internet service)
StreamingTV is the streaming TV service connected (Yes, No, No internet service)
StreamingMovies is the streaming cinema service activated (Yes, No, No internet service)
Contract type of customer contract (Month-to-month, One year, Two year)
PaperlessBilling whether the client uses paperless billing (Yes, No)
PaymentMethod payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges current monthly payment
TotalCharges the total amount that the client paid for the services for the entire time
Churn whether there was a churn (Yes or No)

Original database on Kaggle: https://www.kaggle.com/radmirzosimov/telecom-users-dataset

In [1]:
# To hide warning messages
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

# For dataset manipulation and data exploration 
import numpy as np
import pandas as pd

# Importing libraries needed for plotting interactive graphics with plotly
import plotly.offline as py
import plotly.express as px

py.init_notebook_mode(connected = False)

# Importing some stuff to calculate some statistics
from scipy.stats import kurtosis, skew, chisquare
import statsmodels.api as sm

3. Understanding the data

In [2]:
# Loading dataset
data = pd.read_csv("data/telecom_users.csv")
display(data)
Unnamed: 0 customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 1869 7010-BRBUU Male 0 Yes Yes 72 Yes Yes No ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.10 1734.65 No
1 4528 9688-YGXVR Female 0 No No 44 Yes No Fiber optic ... Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.2 No
2 6344 9286-DOJGF Female 1 Yes No 38 Yes Yes Fiber optic ... No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
3 6739 6994-KERXL Male 0 No No 4 Yes No DSL ... No No No Yes Month-to-month Yes Electronic check 55.90 238.5 No
4 432 2181-UAESM Male 0 No No 2 Yes No DSL ... Yes No No No Month-to-month No Electronic check 53.45 119.5 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5981 3772 0684-AOSIH Male 0 Yes No 1 Yes No Fiber optic ... No No Yes Yes Month-to-month Yes Electronic check 95.00 95 Yes
5982 5191 5982-PSMKW Female 0 Yes Yes 23 Yes Yes DSL ... Yes Yes Yes Yes Two year Yes Credit card (automatic) 91.10 2198.3 No
5983 5226 8044-BGWPI Male 0 Yes Yes 12 Yes No No ... No internet service No internet service No internet service No internet service Month-to-month Yes Electronic check 21.15 306.05 No
5984 5390 7450-NWRTR Male 1 No No 12 Yes Yes Fiber optic ... Yes No Yes Yes Month-to-month Yes Electronic check 99.45 1200.15 Yes
5985 860 4795-UXVCJ Male 0 No No 26 Yes No No ... No internet service No internet service No internet service No internet service One year No Credit card (automatic) 19.80 457.3 No

5986 rows × 22 columns

In [3]:
# Creating a copy of dataset
df = data
In [4]:
# Verifying dimensions
df.shape
Out[4]:
(5986, 22)
In [5]:
# Checking duplicate data
df.duplicated().sum()
Out[5]:
0
In [6]:
# Checking missing values
df.isna().sum()
Out[6]:
Unnamed: 0          0
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64
In [7]:
# Counting unique values
info = df.nunique().sort_values()

# Determining the data type for each variables
info = pd.DataFrame(info.values, index = info.index, columns = ['NUniques'])

# Assigning information about data type of the variables to a DataFrame.
info['dtypes'] = df.dtypes

# Show dataframe.
display(info)
NUniques dtypes
Churn 2 object
PaperlessBilling 2 object
gender 2 object
SeniorCitizen 2 int64
Partner 2 object
Dependents 2 object
PhoneService 2 object
Contract 3 object
StreamingMovies 3 object
StreamingTV 3 object
TechSupport 3 object
DeviceProtection 3 object
OnlineSecurity 3 object
InternetService 3 object
MultipleLines 3 object
OnlineBackup 3 object
PaymentMethod 4 object
tenure 73 int64
MonthlyCharges 1526 float64
TotalCharges 5611 object
customerID 5986 object
Unnamed: 0 5986 int64

We can see some problems with data types in few variables. "SeniorCitizen" was loaded as int64, but this variable is not numeric, it represents a category (Yes/No). "TotalCharges" was loaded as an object (string), we must change it to numeric. "Unnamed: 0" does not appear to represent anything relevant, we will remove it from the dataset.

4. Data munging

In [8]:
# Removing "Unnamed: 0" collumn
df = df.drop(["Unnamed: 0"], axis=1)
In [9]:
# Changing 1 and 0 in "SeniorCitizen" to Yes and No
df.SeniorCitizen.replace((1, 0), ("Yes", "No"), inplace=True)
In [10]:
# Changing "TotalCharges" data type to float
df["TotalCharges"] = pd.to_numeric(df.TotalCharges, errors='coerce')
In [11]:
# Checking data types for each variable
df.dtypes
Out[11]:
customerID           object
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object
In [12]:
# Checking duplicate data
df.duplicated().sum()
Out[12]:
0
In [13]:
# Checking missing values
df.isna().sum()
Out[13]:
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        10
Churn                0
dtype: int64

In the process of cleaning and transforming the data we cause some missing values (10 rows) in "TotalCharges". As they are not representative of the size of the dataset, we will just remove them. This change will not cause any problems in the analysis that we will do next.

In [14]:
# Removing missing values in "Total Charges"
df = df.dropna()
In [15]:
# Show dataset
display(df)
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7010-BRBUU Male No Yes Yes 72 Yes Yes No No internet service ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 24.10 1734.65 No
1 9688-YGXVR Female No No No 44 Yes No Fiber optic No ... Yes No Yes No Month-to-month Yes Credit card (automatic) 88.15 3973.20 No
2 9286-DOJGF Female Yes Yes No 38 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Bank transfer (automatic) 74.95 2869.85 Yes
3 6994-KERXL Male No No No 4 Yes No DSL No ... No No No Yes Month-to-month Yes Electronic check 55.90 238.50 No
4 2181-UAESM Male No No No 2 Yes No DSL Yes ... Yes No No No Month-to-month No Electronic check 53.45 119.50 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5981 0684-AOSIH Male No Yes No 1 Yes No Fiber optic Yes ... No No Yes Yes Month-to-month Yes Electronic check 95.00 95.00 Yes
5982 5982-PSMKW Female No Yes Yes 23 Yes Yes DSL Yes ... Yes Yes Yes Yes Two year Yes Credit card (automatic) 91.10 2198.30 No
5983 8044-BGWPI Male No Yes Yes 12 Yes No No No internet service ... No internet service No internet service No internet service No internet service Month-to-month Yes Electronic check 21.15 306.05 No
5984 7450-NWRTR Male Yes No No 12 Yes Yes Fiber optic No ... Yes No Yes Yes Month-to-month Yes Electronic check 99.45 1200.15 Yes
5985 4795-UXVCJ Male No No No 26 Yes No No No internet service ... No internet service No internet service No internet service No internet service One year No Credit card (automatic) 19.80 457.30 No

5976 rows × 21 columns

Now the dataset is ready to be analyzed.

5. Exploratory Analysis

In [16]:
# Defining function to generate a dataframe with numeric variables statistics.
def varStats(col, data, target = ''):

    if target == '':

        stats = pd.DataFrame({
            'Min'   : data[col].min(),
            'Q1'    : data[col].quantile(.25),
            'Median': data[col].median(),
            'Mean'  : data[col].mean(),
            'Q3'    : data[col].quantile(.75),
            'Max'   : data[col].max(),
            'SD'    : data[col].std(),
            'SK'    : skew(data[col]),
            'KU'    : kurtosis(data[col])
        }, index = [col])

    else:

        stats = pd.concat([
            df[[col, target]].groupby(target).min(),
            df[[col, target]].groupby(target).quantile(.25),
            df[[col, target]].groupby(target).median(),
            df[[col, target]].groupby(target).mean(),
            df[[col, target]].groupby(target).quantile(.75),
            df[[col, target]].groupby(target).max(),
            df[[col, target]].groupby(target).std(),
            df[[col, target]].groupby(target).skew(),
            df[[col, target]].groupby(target).apply(lambda group: kurtosis(group)[0])

        ], axis = 1)

        stats.columns = ['Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max', 'SD', 'SK', 'KU']

    return stats

The Asymmetry coefficient (Skewness) indicates how the data is distributed and to interpret its result we can look at the following table:

Skewness Description
SK ≈ 0 The data is symmetric. Both the right and left tail of the probability density function are the same.
SK < 0 Asymmetry is negative. The tail on the left side of the probability density function is larger than the tail on the right.
SK > 0 Asymmetry is positive. The tail on the right side of the probability density function is larger than the tail on the left.
In [17]:
from IPython.display import Image
Image("img/skew.png")
Out[17]:

Img Reference:

Skewness - https://www.assetinsights.net/Glossary/G_Skewness.html - Accessed: 2021-05-02

The Kurtosis coefficient (Kurtosis) is a measure that characterizes the flattening of the distribution function curve and to interpret its result we can look at the following table:

Kurtosis Description
KU ≈ 0 The distribution is normal and is called Mesokurtic Kurtosis.
KU < 0 The curve is flatter than normal. For a negative kurtosis coefficient there is a Platykurtic Kurtosis.
KU > 0 The curve is more prominent than the normal. For a positive kurtosis coefficient, there is a Leptokurtic Kurtosis.
In [18]:
from IPython.display import Image
Image("img/kurtosis.png")
Out[18]:

Img reference:

Kurtosis and Skewness Example Question | CFA Level I - AnalystPrep https://analystprep.com/cfa-level-1-exam/quantitative-methods/kurtosis-and-skewness-types-of-distributions/ Accessed: 2021-05-02

In [19]:
# Defining a function to plot interactive graphics in a non-standard jupyter environment
def configure_plotly_browser_state():
  
  import IPython
  
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-1.43.1.min.js?noext',
            },
          });
        </script>
        '''))

To create plots offline with plotly (that is, in environments such as Google Colab, Azure, Kaggle, Nteract, etc.), we need to define the following function and call it whenever we are going to generate a graph.

Variable: Churn

Whether the customer canceled the service or not.

In [20]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Churn'

# Defining label
label = 'Customer Churn'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [21]:
# Getting relative frequency
display(df['Churn'].value_counts(normalize=True).map('{:.2%}'.format))
No     73.44%
Yes    26.56%
Name: Churn, dtype: object

Within this dataset, which encompasses approximately 5976 customers, 26.56% canceled the service.

Variable: gender

Customer gender (Male/Female).

In [22]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'gender'

# Defining label
label = 'Gender'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [23]:
# Getting relative frequency
display(df['gender'].value_counts(normalize=True).map('{:.2%}'.format))
Male      50.94%
Female    49.06%
Name: gender, dtype: object
Chi-squared test

A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency.

Independence

Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:

In [24]:
from IPython.display import Image
Image("img/independence.png")
Out[24]:

We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:

In [25]:
# Creating A two-way contingency table.
data = df[["Churn", "gender"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
gender  Female  Male
Churn               
No        2141  2248
Yes        791   796
In [26]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
gender       Female         Male
Churn                           
No      2153.371486  2235.628514
Yes      778.628514   808.371486
In [27]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
gender    Female      Male
Churn                     
No     -0.266601  0.261651
Yes     0.443360 -0.435127
In [28]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.4685296749782847
In [29]:
# Chi-squared statistic result
print(rslt.statistic)
0.5254415071601595
In [30]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
gender    Female      Male
Churn                     
No      0.071076  0.068461
Yes     0.196568  0.189336

We can see that the data is balanced between genders and Churn. In other words, the customer's gender is not relevant for the cancellation of the service.

Variable: SeniorCitizen

If the customer is retired.

In [31]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'SeniorCitizen'

# Defining label
label = 'Senior Citizen'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [32]:
# Getting relative frequency
display(df['SeniorCitizen'].value_counts(normalize=True).map('{:.2%}'.format))
No     83.84%
Yes    16.16%
Name: SeniorCitizen, dtype: object
Independence and chi-squared test
In [33]:
# Creating A two-way contingency table.
data = df[["Churn", "SeniorCitizen"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
SeniorCitizen    No  Yes
Churn                   
No             3825  564
Yes            1185  402
In [34]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
SeniorCitizen           No         Yes
Churn                                 
No             3679.533133  709.466867
Yes            1330.466867  256.533133
In [35]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
SeniorCitizen        No       Yes
Churn                            
No             2.398102 -5.461325
Yes           -3.988063  9.082227

The group of customers who are retired and who canceled the service is the ones that most violate independence.

In [36]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [37]:
# Chi-squared statistic result
print(rslt.statistic)
133.96846458355992
In [38]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
SeniorCitizen         No        Yes
Churn                              
No              5.750895  29.826072
Yes            15.904650  82.486848

Only 16% of customers are retired. The percentage of customers who canceled the service was higher in the group of retirees, about 41,61% of 966 records. According to these results, retired customers are 1.75x more likely to cancel the service.

Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that customers who are retired and have canceled the service are the ones that most contribute to the difference in proportion between groups. However, we cannot indicate with certainty the causes of the differences observed.

Variable: Partner

If the customer is married.

In [39]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Partner'

# Defining label
label = 'Partner'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [40]:
# Getting relative frequency
display(df['Partner'].value_counts(normalize=True).map('{:.2%}'.format))
No     51.54%
Yes    48.46%
Name: Partner, dtype: object
Independence and chi-squared test
In [41]:
# Creating A two-way contingency table.
data = df[["Churn", "Partner"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
Partner    No   Yes
Churn              
No       2069  2320
Yes      1011   576
In [42]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
Partner           No          Yes
Churn                            
No       2262.068273  2126.931727
Yes       817.931727   769.068273
In [43]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
Partner        No       Yes
Churn                      
No      -4.059365  4.186337
Yes      6.750756 -6.961911
In [44]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [45]:
# Chi-squared statistic result
print(rslt.statistic)
128.04475953195737
In [46]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Partner         No        Yes
Churn                        
No       16.478441  17.525414
Yes      45.572701  48.468204

Single or non married customers are 1.65x more likely to cancel the service. 32,82% of customers who do not have partners canceled, in the group of customers who have a partner 19,88% canceled service.

Observing the results of the chi-squared test, we can reject the null hypothesis that there are no differences between the groups, stating that the clients who are married are the ones that most contribute to the difference in proportion between the groups. However, we cannot indicate with certainty the causes of the differences observed

Variable: Dependents

If the customer has dependents

In [47]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Dependents'

# Defining label
label = 'Dependents'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [48]:
# Getting relative frequency
display(df['Partner'].value_counts(normalize=True).map('{:.2%}'.format))
No     51.54%
Yes    48.46%
Name: Partner, dtype: object
Independence and chi-squared test
In [49]:
# Creating A two-way contingency table.
data = df[["Churn", "Dependents"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
Dependents    No   Yes
Churn                 
No          2889  1500
Yes         1306   281
In [50]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
Dependents           No          Yes
Churn                               
No          3080.966365  1308.033635
Yes         1114.033635   472.966365
In [51]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
Dependents        No       Yes
Churn                         
No         -3.458451  5.307814
Yes         5.751432 -8.826937
In [52]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [53]:
# Chi-squared statistic result
print(rslt.statistic)
151.12755595173218
In [54]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Dependents         No        Yes
Churn                           
No          11.960885  28.172888
Yes         33.078970  77.914812

Customers who do not have dependents are about 2x more likely to cancel the service. 31.13% of customers who have no dependents canceled, while 15.77% of customers who have dependents canceled the service.

Observing the results of the chi-squared test, we can reject the null hypothesis that there are no differences between the groups, stating that the clients who have dependents are the ones that most contribute to the difference in proportion between the groups. However, we cannot indicate with certainty the causes of the differences observed.

Variable: PaperlessBilling

In [55]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'PaperlessBilling'

# Defining label
label = 'Paperless Billing'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [56]:
# Getting relative frequency
display(df['PaperlessBilling'].value_counts(normalize=True).map('{:.2%}'.format))
Yes    58.99%
No     41.01%
Name: PaperlessBilling, dtype: object
Independence and chi-squared test
In [57]:
# Creating A two-way contingency table.
data = df[["Churn", "PaperlessBilling"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
PaperlessBilling    No   Yes
Churn                       
No                2045  2344
Yes                406  1181
In [58]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
PaperlessBilling           No          Yes
Churn                                     
No                1800.106928  2588.893072
Yes                650.893072   936.106928
In [59]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
PaperlessBilling        No       Yes
Churn                               
No                5.772014 -4.813040
Yes              -9.598905  8.004123
In [60]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [61]:
# Chi-squared statistic result
print(rslt.statistic)
212.68645155316744
In [62]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
PaperlessBilling         No        Yes
Churn                                 
No                33.316141  23.165351
Yes               92.138969  64.065990

Customers who do not receive a paper bill have about 2x more chances of canceling the service. 16.56% of the group of customers who received paper bills canceled, while in the group of customers with paperless billing 33.50% canceled the service.

Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that the type of account that customers receive contributes to the difference in proportion between groups. We can conclude, with certainty, that customers with paperless billing are more likely to cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: Contract

In [63]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'Contract'

# Defining label
label = 'Contract'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [64]:
# Getting relative frequency
display(df['Contract'].value_counts(normalize=True).map('{:.2%}'.format))
Month-to-month    54.70%
Two year          23.96%
One year          21.34%
Name: Contract, dtype: object
Independence and chi-squared test
In [65]:
# Creating A two-way contingency table.
data = df[["Churn", "Contract"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
Contract  Month-to-month  One year  Two year
Churn                                       
No                  1871      1127      1391
Yes                 1398       148        41
In [66]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
Contract  Month-to-month    One year     Two year
Churn                                            
No           2400.877008  936.408133  1051.714859
Yes           868.122992  338.591867   380.285141
In [67]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
Contract  Month-to-month   One year   Two year
Churn                                         
No            -10.814093   6.228332  10.462027
Yes            17.983923 -10.357766 -17.398434
In [68]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [69]:
# Chi-squared statistic result
print(rslt.statistic)
998.601081701014
In [70]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
Contract  Month-to-month    One year    Two year
Churn                                           
No            116.944618   38.792124  109.454008
Yes           323.421504  107.283321  302.705508

Customers with the Month-to-month contract type have a strong tendency to cancel the service.

Variable: PaymentMethod

In [71]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'PaymentMethod'

# Defining label
label = 'Payment Method'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [72]:
# Getting relative frequency
display(df['PaymentMethod'].value_counts(normalize=True).map('{:.2%}'.format))
Electronic check             33.57%
Mailed check                 22.79%
Bank transfer (automatic)    21.85%
Credit card (automatic)      21.79%
Name: PaymentMethod, dtype: object
Independence and chi-squared test
In [73]:
# Creating A two-way contingency table.
data = df[["Churn", "PaymentMethod"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                                  1082                     1104   
Yes                                  224                      198   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                         1104          1099  
Yes                         902           263  
In [74]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                            959.175703               956.237952   
Yes                           346.824297               345.762048   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                  1473.282129   1000.304217  
Yes                  532.717871    361.695783  
In [75]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                              3.965840                 4.778372   
Yes                            -6.595224                -7.946470   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                    -9.620892      3.120560  
Yes                   15.999620     -5.189516  
In [76]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [77]:
# Chi-squared statistic result
print(rslt.statistic)
530.4224414393452
In [78]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
PaymentMethod  Bank transfer (automatic)  Credit card (automatic)  \
Churn                                                               
No                             15.727888                22.832834   
Yes                            43.496976                63.146383   

PaymentMethod  Electronic check  Mailed check  
Churn                                          
No                    92.561559      9.737895  
Yes                  255.987827     26.931079  

Customers who have paid for the service by Eletronic check have a strong tendency to cancel the service.

Variable: PhoneService

In [79]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'PhoneService'

# Defining label
label = 'Phone Service'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [80]:
# Getting relative frequency
display(df['PhoneService'].value_counts(normalize=True).map('{:.2%}'.format))
Yes    90.16%
No      9.84%
Name: PhoneService, dtype: object
Independence and chi-squared test
In [81]:
# Creating A two-way contingency table.
data = df[["Churn", "PhoneService"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
PhoneService   No   Yes
Churn                  
No            439  3950
Yes           149  1438
In [82]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
PhoneService          No          Yes
Churn                                
No            431.849398  3957.150602
Yes           156.150602  1430.849398
In [83]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
PhoneService        No       Yes
Churn                           
No            0.344094 -0.113671
Yes          -0.572230  0.189037
In [84]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.48192512451686575
In [85]:
# Chi-squared statistic result
print(rslt.statistic)
0.4945037706856925
In [86]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
PhoneService        No       Yes
Churn                           
No            0.118400  0.012921
Yes           0.327447  0.035735

90% of customers have phone service. The cancellation rate among customers who hired the service was 26.68%. Among customers who did not contract phone service, the cancellation rate was 25.34%. We can say with certainty, based on statistical tests, that according to this data, whether to hire the phone service does not affect the Churn.

Variable: MultipleLines

In [87]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'MultipleLines'

# Defining label
label = 'Multiple Lines'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [88]:
# Getting relative frequency
display(df['MultipleLines'].value_counts(normalize=True).map('{:.2%}'.format))
No                  47.57%
Yes                 42.59%
No phone service     9.84%
Name: MultipleLines, dtype: object
Independence and chi-squared test
In [89]:
# Creating A two-way contingency table.
data = df[["Churn", "MultipleLines"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
MultipleLines    No  No phone service   Yes
Churn                                      
No             2128               439  1822
Yes             715               149   723
In [90]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
MultipleLines           No  No phone service          Yes
Churn                                                    
No             2088.006526        431.849398  1869.144076
Yes             754.993474        156.150602   675.855924
In [91]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
MultipleLines        No  No phone service       Yes
Churn                                              
No             0.875232          0.344094 -1.090450
Yes           -1.455518         -0.572230  1.813427
In [92]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.020161009617758352
In [93]:
# Chi-squared statistic result
print(rslt.statistic)
7.808009513573989
In [94]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
MultipleLines        No  No phone service       Yes
Churn                                              
No             0.766031          0.118400  1.189081
Yes            2.118532          0.327447  3.288517

There are no statistically significant differences to affirm that there is a difference between the groups of customers who have hired multiple lines. However, among those who did not contract the telephone service, there were differences in relation to churn.

Variable: InternetService

In [95]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'InternetService'

# Defining label
label = 'Internet Service'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [96]:
# Getting relative frequency
display(df['InternetService'].value_counts(normalize=True).map('{:.2%}'.format))
Fiber optic    43.96%
DSL            34.54%
No             21.50%
Name: InternetService, dtype: object
Independence and chi-squared test
In [97]:
# Creating A two-way contingency table.
data = df[["Churn", "InternetService"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
InternetService   DSL  Fiber optic    No
Churn                                   
No               1667         1536  1186
Yes               397         1091    99
In [98]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
InternetService          DSL  Fiber optic         No
Churn                                               
No               1515.879518  1929.367972  943.75251
Yes               548.120482   697.632028  341.24749
In [99]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
InternetService       DSL  Fiber optic         No
Churn                                            
No               3.881423    -8.955534   7.885518
Yes             -6.454838    14.893124 -13.113679
In [100]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [101]:
# Chi-squared statistic result
print(rslt.statistic)
592.8870555895634
In [102]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
InternetService        DSL  Fiber optic          No
Churn                                              
No               15.065445    80.201581   62.181394
Yes              41.664927   221.805128  171.968580

21.50% of customers do not hire internet service. Looking at the results of the chi-squared test, we can reject the null hypothesis that there are no differences between groups, stating that customers who hired optic fiber internet service are the ones that most contribute to the difference in proportion between groups.We can say that customers who contracted fiber optic internet service are more likely to cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: StreamingMovies

In [103]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'StreamingMovies'

# Defining label
label = 'Streaming Movies'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [104]:
# Getting relative frequency
display(df['StreamingMovies'].value_counts(normalize=True).map('{:.2%}'.format))
No                     39.37%
Yes                    39.12%
No internet service    21.50%
Name: StreamingMovies, dtype: object
Independence and chi-squared test
In [105]:
# Creating A two-way contingency table.
data = df[["Churn", "StreamingMovies"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
StreamingMovies    No  No internet service   Yes
Churn                                           
No               1561                 1186  1642
Yes               792                   99   696
In [106]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
StreamingMovies           No  No internet service          Yes
Churn                                                         
No               1728.132028            943.75251  1717.115462
Yes               624.867972            341.24749   620.884538
In [107]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
StreamingMovies        No  No internet service       Yes
Churn                                                   
No              -4.020418             7.885518 -1.812715
Yes              6.685987           -13.113679  3.014560
In [108]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [109]:
# Chi-squared statistic result
print(rslt.statistic)
307.3896710297794
In [110]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
StreamingMovies         No  No internet service       Yes
Churn                                                    
No               16.163762            62.181394  3.285937
Yes              44.702427           171.968580  9.087571

There was no significant difference within the group of customers who hired the streaming movies service. However, hiring the internet service contributes significantly to the differences observed between the groups. Customers who hire the internet service tend to cancel the service.

Variable: StreamingTV

In [111]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'StreamingTV'

# Defining label
label = 'Streaming TV'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [112]:
# Getting relative frequency
display(df['StreamingTV'].value_counts(normalize=True).map('{:.2%}'.format))
No                     39.96%
Yes                    38.54%
No internet service    21.50%
Name: StreamingTV, dtype: object
Independence and chi-squared test
In [113]:
# Creating A two-way contingency table.
data = df[["Churn", "StreamingTV"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
StreamingTV    No  No internet service   Yes
Churn                                       
No           1589                 1186  1614
Yes           799                   99   689
In [114]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
StreamingTV           No  No internet service          Yes
Churn                                                     
No           1753.837349            943.75251  1691.410141
Yes           634.162651            341.24749   611.589859
In [115]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
StreamingTV        No  No internet service       Yes
Churn                                               
No          -3.936053             7.885518 -1.882233
Yes          6.545688           -13.113679  3.130169
In [116]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [117]:
# Chi-squared statistic result
print(rslt.statistic)
305.82927501100244
In [118]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
StreamingTV         No  No internet service       Yes
Churn                                                
No           15.492515            62.181394  3.542801
Yes          42.846030           171.968580  9.797955

There was no significant difference within the group of customers who hired the streaming TV service. However, hiring the internet service contributes significantly to the differences observed between the groups. Customers who hire the internet service tend to cancel the service.

Variable: TechSupport

In [119]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'TechSupport'

# Defining label
label = 'Tech Support'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [120]:
# Getting relative frequency
display(df['TechSupport'].value_counts(normalize=True).map('{:.2%}'.format))
No                     49.51%
Yes                    28.98%
No internet service    21.50%
Name: TechSupport, dtype: object
Independence and chi-squared test
In [121]:
# Creating A two-way contingency table.
data = df[["Churn", "TechSupport"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
TechSupport    No  No internet service   Yes
Churn                                       
No           1737                 1186  1466
Yes          1222                   99   266
In [122]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
TechSupport           No  No internet service          Yes
Churn                                                     
No           2173.201305            943.75251  1272.046185
Yes           785.798695            341.24749   459.953815
In [123]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
TechSupport         No  No internet service       Yes
Churn                                                
No           -9.357008             7.885518  5.438096
Yes          15.560778           -13.113679 -9.043597
In [124]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [125]:
# Chi-squared statistic result
print(rslt.statistic)
675.2009219256641
In [126]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
TechSupport          No  No internet service        Yes
Churn                                                  
No            87.553591            62.181394  29.572890
Yes          242.137815           171.968580  81.786652

41.29% of the customers in the group that did not hire the technical support service canceled the service compared to only 15.35% of the customers in the group that hired. We can conclude that customers who hire the technical support service are 2.69x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: OnlineBackup

In [127]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'OnlineBackup'

# Defining label
label = 'OnlineBackup'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [128]:
# Getting relative frequency
display(df['OnlineBackup'].value_counts(normalize=True).map('{:.2%}'.format))
No                     43.57%
Yes                    34.92%
No internet service    21.50%
Name: OnlineBackup, dtype: object
Independence and chi-squared test
In [129]:
# Creating A two-way contingency table.
data = df[["Churn", "OnlineBackup"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
OnlineBackup    No  No internet service   Yes
Churn                                        
No            1566                 1186  1637
Yes           1038                   99   450
In [130]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
OnlineBackup           No  No internet service          Yes
Churn                                                      
No            1912.475904            943.75251  1532.771586
Yes            691.524096            341.24749   554.228414
In [131]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
OnlineBackup         No  No internet service       Yes
Churn                                                 
No            -7.922734             7.885518  2.662241
Yes           13.175569           -13.113679 -4.427328
In [132]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [133]:
# Chi-squared statistic result
print(rslt.statistic)
497.20406224603863
In [134]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
OnlineBackup          No  No internet service        Yes
Churn                                                   
No             62.769707            62.181394   7.087528
Yes           173.595616           171.968580  19.601237

39.86% of the customers in the group that did not hire the online backup service canceled the service and 21.56% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the online backup service are 1.84x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: OnlineSecurity

In [135]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'OnlineSecurity'

# Defining label
label = 'Online Security'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [136]:
# Getting relative frequency
display(df['OnlineSecurity'].value_counts(normalize=True).map('{:.2%}'.format))
No                     49.90%
Yes                    28.60%
No internet service    21.50%
Name: OnlineSecurity, dtype: object
Independence and chi-squared test
In [137]:
# Creating A two-way contingency table.
data = df[["Churn", "OnlineSecurity"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
OnlineSecurity    No  No internet service   Yes
Churn                                          
No              1740                 1186  1463
Yes             1242                   99   246
In [138]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
OnlineSecurity           No  No internet service          Yes
Churn                                                        
No              2190.093373            943.75251  1255.154116
Yes              791.906627            341.24749   453.845884
In [139]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
OnlineSecurity         No  No internet service       Yes
Churn                                                   
No              -9.617702             7.885518  5.866687
Yes             15.994314           -13.113679 -9.756347
In [140]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [141]:
# Chi-squared statistic result
print(rslt.statistic)
712.0725709915333
In [142]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
OnlineSecurity          No  No internet service        Yes
Churn                                                     
No               92.500186            62.181394  34.418013
Yes             255.818095           171.968580  95.186302

41,64% of the customers in the group that did not hire the online security service canceled the service and 14,39% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the online security service are about 3x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Variable: DeviceProtection

In [143]:
configure_plotly_browser_state()

# Defining variable to be analyzed
col = 'DeviceProtection'

# Defining label
label = 'Device Protection'

# Creating Barplot
fig = px.histogram(df, 
             x = col,
             template = 'plotly_white',
             title = 'Absolute frequency of ' + col,
             labels = {col: label},
             opacity = 0.70,
             color = "Churn"
            )
   
# Show figure
fig.show()
In [144]:
# Getting relative frequency
display(df['DeviceProtection'].value_counts(normalize=True).map('{:.2%}'.format))
No                     44.16%
Yes                    34.34%
No internet service    21.50%
Name: DeviceProtection, dtype: object
Independence and chi-squared test
In [145]:
# Creating A two-way contingency table.
data = df[["Churn", "DeviceProtection"]]

table = sm.stats.Table.from_data(data)

print(table.table_orig)
DeviceProtection    No  No internet service   Yes
Churn                                            
No                1608                 1186  1595
Yes               1031                   99   457
In [146]:
# Returns fitted cell counts under independence. The returned cell counts are estimates under a model where the rows and 
# columns of the table are independent.
print(table.fittedvalues)
DeviceProtection           No  No internet service          Yes
Churn                                                          
No                1938.181225            943.75251  1507.066265
Yes                700.818775            341.24749   544.933735
In [147]:
# Returns Pearson residuals. The Pearson residuals are calculated under a model where the rows and columns of the table 
# are independent.
print(table.resid_pearson)
DeviceProtection         No  No internet service       Yes
Churn                                                     
No                -7.499895             7.885518  2.265110
Yes               12.472385           -13.113679 -3.766896
In [148]:
# Assess independence for nominal factors. Assessment of independence between rows and columns using chi^2 testing. 
# The rows and columns are treated as nominal (unordered) categorical variables.
rslt = table.test_nominal_association()
print(rslt.pvalue) # p-value result
0.0
In [149]:
# Chi-squared statistic result
print(rslt.statistic)
465.27902078486363
In [150]:
# Returns the contributions to the chi^2 statistic for independence. The returned table contains the contribution of each cell 
# to the chi^2 test statistic for the null hypothesis that the rows and columns are independent.
print(table.chi2_contribs)
DeviceProtection          No  No internet service        Yes
Churn                                                       
No                 56.248425            62.181394   5.130724
Yes               155.560389           171.968580  14.189508

39,06% of the customers in the group that did not hire the device protection service canceled the service, and 22,27% of the customers in the group that hired canceled the service. We can conclude according to statistical results, that customers who hire the device protection service are about 1,75x more likely to not cancel the service. However, we cannot indicate with certainty the causes of the differences observed.

Numeric variables: tenure, MonthlyCharges, TotalCharges

In [151]:
configure_plotly_browser_state()


fig1 = px.histogram(df, 
                      x = 'tenure',
                      template = 'plotly_white',
                      title = 'Tenure distribution',
                      opacity = 0.70,
                      color = "Churn"
                     )

fig2 = px.histogram(df, 
                      x = 'MonthlyCharges',
                      template = 'plotly_white',
                      title = 'Monthly Charges distribution',
                      opacity = 0.70,
                    nbins=30,
                      color = "Churn"
                     )

fig3 = px.histogram(df, 
                      x = 'TotalCharges',
                      template = 'plotly_white',
                      title = 'Total Charges distribution',
                      opacity = 0.70,
                      color = "Churn"
                     )

fig1.show()
fig2.show()
fig3.show()
  • The variables tenure and TotalCharges are related directly and convey the same insight. Therefore, these variables represent customer loyalty, that is, the longer a person is a consumer of the company, the higher the customer retention rate.

  • The more expensive the monthly fee for the service, the greater the chance of losing the customer.

In [152]:
configure_plotly_browser_state()

fig1 = px.box(df, 
              x = 'tenure',      
              template = 'plotly_white',
              title = 'Tenure boxplot',
              color = "Churn",
             )

fig2 = px.box(df, 
              x = 'MonthlyCharges', 
              template = 'plotly_white',
              title = 'MonthlyCharges boxplot',
              color = "Churn",
             )

fig3 = px.box(df, 
              x = 'TotalCharges',   
              template = 'plotly_white',
              title = 'TotalCharges boxplot',
              color = "Churn",
             )

fig1.show()
fig2.show()
fig3.show()

We can see outliers in the TotalCharges boxplot within the group of customers who canceled the service. This result shows that it is unusual for a customer to cancel the service after having spent more than $ 5,688.05 (upper fence of the boxplot) for the service and this is totally related to the time, in months, that the customer is a subscriber.

In [153]:
# Defining variable to be calculated
col = 'tenure'

# Apllying varStats function
varStats(col, target = 'Churn', data = df)
Out[153]:
Min Q1 Median Mean Q3 Max SD SK KU
Churn
No 1 15.0 38 37.685350 61.0 72 24.025427 -0.031064 -1.411587
Yes 1 2.0 10 18.246377 30.0 72 19.667262 1.105829 0.061708

The average tenure of customers who canceled the service was 2x lower, but it is still within the standard deviation of the group of customers who did not. It is not possible to state that this difference is statistically significant. The Kurtosis and Skewness coefficients for the group that canceled indicate that the data distribution tends to the left in the distribution and the curve is slightly flatter than the normal one.

In [154]:
# Defining variable to be calculated
col = 'MonthlyCharges'

# Apllying varStats function
varStats(col, target = 'Churn', data = df)
Out[154]:
Min Q1 Median Mean Q3 Max SD SK KU
Churn
No 18.25 25.150 64.8 61.477364 88.75 118.75 31.085043 -0.036103 -1.357109
Yes 18.85 55.675 79.5 74.164871 94.40 118.35 24.965002 -0.694174 -0.436118

It is not possible to state that differences in the median and mean are statistically significant. The Kurtosis and Skewness coefficients for the group that canceled the service indicate that the data distribution tends to the right in the distribution and the curve is slightly prominent than the normal one.

In [155]:
# Defining variable to be calculated
col = 'TotalCharges'

# Apllying varStats function
varStats(col, target = 'Churn', data = df)
Out[155]:
Min Q1 Median Mean Q3 Max SD SK KU
Churn
No 18.80 579.000 1689.45 2568.294874 4287.200 8672.45 2335.459814 0.798054 -0.575015
Yes 18.85 131.925 706.60 1550.701985 2366.775 8684.80 1905.709839 1.478092 1.325142

It is not possible to state that differences in the median and mean are statistically significant. The Kurtosis and Skewness coefficients for the group that canceled the service indicate that the data distribution tends to the left in the distribution and the curve is flatter than the normal one.

Conclusions and Insights

  • Gender of the customer is not a relevant factor.
  • Only 16% of customers are retired.
  • Customers with the Month-by-month contract type are the most likely to cancel the service.
  • Customers who pay using the electronic check method are much more likely to cancel the service.
  • 90% of customers use the telephone service.
  • Whether or not to hire a telephone service is not relevant for churn.
  • 21.5% of customers do not contract internet service.
  • There are failures in the optic fiber service. Customers who purchased this service were more likely to cancel.
  • The more expensive the monthly fee for the service, the greater the chance of losing the customer.
  • The longer a person is a consumer of the company, the higher the customer retention rate.
  • That it is unusual for a customer to cancel the service after having spent more than $ 5,688.05 for the service and this is totally related to the time, in months, that the customer is a subscriber.
  • The average tenure of customers who canceled the service was 2x lower, but it is not possible to state that this difference is statistically significant.

They are more likely to cancel the service:

  • Retired customers (1.75x).
  • Married customers (1.65x).
  • Customers who do not have dependents (2x).
  • Customers with Paperlessbill (2x).

They are less likely to cancel the service:

  • Customers who do not contract internet service.
  • Customers who hire the technical support service (2.69x).
  • Customers who hire the online backup service (1.84x).
  • Customers who hire the online security service (3x).
  • Customers who hire the device protection service (1.75x).

I will be incredibly happy to receive suggestions to improve the project, contact me if you have any questions.