Building a Predictive Model for Detecting Click Traffic Fraud in Mobile Application Advertisements

1 - Business problem

The risk of fraud is everywhere, but for companies that advertise online click, fraud can happen in an overwhelming amount, resulting in misleading click data and wasted money. Ad placements can increase costs simply when people or bots click on ads on a large scale, which in practice does not yield the expected result. With more than one billion mobile devices in use every month, China is the largest mobile market in the world and therefore suffers from high volumes of fraudulent traffic.

The TalkingData (https://www.talkingdata.com), China’s largest independent Big Data platform, covers more than 70% of active mobile devices across the country. They handle three billion clicks a day, of which 90% are potentially fraudulent. Your current approach to preventing click fraud for application developers is to measure a user’s click journey across the portfolio and signal IP addresses that produce many clicks, but never end up installing applications. With this information, they created a blacklist of IPs and a blacklist of devices.

Although successful, they always want to be one step ahead of fraudsters and have asked for their help to further develop the solution. In summary, using a dataset available on Kaggle (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/overview) we have built an algorithm that can predict whether a user will download an app after clicking on a mobile app ad to determine whether a click is fraudulent or not. For the construction of this project, we used R language.

2 - Loading data

# Importing the required libraries
library(data.table)
library(dplyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(fasttime)
library(Amelia)
library(caret)
library(randomForest)
library(smotefamily)
library(scales)
library(imbalance)
library(caTools)
library(pROC)
#Loading train data
df_original <- fread('train_sample.csv')

To maintain the integrity of the original data, we will make a copy of the dataset for manipulation.

# Copying dataset
data <- df_original

3 - Data munging

# Cheking dataset
head(data)
##        ip app device os channel          click_time attributed_time
## 1:  87540  12      1 13     497 2017-11-07 09:30:38                
## 2: 105560  25      1 17     259 2017-11-07 13:40:27                
## 3: 101424  12      1 19     212 2017-11-07 18:05:24                
## 4:  94584  13      1 13     477 2017-11-07 04:58:08                
## 5:  68413  12      1  1     178 2017-11-09 09:00:09                
## 6:  93663   3      1 17     115 2017-11-09 01:22:13                
##    is_attributed
## 1:             0
## 2:             0
## 3:             0
## 4:             0
## 5:             0
## 6:             0
glimpse(data)
## Rows: 100,000
## Columns: 8
## $ ip              <int> 87540, 105560, 101424, 94584, 68413, 93663, 17059, 121~
## $ app             <int> 12, 25, 12, 13, 12, 3, 1, 9, 2, 3, 3, 3, 3, 6, 2, 25, ~
## $ device          <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, ~
## $ os              <int> 13, 17, 19, 13, 1, 17, 17, 25, 22, 19, 22, 13, 22, 20,~
## $ channel         <int> 497, 259, 212, 477, 178, 115, 135, 442, 364, 135, 489,~
## $ click_time      <chr> "2017-11-07 09:30:38", "2017-11-07 13:40:27", "2017-11~
## $ attributed_time <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ""~
## $ is_attributed   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
summary(data)
##        ip              app             device              os        
##  Min.   :     9   Min.   :  1.00   Min.   :   0.00   Min.   :  0.00  
##  1st Qu.: 40552   1st Qu.:  3.00   1st Qu.:   1.00   1st Qu.: 13.00  
##  Median : 79827   Median : 12.00   Median :   1.00   Median : 18.00  
##  Mean   : 91256   Mean   : 12.05   Mean   :  21.77   Mean   : 22.82  
##  3rd Qu.:118252   3rd Qu.: 15.00   3rd Qu.:   1.00   3rd Qu.: 19.00  
##  Max.   :364757   Max.   :551.00   Max.   :3867.00   Max.   :866.00  
##     channel       click_time        attributed_time    is_attributed    
##  Min.   :  3.0   Length:100000      Length:100000      Min.   :0.00000  
##  1st Qu.:145.0   Class :character   Class :character   1st Qu.:0.00000  
##  Median :258.0   Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :268.8                                         Mean   :0.00227  
##  3rd Qu.:379.0                                         3rd Qu.:0.00000  
##  Max.   :498.0                                         Max.   :1.00000

Data Dictionary

According to the documentation, each line of training data contains a click record and the following variables:

Data fields:

ip: ip address of click. app: app id for marketing. device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.) os: os version id of user mobile phone channel: channel id of mobile ad publisher click_time: timestamp of click (UTC) attributed_time: if user download the app for after clicking an ad, this is the time of the app download is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

# Transforming variables types
# Converting click_time to date type
data$click_time <- fastPOSIXct(data$click_time)

# Converting attributed_time to date type
data$attributed_time <- fastPOSIXct(data$attributed_time)

# Checking for missing values 
missmap(data, 
        main = "Training Data - Missing Values Map", 
        col = c("yellow", "black"), 
        legend = FALSE)

There are several missing values in the variable “attributed_time”, as most records users have NOT downloaded of the app, the time has not been recorded and the column does not fill with any value

# Checking for duplicate records
table(duplicated(data))
## 
## FALSE  TRUE 
## 99999     1
# Eliminating duplicate records from the dataset.
data <- data[!duplicated(data), ]

4 - Exploratory data analysis

Let’s explore some categorical variables.

# Checking the dataset
glimpse(data)
## Rows: 99,999
## Columns: 8
## $ ip              <int> 87540, 105560, 101424, 94584, 68413, 93663, 17059, 121~
## $ app             <int> 12, 25, 12, 13, 12, 3, 1, 9, 2, 3, 3, 3, 3, 6, 2, 25, ~
## $ device          <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, ~
## $ os              <int> 13, 17, 19, 13, 1, 17, 17, 25, 22, 19, 22, 13, 22, 20,~
## $ channel         <int> 497, 259, 212, 477, 178, 115, 135, 442, 364, 135, 489,~
## $ click_time      <dttm> 2017-11-07 07:30:38, 2017-11-07 11:40:27, 2017-11-07 ~
## $ attributed_time <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ is_attributed   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# Checking the unique values for each variable
plot_var <- c("os", "channel", "device", "app", "attributed_time", "click_time", "ip")
data[, lapply(.SD, uniqueN), .SDcols = plot_var] %>%
  melt(variable.name = "features", value.name = "unique_values") %>%
  ggplot(aes(reorder(features, -unique_values), unique_values)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  scale_y_log10(breaks = c(50,100,250, 500, 10000, 50000)) +
  geom_text(aes(label = unique_values), vjust = 1.6, color = "white", size=3.5) +
  theme_minimal() +
  labs(x = "features", y = "Number of unique values")

# Analysing target variable
table(data$is_attributed)
## 
##     0     1 
## 99772   227
data %>%
  mutate(Downloaded = factor(is_attributed, labels = c('No', 'Yes')))  %>%
  ggplot(aes(x = Downloaded, fill = Downloaded)) + 
  geom_bar() + 
  labs(title = 'Var Target (is_attributed) Balancing') + 
  ylab('rows') + 
  theme_minimal()

We can see that the class is unbalanced so we must correct this before running the predictive modeling.

# Exploring categorical variables
# Var: os
data[, .N, by = os][order(-N)][1:10] %>% 
  ggplot(aes(reorder(os, -N), N)) +
  geom_bar(stat="identity", fill="steelblue") + 
  theme_minimal() +
  geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
  labs(x = "os") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

# Var: channel
data[, .N, by = channel][order(-N)][1:10] %>% 
  ggplot(aes(reorder(channel, -N), N)) +
  geom_bar(stat="identity", fill="steelblue") + 
  theme_minimal() +
  geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
  labs(x = "channel") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

# Var: device
data[, .N, by = device][order(-N)][1:10] %>% 
  ggplot(aes(reorder(device, -N), N)) +
  geom_bar(stat="identity", fill="steelblue") + 
  theme_minimal() +
  geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
  labs(x = "device") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

# Var: app
data[, .N, by = app][order(-N)][1:10] %>% 
  ggplot(aes(reorder(app, -N), N)) +
  geom_bar(stat="identity", fill="steelblue") + 
  theme_minimal() +
  geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
  labs(x = "app") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))    

Visually, we can understand how categorical variables behave and which are the most frequent values.

#Time Series Analysis

# Resume click_time variable
summary(data$click_time)
##                  Min.               1st Qu.                Median 
## "2017-11-06 14:00:00" "2017-11-07 09:34:09" "2017-11-08 05:07:49" 
##                  Mean               3rd Qu.                  Max. 
## "2017-11-08 04:29:52" "2017-11-09 00:06:01" "2017-11-09 13:59:51"

The dataset corresponds to a period of 3 days, between the 6th and the 9th of November 2017.

# Number of clicks that resulted in a download
data %>%
  mutate(dates = floor_date(click_time, unit = 'hour')) %>%
  group_by(dates) %>%
  summarise(downloads = sum(as.numeric(is_attributed==1))) %>%
  ggplot(aes(x = dates, y = downloads)) +
  geom_line() +
  scale_x_datetime(date_breaks = '2 hours', date_labels = '%d %b - %H') +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
  xlab('Time') +
  ylab('Total Downloads') +
  labs(title = 'Downloads per hour')

# Number of clicks that did NOT result in a download
data %>%
  mutate(dates = floor_date(click_time, unit = 'hour')) %>%
  group_by(dates) %>%
  summarise(nodownloads = sum(!is_attributed)) %>%
  ggplot(aes(x = dates, y = nodownloads)) +
  geom_line() +
  scale_x_datetime(date_breaks = '2 hours', date_labels = '%d %b - %H') +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
  xlab('Time') +
  ylab('Unrealized downloads') +
  labs(title = 'Unrealized downloads per hours')

Clearly, we can see a pattern in the time that downloads are made, with values well below the average between 5 pm-7pm.

# Checking the difference, in seconds, between the click on the ad and the time that the download was performed
secs_diff <- difftime(data$attributed_time, data$click_time, units="secs")
secs_diff <- na.omit(secs_diff)
summary(as.numeric(secs_diff))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    52.5   198.0  4499.6  4887.5 46341.0

5 - Creating the Predictive Model

Feature Engineering

# Let's split click_date variable into "day" and "hour" and add the variable "secs_diff".
train_data <- data %>%  
  mutate(day = day(click_time), hour = hour(click_time), min = minute(click_time), sec = second(click_time)) %>% 
  select(-c(click_time, attributed_time))

# Converting the target varbiable to factor
train_data$is_attributed <- as.factor(train_data$is_attributed)
str(train_data)
## Classes 'data.table' and 'data.frame':   99999 obs. of  10 variables:
##  $ ip           : int  87540 105560 101424 94584 68413 93663 17059 121505 192967 143636 ...
##  $ app          : int  12 25 12 13 12 3 1 9 2 3 ...
##  $ device       : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ os           : int  13 17 19 13 1 17 17 25 22 19 ...
##  $ channel      : int  497 259 212 477 178 115 135 442 364 135 ...
##  $ is_attributed: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day          : int  7 7 7 7 9 8 8 7 8 8 ...
##  $ hour         : int  7 11 16 2 7 23 23 8 7 10 ...
##  $ min          : int  30 40 5 58 0 22 17 1 35 35 ...
##  $ sec          : num  38 27 24 8 9 13 58 53 17 26 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Data transformation

# Normalizing data
dados_norm <- as.data.frame(lapply(train_data[, -6], rescale))
train_data <- cbind(train_data[,6], dados_norm)

# Class Balancing
new.sample <- rwo(train_data, numInstances = 99545, classAttr = "is_attributed")
train_data <- rbind(train_data, new.sample)

table(new.sample$is_attributed)
## 
##     0     1 
##     0 99545
table(train_data$is_attributed)
## 
##     0     1 
## 99772 99772
# Split data into train and test datasets
split = sample.split(train_data$is_attributed, SplitRatio = 0.70)
train = subset(train_data, split==TRUE)
test = subset(train_data, split==FALSE)

str(train)
## Classes 'data.table' and 'data.frame':   139680 obs. of  10 variables:
##  $ is_attributed: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ip           : num  0.24 0.289 0.278 0.188 0.257 ...
##  $ app          : num  0.02 0.04364 0.02 0.02 0.00364 ...
##  $ device       : num  0.000259 0.000259 0.000259 0.000259 0.000259 ...
##  $ os           : num  0.01501 0.01963 0.02194 0.00115 0.01963 ...
##  $ channel      : num  0.998 0.517 0.422 0.354 0.226 ...
##  $ day          : num  0.333 0.333 0.333 1 0.667 ...
##  $ hour         : num  0.304 0.478 0.696 0.304 1 ...
##  $ min          : num  0.5085 0.678 0.0847 0 0.3729 ...
##  $ sec          : num  0.644 0.458 0.407 0.153 0.22 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame':   59864 obs. of  10 variables:
##  $ is_attributed: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ip           : num  0.2593 0.1022 0.0229 0.2409 0.1463 ...
##  $ app          : num  0.02182 0.02364 0.01091 0.00364 0.02545 ...
##  $ device       : num  0.000259 0.000259 0.000259 0.000259 0.000259 ...
##  $ os           : num  0.015 0.015 0.0219 0.015 0.0219 ...
##  $ channel      : num  0.958 0.699 0.198 0.226 0.218 ...
##  $ day          : num  0.333 0 0.333 0.333 1 ...
##  $ hour         : num  0.087 0.7826 0.3478 0.0435 0.087 ...
##  $ min          : num  0.983 0.119 0.508 0.339 0.39 ...
##  $ sec          : num  0.1356 0 0 1 0.0169 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Now that we have done the class balancing we will create the first version of the predictive model, using the RandomForrest algorithm to try to find the most important variables for the analysis.

# Checking the importance of variables using RandomForrest
# Define a seed to allow the same experiment results to be reproducible
set.seed(100)
    
model1 <- randomForest(is_attributed ~ .,
                      data       = train, 
                      ntree      = 15, 
                      nodesize   = 2, 
                      importance = T)


# print model
model1
## 
## Call:
##  randomForest(formula = is_attributed ~ ., data = train, ntree = 15,      nodesize = 2, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 15
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.13%
## Confusion matrix:
##       0     1  class.error
## 0 69729    53 0.0007595082
## 1   126 69639 0.0018060632
# Plotting a graph to visualize the level of importance of each variable in the prediction process of the target variable
plot_rf <- as.data.frame(varImpPlot(model1))

# Sorting the name of the variables in descending order of importance
names <- rownames(plot_rf[order(plot_rf$MeanDecreaseAccuracy, decreasing = T),])

# Print the result
plot_rf[order(plot_rf$MeanDecreaseAccuracy, decreasing = T),]
##         MeanDecreaseAccuracy MeanDecreaseGini
## device             24.136254       36031.0289
## app                 7.442172       15880.0816
## ip                  7.060039        6843.8605
## channel             6.536286        2711.3583
## day                 6.302842        2789.2955
## hour                5.742821         392.0797
## sec                 5.511840         259.2693
## min                 5.235682         374.3236
## os                  4.510791        4536.3190

Making predictions

# Making Prections
pred1 <- predict(model1, test)
# Computes a confusion matrix generated from the model created
caret::confusionMatrix(test$is_attributed, pred1, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 29922    10
##          1    60 29872
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9985, 0.9991)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9977          
##                                           
##  Mcnemar's Test P-Value : 4.724e-09       
##                                           
##             Sensitivity : 0.9997          
##             Specificity : 0.9980          
##          Pos Pred Value : 0.9980          
##          Neg Pred Value : 0.9997          
##              Prevalence : 0.4992          
##          Detection Rate : 0.4990          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9988          
##                                           
##        'Positive' Class : 1               
## 
# More details about the model
summary(model1)
##                 Length Class  Mode     
## call                 6 -none- call     
## type                 1 -none- character
## predicted       139680 factor numeric  
## err.rate            45 -none- numeric  
## confusion            6 -none- numeric  
## votes           279360 matrix numeric  
## oob.times       139680 -none- numeric  
## classes              2 -none- character
## importance          36 -none- numeric  
## importanceSD        27 -none- numeric  
## localImportance      0 -none- NULL     
## proximity            0 -none- NULL     
## ntree                1 -none- numeric  
## mtry                 1 -none- numeric  
## forest              14 -none- list     
## y               139680 factor numeric  
## test                 0 -none- NULL     
## inbag                0 -none- NULL     
## terms                3 terms  call
# Calculating the AUC (Area Under the Curve) for the model
rfAUC <- auc(roc(as.integer(test$is_attributed), as.integer(pred1)))
rfAUC
## Area under the curve: 0.9988

We achieved great performance in the first version of the model, reaching an accuracy of 0.999. Let’s recreate the model with the most important variables (device, app, channel, ip, and os) to try to further improve the model’s accuracy.

# Building the model with the most relevant variables
# Defining formula to be used by the model.
f <- is_attributed ~ device + app + channel + ip + os

# Creating the final model based on the Random Forest algorithm
model_rf <- randomForest(f, 
                         ntree      = 15, 
                         nodesize   = 2, 
                         data       = train)
# Making predictions with the model based on the new Random Forest algorithm
pred2 <- predict(model_rf, test, type = 'response')
# Creating the Confusion Matrix
confusionMatrix(table(pred = pred2, data = test$is_attributed))
## Confusion Matrix and Statistics
## 
##     data
## pred     0     1
##    0 29919    58
##    1    13 29874
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9985, 0.9991)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9976          
##                                           
##  Mcnemar's Test P-Value : 1.772e-07       
##                                           
##             Sensitivity : 0.9996          
##             Specificity : 0.9981          
##          Pos Pred Value : 0.9981          
##          Neg Pred Value : 0.9996          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4998          
##    Detection Prevalence : 0.5008          
##       Balanced Accuracy : 0.9988          
##                                           
##        'Positive' Class : 0               
## 

Remember: 0 = Unrealized Download and 1 = Realized download

# More details about the model
summary(model_rf)
##                 Length Class  Mode     
## call                 5 -none- call     
## type                 1 -none- character
## predicted       139680 factor numeric  
## err.rate            45 -none- numeric  
## confusion            6 -none- numeric  
## votes           279360 matrix numeric  
## oob.times       139680 -none- numeric  
## classes              2 -none- character
## importance           5 -none- numeric  
## importanceSD         0 -none- NULL     
## localImportance      0 -none- NULL     
## proximity            0 -none- NULL     
## ntree                1 -none- numeric  
## mtry                 1 -none- numeric  
## forest              14 -none- list     
## y               139680 factor numeric  
## test                 0 -none- NULL     
## inbag                0 -none- NULL     
## terms                3 terms  call
# Calculating the AUC (Area Under the Curve) for the model
rfAUC2 <- auc(roc(as.integer(test$is_attributed), as.integer(pred2)))
rfAUC2
## Area under the curve: 0.9988

There was no significant improvement in the accuracy of the model after the selection of the most important variables.