The risk of fraud is everywhere, but for companies that advertise online click, fraud can happen in an overwhelming amount, resulting in misleading click data and wasted money. Ad placements can increase costs simply when people or bots click on ads on a large scale, which in practice does not yield the expected result. With more than one billion mobile devices in use every month, China is the largest mobile market in the world and therefore suffers from high volumes of fraudulent traffic.
The TalkingData (https://www.talkingdata.com), China’s largest independent Big Data platform, covers more than 70% of active mobile devices across the country. They handle three billion clicks a day, of which 90% are potentially fraudulent. Your current approach to preventing click fraud for application developers is to measure a user’s click journey across the portfolio and signal IP addresses that produce many clicks, but never end up installing applications. With this information, they created a blacklist of IPs and a blacklist of devices.
Although successful, they always want to be one step ahead of fraudsters and have asked for their help to further develop the solution. In summary, using a dataset available on Kaggle (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/overview) we have built an algorithm that can predict whether a user will download an app after clicking on a mobile app ad to determine whether a click is fraudulent or not. For the construction of this project, we used R language.
# Importing the required libraries
library(data.table)
library(dplyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(fasttime)
library(Amelia)
library(caret)
library(randomForest)
library(smotefamily)
library(scales)
library(imbalance)
library(caTools)
library(pROC)
#Loading train data
df_original <- fread('train_sample.csv')
To maintain the integrity of the original data, we will make a copy of the dataset for manipulation.
# Copying dataset
data <- df_original
# Cheking dataset
head(data)
## ip app device os channel click_time attributed_time
## 1: 87540 12 1 13 497 2017-11-07 09:30:38
## 2: 105560 25 1 17 259 2017-11-07 13:40:27
## 3: 101424 12 1 19 212 2017-11-07 18:05:24
## 4: 94584 13 1 13 477 2017-11-07 04:58:08
## 5: 68413 12 1 1 178 2017-11-09 09:00:09
## 6: 93663 3 1 17 115 2017-11-09 01:22:13
## is_attributed
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
glimpse(data)
## Rows: 100,000
## Columns: 8
## $ ip <int> 87540, 105560, 101424, 94584, 68413, 93663, 17059, 121~
## $ app <int> 12, 25, 12, 13, 12, 3, 1, 9, 2, 3, 3, 3, 3, 6, 2, 25, ~
## $ device <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, ~
## $ os <int> 13, 17, 19, 13, 1, 17, 17, 25, 22, 19, 22, 13, 22, 20,~
## $ channel <int> 497, 259, 212, 477, 178, 115, 135, 442, 364, 135, 489,~
## $ click_time <chr> "2017-11-07 09:30:38", "2017-11-07 13:40:27", "2017-11~
## $ attributed_time <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ""~
## $ is_attributed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
summary(data)
## ip app device os
## Min. : 9 Min. : 1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 40552 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 13.00
## Median : 79827 Median : 12.00 Median : 1.00 Median : 18.00
## Mean : 91256 Mean : 12.05 Mean : 21.77 Mean : 22.82
## 3rd Qu.:118252 3rd Qu.: 15.00 3rd Qu.: 1.00 3rd Qu.: 19.00
## Max. :364757 Max. :551.00 Max. :3867.00 Max. :866.00
## channel click_time attributed_time is_attributed
## Min. : 3.0 Length:100000 Length:100000 Min. :0.00000
## 1st Qu.:145.0 Class :character Class :character 1st Qu.:0.00000
## Median :258.0 Mode :character Mode :character Median :0.00000
## Mean :268.8 Mean :0.00227
## 3rd Qu.:379.0 3rd Qu.:0.00000
## Max. :498.0 Max. :1.00000
According to the documentation, each line of training data contains a click record and the following variables:
ip: ip address of click. app: app id for marketing. device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.) os: os version id of user mobile phone channel: channel id of mobile ad publisher click_time: timestamp of click (UTC) attributed_time: if user download the app for after clicking an ad, this is the time of the app download is_attributed: the target that is to be predicted, indicating the app was downloaded
Note that ip, app, device, os, and channel are encoded.
# Transforming variables types
# Converting click_time to date type
data$click_time <- fastPOSIXct(data$click_time)
# Converting attributed_time to date type
data$attributed_time <- fastPOSIXct(data$attributed_time)
# Checking for missing values
missmap(data,
main = "Training Data - Missing Values Map",
col = c("yellow", "black"),
legend = FALSE)
There are several missing values in the variable “attributed_time”, as most records users have NOT downloaded of the app, the time has not been recorded and the column does not fill with any value
# Checking for duplicate records
table(duplicated(data))
##
## FALSE TRUE
## 99999 1
# Eliminating duplicate records from the dataset.
data <- data[!duplicated(data), ]
Let’s explore some categorical variables.
# Checking the dataset
glimpse(data)
## Rows: 99,999
## Columns: 8
## $ ip <int> 87540, 105560, 101424, 94584, 68413, 93663, 17059, 121~
## $ app <int> 12, 25, 12, 13, 12, 3, 1, 9, 2, 3, 3, 3, 3, 6, 2, 25, ~
## $ device <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, ~
## $ os <int> 13, 17, 19, 13, 1, 17, 17, 25, 22, 19, 22, 13, 22, 20,~
## $ channel <int> 497, 259, 212, 477, 178, 115, 135, 442, 364, 135, 489,~
## $ click_time <dttm> 2017-11-07 07:30:38, 2017-11-07 11:40:27, 2017-11-07 ~
## $ attributed_time <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ is_attributed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# Checking the unique values for each variable
plot_var <- c("os", "channel", "device", "app", "attributed_time", "click_time", "ip")
data[, lapply(.SD, uniqueN), .SDcols = plot_var] %>%
melt(variable.name = "features", value.name = "unique_values") %>%
ggplot(aes(reorder(features, -unique_values), unique_values)) +
geom_bar(stat = "identity", fill = "steelblue") +
scale_y_log10(breaks = c(50,100,250, 500, 10000, 50000)) +
geom_text(aes(label = unique_values), vjust = 1.6, color = "white", size=3.5) +
theme_minimal() +
labs(x = "features", y = "Number of unique values")
# Analysing target variable
table(data$is_attributed)
##
## 0 1
## 99772 227
data %>%
mutate(Downloaded = factor(is_attributed, labels = c('No', 'Yes'))) %>%
ggplot(aes(x = Downloaded, fill = Downloaded)) +
geom_bar() +
labs(title = 'Var Target (is_attributed) Balancing') +
ylab('rows') +
theme_minimal()
We can see that the class is unbalanced so we must correct this before running the predictive modeling.
# Exploring categorical variables
# Var: os
data[, .N, by = os][order(-N)][1:10] %>%
ggplot(aes(reorder(os, -N), N)) +
geom_bar(stat="identity", fill="steelblue") +
theme_minimal() +
geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
labs(x = "os") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Var: channel
data[, .N, by = channel][order(-N)][1:10] %>%
ggplot(aes(reorder(channel, -N), N)) +
geom_bar(stat="identity", fill="steelblue") +
theme_minimal() +
geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
labs(x = "channel") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Var: device
data[, .N, by = device][order(-N)][1:10] %>%
ggplot(aes(reorder(device, -N), N)) +
geom_bar(stat="identity", fill="steelblue") +
theme_minimal() +
geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
labs(x = "device") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Var: app
data[, .N, by = app][order(-N)][1:10] %>%
ggplot(aes(reorder(app, -N), N)) +
geom_bar(stat="identity", fill="steelblue") +
theme_minimal() +
geom_text(aes(label = round(N / sum(N), 2)), vjust = 1.6, color = "white", size=2.5) +
labs(x = "app") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Visually, we can understand how categorical variables behave and which are the most frequent values.
#Time Series Analysis
# Resume click_time variable
summary(data$click_time)
## Min. 1st Qu. Median
## "2017-11-06 14:00:00" "2017-11-07 09:34:09" "2017-11-08 05:07:49"
## Mean 3rd Qu. Max.
## "2017-11-08 04:29:52" "2017-11-09 00:06:01" "2017-11-09 13:59:51"
The dataset corresponds to a period of 3 days, between the 6th and the 9th of November 2017.
# Number of clicks that resulted in a download
data %>%
mutate(dates = floor_date(click_time, unit = 'hour')) %>%
group_by(dates) %>%
summarise(downloads = sum(as.numeric(is_attributed==1))) %>%
ggplot(aes(x = dates, y = downloads)) +
geom_line() +
scale_x_datetime(date_breaks = '2 hours', date_labels = '%d %b - %H') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
xlab('Time') +
ylab('Total Downloads') +
labs(title = 'Downloads per hour')
# Number of clicks that did NOT result in a download
data %>%
mutate(dates = floor_date(click_time, unit = 'hour')) %>%
group_by(dates) %>%
summarise(nodownloads = sum(!is_attributed)) %>%
ggplot(aes(x = dates, y = nodownloads)) +
geom_line() +
scale_x_datetime(date_breaks = '2 hours', date_labels = '%d %b - %H') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
xlab('Time') +
ylab('Unrealized downloads') +
labs(title = 'Unrealized downloads per hours')
Clearly, we can see a pattern in the time that downloads are made, with values well below the average between 5 pm-7pm.
# Checking the difference, in seconds, between the click on the ad and the time that the download was performed
secs_diff <- difftime(data$attributed_time, data$click_time, units="secs")
secs_diff <- na.omit(secs_diff)
summary(as.numeric(secs_diff))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 52.5 198.0 4499.6 4887.5 46341.0
# Let's split click_date variable into "day" and "hour" and add the variable "secs_diff".
train_data <- data %>%
mutate(day = day(click_time), hour = hour(click_time), min = minute(click_time), sec = second(click_time)) %>%
select(-c(click_time, attributed_time))
# Converting the target varbiable to factor
train_data$is_attributed <- as.factor(train_data$is_attributed)
str(train_data)
## Classes 'data.table' and 'data.frame': 99999 obs. of 10 variables:
## $ ip : int 87540 105560 101424 94584 68413 93663 17059 121505 192967 143636 ...
## $ app : int 12 25 12 13 12 3 1 9 2 3 ...
## $ device : int 1 1 1 1 1 1 1 1 2 1 ...
## $ os : int 13 17 19 13 1 17 17 25 22 19 ...
## $ channel : int 497 259 212 477 178 115 135 442 364 135 ...
## $ is_attributed: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 7 7 7 7 9 8 8 7 8 8 ...
## $ hour : int 7 11 16 2 7 23 23 8 7 10 ...
## $ min : int 30 40 5 58 0 22 17 1 35 35 ...
## $ sec : num 38 27 24 8 9 13 58 53 17 26 ...
## - attr(*, ".internal.selfref")=<externalptr>
# Normalizing data
dados_norm <- as.data.frame(lapply(train_data[, -6], rescale))
train_data <- cbind(train_data[,6], dados_norm)
# Class Balancing
new.sample <- rwo(train_data, numInstances = 99545, classAttr = "is_attributed")
train_data <- rbind(train_data, new.sample)
table(new.sample$is_attributed)
##
## 0 1
## 0 99545
table(train_data$is_attributed)
##
## 0 1
## 99772 99772
# Split data into train and test datasets
split = sample.split(train_data$is_attributed, SplitRatio = 0.70)
train = subset(train_data, split==TRUE)
test = subset(train_data, split==FALSE)
str(train)
## Classes 'data.table' and 'data.frame': 139680 obs. of 10 variables:
## $ is_attributed: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ip : num 0.24 0.289 0.278 0.188 0.257 ...
## $ app : num 0.02 0.04364 0.02 0.02 0.00364 ...
## $ device : num 0.000259 0.000259 0.000259 0.000259 0.000259 ...
## $ os : num 0.01501 0.01963 0.02194 0.00115 0.01963 ...
## $ channel : num 0.998 0.517 0.422 0.354 0.226 ...
## $ day : num 0.333 0.333 0.333 1 0.667 ...
## $ hour : num 0.304 0.478 0.696 0.304 1 ...
## $ min : num 0.5085 0.678 0.0847 0 0.3729 ...
## $ sec : num 0.644 0.458 0.407 0.153 0.22 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame': 59864 obs. of 10 variables:
## $ is_attributed: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ip : num 0.2593 0.1022 0.0229 0.2409 0.1463 ...
## $ app : num 0.02182 0.02364 0.01091 0.00364 0.02545 ...
## $ device : num 0.000259 0.000259 0.000259 0.000259 0.000259 ...
## $ os : num 0.015 0.015 0.0219 0.015 0.0219 ...
## $ channel : num 0.958 0.699 0.198 0.226 0.218 ...
## $ day : num 0.333 0 0.333 0.333 1 ...
## $ hour : num 0.087 0.7826 0.3478 0.0435 0.087 ...
## $ min : num 0.983 0.119 0.508 0.339 0.39 ...
## $ sec : num 0.1356 0 0 1 0.0169 ...
## - attr(*, ".internal.selfref")=<externalptr>
Now that we have done the class balancing we will create the first version of the predictive model, using the RandomForrest algorithm to try to find the most important variables for the analysis.
# Checking the importance of variables using RandomForrest
# Define a seed to allow the same experiment results to be reproducible
set.seed(100)
model1 <- randomForest(is_attributed ~ .,
data = train,
ntree = 15,
nodesize = 2,
importance = T)
# print model
model1
##
## Call:
## randomForest(formula = is_attributed ~ ., data = train, ntree = 15, nodesize = 2, importance = T)
## Type of random forest: classification
## Number of trees: 15
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.13%
## Confusion matrix:
## 0 1 class.error
## 0 69729 53 0.0007595082
## 1 126 69639 0.0018060632
# Plotting a graph to visualize the level of importance of each variable in the prediction process of the target variable
plot_rf <- as.data.frame(varImpPlot(model1))
# Sorting the name of the variables in descending order of importance
names <- rownames(plot_rf[order(plot_rf$MeanDecreaseAccuracy, decreasing = T),])
# Print the result
plot_rf[order(plot_rf$MeanDecreaseAccuracy, decreasing = T),]
## MeanDecreaseAccuracy MeanDecreaseGini
## device 24.136254 36031.0289
## app 7.442172 15880.0816
## ip 7.060039 6843.8605
## channel 6.536286 2711.3583
## day 6.302842 2789.2955
## hour 5.742821 392.0797
## sec 5.511840 259.2693
## min 5.235682 374.3236
## os 4.510791 4536.3190
# Making Prections
pred1 <- predict(model1, test)
# Computes a confusion matrix generated from the model created
caret::confusionMatrix(test$is_attributed, pred1, positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 29922 10
## 1 60 29872
##
## Accuracy : 0.9988
## 95% CI : (0.9985, 0.9991)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9977
##
## Mcnemar's Test P-Value : 4.724e-09
##
## Sensitivity : 0.9997
## Specificity : 0.9980
## Pos Pred Value : 0.9980
## Neg Pred Value : 0.9997
## Prevalence : 0.4992
## Detection Rate : 0.4990
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9988
##
## 'Positive' Class : 1
##
# More details about the model
summary(model1)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 139680 factor numeric
## err.rate 45 -none- numeric
## confusion 6 -none- numeric
## votes 279360 matrix numeric
## oob.times 139680 -none- numeric
## classes 2 -none- character
## importance 36 -none- numeric
## importanceSD 27 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 139680 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
# Calculating the AUC (Area Under the Curve) for the model
rfAUC <- auc(roc(as.integer(test$is_attributed), as.integer(pred1)))
rfAUC
## Area under the curve: 0.9988
We achieved great performance in the first version of the model, reaching an accuracy of 0.999. Let’s recreate the model with the most important variables (device, app, channel, ip, and os) to try to further improve the model’s accuracy.
# Building the model with the most relevant variables
# Defining formula to be used by the model.
f <- is_attributed ~ device + app + channel + ip + os
# Creating the final model based on the Random Forest algorithm
model_rf <- randomForest(f,
ntree = 15,
nodesize = 2,
data = train)
# Making predictions with the model based on the new Random Forest algorithm
pred2 <- predict(model_rf, test, type = 'response')
# Creating the Confusion Matrix
confusionMatrix(table(pred = pred2, data = test$is_attributed))
## Confusion Matrix and Statistics
##
## data
## pred 0 1
## 0 29919 58
## 1 13 29874
##
## Accuracy : 0.9988
## 95% CI : (0.9985, 0.9991)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9976
##
## Mcnemar's Test P-Value : 1.772e-07
##
## Sensitivity : 0.9996
## Specificity : 0.9981
## Pos Pred Value : 0.9981
## Neg Pred Value : 0.9996
## Prevalence : 0.5000
## Detection Rate : 0.4998
## Detection Prevalence : 0.5008
## Balanced Accuracy : 0.9988
##
## 'Positive' Class : 0
##
Remember: 0 = Unrealized Download and 1 = Realized download
# More details about the model
summary(model_rf)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 139680 factor numeric
## err.rate 45 -none- numeric
## confusion 6 -none- numeric
## votes 279360 matrix numeric
## oob.times 139680 -none- numeric
## classes 2 -none- character
## importance 5 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 139680 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
# Calculating the AUC (Area Under the Curve) for the model
rfAUC2 <- auc(roc(as.integer(test$is_attributed), as.integer(pred2)))
rfAUC2
## Area under the curve: 0.9988
There was no significant improvement in the accuracy of the model after the selection of the most important variables.