Background: Fraud risks are everywhere. Due to the overwhelming volume of click fraud, advertisers are forced to pay a substantial amount of general invalid traffic each year and suffer misleading volumes of exposure.
Data Frame: The data is provided by
TalkingData
, which is an independent big data service platform focusing on data analysis and digital marketing, onKaggle
. In the data frame, there are 184 million observations of advertisement clicks in 4 days. Also, there are 7 variables in the data frame: IP, app, device, os, channel, click_time, and attribute_time. (Link to Data: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data)
Question: In this project, we are interested in what kind of statistical models would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent, which means the application is NOT actually downloaded when a user click the advertisement. Also, we are interested in what kind of variable combinations would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent.
Statistical Computation: We first split data into 2 files (80% training & 20% testing) where each file is less than 4GB. Then, we train 5 classification algorithms Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor on training data in parallel. Then, a shell script will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file. Then, data is split again into 2 files (80% training & 20% testing) where each file is less than 4GB. We train the highest-accuracy algorithm with 5 variable combinations in parallel. Finally, a shell script will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file. (Check Directed Acyclic Graph in body part for more information)
Conclusion: In conclusion, the statistical model
Random Forest
with a combination ofip
app
device
andos
variables has the highest accuracy to predict whether the application is actually downloaded when a user clicks the advertisement.
Source & Size: The data is from
Kaggle
provided byTalkingData
, an independent big data service platform focusing on data analysis and digital marketing (Link: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data). The data contains 7 variables and 184,903,891 observations (around 184 million). In this project, we useis_attributed
as the response variable andip
,app
,channel
,device
, andos
as factors. (We choose to ignoreclick_time
andattributed_time
because we decide not to work with Time Series this time.)
ip
: IP address of click.app
: application id for marketing, which is used to identify the advertisement.channel
: channel id of mobile ad publisher, which is used to identify the advertisement publisher.device
: device type id of user mobile phone (e.g., Iphone 7, Huawei mate 7, etc.)os
: Operating System version id of user mobile phone (e.g., IOS, Android)click_time
: (UTC) the time where the click action is done.attributed_time
: if user download the app for after clicking an ad, this is the time of the app downloadCleaning: The data frame is in good condition since there is no
NA
s for variables we choose. However, from the graph above, it is obvious that the data is seriously imbalanced. In order to maintain a large data frame with enough observations, we decide to oversample (method = SMOTE) the data (is_attributed == 0
:is_attributed == 1
~ 0.8 : 1).
algorithm.sub
: run algorithms in parallel.variable.sub
: run variable combinations in parallel.submitAllJobs.dag
: dag file to control the execution flow.submit.sh
: submit the “submitAllJobs.dag” to CHTC.alg.sh
: executable file for “algorithm.sub”.var.sh
: executable file for “variable.sub”.initialize.sh
: initialize all required files and folders for execution.merge.sh
: summarized the result of each algorithm, write “alg.txt” to pass the best algorithm.result.sh
: summarized the result of each variable combinations, write “final.txt” to indicate the best algorithm and variable combination.clean.sh
: clean all files for new submission/testing.data.R
: read and oversample the data.#algorithm".R
: run algorithm and produce accuracy (where prediction == true value)"var#".R
: run each variable combination in step-forward selection order.Considering this is a classification problem, the statistical methods we use are Naive Bayes
, Random Forest
, Support Vector Machine
, Logistic Regression
, and k-Nearest-Neighbor
.
The Directed Acyclic Graph below shows the Process.
Algorithm | Accuracy |
---|---|
Naive bayes | 0.867 |
Random Forest | 0.9934 |
SVM | 0.8388 |
Logistic Regression | 0.8798 |
kNN | 0.8413 |
Variable.Combination | Accuracy |
---|---|
ip | 0.7823 |
ip + app | 0.9484 |
ip + app + device | 0.9849 |
ip + app + device + os | 0.9897 |
ip + app + device + os + channel | 0.9953 |
The result above suggests overfitting does not exist, and more variables lead to a higher accuracy.
In short, based on sample data (100,000 observations), the best algorithm is Random Forest
with the variable combination ip + app + device + os + channel
.
However, the running time and memory become an issue for the whole data frame (184 million observations). The graph below shows the Running Time vs. the Number of Observations that we estimate based on “transfer-file-in” time and “transfer-file-out” time. (10k, 100k, 300k, 500k, 1M)
In specific, we calculate the difference between “transfer-file-in” time and “transfer-file-out” time for each job and sum the differences, which is the total amount of time for CHTC computing.
##
## Call:
## lm(formula = time ~ number, data = trend)
##
## Residuals:
## 1 2 3 4 5
## 14.308 -6.994 -14.098 3.998 2.787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.8080 8.3108 -1.902 0.1533
## number 3.1002 0.1599 19.384 0.0003 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.6 on 3 degrees of freedom
## Multiple R-squared: 0.9921, Adjusted R-squared: 0.9894
## F-statistic: 375.7 on 1 and 3 DF, p-value: 0.0002999
Moreover, 10GB of memory is not enough for the large data frame. (We test with 100GB, after running 2 days, memory leaks happen.)
Oversampling seems not to be useful for Random Forest so even though Random Forest has a high accuracy rate, the confusion matrix suggests it is not the best to predict whether the application is actually downloaded when a user click the advertisement (Not good at predicting is_attributed == 1
). A better performance measurement method is needed.
We implement a new Rscript based on the library mlr3verse
which can request multiple CPUs to train the model so that the overall running time would diminish.
We notice that the Support Vector Machine runs very slowly under a large data frame. So, we modify the statistical models we are planning to use: Naive Bayes
, Random Forest
, Logistic Regression
. Also, we remove the kNN to reduce the number the parallel jobs (reduce total memory & CPUs required so that less waiting time).
The model training costs a large amount of memory. So, we decide to reduce the number of observations in training data from 80% of the whole data to 0.1% of the whole data, which would improve the training speed. Testing data would be 99.9% of the whole data, which can also help reflect the prediction accuracy of the algorithm comprehensively.
Since 100GB memory is not enough, we are requesting 500GB memory. Also, we are requesting 7 CPUs for faster computation speed.
We are interested in finding “fraud click,” so that it is more appropriate to use recall to estimate the performance of the model, where **recall** = #correct positive predictions / #positive examples
.
algorithm.sub
-> 3 jobs; variable.sub
-> 5 jobsConsidering this is a classification problem, the statistical methods we use are Naive Bayes
, Random Forest
, Logistic Regression
.
The Directed Acyclic Graph below shows the Process.
initialize.sh
will prepare all required files for CHTC computing.data.R
randomly split the whole data frame into 5 small independent csv files.merge.sh
will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file.result.sh
will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file.Algorithm | Performance.Recall |
---|---|
Naive bayes | 0.5432 |
Random Forest | 0.884 |
Logistic Regression | 0.6484 |
Random Forest
has the best accuracy among all the algorithms we test.Variable.Combination | Accuracy |
---|---|
ip | 0.534 |
ip + app | 0.7843 |
ip + app + device | 0.878 |
ip + app + device + os | 0.915 |
ip + app + device + os + channel | 0.8552 |
ip + app + device + os
would lead to the highest accuracy.Random Forest
with the variable combination ip + app + device + os
would be good at predicting whether the application is actually downloaded when a user clicks the advertisementLuck
.step forward selection
(sequential), so that not all data combinations are tested.In conclusion, the statistical model Random Forest
with a combination of ip + app + device + os
variables has the highest accuracy to predict whether the application is actually downloaded when a user clicks the advertisement.
Practical Interpretation: Random Forest
tend to handle large data frames more efficiently than other algorithms, so when dealing with a large data frame, if prediction accuracy is more important than interpretation of the model, Random Forest
should be considered. Moreover, channel
seems to be less significant than other variables, which suggest when tracking Fraud Click of online advertisement, investigators should put more effort into analyzing clickers’ (e.g ip address, device, os) and advertisement providers’ (e.g software companies) information over the platform (e.g YouTube) information.
current
algorithm’s calculation speed.Github Repository: https://github.com/runxuanli/479project