Background: Fraud risks are everywhere. Due to the overwhelming volume of click fraud, advertisers are forced to pay a substantial amount of general invalid traffic each year and suffer misleading volumes of exposure.
Data Frame: The data is provided by
TalkingData
, which is an independent big data service platform focusing on data analysis and digital marketing, onKaggle
. In the data frame, there are 184 million observations of advertisement clicks in 4 days. Also, there are 7 variables in the data frame: IP, app, device, os, channel, click_time, and attribute_time. (Link to Data: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data)
Question: In this project, we are interested in what kind of statistical models would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent, which means the application is NOT actually downloaded when a user click the advertisement. Also, we are interested in what kind of variable combinations would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent.
Statistical Computation: We first split data into 2 files (80% training & 20% testing) where each file is less than 4GB. Then, we train 5 classification algorithms Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor on training data in parallel. Then, a shell script will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file. Then, data is split again into 2 files (80% training & 20% testing) where each file is less than 4GB. We train the highest-accuracy algorithm with 5 variable combinations in parallel. Finally, a shell script will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file. (Check Directed Acyclic Graph in body part for more information)
Conclusion: In conclusion, the statistical model
Random Forest
with a combination ofip
app
device
andos
variables has the highest accuracy to predict whether the application is actually downloaded when a user clicks the advertisement.
Source & Size: The data is from
Kaggle
provided byTalkingData
, an independent big data service platform focusing on data analysis and digital marketing (Link: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data). The data contains 7 variables and 184,903,891 observations (around 184 million). In this project, we useis_attributed
as the response variable andip
,app
,channel
,device
, andos
as factors. (We choose to ignoreclick_time
andattributed_time
because we decide not to work with Time Series this time.)
ip
: IP address of click.app
: application id for marketing, which is used to identify the advertisement.channel
: channel id of mobile ad publisher, which is used to identify the advertisement publisher.device
: device type id of user mobile phone (e.g., Iphone 7, Huawei mate 7, etc.)os
: Operating System version id of user mobile phone (e.g., IOS, Android)click_time
: (UTC) the time where the click action is done.attributed_time
: if user download the app for after clicking an ad, this is the time of the app downloadCleaning: The data frame is in good condition since there is no
NA
s for variables we choose. However, from the graph above, it is obvious that the data is seriously imbalanced. In order to maintain a large data frame with enough observations, we decide to oversample (method = SMOTE) the data (is_attributed == 0
:is_attributed == 1
~ 0.8 : 1).
algorithm.sub
: run algorithms in parallel.variable.sub
: run variable combinations in parallel.submitAllJobs.dag
: dag file to control the execution flow.submit.sh
: submit the “submitAllJobs.dag” to CHTC.alg.sh
: executable file for “algorithm.sub”.var.sh
: executable file for “variable.sub”.initialize.sh
: initialize all required files and folders for execution.merge.sh
: summarized the result of each algorithm, write “alg.txt” to pass the best algorithm.result.sh
: summarized the result of each variable combinations, write “final.txt” to indicate the best algorithm and variable combination.clean.sh
: clean all files for new submission/testing.data.R
: read and oversample the data.#algorithm".R
: run algorithm and produce accuracy (where prediction == true value)"var#".R
: run each variable combination in step-forward selection order.Considering this is a classification problem, the statistical methods we use are Naive Bayes
, Random Forest
, Support Vector Machine
, Logistic Regression
, and k-Nearest-Neighbor
.
The Directed Acyclic Graph below shows the Process.
Directed Acyclic Graph
Algorithm | Accuracy |
---|---|
Naive bayes | 0.867 |
Random Forest | 0.9934 |
SVM | 0.8388 |
Logistic Regression | 0.8798 |
kNN | 0.8413 |
Variable.Combination | Accuracy |
---|---|
ip | 0.7823 |
ip + app | 0.9484 |
ip + app + device | 0.9849 |
ip + app + device + os | 0.9897 |
ip + app + device + os + channel | 0.9953 |
The result above suggests overfitting does not exist, and more variables lead to a higher accuracy.
In short, based on sample data (100,000 observations), the best algorithm is Random Forest
with the variable combination ip + app + device + os + channel
.
However, the running time and memory become an issue for the whole data frame (184 million observations). The graph below shows the Running Time vs. the Number of Observations that we estimate based on “transfer-file-in” time and “transfer-file-out” time. (10k, 100k, 300k, 500k, 1M)
In specific, we calculate the difference between “transfer-file-in” time and “transfer-file-out” time for each job and sum the differences, which is the total amount of time for CHTC computing.