Introduction

Background: Fraud risks are everywhere. Due to the overwhelming volume of click fraud, advertisers are forced to pay a substantial amount of general invalid traffic each year and suffer misleading volumes of exposure.

Data Frame: The data is provided by TalkingData, which is an independent big data service platform focusing on data analysis and digital marketing, on Kaggle. In the data frame, there are 184 million observations of advertisement clicks in 4 days. Also, there are 7 variables in the data frame: IP, app, device, os, channel, click_time, and attribute_time. (Link to Data: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data)

Question: In this project, we are interested in what kind of statistical models would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent, which means the application is NOT actually downloaded when a user click the advertisement. Also, we are interested in what kind of variable combinations would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent.

Statistical Computation: We first split data into 2 files (80% training & 20% testing) where each file is less than 4GB. Then, we train 5 classification algorithms Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor on training data in parallel. Then, a shell script will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file. Then, data is split again into 2 files (80% training & 20% testing) where each file is less than 4GB. We train the highest-accuracy algorithm with 5 variable combinations in parallel. Finally, a shell script will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file. (Check Directed Acyclic Graph in body part for more information)

Conclusion: In conclusion, the statistical model Random Forest with a combination of ip app device and os variables has the highest accuracy to predict whether the application is actually downloaded when a user clicks the advertisement.


Data

Source & Size: The data is from Kaggle provided by TalkingData, an independent big data service platform focusing on data analysis and digital marketing (Link: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data). The data contains 7 variables and 184,903,891 observations (around 184 million). In this project, we use is_attributed as the response variable and ip, app, channel, device, and os as factors. (We choose to ignore click_time and attributed_time because we decide not to work with Time Series this time.)

Cleaning: The data frame is in good condition since there is no NAs for variables we choose. However, from the graph above, it is obvious that the data is seriously imbalanced. In order to maintain a large data frame with enough observations, we decide to oversample (method = SMOTE) the data (is_attributed == 0 : is_attributed == 1 ~ 0.8 : 1).


Statistical Computation

README

Condor File
  • algorithm.sub : run algorithms in parallel.
  • variable.sub : run variable combinations in parallel.
  • submitAllJobs.dag : dag file to control the execution flow.
Executable File
  • submit.sh : submit the “submitAllJobs.dag” to CHTC.
  • alg.sh : executable file for “algorithm.sub”.
  • var.sh : executable file for “variable.sub”.
  • initialize.sh : initialize all required files and folders for execution.
  • merge.sh : summarized the result of each algorithm, write “alg.txt” to pass the best algorithm.
  • result.sh : summarized the result of each variable combinations, write “final.txt” to indicate the best algorithm and variable combination.
  • clean.sh : clean all files for new submission/testing.
R File
  • data.R : read and oversample the data.
  • #algorithm".R : run algorithm and produce accuracy (where prediction == true value)
  • "var#".R : run each variable combination in step-forward selection order.

Staistical Computing Process

CPU, Memory,Disk Usage
  • CPU: 2
  • Memory: 10GB
  • Disk: 10GB
  • #Jobs: algorithm.sub -> 5 jobs; variable.sub -> 5 jobs
Statistical Methods & Process
  • Considering this is a classification problem, the statistical methods we use are Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor.

  • The Directed Acyclic Graph below shows the Process.

Directed Acyclic Graph

Directed Acyclic Graph

  • We first split data into 2 files (80% training & 20% testing) where each file is less than 4GB. - Then, we train 5 classification algorithms Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor on training data in parallel.
  • Then, a shell script will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file.
  • Then, data is split again into 2 files (80% training & 20% testing) where each file is less than 4GB.
  • We train the highest-accuracy algorithm with 5 variable combinations in parallel.
  • Finally, a shell script will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file.

Result

  • The program successfully runs on sample data, which is randomly selected 100000 observations of the original data frame. Each algorithm is trained on the same train data (80%) with 5 selected factors and tested on the same test data (20%).
Algorithm Accuracy
Naive bayes 0.867
Random Forest 0.9934
SVM 0.8388
Logistic Regression 0.8798
kNN 0.8413
  • The result above suggests Random Forest has the best accuracy among all the algorithms we test.
Variable.Combination Accuracy
ip 0.7823
ip + app 0.9484
ip + app + device 0.9849
ip + app + device + os 0.9897
ip + app + device + os + channel 0.9953
  • The result above suggests overfitting does not exist, and more variables lead to a higher accuracy.

  • In short, based on sample data (100,000 observations), the best algorithm is Random Forest with the variable combination ip + app + device + os + channel.


Problems

  • However, the running time and memory become an issue for the whole data frame (184 million observations). The graph below shows the Running Time vs. the Number of Observations that we estimate based on “transfer-file-in” time and “transfer-file-out” time. (10k, 100k, 300k, 500k, 1M)

  • In specific, we calculate the difference between “transfer-file-in” time and “transfer-file-out” time for each job and sum the differences, which is the total amount of time for CHTC computing.