Introduction

Background: Fraud risks are everywhere. Due to the overwhelming volume of click fraud, advertisers are forced to pay a substantial amount of general invalid traffic each year and suffer misleading volumes of exposure.

Data Frame: The data is provided by TalkingData, which is an independent big data service platform focusing on data analysis and digital marketing, on Kaggle. In the data frame, there are 184 million observations of advertisement clicks in 4 days. Also, there are 7 variables in the data frame: IP, app, device, os, channel, click_time, and attribute_time. (Link to Data: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data)

Question: In this project, we are interested in what kind of statistical models would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent, which means the application is NOT actually downloaded when a user click the advertisement. Also, we are interested in what kind of variable combinations would have the best performance (accuracy) in predicting whether certain advertisement clicks are fraudulent.

Statistical Computation: We first split data into 2 files (80% training & 20% testing) where each file is less than 4GB. Then, we train 5 classification algorithms Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor on training data in parallel. Then, a shell script will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file. Then, data is split again into 2 files (80% training & 20% testing) where each file is less than 4GB. We train the highest-accuracy algorithm with 5 variable combinations in parallel. Finally, a shell script will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file. (Check Directed Acyclic Graph in body part for more information)

Conclusion: In conclusion, the statistical model Random Forest with a combination of ip app device and os variables has the highest accuracy to predict whether the application is actually downloaded when a user clicks the advertisement.


Data

Source & Size: The data is from Kaggle provided by TalkingData, an independent big data service platform focusing on data analysis and digital marketing (Link: https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/data). The data contains 7 variables and 184,903,891 observations (around 184 million). In this project, we use is_attributed as the response variable and ip, app, channel, device, and os as factors. (We choose to ignore click_time and attributed_time because we decide not to work with Time Series this time.)

Cleaning: The data frame is in good condition since there is no NAs for variables we choose. However, from the graph above, it is obvious that the data is seriously imbalanced. In order to maintain a large data frame with enough observations, we decide to oversample (method = SMOTE) the data (is_attributed == 0 : is_attributed == 1 ~ 0.8 : 1).


Statistical Computation

README

Condor File
  • algorithm.sub : run algorithms in parallel.
  • variable.sub : run variable combinations in parallel.
  • submitAllJobs.dag : dag file to control the execution flow.
Executable File
  • submit.sh : submit the “submitAllJobs.dag” to CHTC.
  • alg.sh : executable file for “algorithm.sub”.
  • var.sh : executable file for “variable.sub”.
  • initialize.sh : initialize all required files and folders for execution.
  • merge.sh : summarized the result of each algorithm, write “alg.txt” to pass the best algorithm.
  • result.sh : summarized the result of each variable combinations, write “final.txt” to indicate the best algorithm and variable combination.
  • clean.sh : clean all files for new submission/testing.
R File
  • data.R : read and oversample the data.
  • #algorithm".R : run algorithm and produce accuracy (where prediction == true value)
  • "var#".R : run each variable combination in step-forward selection order.

Staistical Computing Process

CPU, Memory,Disk Usage
  • CPU: 2
  • Memory: 10GB
  • Disk: 10GB
  • #Jobs: algorithm.sub -> 5 jobs; variable.sub -> 5 jobs
Statistical Methods & Process
  • Considering this is a classification problem, the statistical methods we use are Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor.

  • The Directed Acyclic Graph below shows the Process.

Directed Acyclic Graph

Directed Acyclic Graph

  • We first split data into 2 files (80% training & 20% testing) where each file is less than 4GB. - Then, we train 5 classification algorithms Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest-Neighbor on training data in parallel.
  • Then, a shell script will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file.
  • Then, data is split again into 2 files (80% training & 20% testing) where each file is less than 4GB.
  • We train the highest-accuracy algorithm with 5 variable combinations in parallel.
  • Finally, a shell script will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file.

Result

  • The program successfully runs on sample data, which is randomly selected 100000 observations of the original data frame. Each algorithm is trained on the same train data (80%) with 5 selected factors and tested on the same test data (20%).
Algorithm Accuracy
Naive bayes 0.867
Random Forest 0.9934
SVM 0.8388
Logistic Regression 0.8798
kNN 0.8413
  • The result above suggests Random Forest has the best accuracy among all the algorithms we test.
Variable.Combination Accuracy
ip 0.7823
ip + app 0.9484
ip + app + device 0.9849
ip + app + device + os 0.9897
ip + app + device + os + channel 0.9953
  • The result above suggests overfitting does not exist, and more variables lead to a higher accuracy.

  • In short, based on sample data (100,000 observations), the best algorithm is Random Forest with the variable combination ip + app + device + os + channel.


Problems

  • However, the running time and memory become an issue for the whole data frame (184 million observations). The graph below shows the Running Time vs. the Number of Observations that we estimate based on “transfer-file-in” time and “transfer-file-out” time. (10k, 100k, 300k, 500k, 1M)

  • In specific, we calculate the difference between “transfer-file-in” time and “transfer-file-out” time for each job and sum the differences, which is the total amount of time for CHTC computing.

  • From the plot above, it is very obvious that the Running Time has a linear relationship with the Number of Observations. So, we build a linear regression model to predict the Running Time for 184 million observations. The linear model below shows it will take 57,024.2 minutes = 950.4 hours to complete the running of the whole data frame.
## 
## Call:
## lm(formula = time ~ number, data = trend)
## 
## Residuals:
##       1       2       3       4       5 
##  14.308  -6.994 -14.098   3.998   2.787 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.8080     8.3108  -1.902   0.1533    
## number        3.1002     0.1599  19.384   0.0003 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.6 on 3 degrees of freedom
## Multiple R-squared:  0.9921, Adjusted R-squared:  0.9894 
## F-statistic: 375.7 on 1 and 3 DF,  p-value: 0.0002999
  • Moreover, 10GB of memory is not enough for the large data frame. (We test with 100GB, after running 2 days, memory leaks happen.)

  • Oversampling seems not to be useful for Random Forest so even though Random Forest has a high accuracy rate, the confusion matrix suggests it is not the best to predict whether the application is actually downloaded when a user click the advertisement (Not good at predicting is_attributed == 1). A better performance measurement method is needed.

Solutions to Problems

  1. We implement a new Rscript based on the library mlr3verse which can request multiple CPUs to train the model so that the overall running time would diminish.

  2. We notice that the Support Vector Machine runs very slowly under a large data frame. So, we modify the statistical models we are planning to use: Naive Bayes, Random Forest, Logistic Regression. Also, we remove the kNN to reduce the number the parallel jobs (reduce total memory & CPUs required so that less waiting time).

  3. The model training costs a large amount of memory. So, we decide to reduce the number of observations in training data from 80% of the whole data to 0.1% of the whole data, which would improve the training speed. Testing data would be 99.9% of the whole data, which can also help reflect the prediction accuracy of the algorithm comprehensively.

  4. Since 100GB memory is not enough, we are requesting 500GB memory. Also, we are requesting 7 CPUs for faster computation speed.

  5. We are interested in finding “fraud click,” so that it is more appropriate to use recall to estimate the performance of the model, where **recall** = #correct positive predictions / #positive examples.


New Staistical Computing Process

CPU, Memory,Disk Usage
  • CPU: 7
  • Memory: 500GB
  • Disk: 500GB
  • #Jobs: algorithm.sub -> 3 jobs; variable.sub -> 5 jobs
Statistical Methods & Process
  • Considering this is a classification problem, the statistical methods we use are Naive Bayes, Random Forest, Logistic Regression.

  • The Directed Acyclic Graph below shows the Process.

Directed Acyclic Graph

Directed Acyclic Graph

  • initialize.sh will prepare all required files for CHTC computing.
  • data.R randomly split the whole data frame into 5 small independent csv files.
  • Then, we train 3 classification algorithms Naive Bayes, Random Forest, and Logistic Regression on training data in parallel on 3 of 5 small csv files (each algorithm is trained on 1 csv file). In specific, each csv file is split into 0.1% training and 99.9% testing for the algorithm.
  • Then, a shell script merge.sh will run to merge the accuracy output text files of each algorithm and write the name of the algorithm with the highest accuracy into a text file.
  • data.R` randomly split the whole data frame into 5 small independent csv files.
  • We train the highest-accuracy algorithm with 5 variable combinations in parallel on 5 small csv files (each variable combination is trained on 1 csv file). In specific, each csv file is split into 0.1% training and 99.9% testing for the algorithm.
  • Finally, a shell script result.sh will run to merge the accuracy output text files of each variable combination and write the best algorithm and variable combinations into a text file.

New Results

Algorithm Performance.Recall
Naive bayes 0.5432
Random Forest 0.884
Logistic Regression 0.6484
  • The result above suggests Random Forest has the best accuracy among all the algorithms we test.
Variable.Combination Accuracy
ip 0.534
ip + app 0.7843
ip + app + device 0.878
ip + app + device + os 0.915
ip + app + device + os + channel 0.8552
  • The whole process takes several days to run.
  • The result above suggests over-fitting does exist. The variable combination ip + app + device + os would lead to the highest accuracy.
  • In short, the best algorithm is Random Forest with the variable combination ip + app + device + os would be good at predicting whether the application is actually downloaded when a user clicks the advertisement

Weakness

  • In order to increase the overall training speed, we divide the whole data into 5 small data frames. Then, we train algorithms and variable combinations each on one of 5 data frames. Even though it may not affect the overall result, it is unfair because some algorithms and variable combinations may get a higher accuracy rate by Luck.
  • For the algorithms part, we reduce the number of algorithms we tested so that not all possible algorithms are tested.
  • For the data combinations part, we only perform the step forward selection (sequential), so that not all data combinations are tested.
  • Finally, even though we greatly improved the training speed, the overall training speed and memory usage are still not ideal.

Conclusion

Future Improvements

  • We may consider exploring other classification models and other variable combinations (for 5 factors, there are 5 + 10 + 10 + 5 + 1 = 31 combinations).
  • We may figure out other ways to measure the performance of the algorithm other than “accuracy” and “recall”. For example, for random forest, we can look into “out-of-bag predictions,” for logistic regression, we can look into “AIC,” etc.
  • We can also implement a linear model to estimate the current algorithm’s calculation speed.
  • Also, it is important to figure out a way to reduce the running time and memory usage significantly without sacrificing the fairness of algorithm selection (e.g. each algorithm has a different train & test data).
  • Possibly reduce the word count of the report.

Github Repository: https://github.com/runxuanli/479project