- Deployed Spark clusters in the cloud (Google Cloud) to scale.
- Dataset: ADFA-LD.
- Architecture: Logistic Regression w/ limited memory BFGS (LR-LBFGS), SVM.
Previous works
- Haider et al.: Proposed Nested-Arc Hidden Semi-Markov Model (NAHSMM) to improve training efficiency of HMM.
- Chen et al.: Achieved high speed incremental learning of data streams by implementing a real-time self-structuring learning framework, AnRAD.
- Used multiple-length n-gram method for preprocessing, feature extraction and testing on ADFA-LD. Generated TF-IDF from the word-vectors.
- AUC is used to evaluate detection accuracy.
- Designed an RDD repartitioning method using Spark API, to transform raw system call RDDs into RDDs with more partitions. When they were loaded and cached, they might not have been enough partitioned. This step is important for even distribution of partitions to the worker nodes.
Possible improvement(s)
- How computationally expensive it will be if we deploy NN architectures in the clusters?