Skip to content

ndg8743/MalwareClassification

Repository files navigation

Malware Classification using Machine Learning

Live Demo: https://gopee.dev/malware/

An interactive Streamlit dashboard for exploring malware classification using machine learning. The site provides visualizations of the CIC-MalMem-2022 dataset, model comparison tools, and detailed analysis of different ML algorithms (Linear Regression, Logistic Regression, Naive Bayes, Random Forest, XGBoost) trained on memory forensics features extracted via Volatility.

Dashboard Screenshot

Binary classification of malware vs benign samples using memory forensics features from the CIC-MalMem-2022 dataset.

Quick Start

Run everything with a single command:

python run.py

Project Status

Dataset: 57,995 samples (29,226 benign + 28,769 malware)

Features: 52 memory forensics features (3 zero-variance removed)

Task: Binary classification (Malware vs Benign)

Algorithms: Linear Regression, Logistic Regression, Naive Bayes, Random Forest, XGBoost

Memory Forensics Features

Features extracted using Volatility framework:

  • pslist: Process count, threads, handlers
  • handles: File/registry/mutex handles
  • malfind: Memory injection detection
  • ldrmodules: Hidden DLL detection
  • psxview: Process hiding (rootkit detection)
  • svcscan: Service manipulation
  • callbacks: Kernel callback hooks

References

Primary Dataset

Canadian Institute for Cybersecurity. "Malware Memory Analysis | CIC-MalMem-2022." University of New Brunswick. https://www.unb.ca/cic/datasets/malmem-2022.html

Cevallos-Salas, D. et al. "Obfuscated Privacy Malware Classifiers Based on Memory Dumping Analysis." Universidad San Francisco de Quito. https://research.usfq.edu.ec/en/publications/obfuscated-privacy-malware-classifiers-based-on-memory-dumping-an/

"Detecting New Obfuscated Malware Variants: A Lightweight and Interpretable Machine Learning Approach." ScienceDirect, 2024. https://www.sciencedirect.com/science/article/pii/S2667305324001467

"Development of a Deep Stacked Ensemble with Process Based Volatile Memory Forensics for Platform Independent Malware Detection and Classification." ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S0957417423004542

"Obfuscated Malware Detection Using Dilated Convolutional Network." IEEE Conference Publication. https://ieeexplore.ieee.org/document/10307598

Shamshad, H. et al. "Obfuscated Malware Detection Using Deep Neural Network with ANOVA Feature Selection on CIC-MalMem-2022 Dataset." Journal of Scientific and Technical Information Technologies, Mechanics and Optics. https://ntv.ifmo.ru/en/article/23095/

"Obfuscated Memory Malware Detection in Resource-Constrained IoT Devices for Smart City Applications." Sensors (MDPI), 2023. https://www.mdpi.com/1424-8220/23/11/5348

Preprints

"Obfuscated Malware Detection: Investigating Real-world Scenarios through Memory Analysis." arXiv, 2404.02372, 2024. https://arxiv.org/abs/2404.02372


Models (scikit-learn)

Linear Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Logistic Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Gaussian Naive Bayes:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

XGBoost (Gradient Boosting): https://xgboost.readthedocs.io/en/stable/

XGBoost Python API: https://xgboost.readthedocs.io/en/stable/python/python_api.html

XGBoost GPU Support: https://xgboost.readthedocs.io/en/stable/gpu/index.html

Understanding Gradient Boosting: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/


Evaluation

Confusion Matrix:
https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix

ROC Curve:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

Understanding ROC-AUC:
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

Cross-Validation:
https://scikit-learn.org/stable/modules/cross_validation.html

cross_val_score (K-Fold): https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html


Hyperparameter Tuning

RandomizedSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Hyperparameter tuning guide: https://scikit-learn.org/stable/modules/grid_search.html

XGBoost parameter tuning: https://xgboost.readthedocs.io/en/stable/parameter.html


Preprocessing & Feature Selection

RobustScaler (handles outliers): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

LabelEncoder (encode target labels): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

SelectKBest (feature selection): https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

Chi-squared test for feature selection: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

Train/test split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Understanding data leakage: https://machinelearningmastery.com/data-leakage-machine-learning/


Data

pandas - 10 minutes tutorial: https://pandas.pydata.org/docs/user_guide/10min.html

pandas - Data structures: https://pandas.pydata.org/docs/user_guide/dsintro.html

NumPy reference: https://numpy.org/doc/stable/reference/index.html


Visualization

Matplotlib - pyplot tutorial: https://matplotlib.org/stable/tutorials/introductory/pyplot.html

Matplotlib - Customizing: https://matplotlib.org/stable/tutorials/introductory/customizing.html

Matplotlib - Subplots and layouts: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html

Matplotlib - Bar labels: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html

Matplotlib - Scatter plots: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

Seaborn - heatmap: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Seaborn - countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html


Hardware Monitoring (06_testing.ipynb)

psutil (CPU/RAM monitoring): https://psutil.readthedocs.io/en/latest/

psutil - System monitoring: https://psutil.readthedocs.io/en/latest/#system-related-functions

GPUtil (GPU monitoring): https://github.com/anderskm/gputil

Python threading module: https://docs.python.org/3/library/threading.html

Background threads for monitoring: https://realpython.com/intro-to-python-threading/


Tree Visualization (05_xgboost.ipynb)

XGBoost plot_tree: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.plot_tree

Feature importance in XGBoost: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.Booster.get_score

Understanding feature importance: https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

Graphviz (tree visualization backend): https://graphviz.org/


Dimensionality Reduction

PCA (Principal Component Analysis): https://scikit-learn.org/stable/modules/decomposition.html#pca

PCA API reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

t-SNE (t-Distributed Stochastic Neighbor Embedding): https://scikit-learn.org/stable/modules/manifold.html

t-SNE API reference: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Understanding t-SNE: https://distill.pub/2016/misread-tsne/

Feature correlation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html


Memory Forensics

Volatility Framework (memory analysis): https://github.com/volatilityfoundation/volatility

Volatility documentation: https://volatility3.readthedocs.io/

Memory forensics techniques: https://www.volatilityfoundation.org/


Learning

Python Data Science Handbook:
https://jakevdp.github.io/PythonDataScienceHandbook/

scikit-learn - Algorithm selection:
https://scikit-learn.org/stable/tutorial/machine_learning_map/

Binary Classification overview:
https://machinelearningmastery.com/types-of-classification-in-machine-learning/


Python

JSON module: https://docs.python.org/3/library/json.html

File I/O: https://docs.python.org/3/tutorial/inputoutput.html


Model Persistence

joblib (saving/loading models): https://joblib.readthedocs.io/en/latest/

scikit-learn model persistence: https://scikit-learn.org/stable/model_persistence.html

About

Exploring malware classification using machine learning

Topics

Resources

Stars

Watchers

Forks

Contributors