Malware Classification using Machine Learning

Live Demo: https://gopee.dev/malware/

An interactive Streamlit dashboard for exploring malware classification using machine learning. The site provides visualizations of the CIC-MalMem-2022 dataset, model comparison tools, and detailed analysis of different ML algorithms (Linear Regression, Logistic Regression, Naive Bayes, Random Forest, XGBoost) trained on memory forensics features extracted via Volatility.

Binary classification of malware vs benign samples using memory forensics features from the CIC-MalMem-2022 dataset.

Quick Start

Run everything with a single command:

python run.py

Project Status

Dataset: 57,995 samples (29,226 benign + 28,769 malware)

Features: 52 memory forensics features (3 zero-variance removed)

Task: Binary classification (Malware vs Benign)

Algorithms: Linear Regression, Logistic Regression, Naive Bayes, Random Forest, XGBoost

Memory Forensics Features

Features extracted using Volatility framework:

pslist: Process count, threads, handlers
handles: File/registry/mutex handles
malfind: Memory injection detection
ldrmodules: Hidden DLL detection
psxview: Process hiding (rootkit detection)
svcscan: Service manipulation
callbacks: Kernel callback hooks

References

Primary Dataset

Canadian Institute for Cybersecurity. "Malware Memory Analysis | CIC-MalMem-2022." University of New Brunswick. https://www.unb.ca/cic/datasets/malmem-2022.html

Cevallos-Salas, D. et al. "Obfuscated Privacy Malware Classifiers Based on Memory Dumping Analysis." Universidad San Francisco de Quito. https://research.usfq.edu.ec/en/publications/obfuscated-privacy-malware-classifiers-based-on-memory-dumping-an/

"Detecting New Obfuscated Malware Variants: A Lightweight and Interpretable Machine Learning Approach." ScienceDirect, 2024. https://www.sciencedirect.com/science/article/pii/S2667305324001467

"Development of a Deep Stacked Ensemble with Process Based Volatile Memory Forensics for Platform Independent Malware Detection and Classification." ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S0957417423004542

"Obfuscated Malware Detection Using Dilated Convolutional Network." IEEE Conference Publication. https://ieeexplore.ieee.org/document/10307598

Shamshad, H. et al. "Obfuscated Malware Detection Using Deep Neural Network with ANOVA Feature Selection on CIC-MalMem-2022 Dataset." Journal of Scientific and Technical Information Technologies, Mechanics and Optics. https://ntv.ifmo.ru/en/article/23095/

"Obfuscated Memory Malware Detection in Resource-Constrained IoT Devices for Smart City Applications." Sensors (MDPI), 2023. https://www.mdpi.com/1424-8220/23/11/5348

Preprints

"Obfuscated Malware Detection: Investigating Real-world Scenarios through Memory Analysis." arXiv, 2404.02372, 2024. https://arxiv.org/abs/2404.02372

Models (scikit-learn)

Linear Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Logistic Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Gaussian Naive Bayes:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

XGBoost (Gradient Boosting): https://xgboost.readthedocs.io/en/stable/

XGBoost Python API: https://xgboost.readthedocs.io/en/stable/python/python_api.html

XGBoost GPU Support: https://xgboost.readthedocs.io/en/stable/gpu/index.html

Understanding Gradient Boosting: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

Evaluation

Confusion Matrix:
https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix

ROC Curve:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

Understanding ROC-AUC:
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

Cross-Validation:
https://scikit-learn.org/stable/modules/cross_validation.html

cross_val_score (K-Fold): https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

Hyperparameter Tuning

RandomizedSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Hyperparameter tuning guide: https://scikit-learn.org/stable/modules/grid_search.html

XGBoost parameter tuning: https://xgboost.readthedocs.io/en/stable/parameter.html

Preprocessing & Feature Selection

RobustScaler (handles outliers): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

LabelEncoder (encode target labels): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

SelectKBest (feature selection): https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

Chi-squared test for feature selection: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

Train/test split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Understanding data leakage: https://machinelearningmastery.com/data-leakage-machine-learning/

Data

pandas - 10 minutes tutorial: https://pandas.pydata.org/docs/user_guide/10min.html

pandas - Data structures: https://pandas.pydata.org/docs/user_guide/dsintro.html

NumPy reference: https://numpy.org/doc/stable/reference/index.html

Visualization

Matplotlib - pyplot tutorial: https://matplotlib.org/stable/tutorials/introductory/pyplot.html

Matplotlib - Customizing: https://matplotlib.org/stable/tutorials/introductory/customizing.html

Matplotlib - Subplots and layouts: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html

Matplotlib - Bar labels: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html

Matplotlib - Scatter plots: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

Seaborn - heatmap: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Seaborn - countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html

Hardware Monitoring (06_testing.ipynb)

psutil (CPU/RAM monitoring): https://psutil.readthedocs.io/en/latest/

psutil - System monitoring: https://psutil.readthedocs.io/en/latest/#system-related-functions

GPUtil (GPU monitoring): https://github.com/anderskm/gputil

Python threading module: https://docs.python.org/3/library/threading.html

Background threads for monitoring: https://realpython.com/intro-to-python-threading/

Tree Visualization (05_xgboost.ipynb)

XGBoost plot_tree: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.plot_tree

Feature importance in XGBoost: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.Booster.get_score

Understanding feature importance: https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

Graphviz (tree visualization backend): https://graphviz.org/

Dimensionality Reduction

PCA (Principal Component Analysis): https://scikit-learn.org/stable/modules/decomposition.html#pca

PCA API reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

t-SNE (t-Distributed Stochastic Neighbor Embedding): https://scikit-learn.org/stable/modules/manifold.html

t-SNE API reference: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Understanding t-SNE: https://distill.pub/2016/misread-tsne/

Feature correlation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

Memory Forensics

Volatility Framework (memory analysis): https://github.com/volatilityfoundation/volatility

Volatility documentation: https://volatility3.readthedocs.io/

Memory forensics techniques: https://www.volatilityfoundation.org/

Learning

Python Data Science Handbook:
https://jakevdp.github.io/PythonDataScienceHandbook/

scikit-learn - Algorithm selection:
https://scikit-learn.org/stable/tutorial/machine_learning_map/

Binary Classification overview:
https://machinelearningmastery.com/types-of-classification-in-machine-learning/

Python

JSON module: https://docs.python.org/3/library/json.html

File I/O: https://docs.python.org/3/tutorial/inputoutput.html

Model Persistence

joblib (saving/loading models): https://joblib.readthedocs.io/en/latest/

scikit-learn model persistence: https://scikit-learn.org/stable/model_persistence.html

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
dataset		dataset
notebooks		notebooks
report		report
results		results
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
installcheck.py		installcheck.py
requirements.txt		requirements.txt
run.py		run.py
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Classification using Machine Learning

Quick Start

Project Status

Memory Forensics Features

References

Primary Dataset

Preprints

Models (scikit-learn)

Evaluation

Hyperparameter Tuning

Preprocessing & Feature Selection

Data

Visualization

Hardware Monitoring (06_testing.ipynb)

Tree Visualization (05_xgboost.ipynb)

Dimensionality Reduction

Memory Forensics

Learning

Python

Model Persistence

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Malware Classification using Machine Learning

Quick Start

Project Status

Memory Forensics Features

References

Primary Dataset

Preprints

Models (scikit-learn)

Evaluation

Hyperparameter Tuning

Preprocessing & Feature Selection

Data

Visualization

Hardware Monitoring (06_testing.ipynb)

Tree Visualization (05_xgboost.ipynb)

Dimensionality Reduction

Memory Forensics

Learning

Python

Model Persistence

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages