Live Demo: https://gopee.dev/malware/
An interactive Streamlit dashboard for exploring malware classification using machine learning. The site provides visualizations of the CIC-MalMem-2022 dataset, model comparison tools, and detailed analysis of different ML algorithms (Linear Regression, Logistic Regression, Naive Bayes, Random Forest, XGBoost) trained on memory forensics features extracted via Volatility.
Binary classification of malware vs benign samples using memory forensics features from the CIC-MalMem-2022 dataset.
Run everything with a single command:
python run.pyDataset: 57,995 samples (29,226 benign + 28,769 malware)
Features: 52 memory forensics features (3 zero-variance removed)
Task: Binary classification (Malware vs Benign)
Algorithms: Linear Regression, Logistic Regression, Naive Bayes, Random Forest, XGBoost
Features extracted using Volatility framework:
- pslist: Process count, threads, handlers
- handles: File/registry/mutex handles
- malfind: Memory injection detection
- ldrmodules: Hidden DLL detection
- psxview: Process hiding (rootkit detection)
- svcscan: Service manipulation
- callbacks: Kernel callback hooks
Canadian Institute for Cybersecurity. "Malware Memory Analysis | CIC-MalMem-2022." University of New Brunswick. https://www.unb.ca/cic/datasets/malmem-2022.html
Cevallos-Salas, D. et al. "Obfuscated Privacy Malware Classifiers Based on Memory Dumping Analysis." Universidad San Francisco de Quito. https://research.usfq.edu.ec/en/publications/obfuscated-privacy-malware-classifiers-based-on-memory-dumping-an/
"Detecting New Obfuscated Malware Variants: A Lightweight and Interpretable Machine Learning Approach." ScienceDirect, 2024. https://www.sciencedirect.com/science/article/pii/S2667305324001467
"Development of a Deep Stacked Ensemble with Process Based Volatile Memory Forensics for Platform Independent Malware Detection and Classification." ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S0957417423004542
"Obfuscated Malware Detection Using Dilated Convolutional Network." IEEE Conference Publication. https://ieeexplore.ieee.org/document/10307598
Shamshad, H. et al. "Obfuscated Malware Detection Using Deep Neural Network with ANOVA Feature Selection on CIC-MalMem-2022 Dataset." Journal of Scientific and Technical Information Technologies, Mechanics and Optics. https://ntv.ifmo.ru/en/article/23095/
"Obfuscated Memory Malware Detection in Resource-Constrained IoT Devices for Smart City Applications." Sensors (MDPI), 2023. https://www.mdpi.com/1424-8220/23/11/5348
"Obfuscated Malware Detection: Investigating Real-world Scenarios through Memory Analysis." arXiv, 2404.02372, 2024. https://arxiv.org/abs/2404.02372
Linear Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Logistic Regression:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Gaussian Naive Bayes:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
XGBoost (Gradient Boosting): https://xgboost.readthedocs.io/en/stable/
XGBoost Python API: https://xgboost.readthedocs.io/en/stable/python/python_api.html
XGBoost GPU Support: https://xgboost.readthedocs.io/en/stable/gpu/index.html
Understanding Gradient Boosting: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
Confusion Matrix:
https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix
ROC Curve:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Understanding ROC-AUC:
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
Cross-Validation:
https://scikit-learn.org/stable/modules/cross_validation.html
cross_val_score (K-Fold): https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
RandomizedSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Hyperparameter tuning guide: https://scikit-learn.org/stable/modules/grid_search.html
XGBoost parameter tuning: https://xgboost.readthedocs.io/en/stable/parameter.html
RobustScaler (handles outliers): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html
LabelEncoder (encode target labels): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
SelectKBest (feature selection): https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
Chi-squared test for feature selection: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
Train/test split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Understanding data leakage: https://machinelearningmastery.com/data-leakage-machine-learning/
pandas - 10 minutes tutorial: https://pandas.pydata.org/docs/user_guide/10min.html
pandas - Data structures: https://pandas.pydata.org/docs/user_guide/dsintro.html
NumPy reference: https://numpy.org/doc/stable/reference/index.html
Matplotlib - pyplot tutorial: https://matplotlib.org/stable/tutorials/introductory/pyplot.html
Matplotlib - Customizing: https://matplotlib.org/stable/tutorials/introductory/customizing.html
Matplotlib - Subplots and layouts: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html
Matplotlib - Bar labels: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html
Matplotlib - Scatter plots: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html
Seaborn - heatmap: https://seaborn.pydata.org/generated/seaborn.heatmap.html
Seaborn - countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html
psutil (CPU/RAM monitoring): https://psutil.readthedocs.io/en/latest/
psutil - System monitoring: https://psutil.readthedocs.io/en/latest/#system-related-functions
GPUtil (GPU monitoring): https://github.com/anderskm/gputil
Python threading module: https://docs.python.org/3/library/threading.html
Background threads for monitoring: https://realpython.com/intro-to-python-threading/
XGBoost plot_tree: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.plot_tree
Feature importance in XGBoost: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.Booster.get_score
Understanding feature importance: https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/
Graphviz (tree visualization backend): https://graphviz.org/
PCA (Principal Component Analysis): https://scikit-learn.org/stable/modules/decomposition.html#pca
PCA API reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
t-SNE (t-Distributed Stochastic Neighbor Embedding): https://scikit-learn.org/stable/modules/manifold.html
t-SNE API reference: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Understanding t-SNE: https://distill.pub/2016/misread-tsne/
Feature correlation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html
Volatility Framework (memory analysis): https://github.com/volatilityfoundation/volatility
Volatility documentation: https://volatility3.readthedocs.io/
Memory forensics techniques: https://www.volatilityfoundation.org/
Python Data Science Handbook:
https://jakevdp.github.io/PythonDataScienceHandbook/
scikit-learn - Algorithm selection:
https://scikit-learn.org/stable/tutorial/machine_learning_map/
Binary Classification overview:
https://machinelearningmastery.com/types-of-classification-in-machine-learning/
JSON module: https://docs.python.org/3/library/json.html
File I/O: https://docs.python.org/3/tutorial/inputoutput.html
joblib (saving/loading models): https://joblib.readthedocs.io/en/latest/
scikit-learn model persistence: https://scikit-learn.org/stable/model_persistence.html
