A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.
With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.
Description | Value |
---|---|
Source | Microsoft Malware Prediction (Kaggle) |
Training Set Size | 8,920,441 rows × 83 features |
Test Set Size | 7,653,424 rows × 83 features |
File Size | Approx. 8 GB for train.csv |
Target Variable | HasDetections (1 = Malware detected, 0 = No malware detected) |
Data Type | Tabular, mixed categorical & numerical |
Class Imbalance | Slight imbalance (~50:50 ratio, needs careful validation) |
Category | Tools/Libraries | Reason |
---|---|---|
Language | Python 3.11 | Versatile and widely used for ML workflows |
Data Handling | pandas , dask , numpy |
Efficient large dataset processing |
Visualization | seaborn , matplotlib , plotly |
EDA and visual storytelling |
Machine Learning | LightGBM |
High-speed gradient boosting on large datasets |
Evaluation Metrics | scikit-learn |
Classification reports, confusion matrices |
MachineIdentifier
.num_leaves = 64
learning_rate = 0.1
feature_fraction = 0.8
bagging_fraction = 0.8
max_depth = 8
result.csv
.Metric | Validation Set Value |
---|---|
Accuracy | ~0.734 |
AUC Score | ~0.79 |
F1 Score | ~0.73 |
SmartScreen
, AVProductStatesIdentifier
, and Platform
.Optuna
or GridSearchCV
.```bash git clone https://github.com/yourusername/malware-prediction-ml.git cd malware-prediction-ml