Microsoft-Malware-Prediction

🛡️ Malware Prediction Using Machine Learning

A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.


🎥Demo Video



📊 Problem Statement

With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.


📦 Dataset Details

Description Value
Source Microsoft Malware Prediction (Kaggle)
Training Set Size 8,920,441 rows × 83 features
Test Set Size 7,653,424 rows × 83 features
File Size Approx. 8 GB for train.csv
Target Variable HasDetections (1 = Malware detected, 0 = No malware detected)
Data Type Tabular, mixed categorical & numerical
Class Imbalance Slight imbalance (~50:50 ratio, needs careful validation)

🛠️ Tech Stack

Category Tools/Libraries Reason
Language Python 3.11 Versatile and widely used for ML workflows
Data Handling pandas, dask, numpy Efficient large dataset processing
Visualization seaborn, matplotlib, plotly EDA and visual storytelling
Machine Learning LightGBM High-speed gradient boosting on large datasets
Evaluation Metrics scikit-learn Classification reports, confusion matrices

📊 Project Workflow

1️⃣ Data Loading

2️⃣ Data Cleaning & Preprocessing

3️⃣ Exploratory Data Analysis (EDA)

4️⃣ Model Building

5️⃣ Evaluation

6️⃣ Prediction & Output


📊 Results

Metric Validation Set Value
Accuracy ~0.734
AUC Score ~0.79
F1 Score ~0.73

🔮 Future Scope


🚀 Setup Instructions

  1. Clone the repository

```bash git clone https://github.com/yourusername/malware-prediction-ml.git cd malware-prediction-ml