ExeRay 2.0 :hospital:

Advanced X-ray Vision for Windows Executables

Detect malicious .exe files using machine learning. Extracts static features (entropy, imports, metadata) and combines ML with heuristic rules for fast, automated classification.

🚀 What's New in v2.0

50+ New Detection Features (VM detection, anti-debugging, API call chains)
Enhanced Prediction Engine with detailed suspicious behavior reports if malware found
Recall-Optimized Training: Custom scorer prioritizing malware detection
Streamlined 3-Script Architecture (faster workflow)
Improved Accuracy (F1-score up to 0.99 in testing)
Dataset Provided !!

📊 Dataset Information

- Source & Composition:

<div align="center"> <table> <thead> <tr> <th>Dataset</th> <th>From</th> <th>Examples</th> <th>Total</th> </tr> </thead> <tbody> <tr> <td><strong>Malicious Dataset</strong></td> <td> <a href="https://github.com/iosifache/DikeDataset">DikeDataset</a>, <a href="https://github.com/ytisf/theZoo">theZoo</a>, <a href="https://bazaar.abuse.ch/">MalwareBazaar</a> </td> <td>WannaCry.exe, njRAT.exe</td> <td>10,925</td> </tr> <tr> <td><strong>Benign Dataset</strong></td> <td> Windows Files, <a href="https://ninite.com/">Ninite.com</a>, <a href="https://portableapps.com/">PortableApps.com</a> </td> <td>Putty.exe, notepad.exe, ida.exe</td> <td>3,590</td> </tr> <tr> <td colspan="3" style="text-align:right;"><strong>Total</strong></td> <td><strong>14,515</strong></td> </tr> <tr> <td colspan="3" style="text-align:right;"><strong>Size</strong></td> <td><strong>11.81 GB</strong></td> </tr> </tbody> </table> </div>

Dataset Processing: From 10,925 malware samples, we processed 4,200 for feature extraction, then applied Undersampling to balance with 3,500 benign samples (7,000 total). Used RandomUnderSampler (random_state=42) to prevent malware bias while preserving key patterns.
Benign and Malicious Dataset Link (11.81 GB on MEGA): https://mega.nz/folder/iAU3iARQ#nKPwCQIW4jZgAEFmRJlR6Q
⚠️ Safety Notice:
- Please exercise caution when downloading and handling this dataset. It contains both benign and malicious files for research purposes.
- Do not execute or open any files unless you're in a secure, isolated environment (e.g., a virtual machine or sandbox). Executing malicious files can harm your system or compromise your data.
⚠️ Important Notice About the Dataset:
- To keep this repository lightweight and easy to download, the full dataset is not included here. Specifically:
  - The data/ folder does not contain any executable files.
  - You will find only two empty directories: benign/ and malware/.
  - If you wish to work with the actual dataset, you need to download it manually from the MEGA link.

:gear: Enhanced Features

Hybrid AI detection (XGBoost + Random Forest)
Detailed Malware Fingerprinting:
- VM/Sandbox detection markers
- Anti-debugging technique identification
- Suspicious API call patterns
Confidence Scoring with threat level classification

:wrench: Upgraded Tech Stack

New Components:

Advanced PE Analysis: Full directory parsing (TLS, Debug, Resources)
String Analysis: Unicode/ASCII pattern detection
Behavioral Indicators: 15+ new malware behavior signatures

Key Improvements:

- Structural Features:

# PE File Structure
'num_sections', 
'num_unique_sections',
'section_names_entropy',
'avg_section_size',
'min_section_size',
'max_section_size',
'total_section_size',
'avg_entropy',
'min_entropy',
'max_entropy',
'has_packed_sections',
'has_executable_sections',
'writable_executable_sections',
'is_dll',
'is_executable',
'is_system_file',
'has_aslr',
'has_dep',
'is_signed',
'has_rich_header',
'rich_header_entries',
'has_resources',
'num_resources',
'has_embedded_exe',
'has_debug',
'has_tls',
'has_relocations',
'ep_in_first_section',
'ep_in_last_section',
'ep_section_entropy',
'has_suspicious_sections'

- Behavioral Features:

# API/Import Analysis
'num_imports',
'num_unique_dlls',
'num_unique_imports',
'imports_to_dlls_ratio',
'has_import_name_mismatches',
'suspicious_imports_count',
'num_exports',
'suspicious_exports',
'suspicious_api_chains',
'has_delayed_imports',
'has_vm_detection_imports',
'has_anti_debug_imports',
'has_process_creation_imports',
'has_createprocess',
'has_setwindowshookex',

# String Patterns
'num_strings',
'avg_string_length',
'has_suspicious_strings',
'has_anti_debug',
'has_vm_detection_strings',
'has_vm_mac_addresses',
'has_anti_debug_strings',
'has_nop_sleds',
'has_anti_debug_strings'

- Detection Signatures:

vm_detection_strings = {
    b'vbox', b'vmware', b'virtualbox', b'qemu', b'xen', b'hypervisor',
    b'virtual machine', b'vmcheck', b'vboxguest', b'vboxsf', b'vboxvideo'
}

vm_mac_prefixes = {
    b'00:0C:29', b'00:1C:14', b'00:05:69', b'00:50:56',  # VMware
    b'08:00:27',  # VirtualBox
    b'00:16:3E',  # Xen
    b'00:1C:42',  # Parallels
    b'00:15:5D'   # Hyper-V
}

anti_debug_strings = {
    b'IsDebuggerPresent', b'CheckRemoteDebuggerPresent', b'OutputDebugString',
    b'NtQueryInformationProcess', b'NtSetInformationThread', b'ZwSetInformationThread'
}

suspicious_patterns = {
    b'payload', b'malware', b'inject', b'virus', b'trojan',
    b'backdoor', b'rat', b'worm', b'spyware', b'keylog',
    b'xored', b'encrypted', b'packed', b'obfus'
}

# API Groups
vm_detection_apis = {
    'cpuid', 'hypervisor', 'vmcheck', 'vbox', 'vmware', 'virtualbox',
    'wine_get_unix_file_name', 'wine_get_dos_file_name'
}

anti_debug_apis = {
    'IsDebuggerPresent', 'CheckRemoteDebuggerPresent', 'OutputDebugStringA',
    'NtQueryInformationProcess', 'NtSetInformationThread', 'NtQuerySystemInformation',
    'GetTickCount', 'QueryPerformanceCounter', 'RDTSC', 'GetProcessHeap',
    'ZwSetInformationThread', 'DbgBreakPoint', 'DbgUiRemoteBreakin'
}

process_creation_apis = {
    'CreateProcessA', 'CreateProcessW', 'CreateProcessAsUserA', 'CreateProcessAsUserW',
    'SetWindowsHookExA', 'SetWindowsHookExW', 'ShellExecuteA', 'ShellExecuteW',
    'WinExec', 'System'
}

# Suspicious API Chains
api_sequences = {
    ('VirtualAlloc', 'WriteProcessMemory', 'CreateRemoteThread'): 'Process Injection',
    ('RegCreateKey', 'RegSetValue', 'RegCloseKey'): 'Registry Persistence',
    ('LoadLibraryA', 'GetProcAddress', 'VirtualProtect'): 'Dynamic API Resolution',
    ('OpenProcess', 'ReadProcessMemory', 'WriteProcessMemory'): 'Process Hollowing',
    ('NtUnmapViewOfSection', 'MapViewOfFile', 'ResumeThread'): 'RunPE Technique',
    ('CreateProcessA', 'WriteProcessMemory', 'ResumeThread'): 'Process Injection',
    ('SetWindowsHookExA', 'GetMessage', 'DispatchMessage'): 'Hook Injection'
}

:file_folder: Directory Structure

ExeShield_AI/
├── assets/                      # Repo Images
├── data/                        # Raw Samples  
│   ├── malware/                 # Malicious Executables  
│   └── benign/                  # Clean Executables
├── dependencies/                # Installation Dependencies
├── models/                      # Saved Models/Thresholds  
│   ├── malware_detector.joblib  
│   └── optimal_threshold.npy  
├── output/                      # Processed Data (CSV/features)
│   └── processed_features_dataset.csv
├── scripts/                     # Core Scripts  
│   ├── extract_features.py  
│   ├── train_model.py  
│   └── predict.py
│   └── visualize_model.py
├── visualizations/              # Model Feature & Tree Visualizations
│   ├── feature_importances.png  
│   ├── feature_importances_gain.png  
│   └── xgb_tree_0.png
│   └── xgb_tree_1.png
│   └── xgb_tree_2.png
│   └── xgb_tree_99.png  
└── README.md

:computer: Installation and Usage (Commands & Outputs)

1. Clone the repository:

git clone https://github.com/MohamedMostafa010/ExeRay.git
cd ExeRay

2. Install dependencies:

pip install -r dependencies/requirements.txt

3. Extract Features:

> python extract_features.py
[*] Processing benign samples from ../data\benign...
[!] Not a valid PE file: adaminstall.exe
[!] Not a valid PE file: adamsync.exe
[!] Not a valid PE file: AddSuggestedFoldersToLibraryDialog.exe
[!] Not a valid PE file: AgentService.exe
[!] Not a valid PE file: AggregatorHost.exe
[!] Not a valid PE file: appcmd.exe
[!] Not a valid PE file: AppHostRegistrationVerifier.exe
[!] Not a valid PE file: ApplySettingsTemplateCatalog.exe
[!] Not a valid PE file: ApplyTrustOffline.exe
[!] Not a valid PE file: ApproveChildRequest.exe
[!] Not a valid PE file: AppVClient.exe
[!] Not a valid PE file: ARPPRODUCTICON.exe
[!] Not a valid PE file: audit.exe
[!] Not a valid PE file: AuditShD.exe
[!] Not a valid PE file: autofstx.exe
...
[*] Processing malware samples (limited to 3500) from ../data\malware...

[+] Processed Features Dataset saved to ../output/processed_features_dataset.csv
[+] Total samples: 6857
[+] Malware samples: 3500
[+] Benign samples: 3357

4. Train Model (Metrics Also Provided After Training to Know Your Model's Performance):

> python train_model.py
Training models:   0%|                                                                                                                                                 | 0/2 [00:00<?, ?it/s]
New best model: XGBoost (Recall=0.990)
Training models: 100%|██████████████

ExeRay

Install / Use

README