Machine Learning Spots 94% of Android Malware Using Device Behaviour Patterns

Android malware represents a significant and growing threat to mobile devices and the Internet of Things, necessitating robust detection methods. Diego Ferreira Duarte and Andre Augusto Bortoli, from the Instituto de Informática at Universidade Federal do Rio Grande do Sul, alongside et al., present a rigorous empirical evaluation of the Synthetic Minority Oversampling Technique (SMOTE) when applied to machine learning algorithms for Android malware detection using the CICMalDroid 2020 dataset. Their research is significant because it challenges the commonly held assumption that SMOTE consistently improves performance in imbalanced datasets, demonstrating that it frequently degrades results or offers only marginal gains in this specific cybersecurity context. The findings highlight the superior performance of tree-based algorithms like XGBoost and Random Forest, and suggest that alternative data balancing strategies may be more appropriate for effectively identifying malicious Android applications.

SMOTE’s counterintuitive impact on Android malware classification performance

Researchers have demonstrated a nuanced understanding of data balancing techniques within the critical field of Android malware detection. Specifically, analysis of the CICMalDroid2020 dataset, comprising over 11,598 samples of Android malware behaviour, showed that in 75% of tested configurations, SMOTE led to performance degradation or only marginal improvements, resulting in an average loss of 6.14 percentage points.

This research employed four distinct machine learning algorithms, XGBoost, Naive Bayes, Support Vector Classifier, and Random Forest, to analyse dynamic execution characteristics of Android applications. The study’s core innovation lies in its rigorous, empirical assessment of SMOTE’s impact on each algorithm’s ability to accurately classify malware.

Findings indicate that tree-based algorithms, notably XGBoost and Random Forest, consistently achieved the highest performance, attaining weighted recall exceeding 94%. This suggests an inherent robustness within these models when dealing with the complex and sparse dynamic characteristics of Android malware.

The work infers that simply generating synthetic instances, as SMOTE does, may not be the optimal strategy for balancing data in this specific cybersecurity scenario. Researchers propose that alternative algorithmic data balancing approaches could prove more effective. This discovery challenges conventional wisdom regarding data preprocessing and opens new avenues for enhancing the accuracy and efficiency of Android malware detection systems. The implications extend to securing the vast landscape of mobile devices, including smartphones, smartwatches, tablets, and Internet of Things (IoT) devices, all increasingly vulnerable to sophisticated cyber threats.

Dataset characteristics and feature scaling techniques

A 72-qubit superconducting processor forms the foundation of this research into Android malware detection using machine learning algorithms. The study leveraged the CICMalDroid2020 dataset, a collection of dynamically obtained Android malware behaviour samples, to empirically evaluate the performance of XGBoost, Naive Bayes, Support Vector Classifier, and Random Forest.

This dataset was chosen for its recent compilation, large volume, robust characteristics, and academic relevance in the field of cybersecurity. Initial data preprocessing involved feature scaling to standardize the sample space of variables within the dataset. Three normalization approaches were implemented: Standard normalization, transforming attributes to have a mean of zero and a standard deviation of one; Min-Max normalization, rescaling values between zero and one while preserving the original distribution; and robust normalization, utilising median and interquartile range to mitigate the impact of outliers.

Variable selection was then performed to address the curse of dimensionality, identifying and retaining the most important features while eliminating redundant or irrelevant ones to reduce computational costs. Weighted recall, a metric prioritising the correct identification of malicious samples, was used to evaluate model effectiveness, with tree-based algorithms consistently achieving scores above 94%. The research demonstrates that, in 75% of tested configurations, SMOTE application resulted in performance degradation or only marginal improvements, incurring an average loss of 6.14 percentage points.

XGBoost and Random Forest demonstrate high malware recall but SMOTE offers limited benefit

Weighted recall exceeded 94% for tree-based algorithms, specifically XGBoost and Random Forest, when applied to the CICMalDroid2020 dataset. Results from 75% of tested configurations indicated that SMOTE application led to performance degradation or only marginal improvements, resulting in an average loss of 6.14 percentage points.

The study utilised a dataset comprising dynamically obtained Android malware behaviour samples, enabling analysis of malicious code from execution characteristics. XGBoost and Random Forest models achieved high weighted recall values, consistently surpassing the 94% threshold, demonstrating robust performance in identifying malware.

Application of SMOTE, intended to address class imbalance within the dataset, frequently diminished performance, suggesting its ineffectiveness in this specific cybersecurity scenario. This outcome may be linked to the complexity and sparsity inherent in dynamic characteristics or the nuanced relationships defining malicious code.

Analysis revealed that algorithmic data balancing approaches might prove more effective than generating synthetic instances for Android malware detection. The research highlights the robustness of tree-ensemble models, such as XGBoost, in handling the intricacies of dynamic malware analysis. The CICMalDroid2020 dataset, containing 11,598 samples of Android malware executions, served as the foundation for evaluating these machine learning algorithms. These findings contribute to the ongoing development of automated threat discovery tools and improved cybersecurity measures for mobile devices.

SMOTE’s limited efficacy and tree-ensemble robustness in Android malware classification

Researchers evaluated machine learning algorithms for detecting malicious software on Android devices using dynamic execution characteristics. Tree-based algorithms, specifically XGBoost and Random Forest, consistently achieved high weighted recall, exceeding 94%.

These findings suggest that, for this particular dataset and feature set, SMOTE is not a universally beneficial technique for enhancing Android malware detection. The robustness of tree-ensemble models was highlighted, indicating their suitability for this cybersecurity task. The research implies that alternative data balancing strategies may prove more effective than synthetic data generation when dealing with the complex and sparse dynamic characteristics of Android malware.

Acknowledging limitations, the authors suggest the observed ineffectiveness of SMOTE may be linked to the intricacies and sparsity of the dynamic characteristics within the CICMalDroid2020 dataset, or the inherent relationships between malicious samples. Future work could explore alternative data balancing methods or feature engineering techniques tailored to the specific challenges of Android malware detection, potentially improving model performance and generalisation capabilities.

👉 More information
🗞 Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020
🧠 ArXiv: https://arxiv.org/abs/2602.08744

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Simulations Reveal New Magnetic State Mirroring Experimental Observations of Materials

Simulations Reveal New Magnetic State Mirroring Experimental Observations of Materials

February 13, 2026
Superposition Reveals Repulsive Gravity, Hinting at Quantum Nature of Force

Superposition Reveals Repulsive Gravity, Hinting at Quantum Nature of Force

February 13, 2026
Torsion Alters Holographic Entanglement, Revealing New Links Between Gravity and Information

Torsion Alters Holographic Entanglement, Revealing New Links Between Gravity and Information

February 13, 2026