Analysis and Improvement of Machine Learning Methods for Detecting Malicious Links on Instagram
DOI:
https://doi.org/10.66571/tsarka-3134-6057-09Ключевые слова:
Malicious URLs, Cybersecurity, Convolutional Neural Network (CNN), Hybrid Ensemble, DistilBERT, Group Shuffle Split, Data Leakage, InstagramАннотация
The distribution of malicious URLs on social network websites, especially Instagram, is a serious cybersecurity risk to end-users. Although machine learning methods have enhanced the threat detection process, existing literature is often afflicted with a critical methodological issue, which is the lack of domain isolation when evaluating the model. This lack of isolation will result in data leakage and artificially inflated performance metrics that cannot be generalized to zero-day attacks. In this paper, the gap is filled by providing a powerful malicious link detection framework. In order to achieve complete objectivity, we also apply strict domain-split cross-validation strategy (Group Shuffle Split), which is very effective in removing data leakage vulnerabilities. In addition, the paper compares two developed, independent architectures, one, a character-based 1D Convolutional Neural Network (CNN) used to perform automated pattern recognition, and another, a Hybrid Ensemble system. The proposed ensemble is able to combine deep semantic embeddings of Large Language Models (DistilBERT) with classical gradient-boosting algorithms with the help of a Soft Voting mechanism, reinforced by a lexical and structural feature engineering strategy. This approach greatly decreases the false-positive rates, as well as prevents the trivial memorization of host identifiers. Lastly, to illustrate a realistic application, the theoretical frameworks are converted into a working functional prototype, including a REST API and a client-side browser extension, which is dedicated to the proactive, real-time protection of social media users.







