Phishing URL Detection Using Machine Learning

Diya Saxena; Dr. Sheshang Degadwala; Malini Joshi

doi:10.32628/IJSRST2613101

Authors

Diya Saxena Research Scholar, Department of Computer Engineering, Sigma University, Vadodara, Gujarat, India Author
Dr. Sheshang Degadwala Professor & Head, Department of Computer Engineering, Sigma University, Vadodara, Gujarat, India Author
Malini Joshi Assistant Professor, Department of Computer Engineering, Sigma University, Vadodara, Gujarat, India Author

DOI:

https://doi.org/10.32628/IJSRST2613101

Abstract

Phishing is a common cyber-attack where attackers create fake websites to trick users into revealing sensitive information such as passwords, credit card numbers, and personal details. Detecting phishing URLs is important for protecting users and reducing the risk of identity theft and fraud. This project focuses on the use of machine learning techniques to help identify phishing URLs by analysing different characteristics of the URLs. The aim is to understand how these methods can learn from data and recognize patterns that distinguish phishing URLs from legitimate ones. The study involves gathering datasets of URLs, extracting features that describe their structure and content, and building models that can classify URLs accordingly. This research does not assume any particular machine learning approach but explores various possibilities to find effective solutions. The goal is to investigate how machine learning can contribute to automated and efficient phishing detection, which could be useful in improving online security tools such as browsers and email filters.

Downloads

Download data is not yet available.

References

Mohammad, R. M., Thabtah, F., & McCluskey, L. (2014). Predicting phishing websites using classification mining techniques. Applied Soft Computing, 25, 56–72. This paper presents a comparative study of classification algorithms for phishing URL detection, highlighting the importance of feature selection in improving accuracy. It demonstrates that machine learning outperforms traditional blacklist-based detection.

Jain, A. K., & Gupta, B. B. (2017). Phishing detection: Analysis of visual similarity based approaches. Security and Communication Networks, 2017, 1–13. Focuses on visual similarity techniques to detect phishing websites that mimic legitimate sites. It emphasizes that machine learning can complement visual features for better detection. DOI: https://doi.org/10.1155/2017/5421046

Aburrous, M., Hossain, M. A., Dahal, K., & Thabtah, F. (2018). Intelligent phishing detection system for e-banking using fuzzy data mining. Expert Systems with Applications, 37(12), 7913–7921. Introduces a fuzzy data mining approach to detect phishing attacks in e-banking. Highlights the role of intelligent systems in enhancing security in financial applications. DOI: https://doi.org/10.1016/j.eswa.2010.04.044

Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009). Beyond blacklists: Learning to detect malicious web sites from suspicious URLs. Proceedings of the 15th ACM SIGKDD, 1245–1254. Proposes a supervised learning model to detect phishing websites beyond blacklists, focusing on URL lexical features and highlighting zero-day phishing detection. DOI: https://doi.org/10.1145/1557019.1557153

Bahnsen, A. C., Bohorquez, E. C., Villegas, S., Vargas, J., & Gonzalez, F. A. (2015). Cost-sensitive decision trees for fraud detection. Expert Systems with Applications, 42(4), 216–229. Uses cost-sensitive decision trees to minimize misclassification cost in fraud detection, which is directly applicable to phishing detection with imbalanced datasets.

Le, H., Pham, Q., Sahoo, D., & Hoi, S. C. H. (2021). URLNet: Learning a URL representation with deep learning for malicious URL detection. IEEE Access, 9, 59399–59411. Introduces a deep learning model called URLNet that automatically learns features from URL sequences. Demonstrates significant improvement over traditional machine learning models.

Rao, R. S., & Pais, A. R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3872. Develops a feature-based framework using Random Forests for phishing detection. Shows the impact of URL feature selection on detection accuracy. DOI: https://doi.org/10.1007/s00521-017-3305-0

Zhang, J., Yao, Y., & Chen, Q. (2020). Phishing website detection using gradient boosting decision tree. Security and Communication Networks, 2020, 1–10. Applies gradient boosting techniques to phishing URL detection, highlighting ensemble learning’s role in improving predictive performance.

Sahoo, D., Liu, C., & Hoi, S. C. H. (2017). Malicious URL detection using machine learning: A survey. ACM Computing Surveys, 50(3), 1–36. Comprehensive survey that reviews machine learning methods for malicious URL detection, providing a structured comparison of features, algorithms, and datasets.

Chiew, K. L., Chang, E. H., Sze, S. N., & Tiong, W. K. (2015). Utilisation of lexical features for phishing URL detection. Applied Soft Computing, 36, 476–484. Focuses on lexical URL analysis for phishing detection. Demonstrates that simple lexical features can provide strong predictive power when used in machine learning models.

Zouina, M., & Outtaj, B. (2017). A novel lightweight URL-based phishing detection system using SVM. Procedia Computer Science, 110, 475–482. Proposes a lightweight phishing detection model using SVM for real-time applications. Highlights the trade-off between accuracy and computational efficiency. DOI: https://doi.org/10.1186/s13673-017-0098-1

Aljofey, A., Jiang, Q., Qu, Q., Huang, M., & Niyigena, J. P. (2020). An effective phishing detection model based on character-level convolutional neural network. Electronics, 9(7), 1048. Uses character-level CNN to automatically extract features from URLs. Demonstrates improved accuracy without manual feature engineering. DOI: https://doi.org/10.3390/electronics9091514

Feng, F., Zhou, Q., Shen, Z., Yang, X., Han, L., Wang, J., & Chen, J. (2018). The application of a novel neural network in phishing detection. IEEE Access, 6, 31975–31988. Presents a neural network model for phishing detection, showing how neural models can detect sophisticated URL obfuscation patterns.

Verma, R., & Das, A. (2017). What works and what does not: A study of classifiers for phishing URL detection. Proceedings of the IEEE International Conference on Big Data, 769–774. Comparative study of multiple classifiers, emphasizing that Random Forest and SVM generally outperform simpler models for phishing detection.

Marchal, S., Saari, K., Singh, N., & Asokan, N. (2016). Know your phish: Novel techniques for detecting phishing sites and their targets. IEEE Conference on Data Mining Workshops, 323–330.Introduces methods to identify phishing targets and URLs simultaneously, highlighting the importance of contextual features. DOI: https://doi.org/10.1109/ICDCS.2016.10

Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345–357. Explores several ML models including ensemble methods for URL-based phishing detection and stresses the importance of model evaluation metrics. DOI: https://doi.org/10.1016/j.eswa.2018.09.029

PhishTank. (2024). PhishTank phishing URL dataset. Online Repository. A reliable and publicly available repository of phishing URLs used for research and model training. Essential for dataset diversity and real-world validation.

OpenPhish. (2024). OpenPhish phishing intelligence feed. Online Dataset. Provides up-to-date phishing URLs with verified labels, supporting continuous learning and evaluation of detection models.

Kaggle. (2023). Phishing website dataset. Kaggle Data Repository. Offers structured datasets combining phishing and legitimate URLs for training ML models. Useful for benchmarking and experimentation.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Comprehensive book on deep learning principles, including CNNs and RNNs, which form the theoretical foundation for URL-based deep learning models.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Introduces Random Forest, an ensemble method widely used in phishing detection. Shows its robustness against overfitting and high-dimensional data. DOI: https://doi.org/10.1023/A:1010933404324

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. Foundational paper on SVM, widely used in phishing URL classification due to its ability to handle non-linear separable data. DOI: https://doi.org/10.1023/A:1022627411411

Phishing URL Detection Using Machine Learning

Authors

DOI:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

RightSideBlock

IssueDate

Latest publications