EFFECT OF SAMPLE SIZE ON THE ACCURACY OF MACHINE LEARNING CLASSIFICATION MODELS
Keywords:
Machine Learning, Classification Algorithms, Sample Size, Predictive Performance, AccuracyAbstract
The reliability and effectiveness of machine learning classification models are heavily influenced by the size of the training dataset. This study examines the impact of varying sample sizes on the predictive performance of five widely used classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), and Naïve Bayes. Using simulated datasets ranging from 50 to 5000 samples, each model was evaluated based on four key performance metrics: Accuracy, Precision, Recall, and F1-score. The analysis reveals that while all models benefit from increased data, their sensitivity to sample size varies significantly. Logistic Regression and SVM present consistent and robust performance across all sample sizes, whereas Naïve Bayes performs surprisingly well even with limited data. In contrast, Decision Trees display instability in smaller datasets but show notable improvement at larger scales. Random Forests, though slower to improve, achieve competitive results as sample size increases. These findings provide valuable insights for practitioners selecting algorithms under varying data availability conditions and emphasize the importance of aligning model complexity with dataset size to achieve optimal classification performance.