EVALUATING THE EFFECT OF DATASET SIZE ON PREDICTIVE MODEL USING SUPERVISED LEARNING TECHNIQUE

A. R. Ajiboye, Ruzaini Abdullah-Arshah, H. Qin, H. Isah-Kebbe

Abstract


Learning models used for prediction purposes are mostly developed without paying much cognizance to the size of datasetsthat can produce models of high accuracy and better generalization. Although, the general believe is that, large dataset is needed to construct a predictive learning model. To describe adata setas large in size, perhaps, iscircumstance dependent, thus, what constitutesa dataset to be considered as being big or small is vague.In this paper, the ability of predictive model to generalize with respect to a particular size of data when simulated with new untrained input is examined. The study experiments on three different sizes of data using Matlab programto create predictive models with a view to establishing if the sizeof data has any effect on the accuracy of a model.The simulated output of each model is measured using theMean Absolute Error (MAE) and comparisons are made. Findings from this study reveals that, the quantity of data partitioned for the purpose of training must be of good representation of the entire sets and sufficient enough to span through the input space. The results of simulating the three network models also shows that, the learning model with the largest size of training setsappearsto be the most accurate and consistently delivers a much better and stable results.

 

Keywords: Prediction, Neural Network, Supervised Learning, Data mining, Data size.


Full Text:

[PDF]

References


Bandyopadhyay, G., & Chattopadhyay, S. 2007. Single hidden layer artificial neural network models versus multiple linear regression model in forecasting the time series of total ozone. International Journal of Environmental Science & Technology, 4, 141-149.

Basavanhally, A., Doyle, S., & Madabhushi, A. Year. Predicting classifier performance with a small training set: Applications to computer-aided diagnosis and prognosis. Paper presented at the Biomedical Imaging: From Nano to Macro, 2010 IEEE International Symposium on, 2010.

Bidgoli, B. M., Kashy, D., Kortemeyer, G., & Punch, W. Year. Predicting student performance: An Application of data mining methods with the educational web-based system Lon-Capa. Paper presented at the Proceedings of ASEE/IEEE frontiers in education conference, 2003.

Dobbin, K. K., & Simon, R. M. 2007. Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics, 8(1), 101-117.

Han, J., Kamber, M., & Pei, J. 2012. Data Mining Concepts and Techniques (3rd ed.): Morgan Kaufman, Elsevier Inc. USA.

Haykin, S. 2009. Neural Networks and Learning Machines (r. Ed. Ed.). New Jersey: Pearson Education, Inc.

McArthur, D. P., Encheva, S., & Thorsen, I. 2013. Predicting with a small amount of data: An application of fuzzy reasoning to regional disparities. Journal of Economic Studies, 41(1), 2-2.

Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., . . . Mesirov, J. P. 2003. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology, 10(2), 119-142.

Oladokun, V., Adebanjo, A., & Charles-Owaba, O. 2008. Predicting students’ academic performance using artificial neural network: A case study of an engineering course. The Pacific Journal of Science and Technology, 9(1), 72-79.

Osmanbegović, E., & Suljić, M. 2012. Data mining approach for predicting student performance. Economic Review, 10(1).

Özel, T., & Karpat, Y. 2005. Predictive modeling of surface roughness and tool wear in hard turning using regression and neural networks. International Journal of Machine Tools and Manufacture, 45(4), 467-479.

Raghuwanshi, N., Singh, R., & Reddy, L. 2006. Runoff and sediment yield modeling using artificial neural networks: Upper Siwane River, India. Journal of Hydrologic Engineering, 11(1), 71-79.

Rajaraman, A., & Ullman, J. D. 2012. Mining of Massive Datasets. Edinburgh UK: Cambridge University Press.

Schumacher, P., Olinsky, A., Quinn, J., & Smith, R. 2010. A comparison of logistic regression, neural networks, and classification trees predicting success of actuarial students. Journal of Education for Business, 85(5), 258-263.

Skillicorn, D. 2007. Understanding Complex Datasets: Data Mining with Matrix Decompositions. USA: Taylor & Francis Group.

Suh, S. C. 2012. Practical Applications of Data Mining: Jones & Bartlett Learning.

van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. 2014. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC medical research methodology, 14(1), 137.

Witten, I. H., Frank, E., & Hall, M. A. 2011. Data Mining Practical Machine Learning Tools and Techniques (3rd Edition ed.): Morgan Kaufmann.


Refbacks

  • There are currently no refbacks.