Charles Explorer logo
🇬🇧

Data Splitting

Publication at Faculty of Mathematics and Physics |
2010

Abstract

In machine learning, one of the main requirements is to build computational models with a high ability to generalize well the extracted knowledge. When training e.g. artificial neural networks, poor generalization is often characterized by over-training.

A common method to avoid over-training is the hold-out cross-validation. The basic problem of this method represents, however, appropriate data splitting.

In most of the applications, simple random sampling is used. Nevertheless, there are several sophisticated statistical sampling methods suitable for various types of datasets.

This paper provides a survey of existing sampling methods applicable to the data splitting problem. Supporting experiments evaluating the benefits of the selected data splitting techniques involve artificial neural networks of the back-propagation type.