Autoencoders for sample size estimation for fully connected neural network classifiers | npj Digital Medicine

Data collection is a universal step in the development of machine learning algorithms. The question of how much labeled data is needed to train a generalizable classifier is one that every data scientist working in supervised or semi-supervised learning must grapple with. The paradigm of big data answers with ‘as much data as we can get’. For many tasks, however, this convention is highly problematic. A priori sample size determination is a common practice for almost every field to mitigate some of these issues, and here we introduce a variant on sample size determination—the minimum convergence sample for machine learning. We additionally propose and validate a method to estimate a minimum convergence sample for deep learning algorithms with a focus on fully connected networks. This study makes several contributions to the field. The first is a simple method for estimating dataset learnability by a given model f. While prior work has characterized different methodologies by which learnability can be characterized, these works have focused on characterizing the expressivity of models relative to data rather than the learnability of data relative to a model[14, 15]. Du and colleagues [16, 17] have adapted tools from empirical process theory [18] for the estimation of sample complexity of CNNs and RNNs. However, empirical process theory may not extend well to networks which include nonlinear activation methods, like Rectified Linear Units.

In this work, we primarily focus on estimating the number of labeled samples needed to train fully connected networks. Many developments in deep learning have focused on reducing the number of samples. These approaches fall into two categories: (a) adding structural information about the data into the model or (b) having the model assign labels through semi-supervised learning approaches. An improvement in sample efficiency does not void sample determination, but rather increases the potential gap between the minimum convergence sample and number of collected labeled samples.

In the first category, improvements neural network architectures that take advantage of domain-specific and data-specific knowledge can reduce the minimum convergence samples. For example, neural network layers like convolutions takes advantage of spatial relationships of the pixels in an image to better learn representations of an image [19]. Convolutions improve sample efficiency and achieve generalizable performance at smaller samples than fully connected networks. The technique of a priori minimum convergence sample estimation we present should be easily adaptable to architectures like convolutional neural networks.

In the second category, semi-supervised learning methods can learn labels given a large set of unlabeled samples and smaller subset of labeled samples [20]. Semi-supervised learning relies on a few key assumptions, the most relevant of which is the low-density assumption. The low-density assumption in semi-supervised learning is that the decision boundary of a classifier should preferably pass through low-density regions in the input space. When a dataset is not representative of the population due to systemic inequities as seen in healthcare, this may result in the classifier providing biased labels for underrepresented classes. For example, there may be biases in data collection based upon race and ethnicity because of inequitable access to care [21]. A minimum convergence sample estimate may help guide data collection to adequately represent low-density subsets of the data.

Sample size determination and statistical power remain closely related for many important studies, including medical trials and social and psychological studies [22, 23]. Not having a priori sample size estimation has been recognized as a common mistake in the design of clinical trials and occurs more often when statisticians are not involved early in the trial design process [24]. Under-powered studies in neuroscience, and more specifically in Alzheimer’s disease have led to routine failures in replication [25,26,27]. Given the increasing presence of artificial intelligence in medicine and clinical trials [28], sample size determination for neural networks represents an important opportunity to advance the utility of artificial intelligence in these domains by increasing trial efficiency, efficacy, and power. Most grant applications in medicine require an estimate of sample sizes, and now an estimate can be reasonably provided for machine-learning based grant applications via the deployed Flask application. A sound method for conducting sample size estimation for machine learning models can ensure proper experiment design.

Some past work has surveyed the use of sample size determination in machine learning with respect to medical imaging applications [29]. These methods are split into Pre-Hoc and Post-Hoc methods. However, pre-hoc methods were not robust in the high-dimensional setting with large intraclass variability [30]. Other pre-hoc methods such as empirical process theory did not extend well to non-linear methods [16]. Post-hoc methods usually involve fitting a learning curve, but fitting a learning curve is trivial for minimum convergence sample estimation because any amount of data should result in a non-zero increase in performance on a training data-set. Moreover, these methods are task-specific, data-specific, and model-specific, as one learning curve has no relevance outside that specific task, model and data-set. Nevertheless, while our experiments validate minimum convergence sample estimation on toy data-sets, synthetic data, and one real-world example of medical imaging due to data availability, future work should further validate this method on more across different tasks and imaging types in the healthcare context.

Our second contribution is the proposal of a method to empirically estimate MCSE for a given fully connected neural network f. This function allows users to predict statistical power of a model without needing to train on the entire training set during every trial. It also includes an uncertainty on the estimate, in which the variance is inversely correlated to how structured the underlying data is. Our third contribution is a publicly available tool for minimum sample size estimation for fully connected neural networks.

Importantly, there are several natural opportunities to extend our work to more complex models, as discussed below. First, our paper only considered a fully connected network with a relatively simple architecture. One natural question that might extend from this work involves assessing how this method fares in estimating the statistical power of convolutional or recurrent neural networks. While adding convolutions would be relatively easy to do via the addition of another layer, adding attention mechanisms may require additional structural modifications to fully approximate the statistical power of recurrent neural network or transformers. For our method to be applicable to medical imaging tasks, we anticipate that extending this work to convolutional neural networks remains an important next step. Future work can validate MCSE on more complex architectures utilizing pre-trained networks and skip connections. Second, the loss function that was utilized in this current analysis was the reconstruction loss, which is a relatively simple choice of loss function. For variational autoencoders, the loss function changes to instead use a KL-divergence, while GANs use JS-divergence and WGANs use Wasserstein divergence [31,32,33]. Therefore, different autoencoders with various structural representations can also be used to represent a fully connected network with distinct losses and structural features. Future work should examine different reconstruction frameworks to approximate the statistical power of increasingly complex network architectures. Third, we have not explored the utility of a similar approach to aid in architecture search and identification of an optimal set of cases for labeling.

In summary, we present a novel method of estimating the minimum sample size required to train a fully connected neural network for a classification task. The distinguishing feature of our approach is that this estimate can be obtained prior to labeling any data, which can be advantageous in real-world settings where labeling is expensive or time-consuming.