An Updated Survey on Statistical Thresholding and Sample Size of fMRI Studies

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Associated Data

FILE S1: The complete lists of the 1,020 articles that were screened initially and the 388 articles that entered the analyses.

GUID: EBB4189C-D77E-4E0F-B4F6-BFDEF3AA9471 GUID: EBB4189C-D77E-4E0F-B4F6-BFDEF3AA9471

Abstract

Background: Since the early 2010s, the neuroimaging field has paid more attention to the issue of false positives. Several journals have issued guidelines regarding statistical thresholds. Three papers have reported the statistical analysis of the thresholds used in fMRI literature, but they were published at least 3 years ago and surveyed papers published during 2007–2012. This study revisited this topic to evaluate the changes in this field.

Methods: The PubMed database was searched to identify the task-based (not resting-state) fMRI papers published in 2017 and record their sample sizes, inferential methods (e.g., voxelwise or clusterwise), theoretical methods (e.g., parametric or non-parametric), significance level, cluster-defining primary threshold (CDT), volume of analysis (whole brain or region of interest) and software used.

Results: The majority (95.6%) of the 388 analyzed articles reported statistics corrected for multiple comparisons. A large proportion (69.6%) of the 388 articles reported main results by clusterwise inference. The analyzed articles mostly used software Statistical Parametric Mapping (SPM), Analysis of Functional NeuroImages (AFNI), or FMRIB Software Library (FSL) to conduct statistical analysis. There were 70.9%, 37.6%, and 23.1% of SPM, AFNI, and FSL studies, respectively, that used a CDT of p ≤ 0.001. The statistical sample size across the articles ranged between 7 and 1,299 with a median of 33. Sample size did not significantly correlate with the level of statistical threshold.

Conclusion: There were still around 53% (142/270) studies using clusterwise inference that chose a more liberal CDT than p = 0.001 (n = 121) or did not report their CDT (n = 21), down from around 61% reported by Woo et al. (2014). For FSL studies, it seemed that the CDT practice had no improvement since the survey by Woo et al. (2014). A few studies chose unconventional CDT such as p = 0.0125 or 0.004. Such practice might create an impression that the threshold alterations were attempted to show “desired” clusters. The median sample size used in the analyzed articles was similar to those reported in previous surveys. In conclusion, there seemed to be no change in the statistical practice compared to the early 2010s.

Keywords: false-discovery rate, familywise error rate, fMRI, Gaussian random field, literature, Monte Carlo stimulation, threshold, threshold-free cluster enhancement

Introduction

Functional magnetic resonance imaging (fMRI) studies—particularly the task-based fMRI studies, the most popular type of fMRI study—enable researchers to examine the human brain about various aspects ranging from sensation to cognition. Findings may bear clinical relevance such as the identification of neural correlates of diseases or the enabling of a neuro-functional assessment of clinical treatments.

The reproducibility of a neuroscience report depends on numerous factors—including the methodological details, statistical power and flexibility of the analyses (Carp, 2012). One of the most important factors that could be assessed relatively easily is the statistical approach used. Every paper may set its own significance level for the statistical tests reported (Hupé, 2015), and therefore, one may need to interpret the significant results from different papers differently. Considering the mass-univariate analytic approach utilized by various popular fMRI data-processing software—such as Statistical Parametric Mapping (SPM) (Penny et al., 2011), Analysis of Functional NeuroImages (AFNI) (Cox, 1996), and FMRIB Software Library (FSL) (Jenkinson et al., 2012) —correction for multiple comparisons is crucial for simultaneous statistical tests on several thousands of voxels. With regard to proper corrections for multiple comparisons, Carp (2012) revealed that an astonishing 41% of his 241 surveyed studies, which were published during 2007–2012, did not report formal corrections. As an extension to his work, Guo et al. (2014) reported a much reduced 19% for their 100 surveyed studies, which were published in six leading neuroscience/neuroimaging/multidisciplinary journals during 2010–2011. Similarly, Woo et al. (2014) reported that 6% of their 814 surveyed studies, which were published in seven leading journals during 2010–2011, did not apply formal statistical corrections. Uncorrected results may contain high false-positive rates, and therefore, their reproducibility and clinical relevance could potentially be undermined. Even for corrected results, the improper setting of statistical thresholds may also lead to inflated false-positive rates. Woo et al. (2014) and Eklund et al. (2016) have repeatedly stated that routine voxelwise correction methods are adequate for controlling false positives whereas cluster-defining primary thresholds (CDT) for clusterwise inferences should be set at p = 0.001 or lower because more liberal thresholds, such as p = 0.01, may cause highly inflated false-positive rates for parametric methods. Clusterwise inference was the most popular method because it is more sensitive when detecting significance (i.e., more powerful); however, its spatial precision is inferior to that of voxelwise inference, as a large significant cluster can only indicate that significant activations are contained within the cluster. Clusterwise inference gives no information with regard to which voxels are significantly activated (Woo et al., 2014).

In 2016, two journals issued guidelines regarding their stance on the standard statistical thresholds of reported fMRI/neuroimaging results (Carter et al., 2016; Roiser et al., 2016). Table ​ Table1 1 lists the key points of these guidelines and the suggestions of Woo et al. (2014) and Eklund et al. (2016). Moreover, several years have lapsed since 2014, the year when the last survey was published (Guo et al., 2014). It is time to conduct a literature survey on the statistical thresholds used by the fMRI studies published most recently.

Table 1

Recently published recommended statistical practices for controlling false positives.

Publication nameRecommendations
Woo et al., 20141. Set the default cluster-defining primary threshold (CDT) at p < 0.001. 2. Use a stringent CDT or voxelwise inference for highly powered studies.
Eklund et al., 20161. The parametric method works well for voxelwise inferences but not for clusterwise inferences (unless a stringent CDT is set at p < 0.001). 2. The permutation method works well for both voxelwise and clusterwise inferences.
Roiser et al., 20161. For clusterwise inferences, choose a stringent CDT (e.g., p < 0.001) unless the permutation method was employed. 2. For voxelwise inferences, p-values should be corrected for multiple comparisons. 3. Complementary approaches, such as false-discovery rate or threshold-free cluster enhancement, can be considered. 4. Preregister the proposed studies in which the planned statistical analyses methods are documented clearly.
Carter et al., 20161. Studies investigating very small brain regions should use a high voxel threshold (e.g., p < 0.001). 2. Studies not targeting precise localization may consider a more liberal threshold and focus on controlling false negatives by data reduction (e.g., region-of-interest analyses), as studies with fewer than 50 subjects per group usually have limited power.

Materials and Methods

In accordance with the methods of previous studies (Carp, 2012; Guo et al., 2014), articles published in 2017 and written in English were identified with the keywords “fMRI,” “BOLD,” and “task” in the PubMed database. The search was performed on July 20, 2017. These criteria yielded 1,020 articles (listed in Supplementary File S1). For this study, all 1,020 articles were initially included, and each was assessed by reading its full text and excluded if it did not report task-based human fMRI studies and did not report results from SPM. In other words, studies that reported animal studies, resting-state fMRI, connectivity, multi-voxel pattern analysis or percent of signal change were excluded. The screening excluded 632 articles accordingly and finally a total of 388 articles entered the analysis (Supplementary File S1). For the 388 articles, items including sample size, inferential method (e.g., voxelwise or clusterwise), theoretical method of correction for multiple comparisons (e.g., parametric or non-parametric), significance level, CDT (if applicable), volume of analysis (whole brain or region of interest; ROI) and software used were recorded manually. For articles that used multiple thresholds, the most stringent one used for the main analyses was chosen (Woo et al., 2014). Pearson’s correlation test was performed to evaluate the relationship between the sample size and the levels of CDT in the articles using clusterwise inference.

Results

Sample Size and Software Used

The sample size reported in 388 papers ranged from 7 to 1,299 with a median of 33. One hundred and thirty-eight studies (35.6%) analyzed data from 25 or fewer subjects, 152 studies (39.2%) had 26–50 subjects, 54 studies (13.9%) had 51–75 subjects, 23 studies (5.9%) had 76–100 subjects and 21 studies (5.4%) had 101 or more subjects (Figure ​ Figure1 1 ).

An external file that holds a picture, illustration, etc. Object name is fnhum-12-00016-g001.jpg

Choices of inferential methods and sample sizes used by the surveyed studies. The majority of the surveyed studies used clusterwise inference and recruited 50 subjects or fewer. For the studies using clusterwise inference, the cluster-defining primary thresholds (CDTs) used by them were recorded. According to Woo et al. (2014) and Eklund et al. (2016), a CDT at or more stringent than p = 0.001 is recommended (indicated by red portions of the bars in the lower panel). This was achieved by 70.9%, 37.6%, and 23.1% of studies using SPM, AFNI, and FSL, respectively.

The studies were published in 125 journals (Table ​ Table2 2 ). The studies predominantly used SPM for statistical analyses (202, 52.1%)—followed by FSL (79, 20.4%), AFNI (71, 18.3%), BrainVoyager (11, 2.8%), Resting-State fMRI Data Analysis Toolkit (6, 1.5%), Statistical Non-Parametric Mapping (SnPM; 5, 1.3%), and Matlab but other toolbox than SPM or SnPM (5, 1.3%). There was one study that used FreeSurfer, one used MAsks for Region of INterest Analysis, one used FIDL (developed by Washington University in St. Louis), one used TFCE toolbox (University of Jena) and one used XBAM (developed by King’s College London).

Table 2

The 125 journals that published the 388 analyzed articles.

Journal listCount%Journal list (continued)Count%
Neuroimage235.9Alzheimers Dement (Amst)10.3
Cortex143.6Appl Neuropsychol Child10.3
Neuropsychologia143.6Arch Gerontol Geriatr10.3
Brain Imaging Behav133.4Behav Res Ther10.3
Cereb Cortex133.4BMC Psychiatry10.3
Hum Brain Mapp133.4Br J Psychiatry10.3
J Neurosci123.1Br J Sports Med10.3
PloS One123.1Cerebellum10.3
Sci Rep123.1Cogn Neurosci10.3
J Cogn Neurosci112.8Cultur Divers Ethnic Minor Psychol10.3
Behav Brain Res102.6Dev Psychol10.3
Psychiatry Res92.3Einstein (Sao Paulo)10.3
J Affect Disord82.1Emotion10.3
Neuroimage Clin82.1Epilepsy Behav10.3
Brain Struct Funct61.5Eur Child Adolesc Psychiatry10.3
Neuroscience61.5Eur Eat Disord Rev10.3
Soc Cogn Affect Neurosci61.5Eur J Paediatr Neurol10.3
Addict Biol51.3Eur J Pain10.3
Biol Psychol51.3Eur Neuropsychopharmacol10.3
Dev Psychopathol51.3Eur Radiol10.3
Neuropsychopharmacology51.3Front Aging Neurosci10.3
Psychol Med51.3Front Neuroanat10.3
Soc Neurosci51.3Front Psychol10.3
Addiction41Int J Neuropsychopharmacol10.3
Brain Cogn41Int J Neurosci10.3
Brain Res41Int J Psychophysiol10.3
Cogn Affect Behav Neurosci41J Alzheimers Dis10.3
Dev Sci41J Am Acad Child Adolesc Psychiatry10.3
Eur J Neurosci41J Atten Disord10.3
Front Behav Neurosci41J Autism Dev Disord10.3
Front Hum Neurosci41J Child Sex Abus10.3
Mult Scler41J Clin Exp Neuropsychol10.3
Transl Psychiatry41J Hypertens10.3
Biol Psychiatry30.8J Neurol Neurosurg Psychiatry10.3
Brain Behav30.8J Neuropsychol10.3
Brain Lang30.8J Neurotrauma10.3
Elife30.8J Orthop Sports Phys Ther10.3
Eur Arch Psychiatry Clin Neurosci30.8J Physiol Anthropol10.3
Neural Plast30.8J Psycholinguist Res10.3
Psychopharmacology (Berl)30.8J Speech Lang Hear Res10.3
Schizophr Res30.8J Vis Exp10.3
Alcohol Alcohol20.5J Voice10.3
Am J Psychiatry20.5JAMA Psychiatry10.3
Bipolar Disord20.5Nat Commun10.3
Brain Stimul20.5Neural Regen Res10.3
Brain Topogr20.5Neurobiol Learn Mem10.3
Brain20.5Neurodegener Dis10.3
Clin Physiol Funct Imaging20.5Neurogastroenterol Motil10.3
Depress Anxiety20.5Neurol Med Chir (Tokyo)10.3
Dev Cogn Neurosci20.5Neurology10.3
Drug Alcohol Depend20.5Neuropsychobiology10.3
Exp Brain Res20.5Neuroradiology10.3
Hippocampus20.5Nutr Neurosci10.3
J Int Neuropsychol Soc20.5Obes Res Clin Pract10.3
J Psychiatr Res20.5Physiol Rep10.3
J Psychopharmacol20.5PLoS Biol10.3
Mol Psychiatry20.5Proc IEEE Inst Electr Electron Eng10.3
Neurobiol Aging20.5Psychiatry Clin Neurosci10.3
Proc Natl Acad Sci USA20.5Psychophysiology10.3
Prog Neuropsychopharmacol Biol Psychiatry20.5Res Dev Disabil10.3
Psychoneuroendocrinology20.5Schizophr Bull10.3
Acta Radiol10.3Swiss Med Wkly10.3
Alcohol Clin Exp Res10.3

Choice of Inferential Method, Theoretical Method, and Significance Level

The majority of studies (371, 95.6%) reported main results with statistics corrected for multiple comparisons. Of the analyzed studies, 270 (69.6%) reported clusterwise inference for their main analyses whereas 92 (23.7%) reported using voxelwise inference and nine (2.3%) reported using the threshold-free cluster enhancement (TFCE) inference (Figure ​ Figure1 1 ). Most of the studies defined significance at corrected p = 0.05. There were 338 studies (87.1%) that reported whole-brain results for their main analyses and 244 of them (72.2%) used clusterwise inference (Table ​ Table3 3 ). Fifty studies (12.9%) reported ROI results and 17 studies (4.4%) reported uncorrected statistics.

Table 3

Thresholds of statistical significance used by the 338 surveyed studies reporting whole brain results.

Inferential methodn%
Cluster-level inference (n = 244)
Corrected p = 0.0522893.4
Corrected p = 0.02510.4
Corrected p = 0.0183.3
Corrected p = 0.00172.9
Voxel-level inference (n = 71)
Corrected p = 0.056794.4
Corrected p = 0.02511.4
Corrected p = 0.0111.4
Corrected p = 0.00511.4
Corrected p = 0.00111.4
Threshold free cluster enhancement (n = 7)
Corrected p = 0.057100.0
Uncorrected inference (n = 16)
p = 0.05, k = 4016.3
p = 0.005, k = 5016.3
p = 0.005, k = 2016.3
p = 0.005, k = 1016.3
p = 0.00516.3
p = 0.001, k = 20425.0
p = 0.001, k = 1516.3
p = 0.001, k = 10318.8
p = 0.001, k = 516.3
p = 0.001212.5
k means the minimal cluster size expressed in number of voxels.

Corrections for multiple comparisons were achieved by various theoretical methods (Table ​ Table4 4 )—predominantly parametric methods, regardless of inference at cluster or voxel level. Five studies did not mention their theoretical methods, and all of them used FSL software.

Table 4

Cross-tabulation of the theoretical methods and statistical thresholds of the 371 surveyed studies reporting corrected statistics.

Inferential methodTheoretical method Total count
Parametric (FWE)Parametric (FDR)Parametric (Monte Carlo)PermutationUnknown
Voxelwise7218 2 92
Clusterwise155129265270
TFCE 9 9

There were 17 studies reporting uncorrected statistics; thus, only 371 studies were included in the table. TFCE, threshold-free cluster enhancement.

Cluster-Defining Primary Threshold (CDT) of Studies Using the Clusterwise Inferential Method

As mentioned above, 270 studies used clusterwise inference and thus required a CDT. Nearly half of them (128, 47.4%) defined their CDTs at or more stringent than p = 0.001 (Table ​ Table5 5 ). For studies using SPM, AFNI, and FSL, the proportions of CDTs reaching this standard were 70.9%, 37.6%, and 23.1%, respectively (Figure ​ Figure1 1 ). Eighteen studies (6.7%) did not report their CDTs. The CDT level did not have a significant correlation with the sample size (r 2 = 0.001, p = 0.683). One of the studies had a sample size of 1,299 subjects, which was much larger than the second-largest sample size at 429. If this outlier was excluded, there was still no significant correlation (r 2 = 0.007, p = 0.180).

Table 5

Cluster-defining primary thresholds (CDTs) of 270 studies using clusterwise inferences.

CDT (p-value)N%
0.0593.3
0.02510.4
0.0210.4
0.012510.4
0.016022.2
0.0054918.1
0.00112445.9
0.000410.4
0.000120.7
0.0000110.4
Unknown217.8

Discussion

The updated literature survey reported in this study reaffirmed that clusterwise inference remains the mainstream approach (270/388, 69.6%) for a cohort of 388 fMRI studies, compared to the previous numbers reported by Carp (2012) (53.2%), Guo et al. (2014) (63%), and Woo et al. (2014) (75%). There were still around 53% (142/270) studies using clusterwise inference that chose a more liberal CDT than p = 0.001 (n = 121) or did not report their CDT (n = 21), down from around 61% reported in Woo et al. (2014). The ratio of studies reporting uncorrected statistics was much lower than the ratios reported by Carp (2012) (40.9%), Guo et al. (2014) (19%), and Woo et al. (2014) (6%).

With regard to the sample size used in the surveyed studies, the median sample size was 33. A previous study reported that the median sample size used in the studies published in 2015 was 28.5, based on automated data extraction from Neurosynth 1 database (Poldrack et al., 2017). It was reassuring that studies using clusterwise inference with smaller sample sizes did not use more liberal CDTs.

In terms of inferential methods, it is still true that FSL studies mainly set their CDTs at p = 0.01 (default setting of the software), which is more liberal than the p = 0.001 that was highly recommended by various reports (Woo et al., 2014; Eklund et al., 2016; Roiser et al., 2016). Compared with the articles surveyed by Woo et al. (2014), a similar proportion of FSL studies surveyed in the current report used p = 0.001 or more stringent thresholds (around 23.1% vs. 20%). The false-positive rate may be influenced by multiple factors, such as the degree of spatial smoothing, experiment paradigm, statistical test performed and algorithms written in the statistical software. Hence, even if the statistical thresholds were set according to recommendations, the rate of false positives could still be high and inhomogeneous across the brain (Eklund et al., 2016). Therefore, some may advocate the use of false-discovery rate (FDR) (Genovese et al., 2002) or non-parametric approaches (Nichols and Holmes, 2002). However, few studies used FDR or non-parametric methods. Potential drawbacks of these methods are that problems may arise when inference is drawn from non-parametric methods (Hupé, 2015), whereas FDR results depend on the probability of non-null effects, which conceptually may not always be valid and different studies may set different thresholds (Hupé, 2015). Regardless of the theoretical methods used, the effect sizes should be reported alongside the brain maps of p-values for better comprehension of the results (Wasserstein and Lazar, 2016).

The current study has certain limitations. It would be beneficial to evaluate the effects of altering the statistical thresholds on the outcomes of the surveyed articles. However, it is not possible for a literature survey to achieve this. It should be noticed that the statistical practice is only one of the important aspects of an article. Readers should also evaluate other aspects—such as methodological details, study power and the flexibility of the analyses. It is important for readers to notice the statistical threshold used for different parts of the results. All of these may influence the quality of an article. Publishing replication studies regardless of statistical significance may help readers better comprehend the data quality (Yeung, 2017). Meanwhile, conducting meta-analysis of functional neuroimaging data can also establish consensus on the locations of brain activation to confirm or refute hypothesis (Wager et al., 2007; Zmigrod et al., 2016; Yeung et al., 2017b,d, 2018).

Conclusion

A considerable amount of studies still used statistical approaches that might be considered as having inadequate control over false positives. There were still around 30% SPM studies that chose a more liberal CDT than p = 0.01 or did not report their CDT, in spite of the present recommendations. For FSL studies, it seemed that the CDT practice had no sign of improvement since the survey by Woo et al. (2014). A few studies, as noted in Table ​ Table5 5 , chose unconventional CDT such as p = 0.0125 or 0.004. Such practice might tend to create an impression that the threshold alterations were attempted to show “desired” clusters. As the neuroimaging literature is often highly cited and has continued to grow substantially over the years (Yeung et al., 2017a,c,e), there is a need to enforce a high standard of statistical control over false positives. Meanwhile, the median sample size of the analyzed articles did not differ largely from that of previous surveys, and studies with smaller sample sizes did not use more liberal statistical thresholds. In short, there seemed to be no change in the statistical practice compared to the early 2010s.

Author Contributions

AY is responsible for all parts of the work.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The author sincerely thanks Ms. Natalie Sui Miu Wong from Oral and Maxillofacial Surgery, Faculty of Dentistry, The University of Hong Kong for her critical comments and statistical advice.