Last active
May 29, 2019 15:50
-
-
Save benmarwick/c8977f979849eabe318771735e39d13a to your computer and use it in GitHub Desktop.
What are the most frequently used statistical tests?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "What are the most frequently used statistical tests?" | |
author: "Ben Marwick" | |
date: "March 31, 2016" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, | |
warning = FALSE, | |
message = FALSE) | |
``` | |
I was thinking recently about the history of statistics, and why some methods are popular today, and others are not. This led me to ask the question "what are the most popular basic statistical methods?" | |
I have a pretty good sense of what's popular in my own field, but that's a pretty small group. I wanted to have a look at scientists generally in lots of disciplines, and I wanted to do it from my kitchen table using R. Two methods seemed suitable: using rOpenSci's `fulltext` | |
Here's the list of statistical methods that I wanted to know about: | |
```{r tests} | |
the_tests <- c("t-test", "chi-square", "chi square", "chi-squared", "ANOVA", "Wilcox", "Fisher's exact", "Pearson", "z-test", "f-test", "Bayesian", "confidence interval", "Kruskal Wallis", "Kruskal-Wallis", "Wilcoxon", "correlation", "multiple correlation", "MANOVA", "factor analysis", "logistic regression", "multiple regression", "Principal component analysis", "bootstrap", "resampling", "Mann Whitney", "Mann-Whitney", "cluster analysis", "ANCOVA", "linear regression", "Kolmogorov-Smirnov") | |
``` | |
Here is how we can search the full text of a bunch of journals. We might take this as an indicator of what researchers are actually using in their scientific publications. | |
```{r fulltext} | |
sources <- c('plos','crossref','arxiv', 'europmc', 'bmc') | |
library("purrr") | |
library("fulltext") | |
library("dplyr") | |
results <- the_tests %>% | |
map(~ ft_search(query = ., from = sources)) | |
results_df <- results %>% | |
at_depth(2, 2) %>% | |
invoke(rbind, .) %>% | |
data.frame %>% | |
apply(., 1, unlist) %>% | |
data.frame %>% | |
colSums %>% | |
setNames(., nm = the_tests) %>% | |
data.frame(test = names(.), | |
freq = unname(.)) | |
library(ggplot2) | |
ggplot(results_df, aes(reorder(test, -freq), freq)) + | |
geom_bar(stat = "identity") + | |
xlab("method") + | |
ylab("number of articles") + | |
theme_bw() + | |
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2)) | |
``` | |
Here is how we can see how many Google searches each of the tests have received recently. This is an indication of what people are searching for, and would include students and people in industry whose research might not end up in publications that we could access with the previous method. | |
```{r gtrends_hide, echo=FALSE} | |
library(gtrendsR) | |
usr <- "[email protected]" | |
psw <- "" | |
gconnect(usr, psw) | |
``` | |
```{r gtrends} | |
# we can only search for five terms at a time | |
the_tests_pieces <- split(the_tests, ceiling(seq_along(the_tests)/5)) | |
text_trends <- vector("list", length(the_tests_pieces)) | |
all_the_trends <- data.frame(matrix(ncol = length(the_tests), | |
nrow = 500)) | |
# loop to search all the terms in batches of five terms at a time | |
for(i in seq_along(the_tests_pieces)){ | |
# make a safe version of the function | |
gtrends_safe <- safely(gtrends) | |
# get the data from google, we'll just save the 'trend' bits for plotting | |
text_trends[[i]] <- gtrends_safe(the_tests_pieces[[i]])[[1]]$trend | |
# get the 'trends' and combine for all the stat methods we're interested in | |
} | |
date_time <- text_trends[[1]]$start | |
text_trends_1 <- lapply(text_trends,"[", 1:length(date_time), 3:7, drop=FALSE) | |
text_trends_2 <- text_trends_1[ ! sapply(text_trends_1, is.null) ] | |
text_trends_df <- data.frame(Reduce(dplyr::inner_join, list(text_trends_2))) | |
text_trends_df$date_time <- date_time | |
# total number of searches | |
gtrend_total <- colSums(text_trends_df[,(1:ncol(text_trends_df)-1)]) | |
gtrend_total_df <- data.frame(test = names(gtrend_total), | |
value = unname(gtrend_total)) | |
library(ggplot2) | |
ggplot(gtrend_total_df, aes(reorder(test, -value), value)) + | |
geom_bar(stat = "identity") + | |
xlab("method") + | |
ylab("number of \nGoogle searches") + | |
theme_bw() + | |
theme(axis.text.x = element_text(angle = 90, hjust = 1)) | |
# over time | |
library(tidyr) | |
text_trends_df_long <- gather(text_trends_df, 'test', 'value', 1:24) | |
library(plotly) | |
library(ggrepel) | |
p <- ggplot(text_trends_df_long, aes(date_time, value, colour = test, label = test)) + | |
stat_smooth() + | |
coord_cartesian(xlim =c(min(text_trends_df_long$date_time), max(text_trends_df_long$date_time) + 100000000)) + | |
scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") + | |
scale_y_log10() + | |
geom_text_repel( | |
data = subset(text_trends_df_long, date_time == max(date_time)), | |
aes(label = test), | |
size = 3, | |
nudge_x = 5, | |
segment.color = NA | |
) + | |
guides(colour=FALSE) + | |
theme_bw() | |
ggplotly(p) |
Author
benmarwick
commented
Dec 18, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment