Last active
August 15, 2022 02:49
-
-
Save fdzuluaga2020/51b2e5218b44ff9831f43411470387e1 to your computer and use it in GitHub Desktop.
R-Studio
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dplyr : Data Wrangling (Filter, Mutate, Sumarize, Arrange, Join, Group_by) | |
knitr | |
tidyr : Converted Data to Tidy Format | |
ggplot2 : Data Visualization | |
readr : Importing | |
----------------------------------------------------------------------------- | |
tidyverse : Umbrella Package | |
https://tidyr.tidyverse.org/dev/articles/pivot.html#pew | |
running: | |
library(tidyverse) | |
would be the same as running: | |
library(ggplot2) | |
library(dplyr) | |
library(readr) | |
library(tidyr) | |
library(purrr) | |
library(tibble) | |
library(stringr) | |
library(forcats) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The fundamental premise of data modeling is to make explicit the relationship between: an outcome variable y, also called a dependent variable or response variable, and an explanatory/predictor variable x, also called an independent variable or covariate | |
Another way to state this is using mathematical terminology: we will model the outcome variable y “as a function” of the explanatory/predictor variable x | |
But, why do we have two different labels, explanatory and predictor, for the variable x? | |
That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes: | |
Modeling for explanation: When you want to explicitly describe and quantify the relationship between the outcome variable y and a set of explanatory variables x, determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any causal relationships between the variables. | |
Modeling for prediction: When you want to predict an outcome variable y based on the information contained in a set of predictor variables x. Unlike modeling for explanation, however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about y using the information in x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Factores : son las variables categoricas | |
Data Frames : son las representaciones rectangulares de los datos en donde las columnas son las variables y las filas las observaciones | |
Condicionales : | |
Igualdad : se define con == | |
Desigualdad : != | |
Operadores Logicos : And se representa como (&) , Or se representa como (|) | |
Funciones : son un conjunto de comandos que reciben unos argumentos y regresan un resultado | |
Packages : son como Apps para soportar las diferentes funciones que se han desarrollado para R, los paquetes deben ser instalados y cargados para poderlos usar. | |
La carga de un paquete se hace con el comando library(nombre del package), si el package no ha sido cargado simplemente no puede ser usado | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Se pueden explorar de la siguiente manera (Cargar primero el package dplyr) : | |
View(Package Name) : permite ver las columnas y las filas existentes en el Data Frame | |
glimpse(Package Name) : permite observar los contenidos de las variables | |
kable(Package Name) : presenta mayor legibilidad de los datos, util en R Markdown, Kable requiere el package knitr | |
Operador $ : | |
Nos permite explorar una variable del Data Frame, asi por ejemplo : airlines$name permite explorar la variable name del Data Frame airline | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation. | |
Importing data : | |
Using the console : | |
library(readr) | |
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv") | |
dem_score | |
Let’s apply some of the data wrangling verbs we learned in Chapter 3 on the drinks data frame: | |
filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then | |
select() all columns except total_litres_of_pure_alcohol by using the - sign, then | |
rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively. | |
and save the resulting data frame in drinks_smaller: | |
drinks_smaller <- drinks %>% | |
filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>% | |
select(-total_litres_of_pure_alcohol) %>% | |
rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings) | |
drinks_smaller | |
“tidy” in data science using R means that your data follows a standardized format | |
A dataset is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes. | |
“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data: | |
- Each variable forms a column. | |
- Each observation forms a row. | |
- Each type of observational unit forms a table. | |
For example, say you have the following table of stock prices in Table 4.1: | |
TABLE 4.1: Stock prices (non-tidy format) : | |
Date Boeing stock price Amazon stock price Google stock price | |
2009-01-01 $173.55 $174.90 $174.34 | |
2009-01-02 $172.61 $171.42 $170.04 | |
Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format, each variable should be its own column, as shown in Table 4.2. Notice that both tables present the same information, but in different formats | |
TABLE 4.2: Stock prices (tidy format) : | |
Date Stock Name Stock Price | |
2009-01-01 Boeing $173.55 | |
2009-01-01 Amazon $174.90 | |
2009-01-01 Google $174.34 | |
2009-01-02 Boeing $172.61 | |
2009-01-02 Amazon $171.42 | |
2009-01-02 Google $170.04 | |
Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table 4.3 | |
TABLE 4.3: Example of tidy data : | |
Date Boeing Price Weather | |
2009-01-01 $173.55 Sunny | |
2009-01-02 $172.61 Overcast | |
In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table 4.1, the data is “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing price, and the Weather that particular day | |
If your original data frame is in wide (non-“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package | |
Going back to our drinks_smaller data frame from earlier: | |
drinks_smaller | |
# A tibble: 4 × 4 | |
country beer spirit wine | |
<chr> <int> <int> <int> | |
1 China 79 192 8 | |
2 Italy 85 42 237 | |
3 Saudi Arabia 0 5 0 | |
4 USA 249 158 84 | |
We convert it to “tidy” format by using the pivot_longer() function from the tidyr package as follows: | |
drinks_smaller_tidy <- drinks_smaller %>% | |
pivot_longer(names_to = "type", | |
values_to = "servings", | |
cols = -country) | |
drinks_smaller_tidy | |
# A tibble: 12 × 3 | |
country type servings | |
<chr> <chr> <int> | |
1 China beer 79 | |
2 China spirit 192 | |
3 China wine 8 | |
4 Italy beer 85 | |
5 Italy spirit 42 | |
6 Italy wine 237 | |
7 Saudi Arabia beer 0 | |
8 Saudi Arabia spirit 5 | |
9 Saudi Arabia wine 0 | |
10 USA beer 249 | |
11 USA spirit 158 | |
12 USA wine 84 | |
We set the arguments to pivot_longer() as follows: | |
names_to here corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we set names_to = "type". In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here. | |
values_to here is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "servings" since each of the numeric values in each of the beer, wine, and spirit columns of the drinks_smaller data corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 × 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument. | |
The third argument cols is the columns in the drinks_smaller data frame you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it. | |
The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output: | |
drinks_smaller %>% | |
pivot_longer(names_to = "type", | |
values_to = "servings", | |
cols = c(beer, spirit, wine)) | |
Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller data frame, we could also do the following for the cols argument: | |
drinks_smaller %>% | |
pivot_longer(names_to = "type", | |
values_to = "servings", | |
cols = beer:wine) | |
If however you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider() function instead | |
Case study: Democracy in Guatemala : | |
Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala. | |
guat_dem <- dem_score %>% | |
filter(country == "Guatemala") | |
guat_dem | |
# A tibble: 1 × 10 | |
country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` | |
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> | |
1 Guatemala 2 -6 -5 3 1 -3 -7 3 3 | |
Let’s lay out the grammar of graphics we saw in Section 2.1. | |
First we know we need to set data = guat_dem and use a geom_line() layer, but what is the aesthetic mapping of variables? We’d like to see how the democracy score has changed over the years, so we need to map: | |
year to the x-position aesthetic and | |
democracy_score to the y-position aesthetic | |
Now we are stuck in a predicament, much like with our drinks_smaller example in Section 4.2. We see that we have a variable named country, but its only value is "Guatemala". We have other variables denoted by different year values. Unfortunately, the guat_dem data frame is not “tidy” and hence is not in the appropriate format to apply the grammar of graphics, and thus we cannot use the ggplot2 package just yet | |
We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “names” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “values” variable called democracy_score. Our resulting data frame will have three columns: country, year, and democracy_score. Recall that the pivot_longer() function in the tidyr package does this for us: | |
guat_dem_tidy <- guat_dem %>% | |
pivot_longer(names_to = "year", | |
values_to = "democracy_score", | |
cols = -country, | |
names_transform = list(year = as.integer)) | |
guat_dem_tidy | |
# A tibble: 9 × 3 | |
country year democracy_score | |
<chr> <int> <dbl> | |
1 Guatemala 1952 2 | |
2 Guatemala 1957 -6 | |
3 Guatemala 1962 -5 | |
4 Guatemala 1967 3 | |
5 Guatemala 1972 1 | |
6 Guatemala 1977 -3 | |
7 Guatemala 1982 -7 | |
8 Guatemala 1987 3 | |
9 Guatemala 1992 3 | |
We set the arguments to pivot_longer() as follows: | |
names_to is the name of the variable in the new “tidy” data frame that will contain the column names of the original data. Observe how we set names_to = "year". In the resulting guat_dem_tidy, the column year contains the years where Guatemala’s democracy scores were measured. | |
values_to is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "democracy_score". In the resulting guat_dem_tidy the column democracy_score contains the 1 × 9 = 9 democracy scores as numeric values. | |
The third argument is the columns you either want to or don’t want to “tidy.” Observe how we set this to cols = -country indicating that we don’t want to “tidy” the country variable in guat_dem and rather only variables 1952 through 1992. | |
The last argument of names_transform tells R what type of variable year should be set to. Without specifying that it is an integer as we’ve done here, pivot_longer() will set it to be a character value by default. | |
We can now create the time-series plot in Figure 4.5 to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a geom_line(). Furthermore, we’ll use the labs() function in the ggplot2 package to add informative labels to all the aes()thetic attributes of our plot, in this case the x and y positions. | |
ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) + | |
geom_line() + | |
labs(x = "Year", y = "Democracy Score") | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
El concepto de los graficos en R se base en la Gramatica de los componentes : | |
In short, the grammar tells us that: | |
A statistical graphic is a MAPPING of DATA variables to AESthetic attributes of GEOMetric objects. | |
Specifically, we can break a graphic into the following three essential components: | |
data: the dataset containing the variables of interest. | |
geom: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars. | |
aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset | |
Por ejemplo si voy a graficar el GDP per capita Vs Life Expectancy en donde muestro la relacion como un grafico de puntos en donde el tamaño de los puntos representa la poblacion de los paises y el color de los puntos representa el continente al que pertenece el pais, desde el punto de vista gramatico lo que estamos haciendo es : | |
The DATA variable GDP per Capita gets mapped to the x-position AESthetic of the points. | |
The DATA variable Life Expectancy gets mapped to the y-position AESthetic of the points. | |
The DATA variable Population gets mapped to the SIZE AESthetic of the points. | |
The DATA variable Continent gets mapped to the COLOR AESthetic of the points. | |
En otras palabras definimos : | |
data variable aes geom | |
GDP per Capita x point | |
Life Expectancy y point | |
Population size point | |
Continent color point | |
En la gramatica tambien podriamos incluir : | |
faceting breaks up a plot into several plots split by the values of another variable | |
position adjustments for barplots | |
-------------------------------------------------------------------------------------------- | |
Los tipos fundamentales de graficas son : | |
-------------------------------------------------------------------------------------------- | |
SCATTERPLOTS : They allow you to visualize the relationship between two numerical variables | |
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + | |
geom_point() | |
La constuccion de la gramatica se hace por capas, en donde cada capa se adiciona con el signo + al final de cada linea de codigo, el signo + no puede aparecer al principio de la linea de codigo, siempre tiene que aparecer al final, tener en cuenta que despues del + se de ENTER para iniciar una nueva linea | |
Cuando se hacen las graficas de puntos es posible que se genere el fenomeno de Overplotting que hace que se produzca una concentracion de puntos tal que no permitan ver bien las observaciones por la masa que se aprecia, para este caso se pueden dar dos manejos : | |
- Adjusting the transparency of the points | |
- Adding a little random “jitter”, or random “nudges”, to each of the points | |
Method 1: Changing the transparency | |
The first way of addressing overplotting is to change the transparency/opacity of the points by setting the alpha argument in geom_point(). We can change the alpha argument to be any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque. By default, alpha is set to 1. In other words, if we don’t explicitly set an alpha value, R will use alpha = 1 | |
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + | |
geom_point(alpha = 0.2) | |
Method 2: Jittering the points | |
The second way of addressing overplotting is by jittering all the points. This means giving each point a small “nudge” in a random direction | |
-------------------------------------------------------------------------------------------- | |
LINEGRAPHS : Linegraphs show the relationship between two numerical variables when the variable on the x-axis, also called the explanatory variable, is of a sequential nature. In other words, there is an inherent ordering to the variable | |
ggplot(data = early_january_weather, | |
mapping = aes(x = time_hour, y = temp)) + | |
geom_line() | |
-------------------------------------------------------------------------------------------- | |
HISTOGRAMS : let’s say we don’t care about its relationship with time, but rather we only care about how the values of temp distribute. Present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question | |
In other words: | |
- What are the smallest and largest values? | |
- What is the “center” or “most typical” value? | |
- How do the values spread out? | |
- What are frequent and infrequent values? | |
A histogram is a plot that visualizes the distribution of a numerical value as follows: | |
- We first cut up the x-axis into a series of bins, where each bin represents a range of values. | |
- For each bin, we count the number of observations that fall in the range corresponding to that bin. | |
- Then for each bin, we draw a bar whose height marks the corresponding count. | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram() | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram(color = "white") | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram(color = "white", fill = "steelblue") | |
Adjusting the bins : | |
- By adjusting the number of bins via the bins argument to geom_histogram(). | |
- By adjusting the width of the bins via the binwidth argument to geom_histogram(). | |
Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows: | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram(bins = 40, color = "white") | |
Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the binwidth argument in the geom_histogram() layer. For example, let’s set the width of each bin to be 10°F. | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram(binwidth = 10, color = "white") | |
-------------------------------------------------------------------------------------------- | |
Facets : | |
Before continuing with the next of the 5NG, let’s briefly introduce a new concept called faceting. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ. | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram(binwidth = 5, color = "white") + | |
facet_wrap(~ month) | |
We can also specify the number of rows and columns in the grid by using the nrow and ncol arguments inside of facet_wrap() | |
ggplot(data = weather, mapping = aes(x = temp)) + | |
geom_histogram(binwidth = 5, color = "white") + | |
facet_wrap(~ month, nrow = 4) | |
-------------------------------------------------------------------------------------------- | |
Boxplots : | |
While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal is a side-by-side boxplot. | |
- 25% of points fall below the bottom edge of the box, which is the first quartile of 36°F. In other words, 25% of observations were below 36°F. | |
- 25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 45°F. Thus, 25% of observations were between 36°F and 45°F and 50% of observations were below 45°F. | |
- 25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 52°F. It follows that 25% of observations were between 45°F and 52°F and 75% of observations were below 52°F. | |
- 25% of points fall above the top edge of the box. In other words, 25% of observations were above 52°F. | |
- The middle 50% of points lie within the interquartile range (IQR) between the first and third quartile. Thus, the IQR for this example is 52 - 36 = 16°F. | |
- The interquartile range is a measure of a numerical variable’s spread. | |
- The whiskers stick out from either end of the box all the way to the minimum and maximum | |
- Any observed values outside this range get marked with points called outliers | |
ggplot(data = weather, mapping = aes(x = month, y = temp)) + | |
geom_boxplot() | |
that this plot does not provide information about temperature separated by month | |
We can convert the numerical variable month into a factor categorical variable by using the factor() function. So after applying factor(month), month goes from having numerical values just the 1, 2, …, and 12 to having an associated ordering. With this ordering, ggplot() now knows how to work with this variable to produce the needed plot | |
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + | |
geom_boxplot() | |
Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes | |
To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points | |
-------------------------------------------------------------------------------------------- | |
Barplots : | |
Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another commonly desired task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories within a categorical variable, also known as the levels of the categorical variable. Often the best way to visualize these different counts, also known as frequencies, is with barplots (also called barcharts) | |
One complication, however, is how your data is represented. Is the categorical variable of interest “pre-counted” or not? | |
Depending on how your categorical data is represented, you’ll need to add a different geometric layer type to your ggplot() to create a barplot | |
ggplot(data = fruits, mapping = aes(x = fruit)) + | |
geom_bar() | |
ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) + | |
geom_col() | |
When the categorical variable whose distribution you want to visualize : | |
- Is not pre-counted in your data frame, we use geom_bar(). | |
- Is pre-counted in your data frame, we use geom_col() with the y-position aesthetic mapped to the variable that has the counts. | |
-------------------------------------------------------------------------------------------- | |
Pie charts : | |
One of the most common plots used to visualize the distribution of categorical data is the pie chart. While they may seem harmless enough, pie charts actually present a problem in that humans are unable to judge angles well | |
We overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another | |
-------------------------------------------------------------------------------------------- | |
BarPlots : | |
Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the joint distribution of two categorical variables at the same time | |
stacked barplot : | |
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + | |
geom_bar() | |
First, the fill aesthetic corresponds to the color used to fill the bars, while the color aesthetic corresponds to the color of the outline of the bars | |
side-by-side barplots, also known as dodged barplots : | |
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + | |
geom_bar(position = "dodge") | |
Note the width of the bars for AS, F9, FL, HA and YV is different than the others. We can make one tweak to the position argument to get them to be the same size in terms of width as the other bars by using the more robust position_dodge() function | |
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + | |
geom_bar(position = position_dodge(preserve = "single")) | |
faceted barplot : | |
ggplot(data = flights, mapping = aes(x = carrier)) + | |
geom_bar() + | |
facet_wrap(~ origin, ncol = 1) | |
Let’s go over some important points about specifying the arguments (i.e., inputs) to functions. Run the following two segments of code: | |
# Segment 1: | |
ggplot(data = flights, mapping = aes(x = carrier)) + | |
geom_bar() | |
# Segment 2: | |
ggplot(flights, aes(x = carrier)) + | |
geom_bar() | |
You’ll notice that both code segments create the same barplot, even though in the second segment we omitted the data = and mapping = code argument names. This is because the ggplot() function by default assumes that the data argument comes first and the mapping argument comes second. As long as you specify the data frame in question first and the aes() mapping second, you can omit the explicit statement of the argument names data = and mapping = | |
Going forward for the rest of this book, all ggplot() code will be like the second segment: with the data = and mapping = explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake; it’s common to see this style when reviewing other R users’ code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Use series of functions from the dplyr package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs : | |
- filter() a data frame’s existing rows to only pick out a subset of them | |
- summarize() one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range | |
- group_by() its rows. In other words, assign different rows to be part of the same group. We can then combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one computed for each of the three origin airports | |
- mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius | |
- arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp | |
- join() it with another data frame by matching along a “key” variable. In other words, merge these two data frames together | |
The pipe operator: %>% | |
The pipe operator allows us to combine multiple operations in R into a single sequential chain of actions | |
Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x using hypothetical functions f(), g(), and h(): | |
Take x then : | |
- Use x as an input to a function f() then | |
- Use the output of f(x) as an input to a function g() then | |
- Use the output of g(f(x)) as an input to a function h() | |
One way to achieve this sequence of operations is by using nesting parentheses as follows: | |
h(g(f(x))) | |
For example, you can obtain the same output as the hypothetical sequence of functions as follows: | |
x %>% | |
f() %>% | |
g() %>% | |
h() | |
You would read this sequence as: | |
Take x then : | |
- Use this output as the input to the next function f() then | |
- Use this output as the input to the next function g() then | |
- Use this output as the input to the next function h() | |
Much like when adding layers to a ggplot() using the + sign, you form a single chain of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator %>%. Furthermore, much like how the + sign has to come at the end of lines when constructing plots, the pipe operator %>% has to come at the end of lines as well | |
------------------------------------------------------------------------------------------------ | |
filter rows : | |
The filter() function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about the values of a variable in your dataset and then filters out only the rows that match that criteria | |
portland_flights <- flights %>% | |
filter(dest == "PDX") | |
View(portland_flights) | |
Note the order of the code. First, take the flights data frame flights then filter() the data frame so that only those where the dest equals "PDX" are included | |
You can use other operators beyond just the == operator that tests for equality: | |
> corresponds to “greater than” | |
< corresponds to “less than” | |
>= corresponds to “greater than or equal to” | |
<= corresponds to “less than or equal to” | |
!= corresponds to “not equal to.” The ! is used in many programming languages to indicate “not.” | |
Furthermore, you can combine multiple criteria using operators that make comparisons: | |
| corresponds to “or” | |
& corresponds to “and” | |
To see many of these in action, let’s filter flights for all rows that departed from JFK and were heading to Burlington, Vermont ("BTV") or Seattle, Washington ("SEA") and departed in the months of October, November, or December. Run the following: | |
btv_sea_flights_fall <- flights %>% | |
filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10) | |
View(btv_sea_flights_fall) | |
We can often skip the use of & and just separate our conditions with a comma. The previous code will return the identical output btv_sea_flights_fall as the following code: | |
btv_sea_flights_fall <- flights %>% | |
filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10) | |
View(btv_sea_flights_fall) | |
Let’s present another example that uses the ! “not” operator to pick rows that don’t match a criteria. As mentioned earlier, the ! can be read as “not.” Here we are filtering rows corresponding to flights that didn’t go to Burlington, VT or Seattle, WA. | |
not_BTV_SEA <- flights %>% | |
filter(!(dest == "BTV" | dest == "SEA")) | |
View(not_BTV_SEA) | |
Now say we have a larger number of airports we want to filter for, say "SEA", "SFO", "PDX", "BTV", and "BDL". We could continue to use the | (or) operator: | |
many_airports <- flights %>% | |
filter(dest == "SEA" | dest == "SFO" | dest == "PDX" | | |
dest == "BTV" | dest == "BDL") | |
but as we progressively include more airports, this will get unwieldy to write. A slightly shorter approach uses the %in% operator along with the c() function. Recall from Subsection 1.2.1 that the c() function “combines” or “concatenates” values into a single vector of values. | |
many_airports <- flights %>% | |
filter(dest %in% c("SEA", "SFO", "PDX", "BTV", "BDL")) | |
View(many_airports) | |
What this code is doing is filtering flights for all flights where dest is in the vector of airports c("BTV", "SEA", "PDX", "SFO", "BDL"). Both outputs of many_airports are the same, but as you can see the latter takes much less energy to code. The %in% operator is useful for looking for matches commonly in one vector/variable compared to another | |
------------------------------------------------------------------------------------------------ | |
summarize variables : | |
The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values | |
summary_temp <- weather %>% | |
summarize(mean = mean(temp), std_dev = sd(temp)) | |
summary_temp | |
# A tibble: 1 × 2 | |
mean std_dev | |
<dbl> <dbl> | |
1 NA NA | |
NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead | |
Going back to our summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. To work around this fact, you can set the na.rm argument to TRUE, where rm is short for “remove”; this will ignore any NA missing values and only return the summary value for all non-missing values | |
The code that follows computes the mean and standard deviation of all non-missing values of temp: | |
summary_temp <- weather %>% | |
summarize(mean = mean(temp, na.rm = TRUE), | |
std_dev = sd(temp, na.rm = TRUE)) | |
summary_temp | |
# A tibble: 1 × 2 | |
mean std_dev | |
<dbl> <dbl> | |
1 55.3 17.8 | |
R that takes many values and returns just one. Here are just a few: | |
mean(): the average | |
sd(): the standard deviation, which is a measure of spread | |
min() and max(): the minimum and maximum values, respectively | |
IQR(): interquartile range | |
sum(): the total amount when adding multiple numbers | |
n(): a count of the number of rows in each group. This particular summary function will make more sense when group_by() | |
------------------------------------------------------------------------------------------------ | |
group_by rows : | |
Say instead of a single mean temperature for the whole year, you would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean temperature split by month. We can do this by “grouping” temperature observations by the values of another variable, in this case by the 12 values of the variable month. Run the following code: | |
summary_monthly_temp <- weather %>% | |
group_by(month) %>% | |
summarize(mean = mean(temp, na.rm = TRUE), | |
std_dev = sd(temp, na.rm = TRUE)) | |
summary_monthly_temp | |
# A tibble: 12 × 3 | |
month mean std_dev | |
<int> <dbl> <dbl> | |
1 1 35.6 10.2 | |
2 2 34.3 6.98 | |
3 3 39.9 6.25 | |
4 4 51.7 8.79 | |
5 5 61.8 9.68 | |
6 6 72.2 7.55 | |
7 7 80.1 7.12 | |
8 8 74.5 5.19 | |
9 9 67.4 8.47 | |
10 10 60.1 8.85 | |
11 11 45.0 10.4 | |
12 12 38.4 9.98 | |
------------------------------------------------------------------------------------------------ | |
Grouping by more than one variable : | |
You are not limited to grouping by one variable. Say you want to know the number of flights leaving each of the three New York City airports for each month. We can also group by a second variable month using group_by(origin, month): | |
by_origin_monthly <- flights %>% | |
group_by(origin, month) %>% | |
summarize(count = n()) | |
by_origin_monthly | |
------------------------------------------------------------------------------------------------ | |
mutate existing variables : | |
Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F). | |
weather <- weather %>% | |
mutate(temp_in_C = (temp - 32) / 1.8) | |
Let’s now compute monthly average temperatures in both °F and °C using the group_by() and summarize() code we saw in Section 3.4: | |
summary_monthly_temp <- weather %>% | |
group_by(month) %>% | |
summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), | |
mean_temp_in_C = mean(temp_in_C, na.rm = TRUE)) | |
summary_monthly_temp | |
flights <- flights %>% | |
mutate(gain = dep_delay - arr_delay) | |
Let’s look at some summary statistics of the gain variable by considering multiple summary functions at once in the same summarize() code: | |
gain_summary <- flights %>% | |
summarize( | |
min = min(gain, na.rm = TRUE), | |
q1 = quantile(gain, 0.25, na.rm = TRUE), | |
median = quantile(gain, 0.5, na.rm = TRUE), | |
q3 = quantile(gain, 0.75, na.rm = TRUE), | |
max = max(gain, na.rm = TRUE), | |
mean = mean(gain, na.rm = TRUE), | |
sd = sd(gain, na.rm = TRUE), | |
missing = sum(is.na(gain)) | |
) | |
gain_summary | |
To close out our discussion on the mutate() function to create new variables, note that we can create multiple new variables at once in the same mutate() code : | |
flights <- flights %>% | |
mutate( | |
gain = dep_delay - arr_delay, | |
hours = air_time / 60, | |
gain_per_hour = gain / hours | |
) | |
------------------------------------------------------------------------------------------------ | |
arrange and sort rows : | |
arrange() always returns rows sorted in ascending order by default | |
freq_dest %>% | |
arrange(desc(num_flights)) | |
------------------------------------------------------------------------------------------------ | |
join data frames : | |
Matching “key” variable names : | |
In both the flights and airlines data frames, the key variable we want to join/merge/match the rows by has the same name: carrier. Let’s use the inner_join() function to join the two data frames, where the rows will be matched by the variable carrier, and then compare the resulting data frames: | |
flights_joined <- flights %>% | |
inner_join(airlines, by = "carrier") | |
View(flights) | |
View(flights_joined) | |
Different “key” variable names : | |
if you look at both the airports and flights data frames, you’ll find that the airport codes are in variables that have different names. In airports the airport code is in faa, whereas in flights the airport codes are in origin and dest. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure 3.7. | |
In order to join these two data frames by airport code, our inner_join() operation will use the by = c("dest" = "faa") argument with modified code syntax allowing us to join two data frames where the key variable has a different name: | |
flights_with_airport_names <- flights %>% | |
inner_join(airports, by = c("dest" = "faa")) | |
View(flights_with_airport_names) | |
Let’s construct the chain of pipe operators %>% that computes the number of flights from NYC to each destination, but also includes information about each destination airport: | |
named_dests <- flights %>% | |
group_by(dest) %>% | |
summarize(num_flights = n()) %>% | |
arrange(desc(num_flights)) %>% | |
inner_join(airports, by = c("dest" = "faa")) %>% | |
rename(airport_name = name) | |
named_dests | |
Multiple “key” variables : | |
Say instead we want to join two data frames by multiple key variables. For example, in Figure 3.7, we see that in order to join the flights and weather data frames, we need more than one key variable: year, month, day, hour, and origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather data frame: hourly weather recordings at each of the 3 NYC airports | |
We achieve this by specifying a vector of key variables to join by using the c() function | |
flights_weather_joined <- flights %>% | |
inner_join(weather, by = c("year", "month", "day", "hour", "origin")) | |
View(flights_weather_joined) | |
------------------------------------------------------------------------------------------------ | |
select variables : | |
We’ve seen that the flights data frame in the nycflights13 package contains 19 different variables. You can identify the names of these 19 variables by running the glimpse() function from the dplyr package: | |
glimpse(flights) | |
However, say you only need two of these 19 variables, say carrier and flight. You can select() these two variables: | |
flights %>% | |
select(carrier, flight) | |
Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable year in the flights data frame. This variable isn’t quite a “variable” because it is always 2013 and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect year by using the - sign: | |
flights_no_year <- flights %>% select(-year) | |
Another way of selecting columns/variables is by specifying a range of columns: | |
flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time) | |
flight_arr_times | |
This will select() all columns between month and day, as well as between arr_time and sched_arr_time, and drop the rest | |
The select() function can also be used to reorder columns when used with the everything() helper function. For example, suppose we want the hour, minute, and time_hour variables to appear immediately after the year, month, and day variables, while not discarding the rest of the variables. In the following code, everything() will pick up all remaining variables: | |
flights_reorder <- flights %>% | |
select(year, month, day, hour, minute, time_hour, everything()) | |
glimpse(flights_reorder) | |
Lastly, the helper functions starts_with(), ends_with(), and contains() can be used to select variables/columns that match those conditions. As examples, | |
flights %>% select(starts_with("a")) | |
flights %>% select(ends_with("delay")) | |
flights %>% select(contains("time")) | |
------------------------------------------------------------------------------------------------ | |
rename variables : | |
Another useful function is rename(), which as you may have guessed changes the name of variables. Suppose we want to only focus on dep_time and arr_time and change dep_time and arr_time to be departure_time and arrival_time instead in the flights_time data frame: | |
flights_time_new <- flights %>% | |
select(dep_time, arr_time) %>% | |
rename(departure_time = dep_time, arrival_time = arr_time) | |
glimpse(flights_time_new) | |
------------------------------------------------------------------------------------------------ | |
top_n values of a variable : | |
We can also return the top n values of a variable using the top_n() function | |
named_dests %>% top_n(n = 10, wt = num_flights) | |
Let’s further arrange() these results in descending order of num_flights: | |
named_dests %>% | |
top_n(n = 10, wt = num_flights) %>% | |
arrange(desc(num_flights)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nycflights13 : | |
Data related to all domestic flights departing from one of New York City’s three main airports in 2013, este paquete contiene 5 datasets : | |
flights: Information on all 336,776 flights | |
airlines: A table matching airline names and their two-letter International Air Transport Association (IATA) airline codes (also known as carrier codes) for 16 airline companies. For example, “DL” is the two-letter code for Delta | |
planes: Information about each of the 3,322 physical aircraft used | |
weather: Hourly meteorological data for each of the three NYC airports | |
airports: Names, codes, and locations of the 1,458 domestic destinations | |
------------------------------------------------------------------------------------------------ | |
moderndive : | |
------------------------------------------------------------------------------------------------ | |
https://moderndive.com/data/dem_score.csv | |
------------------------------------------------------------------------------------------------ | |
fivethirtyeight : The fivethirtyeight package (Kim, Ismay, and Chunn 2022) provides access to the datasets used in many articles published by the data journalism website, FiveThirtyEight.com | |
For a complete list of all 129 datasets included in the fivethirtyeight package, check out the package webpage by going to: | |
https://fivethirtyeight-r.netlify.app/articles/fivethirtyeight.html | |
------------------------------------------------------------------------------------------------ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
R distingue mayusculas y minusculas, no se pueden utilizar espacios en los objetos o nombres de variables | |
Todo lo que escriba en la consola se ejecuta inmediatamente pero se pierde cuando se termina la sesion de trabajo | |
Si quiero conservar lo que defino uso el script, que se crea como un script nuevo desde la opcion New File - Script | |
Cuando trabajo en un script solo se ejecuta cuando le doy run y el resultado aparece en la consola | |
Normalmente el script tiene varias lineas, asi que para ejecutarlas las selecciono y les doy Run (Ctrl-Enter), solo se ejecutaran aquellas lineas que hallan sido seleccionadas, las que se quedan por fuera de la seleccion no se ejecutan. El resultado se visualiza en la consola | |
Las variables en R se denominan objetos, el operador de asignacion es <- , asi que para asignarle el valor de 5 a x tendria : x<-5 | |
Cuando se realiza una asignacion la variable y su valor asignado aparecen en el panel de Enviroment | |
Los objetos no son solamente las variables, son en realidad cualquier entidad en la que se pueden almacenar, de hecho un archivo de Excel o una tabla de una base de datos se puede almacenar en un objeto. | |
Por ejemplo para almacenar una tabla de Excel en un objeto denominado Calcuta tendriamos : Calcuta<-read_xlx("nombre_archivo.xlsx") e inmediatamente aparece disponible el objeto en la ventana de enviroment | |
Los principales paquetes de R para comenzar cualquier analisis son : | |
ggplot2,readxl,tidyr,dplyr | |
Normalmente no viene directamente instalados en R-Studio por lo cual deben sere instalados manualmente desde la opcion install de packages | |
Tambien se pueden instalar manualmente desde la consola con el commando : | |
install.packages("ggplot2") | |
install.packages("readxl") | |
install.packages("tidyr") | |
install.packages("dplyr") | |
Una vez esten instalados los paquetes, estos deben ser activados con el checkbox de la ventana de packages, tambien pueden ser activados por codigo como : | |
library(ggplot2) | |
library(readxl) | |
library(tidyr) | |
library(dplyr) | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://moderndive.netlify.app/1-getting-started.html | |
Cuando se digita en la consola ? antes de una funcion, objeto o comando se activa la ayuda para ese comando, objeto o funcion : ?comando | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
R for Data Science : | |
https://r4ds.had.co.nz/data-visualisation.html |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Muy importante para acada sesion establecer el Working Directory que se hace en Session - Set Working Directory - Choose Directory |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Indentificacion Distribucion de los Datos : | |
https://statisticsbyjim.com/hypothesis-testing/identify-distribution-data/#:~:text=Using%20Probability%20Plots%20to%20Identify,the%20distribution%20fits%20your%20data. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment