Skip to content

Instantly share code, notes, and snippets.

@fdzuluaga2020
Last active August 15, 2022 02:49
Show Gist options
  • Save fdzuluaga2020/51b2e5218b44ff9831f43411470387e1 to your computer and use it in GitHub Desktop.
Save fdzuluaga2020/51b2e5218b44ff9831f43411470387e1 to your computer and use it in GitHub Desktop.
R-Studio
dplyr : Data Wrangling (Filter, Mutate, Sumarize, Arrange, Join, Group_by)
knitr
tidyr : Converted Data to Tidy Format
ggplot2 : Data Visualization
readr : Importing
-----------------------------------------------------------------------------
tidyverse : Umbrella Package
https://tidyr.tidyverse.org/dev/articles/pivot.html#pew
running:
library(tidyverse)
would be the same as running:
library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)
library(purrr)
library(tibble)
library(stringr)
library(forcats)
The fundamental premise of data modeling is to make explicit the relationship between: an outcome variable y, also called a dependent variable or response variable, and an explanatory/predictor variable x, also called an independent variable or covariate
Another way to state this is using mathematical terminology: we will model the outcome variable y “as a function” of the explanatory/predictor variable x
But, why do we have two different labels, explanatory and predictor, for the variable x?
That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes:
Modeling for explanation: When you want to explicitly describe and quantify the relationship between the outcome variable y and a set of explanatory variables x, determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any causal relationships between the variables.
Modeling for prediction: When you want to predict an outcome variable y based on the information contained in a set of predictor variables x. Unlike modeling for explanation, however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about y using the information in x
Factores : son las variables categoricas
Data Frames : son las representaciones rectangulares de los datos en donde las columnas son las variables y las filas las observaciones
Condicionales :
Igualdad : se define con ==
Desigualdad : !=
Operadores Logicos : And se representa como (&) , Or se representa como (|)
Funciones : son un conjunto de comandos que reciben unos argumentos y regresan un resultado
Packages : son como Apps para soportar las diferentes funciones que se han desarrollado para R, los paquetes deben ser instalados y cargados para poderlos usar.
La carga de un paquete se hace con el comando library(nombre del package), si el package no ha sido cargado simplemente no puede ser usado
Se pueden explorar de la siguiente manera (Cargar primero el package dplyr) :
View(Package Name) : permite ver las columnas y las filas existentes en el Data Frame
glimpse(Package Name) : permite observar los contenidos de las variables
kable(Package Name) : presenta mayor legibilidad de los datos, util en R Markdown, Kable requiere el package knitr
Operador $ :
Nos permite explorar una variable del Data Frame, asi por ejemplo : airlines$name permite explorar la variable name del Data Frame airline
A rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation.
Importing data :
Using the console :
library(readr)
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
dem_score
Let’s apply some of the data wrangling verbs we learned in Chapter 3 on the drinks data frame:
filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then
select() all columns except total_litres_of_pure_alcohol by using the - sign, then
rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively.
and save the resulting data frame in drinks_smaller:
drinks_smaller <- drinks %>%
filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>%
select(-total_litres_of_pure_alcohol) %>%
rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
“tidy” in data science using R means that your data follows a standardized format
A dataset is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes.
“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
For example, say you have the following table of stock prices in Table 4.1:
TABLE 4.1: Stock prices (non-tidy format) :
Date Boeing stock price Amazon stock price Google stock price
2009-01-01 $173.55 $174.90 $174.34
2009-01-02 $172.61 $171.42 $170.04
Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format, each variable should be its own column, as shown in Table 4.2. Notice that both tables present the same information, but in different formats
TABLE 4.2: Stock prices (tidy format) :
Date Stock Name Stock Price
2009-01-01 Boeing $173.55
2009-01-01 Amazon $174.90
2009-01-01 Google $174.34
2009-01-02 Boeing $172.61
2009-01-02 Amazon $171.42
2009-01-02 Google $170.04
Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table 4.3
TABLE 4.3: Example of tidy data :
Date Boeing Price Weather
2009-01-01 $173.55 Sunny
2009-01-02 $172.61 Overcast
In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table 4.1, the data is “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing price, and the Weather that particular day
If your original data frame is in wide (non-“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package
Going back to our drinks_smaller data frame from earlier:
drinks_smaller
# A tibble: 4 × 4
country beer spirit wine
<chr> <int> <int> <int>
1 China 79 192 8
2 Italy 85 42 237
3 Saudi Arabia 0 5 0
4 USA 249 158 84
We convert it to “tidy” format by using the pivot_longer() function from the tidyr package as follows:
drinks_smaller_tidy <- drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = -country)
drinks_smaller_tidy
# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
We set the arguments to pivot_longer() as follows:
names_to here corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we set names_to = "type". In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here.
values_to here is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "servings" since each of the numeric values in each of the beer, wine, and spirit columns of the drinks_smaller data corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 × 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument.
The third argument cols is the columns in the drinks_smaller data frame you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it.
The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:
drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = c(beer, spirit, wine))
Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller data frame, we could also do the following for the cols argument:
drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = beer:wine)
If however you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider() function instead
Case study: Democracy in Guatemala :
Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala.
guat_dem <- dem_score %>%
filter(country == "Guatemala")
guat_dem
# A tibble: 1 × 10
country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Guatemala 2 -6 -5 3 1 -3 -7 3 3
Let’s lay out the grammar of graphics we saw in Section 2.1.
First we know we need to set data = guat_dem and use a geom_line() layer, but what is the aesthetic mapping of variables? We’d like to see how the democracy score has changed over the years, so we need to map:
year to the x-position aesthetic and
democracy_score to the y-position aesthetic
Now we are stuck in a predicament, much like with our drinks_smaller example in Section 4.2. We see that we have a variable named country, but its only value is "Guatemala". We have other variables denoted by different year values. Unfortunately, the guat_dem data frame is not “tidy” and hence is not in the appropriate format to apply the grammar of graphics, and thus we cannot use the ggplot2 package just yet
We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “names” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “values” variable called democracy_score. Our resulting data frame will have three columns: country, year, and democracy_score. Recall that the pivot_longer() function in the tidyr package does this for us:
guat_dem_tidy <- guat_dem %>%
pivot_longer(names_to = "year",
values_to = "democracy_score",
cols = -country,
names_transform = list(year = as.integer))
guat_dem_tidy
# A tibble: 9 × 3
country year democracy_score
<chr> <int> <dbl>
1 Guatemala 1952 2
2 Guatemala 1957 -6
3 Guatemala 1962 -5
4 Guatemala 1967 3
5 Guatemala 1972 1
6 Guatemala 1977 -3
7 Guatemala 1982 -7
8 Guatemala 1987 3
9 Guatemala 1992 3
We set the arguments to pivot_longer() as follows:
names_to is the name of the variable in the new “tidy” data frame that will contain the column names of the original data. Observe how we set names_to = "year". In the resulting guat_dem_tidy, the column year contains the years where Guatemala’s democracy scores were measured.
values_to is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "democracy_score". In the resulting guat_dem_tidy the column democracy_score contains the 1 × 9 = 9 democracy scores as numeric values.
The third argument is the columns you either want to or don’t want to “tidy.” Observe how we set this to cols = -country indicating that we don’t want to “tidy” the country variable in guat_dem and rather only variables 1952 through 1992.
The last argument of names_transform tells R what type of variable year should be set to. Without specifying that it is an integer as we’ve done here, pivot_longer() will set it to be a character value by default.
We can now create the time-series plot in Figure 4.5 to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a geom_line(). Furthermore, we’ll use the labs() function in the ggplot2 package to add informative labels to all the aes()thetic attributes of our plot, in this case the x and y positions.
ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) +
geom_line() +
labs(x = "Year", y = "Democracy Score")
El concepto de los graficos en R se base en la Gramatica de los componentes :
In short, the grammar tells us that:
A statistical graphic is a MAPPING of DATA variables to AESthetic attributes of GEOMetric objects.
Specifically, we can break a graphic into the following three essential components:
data: the dataset containing the variables of interest.
geom: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.
aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset
Por ejemplo si voy a graficar el GDP per capita Vs Life Expectancy en donde muestro la relacion como un grafico de puntos en donde el tamaño de los puntos representa la poblacion de los paises y el color de los puntos representa el continente al que pertenece el pais, desde el punto de vista gramatico lo que estamos haciendo es :
The DATA variable GDP per Capita gets mapped to the x-position AESthetic of the points.
The DATA variable Life Expectancy gets mapped to the y-position AESthetic of the points.
The DATA variable Population gets mapped to the SIZE AESthetic of the points.
The DATA variable Continent gets mapped to the COLOR AESthetic of the points.
En otras palabras definimos :
data variable aes geom
GDP per Capita x point
Life Expectancy y point
Population size point
Continent color point
En la gramatica tambien podriamos incluir :
faceting breaks up a plot into several plots split by the values of another variable
position adjustments for barplots
--------------------------------------------------------------------------------------------
Los tipos fundamentales de graficas son :
--------------------------------------------------------------------------------------------
SCATTERPLOTS : They allow you to visualize the relationship between two numerical variables
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
La constuccion de la gramatica se hace por capas, en donde cada capa se adiciona con el signo + al final de cada linea de codigo, el signo + no puede aparecer al principio de la linea de codigo, siempre tiene que aparecer al final, tener en cuenta que despues del + se de ENTER para iniciar una nueva linea
Cuando se hacen las graficas de puntos es posible que se genere el fenomeno de Overplotting que hace que se produzca una concentracion de puntos tal que no permitan ver bien las observaciones por la masa que se aprecia, para este caso se pueden dar dos manejos :
- Adjusting the transparency of the points
- Adding a little random “jitter”, or random “nudges”, to each of the points
Method 1: Changing the transparency
The first way of addressing overplotting is to change the transparency/opacity of the points by setting the alpha argument in geom_point(). We can change the alpha argument to be any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque. By default, alpha is set to 1. In other words, if we don’t explicitly set an alpha value, R will use alpha = 1
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2)
Method 2: Jittering the points
The second way of addressing overplotting is by jittering all the points. This means giving each point a small “nudge” in a random direction
--------------------------------------------------------------------------------------------
LINEGRAPHS : Linegraphs show the relationship between two numerical variables when the variable on the x-axis, also called the explanatory variable, is of a sequential nature. In other words, there is an inherent ordering to the variable
ggplot(data = early_january_weather,
mapping = aes(x = time_hour, y = temp)) +
geom_line()
--------------------------------------------------------------------------------------------
HISTOGRAMS : let’s say we don’t care about its relationship with time, but rather we only care about how the values of temp distribute. Present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question
In other words:
- What are the smallest and largest values?
- What is the “center” or “most typical” value?
- How do the values spread out?
- What are frequent and infrequent values?
A histogram is a plot that visualizes the distribution of a numerical value as follows:
- We first cut up the x-axis into a series of bins, where each bin represents a range of values.
- For each bin, we count the number of observations that fall in the range corresponding to that bin.
- Then for each bin, we draw a bar whose height marks the corresponding count.
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram()
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(color = "white")
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(color = "white", fill = "steelblue")
Adjusting the bins :
- By adjusting the number of bins via the bins argument to geom_histogram().
- By adjusting the width of the bins via the binwidth argument to geom_histogram().
Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows:
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 40, color = "white")
Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the binwidth argument in the geom_histogram() layer. For example, let’s set the width of each bin to be 10°F.
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 10, color = "white")
--------------------------------------------------------------------------------------------
Facets :
Before continuing with the next of the 5NG, let’s briefly introduce a new concept called faceting. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month)
We can also specify the number of rows and columns in the grid by using the nrow and ncol arguments inside of facet_wrap()
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month, nrow = 4)
--------------------------------------------------------------------------------------------
Boxplots :
While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal is a side-by-side boxplot.
- 25% of points fall below the bottom edge of the box, which is the first quartile of 36°F. In other words, 25% of observations were below 36°F.
- 25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 45°F. Thus, 25% of observations were between 36°F and 45°F and 50% of observations were below 45°F.
- 25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 52°F. It follows that 25% of observations were between 45°F and 52°F and 75% of observations were below 52°F.
- 25% of points fall above the top edge of the box. In other words, 25% of observations were above 52°F.
- The middle 50% of points lie within the interquartile range (IQR) between the first and third quartile. Thus, the IQR for this example is 52 - 36 = 16°F.
- The interquartile range is a measure of a numerical variable’s spread.
- The whiskers stick out from either end of the box all the way to the minimum and maximum
- Any observed values outside this range get marked with points called outliers
ggplot(data = weather, mapping = aes(x = month, y = temp)) +
geom_boxplot()
that this plot does not provide information about temperature separated by month
We can convert the numerical variable month into a factor categorical variable by using the factor() function. So after applying factor(month), month goes from having numerical values just the 1, 2, …, and 12 to having an associated ordering. With this ordering, ggplot() now knows how to work with this variable to produce the needed plot
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes
To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points
--------------------------------------------------------------------------------------------
Barplots :
Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another commonly desired task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories within a categorical variable, also known as the levels of the categorical variable. Often the best way to visualize these different counts, also known as frequencies, is with barplots (also called barcharts)
One complication, however, is how your data is represented. Is the categorical variable of interest “pre-counted” or not?
Depending on how your categorical data is represented, you’ll need to add a different geometric layer type to your ggplot() to create a barplot
ggplot(data = fruits, mapping = aes(x = fruit)) +
geom_bar()
ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
geom_col()
When the categorical variable whose distribution you want to visualize :
- Is not pre-counted in your data frame, we use geom_bar().
- Is pre-counted in your data frame, we use geom_col() with the y-position aesthetic mapped to the variable that has the counts.
--------------------------------------------------------------------------------------------
Pie charts :
One of the most common plots used to visualize the distribution of categorical data is the pie chart. While they may seem harmless enough, pie charts actually present a problem in that humans are unable to judge angles well
We overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another
--------------------------------------------------------------------------------------------
BarPlots :
Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the joint distribution of two categorical variables at the same time
stacked barplot :
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar()
First, the fill aesthetic corresponds to the color used to fill the bars, while the color aesthetic corresponds to the color of the outline of the bars
side-by-side barplots, also known as dodged barplots :
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar(position = "dodge")
Note the width of the bars for AS, F9, FL, HA and YV is different than the others. We can make one tweak to the position argument to get them to be the same size in terms of width as the other bars by using the more robust position_dodge() function
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar(position = position_dodge(preserve = "single"))
faceted barplot :
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar() +
facet_wrap(~ origin, ncol = 1)
Let’s go over some important points about specifying the arguments (i.e., inputs) to functions. Run the following two segments of code:
# Segment 1:
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar()
# Segment 2:
ggplot(flights, aes(x = carrier)) +
geom_bar()
You’ll notice that both code segments create the same barplot, even though in the second segment we omitted the data = and mapping = code argument names. This is because the ggplot() function by default assumes that the data argument comes first and the mapping argument comes second. As long as you specify the data frame in question first and the aes() mapping second, you can omit the explicit statement of the argument names data = and mapping =
Going forward for the rest of this book, all ggplot() code will be like the second segment: with the data = and mapping = explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake; it’s common to see this style when reviewing other R users’ code
Use series of functions from the dplyr package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs :
- filter() a data frame’s existing rows to only pick out a subset of them
- summarize() one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range
- group_by() its rows. In other words, assign different rows to be part of the same group. We can then combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one computed for each of the three origin airports
- mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius
- arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp
- join() it with another data frame by matching along a “key” variable. In other words, merge these two data frames together
The pipe operator: %>%
The pipe operator allows us to combine multiple operations in R into a single sequential chain of actions
Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x using hypothetical functions f(), g(), and h():
Take x then :
- Use x as an input to a function f() then
- Use the output of f(x) as an input to a function g() then
- Use the output of g(f(x)) as an input to a function h()
One way to achieve this sequence of operations is by using nesting parentheses as follows:
h(g(f(x)))
For example, you can obtain the same output as the hypothetical sequence of functions as follows:
x %>%
f() %>%
g() %>%
h()
You would read this sequence as:
Take x then :
- Use this output as the input to the next function f() then
- Use this output as the input to the next function g() then
- Use this output as the input to the next function h()
Much like when adding layers to a ggplot() using the + sign, you form a single chain of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator %>%. Furthermore, much like how the + sign has to come at the end of lines when constructing plots, the pipe operator %>% has to come at the end of lines as well
------------------------------------------------------------------------------------------------
filter rows :
The filter() function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about the values of a variable in your dataset and then filters out only the rows that match that criteria
portland_flights <- flights %>%
filter(dest == "PDX")
View(portland_flights)
Note the order of the code. First, take the flights data frame flights then filter() the data frame so that only those where the dest equals "PDX" are included
You can use other operators beyond just the == operator that tests for equality:
> corresponds to “greater than”
< corresponds to “less than”
>= corresponds to “greater than or equal to”
<= corresponds to “less than or equal to”
!= corresponds to “not equal to.” The ! is used in many programming languages to indicate “not.”
Furthermore, you can combine multiple criteria using operators that make comparisons:
| corresponds to “or”
& corresponds to “and”
To see many of these in action, let’s filter flights for all rows that departed from JFK and were heading to Burlington, Vermont ("BTV") or Seattle, Washington ("SEA") and departed in the months of October, November, or December. Run the following:
btv_sea_flights_fall <- flights %>%
filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10)
View(btv_sea_flights_fall)
We can often skip the use of & and just separate our conditions with a comma. The previous code will return the identical output btv_sea_flights_fall as the following code:
btv_sea_flights_fall <- flights %>%
filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
View(btv_sea_flights_fall)
Let’s present another example that uses the ! “not” operator to pick rows that don’t match a criteria. As mentioned earlier, the ! can be read as “not.” Here we are filtering rows corresponding to flights that didn’t go to Burlington, VT or Seattle, WA.
not_BTV_SEA <- flights %>%
filter(!(dest == "BTV" | dest == "SEA"))
View(not_BTV_SEA)
Now say we have a larger number of airports we want to filter for, say "SEA", "SFO", "PDX", "BTV", and "BDL". We could continue to use the | (or) operator:
many_airports <- flights %>%
filter(dest == "SEA" | dest == "SFO" | dest == "PDX" |
dest == "BTV" | dest == "BDL")
but as we progressively include more airports, this will get unwieldy to write. A slightly shorter approach uses the %in% operator along with the c() function. Recall from Subsection 1.2.1 that the c() function “combines” or “concatenates” values into a single vector of values.
many_airports <- flights %>%
filter(dest %in% c("SEA", "SFO", "PDX", "BTV", "BDL"))
View(many_airports)
What this code is doing is filtering flights for all flights where dest is in the vector of airports c("BTV", "SEA", "PDX", "SFO", "BDL"). Both outputs of many_airports are the same, but as you can see the latter takes much less energy to code. The %in% operator is useful for looking for matches commonly in one vector/variable compared to another
------------------------------------------------------------------------------------------------
summarize variables :
The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values
summary_temp <- weather %>%
summarize(mean = mean(temp), std_dev = sd(temp))
summary_temp
# A tibble: 1 × 2
mean std_dev
<dbl> <dbl>
1 NA NA
NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead
Going back to our summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. To work around this fact, you can set the na.rm argument to TRUE, where rm is short for “remove”; this will ignore any NA missing values and only return the summary value for all non-missing values
The code that follows computes the mean and standard deviation of all non-missing values of temp:
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_temp
# A tibble: 1 × 2
mean std_dev
<dbl> <dbl>
1 55.3 17.8
R that takes many values and returns just one. Here are just a few:
mean(): the average
sd(): the standard deviation, which is a measure of spread
min() and max(): the minimum and maximum values, respectively
IQR(): interquartile range
sum(): the total amount when adding multiple numbers
n(): a count of the number of rows in each group. This particular summary function will make more sense when group_by()
------------------------------------------------------------------------------------------------
group_by rows :
Say instead of a single mean temperature for the whole year, you would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean temperature split by month. We can do this by “grouping” temperature observations by the values of another variable, in this case by the 12 values of the variable month. Run the following code:
summary_monthly_temp <- weather %>%
group_by(month) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
# A tibble: 12 × 3
month mean std_dev
<int> <dbl> <dbl>
1 1 35.6 10.2
2 2 34.3 6.98
3 3 39.9 6.25
4 4 51.7 8.79
5 5 61.8 9.68
6 6 72.2 7.55
7 7 80.1 7.12
8 8 74.5 5.19
9 9 67.4 8.47
10 10 60.1 8.85
11 11 45.0 10.4
12 12 38.4 9.98
------------------------------------------------------------------------------------------------
Grouping by more than one variable :
You are not limited to grouping by one variable. Say you want to know the number of flights leaving each of the three New York City airports for each month. We can also group by a second variable month using group_by(origin, month):
by_origin_monthly <- flights %>%
group_by(origin, month) %>%
summarize(count = n())
by_origin_monthly
------------------------------------------------------------------------------------------------
mutate existing variables :
Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F).
weather <- weather %>%
mutate(temp_in_C = (temp - 32) / 1.8)
Let’s now compute monthly average temperatures in both °F and °C using the group_by() and summarize() code we saw in Section 3.4:
summary_monthly_temp <- weather %>%
group_by(month) %>%
summarize(mean_temp_in_F = mean(temp, na.rm = TRUE),
mean_temp_in_C = mean(temp_in_C, na.rm = TRUE))
summary_monthly_temp
flights <- flights %>%
mutate(gain = dep_delay - arr_delay)
Let’s look at some summary statistics of the gain variable by considering multiple summary functions at once in the same summarize() code:
gain_summary <- flights %>%
summarize(
min = min(gain, na.rm = TRUE),
q1 = quantile(gain, 0.25, na.rm = TRUE),
median = quantile(gain, 0.5, na.rm = TRUE),
q3 = quantile(gain, 0.75, na.rm = TRUE),
max = max(gain, na.rm = TRUE),
mean = mean(gain, na.rm = TRUE),
sd = sd(gain, na.rm = TRUE),
missing = sum(is.na(gain))
)
gain_summary
To close out our discussion on the mutate() function to create new variables, note that we can create multiple new variables at once in the same mutate() code :
flights <- flights %>%
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
------------------------------------------------------------------------------------------------
arrange and sort rows :
arrange() always returns rows sorted in ascending order by default
freq_dest %>%
arrange(desc(num_flights))
------------------------------------------------------------------------------------------------
join data frames :
Matching “key” variable names :
In both the flights and airlines data frames, the key variable we want to join/merge/match the rows by has the same name: carrier. Let’s use the inner_join() function to join the two data frames, where the rows will be matched by the variable carrier, and then compare the resulting data frames:
flights_joined <- flights %>%
inner_join(airlines, by = "carrier")
View(flights)
View(flights_joined)
Different “key” variable names :
if you look at both the airports and flights data frames, you’ll find that the airport codes are in variables that have different names. In airports the airport code is in faa, whereas in flights the airport codes are in origin and dest. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure 3.7.
In order to join these two data frames by airport code, our inner_join() operation will use the by = c("dest" = "faa") argument with modified code syntax allowing us to join two data frames where the key variable has a different name:
flights_with_airport_names <- flights %>%
inner_join(airports, by = c("dest" = "faa"))
View(flights_with_airport_names)
Let’s construct the chain of pipe operators %>% that computes the number of flights from NYC to each destination, but also includes information about each destination airport:
named_dests <- flights %>%
group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
rename(airport_name = name)
named_dests
Multiple “key” variables :
Say instead we want to join two data frames by multiple key variables. For example, in Figure 3.7, we see that in order to join the flights and weather data frames, we need more than one key variable: year, month, day, hour, and origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather data frame: hourly weather recordings at each of the 3 NYC airports
We achieve this by specifying a vector of key variables to join by using the c() function
flights_weather_joined <- flights %>%
inner_join(weather, by = c("year", "month", "day", "hour", "origin"))
View(flights_weather_joined)
------------------------------------------------------------------------------------------------
select variables :
We’ve seen that the flights data frame in the nycflights13 package contains 19 different variables. You can identify the names of these 19 variables by running the glimpse() function from the dplyr package:
glimpse(flights)
However, say you only need two of these 19 variables, say carrier and flight. You can select() these two variables:
flights %>%
select(carrier, flight)
Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable year in the flights data frame. This variable isn’t quite a “variable” because it is always 2013 and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect year by using the - sign:
flights_no_year <- flights %>% select(-year)
Another way of selecting columns/variables is by specifying a range of columns:
flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time)
flight_arr_times
This will select() all columns between month and day, as well as between arr_time and sched_arr_time, and drop the rest
The select() function can also be used to reorder columns when used with the everything() helper function. For example, suppose we want the hour, minute, and time_hour variables to appear immediately after the year, month, and day variables, while not discarding the rest of the variables. In the following code, everything() will pick up all remaining variables:
flights_reorder <- flights %>%
select(year, month, day, hour, minute, time_hour, everything())
glimpse(flights_reorder)
Lastly, the helper functions starts_with(), ends_with(), and contains() can be used to select variables/columns that match those conditions. As examples,
flights %>% select(starts_with("a"))
flights %>% select(ends_with("delay"))
flights %>% select(contains("time"))
------------------------------------------------------------------------------------------------
rename variables :
Another useful function is rename(), which as you may have guessed changes the name of variables. Suppose we want to only focus on dep_time and arr_time and change dep_time and arr_time to be departure_time and arrival_time instead in the flights_time data frame:
flights_time_new <- flights %>%
select(dep_time, arr_time) %>%
rename(departure_time = dep_time, arrival_time = arr_time)
glimpse(flights_time_new)
------------------------------------------------------------------------------------------------
top_n values of a variable :
We can also return the top n values of a variable using the top_n() function
named_dests %>% top_n(n = 10, wt = num_flights)
Let’s further arrange() these results in descending order of num_flights:
named_dests %>%
top_n(n = 10, wt = num_flights) %>%
arrange(desc(num_flights))
nycflights13 :
Data related to all domestic flights departing from one of New York City’s three main airports in 2013, este paquete contiene 5 datasets :
flights: Information on all 336,776 flights
airlines: A table matching airline names and their two-letter International Air Transport Association (IATA) airline codes (also known as carrier codes) for 16 airline companies. For example, “DL” is the two-letter code for Delta
planes: Information about each of the 3,322 physical aircraft used
weather: Hourly meteorological data for each of the three NYC airports
airports: Names, codes, and locations of the 1,458 domestic destinations
------------------------------------------------------------------------------------------------
moderndive :
------------------------------------------------------------------------------------------------
https://moderndive.com/data/dem_score.csv
------------------------------------------------------------------------------------------------
fivethirtyeight : The fivethirtyeight package (Kim, Ismay, and Chunn 2022) provides access to the datasets used in many articles published by the data journalism website, FiveThirtyEight.com
For a complete list of all 129 datasets included in the fivethirtyeight package, check out the package webpage by going to:
https://fivethirtyeight-r.netlify.app/articles/fivethirtyeight.html
------------------------------------------------------------------------------------------------
R distingue mayusculas y minusculas, no se pueden utilizar espacios en los objetos o nombres de variables
Todo lo que escriba en la consola se ejecuta inmediatamente pero se pierde cuando se termina la sesion de trabajo
Si quiero conservar lo que defino uso el script, que se crea como un script nuevo desde la opcion New File - Script
Cuando trabajo en un script solo se ejecuta cuando le doy run y el resultado aparece en la consola
Normalmente el script tiene varias lineas, asi que para ejecutarlas las selecciono y les doy Run (Ctrl-Enter), solo se ejecutaran aquellas lineas que hallan sido seleccionadas, las que se quedan por fuera de la seleccion no se ejecutan. El resultado se visualiza en la consola
Las variables en R se denominan objetos, el operador de asignacion es <- , asi que para asignarle el valor de 5 a x tendria : x<-5
Cuando se realiza una asignacion la variable y su valor asignado aparecen en el panel de Enviroment
Los objetos no son solamente las variables, son en realidad cualquier entidad en la que se pueden almacenar, de hecho un archivo de Excel o una tabla de una base de datos se puede almacenar en un objeto.
Por ejemplo para almacenar una tabla de Excel en un objeto denominado Calcuta tendriamos : Calcuta<-read_xlx("nombre_archivo.xlsx") e inmediatamente aparece disponible el objeto en la ventana de enviroment
Los principales paquetes de R para comenzar cualquier analisis son :
ggplot2,readxl,tidyr,dplyr
Normalmente no viene directamente instalados en R-Studio por lo cual deben sere instalados manualmente desde la opcion install de packages
Tambien se pueden instalar manualmente desde la consola con el commando :
install.packages("ggplot2")
install.packages("readxl")
install.packages("tidyr")
install.packages("dplyr")
Una vez esten instalados los paquetes, estos deben ser activados con el checkbox de la ventana de packages, tambien pueden ser activados por codigo como :
library(ggplot2)
library(readxl)
library(tidyr)
library(dplyr)
https://moderndive.netlify.app/1-getting-started.html
Cuando se digita en la consola ? antes de una funcion, objeto o comando se activa la ayuda para ese comando, objeto o funcion : ?comando
R for Data Science :
https://r4ds.had.co.nz/data-visualisation.html
Muy importante para acada sesion establecer el Working Directory que se hace en Session - Set Working Directory - Choose Directory
Indentificacion Distribucion de los Datos :
https://statisticsbyjim.com/hypothesis-testing/identify-distribution-data/#:~:text=Using%20Probability%20Plots%20to%20Identify,the%20distribution%20fits%20your%20data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment