lluang · May 24, 2020 00:52
diff --git a/SIRSimulationModel.Rmd b/SIRSimulationModel.Rmd
 ---
 title: 'S-I-R Simulation Models'
 author: "IE 0015 Information Systems"
 date: "Spring, 2020"
 output:
  pdf_document: default
  html_document:
    df_print: paged
 ---

 This was assigned as a homework assignment in my "Information Systems Engineering" course for undergraduate engineering in the Department of Industrial Engineering. The course is taught as a data analysis course, and this was a large assignment that went with a module on simulation models.  The students had already been working with COVID-19 data for data analysis examples and labs throughout the semester (we started when COVID-19 was something you heard about in the news about China), so they had many such assignments.

 ```{r loadlibraries, echo=FALSE, message=FALSE, warning=FALSE}
 library(dplyr)
 library(tidyr)
 library(magrittr)
 library(ggplot2)
 ```

 The standard model of spread of infectious disease spread is the S-I-R compartmental model: susceptible, infected, removed.  The population is divided into three compartments, and the model tracks the change in the size of each compartment over time.  The standard depiction of the model is as a system of differential equations.

 $$dS = -\beta *S*I$$
 $$dI =  \beta*S*I - \nu*I$$
 $$dR = \nu*I $$

 Where:

 - $\beta$ is the product of the contact per unit time and the probability of transmission. It is the product of the probability of contact of an infected individual and a susceptible individual and the probability that a contact results in the susceptible person being infected.
 - $\nu$ is the recovery rate 
 - S is the number of susceptible individuals. At the beginning this is the entire population. dS is the change in number of suspectible individuals in a given unit time
 - I is the number of infected individuals. Typically, in a model the initial value is 1. dI is the change in number of infected individuals in a given unit time
 - dR is the change in number of removed individuals in a given unit time.  This can either be due to recovery with future immunity, or through death of the individual.

 This homework assignment will develop this model as a simulation.  We will look at an artificial population of 1000 people, with one being infected.

 ```{r setup}
 nperiods = 365
 population = 1000
 averagecontacts = 15
 pinfect = 0.02
 recoveryrate = 1/14 # 1 / average recovery time
 ```

 1.  [5] Calculate the probability that two individuals meet. Save it to a variable named `pcontact`.

 ```{r}

 pcontact <- 0  # calculate pcontact here
 ```


 ```{r}
 pcontact = averagecontacts/population
 ```


 2. [10] This assignment will model the spread of the disease as a mixing model, meaning that every individual has an equal probability of being in contact with any other individual.  For every time period (a day) write a function named `checkinfection(nS, nI)` that determines given the the starting number of susceptible and infected: 
  (1)  how many instances an infected individual is in contact with a susceptible individual
  (2)  How many of those contacts result in infection.
  (3)  Run this code for the first day (I0 = 1, S0 = 999, R0= 0)
  

 ```{r}
 I0 = 1
 S0 = population - I0
 R0 = 0

 checkinfection <- function(nS, nI){
  newlyinfected <- c()
  if( (nS > 0) & (nI > 0)){
    checkcontacted <- matrix(runif(nS*nI), ncol=nI)
    contacts <- ifelse(checkcontacted < pcontact, 1,0)
    checkinfected <- matrix(runif(nS*nI), ncol=nI)
    infectedifcontacted <- ifelse(checkinfected < pinfect, 1, 0)
    infected <- contacts * infectedifcontacted
    infectednew <- ifelse(rowSums(infected)>=1,1,0) #you can only get infected once
    newlyinfected <- c(newlyinfected, sum(infectednew))

    dI <- min(sum(newlyinfected), nS)
  } else {
    dI <- 0
  }
  dI
 }

 dI = checkinfection(S0, I0)
 I1 = I0 + dI
 s1 = S0 - dI

 print(c(I1, s1))
 ```


 3. [15] For those who are infected, assume that in each time period, the probability of recovery is 1 / (average recovery time).  Write a function named `checkrecover(nI)` that calculates the number of recovered given the number of infected.  Run the function with 100 infected.

 1. Similar to the infected function. Check each infected person individually in each time period using either vectorized functions or a for loop. (Do not use number recovered = number infected * recoveryrate)

 ```{r}
 Istarting = 100

 checkrecover <- function(nI){
  
  checkrecovered <- runif(nI)
  #print(checkrecovered)
  recovered = ifelse(checkrecovered < recoveryrate, 1,0)
  #print(recovered)
  dr <- sum(recovered)
 }

 dr <- checkrecover(Istarting)
 print(dr)
 ```

 4. [10] Simulate one year.
    a. Use the functions you just created to run a simulation of one year with a starting population of 1000 (S0=999, I0 = 1, R0 = 0). You will need to run 365 days of simulation and save each day's S, I and R (hint: create a vector for each).  The output of the simulation will be the maximum number infected and the number of days that the number of infected is greater than **20**.  
    b. Plot Susceptible, Infected, and Recovered vs time (days).

 ```{r}
 #  One day

 onedayepidemic <- function(nS, nI, nR){
  dI <- checkinfection(nS, nI)
  dR <- checkrecover(nI)
  
  S1 = nS - dI
  I1 = nI + dI - dR
  R1 = nR + dR
  return = c(S1, I1, R1)
 }

 oneyearepidemic <- function(S0 = 999, I0 = 1, R0 = 0){
  oneyear <- 365
  nS = c(S0)
  nI = c(I0)
  nR = c(R0)
  for (day in 1:(oneyear-1)){
    nextday <- onedayepidemic(nS[day], nI[day], nR[day])
    nS <- c(nS, nextday[1])
    nI <- c(nI, nextday[2])
    nR <- c(nR, nextday[3])
  }
  
  simulationresults <- data.frame(day = 1:oneyear,
                                  S = nS,
                                  I = nI,
                                  R = nR
                                  )
  simulationresults
 }
 ```

 ```{r}
 oneyearresults <- oneyearepidemic(S0, I0, R0)
 ggplot(oneyearresults) + 
  geom_point(aes(day, S), color = "green") +
  geom_point(aes(day, I), color = "red") +
  geom_point(aes(day, R), color = "blue")
 ```

 ```{r}
 infectedstatistic <- oneyearresults %>% 
  mutate(isinfected = (I>=20)) %>%
  filter(isinfected == TRUE) %>%
  group_by(isinfected) %>%
  summarize(peak=max(I),
         startday = min(day),
         endday = max(day)) %>%
  mutate(duration = (endday - startday))
  
 infectedstatistic
 ```


 5. [10] Now run 100 replications (each replication is 365 days). 
    a.  Represent all 100 replications on a single graph (note: there are many possible answers).  Keep in mind that from the representation of any single replication, it should be possible to determine the maximimum infected and the number of days the number of infected is greater than 20, but your final graph you are more interested in the distributions of the outputs, not the individual values.
    b. Repeat 100 replications, but change the average number of contacts per person to 10 from 15. Discuss the difference in results.

 ```{r}

 epidemicsimulation <- function(S0 = 999, I0 = 1, R0 = 0){
  oneyearresults <- oneyearepidemic(S0, I0, R0)
  infectedstatistic <- oneyearresults %>% 
    mutate(isinfected = (I>=20)) %>%
    filter(isinfected == TRUE) %>%
    group_by(isinfected) %>%
    summarize(peak=max(I),
           startday = min(day),
           endday = max(day)) %>%
    mutate(duration = (endday - startday))
  
  data.frame(infectedstatistic$peak, infectedstatistic$duration)
 }

 infectedresults <- lapply(rep(999, 100), epidemicsimulation)
 epidemicresults <- as.data.frame(do.call(rbind, infectedresults))
 names(epidemicresults) <- c("peak", "duration")
 ```

 ```{r}
 ggplot(epidemicresults) + 
  geom_point(aes(peak, duration)) +
  ggtitle("Epidemic peak and duration for 100 simulations") +
  xlim(1,750) + ylim(1,150)
 ```
 ```{r}

 averagecontacts = 10
 pcontact = averagecontacts/population
 infectedresults10 <- lapply(rep(999, 100), epidemicsimulation)
 epidemicresults10 <- as.data.frame(do.call(rbind, infectedresults))
 names(epidemicresults10) <- c("peak", "duration")
 ```

 ```{r}
 ggplot(epidemicresults10) + 
  geom_point(aes(peak, duration)) +
  ggtitle("Epidemic peak and duration for 100 simulations where average contacts = 10") +
  xlim(1,750) + ylim(1,150)
 ```
 # Postscript

 In real life, after a simulation like this was used, you would change parameters to reflect potential policy changes.  For example, social distancing changes the variable `averagecontacts`.  Protective equipment such as non-medical masks changes the variable `pinfect`.  Treatment methods changes `recoveryrate`.  And you would see how the results shift as you change any of them. (e.g. reducing `averagecontacts` reduces the peak and increases the duration).
	---
	title: 'S-I-R Simulation Models'
	author: "IE 0015 Information Systems"
	date: "Spring, 2020"
	output:
	pdf_document: default
	html_document:
	df_print: paged
	---

	This was assigned as a homework assignment in my "Information Systems Engineering" course for undergraduate engineering in the Department of Industrial Engineering. The course is taught as a data analysis course, and this was a large assignment that went with a module on simulation models. The students had already been working with COVID-19 data for data analysis examples and labs throughout the semester (we started when COVID-19 was something you heard about in the news about China), so they had many such assignments.

	```{r loadlibraries, echo=FALSE, message=FALSE, warning=FALSE}
	library(dplyr)
	library(tidyr)
	library(magrittr)
	library(ggplot2)
	```

	The standard model of spread of infectious disease spread is the S-I-R compartmental model: susceptible, infected, removed. The population is divided into three compartments, and the model tracks the change in the size of each compartment over time. The standard depiction of the model is as a system of differential equations.

	$$dS = -\beta SI$$
	$$dI = \betaSI - \nu*I$$
	$$dR = \nu*I $$

	Where:

	- $\beta$ is the product of the contact per unit time and the probability of transmission. It is the product of the probability of contact of an infected individual and a susceptible individual and the probability that a contact results in the susceptible person being infected.
	- $\nu$ is the recovery rate
	- S is the number of susceptible individuals. At the beginning this is the entire population. dS is the change in number of suspectible individuals in a given unit time
	- I is the number of infected individuals. Typically, in a model the initial value is 1. dI is the change in number of infected individuals in a given unit time
	- dR is the change in number of removed individuals in a given unit time. This can either be due to recovery with future immunity, or through death of the individual.

	This homework assignment will develop this model as a simulation. We will look at an artificial population of 1000 people, with one being infected.

	```{r setup}
	nperiods = 365
	population = 1000
	averagecontacts = 15
	pinfect = 0.02
	recoveryrate = 1/14 # 1 / average recovery time
	```

	1. [5] Calculate the probability that two individuals meet. Save it to a variable named `pcontact`.

	```{r}

	pcontact <- 0 # calculate pcontact here
	```


	```{r}
	pcontact = averagecontacts/population
	```


	2. [10] This assignment will model the spread of the disease as a mixing model, meaning that every individual has an equal probability of being in contact with any other individual. For every time period (a day) write a function named `checkinfection(nS, nI)` that determines given the the starting number of susceptible and infected:
	(1) how many instances an infected individual is in contact with a susceptible individual
	(2) How many of those contacts result in infection.
	(3) Run this code for the first day (I0 = 1, S0 = 999, R0= 0)


	```{r}
	I0 = 1
	S0 = population - I0
	R0 = 0

	checkinfection <- function(nS, nI){
	newlyinfected <- c()
	if( (nS > 0) & (nI > 0)){
	checkcontacted <- matrix(runif(nS*nI), ncol=nI)
	contacts <- ifelse(checkcontacted < pcontact, 1,0)
	checkinfected <- matrix(runif(nS*nI), ncol=nI)
	infectedifcontacted <- ifelse(checkinfected < pinfect, 1, 0)
	infected <- contacts * infectedifcontacted
	infectednew <- ifelse(rowSums(infected)>=1,1,0) #you can only get infected once
	newlyinfected <- c(newlyinfected, sum(infectednew))

	dI <- min(sum(newlyinfected), nS)
	} else {
	dI <- 0
	}
	dI
	}

	dI = checkinfection(S0, I0)
	I1 = I0 + dI
	s1 = S0 - dI

	print(c(I1, s1))
	```


	3. [15] For those who are infected, assume that in each time period, the probability of recovery is 1 / (average recovery time). Write a function named `checkrecover(nI)` that calculates the number of recovered given the number of infected. Run the function with 100 infected.

	1. Similar to the infected function. Check each infected person individually in each time period using either vectorized functions or a for loop. (Do not use number recovered = number infected * recoveryrate)

	```{r}
	Istarting = 100

	checkrecover <- function(nI){

	checkrecovered <- runif(nI)
	#print(checkrecovered)
	recovered = ifelse(checkrecovered < recoveryrate, 1,0)
	#print(recovered)
	dr <- sum(recovered)
	}

	dr <- checkrecover(Istarting)
	print(dr)
	```

	4. [10] Simulate one year.
	a. Use the functions you just created to run a simulation of one year with a starting population of 1000 (S0=999, I0 = 1, R0 = 0). You will need to run 365 days of simulation and save each day's S, I and R (hint: create a vector for each). The output of the simulation will be the maximum number infected and the number of days that the number of infected is greater than 20.
	b. Plot Susceptible, Infected, and Recovered vs time (days).

	```{r}
	# One day

	onedayepidemic <- function(nS, nI, nR){
	dI <- checkinfection(nS, nI)
	dR <- checkrecover(nI)

	S1 = nS - dI
	I1 = nI + dI - dR
	R1 = nR + dR
	return = c(S1, I1, R1)
	}

	oneyearepidemic <- function(S0 = 999, I0 = 1, R0 = 0){
	oneyear <- 365
	nS = c(S0)
	nI = c(I0)
	nR = c(R0)
	for (day in 1:(oneyear-1)){
	nextday <- onedayepidemic(nS[day], nI[day], nR[day])
	nS <- c(nS, nextday[1])
	nI <- c(nI, nextday[2])
	nR <- c(nR, nextday[3])
	}

	simulationresults <- data.frame(day = 1:oneyear,
	S = nS,
	I = nI,
	R = nR
	)
	simulationresults
	}
	```

	```{r}
	oneyearresults <- oneyearepidemic(S0, I0, R0)
	ggplot(oneyearresults) +
	geom_point(aes(day, S), color = "green") +
	geom_point(aes(day, I), color = "red") +
	geom_point(aes(day, R), color = "blue")
	```

	```{r}
	infectedstatistic <- oneyearresults %>%
	mutate(isinfected = (I>=20)) %>%
	filter(isinfected == TRUE) %>%
	group_by(isinfected) %>%
	summarize(peak=max(I),
	startday = min(day),
	endday = max(day)) %>%
	mutate(duration = (endday - startday))

	infectedstatistic
	```


	5. [10] Now run 100 replications (each replication is 365 days).
	a. Represent all 100 replications on a single graph (note: there are many possible answers). Keep in mind that from the representation of any single replication, it should be possible to determine the maximimum infected and the number of days the number of infected is greater than 20, but your final graph you are more interested in the distributions of the outputs, not the individual values.
	b. Repeat 100 replications, but change the average number of contacts per person to 10 from 15. Discuss the difference in results.

	```{r}

	epidemicsimulation <- function(S0 = 999, I0 = 1, R0 = 0){
	oneyearresults <- oneyearepidemic(S0, I0, R0)
	infectedstatistic <- oneyearresults %>%
	mutate(isinfected = (I>=20)) %>%
	filter(isinfected == TRUE) %>%
	group_by(isinfected) %>%
	summarize(peak=max(I),
	startday = min(day),
	endday = max(day)) %>%
	mutate(duration = (endday - startday))

	data.frame(infectedstatistic$peak, infectedstatistic$duration)
	}

	infectedresults <- lapply(rep(999, 100), epidemicsimulation)
	epidemicresults <- as.data.frame(do.call(rbind, infectedresults))
	names(epidemicresults) <- c("peak", "duration")
	```

	```{r}
	ggplot(epidemicresults) +
	geom_point(aes(peak, duration)) +
	ggtitle("Epidemic peak and duration for 100 simulations") +
	xlim(1,750) + ylim(1,150)
	```
	```{r}

	averagecontacts = 10
	pcontact = averagecontacts/population
	infectedresults10 <- lapply(rep(999, 100), epidemicsimulation)
	epidemicresults10 <- as.data.frame(do.call(rbind, infectedresults))
	names(epidemicresults10) <- c("peak", "duration")
	```

	```{r}
	ggplot(epidemicresults10) +
	geom_point(aes(peak, duration)) +
	ggtitle("Epidemic peak and duration for 100 simulations where average contacts = 10") +
	xlim(1,750) + ylim(1,150)
	```
	# Postscript

	In real life, after a simulation like this was used, you would change parameters to reflect potential policy changes. For example, social distancing changes the variable `averagecontacts`. Protective equipment such as non-medical masks changes the variable `pinfect`. Treatment methods changes `recoveryrate`. And you would see how the results shift as you change any of them. (e.g. reducing `averagecontacts` reduces the peak and increases the duration).