Math 321 Statistics for Experimentalists Spring 2024: R Projects

Either create a Posit Cloud account or install RStudio on your computer

1: Intro to R and data

  1. Create a new RStudio project. An R console should occupy the left side of the page.
  2. R is basically a calculator, but one that specializes in lists and data frames. Try enough of the following commands that you can tell what the rest do. Some of these commands will create lists (hint: create starts with c).
    • > 2+3
    • > 2*3
    • > 2^3
    • > x
    • > y
    • > x = c(1, 1, 2, 3, 5, 8, 13)
    • > y <- c(7:1)
    • > x
    • > mean(x)
    • > mean(y)
    • > y
    • > sort(y)
    • > median(y)
    • > median(x)
    • > quantile(y)
    • > quantile(x)
    • > x+1
    • > x+y
    • > 2*y
    • > x*y
    • > x < y
    • > x[4]
    • > x[4:6]
    • > x[x < y]
  3. Try the following commands:
    • > 1:100
    • > sample(1:100, size=20)
    • > x=sample(1:100, size=20)
    • > x
    • > sort(x)
    • > summary(x)
    • > stem(x, scale=2)
    • > stem(x)
    • > hist(x)
    • > hist(x, breaks=10)
    • > hist(x, breaks=10, freq=F); lines(density(x))
    • > boxplot(x)
  4. Pause to make sure all this make sense. Ask questions, then try some more:
    • > y=sample(1:100, size=20, replace=T)
    • Repeat the interesting parts of the code above but with y in place of x. How are x and y different?
    • > boxplot(x,y)
  5. Write a brief summary (at most a paragraph) of what all this did.

2: Loading data and numerical v. categorical data

  1. Download the current USGS data for all earthquakes in the past 30 days in Washington. Pay attention to where you save this file.
  2. Upload the data to RStudioCloud: click "File" in the bottom-right pane, then "Upload" and select all_month.csv.
  3. Load the data into R: > quakes=read.csv("all_month.csv", header=T). This gives us a data frame with all the quake data.
  4. > str(quakes) tells you the structure of the data. How many earthquakes were there? How many variables are there?
  5. The magnitudes of the earthquakes is in the column mag. The following commands give you different views of the data:
    • > summary(quakes$mag) (a 5-statistic summary)
    • > boxplot(quakes$mag) (a box-and-whisker plot; the box covers the IQR, whiskers extend 1.5 × IQR from the box)
    • > hist(quakes$mag) (an absolute frequency histogram)
    • > hist(quakes$mag, freq=F) (a relative frequency histogram)
    • > plot(density(quakes$mag)) (an estimated density curve--like a smooth histogram)
    • > hist(quakes$mag, freq=F); lines(density(quakes$mag)). You may need to tell R to skip missing values: lines(density(quakes$mag, na.rm=T))
  6. The number of seismic stations used to determine the earthquake's location is in the column nst. This is numerical data, but it is discrete, not continuous. Try:
    • > boxplot(quakes$nst)
    • > hist(quakes$nst)
    • > barplot(table(quakes$nst))
    Do you have a preference?
  7. What goes wrong with > barplot(table(quakes$mag))?
  8. The source of the of seismic activity is in the column type. Explain what happens with the following commands:
    • > barplot(table(quakes$type))
    • > hist(quakes$type)
    • > boxplot(quakes$type)
  9. Sometimes we'll look at more than one variable at a time. > plot(quakes$mag, quakes$latitude) and > plot(quakes$mag, quakes$nst) both look interesting; what do you see?

3: Estimation and the Central Limit Theorem (optional)

  1. Consider a random sample of size 16 from an exponentially distributed population with mean 4/3 and variance 16/9. This sample is small enough that the CLT may not provide a good estimate for the distribution of the sample mean, but we're going to try anyway: use a normal approximation to estimate the probability that the sample mean is less than 4/3 and the probability that it is less than 2. The R command pnorm(x) is the CDF of the standard normal distribution.
  2. We can investigate these estimates by getting R to simulate this sampling process. The following code generates a list of 100 sample means for independent samples of size 16 from a population with mean 4/3 and variance 16/9.
    • > x=c()
    • > for(n in c(1:100)){x=c(x,mean(rexp(16,3/4)))}
  3. Investigate the results:
    • Produce a histogram with a density estimate: > hist(x, freq=F, xlim=c(0,3), ylim=c(0,4), col=rgb(.1,.1,.1,.5)); lines(density(x))
    • Find the proportion of sample means less than 4/3: > mean(x<4/3)
    • Find the proportion of sample means less than 2: > mean(x<2)
  4. Repeat everything (including the CDF estimates), but with a random sample of size 64:
    • > y=c()
    • > for(n in c(1:100)){y=c(y,mean(rexp(64,3/4)))}
    • Add a histogram with a density estimate to our plot: > hist(y, freq=F, col=rgb(.1,.1,.1,.1), add=T); lines(density(y))
    • Find the proportion of sample means less than 4/3: > mean(y<4/3)
    • Find the proportion of sample means less than 2: > mean(y<2)
  5. What do you see happening? Compare and contrast the results of these two sampling processes. If it is helpful, run the code again, possibly with different sample sizes.

4: Hypothesis tests (due 4/12)

  1. Guide to summarizing hypothesis tests
  2. Download heights.csv and load the data set into R:
    • In R studio you can use menus: File > Import data set > From CSV...
    • In RStudio Cloud you'll first need to upload the file heights.csv to the web. Find the File pane in the bottom right quadrant and use the Upload button there. After that, load the data set into R using > heights=read.csv("heights.csv")
    • See what you have: > head(heights), > View(heights) , and > summary(heights) all show you the data (in differnt ways)
    • The heights data is stored as a data frame. You can access columns by name using commands like > heights$child
  3. Google says that the mean height of an American male is 69.7 inches. The height data you just loaded was collected from a random sample of Gonzaga students a couple of years ago. We can test the hypothesis that the mean height of a male GU student is 69.7 inches using this data:
    • Extract the heights of male students from the data frame: > m=heights$child[heights$sex=="M"]
    • You can now compute a test statistic manually using built-in functions for the mean > mean(m) and standard deviation > sd(m) and basic math operations: > (mean(m) - 69.7)/(sd(m)/sqrt(length(m)))
    • Calculate the p-value of your test statistic (use > length(m) to find the sample size) and write a summary of your finding: include null and alternative hypotheses, the value of the test statistic, the p-value, and your conclusion
    • R has built-in hypothesis tests, which you can use to check your work: > t.test(m, mu=69.7). Does this match your test?
    • Try the same test with the heights of fathers of students (this time just use the built-in test): > t.test(heights$father, mu=69.7)
    • Write another summary of the result (follow the same guidelines as above)
    • Would it be reasonable to treat the heights of the male students and the heights of fathers as one random sample (with a larger sample size)? Explain your reasoning.
  4. Are male students on average taller than female students? Use the height data to answer this question.
    • What are your null and alternative hypotheses?
    • Extract the heights of female students: > f=heights$child[heights$sex=="F"]
    • One of the following should test your hypotheses; which is most correct?
      • > t.test(m,f, alternative="greater")
      • > t.test(m,f, alternative="less")
      • > t.test(m,f, alternative="greater", var.equal=T)
      • > t.test(m,f, alternative="less", var.equal=T)
    • > var.test(m,f) tests the null hypotheses that the variances are equal against the alternative that the variances are not equal
    • Summarize your test results (state null and alternative hypotheses, the test statistic, the p-value, and your conclusion)
    • Google tells me that the difference in mean heights should be about 6 inches. Test this hypothesis (you'll need to tell R what your null hypothesis is using mu=6) and summarize your results
  5. Are sons taller than their fathers on average?
    • The hypothesis test comparing heights of males and females required independent random samples from the two populations in question. Would this a reasonable assumption for a hypothesis test comparing the heights of male students with the heights of fathers in the sample?
    • How might you test to see if male students are, on average, taller than their fathers?
    • See if you can implement your idea in R
  6. The t tests you've been using are most accurate when the populations are normally distributed. Heights and other biological measurements are often assumed to be normal, but we can check on this assumption using a Q-Q plot. In this case, we'll use qqnorm(), which plots the data against the theoretical positions for a normal distribution. Data that comes from a normal distribution should fall roughly on a straight line. A consistent pattern of departures away from a straight line suggests the data might not be normally distributed. Some nice examples are here (pdf).
    • Run > qqnorm(m), > qqnorm(f), > qqnorm(heights$mother), and > qqnorm(heights$father) and evaluate the normality assumption.
    • > qqline([data name], col=4) adds R's best guess at a line to your Q-Q plot (in color 4, for contrast).
    • Do you see anything that looks like a problem?
  7. I found a claim that on average you shouldn't need to wait more than an hour to see the Old Faithful geyser erupt. R has a built-in data set that includes waiting time between Old Faithful eruptions: faithful$waiting. Use this data set to test the claim. Write a summary of your findings. Include the usual statement of hypotheses, your test statistic, the p-value, and your conclusion. Also address any assumptions (e.g. of normality), possibly using Q-Q plots. Challenge: give a 95% confidence lower bound for the wait time.

5: ANOVA and linear regression (due 4/22)

  1. Use concrete.csv, which contains data on moisture absorption by concrete with 5 different levels of aggregate in the mixture (with 6 samples of each type of concrete). The following commands get R to do all the calculations for an ANOVA test based on this data.
    • First look at the data: View(concrete)
    • Tell R to do the ANOVA calculations and store the results: concrete.aov=aov(AbsorbtionWeight~as.factor(Aggregate), data=concrete). The command as.factor tells R to group the samples according to the Aggregate number; why is this important?
    • Get a summary: anova(concrete.aov)
    • State your null and alternative hypotheses clearly and draw a conclusion
  2. Use echidna.csv, which gives data on echidna hibernation. Your goal is to investigate the echidna data to see if there's a relationship between the duration of hibernation and the emergence mass:
    • Download echidna.csv and load the data into R
    • The variable names might be hard to work with. You can rename them: names(echidna)=c("name1", "name2", "name3") (fill in names you you'll be able to connect to the data, e.g. em for emergence mass; c() creates a list as usual; keep the quotation marks for your names).
    • Plot the two variables: plot(name1~name3, data=echidna, col="blue", main="Echidna mass (kg) as a function of hibernation duration", xlab="hibernation duration (days)", ylab="emergence mass (kg)") (here ~ tells R to think of the first variable as a function of the third).
    • Fit a linear model to the data: echidna.model=lm(name1~name3, data=echidna).
    • Add the least squares line to the plot: abline(echidna.model).
    • Get a summary of the model: summary(echidna.model).
    • Predict the emergence mass of an echidna that hibernates for 150 days.
  3. Continue working with the echidna data
    • There's nothing in the mathematics that says we have to think of emergence mass as a function of hibernation duration. In fact, it's much easier to measure emergence mass than hibernation duration, so we might want to use the emergence mass in an estimate of hibernation duration. Reverse all the ties in the previous problem and find a new model (echidna.model2) and predict the hibernation duration of an echidna that has an emergence mass of 4 kg.
    • Check your echidna models to see if the residuals are obviously not normally distributed (use plot(echidna.model) and analyze the first two plots). What are the p-values for the slope terms and what do they tell you about the relationship between hibernation duration and emergence mass? Is the correlation positive or negative?
  4. Use the mtcars data set built into R: > ?mtcars will explain the data set.
    • I'd like to think of mpg as the dependent variable and assess its connection with cyl, gear, and qsec. Some of these independent variables are categorical: use ANOVA to analyze them. Other variables are continuous, so regression is appropriate.
    • Comment on the validity of the models.
    • Summarize your findings.

6: Regression II (due 5/1)

  1. Opah!
    • Load this data on opah into R: opah.csv.
    • Use summary to see what you have (fork length is a specific way of measuring the length of the fish).
    • The column names are inconveniently long, but you can rename as you did with the echidna data.
    • Plot mass (y-axis) against fork length (x-axis) and add the least squares line. Does it look like a good fit?
    • Assess the validity of the linear model using (at least) the plot of Residuals vs Fitted and the Normal Q-Q plot.
    • In situations like this a log-log transform sometimes helps: plot the logarithm of the mass against the logarithm of fork length and add a least squares line to this plot. Does the model fit better? Assess the the plots of Residuals vs Fitted and Normal Q-Q.
  2. Revisiting echidna regression
    • R has a built-in way to get fitted values from linear models. Compare the results of the following with your prediction of the hibernation duration of an echidan with emergence mass of 4 kg: predict(echidna.model2, newdata=data.frame([name]=c(4.0)), interval="prediction") (replace [name] with the name you gave the emergence mass variable).
    • What is the difference between the prediction interval in the last part and the confidence interval this generates: predict(echidna.model2, newdata=data.frame([name]=c(4.0)), interval="confidence")?
    • Adding more inputs can make our predictions more accurate. Fit a least squares line for the echidna using lm([hibernation length]~[jan/feb mass]+[emergence mass], data=echidna) (replace [names] with the corresponding names you gave the variables).
    • Assess the validity of this model using plot().
    • Predict the hibernation duration of an echidna with a Jan/Feb mass of 4 kg and an emergence mass of 3.75 kg. If you want to use the built-in prediction function, you'll need to give it newdata=data.frame([jan/feb mass]=c(4), [emergence mass]=c(3.75)).
  3. Using ggplot2. Note: I used em and hlen for the emergence mass and hibernation length of echidnas, respectively. You'll need to replace these in the code below with the names you used.
    • Install the ggplot2 package: install.packages("ggplot2") (this may take a few minutes). This package has more sophisticaed plotting tools, but it's more complicated.
    • Load the ggplot2 library: library("ggplot2")
    • Make a first plor: ggplot(echidna, aes(em,hlen))+geom_point()+stat_smooth(method=lm)
    • Add the predictions to the echidna data: echidna.plus=cbind(echidna, predict(echidna.model2, interval="prediction"))
    • Start making a new plot: p=ggplot(echidna.plus, aes(em,hlen))+geom_point()+stat_smooth(method=lm). You can enter p anytime you want to see the plot.
    • Add the 95% prediction interval: p+geom_line(aes(y=lwr), color=2, linetype="dashed")+geom_line(aes(y=upr), color=2, linetype="dashed")
  4. (Optional) Try to make a version of the plot you just made, but for the log-log Opah data.

Course resources

Links

Office hours

Logan Axon
Department of Mathematics
MSC 2615
Gonzaga University
Spokane, WA 99258
Office: Herak 227A
Phone: 509.313.3897
Email: axon@gonzaga.edu

Last updated 4/23/2024