0: Install RStudio on your home computer (optional)
- Download the appropriate version of R from https://ftp.osuosl.org/pub/cran/ and install it.
- It is important to install R first. After you have done that, download RStudio from https://www.rstudio.com/products/rstudio/download/#download and install it.
- Run RStudio and install ggplot2:
> install.packages("ggplot2")
1: Intro to R and data
- Login and launch RStudio. Try the following commands and figure out what they do.
> seq(1,100)
> sample(seq(1,100), size=10)
> x=sample(seq(1,100), size=10)
> x
> stem(x, scale=2, width=100)
> hist(x)
> boxplot(x)
> y=sample(seq(1,100), size=15)
> stem(y, scale=2, width=100)
> boxplot(x,y)
2: Loading data and numerical v. categorical data
- Go to USGS Earthquake Feed and download the data for all earthquakes in the past 7 days. Pay attention to where you save this file.
- Open RStudio and tell it you want to work in the directory with the earthquake data:
> setwd("C:/Users/myname/Documents/421R/")
- Load the data into R:
> quakes=read.csv("all_week.csv", header=T)
> str(quakes)
tells you the structure of the data. Find the number of earthquakes and the names of the columns. Note that some columns have numerical data, others have factors, and others have integer entries. Each type permits the use of different tools.- The magnitudes of the earthquakes is in the column
mag
. This data seems to be continuous. The following commands give you different views of the data:> summary(quakes$mag)
(a 5-statistic summary)> boxplot(quakes$mag)
(a box-and-whisker plot; the box covers the IQR, whiskers extend 1.5 × IQR from the box)> hist(quakes$mag)
(an absolute frequency histogram)> hist(quakes$mag, freq=F)
(a relative frequency histogram)> plot(density(quakes$mag))
(an estimated density curve--like a smooth histogram)> hist(quakes$mag, freq=F); lines(density(quakes$mag))
- The type of earthquake is in the column
type
. This is discrete data. The following commands give a picture of this data:> library(ggplot2)
(loads an additional plotting package)> ggplot(data=quakes) + geom_bar(mapping=aes(x=quakes$type))
(maybe not enough strange earthquakes?)
- The number of seismic stations used to determine the earthquake's location is in the column
nst
. This is numerical data, but it is discrete, not continuous. Try> ggplot(data=quakes) + geom_bar(mapping=aes(x=quakes$nst))
- What happens if you try
> ggplot(data=quakes) + geom_bar(mapping=aes(x=quakes$mag))
and why? - Sometimes we'll look at more than one variable at a time.
> plot(quakes$mag, quakes$latitude)
and> plot(quakes$mag, quakes$nst)
both look interesting; what do you see?
3: Estimation and the Central Limit Theorem
- Consider a random sample of size 16 from a population with mean 4/3 and variance 16/9. This sample is small enough that the CLT may not provide a good estimate for the distribution of the sample mean, but we're going to try anyway: use the CLT to estimate the probability that the sample mean is less than 4/3 and the probability that it is less than 2. The R command
pnorm(x)
is the CDF of the standard normal distribution. - We can investigate these estimates by getting R to simulate this sampling process. The following code generates a list of 100 sample means for samples of size 16 from a population with mean 4/3 and variance 16/9.
> x=c()
> for(n in c(1:100)){x=c(x,mean(rexp(16,3/4)))}
- Investigate the results:
- Produce a histogram with a density estimate:
> hist(x, freq=F, xlim=c(0,3), ylim=c(0,4), col=rgb(.1,.1,.1,.5)); lines(density(x))
- Find the proportion of sample means less than 4/3:
> mean(x<4/3)
- Find the proportion of sample means less than 2:
> mean(x<2)
- Produce a histogram with a density estimate:
- Repeat everything (including the CDF estimates), but with a random sample of size 64:
> y=c()
> for(n in c(1:100)){y=c(y,mean(rexp(64,3/4)))}
- Add a histogram with a density estimate to our plot:
> hist(y, freq=F, col=rgb(.1,.1,.1,.1), add=T); lines(density(y))
- Find the proportion of sample means less than 4/3:
> mean(y<4/3)
- Find the proportion of sample means less than 2:
> mean(y<2)
- What do you see happening? Compare and contrast the results of these two sampling processes. If it is helpful, run the code again, possibly with different sample sizes.
4: Hypothesis tests
- Download heights.csv and load the data set into R:
- In R studio you can use menus: File > Import data set > From CSV...
- In RStudio Cloud you'll first need to upload the file heights.csv to the web. Find the File pane in the bottom right quadrant and use the Upload button there. After that, load the data set into R using
> heights=read.csv("heights.csv")
- See what you have:
> summary(heights)
- Google says that the mean height of an American male is 67 inches. We can test the hypothesis that the mean height of a male GU student is 67 inches using this data:
- Extract the heights of male students from the data frame:
> m=heights$child[heights$sex=="M"]
- You can now compute a test statistic manually using
> mean(m)
and> sd(m)
- Calculate the p-value of your test statistic and write a summary of your finding: include the null and alternative hypotheses, the test statistic, the p-value, and your conclusion
- R has built-in hypothesis tests, which should agree with your work:
> t.test(m, mu=67)
- Try the same test with the heights of fathers of students:
> t.test(heights$father, mu=67)
- Write another summary of the result (follow the same guidelines as above)
- Would it be reasonable to treat the heights of the male students and the heights of fathers as one random sample (with a larger sample size)? Explain your reasoning.
- Extract the heights of male students from the data frame:
- Are male students taller than female students? Use the height data to test the null hypothesis that the mean difference is 0.
> f=heights$child[heights$sex=="F"]
> t.test(m,f, alternative="greater")
- Summarize your test results (state null and alternative hypotheses, the test statistic, the p-value, and your conclusion)
- Google tells me that the difference in mean heights should be 3 inches. Test this hypothesis and summarize your results
- I found a claim that on average you shouldn't need to wait more than an hour to see the Old Faithful geyser erupt. R has a built in data set that includes waiting time between Old Faithful eruptions:
faithful$waiting
. Use this data set to test the claim. Write a summary of your findings. Include the usual statement of hypotheses, your test statistic, the p-value, and your conclusion, but also give a 95% confidence lower bound for the wait time. - Are the digits of pi random? No, but they might follow a uniform discrete distribution in many ways. I collected 100 digits of pi (not the first hundred) to use:
p = c(5,3,4,2,1,1,7,0,6,7,9,8,2,1,4,8,0,8,6,5,1,3,2,8,2,3,0,6,6,4,7,0,9,3,8,4,4,6,0,9,5,5,0,5,8,2,2,3,1,7,2,5,3,5,9,4,0,8,1,2,8,4,8,1,1,1,7,4,5,0,2,8,4,1,0,2,7,0,1,9,3,8,5,2,1,1,0,5,5,5,9,6,4,4,6,2,2,9,4,8)
- My null hypothesis is that the digits of pi follow a uniform discrete distribution. What does this tell you about the mean value of a digit of pi?
- Use
t.test
to test this hypothesis (about the mean value) - We can also look at a histogram and make a qualitative assessment:
hist(p, breaks=c(0:10)-.5)
. Does the histogram support my hypothesis? Explain your answer. - There's a way to quantify the extent to which the histogram supports my hypothesis:
> freq = hist(p, breaks=c(0:10)-.5)$counts
then> chisq.test(freq)
tests the null hypothesis that each digit is equally likely to appear against the alternative that at least one digit appears with a different probability.
- Are the random numbers generated by R really random? One way to check is to use the same procedure as in the last exercise to see if the first digits of the results of
runif(100, 0,1)
are uniformly distributed:- Extract the first digits:
> q = floor(runif(100,0,1)*10)
- Look at a histogram:
> hist(q, breaks=c(0:10)-.5)
- Run the test:
> chisq.test(hist(q, breaks=c(0:10)-.5)$counts)
- Extract the first digits:
5: Linear regression
- I'm always interested in doing less grading; maybe statistics can help me decide if I really need to give 4 exams to my calculus students.
- Download scores.csv and load the data into R.
- Use
> summary(scores)
to see what you have. In this case it's scores on exam 1 and exam 4 for sections A and B. It makes sense to combine all the scores on exam 1 and all the score in exam 2:> e1=c(scores$A1, scores$B1)
and similar for exam 4. - Create a plot:
> plot(e1,e4, col="blue", main="Scores on Exam 1 and Exam 4 in Math 157", xlab="Exam 1 Score", ylab="Exam 4 score")
. - Calculate the least squares line:
> model=lm(e4~e1)
. - Add the regression line to the graph:
> abline(model)
. - Use
> plot(model)
to check assumptions and then interpret> summary(model)
. - What advice would you give me about reducing the number of exams? Make reference to your regression data, but also feel free to use other information.
- The CIA publishes a World Factbook containing all kinds of interesting information. In 2016 I downloaded some data on Life Expectancy at Birth (LE), Contraceptive Use (CU), and per capita Health Care Expenditure (H).
- Download CIA-health.csv and load the data into R.
- Use
> summary(scores)
to see what you have. - Repeat the steps of the previous analysis (plot, regression line, conclusions) for life expectancy as a function of health care expenditures and then for life expectancy as a function of contraceptive use.
- What advice would you give a nation about how best to use their (limited) resources to improve life expectancy?
- Propose an explanation for the results and think of a way you might test your explanation (possibly using new/different data).
- Use the built-in data set
faithful
to see if there's a connection between waiting time and the duration of eruptions at Yellowstone's Old Faithful geyser.
5a: More linear regression
- I found some interesting data on opah fish, which you should download and load into R: opah.csv.
- Use
summary
to see what you have (fork length is a specific way of measuring the length of the fish). - The column names are inconveniently long, but you can rename them using
> colnames(opah)<-c('ID', 'FL', 'M')
. - Plot mass (y-axis) against fork length (x-axis) and add the least squares line. Does it look like a good fit?
- Assess the validity of the linear model using the plots of Residuals vs Fitted and Normal Q-Q.
- Does it even look like there's a linear relationship between mass and fork length?
- In situations like this a logarithmic transform sometimes helps: plot the logarithm of the mass against the logarithm of the fork length and add a least squares line to this plot. Does this look better? Also assess the the plots of Residuals vs Fitted and Normal Q-Q.
Rank-Sum Test
- Use the following data:
> x = c(8, 23, 26, 18, 4, 3, 0, 16, 14, 7, 10, 14, 5, 19, 7, 13, 25, 16)
> y = c(1, 0, 7, 1, 1, 2, 15,17, 14, 0, 1, 21, 7, 7, 0, 11, 8, 7)
- Compare the results of
> t.test(x,y, conf.int=T)
and> wilcox.test(x,y, conf.int=T)
. - Use
hist()
andqqnorm()
to see if the data are plausibly normal. What does this tell you about the two tests above? - Generate your own random sample from a geometric distribution
> z = rgeom(18, 0.13)
and repeat the Rank-Sum tests with x and z and with y and z. - Try
> boxplot(x,y,z, names=c("x", "y", "z"))
.
ANOVA
- Use concrete.csv, which contains data on moisture absorption by concrete with 5 different levels of aggregate in the mixture (with 6 samples of each type of concrete).
> fit=aov(AbsorbtionWeight~as.factor(Aggregate), data=concrete)
and> anova(fit)
Links and Resources
- Math 321 home
- WeBWorK
- Dr. Stover's 321
- Math 321 Spring 2016
- Math 321 Spring 2011
- Academic Calendar
- Final Exam Schedule
- WolframAlpha
- Khan Academy
- Desmos
- RStudio Cloud
Office hours
- Monday 2:15-3:15
- Wednesday 2:15-3:15
- Thursday 1-2 in the Math Lab (Herak 224)
- Friday 10-11
- Or by appointment
Logan Axon
Department of Mathematics
MSC 2615
Gonzaga University
Spokane, WA 99258
Office: Herak 307A
Phone: 509.313.3897
Email: axon@gonzaga.edu
Last updated 12/4/2019