0: Install RStudio on your home computer (optional)
- Download the appropriate version of R from https://ftp.osuosl.org/pub/cran/ and install it.
- It is important to install R first. After you have done that, download RStudio from https://www.rstudio.com/products/rstudio/download/#download and install it.
1: Intro to R and data
- Login to RStudio Cloud.
- A) Try the following commands and then write a brief description of what they did.
> x = c(1, 1, 2, 3, 5, 8, 13)
> y <- c(7:1)
> x
> y
> x+1
> x+y
> 2*y
> x*y
> sort(x*y)
> x < y
> x[4]
> x[4:6]
> x[x < y]
- B) Try the following commands and then write a brief description of what they did.
> seq(1,100)
> sample(seq(1,100), size=10)
> x=sample(seq(1,100), size=10)
> stem(x, scale=2, width=100)
> hist(x)
> boxplot(x)
> y=sample(seq(1,100), size=15)
> stem(y, scale=2, width=100)
> boxplot(x,y)
2: Loading data and numerical v. categorical data
- Download the USGS data for all earthquakes in the past 7 days. Pay attention to where you save this file.
- Upload the data to RStudioCloud: click "File" in the bottom-right pane, then "Upload" and select all_week.csv
- Load the data into R:
> quakes=read.csv("all_week.csv", header=T)
> str(quakes)
tells you the structure of the data. How many earthquakes were there? How many variables are there?- The magnitudes of the earthquakes is in the column
mag
. This data seems to be continuous. The following commands give you different views of the data:> summary(quakes$mag)
(a 5-statistic summary)> boxplot(quakes$mag)
(a box-and-whisker plot; the box covers the IQR, whiskers extend 1.5 × IQR from the box)> hist(quakes$mag)
(an absolute frequency histogram)> hist(quakes$mag, freq=F)
(a relative frequency histogram)> plot(density(quakes$mag))
(an estimated density curve--like a smooth histogram)> hist(quakes$mag, freq=F); lines(density(quakes$mag))
- The number of seismic stations used to determine the earthquake's location is in the column
nst
. This is numerical data, but it is discrete, not continuous. Try> boxplot(quakes$nst)
,> hist(quakes$nst)
, and> barplot(table(quakes$type))
. Do you have a preference? - What goes wrong with
> barplot(table(quakes$mag))
. - The type of earthquake is in the column
type
. This is categorical data. What happens when you try> boxplot(quakes$type)
or> hist(quakes$type)
? What about> barplot(table(quakes$type))
? - Sometimes we'll look at more than one variable at a time.
> plot(quakes$mag, quakes$latitude)
and> plot(quakes$mag, quakes$nst)
both look interesting; what do you see?
3 (optional): Estimation and the Central Limit Theorem
- Consider a random sample of size 16 from a population with mean 4/3 and variance 16/9. This sample is small enough that the CLT may not provide a good estimate for the distribution of the sample mean, but we're going to try anyway: use the CLT to estimate the probability that the sample mean is less than 4/3 and the probability that it is less than 2. The R command
pnorm(x)
is the CDF of the standard normal distribution. - We can investigate these estimates by getting R to simulate this sampling process. The following code generates a list of 100 sample means for samples of size 16 from a population with mean 4/3 and variance 16/9.
> x=c()
> for(n in c(1:100)){x=c(x,mean(rexp(16,3/4)))}
- Investigate the results:
- Produce a histogram with a density estimate:
> hist(x, freq=F, xlim=c(0,3), ylim=c(0,4), col=rgb(.1,.1,.1,.5)); lines(density(x))
- Find the proportion of sample means less than 4/3:
> mean(x<4/3)
- Find the proportion of sample means less than 2:
> mean(x<2)
- Produce a histogram with a density estimate:
- Repeat everything (including the CDF estimates), but with a random sample of size 64:
> y=c()
> for(n in c(1:100)){y=c(y,mean(rexp(64,3/4)))}
- Add a histogram with a density estimate to our plot:
> hist(y, freq=F, col=rgb(.1,.1,.1,.1), add=T); lines(density(y))
- Find the proportion of sample means less than 4/3:
> mean(y<4/3)
- Find the proportion of sample means less than 2:
> mean(y<2)
- What do you see happening? Compare and contrast the results of these two sampling processes. If it is helpful, run the code again, possibly with different sample sizes.
4: Hypothesis tests
- Download heights.csv and load the data set into R:
- In R studio you can use menus: File > Import data set > From CSV...
- In RStudio Cloud you'll first need to upload the file heights.csv to the web. Find the File pane in the bottom right quadrant and use the Upload button there. After that, load the data set into R using
> heights=read.csv("heights.csv")
- See what you have:
> summary(heights)
- Google says that the mean height of an American male is 69.7 inches. The height data you just loaded was collected from a random sample of Gonzaga students a couple of years ago. We can test the hypothesis that the mean height of a male GU student is 67 inches using this data:
- Extract the heights of male students from the data frame:
> m=heights$child[heights$sex=="M"]
- You can now compute a test statistic manually using built-in functions for the mean
> mean(m)
and standard deviation> sd(m)
and basic math operations:> (mean(m) - 69.7)/(sd(m)/sqrt(length(m)))
- Calculate the p-value of your test statistic (use
> length(m)
to find the sample size) and write a summary of your finding: include null and alternative hypotheses, the value of the test statistic, its p-value, and your conclusion - R has built-in hypothesis tests, which you can use to check your work:
> t.test(m, mu=69.7)
- Try the same test with the heights of fathers of students (this time just use the built-in test):
> t.test(heights$father, mu=69.7)
- Write another summary of the result (follow the same guidelines as above)
- Would it be reasonable to treat the heights of the male students and the heights of fathers as one random sample (with a larger sample size)? Explain your reasoning.
- Extract the heights of male students from the data frame:
- Are male students taller than female students? Use the height data to test the null hypothesis that the difference between mean heights for male and female students is 0.
> f=heights$child[heights$sex=="F"]
> t.test(m,f, alternative="greater")
- Summarize your test results (state null and alternative hypotheses, the test statistic, the p-value, and your conclusion)
- Google tells me that the difference in mean heights should be about 6 inches. Test this hypothesis (you'll need to tell R what your null hypothesis is) and summarize your results
- Are sons taller than their fathers on average?
- The hypothesis test comparing heights of males and females assumed that we had independent random samples from the two populations in question. Would this a reasonable assumption for a hypothesis test comparing the heights of male students with fathers in the sample?
- How might you test to see if male students are, on average, taller than their fathers?
- I found a claim that on average you shouldn't need to wait more than an hour to see the Old Faithful geyser erupt. R has a built-in data set that includes waiting time between Old Faithful eruptions:
faithful$waiting
. Use this data set to test the claim. Write a summary of your findings. Include the usual statement of hypotheses, your test statistic, the p-value, and your conclusion. Challenge: give a 95% confidence lower bound for the wait time.
5: Goodness-of-fit tests
- Are the digits of pi random? No, but they might follow a uniform discrete distribution in many ways (this is related to the longstanding question of the normality of pi.
- Our null hypothesis is that the digits of pi follow a uniform discrete distribution and our alternative hypothesis is that the digits don't follow a uniform distribution.
- Enter the first 500 digits of pi into R by copying and pasting the following into R:
p = c( 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4, 3, 3, 8, 3, 2, 7, 9, 5, 0, 2, 8, 8, 4, 1, 9, 7, 1, 6, 9, 3, 9, 9, 3, 7, 5, 1, 0, 5, 8, 2, 0, 9, 7, 4, 9, 4, 4, 5, 9, 2, 3, 0, 7, 8, 1, 6, 4, 0, 6, 2, 8, 6, 2, 0, 8, 9, 9, 8, 6, 2, 8, 0, 3, 4, 8, 2, 5, 3, 4, 2, 1, 1, 7, 0, 6, 7, 9, 8, 2, 1, 4, 8, 0, 8, 6, 5, 1, 3, 2, 8, 2, 3, 0, 6, 6, 4, 7, 0, 9, 3, 8, 4, 4, 6, 0, 9, 5, 5, 0, 5, 8, 2, 2, 3, 1, 7, 2, 5, 3, 5, 9, 4, 0, 8, 1, 2, 8, 4, 8, 1, 1, 1, 7, 4, 5, 0, 2, 8, 4, 1, 0, 2, 7, 0, 1, 9, 3, 8, 5, 2, 1, 1, 0, 5, 5, 5, 9, 6, 4, 4, 6, 2, 2, 9, 4, 8, 9, 5, 4, 9, 3, 0, 3, 8, 1, 9, 6, 4, 4, 2, 8, 8, 1, 0, 9, 7, 5, 6, 6, 5, 9, 3, 3, 4, 4, 6, 1, 2, 8, 4, 7, 5, 6, 4, 8, 2, 3, 3, 7, 8, 6, 7, 8, 3, 1, 6, 5, 2, 7, 1, 2, 0, 1, 9, 0, 9, 1, 4, 5, 6, 4, 8, 5, 6, 6, 9, 2, 3, 4, 6, 0, 3, 4, 8, 6, 1, 0, 4, 5, 4, 3, 2, 6, 6, 4, 8, 2, 1, 3, 3, 9, 3, 6, 0, 7, 2, 6, 0, 2, 4, 9, 1, 4, 1, 2, 7, 3, 7, 2, 4, 5, 8, 7, 0, 0, 6, 6, 0, 6, 3, 1, 5, 5, 8, 8, 1, 7, 4, 8, 8, 1, 5, 2, 0, 9, 2, 0, 9, 6, 2, 8, 2, 9, 2, 5, 4, 0, 9, 1, 7, 1, 5, 3, 6, 4, 3, 6, 7, 8, 9, 2, 5, 9, 0, 3, 6, 0, 0, 1, 1, 3, 3, 0, 5, 3, 0, 5, 4, 8, 8, 2, 0, 4, 6, 6, 5, 2, 1, 3, 8, 4, 1, 4, 6, 9, 5, 1, 9, 4, 1, 5, 1, 1, 6, 0, 9, 4, 3, 3, 0, 5, 7, 2, 7, 0, 3, 6, 5, 7, 5, 9, 5, 9, 1, 9, 5, 3, 0, 9, 2, 1, 8, 6, 1, 1, 7, 3, 8, 1, 9, 3, 2, 6, 1, 1, 7, 9, 3, 1, 0, 5, 1, 1, 8, 5, 4, 8, 0, 7, 4, 4, 6, 2, 3, 7, 9, 9, 6, 2, 7, 4, 9, 5, 6, 7, 3, 5, 1, 8, 8, 5, 7, 5, 2, 7, 2, 4, 8, 9, 1, 2, 2, 7, 9, 3, 8, 1, 8, 3, 0, 1, 1, 9, 4, 9, 1 )
- A histogram allows us to make a qualitative assessment:
hist(p, breaks=c(0:10)-.5)
. What do you think the histogram says about our hypotheses? (Tryhist(p)
to see why it's important to tell R where to divide up the data usingbreaks=c(0:10)-.5
). - Use a goodness-of-fit test to quantify the extent to which the data supports the alternative hypothesis:
- Extract the frequency (as a count) at which each digit occurs in our data
> freq = hist(p, breaks=c(0:10)-.5)$counts
-
> chisq.test(freq)
tests the null hypothesis that each digit is equally likely to appear against the alternative that at least one digit appears with a different probability - What conclusion should we draw from the goodness-of-fit test?
- Extract the frequency (as a count) at which each digit occurs in our data
- The data I used in the example in the video was:
6 10 3 3 7 3 3 7 7 6 7 6 9 6 4 6 6 4 5 5 3 4 7 8 3 8 7 4 4 8 8 5 5 6 3 6 6 0 8 5 3 6 4 9 5 7 5 5 8 7 7 7 9 8 9 9 5 3 10 2 7 2 3 4 3 6 8 5 6 7 5 4 3 4 8 3 7 5 5 7 1 5 6 6 6 6 4 5 10 4 6 5 8 7 4 2 5 7 8 9
. My null hypothesis was that this data came from a Poisson distribution with a mean of 6. It's essential that this null hypothesis reflect reliable knowledge and not the data you've just collected. To explore this idea, we'll repeat the goodness-of-fit test using different means: 4.5, 5, and 5.5.- Enter the data into R: use the command
x = scan()
, then copy and paste the data in at the prompt. Hit return one more time and you should be set with a new command prompt - Get the frequencies
xfreq = hist(x, breaks=(-1:max(x))+.5)$count
- To test the fit of a Poisson distribution with mean 4.5, we'll compare with a list of expected probabilities (which must sum to 1, which is why some of the following is complicated). Run the following commands to generate this list:
m = 4.5
sets our mean to 4.5e1 = dpois(0:(max(x)-1), m)
gets the values of the PMF for the first part of our list of expected probabilitiesexpected = c(e1, 1-sum(e1))
makes a list of expected probabilities with a sum of 1chisq.test(xfreq, p=expected)
runs the goodness-of-fit test- Should you reject or accept the null hypothesis?
- Repeat the last test with means of 5, and 5.5 (using the up arrow cycles through old commands). What can you conclude?
- Enter the data into R: use the command
6: ANOVA
- Use concrete.csv, which contains data on moisture absorption by concrete with 5 different levels of aggregate in the mixture (with 6 samples of each type of concrete). See part 2 above for loading data into R; I'm calling the data
concrete
in the following instructions. The following commands get R to do all the calculations for ANOVA based on this data. - First look at the data:
concrete
- Tell R to do the calculations and store the results as:
concrete.aov=aov(AbsorbtionWeight~as.factor(Aggregate), data=concrete)
(the commandas.factor
tells it to group the samples according to the Aggregate number) - Get a summary:
anova(concrete.aov)
- Draw a conclusion: does the aggregate mixture make a difference in mean moisture absorption?
7: Linear regression with echindas
- Investigate the echidna data to see if there's a relationship between the duration of hibernation and the emergence mass:
- Download echidna.csv and load the data into R
summary(echidna)
will show you a summary of the data.- The variable names are going to be hard to work with. You can rename them:
names(echidna)=c("name1", "name2", "name3")
(fill in names you you'll be able to connect to the data, e.g.em
for emergence mass;c()
creates a list as usual; keep the quotation marks). - Plot the two variables:
plot(name1~name3, data=echidna, col="blue", main="Echidna mass (kg) as a function of hibernation duration", xlab="hibernation duration (days)", ylab="emergence mass (kg)")
(here~
tells R to think of the first variable as a function of the third). - Fit a linear model to the data:
echidna.model=lm(name1~name3, data=echidna)
. - Add the least squares line to the plot:
abline(echidna.model)
. - Get a summary of the model:
summary(echidna.model)
. - Predict the emergence mass of an echidna that hibernates for 150 days.
- There's nothing in the mathematics that says we have to think of emergence mass as a function of hibernation duration. In fact, it's much easier to measure emergence mass than hibernation duration, so we might want to use the emergence mass in an estimate of hibernation duration. Reverse all the ties in the previous problem and find a new model (
echidna.model2
) and predict the hibernation duration of an echidna that has an emergence mass of 4 kg.
8: Linear regression II
- Check your echidna models to see if the residuals are obviously not normally distributed (use
plot(echidna.model)
and analyse the first two plots). What are the p-values for the slope terms and what do they tell you about the relationship between hibernation duration and emergence mass? Is the correlation positive or negative? - I'm always interested in doing less grading; maybe statistics can help me decide if I really need to give 4 exams to my calculus students.
- Download scores.csv and load the data into R.
- Use
summary(scores)
to see what you have. In this case it's scores on exam 1 and exam 4 for sections A and B. It makes sense to combine all the scores on exam 1 and all the score in exam 2:e1=c(scores$A1, scores$B1)
and similar for exam 4. - Create a plot:
plot(e4~e1, col="blue", main="Scores on Exam 1 and Exam 4 in Math 157", xlab="Exam 1 Score", ylab="Exam 4 score")
. - Fit a linear model to the data.
- Add the regression line to the graph.
- Use
plot(model)
to check assumptions and then interpret> summary(model)
. - What advice would you give me about reducing the number of exams? Make reference to your regression data, but also feel free to use other information.
- The CIA publishes a World Factbook containing all kinds of interesting information. In 2016 I downloaded some data on Life Expectancy at Birth (LE), Contraceptive Use (CU), and per capita Health Care Expenditure (H).
- Download CIA-health.csv and load the data into R.
- Use
summary()
to see what you have. - Repeat the steps of the previous analysis (plot, regression line, conclusions) for life expectancy as a function of health care expenditures and then for life expectancy as a function of contraceptive use.
- What advice would you give a nation about how best to use their (limited) resources to improve life expectancy?
- Propose an explanation for the results and think of a way you might test your explanation (possibly using new/different data).
- I found some interesting data on opah fish, which you should download and load into R: opah.csv.
- Use
summary
to see what you have (fork length is a specific way of measuring the length of the fish). - The column names are inconveniently long, but you can rename as you did with the echidna data
- Plot mass (y-axis) against fork length (x-axis) and add the least squares line. Does it look like a good fit?
- Assess the validity of the linear model using the plot of Residuals vs Fitted and the Normal Q-Q plot.
- Does it even look like there's a linear relationship between mass and fork length in the original graph?
- In situations like this a logarithmic transform sometimes helps: plot the logarithm of the mass against the fork length and add a least squares line to this plot. Does this look better? Also assess the the plots of Residuals vs Fitted and Normal Q-Q.
Course resources
- Schedule
- Syllabus
- WeBWorK
- Exam 1 review (with formulas)
- Exam 2 review (with formulas)
- Z table
- t critical values
- Special discrete distributions
- Special continuous distributions
- Hypothesis tests
- Linear regression
- Final formulas
Links
- Introduction to Probability
- OpenIntro Statistics
- Math 321 Fall 2019
- Dr. Stover's 321
- Math 321 Spring 2016
- Math 321 Spring 2011
- Academic Calendar
- Final Exam Schedule
- WolframAlpha
- Khan Academy
- Desmos
- Virtual Laboratories
- RStudio Cloud
Office hours
- 10:30-11:30 Monday, Tuesday, and Thursday
- Or by appointment
Logan Axon
Department of Mathematics
MSC 2615
Gonzaga University
Spokane, WA 99258
Office: Herak 307A
Phone: 509.313.3897
Email: axon@gonzaga.edu
Last updated 4/13/2020