0: Akritas' datasets
00: Install RStudio on your home computer (optional)
- Download the appropriate version of R from https://ftp.osuosl.org/pub/cran/ and install it.
- It is important to install R first. After you have done that, download RStudio from https://www.rstudio.com/products/rstudio/download/#download and install it.
1: Intro to R and data
- (Optional) Login to Posit Cloud. Create a new RStudio project. An R console should occupy the left side of the page.
- Launch RStudio on your computer if you aren't using the cloud.
- Try the following commands and then write a brief description of what they did.
> x
> y
> x = c(1, 1, 2, 3, 5, 8, 13)
> y <- c(7:1)
> x
> y
> x+1
> x+y
> 2*y
> x*y
> sort(x*y)
> x < y
> x[4]
> x[4:6]
> x[x < y]
- Try the following commands and then write a brief description of what they did.
> 1:100
> sample(1:100, size=20)
> x=sample(1:100, size=20)
> x
> sort(x)
> stem(x, scale=2)
> stem(x)
> hist(x)
> hist(x, breaks=10)
> hist(x, breaks=10, freq=F); lines(density(x))
> boxplot(x)
> y=sample(1:100, size=20, replace=T)
- Repeat the good parts of the exploration of
x
above. How arex
andy
different? > boxplot(x,y)
2: Loading data and numerical v. categorical data
- Download the USGS data for all earthquakes in the past 30 days in Washington. Pay attention to where you save this file.
- Upload the data to RStudioCloud: click "File" in the bottom-right pane, then "Upload" and select
all_month.csv
. - Load the data into R:
> quakes=read.csv("all_month.csv", header=T)
> str(quakes)
tells you the structure of the data. How many earthquakes were there? How many variables are there?- The magnitudes of the earthquakes is in the column
mag
. The following commands give you different views of the data:> summary(quakes$mag)
(a 5-statistic summary)> boxplot(quakes$mag)
(a box-and-whisker plot; the box covers the IQR, whiskers extend 1.5 × IQR from the box)> hist(quakes$mag)
(an absolute frequency histogram)> hist(quakes$mag, freq=F)
(a relative frequency histogram)> plot(density(quakes$mag))
(an estimated density curve--like a smooth histogram)> hist(quakes$mag, freq=F); lines(density(quakes$mag))
. You may need to tell R to skip missing values:lines(density(quakes$mag, na.rm=T))
- The number of seismic stations used to determine the earthquake's location is in the column
nst
. This is numerical data, but it is discrete, not continuous. Try:> boxplot(quakes$nst)
> hist(quakes$nst)
> barplot(table(quakes$nst))
- What goes wrong with
> barplot(table(quakes$mag))
? - The type of earthquake is in the column
type
. This is categorical data. What happens when you try> boxplot(quakes$type)
or> hist(quakes$type)
? What about> barplot(table(quakes$type))
? - Sometimes we'll look at more than one variable at a time.
> plot(quakes$mag, quakes$latitude)
and> plot(quakes$mag, quakes$nst)
both look interesting; what do you see?
3: Estimation and the Central Limit Theorem (optional)
- Consider a random sample of size 16 from an exponentially distributed population with mean 4/3 and variance 16/9. This sample is small enough that the CLT may not provide a good estimate for the distribution of the sample mean, but we're going to try anyway: use a normal approximation to estimate the probability that the sample mean is less than 4/3 and the probability that it is less than 2. The R command
pnorm(x)
is the CDF of the standard normal distribution. - We can investigate these estimates by getting R to simulate this sampling process. The following code generates a list of 100 sample means for samples of size 16 from a population with mean 4/3 and variance 16/9.
> x=c()
> for(n in c(1:100)){x=c(x,mean(rexp(16,3/4)))}
- Investigate the results:
- Produce a histogram with a density estimate:
> hist(x, freq=F, xlim=c(0,3), ylim=c(0,4), col=rgb(.1,.1,.1,.5)); lines(density(x))
- Find the proportion of sample means less than 4/3:
> mean(x<4/3)
- Find the proportion of sample means less than 2:
> mean(x<2)
- Produce a histogram with a density estimate:
- Repeat everything (including the CDF estimates), but with a random sample of size 64:
> y=c()
> for(n in c(1:100)){y=c(y,mean(rexp(64,3/4)))}
- Add a histogram with a density estimate to our plot:
> hist(y, freq=F, col=rgb(.1,.1,.1,.1), add=T); lines(density(y))
- Find the proportion of sample means less than 4/3:
> mean(y<4/3)
- Find the proportion of sample means less than 2:
> mean(y<2)
- What do you see happening? Compare and contrast the results of these two sampling processes. If it is helpful, run the code again, possibly with different sample sizes.
4: Hypothesis tests (due 4/21)
- Download heights.csv and load the data set into R:
- In R studio you can use menus: File > Import data set > From CSV...
- In RStudio Cloud you'll first need to upload the file heights.csv to the web. Find the File pane in the bottom right quadrant and use the Upload button there. After that, load the data set into R using
> heights=read.csv("heights.csv")
- See what you have:
> summary(heights)
- Open the same file in Excel and compare. R understands this data as a data frame. You can access columns by name using commands like
> heights$child
.
- Google says that the mean height of an American male is 69.7 inches. The height data you just loaded was collected from a random sample of Gonzaga students a couple of years ago. We can test the hypothesis that the mean height of a male GU student is 67 inches using this data:
- Extract the heights of male students from the data frame:
> m=heights$child[heights$sex=="M"]
- You can now compute a test statistic manually using built-in functions for the mean
> mean(m)
and standard deviation> sd(m)
and basic math operations:> (mean(m) - 69.7)/(sd(m)/sqrt(length(m)))
- Calculate the p-value of your test statistic (use
> length(m)
to find the sample size) and write a summary of your finding: include null and alternative hypotheses, the value of the test statistic, its p-value, and your conclusion - R has built-in hypothesis tests, which you can use to check your work:
> t.test(m, mu=69.7)
- Try the same test with the heights of fathers of students (this time just use the built-in test):
> t.test(heights$father, mu=69.7)
- Write another summary of the result (follow the same guidelines as above)
- Would it be reasonable to treat the heights of the male students and the heights of fathers as one random sample (with a larger sample size)? Explain your reasoning.
- Extract the heights of male students from the data frame:
- Are male students taller than female students? Use the height data to test the null hypothesis that the difference between mean heights for male and female students is 0.
> f=heights$child[heights$sex=="F"]
- One of the following will give you an answer to the question; which is most correct?
> t.test(m,f, alternative="greater")
> t.test(m,f, alternative="less")
> t.test(m,f, alternative="greater", var.equal=T)
> t.test(m,f, alternative="less", var.equal=T)
> var.test(m,f)
tests the null hypotheses that the variances are equal against the alternative that the variances are not equal- Summarize your test results (state null and alternative hypotheses, the test statistic, the p-value, and your conclusion)
- Google tells me that the difference in mean heights should be about 6 inches. Test this hypothesis (you'll need to tell R what your null hypothesis is using
mu=6
) and summarize your results
- Are sons taller than their fathers on average?
- The hypothesis test comparing heights of males and females assumed that we had independent random samples from the two populations in question. Would this a reasonable assumption for a hypothesis test comparing the heights of male students with fathers in the sample?
- How might you test to see if male students are, on average, taller than their fathers?
- The t tests you've been using are most accurate when the populations are normally distributed. Heights and other biological measurements are often assumed to be normal, but we can check on this assumption using a Q-Q plot. In this case, we'll use
qqnorm()
, which plots the data against the theoretical positions for a normal distribution. Data that comes from a normal distribution should fall roughly on a straight line. A consistent bend away from a line suggests the data might not be normally distributed. Some nice examples are here (pdf) or on page 155 of the textbook.- Run
> qqnorm(m)
,> qqnorm(f)
,> qqnorm(heights$mother)
, and> qqnorm(heights$father)
and evaluate the normality assumption. > qqline([data name], col=4)
adds R's best guess at a line to your Q-Q plot (in color 4, for contrast).- Do you see anything that looks like a problem?
- Run
- I found a claim that on average you shouldn't need to wait more than an hour to see the Old Faithful geyser erupt. R has a built-in data set that includes waiting time between Old Faithful eruptions:
faithful$waiting
. Use this data set to test the claim. Write a summary of your findings. Include the usual statement of hypotheses, your test statistic, the p-value, and your conclusion. Also address any assumptions (e.g. of normality), possibly using Q-Q plots. Challenge: give a 95% confidence lower bound for the wait time. - Guide to summarizing hypothesis tests
5: Goodness-of-fit tests (due 4/26)
- Watch an introduction to goodness-of-fit tests. This video refers to this Random1 page on goodness of fit tests.
- Are the digits of pi random? No, but they might follow a uniform discrete distribution in many ways (this is related to the longstanding question of the normality of pi.
- Our null hypothesis is that the digits of pi follow a uniform discrete distribution and our alternative hypothesis is that the digits don't follow a uniform distribution.
- Enter the first 500 digits of pi into R by copying and pasting the following into R:
p = c( 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4, 3, 3, 8, 3, 2, 7, 9, 5, 0, 2, 8, 8, 4, 1, 9, 7, 1, 6, 9, 3, 9, 9, 3, 7, 5, 1, 0, 5, 8, 2, 0, 9, 7, 4, 9, 4, 4, 5, 9, 2, 3, 0, 7, 8, 1, 6, 4, 0, 6, 2, 8, 6, 2, 0, 8, 9, 9, 8, 6, 2, 8, 0, 3, 4, 8, 2, 5, 3, 4, 2, 1, 1, 7, 0, 6, 7, 9, 8, 2, 1, 4, 8, 0, 8, 6, 5, 1, 3, 2, 8, 2, 3, 0, 6, 6, 4, 7, 0, 9, 3, 8, 4, 4, 6, 0, 9, 5, 5, 0, 5, 8, 2, 2, 3, 1, 7, 2, 5, 3, 5, 9, 4, 0, 8, 1, 2, 8, 4, 8, 1, 1, 1, 7, 4, 5, 0, 2, 8, 4, 1, 0, 2, 7, 0, 1, 9, 3, 8, 5, 2, 1, 1, 0, 5, 5, 5, 9, 6, 4, 4, 6, 2, 2, 9, 4, 8, 9, 5, 4, 9, 3, 0, 3, 8, 1, 9, 6, 4, 4, 2, 8, 8, 1, 0, 9, 7, 5, 6, 6, 5, 9, 3, 3, 4, 4, 6, 1, 2, 8, 4, 7, 5, 6, 4, 8, 2, 3, 3, 7, 8, 6, 7, 8, 3, 1, 6, 5, 2, 7, 1, 2, 0, 1, 9, 0, 9, 1, 4, 5, 6, 4, 8, 5, 6, 6, 9, 2, 3, 4, 6, 0, 3, 4, 8, 6, 1, 0, 4, 5, 4, 3, 2, 6, 6, 4, 8, 2, 1, 3, 3, 9, 3, 6, 0, 7, 2, 6, 0, 2, 4, 9, 1, 4, 1, 2, 7, 3, 7, 2, 4, 5, 8, 7, 0, 0, 6, 6, 0, 6, 3, 1, 5, 5, 8, 8, 1, 7, 4, 8, 8, 1, 5, 2, 0, 9, 2, 0, 9, 6, 2, 8, 2, 9, 2, 5, 4, 0, 9, 1, 7, 1, 5, 3, 6, 4, 3, 6, 7, 8, 9, 2, 5, 9, 0, 3, 6, 0, 0, 1, 1, 3, 3, 0, 5, 3, 0, 5, 4, 8, 8, 2, 0, 4, 6, 6, 5, 2, 1, 3, 8, 4, 1, 4, 6, 9, 5, 1, 9, 4, 1, 5, 1, 1, 6, 0, 9, 4, 3, 3, 0, 5, 7, 2, 7, 0, 3, 6, 5, 7, 5, 9, 5, 9, 1, 9, 5, 3, 0, 9, 2, 1, 8, 6, 1, 1, 7, 3, 8, 1, 9, 3, 2, 6, 1, 1, 7, 9, 3, 1, 0, 5, 1, 1, 8, 5, 4, 8, 0, 7, 4, 4, 6, 2, 3, 7, 9, 9, 6, 2, 7, 4, 9, 5, 6, 7, 3, 5, 1, 8, 8, 5, 7, 5, 2, 7, 2, 4, 8, 9, 1, 2, 2, 7, 9, 3, 8, 1, 8, 3, 0, 1, 1, 9, 4, 9, 1 )
- A histogram allows us to make a qualitative assessment:
hist(p, breaks=c(0:10)-.5)
. What do you think the histogram says about our hypotheses? (Tryhist(p)
to see why it's important to tell R where to divide up the data usingbreaks=c(0:10)-.5
). - Use a goodness-of-fit test to quantify the extent to which the data supports the alternative hypothesis:
- Extract the frequency (as a count) at which each digit occurs in our data
> freq = hist(p, breaks=c(0:10)-.5)$counts
-
> chisq.test(freq)
tests the null hypothesis that each digit is equally likely to appear against the alternative that at least one digit appears with a different probability - What conclusion should we draw from the goodness-of-fit test?
- Extract the frequency (as a count) at which each digit occurs in our data
- The data I used in the example in the video was:
6 10 3 3 7 3 3 7 7 6 7 6 9 6 4 6 6 4 5 5 3 4 7 8 3 8 7 4 4 8 8 5 5 6 3 6 6 0 8 5 3 6 4 9 5 7 5 5 8 7 7 7 9 8 9 9 5 3 10 2 7 2 3 4 3 6 8 5 6 7 5 4 3 4 8 3 7 5 5 7 1 5 6 6 6 6 4 5 10 4 6 5 8 7 4 2 5 7 8 9
. My null hypothesis was that this data came from a Poisson distribution with a mean of 6. It's essential that this null hypothesis reflect reliable knowledge and not the data you've just collected. To explore this idea, we'll repeat the goodness-of-fit test using different means: 4.5, 5, and 5.5.- Enter the data into R: use the command
x = scan()
, then copy and paste the data in at the prompt. Hit return one more time and you should be set with a new command prompt - Get the frequencies
xfreq = hist(x, breaks=(-1:max(x))+.5)$count
- To test the fit of a Poisson distribution with mean 4.5, we'll compare with a list of expected probabilities (which must sum to 1, which is why some of the following is complicated). Run the following commands to generate this list:
m = 4.5
sets our mean to 4.5e1 = dpois(0:(max(x)-1), m)
gets the values of the PMF for the first part of our list of expected probabilitiesexpected = c(e1, 1-sum(e1))
makes a list of expected probabilities with a sum of 1chisq.test(xfreq, p=expected)
runs the goodness-of-fit test- Should you reject or accept the null hypothesis?
- Repeat the last test with means of 5, and 5.5 (using the up arrow cycles through old commands). What can you conclude?
- Enter the data into R: use the command
6: Linear regression with echindas (due 4/28)
- Investigate the echidna data to see if there's a relationship between the duration of hibernation and the emergence mass:
- Download echidna.csv and load the data into R
summary(echidna)
will show you a summary of the data.- The variable names are going to be hard to work with. You can rename them:
names(echidna)=c("name1", "name2", "name3")
(fill in names you you'll be able to connect to the data, e.g.em
for emergence mass;c()
creates a list as usual; keep the quotation marks). - Plot the two variables:
plot(name1~name3, data=echidna, col="blue", main="Echidna mass (kg) as a function of hibernation duration", xlab="hibernation duration (days)", ylab="emergence mass (kg)")
(here~
tells R to think of the first variable as a function of the third). - Fit a linear model to the data:
echidna.model=lm(name1~name3, data=echidna)
. - Add the least squares line to the plot:
abline(echidna.model)
. - Get a summary of the model:
summary(echidna.model)
. - Predict the emergence mass of an echidna that hibernates for 150 days.
- There's nothing in the mathematics that says we have to think of emergence mass as a function of hibernation duration. In fact, it's much easier to measure emergence mass than hibernation duration, so we might want to use the emergence mass in an estimate of hibernation duration. Reverse all the ties in the previous problem and find a new model (
echidna.model2
) and predict the hibernation duration of an echidna that has an emergence mass of 4 kg. - Check your echidna models to see if the residuals are obviously not normally distributed (use
plot(echidna.model)
and analyze the first two plots). What are the p-values for the slope terms and what do they tell you about the relationship between hibernation duration and emergence mass? Is the correlation positive or negative?
7: Linear regression II (due 5/1)
- I'm always interested in doing less grading; maybe statistics can help me decide if I really need to give 4 exams to my calculus students.
- Download scores.csv and load the data into R.
- Use
summary(scores)
to see what you have. In this case it's scores on exam 1 and exam 4 for sections A and B. It makes sense to combine all the scores on exam for both section:e1=c(scores$A1, scores$B1)
and similar for exam 4. - Create a plot:
plot(e4~e1, col="blue", main="Scores on Exam 1 and Exam 4 in Math 157", xlab="Exam 1 Score", ylab="Exam 4 score")
. - Fit a linear model to the data.
- Add the regression line to the graph.
- Use
plot(model)
to check assumptions and then interpret> summary(model)
. - What advice would you give me about reducing the number of exams? Make reference to your regression data, but also feel free to use other information.
- The CIA publishes a World Factbook containing all kinds of interesting information. In 2016 I downloaded some data on Life Expectancy at Birth (LE; measured in years), Contraceptive Use Rate (CU; percent of females aged 12-49), and per capita Health Care Expenditure (H; percent of GDP).
- Download CIA-health.csv and load the data into R.
- Use
summary()
to see what you have. - Repeat the steps of the previous analysis (plot, regression line, conclusions) for life expectancy as a function of health care expenditures and then for life expectancy as a function of contraceptive use.
- What advice would you give a nation about how best to use their (limited) resources to improve life expectancy?
- Propose an explanation for the results and think of a way you might test your explanation (possibly using new/different data).
8: Opah! (due 5/1)
- I found some interesting data on opah fish, which you should download and load into R: opah.csv.
- Use
summary
to see what you have (fork length is a specific way of measuring the length of the fish). - The column names are inconveniently long, but you can rename as you did with the echidna data.
- Plot mass (y-axis) against fork length (x-axis) and add the least squares line. Does it look like a good fit?
- Assess the validity of the linear model using (at least) the plot of Residuals vs Fitted and the Normal Q-Q plot.
- Does it even look like there's a linear relationship between mass and fork length in the original graph?
- In situations like this a logarithmic transform sometimes helps: plot the logarithm of the mass against the fork length and add a least squares line to this plot. Does this look better? Also assess the the plots of Residuals vs Fitted and Normal Q-Q.
9: ANOVA (due 5/5)
- Continue working with the heights data
- Run an analysis of variance:
h.aov=aov(child~as.factor(sex), data=heights)
. - Compare
summary(h.aov)
,anova(h.aov)
, andt.test(heights$child[heights$sex == "M"], heights$child[heights$sex == "F"], var.equal=T)
(the last one extracts the heights of male and female students from the data frame and compares them with a t-test).
- Run an analysis of variance:
- Use the
mtcars
data set built into R:> ?mtcars
will explain the data set.- I'd like to think of
mpg
as the dependent variable and assess its connection withcyl
,gear
, andqsec
. Some of these independent variables are categorical: use ANOVA to analyze them. Other variables are continuous, so regression is appropriate. - Comment on the validity of the models.
- Summarize your findings.
- I'd like to think of
Course resources
Links
- Random!
- Dr. Stover's 321
- Introduction to Probability by Grinstead and Snell
- SticiGui
- Posit cloud (RStudio online)
- Math 321FLO Spring 2020
- Math 321 Fall 2019
- Math 321 Spring 2016
- Math 321 Spring 2011
- Academic Calendar
- Final Exam Schedule
- WolframAlpha
- Desmos
Office hours
- Monday 10:00-11:00
- Tuesday 10:00-11:00
- Wednesday 2:00-3:00 in the MLC (Bollier 218)
- Thursday 10:00-11:00
- Or by appointment
Logan Axon
Department of Mathematics
MSC 2615
Gonzaga University
Spokane, WA 99258
Office: Herak 227A
Phone: 509.313.3897
Email: axon@gonzaga.edu
Last updated 5/2/2023