Hypothesis testing - Your turn#
Note there are five parts!
Part 1#
Repeat the test from Example 1 in this section, but now assuming that the standard deviation for the distribution of the failure times is not known and must be estimated from the data. Explain the difference observed in the \(p\)-values.
# TODO: Write your code below
TODO: Write your explanation below#
Part 2#
Reconsider the fuel consumption problem from the Example 2 above. In this problem, the fuel consumption for the two types of engines has been determined pair-wise over different distances traveled; a Type 1 engine and a Type 2 engine were paired and driven the same distance (but different from the other pairs) to determine fuel consumption. Fuel consumption is again assumed to be normally distributed. Mean fuel consumption changes with distance traveled and possibly with engine type. Variability of fuel consumption is the same regardless of distance traveled or engine type. Because of this setup, it is believed that a paired \(t\)-test would be more appropriate. For the paired fuel consumption data provided in the table below, can it be concluded at the \(10\%\) significance level that the two engines consume fuel at different rates?
Type 1 usage (gal) |
Type 2 usage (gal) |
|
---|---|---|
Pair 1: |
540 |
555 |
Pair 2: |
520 |
515 |
Pair 3: |
580 |
585 |
Pair 4: |
500 |
505 |
# TODO: Write your code below
TODO: Write your explanation below#
Part 3#
Suppose that because of the lack of historical data, test engineers are not certain that the fuel consumption in the previous exercise is normally distributed. Rather than using a paired \(t\)-test, they are considering using a distribution-free sign test. Using the data in the table above, what conclusion is reached at the \(5\%\) significance level?
Note: There is no built-in Python equivalent for MATLAB’s signtest()
, so we have written our own below using binomtest()
, as the sign test is just a special case (see Wikipedia).
import numpy as np
from scipy.stats import binomtest
def my_signtest(data1, data2=None):
"""
Args:
data1 - array of values
data2 (opt) - second array of values for differences
Returns:
p-value of the test
"""
# If a second array is provided, calculate the differences
if data2 is not None:
if len(data1) != len(data2):
raise ValueError("data1 and data2 must have the same length.")
differences = data1 - data2
else:
differences = data1
# Remove zero differences for the sign test
nonzero_differences = differences[differences != 0]
n = len(nonzero_differences)
num_positive = np.sum(nonzero_differences > 0)
# Two-sided test where p=0.5 under the null hypothesis
result = binomtest(num_positive, n, p=0.5, alternative='two-sided')
return result.pvalue
# TODO: Write your code below
TODO: Write your explanation below#
Part 4#
In a conversation with a colleague regarding investing, you are told that you should invest in value stocks because they outperform growth stocks in the long run.
Your goal is to test the validity of this statement.
Historically, it’s been shown that the quantity \(\log( P_{n-1} / P_n )\), known as “log-return,” is approximately normally distributed and independent over time.
Here, \(P_{n-1}\) and \(P_n\) are stock prices in consecutive years.
In the file index_data.txt
(available on Canvas), you will find annual prices of two stock indexes from 1926 until 2008.
Calculate the log-returns for each index and test the hypothesis that the value index has the same mean log-return as the growth index at the \(10\%\) significance level.
Notes:
Remember you can upload data to Colab by dragging it into the 📁 tab (tutorial).
You may find adding a
delimiter=r"\s+"
topd.read_csv()
when loading in the data to be very helpful! This helps parse extra spaces in the data to give you a neat DataFrame.
# TODO: Write your code below
TODO: Write your explanation below#
Part 5#
A claim is made that the amount of energy saved from a facility per week after installing new solar panels is \(550\) kilowatt-hours more than previous attempts to reduce energy usage. Suppose that technicians run weekly experiments to determine energy savings for both the solar panels and previous methods. Suppose further that the energy savings from the experiments are normally distributed. Unknown to the technicians, however, the data of energy savings (in kilowatt-hours) from the new solar panels has mean \(\mu_s = 6500\) and standard deviation \(800\), while the data of energy saving (in kilowatt-hours) from previous methods of reducing energy usage has mean \(\mu_p = 5500\) and standard deviation \(800\).
The hypothesis test the technicians consider is:
Because the alternative hypothesis \(H_1\) is actually true, we would like to know about how many weekly experiments the technicians need to run so that the probability of detection using a one-sided \(t\)-test at the \(5\%\) level of significance is at least \(95\%\), i.e., the probability of rejecting the null hypothesis is at least \(95\%\). Run \(1000\) one-sided \(t\)-tests over a range of sample sizes to estimate the probability of detection for each sample size and to estimate the required sample size to reject the null hypothesis \(95\%\) of the time.
Hint: Many of the functions in scipy.stats
take a size
or axis
parameter that allow for efficient vectorized operations to avoid loops.
But using loops is a totally valid solution.
# TODO: Write your code below
TODO: Write your answer below#
Exporting your work#
When you’re ready, the easiest way to export the notebook is to File > Print
it and save it as a PDF.
Remove any excessively long, unrelated outputs first by clicking the arrow → next to the output box and then Show/hide output
.
Obviously don’t obscure any necessary output or graphs!