Parameter estimation - Your turn#
Note
Click the and open this notebook in Colab to enable interactivity.
Note
To save your progress, make a copy of this notebook in Colab File > Save a copy in Drive
and you’ll find it in My Drive > Colab Notebooks
.
Part 1#
A traffic engineer would like to make an assessment of traffic flow at a busy intersection during weekday rush hour. The number of arrivals is believed to satisfy the conditions for the Poisson distribution. The engineer records the number of cars arriving at the intersection between the hours of 8am and 9am over a period of two weeks (10 business days). Using the data in the table (in cars/hr), compute the maximum likelihood estimate and the \(95\%\) confidence interval for the average number of car arrivals.
240 |
225 |
300 |
280 |
215 |
205 |
275 |
320 |
210 |
240 |
Note: Unfortunately the poisson
package in scipy.stats
does not have a fit()
function.
But is there a property of the Poisson distribution (and its property \(\lambda\)) that we can use to our advantage?
See Wikipedia for a hint.
# TODO: Write your code below
Part 2#
A product is tested for levels of different chemicals \(X_i\), where \(i = 1,2,3,4,5,6\). The value of each level is an integer ranging from \(1\) to \(5\). Depending on the observed levels of the chemicals, the product is either accepted (\(Y = 1\)) or not (\(Y = 0\)). Given a \(1 \times 6\) input vector of measurements of the levels of chemicals and the output \(Y\), the parameters \(P(X_i = j \mid Y = 1)\) where \(i = 1,2,3,4,5,6\) and \(j = 1,2,3,4,5\) can be estimated using maximum likelihood estimation analytically. The idea behind maximum likelihood parameter estimation is to determine the parameters that maximize the probability (likelihood) of the sample data. In our problem, each \(X_i = j \mid Y = 1\) is a random variable with parameters \(p_{ij}\), which can be estimated as the ratio of the number of observations in which both \(Y = 1\) and \(X_i = j\) to the number of observations in which \(Y = 1\). We can do the same for \(Y = 0\).
The given data (see file classify_data.xlsx
) consists of 9000 tests described above.
Remember you can upload data to Colab by dragging it into the 📁 tab (tutorial).
We suggest you use the pandas package to read in the data with the following syntax:
import pandas as pd
df = pd.read_excel("classify_data.xlsx", header=0, skiprows=[1], usecols="B:H")
(a)#
Use the first 6000 tests to create a model for the given system, i.e., estimate the parameters \(P(X_i = j \mid Y = 1)\) and \(P(X_i = j \mid Y = 0)\) for \(i = 1,2,3,4,5,6\) and \(j = 1,2,3,4,5\).
# TODO: Write your code below
(b)#
Using the next 3000 test points, we will calculate \(P(Y = y \mid X_1 = x_1, \dots, X_6 = x_6)\).
From Bayes’ theorem, we know that:
We will predict an outcome as \(Y = 1\) if:
which is equivalent to:
If we make the additional “Naive Bayes” assumption where the feature values are independent of each other, namely:
then our formula simplifies more and we can use our calculations from part (a).
For the last 3000 entries of the data (from 6001 to 9000), predict the data as \(Y = 1\) or \(Y = 0\) by computing:
for \(y = 1\) and \(y = 0\), and choosing \(Y = 1\) or \(Y = 0\) based on which value is larger. Evaluate this method of prediction by comparing to the actual values for \(Y\) given in the data set.
Your final output: Compute the number of correct predictions divided by the number of predictions (3000).
# TODO: Write your code below
Exporting your work#
When you’re ready, the easiest way to export the notebook is to File > Print
it and save it as a PDF.
Remove any excessively long, unrelated outputs first by clicking the arrow → next to the output box and then Show/hide output
.
Obviously don’t obscure any necessary output or graphs!