Parameter estimation - Your turn

Parameter estimation - Your turn#

Note

Click the and open this notebook in Colab to enable interactivity.

Note

To save your progress, make a copy of this notebook in Colab File > Save a copy in Drive and you’ll find it in My Drive > Colab Notebooks.

Part 1#

A traffic engineer would like to make an assessment of traffic flow at a busy intersection during weekday rush hour. The number of arrivals is believed to satisfy the conditions for the Poisson distribution. The engineer records the number of cars arriving at the intersection between the hours of 8am and 9am over a period of two weeks (10 business days). Using the data in the table (in cars/hr), compute the maximum likelihood estimate and the \(95\%\) confidence interval for the average number of car arrivals.


240	225	300	280	215
205	275	320	210	240

Note: Unfortunately the poisson package in scipy.stats does not have a fit() function. But is there a property of the Poisson distribution (and its property \(\lambda\)) that we can use to our advantage? See Wikipedia for a hint.

# TODO: Write your code below

Part 2#

A product is tested for levels of different chemicals \(X_i\), where \(i = 1,2,3,4,5,6\). The value of each level is an integer ranging from \(1\) to \(5\). Depending on the observed levels of the chemicals, the product is either accepted (\(Y = 1\)) or not (\(Y = 0\)). Given a \(1 \times 6\) input vector of measurements of the levels of chemicals and the output \(Y\), the parameters \(P(X_i = j \mid Y = 1)\) where \(i = 1,2,3,4,5,6\) and \(j = 1,2,3,4,5\) can be estimated using maximum likelihood estimation analytically. The idea behind maximum likelihood parameter estimation is to determine the parameters that maximize the probability (likelihood) of the sample data. In our problem, each \(X_i = j \mid Y = 1\) is a random variable with parameters \(p_{ij}\), which can be estimated as the ratio of the number of observations in which both \(Y = 1\) and \(X_i = j\) to the number of observations in which \(Y = 1\). We can do the same for \(Y = 0\).

The given data (see file classify_data.xlsx) consists of 9000 tests described above. Remember you can upload data to Colab by dragging it into the 📁 tab (tutorial). We suggest you use the pandas package to read in the data with the following syntax:

import pandas as pd
df = pd.read_excel("classify_data.xlsx", header=0, skiprows=[1], usecols="B:H")

(a)#

Use the first 6000 tests to create a model for the given system, i.e., estimate the parameters \(P(X_i = j \mid Y = 1)\) and \(P(X_i = j \mid Y = 0)\) for \(i = 1,2,3,4,5,6\) and \(j = 1,2,3,4,5\).

# TODO: Write your code below

(b)#

Using the next 3000 test points, we will calculate \(P(Y = y \mid X_1 = x_1, \dots, X_6 = x_6)\).

From Bayes’ theorem, we know that:

\[ P(Y = 1 \mid X_1 = x_1, \dots, X_6 = x_6) = \dfrac{P(X_1 = x_1, \dots, X_6 = x_6 \mid Y = 1) P(Y = 1)}{P(X_1 = x_1, \dots, X_6 = x_6)} \]

We will predict an outcome as \(Y = 1\) if:

\[ P(Y = 1 \mid X_1 = x_1, \dots, X_6 = x_6) > P(Y = 0 \mid X_1 = x_1, \dots, X_6 = x_6) \]

which is equivalent to:

\[ P(X_1 = x_1, \dots, X_6 = x_6 \mid Y = 1) P(Y = 1) > P(X_1 = x_1, \dots, X_6 = x_6 \mid Y = 0) P(Y = 0) \]

If we make the additional “Naive Bayes” assumption where the feature values are independent of each other, namely:

\[ P(X_1 = x_1, \dots, X_6 = x_6 \mid Y = y) = P(X_1 = x_1 \mid Y = y) \cdots P(X_6 = x_6 \mid Y = y) \]

then our formula simplifies more and we can use our calculations from part (a).

For the last 3000 entries of the data (from 6001 to 9000), predict the data as \(Y = 1\) or \(Y = 0\) by computing:

\[ P(X_1 = x_1 | Y = y) \cdots P(X_6 = x_6 | Y = y) P(Y = y) \]

for \(y = 1\) and \(y = 0\), and choosing \(Y = 1\) or \(Y = 0\) based on which value is larger. Evaluate this method of prediction by comparing to the actual values for \(Y\) given in the data set.

Your final output: Compute the number of correct predictions divided by the number of predictions (3000).

# TODO: Write your code below

Exporting your work#

When you’re ready, the easiest way to export the notebook is to File > Print it and save it as a PDF. Remove any excessively long, unrelated outputs first by clicking the arrow → next to the output box and then Show/hide output. Obviously don’t obscure any necessary output or graphs!