More practice with ML models

Authors: Enze Chen and Mark Asta (University of California, Berkeley)

Note

This is an interactive exercise, so you will want to click the and open the notebook in DataHub (or Colab for non-UCB students).

Learning objectives

This notebook contains a series of exercises that will give you more practice building ML models in scikit-learn.

Contents

This notebook has the following sections.

  1. Basic ML setup

  2. ML for screening

  3. More practice

Import Python packages

Please remember to run the following cell before continuing!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, accuracy_score

plt.rcParams.update({'figure.figsize':(8,6),       # Increase figure size
                     'font.size':20,               # Increase font size
                     'mathtext.fontset':'cm',      # Change math font to Computer Modern
                     'mathtext.rm':'serif',        # Documentation recommended follow-up
                     'lines.linewidth':5,          # Thicker plot lines
                     'lines.markersize':12,        # Larger plot points
                     'axes.linewidth':2,           # Thicker axes lines (but not too thick)
                     'xtick.major.size':8,         # Make the x-ticks longer (our plot is larger!)
                     'xtick.major.width':2,        # Make the x-ticks wider
                     'ytick.major.size':8,         # Ditto for y-ticks
                     'ytick.major.width':2,        # Ditto for y-ticks
                     'xtick.direction':'in', 
                     'ytick.direction':'in'})

Basic ML setup

Back to top

Gather some data

This is always the first step. For the sake of demonstration, let’s use the hardness-density dataset from the previous tutorials. We will use the pandas package to help us load the data into a DataFrame, taking care to skip the first row.

hd = pd.read_csv('../../assets/data/hardness_density.csv', skiprows=1)
hd.head()
Element Number Mohs hardness Density (g/cc)
0 lithium 3 0.6 0.534
1 beryllium 4 5.5 1.850
2 boron 5 9.4 2.340
3 carbon 6 10.0 3.513
4 sodium 11 0.5 0.968

Exercise: Construct a linear model that predicts the Mohs hardness using the atomic number

You can use the entire dataset as the training set and the test set. Don’t forget to compute an error (in this case, training error) to assess model performance!

X = hd[['Number']]
y = hd['Mohs hardness']
# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Exercise: Now compare the ratio of the RMSE to the GTME

Is your model a good model? Is this what you would expect?

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Exercise: Repeat the previous two steps, but now use the density AND atomic number

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Exercise: Construct a parity plot of these new results

This will allow you to see which materials the model is performing poorly on.

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Use a ML model for screening

Back to top

So far, we haven’t explicitly shown you how a ML model can be used for screening purposes to search for materials with good properties. Let’s do that now, first with a picture that we’ve already shown:

ML screening

The first step is to establish a model between a commonly-available material property and the dielectric constant. For the sake of demonstration, we’ll choose the band gap and use the existing dataset supplied by the Petousis paper.

diel = pd.read_csv('../../assets/data/dielectric_dataset.csv')
diel.head()
mp-id formula n band_gap diel_total diel_elec
0 mp-441 Rb2Te 1.86 1.88 6.23 3.44
1 mp-22881 CdCl2 1.78 3.52 6.73 3.16
2 mp-28013 MnI2 2.23 1.17 10.64 4.97
3 mp-567290 LaN 2.65 1.12 17.99 7.04
4 mp-560902 MnF2 1.53 2.87 7.12 2.35

Exercise: Using the dataset above, train an ML model that maps \(E_{\mathrm{g}}\) to \(\varepsilon\)

But unlike before, don’t make any predictions yet!

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Exercise: Now gather some more data without values of \(\varepsilon\)

You should be able to do this yourself now! Think about what criteria are important here. Start small.

We’ve also scraped a subset of the Materials Project and saved it in the more_mp_materials.csv file. However, this dataset contains all types of materials, including metals, which could be misleading. If you use this dataset, be sure to filter those out first.

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Exercise: Make some predictions with the trained model

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Challenge

But now without a ground truth to compare to, we can’t really calculate an error or make a parity plot. How can we make sense of our predictions, and whether our model is believable? Is this model any good? 🧐


More practice

Back to top

Exercise: You can use the dataset of elemental properties and do some more modeling of your choosing

The file is elem_props.csv located in the same place as the other data. Try to do one regression problem and one classification problem! Also, instead of evaluating training error, do \(k\)-fold CV!

Hints:

  • After loading the data in, you want to narrow down a subset of columns first, comprising inputs and outputs.

  • From your subset DataFrame, you may have to remove rows with NaN values.

# -------------   WRITE YOUR CODE IN THE SPACE BELOW   ---------- #

Exercise: Alternatively, to stay on topic, you can try using your own dielectric data and a different input to predict \(\varepsilon\)!