Databases
Contents
Databases¶
Up until now, we’ve focused a lot on files and how different file types are suitable for different kinds of materials data and informatics applications. For an individual user working in a particular domain, files are familiar and friendly to work with and simply make the most sense. But in addition to files, another common source for data is a database, which consists of structured information that is controlled by a management system. Given the rapid development of several materials-related databases and your potential need to pull data from these sources, we’ll spend some time in this lesson discussing their purpose and usage, with a particular focus on a local product: The Materials Project.
Why might we use a database?¶
It is likely that many of you have never used a database before (that’s OK!), so let’s start by discussing why a database is even necessary. When we talk about structured data in databases, what you should picture in your head is not something like Google Drive that stores a collection of files, but rather a platform that stores a collection of pandas DataFrame-esque objects (the actual DataFrames, not CSV files of that data). This is a crude analogy, but one that relates databases to something you already have experience with. And just as how you found DataFrames easy to work with thanks to its indexing capabilities, databases too are indexed in a way that facilitates data retrieval (we call these queries), slicing, and more. The other details can be saved for a dedicated course like CS W186, but some general reasons for using a database include:
When you have too much data to sensibly organize into files. Besides, we saw that all files had limitations in their expressiveness, and we can’t just store everything in binary files because then we have no idea what’s in it until it’s opened by the appropriate software.
When you need to scale up your operations to many users accessing multiple pieces of data concurrently.
You need a standardized, flexible, and fast data access pattern that doesn’t just involve combing through files. Particularly if the data are spread into bits and pieces across many different “files” uploaded to the database at different times, by different users, etc.
You need to enforce permissions and other security protocols that goes beyond “who has access to which files.”
The Materials Project¶
We’ll begin our discussion of the Materials Project with these slides:
Exploring the UI¶
This will be done as part of the above discussion.
Exploring the API¶
Please see the accompanying Jupyter notebook.
References¶
Here are some more papers if you’re interested in the technical details behind the Materials Project [2, 3, 4].
- 1
Lauri Himanen, Amber Geurts, Adam Stuart Foster, and Patrick Rinke. Data-driven materials science: Status, challenges, and perspectives. Advanced Science, 6(21):1900808, 2019. doi:10.1002/advs.201900808.
- 2
Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, and Kristin A. Persson. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Materials, 1(1):011002, 2013. doi:10.1063/1.4812323.
- 3
Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Ceder. Python Materials Genomics (Pymatgen): A robust, open-source Python library for materials analysis. Computational Materials Science, 68:314–319, 2013. doi:10.1016/j.commatsci.2012.10.028.
- 4
Shyue Ping Ong, Shreyas Cholia, Anubhav Jain, Miriam Brafman, Dan Gunter, Gerbrand Ceder, and Kristin A. Persson. The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principles. Computational Materials Science, 97:209–215, 2015. doi:10.1016/j.commatsci.2014.10.037.