This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Exploratory data analysis and visualization of EMR data

Learning Objectives

  1. Learn how to discover what tables are avaiable in your database
  2. Use pandas to transform a dataset for analysis
  3. Use pandas to visualize a patient's timeline
  4. Use pandas to visualize a population of blood pressure measurements

Table of contents

Connect to PMAP Database server

Map out the tables in the dataset

Relational database servers have a place to store information about the tables such as the names of tables and names and data* of the columns. They store this as a special table name INFORMATION_SCHEMA.TABLES. Here is a simple query to see what tables are available in the dataset.

Note: If you get an error on your first database query, first check to make sure you typed in your password correctly

Exploratory Data Analysis (EDA)

Exploratory data analysis is great way to get a better understanding of a dataset. Visualing data helps you better assess the underlying distribution of data as well as quickly identifiying outliers and data quality isssues.

Here are useful links to learn more about this crucial step in data science.

Blood Pressure Exploratory Data Analysis

We are going to do some exploration of the blood pressure vitals table. There are over 1 Million records in this table alone! Let's find the patient identifier with the most number of blood pressure measurements taken and explore their record. In PMAP we use the osler_id as the participant identifier which is unique which each research projection.

Transforming the blood pressure data

Blood pressure data is stored in the database in a string format "120/80" representing the systolic and diastolic pressures. In the lines below we are going to create a dataset for our data.

Re-indexing the dataset

Pandas provides powerful tools for working with dates. Making a unique datefield as the index of dataset makes for faster temporal analysis as well as provides great formating for graphing functions. Here are the steps to covert the date field to the index.

See how it changes the head function?

indexing a dataframe with a date range

Now you can grab a date range of the dataframe directly with the index

Plotting a timeline of measurements

Pandas is a data manipulation library that is built on top of matplotlib for graphing of data as well as numpy which is an array manipulation library. To plot a column of measurements, it can be done directly in by naming the dataframe and series followed by the plot() command.

Matplotlib library of graphing functions

matplotlib has an entire gallery of graphing objects you can call including.

Plotting multiple series

Each graph has many variables you can set as well as adding additional data sources.

In this example we combine the systolic and diastolic measurements on the same plot along with a title and a legend.

Looking across patients

Let's go back to the original dataframe to look across patients. Here we are grabbing only the first 2K records. Graphing 2K points is an arbitrary limit on how many points to graph at one time.

Histogram of systolic blood pressure values

Scatter plots with regression lines

There is great visualization library built on top of matplotlib call Seaborn. Seaborn was designed to combine statistical analysis with advanced visualization.

Scatter plots of residuals

A great way to look for outliers and non linear effects

Joint plots combining scatter with histograms

A great way to look for outliers and non linear effects