This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Platform Tool : Jupyter/Crunchr on PMAP using
Programming Language : Python 3
Author(s) : Paul Nagy
Data access needed to run tutorial on PMAP : CAMP Asthma Database
Source :
License : The notebook is released under the Apache 2.0 License
Last Updated : April 1st, 2019 Please go here to submit comments and report problems

Exploratory data analysis and visualization of EMR data¶

Learning Objectives¶

Learn how to discover what tables are avaiable in your database
Use pandas to transform a dataset for analysis
Use pandas to visualize a patient's timeline
Use pandas to visualize a population of blood pressure measurements

Table of contents¶

Connect to PMAP Database server
Map out all the tables in the dataset
Exploratory data analysis
Extract blood pressure measurements from a string
Index and filter a pandas dataset with date ranges
Plotting a timeline of measurements
Matplotlib library of graphing functions
Plotting multiple series
Histogram of systolic blood pressure values
Scatter plots with regression lines
Plotting residuals
Joint plots combining scatter with histograms

Connect to PMAP Database server¶

In [11]:

#Libraries needed
import os
import sqlalchemy
import urllib.parse
import pandas as pd
import getpass
from SciServer import Authentication
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
%xmode Plain
%matplotlib inline

myUserName = Authentication.getKeystoneUserWithToken(Authentication.getToken()).userName
passwd = getpass.getpass('Password for ' + myUserName + ': ')
user = "win\\" + myUserName

Exception reporting mode: Plain
Password for pnagy2: ········

In [12]:

#SQL Driver
driver="FreeTDS"
tds_ver="8.0"
# Database
host_ip="ESMPMDBPR4.WIN.AD.JHU.EDU" # Update this accordingly
db_port="1433"
db="CAMP_PMCoe_Projection" # Update this accordingly
# Create Connection String
conn_str=("DRIVER={};Server={};PORT={};DATABASE={};UID={};PWD={};TDS_VERSION={}"
.format(driver, host_ip, db_port, db, user, passwd, tds_ver)
)
# Create Engine
engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=' +
urllib.parse.quote(conn_str)
)

Map out the tables in the dataset¶

Relational database servers have a place to store information about the tables such as the names of tables and names and data* of the columns. They store this as a special table name INFORMATION_SCHEMA.TABLES. Here is a simple query to see what tables are available in the dataset.

Note: If you get an error on your first database query, first check to make sure you typed in your password correctly

In [13]:

# Replace this Query with one that is intended for the "projection" database of interest
query="SELECT * from INFORMATION_SCHEMA.TABLES;"
df_schema = pd.read_sql_query(query, engine)
df_schema

Out[13]:

	TABLE_CATALOG	TABLE_SCHEMA	TABLE_NAME	TABLE_TYPE
0	CAMP_PMCoE_Projection	dbo	patients	BASE TABLE
1	CAMP_PMCoE_Projection	dbo	encounters	BASE TABLE
2	CAMP_PMCoE_Projection	dbo	labs	BASE TABLE
3	CAMP_PMCoE_Projection	dbo	meds	BASE TABLE
4	CAMP_PMCoE_Projection	dbo	problemlist	BASE TABLE
5	CAMP_PMCoE_Projection	dbo	procedures	BASE TABLE
6	CAMP_PMCoE_Projection	dbo	symptoms	BASE TABLE
7	CAMP_PMCoE_Projection	dbo	vitals_BP	BASE TABLE
8	CAMP_PMCoE_Projection	dbo	vitals_height	BASE TABLE
9	CAMP_PMCoE_Projection	dbo	vitals_pulse	BASE TABLE
10	CAMP_PMCoE_Projection	dbo	vitals_respiration	BASE TABLE
11	CAMP_PMCoE_Projection	dbo	vitals_temperature	BASE TABLE
12	CAMP_PMCoE_Projection	dbo	vitals_weight	BASE TABLE
13	CAMP_PMCoE_Projection	dbo	sysdiagrams	BASE TABLE

Exploratory Data Analysis (EDA)¶

Exploratory data analysis is great way to get a better understanding of a dataset. Visualing data helps you better assess the underlying distribution of data as well as quickly identifiying outliers and data quality isssues.

Here are useful links to learn more about this crucial step in data science.

Blood Pressure Exploratory Data Analysis¶

We are going to do some exploration of the blood pressure vitals table. There are over 1 Million records in this table alone! Let's find the patient identifier with the most number of blood pressure measurements taken and explore their record. In PMAP we use the osler_id as the participant identifier which is unique which each research projection.

In [14]:

#Find me the patients with the most number of vital measurements and list them in a descending order.
query="SELECT top 1 osler_id,count(*) FROM vitals_BP group by osler_id order by count(*) DESC;"
df_top = pd.read_sql_query(query, engine)
df_top.head() 

Out[14]:

	osler_id
0	639afb20-84df-40c2-92f7-3812c2327428	2461

Transforming the blood pressure data¶

Blood pressure data is stored in the database in a string format "120/80" representing the systolic and diastolic pressures. In the lines below we are going to create a dataset for our data.

query - return all the blood pressure measurements for this one patient order by the date they were taken
df = pd.read_sql_query(query, engine) - return the results into a dataframe object named df
df.iloc[:,1] - using the pandas iloc function to return all the rows but only the second column of the dataset
df.iloc[:,1].str.split('/') - take the second column and split the string into multiple strings using the "/" delimiter
df['systolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(0)) - assign first string of split to a new column named systolic
df['diastolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(1)) - assign the second string of split to a new column named diastolic

In [31]:

#Find all the blood pressure measurements for this one patient
query= ("SELECT bp_date,bp_systolic_diastolic FROM vitals_BP"
        " where osler_id='639afb20-84df-40c2-92f7-3812c2327428' order by bp_date;")
df = pd.read_sql_query(query, engine)
df['systolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(0))
df['diastolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(1))
df.head()

Out[31]:

	bp_date	bp_systolic_diastolic	systolic	diastolic
0	2016-11-12 17:57:00	122/60	122.0	60.0
1	2016-11-12 18:51:00	139/69	139.0	69.0
2	2016-11-12 19:00:00	143/61	143.0	61.0
3	2016-11-12 19:30:00	127/61	127.0	61.0
4	2016-11-12 20:10:00	148/67	148.0	67.0

Re-indexing the dataset¶

Pandas provides powerful tools for working with dates. Making a unique datefield as the index of dataset makes for faster temporal analysis as well as provides great formating for graphing functions. Here are the steps to covert the date field to the index.

df['bp_date']=pd.to_datetime(df.bp_date) - in the df2 dataframe convert the bp_date column to a datetime type
df.set_index(['bp_date'],inplace=True) - this changes the index to the bp_date field inline

See how it changes the head function?

In [21]:

df['bp_date']=pd.to_datetime(df.bp_date)
df.set_index(['bp_date'],inplace=True)
df.head()

Out[21]:

	bp_systolic_diastolic	systolic	diastolic
bp_date
2016-11-12 17:57:00	122/60	122.0	60.0
2016-11-12 18:51:00	139/69	139.0	69.0
2016-11-12 19:00:00	143/61	143.0	61.0
2016-11-12 19:30:00	127/61	127.0	61.0
2016-11-12 20:10:00	148/67	148.0	67.0

indexing a dataframe with a date range¶

Now you can grab a date range of the dataframe directly with the index

In [23]:

#Changing the index to date allows you to slice the data by date ranges
df['2016-11-10':'2016-11-15']

Out[23]:

	bp_systolic_diastolic	systolic	diastolic
bp_date
2016-11-12 17:57:00	122/60	122.0	60.0
2016-11-12 18:51:00	139/69	139.0	69.0
2016-11-12 19:00:00	143/61	143.0	61.0
2016-11-12 19:30:00	127/61	127.0	61.0
2016-11-12 20:10:00	148/67	148.0	67.0
2016-11-12 20:30:00	144/66	144.0	66.0
2016-11-12 22:03:00	147/62	147.0	62.0

Plotting a timeline of measurements¶

Pandas is a data manipulation library that is built on top of matplotlib for graphing of data as well as numpy which is an array manipulation library. To plot a column of measurements, it can be done directly in by naming the dataframe and series followed by the plot() command.

Matplotlib library of graphing functions¶

matplotlib has an entire gallery of graphing objects you can call including.

Plotting multiple series¶

Each graph has many variables you can set as well as adding additional data sources.

In this example we combine the systolic and diastolic measurements on the same plot along with a title and a legend.

Looking across patients¶

Let's go back to the original dataframe to look across patients. Here we are grabbing only the first 2K records. Graphing 2K points is an arbitrary limit on how many points to graph at one time.

In [14]:

#Find all the blood pressure measurements for this one patient
query= ("SELECT TOP 2000 * FROM vitals_BP ;")
df2 = pd.read_sql_query(query, engine)
df2['systolic']=pd.to_numeric(df2.loc[:,'bp_systolic_diastolic'].str.split('/').str.get(0))
df2['diastolic']=pd.to_numeric(df2.loc[:,'bp_systolic_diastolic'].str.split('/').str.get(1))
df2.head()

Out[14]:

	osler_id	encounter_id	encounter_type	admission_date	discharge_date	bp_systolic_diastolic	bp_date	systolic	diastolic
0	564bd039-e957-4d12-8188-39d9ad158ee6	4225	Office Visit	2015-12-28	NaT	137/80	2015-12-28 13:39:00	137.0	80.0
1	564bd039-e957-4d12-8188-39d9ad158ee6	396743	Office Visit	2017-02-24	NaT	134/67	2017-02-24 11:45:00	134.0	67.0
2	564bd039-e957-4d12-8188-39d9ad158ee6	520283	Office Visit	2017-05-13	NaT	117/66	2017-05-13 10:53:00	117.0	66.0
3	564bd039-e957-4d12-8188-39d9ad158ee6	151206	Office Visit	2016-05-02	NaT	124/69	2016-05-02 15:29:00	124.0	69.0
4	564bd039-e957-4d12-8188-39d9ad158ee6	122484	Office Visit	2016-07-02	NaT	126/80	2016-07-02 13:25:00	126.0	80.0

Histogram of systolic blood pressure values¶

In [15]:

#plotting a histogram of blood pressure measurements
df2.systolic.plot(kind='hist',grid=False,bins=20,title='Systolic Blood Pressure Histogram')

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9a993e5940>

In [16]:

#Scatter graph of the systolic vs diastolic pressure
df2.plot(kind='scatter',x='diastolic',y='systolic',alpha=0.5,s=2,title='systolic vs diastolic pressure')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9a98f47cf8>

Scatter plots with regression lines¶

There is great visualization library built on top of matplotlib call Seaborn. Seaborn was designed to combine statistical analysis with advanced visualization.

Scatter plots of residuals¶

A great way to look for outliers and non linear effects

In [20]:

#A subtraction of the linear fit for a residual plot
sns.residplot(data=df2,x='diastolic',y='systolic',scatter_kws={'alpha':0.1})

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9a98f395f8>

Joint plots combining scatter with histograms¶

A great way to look for outliers and non linear effects

In [21]:

# A joint plot of the scatter graph with liner fit with side bar histograms of the data
sns.jointplot(data=df2,x='diastolic',y='systolic',kind='reg',scatter_kws={'alpha':0.1})

Out[21]:

<seaborn.axisgrid.JointGrid at 0x7f9a945fe5c0>