This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).
#Libraries needed
import os
import sqlalchemy
import urllib.parse
import pandas as pd
import getpass
from SciServer import Authentication
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
%xmode Plain
%matplotlib inline
myUserName = Authentication.getKeystoneUserWithToken(Authentication.getToken()).userName
passwd = getpass.getpass('Password for ' + myUserName + ': ')
user = "win\\" + myUserName
Exception reporting mode: Plain Password for pnagy2: ········
#SQL Driver
driver="FreeTDS"
tds_ver="8.0"
# Database
host_ip="ESMPMDBPR4.WIN.AD.JHU.EDU" # Update this accordingly
db_port="1433"
db="CAMP_PMCoe_Projection" # Update this accordingly
# Create Connection String
conn_str=("DRIVER={};Server={};PORT={};DATABASE={};UID={};PWD={};TDS_VERSION={}"
.format(driver, host_ip, db_port, db, user, passwd, tds_ver)
)
# Create Engine
engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=' +
urllib.parse.quote(conn_str)
)
Relational database servers have a place to store information about the tables such as the names of tables and names and data* of the columns. They store this as a special table name INFORMATION_SCHEMA.TABLES. Here is a simple query to see what tables are available in the dataset.
Note: If you get an error on your first database query, first check to make sure you typed in your password correctly
# Replace this Query with one that is intended for the "projection" database of interest
query="SELECT * from INFORMATION_SCHEMA.TABLES;"
df_schema = pd.read_sql_query(query, engine)
df_schema
TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE | |
---|---|---|---|---|
0 | CAMP_PMCoE_Projection | dbo | patients | BASE TABLE |
1 | CAMP_PMCoE_Projection | dbo | encounters | BASE TABLE |
2 | CAMP_PMCoE_Projection | dbo | labs | BASE TABLE |
3 | CAMP_PMCoE_Projection | dbo | meds | BASE TABLE |
4 | CAMP_PMCoE_Projection | dbo | problemlist | BASE TABLE |
5 | CAMP_PMCoE_Projection | dbo | procedures | BASE TABLE |
6 | CAMP_PMCoE_Projection | dbo | symptoms | BASE TABLE |
7 | CAMP_PMCoE_Projection | dbo | vitals_BP | BASE TABLE |
8 | CAMP_PMCoE_Projection | dbo | vitals_height | BASE TABLE |
9 | CAMP_PMCoE_Projection | dbo | vitals_pulse | BASE TABLE |
10 | CAMP_PMCoE_Projection | dbo | vitals_respiration | BASE TABLE |
11 | CAMP_PMCoE_Projection | dbo | vitals_temperature | BASE TABLE |
12 | CAMP_PMCoE_Projection | dbo | vitals_weight | BASE TABLE |
13 | CAMP_PMCoE_Projection | dbo | sysdiagrams | BASE TABLE |
Exploratory data analysis is great way to get a better understanding of a dataset. Visualing data helps you better assess the underlying distribution of data as well as quickly identifiying outliers and data quality isssues.
Here are useful links to learn more about this crucial step in data science.
We are going to do some exploration of the blood pressure vitals table. There are over 1 Million records in this table alone! Let's find the patient identifier with the most number of blood pressure measurements taken and explore their record. In PMAP we use the osler_id as the participant identifier which is unique which each research projection.
#Find me the patients with the most number of vital measurements and list them in a descending order.
query="SELECT top 1 osler_id,count(*) FROM vitals_BP group by osler_id order by count(*) DESC;"
df_top = pd.read_sql_query(query, engine)
df_top.head()
osler_id | ||
---|---|---|
0 | 639afb20-84df-40c2-92f7-3812c2327428 | 2461 |
Blood pressure data is stored in the database in a string format "120/80" representing the systolic and diastolic pressures. In the lines below we are going to create a dataset for our data.
query
- return all the blood pressure measurements for this one patient order by the date they were takendf = pd.read_sql_query(query, engine)
- return the results into a dataframe object named dfdf.iloc[:,1]
- using the pandas iloc function to return all the rows but only the second column of the datasetdf.iloc[:,1].str.split('/')
- take the second column and split the string into multiple strings using the "/" delimiterdf['systolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(0))
- assign first string of split to a new column named systolicdf['diastolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(1))
- assign the second string of split to a new column named diastolic#Find all the blood pressure measurements for this one patient
query= ("SELECT bp_date,bp_systolic_diastolic FROM vitals_BP"
" where osler_id='639afb20-84df-40c2-92f7-3812c2327428' order by bp_date;")
df = pd.read_sql_query(query, engine)
df['systolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(0))
df['diastolic']=pd.to_numeric(df.iloc[:,1].str.split('/').str.get(1))
df.head()
bp_date | bp_systolic_diastolic | systolic | diastolic | |
---|---|---|---|---|
0 | 2016-11-12 17:57:00 | 122/60 | 122.0 | 60.0 |
1 | 2016-11-12 18:51:00 | 139/69 | 139.0 | 69.0 |
2 | 2016-11-12 19:00:00 | 143/61 | 143.0 | 61.0 |
3 | 2016-11-12 19:30:00 | 127/61 | 127.0 | 61.0 |
4 | 2016-11-12 20:10:00 | 148/67 | 148.0 | 67.0 |
Pandas provides powerful tools for working with dates. Making a unique datefield as the index of dataset makes for faster temporal analysis as well as provides great formating for graphing functions. Here are the steps to covert the date field to the index.
df['bp_date']=pd.to_datetime(df.bp_date)
- in the df2 dataframe convert the bp_date column to a datetime typedf.set_index(['bp_date'],inplace=True)
- this changes the index to the bp_date field inlineSee how it changes the head function?
df['bp_date']=pd.to_datetime(df.bp_date)
df.set_index(['bp_date'],inplace=True)
df.head()
bp_systolic_diastolic | systolic | diastolic | |
---|---|---|---|
bp_date | |||
2016-11-12 17:57:00 | 122/60 | 122.0 | 60.0 |
2016-11-12 18:51:00 | 139/69 | 139.0 | 69.0 |
2016-11-12 19:00:00 | 143/61 | 143.0 | 61.0 |
2016-11-12 19:30:00 | 127/61 | 127.0 | 61.0 |
2016-11-12 20:10:00 | 148/67 | 148.0 | 67.0 |
Now you can grab a date range of the dataframe directly with the index
#Changing the index to date allows you to slice the data by date ranges
df['2016-11-10':'2016-11-15']
bp_systolic_diastolic | systolic | diastolic | |
---|---|---|---|
bp_date | |||
2016-11-12 17:57:00 | 122/60 | 122.0 | 60.0 |
2016-11-12 18:51:00 | 139/69 | 139.0 | 69.0 |
2016-11-12 19:00:00 | 143/61 | 143.0 | 61.0 |
2016-11-12 19:30:00 | 127/61 | 127.0 | 61.0 |
2016-11-12 20:10:00 | 148/67 | 148.0 | 67.0 |
2016-11-12 20:30:00 | 144/66 | 144.0 | 66.0 |
2016-11-12 22:03:00 | 147/62 | 147.0 | 62.0 |
Pandas is a data manipulation library that is built on top of matplotlib for graphing of data as well as numpy which is an array manipulation library. To plot a column of measurements, it can be done directly in by naming the dataframe and series followed by the plot() command.
#Simple line graph of the data
df.systolic.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb5f4871eb8>
matplotlib has an entire gallery of graphing objects you can call including.
Each graph has many variables you can set as well as adding additional data sources.
In this example we combine the systolic and diastolic measurements on the same plot along with a title and a legend.
#multiple line graphs
plt.plot(df.systolic)
plt.plot(df.diastolic)
plt.legend(['Systolic','Diastolic'],loc='upper left')
plt.title('Blood pressure measurements over time')
plt.show()
Let's go back to the original dataframe to look across patients. Here we are grabbing only the first 2K records. Graphing 2K points is an arbitrary limit on how many points to graph at one time.
#Find all the blood pressure measurements for this one patient
query= ("SELECT TOP 2000 * FROM vitals_BP ;")
df2 = pd.read_sql_query(query, engine)
df2['systolic']=pd.to_numeric(df2.loc[:,'bp_systolic_diastolic'].str.split('/').str.get(0))
df2['diastolic']=pd.to_numeric(df2.loc[:,'bp_systolic_diastolic'].str.split('/').str.get(1))
df2.head()
osler_id | encounter_id | encounter_type | admission_date | discharge_date | bp_systolic_diastolic | bp_date | systolic | diastolic | |
---|---|---|---|---|---|---|---|---|---|
0 | 564bd039-e957-4d12-8188-39d9ad158ee6 | 4225 | Office Visit | 2015-12-28 | NaT | 137/80 | 2015-12-28 13:39:00 | 137.0 | 80.0 |
1 | 564bd039-e957-4d12-8188-39d9ad158ee6 | 396743 | Office Visit | 2017-02-24 | NaT | 134/67 | 2017-02-24 11:45:00 | 134.0 | 67.0 |
2 | 564bd039-e957-4d12-8188-39d9ad158ee6 | 520283 | Office Visit | 2017-05-13 | NaT | 117/66 | 2017-05-13 10:53:00 | 117.0 | 66.0 |
3 | 564bd039-e957-4d12-8188-39d9ad158ee6 | 151206 | Office Visit | 2016-05-02 | NaT | 124/69 | 2016-05-02 15:29:00 | 124.0 | 69.0 |
4 | 564bd039-e957-4d12-8188-39d9ad158ee6 | 122484 | Office Visit | 2016-07-02 | NaT | 126/80 | 2016-07-02 13:25:00 | 126.0 | 80.0 |
#plotting a histogram of blood pressure measurements
df2.systolic.plot(kind='hist',grid=False,bins=20,title='Systolic Blood Pressure Histogram')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a993e5940>
#Scatter graph of the systolic vs diastolic pressure
df2.plot(kind='scatter',x='diastolic',y='systolic',alpha=0.5,s=2,title='systolic vs diastolic pressure')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a98f47cf8>
#A scatter graph with a linear fit of the data plotted
fig,ax=plt.subplots()
ax=sns.regplot(data=df2,x='diastolic',y='systolic',scatter_kws={'alpha':0.1})
#A subtraction of the linear fit for a residual plot
sns.residplot(data=df2,x='diastolic',y='systolic',scatter_kws={'alpha':0.1})
<matplotlib.axes._subplots.AxesSubplot at 0x7f9a98f395f8>
A great way to look for outliers and non linear effects
# A joint plot of the scatter graph with liner fit with side bar histograms of the data
sns.jointplot(data=df2,x='diastolic',y='systolic',kind='reg',scatter_kws={'alpha':0.1})
<seaborn.axisgrid.JointGrid at 0x7f9a945fe5c0>