This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Statistical testing EMR Asthma example

Learning Objectives

  1. Learn to prepare data for statistical testing using python packages and scripting
  2. Learn to perform statistical test using python package

Table of contents

Libraries, Authentication, Database Connection

ANOVA : Are Male and Female Post-Discharge Appointment Timings the Same?

A simple hypothesis: There is a difference between males and females for follow-up office visits after a hospitalization, excluding same day of discharge encounters.

This requires first extracting the post hospital discharge appointment days and gender from the EMR, and then conducting the Kruskal-Wallis non-parametric ANOVA test. This is a statistical test which is used to determine whether data samples come from the same distribution or not.

The first step after database connection is to explore the tables available. This section will use patients and encounters tables only. The pandas package has a handy SQL passing function, that works readily with the engine object created during the database connection code bloc.

To get the data into a single dataframe, merge the patients and encounters dataframes created above, by the osler_id

This example will only be testing a simple hypothesis about the time gap between Office Visit and Hospital Encounter. A fully realized explortation would require filtering and re-classifying all of the encounter_type values in the EMR data, listed below by frequency.

Identifying and computing relationship between the 'Office Visit' that follows a 'Hospital Encounter' encounter_type

After all the data has been combined into the encdemo dataframe, we must extract the the relationship of interest to the hypothesis - the time gap between a specific patient's Hospital Encounter and the following Office Visit. To do this we need to make sure we're accounting for the patient identity, the ordering of the visits, and the visit * of interest.

First, sort by patients using osler_id and encounter_date and reset the index.

Resetting the index is required for the for loop to function, because it iterates over the row indicies - now set to integers starting from 0.

The for loop uses the .iterrows() attribute of a pandas dataframe to iterate row by row. Inclusion of the code if i > 0: line causes this to execute starting with the 2nd row (row index = 1), avoiding the issue that the very first row cannot have a relationship to a previous encounter (recall the date sorting above)

The second if clause identifies the relationships of interest to the hypothesis - instances were a Hospital Encounter is followed by an Office Visit for the same patient. It makes use of the row and prev_row objects to evaluate those relationships. Note that at the end of the loop, the current row is set as a previous.

Finally, the visit_gap_df dataframe is populated inside this loop - when all the conditions are met then the dates of the two rows are subtracted and stored, along with the identifying information

The data has now been reduced from ~750,000 rows to the ~30,000 instances of interest. Here is a brief exploration of that data.

This gap data needs to be processed more to remove occurances that will diminish our ability to test the hypothesis, specifically 0-day gaps and gaps that are too large to be reasonably related to the prior hospitalization. We will remove 0-day observations and values >180 days.

First, some clean-up on the index of the dataframe (note it's all 0s above, due to how it was constructed by individual records.

Dropping 0-day observations and values >180 days using the pandas drop() method

As shown below, ~9,100 rows are filtered. A more informative histogram and frequency table are also provided.

The non-parametric one-way ANOVA required for this analysis is contained in the scipy package, which simplifies the computation of the test. After importing the Kruskal-Wallis package from scipy then sub-set the dataframe gap columns to objects for input.

A p-value of 0.078 fails to reject the hypothesis that the distribution of the return days come from different distributions. They appear to be the same - however a more proper model for this behavior might be a survival analysis, which can next be explored using the lifelines package