This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Asthma cohort discovery with regular expressions and set operations

In this notebook, we will look at defining patient cohorts by exploring the medications table.

Learning Objectives

  1. Analyze medications dataset in Python with Pandas
  2. Use Regular Expressions to define cohorts of patients as defined by Asthma guidelines
  3. Using python Set Operators to compare patient cohorts

Table of contents

Connect to PMAP Database server

Intro to Pandas

Pandas is a standard library in Python that allows you to perform data manipulation on datasets. It is a good tool for data exploration, data cleaning, and model building. Here are some useful links to learn more about pandas.

Load the medications table into a Pandas dataframe

Note we are running an SQL query that brings all the medications which has over 500k rows. This will take a few seconds and the output just displays the first row as an example

Note. If this query doesn't run, make sure you have requested access to the asthma training dataset

If you get an error here, make sure you have been granted access to the Asthma data set?

Exploring the data with Pandas

For the Pandas dataframe named df, here are some useful ways to explore the dataframe.

Try these commands here to get a feel of this dataset

Regular Expressions for Cohort Definition

Regular Expressions are useful tools for matching patterns of characters in text.

Useful links for Regular Expressions.

For our dataset of medications, there is a pharaceutical class field that we are going to explore to see which patients have prescriptions for rescue inhalers, inhaled corticoids, and oral coriticoids.

Exploring Set theory with the Set Operator

Python Set Objects are derived from list objects. They are a unique list. Like lists, they are mutable and unordered.

SQL operation LIKE to define Cohort

In case you were wondering if you could do the string filtering in SQL instead of Pandas, you can. In SQL there is the LIKE command that allows % as wildcards. It is not quite as flexible as regex, but it works in this case. Here is the code.