This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Introduction to SQL for mining EMR data

Learning Objectives

  1. Introduction on how Electronic Medical Records are organized
  2. Learn Structured Querying Language to select records from a database
  3. Explore Structured Querying Language for filtering, counting, and grouping of data

Table of contents

Working with EMR Data

Clinical patient data is entered into forms of the Epic electronic medical record and converted nightly to a relational database (Clarity). The data is stored in tables of related information. There are currently over 20k tables in the Epic Clarity database.

This de-identified dataset contains 60k patients with asthma and pulls commonly used tables as part of clinical care. No narrative fields are included in this dataset. Here is a list of tables that are included in this dataset.

Data Tables

In an EMR, patient records are organized by a patient encounter (such as a office visit or a hospitalization). Each patient may have many encounters, and each encounter may have many observations of vitals as part of each. This relationship (known as an entity relationship) is important for combining many observations around a patient by their encounters.

Data_Schema

Import Libraries and Connect to DB

Below are the code cells to connect to the CAMP Asthma database. You can follow along and run the cells or just scroll down to see there pre-compiled output. You will need to enter you JHED to connect to the PMAP database and have requested access for the Asthma dataset.

Database Schema

Structured Query Language Introduction

There is a useful and standard way to ask questions of relational databases known as SQL (Structured Query Language). There are many resources to learn SQL and database querying online based upon the SQL Server database.

SQL is an open standard query language that dates back to the early 1970's to work with databases using relational algebra. It is still the standard for working with relational databases.

Here are the basics to querying a database.

SELECT Column_Names FROM Table_Name

The two main elements of querying in SQL are the SELECT statement which specifies which columns to return and the FROM statement which specifies which table to pull those columns from.

In this example we use the * wildcard after the SELECT statement to designate that we want all the columns. We specify vitals_temperature after the FROM statement to specify the temperature table. We send a query string to the database server and it returns the restulst in the form of a table which is stored as a Pandas dataframe df3. Pandas is described in greater detail in upcoming pages. For now focus on the SQL statement.

Try modifying the query string to return only the temperature field and selecting the top 20 records.

Note the Top 3 statement before the * and after the SELECT statement. This specifies that we only want the top 3 records to be returned. the vitals temperature table contains over 314k records and we do not want to return them all for this example.

Select Statement

In this example select the encounter_id and encounter_type columns to be returned as fields.

Using functions (min,max,count) on columns

There are several functions that can calculate on an entire column (field0 and return an aggregate result. Let's look at the main demographics table which has a single record for each patient along with their date of birth.

Note: Date of Birth Dates are randomly shifted as part of the de-identification process

Finding distinct values in a column with DISTINCT

Getting a distinct list of values from a given column is a great way of performing a quick quality check. Let's check the * of values reported for gender from the demographics table.

Filtering columns with WHERE clause

SQL comes with a powerful capability to filter data. In this example we are filtering data from the vitals temperature table by temperature values over 100.

In this example we filter the type of encounter to Office Visit. Please note the use of double quotations " for the entire SQL string and single quotations for a string match in the WHERE clause. String matching within fields are case sensitive.

Using the IN command to filter by multiple values

For this example we are going to show how you can enter a multiline string in Python using (). This improves readability.

Filtering by a partial string with the LIKE command

The LIKE function can look through a string field for a pattern match. The % sign indicates there can be any characters before or after the string depending on where you are looking for.

Grouping data with the Group By clause

The GROUP BY command groups data by common valued fields. In this example we have a GROUP BY on the type of encounter tied to an aggregate function count. It is important to know that the GROUP BY clause can only be used on columns which also appear in the SELECT statement, and on aggregate functions like COUNT(*) or SUM(). Also note the ORDER BY command which then sorts the resulting data in descending order to show the most frequent encounter *.