Homework 2: Matching and Regular Expressions

In this assignment you will practice simple methods implementing dictionary based matching as well as regular expressions. You will familiarize yourself with built in python libraries and also external libraries. You will learn to read external documentation to implement methods in your own code. The objective of this assignment is to learn how to implement heuristic matching methods.

Please make sure you have signed and returned the I2B2 data use agreement. As discussed in class much of the electronic medical record is free text. Different sections within the note are often denoted by headings these headings are often differentiated by case or punctuation. These sections may or may not be manually entered and can change over time. Notes like the ones below often include a timeline, conditions, symptoms, list of medications, list of procedures and other phenotypic information about a patient.

We will use the following discharge summary from I2B2 throughout this homework assignment:

Task 1: Dictionary Matching

Tokenization is often a primary first step in many natural language processing applications. It is useful in many dictionary lookup applications. After we tokenize a piece of text we can see if each token is in a data structure called a set. A set can hold words that have some semantic meaning. For example we can create a dictionary of drugs, one of symptoms, another of diseases, etc.

Drug dictionary:

Identify several drugs within the text above such as Lasix and Coumadin. We will create a set called "drugs", to match the tokens from the text above and demonstrate the drugs can be identified correctly. We will iterate over each token in the text and see if it is in the set "drugs" if it is, we will count its' occurrence.

In the box below: How can you reduce variation in the text? For example how can you make sure you corrently match text of different cases like: Lasix vs lasix? How do we handle if the word "Lasix," contains punctuation at the end? Can we handle misspellings? What is difficult?

Drug dictionary different tokenization:

In the last homework we used a different tokenization method to handle punctuation at the end of tokens, for example: "Lasix,", modify the code below to use this different tokenization. Additionally modify the "drug" set to find more than two drugs. Demonstrate in the box below:

Drug dictionary string modifications:

"split" is a bultin string method we've been using to break a string into smaller strings (tokens) based on white space. There are many builtin string functions see: https://www.w3schools.com/python/python_ref_string.asp for an example of some of them. A useful function is "lower" for example we can change the entire text above to lowercase:

In the box below describe how this function might help when using dictionary based methods.

Task 2: Regular Expressions:

Regular expressions are useful when there are too many possibilities but have the same format - for example dates month/day/year, telephone numbers - areacode - 3 digits - 4 digits or email addresses. Let's create a simple date extractor. Looking at the date above we have the format: month/day/year notice that there might be 1 or 2 digits for both day and month. We can write a regular expression to handle this:

In the box modify regular expression above to REQUIRE 2 digits for the month and day and instead of / we now require a -.

We can use regular expressions in a similar fashion to dictionary lookup. We can rewrite our dictionary method as a regular expression, we can even make it case insenstive. We are putting all the variations of drugs into a capturing group.

Now lets try to look for drug doses. This is a little harder because we want to use capturing groups to seperate the amount, space (if any), and unit. It can be useful to seperate information into distinct pieces.

In the box below modify the two regular exressions above to capture: drug, dose and unit.

Task 3: Section Breaking:

In the box below: Now use a regular expresion to break the initial text into sections such as: Principal medical history, discharge medications, discharge date, hospital course.

In the box below: 1: Briefly describe your approach. What were your expressions? What are they supposed to to do?

2: Briefly describe cases when the regular expression you created above would fail. For example by failure we mean unintended actions: it didn't break apart the correct sections or it split a section in half. How could you make the pattern robust? Why is it advantageous to break the text into sections? Describe several use cases where you would to break the text into sections. Hint look at the allergy section.