In this assignment you will practice simple methods implementing dictionary based matching as well as regular expressions. You will familiarize yourself with built in python libraries and also external libraries. You will learn to read external documentation to implement methods in your own code. The objective of this assignment is to learn how to implement heuristic matching methods.
Please make sure you have signed and returned the I2B2 data use agreement. As discussed in class much of the electronic medical record is free text. Different sections within the note are often denoted by headings these headings are often differentiated by case or punctuation. These sections may or may not be manually entered and can change over time. Notes like the ones below often include a timeline, conditions, symptoms, list of medications, list of procedures and other phenotypic information about a patient.
We will use the following discharge summary from I2B2 throughout this homework assignment:
!pip install spacy
Collecting spacy Downloading https://files.pythonhosted.org/packages/52/da/3a1c54694c2d2f40df82f38a19ae14c6eb24a5a1a0dae87205ebea7a84d8/spacy-2.1.3-cp36-cp36m-manylinux1_x86_64.whl (27.7MB) 100% |################################| 27.7MB 243kB/s eta 0:00:010:02 Collecting srsly<1.1.0,>=0.0.5 (from spacy) Downloading https://files.pythonhosted.org/packages/6b/97/47753e3393aa4b18de9f942fac26f18879d1ae950243a556888f389d1398/srsly-0.0.5-cp36-cp36m-manylinux1_x86_64.whl (180kB) 100% |################################| 184kB 9.0MB/s eta 0:00:01 Requirement already satisfied: jsonschema<3.0.0,>=2.6.0 in /home/idies/miniconda3/lib/python3.6/site-packages (from spacy) (2.6.0) Requirement already satisfied: numpy>=1.15.0 in /home/idies/miniconda3/lib/python3.6/site-packages (from spacy) (1.15.2) Collecting cymem<2.1.0,>=2.0.2 (from spacy) Downloading https://files.pythonhosted.org/packages/3d/61/9b0520c28eb199a4b1ca667d96dd625bba003c14c75230195f9691975f85/cymem-2.0.2-cp36-cp36m-manylinux1_x86_64.whl Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/idies/miniconda3/lib/python3.6/site-packages (from spacy) (2.19.1) Collecting murmurhash<1.1.0,>=0.28.0 (from spacy) Downloading https://files.pythonhosted.org/packages/a6/e6/63f160a4fdf0e875d16b28f972083606d8d54f56cd30cb8929f9a1ee700e/murmurhash-1.0.2-cp36-cp36m-manylinux1_x86_64.whl Collecting thinc<7.1.0,>=7.0.2 (from spacy) Downloading https://files.pythonhosted.org/packages/a9/f1/3df317939a07b2fc81be1a92ac10bf836a1d87b4016346b25f8b63dee321/thinc-7.0.4-cp36-cp36m-manylinux1_x86_64.whl (2.1MB) 100% |################################| 2.1MB 6.2MB/s eta 0:00:01 | 542kB 20.6MB/s eta 0:00:01 Collecting plac<1.0.0,>=0.9.6 (from spacy) Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl Collecting blis<0.3.0,>=0.2.2 (from spacy) Downloading https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (3.2MB) 100% |################################| 3.2MB 5.9MB/s eta 0:00:01a 0:00:01 Collecting wasabi<1.1.0,>=0.2.0 (from spacy) Downloading https://files.pythonhosted.org/packages/76/6c/0376977df1ba9f0ec27835d80456d9284c79737cb5205649451db1181f01/wasabi-0.2.1-py3-none-any.whl Collecting preshed<2.1.0,>=2.0.1 (from spacy) Downloading https://files.pythonhosted.org/packages/20/93/f222fb957764a283203525ef20e62008675fd0a14ffff8cc1b1490147c63/preshed-2.0.1-cp36-cp36m-manylinux1_x86_64.whl (83kB) 100% |################################| 92kB 7.8MB/s eta 0:00:01 Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/idies/miniconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4) Requirement already satisfied: idna<2.8,>=2.5 in /home/idies/miniconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.7) Requirement already satisfied: urllib3<1.24,>=1.21.1 in /home/idies/miniconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.23) Requirement already satisfied: certifi>=2017.4.17 in /home/idies/miniconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2018.10.15) Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /home/idies/miniconda3/lib/python3.6/site-packages (from thinc<7.1.0,>=7.0.2->spacy) (4.26.0) Installing collected packages: srsly, cymem, murmurhash, preshed, wasabi, blis, plac, thinc, spacy Successfully installed blis-0.2.4 cymem-2.0.2 murmurhash-1.0.2 plac-0.9.6 preshed-2.0.1 spacy-2.1.3 srsly-0.0.5 thinc-7.0.4 wasabi-0.2.1
#Imports that could be useful in this homework:
import re
import nltk
import spacy
from collections import defaultdict
text = '''
197067530 | TWMH | 20930664 | | 908831 | 1/20/1998 12:00:00 AM | INCARCERATED UMBILICAL HERNIA | Signed | DIS | Admission Date: 6/7/1998 Report Status: Signed
Discharge Date: 5/25/1998
PRINCIPAL DIAGNOSIS: INCARCERATED UMBILICAL HERNIA.
HISTORY: Jewell Gauthreaux is a 78 year old woman with a complex past
medical history including coronary artery disease with a
history of MIs times two in the past, a history of DVT back in
1970 , hypertension, rheumatoid arthritis, gout and history of
atrial fibrillation and atrial flutter as well as onset adult
diabetes mellitus. She presented to the Rial Community Hospital on
the day of admission complaining of an umbilical bulge over the
past several weeks. This umbilical bulge had been increasing
somewhat in size, but had not bothered her and was always
reducible. However, over the preceding weekend it became
incarcerated and then became somewhat painful. It was not
associated with any nausea or vomiting and she reported that she
was having normal bowel movements even in the face of this problem.
She presented initially to the Sey Al Skaez County Health Center and was
admitted with the diagnosis of incarcerated umbilical hernia.
PAST MEDICAL HISTORY:
1. Coronary artery disease with a history of MI times two in the
past with a recent echocardiogram on 11/8 showing an EF of
55-60%.
2. History of DVT in 1970.
3. Hypertension.
4. Rheumatoid arthritis.
5. Gout.
6. Atrial fibrillation and atrial flutter on Coumadin.
7. Adult onset diabetes mellitus.
PAST SURGICAL HISTORY:
1. Status post appendectomy.
2. Status post mitral valve replacement with St. Jude valve.
3. Left hip fracture repair.
4. Status post mitral valve commissurotomy in 1955.
MEDICATIONS ON ADMISSION: Lasix 80 mg a day, sublingual
nitroglycerin p.r.n., Propafenone 225 mg
t.i.d., Lopressor 150 mg b.i.d. , Lisinopril 10 mg a day and
Micronase 10 mg b.i.d., Isordil 40 mg t.i.d., Coumadin 5 mg a day
with 2-1/2 mg every Sunday.
ALLERGIES:: She is allergic to aspirin and penicillin.
PHYSICAL EXAMINATION: She is an extremely pleasant elderly woman
in no acute distress. HEENT - showed
extraocular movements intact. Pupils equally round and reactive to
light. NECK - supple. HEART - regular rhythm. LUNGS - clear.
ABDOMEN - soft, nontender, nondistended with approximately 1.5 cm
in diameter umbilical hernia to the left of her umbilicus. This
hernia was somewhat tender to palpation , but showed no overlying
erythema or evidence of necrosis. She had normal bowel sounds.
EXTREMITIES - no clubbing , cyanosis or edema. NEUROLOGIC - intact.
Preoperative laboratory showed BUN of 35, creatinine 1.3,
hematocrit 42.0, white count 6.8, coagulation studies within normal
limits.
HOSPITAL COURSE: Ms. Peick was admitted to the Lum Hospital on the day of admission with the
diagnosis of incarcerated umbilical hernia. Because of her history
of coumadinization for both her mitral valve as well as her atrial
fibrillation it was felt that it would be necessary to hospitalize
her , hold her Coumadin and heparinize her until it was possible to
do her surgery. However, upon arrival here her admission INR was
noted to be subtherapeutic at 1.3 and she was , therefore ,
immediately started on heparin. The Cardiology Service was
consulted regarding her significant past cardiac history and
recommended an echocardiogram to be performed preoperatively. This
echo was done on 9/6/98 demonstrating an EF of 55% with an
abnormal subdural wall motion , trace areas of aortic insufficiency
and mildly increased right ventricular size and the artificial
mitral valve was noted to be functioning well. The cardiology felt
that in the face of this largely unchanged echocardiogram that
showed stable and should go to the operating room for repair of
umbilical hernia. On 7/1/98 the patient was taken to the
operating room and underwent umbilical hernia repair with primary
reapproximation of the fascia. The procedure was done without any
complications. She was extubated and transferred in stable
condition to the postoperative recovery area and observed on the
floor. She was immediately restarted on her Coumadin, as well as
her heparin, which she continued for the next three days
postoperatively. The patient did quite well with gradual up in her
INR to a greater than 2 level and she was discharged to home on
4/23/98 on a regular dose.
DISCHARGE MEDICATIONS: Tylenol 650 mg p.o. q four hours p.r.n.
headache , Estrogen cream topical which is
applied to her vagina because of her atrophic vaginitis. Colace
100 mg p.o. b.i.d. , Lasix 80 mg p.o. q.d. , Micronase 10 mg p.o.
b.i.d. , Isordil 40 mg p.o. t.i.d. , lisinopril 10 mg p.o. q.d. ,
Lopressor 150 mg p.o. b.i.d. , Percocet 1-2 tabs q 3-4 hours p.r.n.
pain , Propafenone 225 mg p.o. t.i.d. and Coumadin 5 mg p.o. q.d.
with 7.5 mg take every Sunday.
DISPOSITION: To home. She will follow-up at the Oxtri- Hospital one week after discharge with follow-up with
primary medical doctor in one week after discharge.
Dictated By: MARK BEALER , M.D. GN70
Attending: DON N. FRITZLER , M.D. AI17 TV067/7242
Batch: 73284 Index No. YSQMAY6DNL D: 1/23/98
T: 6/22/98
'''
Tokenization is often a primary first step in many natural language processing applications. It is useful in many dictionary lookup applications. After we tokenize a piece of text we can see if each token is in a data structure called a set. A set can hold words that have some semantic meaning. For example we can create a dictionary of drugs, one of symptoms, another of diseases, etc.
Identify several drugs within the text above such as Lasix and Coumadin. We will create a set called "drugs", to match the tokens from the text above and demonstrate the drugs can be identified correctly. We will iterate over each token in the text and see if it is in the set "drugs" if it is, we will count its' occurrence.
drugs = set(['Coumadin', 'Lasix'])
drug_counts = defaultdict(int)
tokens = text.split()
for t in tokens:
if t in drugs: #if the current token is a drug
drug_counts[t]+=1
drug_counts
defaultdict(int, {'Lasix': 2, 'Coumadin': 3})
In the box below: How can you reduce variation in the text? For example how can you make sure you corrently match text of different cases like: Lasix vs lasix? How do we handle if the word "Lasix," contains punctuation at the end? Can we handle misspellings? What is difficult?
In the last homework we used a different tokenization method to handle punctuation at the end of tokens, for example: "Lasix,", modify the code below to use this different tokenization. Additionally modify the "drug" set to find more than two drugs. Demonstrate in the box below:
#MODIFY THIS CODE TO ADD MULTIPLE DRUGS, AND TAKE INTO ACCOUNT PUNCTUATION AT THE END OF A TOKEN: "Lasix,"
drugs = set(['Coumadin', 'Lasix'])
drug_counts = defaultdict(int)
tokens = text.split()
for t in tokens:
if t in drugs: #if the current token is a drug
drug_counts[t]+=1
drug_counts
"split" is a bultin string method we've been using to break a string into smaller strings (tokens) based on white space. There are many builtin string functions see: https://www.w3schools.com/python/python_ref_string.asp for an example of some of them. A useful function is "lower" for example we can change the entire text above to lowercase:
print(text.lower())
In the box below describe how this function might help when using dictionary based methods.
Regular expressions are useful when there are too many possibilities but have the same format - for example dates month/day/year, telephone numbers - areacode - 3 digits - 4 digits or email addresses. Let's create a simple date extractor. Looking at the date above we have the format: month/day/year notice that there might be 1 or 2 digits for both day and month. We can write a regular expression to handle this:
import re # this is the regular expression library
#\d is the charaacter class for digits - this means all digits 0-9, alternatively we could use [0-9]
#{} indicates the number of matches we are looking for. {1,2} means at least one, at max 2
re.findall('\d{1,2}/\d{1,2}/\d{2}', text)
NameErrorTraceback (most recent call last) <ipython-input-4-7cc8bffbeb36> in <module> 2 #\d is the charaacter class for digits - this means all digits 0-9, alternatively we could use [0-9] 3 #{} indicates the number of matches we are looking for. {1,2} means at least one, at max 2 ----> 4 re.findall('\d{1,2}/\d{1,2}/\d{2}', text) NameError: name 'text' is not defined
text2 = '''On 02-12-1998 the the patient had an appendectomy. The patient did quite well with gradual up in her
inr to a greater than 2 level and she was discharged to home on 4-23-98 on a regular dose.'''
In the box modify regular expression above to REQUIRE 2 digits for the month and day and instead of / we now require a -.
re.findall('\d{1,2}/\d{1,2}/\d{2}', text2)
NameErrorTraceback (most recent call last) <ipython-input-1-fb7cbb950d2d> in <module> ----> 1 re.findall('\d{1,2}/\d{1,2}/\d{2}', text2) NameError: name 're' is not defined
We can use regular expressions in a similar fashion to dictionary lookup. We can rewrite our dictionary method as a regular expression, we can even make it case insenstive. We are putting all the variations of drugs into a capturing group.
re.findall('(lasix|coumadin|percocet|tylenol)', text, re.IGNORECASE)
['Coumadin', 'Lasix', 'Coumadin', 'coumadin', 'Coumadin', 'Coumadin', 'Tylenol', 'Lasix', 'Percocet', 'Coumadin']
Now lets try to look for drug doses. This is a little harder because we want to use capturing groups to seperate the amount, space (if any), and unit. It can be useful to seperate information into distinct pieces.
re.findall("([0-9]{1,5})(\s*)(mg|cc|ml)", text)
[('80', ' ', 'mg'), ('225', ' ', 'mg'), ('150', ' ', 'mg'), ('10', ' ', 'mg'), ('10', ' ', 'mg'), ('40', ' ', 'mg'), ('5', ' ', 'mg'), ('2', ' ', 'mg'), ('650', ' ', 'mg'), ('100', ' ', 'mg'), ('80', ' ', 'mg'), ('10', ' ', 'mg'), ('40', ' ', 'mg'), ('10', ' ', 'mg'), ('150', ' ', 'mg'), ('225', ' ', 'mg'), ('5', ' ', 'mg'), ('5', ' ', 'mg')]
In the box below modify the two regular exressions above to capture: drug, dose and unit.
#REGULAR EXPRESSION HERE
In the box below: Now use a regular expresion to break the initial text into sections such as: Principal medical history, discharge medications, discharge date, hospital course.
#REGULAR EXPRESSION HERE
In the box below: 1: Briefly describe your approach. What were your expressions? What are they supposed to to do?
2: Briefly describe cases when the regular expression you created above would fail. For example by failure we mean unintended actions: it didn't break apart the correct sections or it split a section in half. How could you make the pattern robust? Why is it advantageous to break the text into sections? Describe several use cases where you would to break the text into sections. Hint look at the allergy section.