A machine learning classifier is an algorithm that takes input (transformed documents) and assigns labels. Using a pre-labeled dataset, the algorithm learns a set of parameters such that when it encounters previously unseen data (a new document), it can label the document appropriately. Ideally, we would create labels for a small subset of documents, perhaps a few hundred, and ask the classifier to ascribe labels to many more documents: several thousands or even tens of thousands of documents.
In our case a document is a piece of free text from the EHR. For example a progress note, surgical note, pathology report, etc.
We are interested in ascribing labels corresponding to things like pheon*. So the question is, given the piece of text, can we label the patient as having diabetes, epilepsy, CHF; or can we determine if the patient is a smoker, previousy history of smoking, or family history of cancer? These labels can be useful for identfying candadates for studies.
This is an interactive notebook. Documentation and code will be presented in it. You can hit the Run button above to run code in a code cell. Let's first import some libraries - these are tools or bits of code that you will be using.
from gensim.parsing.preprocessing import stem_text, preprocess_string, remove_stopwords, strip_multiple_whitespaces
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import cross_validate
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import numpy as np
import csv
Lets first look at the data we'll be using. It is a version of the Obesity dataset from i2b2 and is the reason you signed the data use agreement. The data consists of discharge summaries from the Partners HealthCare Research Patient Data Repository. It was a challenge from 2009 to identify obesity related comorbidities in discharge summaries. Please cite the following if you use this data for research:
Uzuner Ö. (2009). "Recognizing Obesity and Co-morbidities in Sparse Data". Journal of the American Medical Informatics Association. July 2009; 16(4): 561-570. http://jamia.bmj.com/content/16/4/561.full.pdf.
We will be using a subset of this data that has been collected and slightly processed to make using it easier within the allotted time. See the publication for details on the raw data.
The datafile we will be using is called: "obesity_data.csv." The columns in the included file are below in the variable "headers." We will load the csv into memory and get labels for the comorbidity corresponding to the column number in headers.
headers = ['text', 'CAD', 'Gout', 'Venous Insufficiency', 'PVD', 'Hypercholesterolemia',
'Hypertension', 'Asthma', 'Hypertriglyceridemia', 'OSA', 'Gallstones',
'Depression', 'Obesity', 'GERD', 'OA', 'Diabetes', 'CHF']
def load_data(filename, label_col):
labels = []
text = []
with open(filename, 'rt') as file:
reader = csv.reader(file, delimiter='\t')
next(reader)
for row in reader:
val = int(row[label_col])
if val==1 or val==-1:
text.append(row[0])
if val==-1:
labels.append(0)
else:
labels.append(1)
return text, labels
#This is a comment, the column 1 corresponds to the comorbidity CAD
col=1
print("Label we are classifying:", headers[col])
text, labels = load_data('obesity_data.csv', col)
Label we are classifying: CAD
Looking at the data we loaded, we can see the number of documents we loaded. We can also see the number of positive (patient has CAD) and negative labels. We can also look at the first patient record.
print("Number of documents:", len(text), "Num CAD:", np.count_nonzero(labels), "Num without CAD:", (len(labels)-np.count_nonzero(labels)))
Number of documents: 619 Num CAD: 343 Num without CAD: 276
print(text[512])
658449513 | AH | 79268417 | | 790454 | 6/23/1998 12:00:00 AM | DEGENERATIVE JOINT DISEASE RT. KNEE | Signed | DIS | Admission Date: 4/18/1998 Report Status: Signed Discharge Date: 9/22/1998 DIAGNOSIS: END-STAGE OSTEOARTHRITIS , RIGHT KNEE. HISTORY OF PRESENT ILLNESS: Ms. Denault is a 75-year-old woman who was scheduled for a right total knee replacement for end-stage osteoarthritis. The patient has had a long-standing history of progressive knee pain , which has become disabling over the last six months. The patient has two units of autologous blood available for this procedure. PAST MEDICAL HISTORY: Significant for osteoarthritis , borderline diabetes mellitus , glucose-6-phosphate dehydrogenase deficiency , and glaucoma. PAST SURGICAL HISTORY: Significant for cesarean section times three. She is also status post dental extractions. ALLERGIES: Glucose-6-phosphate dehydrogenase deficiency and erythromycin. ADMISSION MEDICATIONS: Timoptic-XE one q.h. and q.a.m. in each eye , Indocin 25 mg p.o. p.r.n. , and Tylenol p.r.n. SOCIAL HISTORY: The patient denies smoking currently. She quit 25 years ago. She reports drinking alcohol occasionally. She is on an American Diabetic Association 1 , 300 calorie low sodium diet. The patient is a widow who resides in Lina Rd. , Po Raco REVIEW OF SYSTEMS: Significant for glasses , history of glaucoma , and early cataracts. She has full dentures on uppers and partial dentures on the bottom. She is deaf in the right ear. There are no pulmonary problems. Cardiovascular , the patient denies coronary artery disease , hypertension , chest pain , congestive heart failure , or deep venous thrombosis. The patient has a history of borderline diabetes mellitus with baseline blood sugars in the 140s. PHYSICAL EXAMINATION: Her blood pressure was 140/80. She is 4'10" tall and 168 lb. In general , she ambulated with an antalgic gait on the right with a cane. She required assistance to get onto the examination table. Head , eyes , ears , nose , and throat were significant for pupils being equal and reactive to light with extraocular muscles intact , and normal pharynx. There was no evidence of carotid bruits. Her lungs were clear to auscultation bilaterally. Her heart was regular in rate and rhythm with a normal S1 and S2. There was no murmur. Her abdomen was protuberant , but nontender with no organomegaly. There was a well healed midline incision consistent with cesarean section. There were no focal neurological deficits. Examination of her extremities revealed bilateral varus with the right greater than the left. Active range of motion of the right knee was from a 15 degree extension deficit to approximately 70 degrees. There was palpable crepitus. She had positive medial and lateral joint line tenderness. The knee appeared stable to valgus and varus stresses. She had a palpable dorsalis pedis and posterior fullness with no evidence of ulceration. She was neurovascularly intact. She received 5 mg of Coumadin preoperatively and was instructed to discontinue use of Indocin. HOSPITAL COURSE: The patient was brought to the Operating Room on 2/29/98 where she underwent a right total knee arthroplasty with a Kinemax system. Estimated blood loss was 300 cc. The tourniquet time was 90 minutes. She received perioperative antibiotics and was continued on Coumadin in the postoperative period. Her pain was well controlled with the use of a PCA pump provided by the anesthesiology department. Her postoperative course was , for the most part , uncomplicated. She did have some low grade temperatures with a T max up to 101.8 postoperatively. This seemed to be related mostly to atelectasis. Her hematocrit was kept greater than 30 with the use of autologous blood transfusions. On postoperative day two , her white blood cell count was 13.7 , but had decreased by postoperative day three to 12. Her electrolytes were well controlled. She was made therapeutic on Coumadin by postoperative day two with a PT of 16.4 and an INR of 2.0. She worked with the physical therapy department along the total knee arthroplasty pathway. Her skin dressings were taken down and the wound was noted to be clean , dry , and intact with no evidence of erythema or discharge. The patient was felt to be doing well , although a little bit slow in terms of her physical therapy. It was felt that she would benefit from a short stay at a rehabilitation hospital and a consult was placed with the Maldharp Drugties Bamayhost Memorial Hospital . The patient was discharged to Tion Pidesf Medical Center pending bed availability on 10/10/98 . DISCHARGE MEDICATIONS: Colace 100 mg p.o. b.i.d. , iron sulfate 300 mg p.o. t.i.d. times a total of five days , Folate 1 mg p.o. q.day , insulin regular ( human ) sliding scale subcu q.i.d. , multivitamin one tab p.o. q.day , Timoptic 0.25% one drop each eye q.a.m. , Coumadin to keep PT/INR between 1.5 and 2.0 , Tylenol 650-1 , 000 mg p.o. q.4h. p.r.n. , Tylenol no. 3 one to two tabs p.o. q.4h. p.r.n. pain , and Benadryl 25-50 mg p.o. q.h.s. p.r.n. for sleep. DISPOSITION: The patient is discharged to a rehabilitation hospital. DISCHARGE INSTRUCTIONS: She is instructed to continue physical therapy for increased range of motion of her right knee. She is further instructed to continue taking Coumadin for a total of six weeks. She is instructed to follow up with Dr. Danilo Mincks as an outpatient in five weeks , to call his office for an appointment. Dictated By: DION TORGRIMSON , M.D. XN309 Attending: JAMEL SARKAR , M.D. CH.B IB54 EB500/1920 Batch: 8560 Index No. Q7RATAYES D: 10/10/98 T: 10/10/98
print(labels)
[1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
The machine learning classifier does not take text in directly- it needs it to be preprocessed into a machine-computable form. Often this is a vector where each dimension is a feature, and often for text classification each feature corresponds to a word. This can lead to large, unweildy vectors that are very sparse. For example, if we include all words in all documents, a dimension could include mispellings or numbers which only occur once.
We need to perform preprocessing to reduce some of the dimensionality, which can help to improve classification performance as well as the speed of learning and classifying new documents. There are many * of preprocessing, for example converting text to all lower case.
We can remove stopwords - these are words that are very frequent in the English language but contain little information. Word classes such as prounouns: I, me, my, she, he, it, us, you, your; determiners (definte/indefinite articles, posssive, interrogative, demonstrative): the, a, their, what, whose, this, those, that. Some stopwords such as no or not might be useful to keep if you are using ngram features. Ngrams are multiword phrases where n is usually 2 or 3, this can be useful in capturing negation like "not have diabetes", "no CHF symptoms".
sentences = ["An apple a day keeps the doctor away.","The patient doesn't have CHF"]
v = CountVectorizer(stop_words='english')
print('Removing stopwords:', v.fit(sentences).vocabulary_.keys())
print()
v = CountVectorizer(ngram_range=(1, 3), stop_words='english')
print('Removing stopwords and ngrams size 1-3:', v.fit(sentences).vocabulary_.keys())
print()
v = CountVectorizer(ngram_range=(1, 3))
print('Ngrams size 1-3:', v.fit(sentences).vocabulary_.keys())
Removing stopwords: dict_keys(['apple', 'day', 'keeps', 'doctor', 'away', 'patient', 'doesn', 'chf']) Removing stopwords and ngrams size 1-3: dict_keys(['apple', 'day', 'keeps', 'doctor', 'away', 'apple day', 'day keeps', 'keeps doctor', 'doctor away', 'apple day keeps', 'day keeps doctor', 'keeps doctor away', 'patient', 'doesn', 'chf', 'patient doesn', 'doesn chf', 'patient doesn chf']) Ngrams size 1-3: dict_keys(['an', 'apple', 'day', 'keeps', 'the', 'doctor', 'away', 'an apple', 'apple day', 'day keeps', 'keeps the', 'the doctor', 'doctor away', 'an apple day', 'apple day keeps', 'day keeps the', 'keeps the doctor', 'the doctor away', 'patient', 'doesn', 'have', 'chf', 'the patient', 'patient doesn', 'doesn have', 'have chf', 'the patient doesn', 'patient doesn have', 'doesn have chf'])
Stemming is often a heuristic proccess of reducing inflected forms to a similar base. For example organize, organizes, organized and organizing. Similarly, there are dereivationally related words with similar meetings that can be reduced such as democarcy, democratic and democratization.
The goal is to reduce sparsity, which can hurt classifier performance. However, stemming can contribute to increased polysemy: operate, operating, operates, operation, operative, operatives and operational all stem to "oper" using the Porter stemming algorithm.
print(stem_text("Is it universally true? Some say the university is the best in the Universe."))
is it univers true? some sai the univers is the best in the universe.
Different * of preprocessing can be combined to generate a feature vector. These feature vecters are used as input to the classifier. Both the preprocessing type, classification algorithms and tunable parameters for the classification algorithm will impact performance.
There are many * of classification algorithms, such as Support Vector Machines (SVM), Random Forest, Decision Trees, etc. We will not go into specifics of them and will treat them as black boxes. It is important to know that each algorithm has default parameters but they can be changed. The crucial takeaway is understanding the resulting metrics from a classifier to determine its performance.
There are many different metrics available to evaluate classifier performance. Each metric depends on the underlying dataset - for example, the number of labels employed (in our case it is 2) and the poportion of labels within the dataset. It also depends on the desired application of the classifier.
See the confusion matrix below. Because we are using binary labels (postive and negative) metrics will be addressed in terms of true postive - a document classification is positive and the label is positive. False positive - a document is classified as potitive but its label is negative. False negative - a document is classified as negative but its' label is positive. True negative - both the classfication of a document and its' label are negative.
Accuracy is the proportion of documents correctly labeled by the classifer divided by the total number of documents the classifier has labeled. For example accuacy is 90% if the classifier correctly labels 90 documents out of 100 that it labels. Another way to think of this is (true positives + true negatives) / (total number of documents classified). This metric is only useful for balanced datasets where the proportion of labels is roughly equal. A classifier can obtain 90% accuracy by only assigning one label if that label represents 90% of the documents.
Precision is a metric of how often the classifier is correct when it labels a document as positive. Another way to think of this is it is the number of true postives/(true positives + false postivies).
This metric is important when the goal of the task is to make sure all the things you label as postive are positive.
Recall is also called the true positive rate or sensitvity. It is true postivies/(true positives + false negatives), or in otherwords how many true positive documents do you label over all positive documents.
This metric is important when the goal of the task is to make sure we don't miss any potential positives.
Precision and recall are not mutually exclusive but in practice due to algorithm implementations there are often trade offs between precision and recall. Usually one optimizes at the expense over the other. It is important to understand what is most important to your task.
The F measure combines precision and recall scores. It is the harmonic mean: F1 = 2 ((precision recall)/(precision + recall)). It is a way of combining precision and recall into a single number of overall performance to compare various classifiers. In F1 precision and recall are evenly weighted. This weighting can be changed if precision or recall is more important than the other.
#Scoring metrics we're interested in
scoring = {'acc': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1'}
#Want to initialize the random number generator with the same seed for reapeatable experiments
random_state = 198273
#Our preprocessing steps
def preprocess(X):
return preprocess_string(X, filters=[lambda x: x.lower(), strip_multiple_whitespaces, remove_stopwords, stem_text])
#Notice the count vectorizer/tfididf trnasformer. We are using a pipeline here so we are first applying the
#steps in preprocess (converting the string to lowercase, removing multiple spaces, removing stopwords then stemming).
#This is followed by the count vectorizer then the tfidf transformer.
preprocess_pipeline = Pipeline([
('vect', CountVectorizer(analyzer=preprocess)),
('tfidf', TfidfTransformer()),
])
features = preprocess_pipeline.fit_transform(text)
for clf in [
RandomForestClassifier(),
MultinomialNB(),
SVC(kernel='linear', class_weight='balanced'),
DecisionTreeClassifier(criterion='entropy', class_weight='balanced', random_state=random_state)
]:
scores = cross_validate(clf, features, labels, scoring=scoring,
cv=5, return_train_score=False)
print("Classifier:", clf)
#print(scores)
print("Accuracy:", scores['test_acc'].mean())
print("Precision:", scores['test_precision'].mean())
print("Recall:", scores['test_recall'].mean())
print("F1:", scores['test_f1'].mean())
print()
Classifier: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) Accuracy: 0.7011713611329663 Precision: 0.7491675809328949 Recall: 0.699914748508099 F1: 0.7214798001620777 Classifier: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) Accuracy: 0.5557448728035668 Precision: 0.5550251606884128 Recall: 1.0 F1: 0.7138446756316068 Classifier: SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) Accuracy: 0.8109183320220298 Precision: 0.8406277772773674 Recall: 0.8193094629156011 F1: 0.8280801096851441 Classifier: DecisionTreeClassifier(class_weight='balanced', criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=198273, splitter='best') Accuracy: 0.8529993181222135 Precision: 0.8653339376465888 Recall: 0.8718243819266837 F1: 0.8679300117076794
You have now seen an example of document classification from start to finish. In the box below write a few paragraphs comparing the performance of the various algorithms. Some questions to think about:
Is accuracy an appropriate metric given the number of documents with CAD vs without?
In what context would precision be more importatant than recall?
When would the opposite be true?
Write your analysis here
Now that you have seen a classification example and have modified the CountVectorizer in a previous hoemwork. Use at least a couple of the modifications from the previous homework, run the classifier and compare/contrast the effects on classification performance of your various modifiactions in the boxes below.
#Copy and modify code from above and the previous homework here.
Analysis answer here