Classification

A machine learning classifier is an algorithm that takes input (transformed documents) and assigns labels. Using a pre-labeled dataset, the algorithm learns a set of parameters such that when it encounters previously unseen data (a new document), it can label the document appropriately. Ideally, we would create labels for a small subset of documents, perhaps a few hundred, and ask the classifier to ascribe labels to many more documents: several thousands or even tens of thousands of documents.

In our case a document is a piece of free text from the EHR. For example a progress note, surgical note, pathology report, etc.

We are interested in ascribing labels corresponding to things like pheon*. So the question is, given the piece of text, can we label the patient as having diabetes, epilepsy, CHF; or can we determine if the patient is a smoker, previousy history of smoking, or family history of cancer? These labels can be useful for identfying candadates for studies.

This is an interactive notebook. Documentation and code will be presented in it. You can hit the Run button above to run code in a code cell. Let's first import some libraries - these are tools or bits of code that you will be using.

Loading Data

Lets first look at the data we'll be using. It is a version of the Obesity dataset from i2b2 and is the reason you signed the data use agreement. The data consists of discharge summaries from the Partners HealthCare Research Patient Data Repository. It was a challenge from 2009 to identify obesity related comorbidities in discharge summaries. Please cite the following if you use this data for research:

Uzuner Ö. (2009). "Recognizing Obesity and Co-morbidities in Sparse Data". Journal of the American Medical Informatics Association. July 2009; 16(4): 561-570. http://jamia.bmj.com/content/16/4/561.full.pdf.

We will be using a subset of this data that has been collected and slightly processed to make using it easier within the allotted time. See the publication for details on the raw data.

The datafile we will be using is called: "obesity_data.csv." The columns in the included file are below in the variable "headers." We will load the csv into memory and get labels for the comorbidity corresponding to the column number in headers.

Looking at the data we loaded, we can see the number of documents we loaded. We can also see the number of positive (patient has CAD) and negative labels. We can also look at the first patient record.

Preprocessing

The machine learning classifier does not take text in directly- it needs it to be preprocessed into a machine-computable form. Often this is a vector where each dimension is a feature, and often for text classification each feature corresponds to a word. This can lead to large, unweildy vectors that are very sparse. For example, if we include all words in all documents, a dimension could include mispellings or numbers which only occur once.

We need to perform preprocessing to reduce some of the dimensionality, which can help to improve classification performance as well as the speed of learning and classifying new documents. There are many * of preprocessing, for example converting text to all lower case.

Stopwords

We can remove stopwords - these are words that are very frequent in the English language but contain little information. Word classes such as prounouns: I, me, my, she, he, it, us, you, your; determiners (definte/indefinite articles, posssive, interrogative, demonstrative): the, a, their, what, whose, this, those, that. Some stopwords such as no or not might be useful to keep if you are using ngram features. Ngrams are multiword phrases where n is usually 2 or 3, this can be useful in capturing negation like "not have diabetes", "no CHF symptoms".

Stemming

Stemming is often a heuristic proccess of reducing inflected forms to a similar base. For example organize, organizes, organized and organizing. Similarly, there are dereivationally related words with similar meetings that can be reduced such as democarcy, democratic and democratization.

The goal is to reduce sparsity, which can hurt classifier performance. However, stemming can contribute to increased polysemy: operate, operating, operates, operation, operative, operatives and operational all stem to "oper" using the Porter stemming algorithm.

Classification Algorithms

Different * of preprocessing can be combined to generate a feature vector. These feature vecters are used as input to the classifier. Both the preprocessing type, classification algorithms and tunable parameters for the classification algorithm will impact performance.

There are many * of classification algorithms, such as Support Vector Machines (SVM), Random Forest, Decision Trees, etc. We will not go into specifics of them and will treat them as black boxes. It is important to know that each algorithm has default parameters but they can be changed. The crucial takeaway is understanding the resulting metrics from a classifier to determine its performance.

Metrics

There are many different metrics available to evaluate classifier performance. Each metric depends on the underlying dataset - for example, the number of labels employed (in our case it is 2) and the poportion of labels within the dataset. It also depends on the desired application of the classifier.

See the confusion matrix below. Because we are using binary labels (postive and negative) metrics will be addressed in terms of true postive - a document classification is positive and the label is positive. False positive - a document is classified as potitive but its label is negative. False negative - a document is classified as negative but its' label is positive. True negative - both the classfication of a document and its' label are negative.

Confusion Matrix

Accuracy

Accuracy is the proportion of documents correctly labeled by the classifer divided by the total number of documents the classifier has labeled. For example accuacy is 90% if the classifier correctly labels 90 documents out of 100 that it labels. Another way to think of this is (true positives + true negatives) / (total number of documents classified). This metric is only useful for balanced datasets where the proportion of labels is roughly equal. A classifier can obtain 90% accuracy by only assigning one label if that label represents 90% of the documents.

Precision

Precision is a metric of how often the classifier is correct when it labels a document as positive. Another way to think of this is it is the number of true postives/(true positives + false postivies).

This metric is important when the goal of the task is to make sure all the things you label as postive are positive.

Recall

Recall is also called the true positive rate or sensitvity. It is true postivies/(true positives + false negatives), or in otherwords how many true positive documents do you label over all positive documents.

This metric is important when the goal of the task is to make sure we don't miss any potential positives.

Precision and recall are not mutually exclusive but in practice due to algorithm implementations there are often trade offs between precision and recall. Usually one optimizes at the expense over the other. It is important to understand what is most important to your task.

F1

The F measure combines precision and recall scores. It is the harmonic mean: F1 = 2 ((precision recall)/(precision + recall)). It is a way of combining precision and recall into a single number of overall performance to compare various classifiers. In F1 precision and recall are evenly weighted. This weighting can be changed if precision or recall is more important than the other.

Question 1: Classification Analysis

You have now seen an example of document classification from start to finish. In the box below write a few paragraphs comparing the performance of the various algorithms. Some questions to think about:

Is accuracy an appropriate metric given the number of documents with CAD vs without?
In what context would precision be more importatant than recall? When would the opposite be true?

Write your analysis here

Question 2: Effects of preprocesing

Now that you have seen a classification example and have modified the CountVectorizer in a previous hoemwork. Use at least a couple of the modifications from the previous homework, run the classifier and compare/contrast the effects on classification performance of your various modifiactions in the boxes below.

Analysis answer here