The code below demonstrates vectorizing a body of text (corpus).
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
'It was the best of times,',
'it was the worst of times,',
'it was the age of wisdom,',
'it was the age of foolishness,',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
['age', 'best', 'foolishness', 'it', 'of', 'the', 'times', 'was', 'wisdom', 'worst'] [[0 1 0 1 1 1 1 1 0 0] [0 0 0 1 1 1 1 1 0 1] [1 0 0 1 1 1 0 1 1 0] [1 0 1 1 1 1 0 1 0 0]]
Check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Make three modifications in the above vectorizer function based on different parameters in sci-kit-learn CountVectorizer. Pay special ettention to ngram range. Explain each modification and interpret the results accordingly. Note that some values are true or false by default.
#Make your changes here. Explain what you have done and the results.
#X = vectorizer.fit_transform(corpus)
#print(vectorizer.get_feature_names())
#print(X.toarray())
Some of the parameters in sci-kit-learn CountVectorizer can reduce variation. List two and explain what each of these parameters is doing. Change the vectorizer function accordingly and interpret the results. What changes did you observe?
#Make your changes here. Explain what you have done and the results.
#X = vectorizer.fit_transform(corpus)
#print(vectorizer.get_feature_names())
#print(X.toarray())
Check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Use TfidfVectorizer instead of CountVectorizer. Make three modifications in the vectorizer function based on three different parameters in sci-kit-learn TfidfVectorizer. Explain each modification and interpret the results accordingly. Note that some values are true or false by default.
#Make your changes here. Explain what you have done and the results.
vectorizer2 = TfidfVectorizer()
Y = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(Y.toarray())
['age', 'best', 'foolishness', 'it', 'of', 'the', 'times', 'was', 'wisdom', 'worst'] [[0. 0.60735961 0. 0.31694544 0.31694544 0.31694544 0.4788493 0.31694544 0. 0. ] [0. 0. 0. 0.31694544 0.31694544 0.31694544 0.4788493 0.31694544 0. 0.60735961] [0.4788493 0. 0. 0.31694544 0.31694544 0.31694544 0. 0.31694544 0.60735961 0. ] [0.4788493 0. 0.60735961 0.31694544 0.31694544 0.31694544 0. 0.31694544 0. 0. ]]
We now want to extract ngram features from the discharge summaries in our obesity_data.csv corpus. Use some of your vectorizer functions and apply it on "notes" corpus instead of the toy corpus. Explain what you see. Provide some insights.
import pandas as pd
from pandas import read_csv
#Let's load the obeisty dataset
df = read_csv('obesity_data.csv', delimiter='\t')
notes = df['text']
# Working on "notes: instead of the toy corpus
#X = vectorizer.fit_transform(notes)
#print(vectorizer.get_feature_names())
#print(X.toarray())
#vectorizer2 = TfidfVectorizer()
#Y = vectorizer2.fit_transform(corpus)
#print(vectorizer2.get_feature_names())
#print(Y.toarray())
#Modify this code as necessary to check diffrent vectorizer functions.
Ideally, these weighted and/or unweighted ngrams will be used as features in a machine learning classifier. Perform a preliminary literature review and identify one paper that used ngrams and bag-of-words with or without TF-IDF weighting in the clinical domain (or a similar health-related domain). Provide one paragraph summary of each paper, especially investigate whether or not bag-of-word models were helpful. Bonus: Find three or more similar papers and provide one paragraph summary of each paper OR select one paper, find its code and data and replicate the results.