Question 1: ngrams

The code below demonstrates vectorizing a body of text (corpus).

Check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Make three modifications in the above vectorizer function based on different parameters in sci-kit-learn CountVectorizer. Pay special ettention to ngram range. Explain each modification and interpret the results accordingly. Note that some values are true or false by default.

Question 2: Reducing variation

Some of the parameters in sci-kit-learn CountVectorizer can reduce variation. List two and explain what each of these parameters is doing. Change the vectorizer function accordingly and interpret the results. What changes did you observe?

Question 3: term weighting

Check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Use TfidfVectorizer instead of CountVectorizer. Make three modifications in the vectorizer function based on three different parameters in sci-kit-learn TfidfVectorizer. Explain each modification and interpret the results accordingly. Note that some values are true or false by default.

Question 4: Working on larger datasets

We now want to extract ngram features from the discharge summaries in our obesity_data.csv corpus. Use some of your vectorizer functions and apply it on "notes" corpus instead of the toy corpus. Explain what you see. Provide some insights.

Question 4: Check previous work

Ideally, these weighted and/or unweighted ngrams will be used as features in a machine learning classifier. Perform a preliminary literature review and identify one paper that used ngrams and bag-of-words with or without TF-IDF weighting in the clinical domain (or a similar health-related domain). Provide one paragraph summary of each paper, especially investigate whether or not bag-of-word models were helpful. Bonus: Find three or more similar papers and provide one paragraph summary of each paper OR select one paper, find its code and data and replicate the results.