This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Radiomics tutorial

Learning Objectives

  1. Load and explore a Radiomics dataset in Python with Pandas
  2. Visualize and analyze the data with Seaborn
  3. Construct and test a machine learning predictive model with Scikit

Table of contents

Radiomics

Radiomics is the study of extracting quantitative measurements from medical imaging. An example would be RECIST (Response Evaluation Criteria in Solid Tumours) which measures the long and short axis of solid nodules to assess tumor size and its response to treatment.

In 1995, Dr. Wolberg collected 569 cases to diagnose breast masses based upon Fine Needle Aspiration Images. He recorded morphologic characteristics of the nuclei such as radius, texture, perimeter, area, smoothness, etc.

Fine Needle Aspiration Images

The cell nucleus features recorded in this dataset include

Data Format

1) ID number 2) Diagnosis (M = malignant, B = benign)

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

Each measure has three features. The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image.

Using Pandas to load data with read_csv.

Pandas is a very useful data manipulation tool as part of the standard python library.Useful parameters when loading comma separated delimited files with read_csv.

Loading the Dataset

Exploring the data with Pandas

For a Pandas dataframe named df, here are some useful ways to explore the dataframe.

Filtering the dataset with Pandas

Boolean indexing and chaining within Pandas provides an effective way to filter datasets

A really useful command for exploring a field is .value_count()

Exploratory Data Analysis

Visualizing data is an important part of exploring data, identifying outliers, and looking for relationships. Seaborn is a visualization library built on top of the matplotlib library.

Seaborn supports many * of graphing data

We are going to graph two histograms on the same graph looking at the values of radius_mean for the benign and malignant cells.

Visual comparison with paired plotting in seaborn

Seaborn comes with a very useful pairplot function which will provide a scatter platof of all the fields in a dataset against each other.

Key things to look for.

  1. Using python list comprehension we selected only the columns that ended with the string "mean". There are over 30 variables and we wanted to narrow down the field.
  2. In the pairplot command we filter the list of columns to those selected to "mean" and color the data according to its diagnosis.

The goal is to look for pairs of variables that can best separate malignant from benign cells.

Visualizing a heatmap of correlation betwen the fields

Another way is to perform a correlation analysis between each pairs of fields. Try running df.corr() separately and look at its output.

Method Chaining, sorting, and correlating in one command

Creating a simple classification test

Fractal dimension_mean is our highest correlated variable. What is a fractal dimension measurement? Check out How to measure a coastline with fractal geometry.

Creating a Receiver Operator Characteristic Curve

Creating a Logistic Regression Model

Creating a Support Vector Machine Model

Normalizing the features of the SVM Model