This textbook was written for the clinical research community at Johns Hopkins leveraging the precision medicine analytics platform (PMAP). These notebooks are available in html form on the Precision Medicine portal as well as in computational form in the CAMP-share folder on Crunchr (Crunchr.pm.jh.edu).

Data quality assessment example

This example looks again at blood pressure values by classifying them by clinical classes of blood pressure and looking for outliers in the data on improbable values.

Learning Objectives

  1. Classify and visualize clinical classes of blood pressure values
  2. Perform a Principal Component Analysis (PCA) on the values

Connect to PMAP Database

Data Exploration

Remove the # symbol from each line below to see what these different commands do.

Tests of data quality

Dimension Definition Example
Completeness Degree to which data is populated Blanks or Null Values
Conformity Corresponds to expected formats Dates, Zip Codes
Consistency Relational integrity Discharge after admission
Synchronization Consistency with external reference data Stale reference tables
Uniqueness Duplication of records Are any records identical to each other?
Timeliness When was the data last updated Data Latency
Accuracy Degree you verify data Validating email addresses

Principles of Tidy Data (Hadley Wickham)

https://vita.had.co.nz/papers/tidy-data.pdf

  1. Rows form observations
  2. Columns form variables
  3. Each value is placed in its own cell