Python Data Cleaning Example

Before data can be analyzed, it has to be put into a usuable format. The following code goes through an example of how a researcher might do that, using the Camp PMCOE data. Keep in mind that your data will be similar, but not exactly the same.

First Steps

The first thing the researcher needs to do is connect to the SQL server, and see what tables are available to them.

This provides the researcher with a list of table names they can conncet to and use. For more detailed information on what each table contains, use the PMAP Data Catalog which must be accessed via the SAFE Desktop environment.

The researcher decides they want to work with the patient, vitals, and symptoms tables. The following code loads those tables into dataframes:

Joining Tables

Now that the researcher has the tables loaded into dataframes, they want to combine, or join, the patient and vitals tables. They do this with the merge() function, and by joining on the unique patient-level identifier, osler_id

The researcher does not need all of those columns, so they select only the ones they want, and they also get rid of duplicate rows:

Next, the researcher only wants hospital encounters where the patient was 18 or older at the time of the encounter.

So first, they calculate the patient's age at the encounter, and put that in a new column, "age_at_encounter"

Then, they filter the dataframe, removing rows where the patient is under 18 at the encounter, or the encounter type is not a hospital encounter.

30-Day Readmission

Now the researcher wants to flag an admission if it occurs within 30 days of the patient's last discharge.

First, they have to sort the dataframe, so it is in order of patient (osler_id), and then admission time (admission_date) for each patient.

Now they are ready to calcualte 30-day readmission. They create a new column, "Readmit30," which will contain a 1 if the admission is within 30 days of the patient's prior discharge, and a 0 otherwise.

Now the researcher adds in symptom data. To do this, they merge their dataframe with the symptoms dataframe created earlier, merging on the unique encounter-level identifier, encounter_id.

Now the researcher has a cleaned dataframe that they are ready to run analysis on. Remember that data cleaning can involve many different steps, depending on what you want to do wiht your data and what you want the final output to look like.