Q1. PhysioNet credential

Q1.1

At this moment, you should already get credentialed on the PhysioNet. Please include a screenshot of your Data Use Agreement for the MIMIC-IV (v0.4).

solution:

Q2. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

There are quite a few utilities in R for reading data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three programs: read.csv in base R, read_csv in tidyverse, and fread in the popular data.table package. Is there any speed difference?

In this homework, we stick to the tidyverse.

solution:

.timer <- function(fun_input){
  timestart <- Sys.time()
  adm <- fun_input(str_c(mimic_path, "/core/admissions.csv.gz")) 
  timeend <- Sys.time()
  return(timeend - timestart)
}
lapply(list("read_csv" = read_csv,
            "read.csv" = read.csv,
            "fread" = fread), .timer)

fread is proved to be the fastest function to read the file. read.csv is the slowest function.

Q3. ICU stays

icustays.csv.gz (https://mimic-iv.mit.edu/docs/datasets/icu/icustays/) contains data about Intensive Care Units (ICU) stays. Summarize following variables using appropriate numerics or graphs:

icustays <- read_csv(str_c(mimic_path, "/icu/icustays.csv.gz"))

Q3.1 how many unique stay_id?

solution:

icustays %>% 
  distinct(stay_id) %>% 
  nrow() %>%
  str_c(., " unique stay_id")  

Q3.2 how many unique subject_id?

solution:

icustays %>%  
  distinct(subject_id) %>% 
  nrow() %>%
  str_c(., " unique subject_id")

Q3.3 length of ICU stay

solution: please note that we took the log scale of x-axis to present data in a more readable way

icustays$los %>% summary
ggplot(data = icustays) +
  geom_histogram(mapping = aes(x = los), bins = 150) +
  scale_x_log10() +
  xlab("length of ICU stay (days)")+
  theme_classic() 

Q3.4 first ICU unit

solution:

icustays %>% 
  group_by(first_careunit) %>% 
  summarise(n = n()) %>%
  arrange(desc(n))
ggplot(data = icustays, 
       mapping = aes(x = factor(1), fill = first_careunit)) +
  scale_fill_discrete(name = "ICUs") +
  xlab(NULL) +
  ylab(NULL) +
  geom_bar() + 
  coord_polar("y") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank())

Q3.5 last ICU unit

solution:

icustays %>% 
  group_by(last_careunit) %>% 
  summarise(n = n()) %>%
  arrange(desc(n))
ggplot(data = icustays, 
       mapping = aes(x = factor(1), fill = last_careunit)) +
  scale_fill_discrete(name = "ICUs") +
  xlab(NULL) +
  ylab(NULL) +
  geom_bar() + 
  coord_polar("y") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank())

Q4. admission data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic-iv.mit.edu/docs/datasets/core/admissions/ for details of each field in this file. Summarize following variables using appropriate graphs. Explain any patterns you observe.

adm <- read_csv(str_c(mimic_path, "/core/admissions.csv.gz"))

Note it is possible that one patient (uniquely identified by the subject_id) is admitted into hospital multiple times. When summarizing some demographic information, it makes sense to summarize based on unique patients.

Q4.1 admission year

solution:

adm %<>% mutate(admission_year=year(adm$admittime))
adm$admission_year %>% summary
ggplot(adm, aes(x = admission_year)) + 
  geom_bar(colour = "black", fill = "white") +
  theme_classic()

Q4.2 admission month

solution:

adm %<>% mutate(admission_month=month(adm$admittime))
adm %>% 
  group_by(admission_month) %>% 
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm) + 
  geom_bar(mapping = aes(x = admission_month, 
                         fill = admission_month %>% as.factor)) +
  scale_fill_discrete(name = "month") +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.3 admission month day

solution:

adm %<>% mutate(admission_monthday=day(adm$admittime))
adm$admission_monthday %>% summary
ggplot(adm, aes(x = admission_monthday)) + 
  geom_bar(colour = "black", fill = "white") +
  theme_classic()

Q4.4 admission week day

solution:

adm %<>% mutate(admission_weekday=wday(adm$admittime))
adm %>% 
  group_by(admission_weekday) %>% 
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm) + 
  geom_bar(mapping = aes(x = admission_weekday, 
                         fill = admission_weekday %>% as.factor)) +
  scale_fill_discrete(name = "weekday") +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.5 admission hour (anything unusual?)

solution:

adm %<>% mutate(admission_hour=hour(adm$admittime))
adm$admission_hour %>% summary
ggplot(adm, aes(x = admission_hour)) + 
  geom_bar(colour = "black", fill = "white") +
  theme_classic() 

According to the barplot, we found that there are more admissions during the night than during the day.

Q4.6 number of deaths in each year

Firstly we need check whether the indicators for the death cases are consistent (deathtime,hospital_expire_flag).

adm %>%
  mutate(test = ifelse((is.na(deathtime) == FALSE & 
                          hospital_expire_flag == 0)|
                         (is.na(deathtime) == TRUE & 
                            hospital_expire_flag == 1), 1, 0)) %>%
  filter(test == 1) %>%
  select(subject_id, deathtime, 
         hospital_expire_flag, discharge_location)

Note: We found that there are some cases whose deathtime is missing but hospital_expire_flag equals 1. Referring to the other terms (e.g. discharge_location), hospital_expire_flag is turn out to be the more presise indicator. Thus, we decided to choose hospital_expire_flag as the indicator for death cases.

solution:

adm %>%
  filter(hospital_expire_flag == 1) %>%
  group_by(admission_year) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm %>% filter(is.na(deathtime) == FALSE), 
       aes(x = admission_year)) + 
  geom_bar(colour = "black", fill = "white") +
  theme_classic()

Q4.7 admission type

solution:

adm %>% 
  group_by(admission_type) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm) + 
  geom_bar(mapping = aes(x = admission_type, fill = admission_type)) +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.8 number of admissions per patient

solution:

number_of_ad <- adm %>% 
  group_by(subject_id) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(n) %>%
  mutate(num = ifelse(n>=5, 5, n))
number_of_ad$num %<>% ordered(levels = c("1", "2", "3", "4", "5"))
ggplot(number_of_ad) + 
  geom_bar(mapping = aes(x = num , fill = num)) +
  xlab("number of admissions") +
  scale_fill_discrete(name = "number of admissions", 
                      labels = c("1", "2", "3", "4", ">=5")) +
  theme(panel.background = element_blank(),
        axis.text.x = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.9 admission location

solution:

adm %>% 
  group_by(admission_location) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm) + 
  geom_bar(mapping = aes(x = admission_location, fill = admission_location)) +
  scale_y_sqrt() +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.9 discharge location

solution:

adm %>% 
  group_by(discharge_location) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm) + 
  geom_bar(mapping = aes(x = discharge_location, fill = discharge_location)) +
  scale_y_sqrt() +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.10 insurance

solution: (summarized based on unique patients)

adm %>% 
  distinct(subject_id, .keep_all = TRUE) %>% 
  group_by(insurance) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm %>% 
  distinct(subject_id, .keep_all = TRUE)) + 
  geom_bar(mapping = aes(x = insurance, fill = insurance)) +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.11 language

solution: (summarized based on unique patients)

adm %>% 
  distinct(subject_id, .keep_all = TRUE) %>% 
  group_by(language) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm %>% 
  distinct(subject_id, .keep_all = TRUE)) + 
  geom_bar(mapping = aes(x = language, fill = language)) +
  scale_fill_discrete(label = c("unknown","English")) +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.12 marital status

solution: (summarized based on unique patients)

adm %>% 
  distinct(subject_id, .keep_all = TRUE) %>%
  group_by(marital_status) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm %>% 
  distinct(subject_id, .keep_all = TRUE)) + 
  geom_bar(mapping = aes(x = marital_status, fill = marital_status)) +
  scale_fill_discrete(name = "marital status") +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.13 ethnicity

solution: (summarized based on unique patients)

adm %>% 
  distinct(subject_id, .keep_all = TRUE) %>% 
  group_by(ethnicity) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm %>% 
  distinct(subject_id, .keep_all = TRUE)) + 
  geom_bar(mapping = aes(x = ethnicity, fill = ethnicity)) +
  scale_fill_discrete(name = "ethnicity") +
  scale_y_sqrt() +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q4.14 death

solution:

adm %<>% mutate(death = ifelse(hospital_expire_flag == 1, "Yes", "No"))
adm %>% 
  group_by(death) %>%
  summarise(n = n(), .groups = "keep") %>%
  arrange(desc(n))
ggplot(adm) + 
  geom_bar(mapping = aes(x = death, fill = death)) +
  scale_fill_discrete(name = "death") +
  scale_y_sqrt() +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q5. patient data

Explore patients.csv.gz (https://mimic-iv.mit.edu/docs/datasets/core/patients/) and summarize following variables using appropriate numerics and graphs:

patients <- read_csv(str_c(mimic_path, "/core/patients.csv.gz"))

Q5.1 gender

solution: (summarized based on unique patients)

patients %>% 
  distinct(subject_id, .keep_all = TRUE) %>%
  group_by(gender) %>%
  summarise(n = n(), .groups = "keep")
ggplot(patients %>% 
  distinct(subject_id, .keep_all = TRUE)) + 
  geom_bar(mapping = aes(x = gender, fill = gender)) +
  scale_fill_discrete(name = "gender", 
                      label = c("Female", "Male")) +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Q5.2 anchor_age

(explain pattern you see)

solution: (summarized based on unique patients)

patients %>% 
  distinct(subject_id, .keep_all = TRUE) %>%
  select(anchor_age) %>%
  summary
ggplot(patients %>% 
  distinct(subject_id, .keep_all = TRUE)) + 
  geom_density(mapping = aes(x = anchor_age)) +
  xlab("anchor_age")+
  theme_classic() 

According to the summary statistics, the mean anchor age of patients are 41. And figure presents that the anchor age of patients has two peak at 0 and 25 respectively. Very few patients have anchor age at around 10. Then we presumed that the peak in age 0 should be correspond to the missing data or some corrupted data. Thus, we filter out the anchor_age which equals 0 and plot again.

patients %>% 
  distinct(subject_id, .keep_all = TRUE) %>%
  filter(anchor_age != 0) %>%
  ggplot() +
  geom_density(mapping = aes(x = anchor_age)) +
  xlab("anchor_age (filter out 0)")+
  theme_classic() 

Q6. Lab results

labevents.csv.gz (https://mimic-iv.mit.edu/docs/datasets/hosp/labevents/) contains all laboratory measurements for patients.

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), glucose (50931), magnesium (50960), calcium (50893), and lactate (50813). Find the itemids of these lab measurements from d_labitems.csv.gz and retrieve a subset of labevents.csv.gz only containing these items.

solution:

Quick check the data file

readLines(str_c(mimic_path, "/hosp/labevents.csv.gz"), n = 5L)

Begin data processing

item_list <- c("creatinine", "potassium", "sodium", 
               "chloride", "bicarbonate", "hematocrit", 
               "white blood cell", "glucose", 
               "magnesium", "calcium", "lactate")
lab_name <- c("subject_id", "hadm_id", "itemid",
              "charttime", "valuenum")
#              
#read and save
#
if (!file.exists("labevents_icu.csv.gz")){
  system.time(labevents <- fread(str_c(mimic_path, "/hosp/labevents.csv.gz"),
                     select = lab_name, nThread = getDTthreads()))
  labevents %>%
    semi_join(icustays, by = c("subject_id", "hadm_id")) %>%
    fwrite("labevents_icu.csv.gz", nThread = getDTthreads())
}
labitems <- fread(str_c(mimic_path, "/hosp/d_labitems.csv.gz"))
system.time(labevents_icu <- fread("labevents_icu.csv.gz",
                                   nThread = getDTthreads()))
#
#define lookup function
#
.look_up <- function(x, type = "look_up", input = NULL){
  if(type == "look_up"){
    lookup <- input %>%
      filter(str_detect(label, regex(x, ignore_case = TRUE))) %>%
      select(itemid, label) 
    return(lookup)
  } else if(type == "find_id"){
    count <- input %>%
      filter(itemid %in% as_vector(x$itemid)) %>%
      count(itemid) %>%
      arrange(desc(n)) %>% .[1, 1]
    return(count)
  }
}
#
#find data matching item id and save
#
idkey_all <- lapply(item_list %>% as.list, 
                    .look_up, 
                    input = labitems)
idkey <- sapply(idkey_all, 
                .look_up, 
                type="find_id", 
                input = labevents_icu)
if (!file.exists("labevents_icu_selected.csv.gz")){
  labevents_icu %>%
  filter(itemid %in% as_vector(idkey)) %>%
  left_join(Reduce(rbind, idkey_all), by = "itemid") %>%
  fwrite("labevents_icu_selected.csv.gz", nThread = getDTthreads())
}

Check the content of the processed data

labevents_icu <- fread("labevents_icu_selected.csv.gz",
                         nThread = getDTthreads())
labevents_icu

Q7. Vitals from chartered events

We are interested in the vitals for ICU patients: heart rate, mean and systolic blood pressure (invasive and noninvasive measurements combined), body temperature, SpO2, and respiratory rate. Find the itemids of these vitals from d_items.csv.gz and retrieve a subset of chartevents.csv.gz only containing these items.

chartevents.csv.gz (https://mimic-iv.mit.edu/docs/datasets/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid.

d_items.csv.gz (https://mimic-iv.mit.edu/docs/datasets/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

solution:

Quick check the data file

readLines(str_c(mimic_path, "/icu/chartevents.csv.gz"), n = 5L)

Begin data processing

item_list2 <- c("heart rate", "arterial blood pressure systolic", 
                "arterial blood pressure mean", 
                "invasive blood pressure mean",
                "invasive blood pressure systolic", "temperature", 
                "SpO2", "respiratory rate")
char_name <- c("subject_id", "hadm_id", "itemid",
              "charttime", "valuenum")
#
#read and save
#
if (!file.exists("chartevents_icu.csv.gz")){
  system.time(chartevents <- fread(str_c(mimic_path, 
                                         "/icu/chartevents.csv.gz"),
                     select = char_name, nThread = getDTthreads()))
  chartevents %>%
    semi_join(icustays, by = c("subject_id", "hadm_id")) %>%
    fwrite("chartevents_icu.csv.gz", nThread = getDTthreads())
}
d_items <- fread(str_c(mimic_path, "/icu/d_items.csv.gz"))
system.time(chartevents_icu <- fread("chartevents_icu.csv.gz",
                                     nThread = getDTthreads()))
#
#find data matching item id and save
#
idkey_all2 <- lapply(item_list2 %>% as.list, 
                     .look_up, 
                     input = d_items)
idkey2 <- sapply(idkey_all2, 
                 .look_up, 
                 type="find_id", 
                 input = chartevents_icu)
if (!file.exists("chartevents_icu_selected.csv.gz")){
  chartevents_icu %>%
  filter(itemid %in% as_vector(idkey2)) %>%
  left_join(Reduce(rbind, idkey_all2), by = "itemid") %>%
  fwrite("chartevents_icu_selected.csv.gz", nThread = getDTthreads())
}

Check the content of the processed data

chartevents_icu <- fread("chartevents_icu_selected.csv.gz",
                         nThread = getDTthreads())
chartevents_icu

Q8. Putting things together

Let us create a tibble for all ICU stays, where rows are

  • first ICU stay of each unique patient
  • adults (age at admission > 18)

and columns contain at least following variables

  • all variables in icustays.csv.gz
  • all variables in admission.csv.gz
  • all variables in patients.csv.gz
  • first lab measurements during ICU stay
  • first vitals measurement during ICU stay
  • an indicator variable whether the patient died within 30 days of hospital admission

solution:

which(duplicated(labevents_icu) == TRUE) %>% length
which(duplicated(chartevents_icu) == TRUE) %>% length

After a quick check, we found that there are some patients have more than one record at a single time point (duplicated resords). Thus, we need to only keep one record.

#keep the first lab\vitals measurements
labevents_icu %>%
  mutate_at(vars(charttime), ymd_hms) %>%
  left_join(icustays %>% 
              select(c("subject_id", "hadm_id", "intime")), 
            by = c("subject_id", "hadm_id")) %>%
  filter(charttime >= intime) %>%
  group_by(subject_id, label) %>%
  arrange(charttime, .by_group = FALSE) %>%
  slice_head(n = 1) %>%
  select(-c(charttime, itemid, intime)) %>%
  spread(key = label, value = valuenum) %>%
  fwrite("labevents_icu_final.csv.gz", nThread = getDTthreads())

chartevents_icu %>%
  mutate_at(vars(charttime), ymd_hms) %>%
  left_join(icustays %>% 
              select(c("subject_id", "hadm_id", "intime")), 
            by = c("subject_id", "hadm_id")) %>%
  filter(charttime >= intime) %>%
  group_by(subject_id, label) %>%
  arrange(charttime, .by_group = FALSE) %>%
  slice_head(n = 1) %>%
  select(-c(charttime, itemid, intime)) %>%
  spread(key = label, value = valuenum) %>%
  fwrite("chartevents_icu_final.csv.gz", nThread = getDTthreads())

Then we can prepare the final dataset

#read in data
labevents_icu <- fread("labevents_icu_final.csv.gz",
                         nThread = getDTthreads())
chartevents_icu <- fread("chartevents_icu_final.csv.gz",
                         nThread = getDTthreads())
#all variables in `icustays.csv.gz`
final_icu_dataset <- icustays %>%
  # first ICU stay of each unique patient 
  group_by(subject_id) %>%
  slice_min(intime) %>%
  # add all variables in `admission.csv.gz` 
  left_join(adm, by = c("subject_id", "hadm_id")) %>%
  # add all variables in `patients.csv.gz` 
  left_join(patients, by = "subject_id") %>%
  mutate(age_at_adm = year(admittime) - anchor_year + anchor_age) %>%
  # adults (age at admission > 18)
  filter(age_at_adm >= 18) %>%
  # add first vitals measurement during ICU stay
  left_join(chartevents_icu, by = c("subject_id", "hadm_id")) %>%
  # add first lab measurements during ICU stay
  left_join(labevents_icu, by = c("subject_id", "hadm_id")) %>%
  # filter out all death cases [indicator: "hospital_expire_flag"]
  mutate(death_binary = ifelse(hospital_expire_flag == 1, "Yes", "No")) %>%
  # indicator variable whether the patient died within 30 days of admission
  mutate(death_30 = ifelse(deathtime - intime <= 30, "Yes", "No"))

print(final_icu_dataset, width = Inf)

Set NA as “No” in indicator variable and compare with with death_binary

final_icu_dataset %>%
  ungroup() %>% 
  mutate_at(vars(death_30), 
            function(x){ifelse(is.na(x) == TRUE, "No", x)}) %>% 
  count(death_30, death_binary)