Data science practice 1
Q1. Linux Shell Commands
Q1.1
This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic-iv.mit.edu/docs/access/ to (1) complete the CITI Data or Specimens Only Research
course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
solution: The verification links to the completion report and completion certificate.
Q1.2
The /usr/203b-data/mimic-iv/
folder on teaching server contains data sets from MIMIC-IV. Refer to https://mimic-iv.mit.edu/docs/datasets/ for details of data files.
ls -l /usr/203b-data/mimic-iv
Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder /usr/203b-data/mimic-iv
directly in following exercises.
Use Bash commands to answer following questions.
solution: Done.
Q1.3
Display the contents in the folders core
, hosp
, icu
. What are the functionalities of the bash commands zcat
, zless
, zmore
, and zgrep
?
solution:
ls -l /usr/203b-data/mimic-iv/core
ls -l /usr/203b-data/mimic-iv/hosp
ls -l /usr/203b-data/mimic-iv/icu
The functionalities of bash commands:
zcat
: Line utility for viewing the contents of a compressed file without literally uncompressing it.zmore
: a filter which allows examination of compressed or plain text files one screenful at a time on a soft-copy terminal.zless
: works the same way aszmore
, except the decompressed output is displayed by theless
command for additional viewing flexibility.zgrep
: Search out expressions from a given a file even if it is compressed.
Q1.4
What’s the output of following bash script?
```bash
for datafile in /usr/203b-data/mimic-iv/core/*.gz
do
ls -l $datafile
done
```
solution: The bash script print out all
.gz
files in the foldercore
.
Display the number of lines in each data file using a similar loop.
solution:
for datafile in /usr/203b-data/mimic-iv/core/*.gz
do
ls -l $datafile
echo "the number of lines:"
zcat $datafile | awk 'END { print NR }'
done
Q1.5
Display the first few lines of admissions.csv.gz
. How many rows are in this data file? How many unique patients (identified by subject_id
) are in this data file? What are the possible values taken by each of the variable admission_type
, admission_location
, insurance
, language
, marital_status
, and ethnicity
? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat
, head
/tail
, awk
, uniq
, wc
, and so on.)
solution:
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz |
awk '(NR<=5)'
echo "the number of rows:"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz |
awk 'END { print NR }'
echo "the number of unique patients: (colname row excluded)"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz |
awk -F ',' '{ print $1 }' | sort | uniq |
tail -n +2 | awk 'END { print NR }'
for i in 6 7 9 10 11 12;
do
echo "---------------------------"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz |
awk -F ',' -v i=$i '{ print $i }' |
awk '(NR<=1)''{printf "%-19s~%-20s\n", $1,
"(count & values (* NULL/NA included))"}'
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz |
awk -F ',' -v i=$i '{ print $i }' | tail -n +2 | sort | uniq -c
done
Q2. Who’s popular in Price and Prejudice
Q2.1
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
```bash
curl http://www.gutenberg.org/cache/epub/42671/pg42671.txt > pride_and_prejudice.txt
```
Do not put this text file pride_and_prejudice.txt
in Git. Using a for
loop, how would you tabulate the number of times each of the four characters is mentioned?
solution: Use
grep -o
prints strings that match an name and then calculated the times.
declare -a name_arry=("Elizabeth" "Jane" "Lydia" "Darcy")
for name_need in ${name_arry[@]}
do
grep -o $name_need pride_and_prejudice.txt | wc -l |
awk -v var="$name_need" '{print "---------------"
printf "%-10s|%-5s\n", var, $1}'
done
Q2.2
What’s the difference between the following two commands?
```bash
echo 'hello, world' > test1.txt
```
and
```bash
echo 'hello, world' >> test2.txt
```
solution:
'> test1.txt'
redirects output totest1.txt
, overwriting the file.'>> test1.txt'
redirects output totest1.txt
, appending the redirected output at the end.
Q2.3
Using your favorite text editor (e.g., vi
), type the following and save the file as middle.sh
:
```bash
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
```
Using chmod
make the file executable by the owner, and run
```bash
./middle.sh pride_and_prejudice.txt 20 5
```
Explain the output. Explain the meaning of "$1"
, "$2"
, and "$3"
in this shell script. Why do we need the first line of the shell script?
solution:
./middle.sh pride_and_prejudice.txt 20 5
the meaning of:
"$1"
: the first column/element of the input (the elementpride_and_prejudice.txt
here)"$2"
: the second column/element of the input (the element20
here)"$3"
: the third column/element of the input (the element5
here)
The first line
#!/bin/sh
means that the script should always be run with bash, rather than another shell. It’s a convention for the server to know what program it should use to run the shell script.
Q3. More fun with Linux
Try these commands in Bash and interpret the results: cal
, cal 2021
, cal 9 1752
(anything unusual?), date
, hostname
, arch
, uname -a
, uptime
, who am i
, who
, w
, id
, last | head
, echo {con,pre}{sent,fer}{s,ed}
, time sleep 5
, history | tail
.
solution:
cal
cal
display the calender of current month.
cal 2021
cal 2021
display the calender of all month in 2021.
cal 9 1752
cal 9 1752
seems display a incomplete calender of September 1752. Reason: The Gregorian calendar reform was adopted by the Kingdom of Great Britain in September 1752. As a result, the September 1752 cal shows the adjusted days missing. [wiki]
date
date
returns the date in the default system timezone.
hostname
hostname
provides the name of the server.
arch
arch
provides the computer architecture.
uname -a
uname -a
prints the name, version and other details about the current machine and the operating system running on it.
uptime
uptime
returns information about how long your system has been running together with the current time, number of users with running sessions, and the system load averages for the past 1, 5, and 15 minutes.
who am i
who am i
displays the username of the current user when this command is invoked.
who
who
displays account information: user login name, user’s terminal, time of login as well as the host the user is logged in from.
w
w
displays information about currently logged in users and what each user is doing.
id
id
print real and effective User ID (UID) and Group ID (GID).
last | head
last | head
displays the first 10 users logged in and out since the file /var/log/wtmp was created.
echo {con,pre}{sent,fer}{s,ed}
echo {con,pre}{sent,fer}{s,ed}
generates all the permutations possible of a set of elements ({con,pre}{sent,fer}{s,ed}) stored in a variable in groups of 2 elements.
time sleep 5
time sleep 5
pauses execution of shell scripts or commands for a 5-second period on a Linux
set -o history
echo "zza"
history | tail
history | tail
shows 10 of the last commands that have been recently used.