Data science practice 1

Q1. Linux Shell Commands

Q1.1

This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic-iv.mit.edu/docs/access/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

solution: The verification links to the completion report and completion certificate.

Q1.2

The /usr/203b-data/mimic-iv/ folder on teaching server contains data sets from MIMIC-IV. Refer to https://mimic-iv.mit.edu/docs/datasets/ for details of data files.

ls -l /usr/203b-data/mimic-iv

Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder /usr/203b-data/mimic-iv directly in following exercises.

Use Bash commands to answer following questions.

solution: Done.

Q1.3

Display the contents in the folders core, hosp, icu. What are the functionalities of the bash commands zcat, zless, zmore, and zgrep?

solution:

ls -l /usr/203b-data/mimic-iv/core

ls -l /usr/203b-data/mimic-iv/hosp

ls -l /usr/203b-data/mimic-iv/icu

The functionalities of bash commands:

zcat: Line utility for viewing the contents of a compressed file without literally uncompressing it.

zmore: a filter which allows examination of compressed or plain text files one screenful at a time on a soft-copy terminal.

zless: works the same way as zmore, except the decompressed output is displayed by the less command for additional viewing flexibility.

zgrep: Search out expressions from a given a file even if it is compressed.

Q1.4

What’s the output of following bash script?

```bash
for datafile in /usr/203b-data/mimic-iv/core/*.gz
  do
    ls -l $datafile
  done
```

solution: The bash script print out all .gz files in the folder core.

Display the number of lines in each data file using a similar loop.

solution:

for datafile in /usr/203b-data/mimic-iv/core/*.gz
  do
    ls -l $datafile
    echo "the number of lines:" 
    zcat $datafile | awk 'END { print NR }'
  done

Q1.5

Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? What are the possible values taken by each of the variable admission_type, admission_location, insurance, language, marital_status, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq, wc, and so on.)

solution:

zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk '(NR<=5)'

echo "the number of rows:"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk 'END { print NR }'

echo "the number of unique patients: (colname row excluded)"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk  -F ',' '{ print $1 }' | sort | uniq |
tail -n +2 | awk 'END { print NR }'

for i in 6 7 9 10 11 12; 
do
echo "---------------------------"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk  -F ',' -v i=$i '{ print $i }' | 
awk '(NR<=1)''{printf "%-19s~%-20s\n", $1,
"(count & values (* NULL/NA included))"}' 
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk  -F ',' -v i=$i '{ print $i }' | tail -n +2 | sort | uniq -c 
done

Q2. Who’s popular in Price and Prejudice

Q2.1

You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.

```bash
curl http://www.gutenberg.org/cache/epub/42671/pg42671.txt > pride_and_prejudice.txt
```

Do not put this text file pride_and_prejudice.txt in Git. Using a for loop, how would you tabulate the number of times each of the four characters is mentioned?

solution: Use grep -o prints strings that match an name and then calculated the times.

declare -a name_arry=("Elizabeth" "Jane" "Lydia" "Darcy")
for name_need in ${name_arry[@]}
do
grep -o $name_need pride_and_prejudice.txt | wc -l | 
awk -v var="$name_need" '{print "---------------" 
printf "%-10s|%-5s\n", var, $1}'
done

Q2.2

What’s the difference between the following two commands?

```bash
echo 'hello, world' > test1.txt
```
and

```bash
echo 'hello, world' >> test2.txt
```

solution: '> test1.txt' redirects output to test1.txt, overwriting the file. '>> test1.txt' redirects output to test1.txt, appending the redirected output at the end.

Q2.3

Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:

```bash
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
```

Using chmod make the file executable by the owner, and run

```bash
./middle.sh pride_and_prejudice.txt 20 5
```

Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?

solution:

./middle.sh pride_and_prejudice.txt 20 5

the meaning of:

"$1": the first column/element of the input (the element pride_and_prejudice.txt here)

"$2": the second column/element of the input (the element 20 here)

"$3": the third column/element of the input (the element 5 here)

The first line #!/bin/sh means that the script should always be run with bash, rather than another shell. It’s a convention for the server to know what program it should use to run the shell script.

Q3. More fun with Linux

Try these commands in Bash and interpret the results: cal, cal 2021, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.

solution:

cal

cal display the calender of current month.

cal 2021

cal 2021 display the calender of all month in 2021.

cal 9 1752

cal 9 1752 seems display a incomplete calender of September 1752. Reason: The Gregorian calendar reform was adopted by the Kingdom of Great Britain in September 1752. As a result, the September 1752 cal shows the adjusted days missing. [wiki]

date

date returns the date in the default system timezone.

hostname

hostname provides the name of the server.

arch

arch provides the computer architecture.

uname -a

uname -a prints the name, version and other details about the current machine and the operating system running on it.

uptime

uptime returns information about how long your system has been running together with the current time, number of users with running sessions, and the system load averages for the past 1, 5, and 15 minutes.

who am i

who am i displays the username of the current user when this command is invoked.

who

who displays account information: user login name, user’s terminal, time of login as well as the host the user is logged in from.

w displays information about currently logged in users and what each user is doing.

id

id print real and effective User ID (UID) and Group ID (GID).

last | head

last | head displays the first 10 users logged in and out since the file /var/log/wtmp was created.

echo {con,pre}{sent,fer}{s,ed}

echo {con,pre}{sent,fer}{s,ed} generates all the permutations possible of a set of elements ({con,pre}{sent,fer}{s,ed}) stored in a variable in groups of 2 elements.

time sleep 5

time sleep 5 pauses execution of shell scripts or commands for a 5-second period on a Linux

set -o history
echo "zza"
history | tail

history | tail shows 10 of the last commands that have been recently used.

Data science practice 1

Q1. Linux Shell Commands

Q1.1

Q1.2

Q1.3

Q1.4

Q1.5

Q2. Who’s popular in Price and Prejudice

Q2.1

Q2.2

Q2.3

Q3. More fun with Linux

Zian ZHUANG