CSCI 241 - Homeworks 7 & 8:
Shell Scripting and Regular Expressions

Due by 11:59.59pm Friday, December 08, 2017

The Github URL for this assignment is https://classroom.github.com/a/XZoT_gyg

Introduction

For this assignment you will be creating a number of shell scripts.

Part 1 - URL Testing

Write a shell script called testurl.sh that accepts a list of urls in a separate file and tests if the website is up or not. You might find it useful to checkout the curl, wget and tail commands.

rhoyle@clyde$ cat urls
http://cs.oberlin.edu/~ncare/cs241/labs/lab8.html
https://occs.cs.oberlin.edu/~rhoyle/17s-cs241/assignments/hw02.html
http://no.such.url
http://occs.cs.oberlin.edu
rhoyle@clyde$ ./testurl.sh urls
Not found: http://no.such.url

This script should also handle errors. If the user doesn't provide any urls to the script it should print out a usage message.

Part 2 - Back it up a step

Next, I want you to create a script called backup.sh. ~~The script should take as arguments a directory to backup into (with "./backup" as the default) and a list of one or more files to copy to the backup directory.~~ The script should take as arguments a list of one or more files that should be copied to the ".backup" directory

Your script should only copy files in if their timestamp is more recent than the file that exists in the backup directory when the script is run. You might find it helpful to check bash's test (i.e. [ ]) syntax. Additionally, you should make your script executable using chmod. That is, the command should be runnable as follows

$ ./backup.sh

Part 3 - Diskhogger

Third, I want you to create a shell script called diskhog.sh that lists the 5 largest items (files or folders) in the current directory in decreasing order of size. You should output the sizes in a human readable format like so:

% cd ~rhoyle/pub/cs241
% ./diskhog.sh
3.9M week03
572K old
348K hw06
152K week06
112K week05

Check out the man pages for du, cut, sort, xargs and head (or tail)

Part 4 - linecount

Create a shell script called linecount that by default will report the total number of lines in all of the files in the current working directory (recursively).

If a glob is specified, use that as a delimiter for the files to scan. So, if you wanted to know how many lines of java code were in your folder, you might run:

% ./linecount '*.java'

You'll want to take a look at wc, cd, find, and test.

Part 5 - Retro-grade Scripting

I want you to write a script called gradeit.sh that will test your pyramid and rot128 submissions for lab 1.

The script should analyze student's submissions for correctness and warn if the output of the program differs from the reference implementation, which is located in ~rhoyle/pub/cs241/hw01.

You must decide what to test for. You will be graded on how thorough your test is. Explain, in comments in your script, what you are testing for and why you are running that particular test. After your script has finished, you should clean up any temporary files created by the testing process

You'll want to take a look atwc, pushd (and popd), find, and diff

Part 6 - Data file analysis

I often find myself using shell tools to answer questions about a data file that I'm working on. Here is a data file from a machine learning dataset that I'd like you download and unzip: adult.data.zip The fields in the data set are described at http://archive.ics.uci.edu/ml/datasets/Adult.

Answer the following questions in your README file (and give the commands used to find the answer):

How many entries are marked "Male" and how many are marked "Female"?
The last column is the label that is applied to the entry. How many of each label type are there?
Give the counts for each label used for "race" in decreasing order
Give the counts for a combined "race"/"sex" attribute in decreasing order

Potentially useful commands to look at include cut, sort, and uniq. If you include the commands you used to generate your answers, it might be possible to give you partial credit. Once you have answered the questions, you should delete the adult.data and adult.data.zip files so that you don't hand them in.

Programming Hints

You should make your programs executable using the chmod command (specifically, chmod u+x)
You may assume that your programs won't have to handle whitespace in their names.
Make sure not to use bash in your implementation.
For file permissions it might be useful to checkout the Grymorie for help.

Part 7 - Regular Expressions

Give a command that will use a single egrep or grep -E on /usr/share/dict/words.241 to find the following. Consider only a, e, i, o, and u as vowels for our purposes. Put your answers in a file called README.

You may want to review the lecture notes, the class readings, and possibly do an online tutorial before beginning. There are useful links for regex visualizers on the course home page.

Protip: you can do export WORDS=/usr/share/dict/words.241 and then use $WORDS as your input file. Also, piping output to wc -l will let you count the lines output.

All words that contain exactly one lowercase vowel (5948 on clyde)
All words that contain the lowercase vowels a, e, i, o, and u in that order (6 on clyde)
All words that are exactly 22 lowercase letters long (2 on clyde)
All words that have a 4-letter sequence repeated (24 on clyde)
All words that start and end with the same 3 letter sequence (32 on clyde)
All lowercase words that are made up of only pairs of consonant-vowels like banana and are at least 6 letters long (545 on clyde)
All words that end with their first 3 letters reversed like detected (14 on clyde)

For this portion of the assignment I want you to construct sed commands that will do the following activities. (Don't forget the -E flag!)

Replace all instances of "snow fall" or "wind chill" with "summertime"
Assuming the input is a dictionary file like /usr/share/dict/words.241 (one per line, alpha order), print out all words between "computer" and "science"
Replaces all instances of "Teh" with "The" and "teh" with "the", but only in standalone words
Move the last word on a line to the front
Find lines where a word has been repeated on the same line and replace that line with a repeated word. Don't print the other lines.
Convert C block comments that are on one line and at the end into a line comment.
So /* add things up */ would become // add things up
Only print out lines that contain "cs 241", but change that to "CSCI 241"
Take the previous, but modify it to handle "CS" or "CSCI" with or without space and of any type of capitalization (e.g., "cScI241")
Truncate all lines after exactly 20 characters.
Replace all instances of "Thomas B. Wexler" with "T-Wex" (including variations with "Tom" and/or no middle initial)
Assuming that a name is made up of two adjacent words that start with a capital and are followed by one or more lower case letters, anonymize the input by changing every name to be just their initials. So "Roberto Hoyle" becomes "RH". Be sure to handle having multiple names on the line.
If there is a 10 digit number on a line (not part of another word) reformat the number as (123) 456-7890
Assume that the input is being piped from wget --quiet -O- http://xkcd.com/ (which will print the xkcd comic's html page to stdout), print out the Image and Title information as follows:
```
Image: http://imgs.xkcd.com/comics/privacy_opinions.png
Title: I'm the Philosopher until someone hands me a burrito.
```

Useful links

You may find the following links useful when working on this assignment:

Extra Credit

Modify testurl.sh to output if a file is a valid HTML file according to the W3C validator at https://validator.w3.org/
Modify your backup.sh script to keep a list of the five most recent backup directories and store copies as symlinks.
Modify diskhog.sh to take a flag to change the number of items to display and another to limits it to files or directories.
Make Diskhogger take a flag to change the number of items to display, or maybe another that limits it to files/directories.
Have your linecount.sh script support an optional argument that will be used as a file glob pattern for the types of files. The user is responsible for properly quoting things on the command line. For example, to get a sum of all of the lines in your java source files you would use:

% ./linecount '*.java'

Add more testing to gradeit.sh
Delete all shell comment lines, that is lines with optional whitespace followed by a #. Don't delete them if the comments are on lines that have actual instructions (i.e., something other than whitespace before the #)
Check to see if there is a 10 digit number on a line which may have non-letters between the digits and print it out in the format 1234567890

handin

README

Create a file called README that contains

Your name
A description of the programs
Your answers to the "Data File Analysis" questions and commands
An estimate of the amount of time it took to complete each part
Any known bugs or incomplete functions
Any interesting design decisions you'd like to share

Now you should clean up your folder (remove test case detritus, etc.) and handin your folder containing your scripts and README.

Grading

Here is what I am looking for in this assignment:

A working set of shell scripts as described above
All programs should work on clyde.cs.oberlin.edu and not use BASH evaluation extensions. (e.g., no $(( )) blocks)
Good comments
A README with the information requested above. The listing of known bugs is important.

Last Modified: February 12, 2017 - Roberto Hoyle and Nick Care. Some material based on work by Benjamin Kuperman.

CSCI 241 - Homeworks 7 & 8:Shell Scripting and Regular Expressions