CIS-4250/5250 Homework #1: Average Temperatures (Awk)

Due: Thursday, September 6, 2018

Reading: This assignment requires you to use Awk. There are a number of Awk tutorials online that vary from short and simple to long and complex. Do a Google search for "awk tutorial" and dig around in one or more of them. If you find one that is particularly useful, let me know and I'll add it to the class web page.

In this class we will be using a Linux cluster via the system lemuria.cis.vtc.edu. However, you may find it convenient to do some of your program development on a Linux desktop system on your own computer. To facilitate that you can use a virtual machine running Ubuntu Linux 16.04 named Jumbo. Download the OVA file and import it into VirtualBox. You can log into the machine using the account "student" with password "frenchfry".

On Jumbo, in the cis-4250 folder in student's home directory, you will find the file 20130101-20160531.txt. This file contains data collected by a weather station operated by Lyndon State College's meteorology department (thanks to Jason Kaiser for providing this data). The file is a text file with one record of comma separated values per line. The first four lines are header information. The data covers the range from January 1, 2013 through May 31, 2016.

Do the following:

CIS-4250

  1. Write an Awk program that reads the file and outputs the average temperature for each month in the year 2013. Ignore the other years. Each record in the file covers a five minute interval. For purposes of this assignment you can assume the data is complete in the sense that there is no missing time. You can also assume that the data is in chronological order. Think about how you might handle this problem if these assumptions were not true, but you don't need to implement a fix at this time.

  2. Run your program twice more to find average temperatures by month for the years 2014 and 2015. You can just modify the program each time; you don't need to parameterize it by year (although you can if you'd like to try that). Do you notice any trends? Global warming? Warning: This is a weak analysis because it only considers three years worth of data at a single location. No real conclusions about global warming can be drawn from this analysis!

Submit your Awk program to Moodle.

CIS-5250

Write the script described above for CIS-4250. In addition, modify the script so that it prints a warning message whenever it finds missing or out-of-order time. This will give you an ability to assess how big a problem that issue might be. You still do not need to deal with the issue, however.

Submit your Awk program to Moodle.

Notes

The data we are using in this assignment is well suited for storage in a relational database system because it is highly structured and well typed. There is also not very much of it. In later assignments we will work with some less structured data that would be awkward to manipulate and query in a relational environment.

Awk is a useful tool and worth getting to know. However, it is not at all suitable for processing big data sets. In the next assignment you will redo this assignment using the Hadoop framework. Hadoop can make use of a cluster of computers and scales to vast data sets. The point of using Awk is to create a performance baseline against which more sophisticated methods can be compared.


Last Revised: 2018-08-21
© Copyright 2018 by Peter C. Chapin <pchapin@vtc.edu>