CIS-4250/5250 Homework #2: Average Temperatures (Hadoop)

Due: Friday, September 21, 2018

Reading: Chapter 2 in the text introduces MapReduce on Hadoop. The first half of the chapter details an example that is similar in concept to what you are doing for this assignment. The second half of the chapter talks about scaling the example to larger problem sizes and introduces concepts of HDFS and distributed execution. That material, while interesting, is less directly related to this assignment than what is covered in the chapter's first half.

Part 2 in the text is all about MapReduce in detail. You do not need to read it for this assignment, but I wanted to make you aware of its existence.


  1. Repeat last week's assignment using the Hadoop MapReduce framework in standalone mode instead of Awk. You can use Jumbo or Lemuria. For this assignment you can use either Java or Scala as you choose. In the future I'll require you to use Scala.

    For this assignment it makes sense to use "baskets" where each basket holds temperature readings for a particular month. For basket identifiers you might use IntWritable to specify month numbers or Text to hold year/month combinations as textual strings. You do not need to post-process or reformat the results sent to the output folder (in general you might want to do this).

Submit your program to Moodle in a zip archive that contains three files: one for the mapper class, one for the reducer class, and one for the main program.


  1. Complete the assignment above as for CIS-4250, except use Scala as the implementation language (it isn't an option).

  2. Write a script or program in whatever language you choose that reads the contents of the output folder and prints the final results in a user-friendly manner. In general the pairs produced by the MapReduce process might need further processing before they are useful (for example, output as a table or imported into a spreadsheet or relational database).

Last Revised: 2018-09-12
© Copyright 2018 by Peter C. Chapin <>