CIS-4250/5250 Homework #4: Ga with Spark

Due: Friday, October 26, 2018

Reading: Review the Apache Spark Quick Start guide and some of the documentation linked to it (programming guides, Spark API, etc.). You can also find information about Spark in the text book and in the Safari library in the book Learning Spark.

Although you can use Jumbo and Hadoop's pseudo-distributed mode for testing, you should ultimately run your program for this assignment on the cluster.


  1. Write a Spark program (in Scala) that duplicates the computation you did in Homework #3 on the Ga data set. Use the WordCount sample as a guide to setting up your program. Call your project "Hw04" and put it in an "Hw04" folder. Use sbt to control the build of your program as was done in the WordCount sample.

Submit to Moodle your analysis program in an archive. Remove the project and target folders and zip the entire directory structure (use zip -r Hw04 or similar. You can also use tar if you prefer.


  1. Complete the assignment above as for CIS-4250.

  2. Read over the attached paper on the Quantcast File System by Ovsiannikov et al., and write a short description (a couple of paragraphs) summarizing its main points. It is not necessary for you to read every single word of the paper, but you should read enough of it to gain a good idea of what it contains. This paper was given at the 39th International Conference on Very Large Databases in 2013.

Submit to Moodle both your Ga analysis program as described for CIS-4250, and include a short text file with your summary of the paper.

Last Revised: 2018-10-17
© Copyright 2018 by Peter C. Chapin <>