pchapin's CIS-4250/5250 Big Data Processing, Fall 2018


Peter C. Chapin. Office: BLP-415 on the Williston campus. Office hours are by appointment. Phone: 802-879-2367 (voice mail active). Email: pchapin@vtc.edu. I will usually respond to email within 24 hours, not including weekends or holidays. Email is the best way to contact me. I am also sometimes on the FreeNode IRC network under the nickname pcc.

Course Description

The official course outline (graduate version) lists high level course objectives and content.

This course is about processing big data sets. For our purposes a "big data set" will be defined as a data set that is so large that it can't (readily) be fit on a single computer. Of course, this definition is quite vague, but it still serves to convey what constitutes "big" for us. There is an implication that our data sets are distributed across multiple machines in a cluster and that our processing methods must utilize parallel execution units in order to complete in a timely manner.

The processing environment described above necessitates somewhat different methods to processing than has been traditionally used in database applications. The purpose of this course is to explore some of those methods. This is primarily a programming course using Scala (although no previous Scala experience is assumed). Scala has been chosen because some important big data processing libraries are written in Scala (notably Apache Spark, but there are others). We will also explore some other programming environments.


This course assumes you are already familiar with the basic concepts of programming and have taken an object-oriented programming course in Java. Although we will work mostly in Scala, no previous Scala experience is necessary or assumed. Some familiarity with Linux is assumed, however, although you will get an opportunity in this course to get more Linux experience. Linux is used as the operating system for our work.


The text is Hadoop The Definitive Guide 4th edition by Tom White, published by O'Reilly Media Inc. ISBN=9781491901632.

Although a bit older, this is still a good book describing the Hadoop system that we'll be using for a good part of the course (either directly or indirectly). We will be using a significantly newer version of Hadoop so some of the specifics in this text might not apply, but the concepts and discussion are still good.

In addition, I recommend (but do not require) Programming in Scala, third edition, by Martin Odersky, Lex Spoon, and Bill Venners. Please be sure you are using the third edition; the earlier editions are now a bit outdated. I will also provide materials about Scala as needed during the semester. You may also find the Scala web site useful.

I have created an email distribution list for the class. I will use this list to distribute announcements and other supplementary materials. Be sure to check your mail regularly (daily) or you might miss something important. If you send a question in email directly to me, I may reply to my distribution list if I think that others would benefit from my answer. If you would rather I did not reply to the list you should say so in your message.

My home page contains various documents of general interest.

Grading Policy

I grade on a point system. Each assignment is worth a certain number of points. At the end of the semester I total all the points you earned and compare that to the total number of possible points. In this course there are two components to your grade.

  1. Homework. 20 pts/each (normally, but some assignments might be different). There will be approximately ten assignments during the semester for a total of about 200 points. You will normally have one week to do each assignment. Many, but not all, of the assignments will entail programming.

  2. Final. 50 pts. There will be one take-home exam at the end of the semester that will serve as the final exam for the course.

Late Policy

Late submissions are not accepted. If something comes up that prevents you from handing in an assignment on time, contact me before the deadline to discuss your issue. Under some circumstances I may be willing to grant an extension.

Copying Policy

I encourage you to share ideas with your fellow students so I won't be shocked to learn that you've been talking with someone about an assignment. If you worked closely with someone you should make a note on your submission that mentions the name(s) of your associate(s).

However, I do ask you to do your own work in your final submissions. If two submissions exhibit what I feel to be "excessive similarity" I will grade the submissions based on merit and then divide the grade by two, assigning half the grade to each submission. If I receive more than two excessively similar submissions I will divide the grade by the number of such submissions and distribute the result accordingly.

Since "excessive similarity" is a bit subjective, I may only give you a warning if the similarity is not too excessive—especially for a first offense. However, I do keep records on when I find excessive similarity and I will be much less inclined to be forgiving if I discover it again. If you are concerned about the possibility of submitting something that might be too similar to another student's work, don't hesitate to speak with me first.

If you find material on the Internet or in a book that seems to answer questions I ask in an assignment, you may include such material in your submission provided you properly reference it. If I discover that you have included unreferenced material from such sources, I may not give you any credit for the question(s) answered by such material. You do not need to provide a reference to our text book or to materials I specifically provide in class.

Other Matters

Students with disabilities may request accommodation as provided within federal law. All such requests should be made by first contacting Robin Goodall, Learning Specialist, in the Center for Academic Success on the Randolph campus. She can be reached by phone at (802) 728-1278 or by email at rgoodall@vtc.edu.

Last Revised: 2018-08-21
© Copyright 2018 by Peter C. Chapin <pchapin@vtc.edu>