CIS-4250/5250 Homework #5: RFCAnalyzer

Due: Friday, December 14, 2018

Reading: ...

CIS-4250

Use the RFCAnalyzer starter zip archive to get set up for this assignment. It contains an SBT project configured for use with IntelliJ along with some skeletal code. Unpack this archive in some suitable place. You can either use IntelliJ to load the project contained in the archive (for example, on Jumbo) or, if you would rather, you can use the command line sbt program (for example, on Lemuria) and a plain text editor such as Emacs. The starter code is in Scala, but for this assignment you are free to use either Java or Scala as you desire. The starter code assumes you will be using Apache Spark, but for this assignment you are free to use either Spark or MapReduce as you desire.

Write a Hadoop-based (MapReduce or Apache Spark) program that analyzes the collection of RFC documents and outputs a list of pairs (2-tuples) where the first element of each pair is the name of an author, and the second element of each pair is a collection of RFC numbers authored by that author. For example, the output might contain the following hypothetical pairs:

      (J. Jones, List(1234, 5678))
      (A. Roberts, List(123))
    

Here the author "J. Jones" was involved in writing RFC-1234 and RFC-5678. The author "A. Roberts was only involved in writing RFC-123. The precise format of the output is up to you and need not look like what is shown above as long as the information is presented in a reasonably understandable way. Each author should only appear in one pair, however, so separate pairs representing J. Jones' work on RFC-1234 and RFC-5678 is not satisfactory.

Author names need not be formatted as shown here either. You are encouraged (although not required) to use author names as they appear in the RFC documents themselves. For example, in some cases the first name is fully spelled out. In some cases a middle initial is provided.

Note that many (most) RFCs have multiple authors. Every author should be recorded appropriately. For example, if T. Smith was also an author of RFC-1234 there should be a pair containing at least that RFC like this: (T. Smith, List(1234)). This would be in addition to the mentioning of RFC-1234 in J. Jones' list.

The most difficult part about this problem will be finding and extracting author names from each RFC. This is difficult because the format of the RFCs is not consistent. Furthermore some RFCs include affliated institutions along with the author names (you should ignore the affliated institutions). Do the best you can... make some effort to create a general program, but not fret if you don't handle all cases "cleanly."

Submit to Moodle your analysis program in an archive. Remove the project and target folders and zip the entire directory structure (use zip -r Hw05.zip Hw05 or similar. You can also use tar if you prefer.

CIS-5250

  1. Complete the assignment above as for CIS-4250.

  2. ...

Submit to Moodle both your analysis program as described for CIS-4250, and include ...


Last Revised: 2018-12-12
© Copyright 2018 by Peter C. Chapin <pchapin@vtc.edu>