CIS-4250/5250 Homework #3: Analyzing the Ga Data Set

Due: Tuesday, October 16, 2018

Reading: ...

CIS-4250

This assignment makes use of the Ga (imaginary) data set. It also makes use of the IDG imaginary data generator. However, imaginary data for this assignment has already been produced and can be found in /user/hadoop on the HDFS file system on the VTC cluster. Proceed as follows:

  1. Make sure you understand the format of the imaginary data. To view it, use a command (on lemuria) such as:

            $ hdfs dfs -cat /user/hadoop/observations-1.txt | less
          

    This file contains the observations made over a one year period of 10 stars all located in a sphere with a radius of 1000 light years from the earth. The format of each line is as follows:

            DAYNUMBER,STARNUMBER,LONGITUDE (degrees),LATITUDE (degrees)
          

    The longitude and latitude are in ecliptic coordinates. It isn't important to understand exactly what that means, but basically the latitude is a measure of how far the star appears to be above (or below) the plane of the Earth's orbit. The longitude is a measure of how far the star is east or west of the "first point of Aries," also called the vernal equinox. In effect, this file gives higly precise measurements of the star's position on the sky. This is the file your program will be reading.

    For reference there each observation file has a corresponding stars file. For example:

            $ hdfs dfs -cat /user/hadoop/stars-1.txt | less
          

    This file contains the 3D coordinates of the star in space as measured along suitable X, Y, and Z axii. Here is the format of that file:

            STARNUMBER,X_COORD (light years),Y_COORD (light years),Z_COORD (light years)
          

    Technically this file is not part of the observation set. It represents "ground truth" that we can compare with our analysis of the observations. In real life the information in the stars file would be unknown.

  2. Write a Hadoop program in Scala (not Java!) that computes the distance to each star based on the observations.

    As the Earth goes around the sun, the stars appear to move on the sky in small circles (really ovals, but in the imaginary data they are all perfect circles) due to parallax. A nearby object shifts when viewed against a more distant background when the observer changes position. For each star, your program must go though the ecliptic longitude values and find the extent of this shifting, called delta, the difference between the maximum and minimum values. The distance to the star in parsecs is then given by the formula: d = 1/(delta/2) where 'd' is the distance and 'delta' is the maximum angular shift of the star in arcseconds. You should convert parsecs to light years in your final output (1 parsec = 3.262 light years).

  3. Write a short Awk script that computes the distance to the various stars using information in the stars file. Use the distances from the stars file to check the results of your program.

  4. Once you are convinced your program is working properly, execute it on the larger files. Each step up increases the number of stars by a factor of 10. Go to as large an observation set as you reasonably can. For each size time how long the program takes to execute.

  5. (Graduate Credit). The IDG source code is on GitHub. Instead of writing observations.txt to the local file system, it should write the file directly into HDFS and thus bypass the need to have the generated data stored locally. Modify the program so it does this.

Submit your analysis program, your Awk script, and a short document that provides the timing information requested above, all in a zip archive to Moodle.

CIS-5250

  1. Complete the assignment above as for CIS-4250.

  2. The IDG source code is on GitHub. Instead of writing observations.txt to the local file system, it should write the file directly into HDFS and thus bypass the need to have the generated data stored locally. Modify the program so it does this.


Last Revised: 2018-10-17
© Copyright 2018 by Peter C. Chapin <pchapin@vtc.edu>