CISC5950: Big Data Programming, HW2

0. Finish the in-class lab exercise

This exercise, especially step 4, shows how one submit a MapReduce job to the Hadoop system. Complete this exercise to gain first-hand experience of working with Hadoop and MapReduce.

1. MapReduce Programming Paradigm

In class, we learnt about the MapReduce programming model, where the computation job is expressed as a map phase followed by a reduce phase, and how input/outputs to both phases are a set of <key,value> pairs. Answer the following questions about MapReduce programming model (to the best of your knowledge at this point):

For the weather data set that you have used in HW1, can you come up 2 or 3 other problems (other than outputting the highest temperature for each year) that can be easily solved using MapReduce model? For each of the problem, write some kind of pseudocode (see the example given in the slides) for map and reduce phase.
Still for the same data set, can you come up with some data set analysis problem that you think cannot be solved using MapReduce model?
More clarification:
One might argue that all problems can be solved using MapReduce framework, as in the worst case, we can have multiple map tasks that perform some kind of pre-processing of data, and then use a single reduce task.
Q.a: Do you agree with the above statement?
Q.b: Suppose we want to sort all temperatures in the data set, discuss how you can do so using MapReduce framework:
1. How do you make sure all <key,value> pairs are shuffled to a single reduce task? How can the reduce task sort the list of values (temperatures) passed to it using any standard sorting algorithm? (Hint: the whole list of values are passed to the reduce function).
2. It's suggested in class that we can have multiple reduce task to speed up sorting, if we do the following in our map function (i.e., to assign key based upon the range that the temperature value falls into... What's the difficulty of the above approach? Hint: what prior information do you need to make it works? How about post-processing needed to finish the sorting of all temperature?
```
if (temperatur<-200)
   output <1,temperatur>
else if (temperature <-100)
   output <2,temperature>
else if (temperature<0)
   output <3;temperature>
... 
```
Q.c Another example given by a student in class is the following:
```
Consider the following problem:

Input: The weather data (as we have seen in lab1)

Output: for each year, the increment of this year's highest temperature over the previous year's highest temperature. 

For example, if the highest temperature measured in year 1920 is 340, and the highest temperature measured in 1921 
is 344, then the output for 1921 should be: 

  <1921, 4>
```
How would you solve the above problem using MapReduce framework?
1. Start by given a simple solution without worrying about parallelism and performance.
2. Suppose you can use two MapReduce program (one to run the MaxTemperature program as shown in textbook, another one to analyze the output of MaxTemperature program to get what we want), how would you design the second MapReduce program so that you can take advantage of the parallelism of multiple mapper or reducer tasks?
For the maximum temperature example, the computation of our lab1 code (One-phase approach (TempStat), Two-phase approach (Filter.java,Max.java)) is very simple, and the usage of memory space is also minimal (no usage of complicated data structure). For this reason, the bottleneck resource is the disk access bandwidth.
Suppose we have a very big data set contains weather data from 80 years, compare the running time of the lab1's approach versus using the MaxTemperature MapReduce program:
1. Lab1 program: processing the whole data set using one program running on one node in the cluster.
2. MapReduce program: assuming 10 map tasks and 10 reduce tasks are dispatched, each running on a different nodes in the cluster.
Can you roughly estimate how many times faster is the second approach compared to the first approach?

2. Is the following statement true or False? Why?

When a MapReduce job is being carried out in Hadoop environment, the reduce tasks can only start after all map tasks have finished execution.

3. (For Business School students)

Can you come up with one or two examples where it's critical to finish a data processing job within a limited time frame? Otherwise, the results will be useless.

4. Open-ended Questions

Describe three potential big data mining problems for potential class projects. For each one, describes the data set and what's the desired output of the problems.