LEI'S DESIGN SERVICES

EXPERIENCED IN CHIP DESIGN VERIFICATION

Chapter 3: Pig to Sort and Extract Data

 

 Figure 4:  Pig script function with input parameters

 

3.1 Dataset

  Use Pig to sort the output reports from MapReduce Number Count in Chapter 1

             Source: /output/20*to2005_*/part-r-00000

                    Where * is a wild card for any characters

 

            File Size:  < 500 Bytes

            Number of Records: < 75

 

3.2 Problem Statement

a)            By sorting in ascending order, use Pig to find the top 6 most winning

number frequency count in various periods annually from output reports

from MapReduce Number Count in Chapter 1.  The periods are as follow:

                   From 2005 to 2005,

                   From 2006 to 2005,

                   From 2007 to 2005,

                   From 2008 to 2005,

                   From 2009 to 2005,

                   From 2010 to 2005,

                   From 2011 to 2005,

                   From 2012 to 2005,

                   From 2013 to 2005, and

                   From 2014 to 2005.

            b)  Use unix shell scripting languages: sed and awk to find the results from the following methods based on the output reports from MapReduce Number Count in Chapter 1, which could be done with Pig also.  The following methods will be using unix shell scripts:

 

v)                    the average 6 winning numbers from all 6 winning numbers,

vi)                   the median 6 winning numbers from all 6 winning numbers,

vii)                 the bottom 6 winning numbers from all 6 winning numbers,

viii)                 the bottom individual 6 numbers;  For example, bottom number from winning #1, bottom number from winning #2, bottom number from winning #3,  bottom number from winning #4, bottom number from winning #5, bottom number from winning mega number,

ix)                  method viii) and in Section 2.2 b) method iv) shift to the left by 1 number,

x)                 method viii) and in Section 2.2 b) method iv) shift to the left by 2 numbers,

xi)                method viii) and in Section 2.2 b) method iv) shift to the left by 3 numbers,

xii)               method viii) and in Section 2.2 b) method iv) shift to the left by 4 numbers, and

xiii)               repeat methods v) to xii)  and methods in Section 2.2 b) by reverse order.

 

3.3  Approach

One approach (for methods i) to iv) in Section 2.2 b) ) is using Pig to sort the numbers in the output files, so the most frequent number is on

a)      the top of the list and the least frequent number is at the bottom of the list.  For example, pick 6 most frequent numbers from the top of the list, and store the 6 most frequent numbers in an output file in HDFS.

b)      Another approach (methods v) to xiii) in Section 3.2 b) ) is using unix shell scripting language to do all the calculations and data manipulations.

Use unix commands such as sort, sed, and awk to find the average, mean, bottom, left shifting, and reverse the order of the numbers.  Figure 4 above shows the functionality of the pig script.

 

3.4 Code

            a)  Pig script for approach a) in Section 3.3 is using parameters such as input directory (period as described in Section 3.2 a) ), output directory where the top 6 winning numbers will be stored in HDFS, and the parameter for how many numbers to be selected.  The code for this Pig script, CaliforniaLottoParam.pig, is listed below:

 

all1nums = LOAD '/output/$indir/part-r-00000' using PigStorage('\\t') as

(WinningNum:long, WinningNumCount:long);

dump all1nums;

all1bynum = order all1nums by $1 desc;

dump all1bynum;

top1nums = limit all1bynum $num;

dump top1nums;

store top1nums into '/output/$outdir';

Where $indir is the sampling number count output directory as describe in Section 3.1 and $outdir is output directory for Pig store sorted data, and $num is the number selection count to store the top numbers sorted by Pig.

 

            To run each period as described in Section 3.2 a), the unix shell script is used to run the Pig scripts.  For example, for the period of 2006 to 2005, the unix shell script is runCaliforniaLotto2006to2005 as listed below:

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all1 -param outdir=2006to2005_top1 -param num=1

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all2 -param outdir=2006to2005_top2 -param num=1

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all3 -param outdir=2006to2005_top3 -param num=1

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all4 -param outdir=2006to2005_top4 -param num=1

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all5 -param outdir=2006to2005_top5 -param num=1

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all5Nmega -param outdir=2006to2005_top5Nmega -param num=6

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all6 -param outdir=2006to2005_top5nums -param num=5

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_all6 -param outdir=2006to2005_top6 -param num=6

pig -f CaliforniaLottoParam.pig -param indir=2006to2005_allmega -param outdir=2006to2005_topmega -param num=1

 

            Finally, use a unix shell script to concatenate the individual top winning numbers into one HDFS file.  Here is the script, concatefiles2006to2005, as listed below:

hdfs dfs -copyToLocal /output/2006to2005_top1/part-r-00000

mv part-r-00000 2006to2005_top1.txt

hdfs dfs -copyToLocal /output/2006to2005_top2/part-r-00000

mv part-r-00000 2006to2005_top2.txt

hdfs dfs -copyToLocal /output/2006to2005_top3/part-r-00000

mv part-r-00000 2006to2005_top3.txt

hdfs dfs -copyToLocal /output/2006to2005_top4/part-r-00000

mv part-r-00000 2006to2005_top4.txt

hdfs dfs -copyToLocal /output/2006to2005_top5/part-r-00000

mv part-r-00000 2006to2005_top5.txt

hdfs dfs -copyToLocal /output/2006to2005_topmega/part-r-00000

mv part-r-00000 2006to2005_topmega.txt

cp 2006to2005_topmega.txt 2006to2005_topmega2.txt

hdfs dfs -copyToLocal /output/2006to2005_top5nums/part-r-00000

mv part-r-00000 2006to2005_top5nums.txt

cat 2006to2005_top1.txt 2006to2005_top2.txt 2006to2005_top3.txt 2006to2005_top4.txt 2006to2005_top5.txt 2006to2005_topmega.txt > 2006to2005_individualtop5Nmega.txt

cat 2006to2005_top5nums.txt 2006to2005_topmega2.txt > 2006to2005_top5numsNmega.txt

 

 

Refer to Section 3.5 a) for the command and output information.

a)      The second approach is just all using unix shell scripts.

For methods v) to vii), the average, mean, and bottom 6 winning numbers are exclusively computed in a unix shell script.  For example,  runbottom_avg_med20061 is used for the period of 2005 to 2005,  runbottom_avg_med20071 is used for the period of 2006 to 2005, … , and finally runbottom_avg_med20151 is used for the period of 2014 to 2005.  Below is the example code for the period of 2005 to 2005:

hdfs dfs -copyToLocal /output/2005to2005all6/part-r-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottomall6.txt

sed -n 1,6p 2005to2005_bottomall6.txt > temp

mv temp 2005to2005_bottomall6.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all1/part-r-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottom1.txt

sed -n 1,1p 2005to2005_bottom1.txt > temp

mv temp 2005to2005_bottom1.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all2/part-r-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottom2.txt

sed -n 1,1p 2005to2005_bottom2.txt > temp

mv temp 2005to2005_bottom2.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all3/part-r-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottom3.txt

sed -n 1,1p 2005to2005_bottom3.txt > temp

mv temp 2005to2005_bottom3.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all4/part-r-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottom4.txt

sed -n 1,1p 2005to2005_bottom4.txt > temp

mv temp 2005to2005_bottom4.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all5/part-r-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottom5.txt

sed -n 1,1p 2005to2005_bottom5.txt > temp

mv temp 2005to2005_bottom5.txt

rm part-r-00000

 

hdfs dfs -copyToLocal /output/2005to2005_allmega/part-00000

#numeric sort descending order at 2nd field (-k2g)

sort -k2g part-r-00000 > 2005to2005_bottommega.txt

sed -n 1,1p 2005to2005_bottommega.txt > temp

mv temp 2005to2005_bottommega.txt

rm part-r-00000

cat 2005to2005_bottom1.txt 2005to2005_bottom2.txt 2005to2005_bottom3.txt 2005to2

005_bottom4.txt 2005to2005_bottom5.txt 2005to2005_bottommega.txt >

2005to2005_individualbottomall6.txt

hdfs dfs -copyToLocal /output/2005to2005_all1/part-r-00000

awk '{sum += $1; sum2 += $2} END {printf ("%d   %d\n", ((sum*10)/NR + 5)/10, ((sum2*10)/NR + 5)/10)}' part-r-00000 > 2005to2005_avg1.txt

sort -n part-r-00000 > temp

mv temp part-r-00000

awk '{arr[NR] = $1; num[NR] = $2} END {if (NR %2 == 1) printf("%d   %d\n", arr[(NR+1)/2], num[(NR+1)/2]); else printf("%d   %d\n", (arr[NR/2] + arr[NR/2+1])/2,

 (num[NR/2] + num[NR/2+1])/2)}' part-r-00000 > 2005to2005_med1.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all2/part-r-00000

awk '{sum += $1; sum2 += $2} END {printf ("%d   %d\n", ((sum*10)/NR + 5)/10, ((sum2*10)/NR + 5)/10)}' part-r-00000 > 2005to2005_avg2.txt

sort -n part-r-00000 > temp

mv temp part-r-00000

awk '{arr[NR] = $1; num[NR] = $2} END {if (NR %2 == 1) printf("%d   %d\n",

arr[(NR+1)/2], num[(NR+1)/2]); else printf("%d   %d\n", (arr[NR/2] + arr[NR/2+1])/2,

(num[NR/2] + num[NR/2 +1])/2)}' part-r-00000 > 2005to2005_med2.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all3/part-r-00000

awk '{sum += $1; sum2 += $2} END {printf ("%d         %d\n", ((sum*10)/NR + 5)/10, ((

sum2*10)/NR + 5)/10)}' part-r-00000 > 2005to2005_avg3.txt

sort -n part-r-00000 > temp

mv temp part-r-00000

awk '{arr[NR] = $1; num[NR] = $2} END {if (NR %2 == 1) printf("%d         %d\n",

arr[(NR+1)/2], num[(NR+1)/2]); else printf("%d         %d\n", (arr[NR/2] + arr[NR/2+1])/2,

(num[NR/2] + num[NR/2 +1])/2)}' part-r-00000 > 2005to2005_med3.txt

 

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all4/part-r-00000

awk '{sum += $1; sum2 += $2} END {printf ("%d         %d\n", ((sum*10)/NR + 5)/10, ((

sum2*10)/NR + 5)/10)}' part-r-00000 > 2005to2005_avg4.txt

sort -n part-r-00000 > temp

mv temp part-r-00000

awk '{arr[NR] = $1; num[NR] = $2} END {if (NR %2 == 1) printf("%d         %d\n",

arr[(NR+1)/2], num[(NR+1)/2]); else printf("%d         %d\n", (arr[NR/2] + arr[NR/2+1])/2,

 (num[NR/2] + num[NR/2 + 1])/2)}' part-r-00000 > 2005to2005_med4.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_all5/part-r-00000

awk '{sum += $1; sum2 += $2} END {printf ("%d         %d\n", ((sum*10)/NR + 5)/10, ((

sum2*10)/NR + 5)/10)}' part-r-00000 > 2005to2005_avg5.txt

sort -n part-r-00000 > temp

mv temp part-r-00000

awk '{arr[NR] = $1; num[NR] = $2} END {if (NR %2 == 1) printf("%d         %d\n",

arr[(NR+1)/2], num[(NR+1)/2]); else printf("%d         %d\n", (arr[NR/2] + arr[NR/2+1])/2,

 (num[NR/2] + num[NR/2 + 1])/2)}' part-r-00000 > 2005to2005_med5.txt

rm part-r-00000

hdfs dfs -copyToLocal /output/2005to2005_allmega/part-r-00000

awk '{sum += $1; sum2 += $2} END {printf ("%d         %d\n", ((sum*10)/NR + 5)/10, ((

sum2*10)/NR + 5)/10)}' part-r-00000 > 2005to2005_avgmega.txt

sort -n part-r-00000 > temp

mv temp part-r-00000

awk '{arr[NR] = $1; num[NR] = $2} END {if (NR %2 == 1) printf("%d         %d\n",

arr[(NR+1)/2], num[(NR+1)/2]); else printf("%d         %d\n", (arr[NR/2] + arr[NR/2+1])/2,

 (num[NR/2] + num[NR/2 + 1])/2)}' part-r-00000 > 2005to2005_medmega.txt

rm part-r-00000

cat 2005to2005_avg1.txt 2005to2005_avg2.txt 2005to2005_avg3.txt 2005to2005_avg4.txt

 2005to2005_avg5.txt 2005to2005_avgmega.txt > 2005to2005_all6avg.txt

cat 2005to2005_med1.txt 2005to2005_med2.txt 2005to2005_med3.txt 2005to2005_med4.txt

  2005to2005_med5.txt 2005to2005_medmega.txt > 2005to2005_all6med.txt

 

The unix shell script, runallbottom_avg_med, for running all ten periods is listed below:

./runbottom_avg_med20061

./runbottom_avg_med20101

./runbottom_avg_med20141

./runbottom_avg_med20071

./runbottom_avg_med20111

./runbottom_avg_med20151

./runbottom_avg_med20081

./runbottom_avg_med20121

./runbottom_avg_med20091

./runbottom_avg_med20131

 

Refer to Section 3.5 b) for the output files of the average, median, and bottom 6 winning numbers.

For shift individual numbers by 1 number up to 4 numbers (methods ix) to xii) in Section 3.2 b)), a unix shell script for each period is used for left shifting.  For example, concatefilesmore20052sl1 is used for 2005 to 2005 period,  concatefilesmore20062sl1 is used for 2006 to 2005 period,  …, and finally concatefilesmore20142sl1 is used for 2014 to 2005 period. Below is the code for concatefilesmore20052sl1:

hdfs dfs -copyToLocal /output/2005to2005_top12/part-r-00000

mv part-r-00000 2005to2005_top12.txt

sed -n 2,2p 2005to2005_top12.txt > temp

mv temp 2005to2005_top12.txt

hdfs dfs -copyToLocal /output/2005to2005_top22/part-r-00000

mv part-r-00000 2005to2005_top22.txt

 

sed -n 2,2p 2005to2005_top22.txt > temp

mv temp 2005to2005_top22.txt

hdfs dfs -copyToLocal /output/2005to2005_top32/part-r-00000

mv part-r-00000 2005to2005_top32.txt

sed -n 2,2p 2005to2005_top32.txt > temp

mv temp 2005to2005_top32.txt

hdfs dfs -copyToLocal /output/2005to2005_top42/part-r-00000

mv part-r-00000 2005to2005_top42.txt

sed -n 2,2p 2005to2005_top42.txt > temp

mv temp 2005to2005_top42.txt

hdfs dfs -copyToLocal /output/2005to2005_top52/part-r-00000

mv part-r-00000 2005to2005_top52.txt

sed -n 2,2p 2005to2005_top52.txt > temp

mv temp 2005to2005_top52.txt

hdfs dfs -copyToLocal /output/2005to2005_topmega2/part-r-00000

mv part-r-00000 2005to2005_topmega2.txt

sed -n 2,2p 2005to2005_topmega2.txt > temp

mv temp 2005to2005_topmega2.txt

cat 2005to2005_top22.txt 2005to2005_top32.txt 2005to2005_top42.txt 2005to2005_top52.txt 2005to2005_topmega2.txt 2005to2005_top12.txt > 2005to2005_individualtop5Nmega2sl1.txt

 

The outputs of the shifting results are self-explanatory.

 

Finally, method xiii) in Section 3.2 b) is just reverse the order of the 6 top winning numbers using gawk.  Here is one example of the code for period of 2005 to 2005:

reverse2005

gawk '{L[n++] = $0 } END {while (n--) print L[n]}' /home/notroot/lab/workspacehadoopHW/hivescript/2005to2005_individualtop5Nmega.txt > 2005to2005_individualtop5NmegaR.txt

gawk '{L[n++] = $0 } END {while (n--) print L[n]}'

/home/notroot/lab/workspacehadoopHW/hivescript/2005to2005_top5Nmega.txt >

2005to2005_top5NmegaR.txt

gawk '{L[n++] = $0 } END {while (n--) print L[n]}' /home/notroot/lab/workspacehadoopHW/hivescript/2005to2005_top5numsNmega.txt > 2005to2005_top5numsNmegaR.txt

gawk '{L[n++] = $0 } END {while (n--) print L[n]}' /home/notroot/lab/workspacehadoopHW/hivescript/2005to2005_top6.txt >

2005to2005_top6R.txt

The results of the reverse order are self-explanatory.

 

3.5 Execution

a)      For approach a) in Section 3.4, the output files in the above example are:

File 1 name: /output/2006to2005_top6/part-r-00000

Command:

hdfs dfs -cat /output/2006to2005_top6/part-r-00000

 

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 18:02:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

25        24

36        22

7          22

5          20

14        20

48        19

 

 

File 2 name: /output/2006to2005_top5Nmega/part-r-00000

Command:

hdfs dfs -cat /output/2006to2005_top5Nmega/part-r-00000

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 18:04:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

36        28

7          27

25        24

20        24

5          24

14        22

 

File 3 name: /output/2006to2005_individualtop5Nmega.txt

Command:

hdfs dfs -cat /output/2006to2005_individualtop5Nmega.txt

 

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 18:11:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

5          15

12        10

20        8

42        9

52        16

4          9

 

File 4 name:  /output/2006to2005_top5numsNmega.txt

Command:

hdfs dfs -cat /output/2006to2005_top5numsNmega.txt

 

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 18:16:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

25        24

36        22

7          22

5          20

14        20

4          9

 

a)      File name for average calculation:

/input/2005to2005_all6avg.txt

Command:

hdfs dfs -cat /input/2005to2005_all6avg.txt

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 23:04:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

12     3

20     2

30     2

37     2

44     3

 23     2

 

File name for median calculation:

     /input/2005to2005_all6med.txt

 

  Command:

     hdfs dfs -cat /input/2005to2005_all6med.txt

 File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 23:06:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

10        1

18        3

31        2

38        1

45        2

22        1

 

    

File name for bottom calculation:

      /input/2005to2005_bottomall6.txt

 

Command:

      hdfs dfs -cat /input/2005to2005_bottomall6.txt

 

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 23:10:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

19        1

11        2

47        2

15        3

26        3

29        3

 

File name for individual winning number bottom calculation:

      /input/2005to2005_individualbottomall6.txt

 

Command:

      hdfs dfs -cat /input/2005to2005_individualbottomall6.txt

 

File content:

OpenJDK Server VM warning: You have loaded library /home/notroot/lab/software/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

15/09/07 23:18:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

19        1

14        1

12        1

13        1

27        1

10        1

 Click here to go to Chapter 4:   www.leisdesignservices.com/mapreducedataanalysis.htm

 Or click here to go to the Table of Content:  www.leisdesignservices.com/hadoopproofofconcept.htm