Sunday, May 4, 2014

Unix Shell Fields Processing: join

1. join: merge two files
test_1(Note the empty line)
 
 YY Female  
 XX Male  

test_2(Note the empty line)
 XX Engineer  

 YY Lawyer  

test_join:
 #! /bin/bash  
 # We have to sort 2 data files with the first field(by default)  
 sort ./test_1 > test_1.tmp  
 sort ./test_2 > test_2.tmp  

 # output the content of sorted result  
 echo =====================  
 cat ./test_1.tmp  
 echo =====================  
 cat ./test_2.tmp  
 echo =====================  

 # join will assume "fields" two files joined are already sorted, and its   
 # algorithm are taking advantage of this. If fields are not sorted, it will  
 # complain and the result is messed   
 join ./test_1.tmp ./test_2.tmp  

 # remove the temporary files  
 rm test_1.tmp  
 rm test_2.tmp  

terminal:
sort is trying to sort per first field(by default), for the empty line, first field is null which is less than anything else. So for the sorted result, empty line always goes first.
join is trying to merge two sorted result into one place, by default, it is trying to merge based on field 1.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ ./test_join  
 =====================  

 XX Male  
 YY Female  
 =====================  

 XX Engineer  
 YY Lawyer  
 =====================  

 XX Male Engineer  
 YY Female Lawyer  

2. join: use different delimiter
test_1:

 :YY: Female  
 :XX: Male  

test_2:
 :XX: Engineer  

 :YY: Lawyer  

test_join:
 #! /bin/bash  
 # sort data files firstly  
 sort ./test_1 > test_1.tmp  
 sort ./test_2 > test_2.tmp  

 # output the content of sorted result  
 echo =====================  
 cat ./test_1.tmp  
 echo =====================  
 cat ./test_2.tmp  
 echo =====================  

 # join is using ':' as the delimiter, then for fields "XX" and "YY" in data  
 # files, they are "2nd" field, 1st fields is empty for lines containing these  
 # fields.   
 # -1 2 means for first file, we join per 2nd field  
 # -2 2 means for second file, we join per 2nd field too   
 # -o 1.2 means explicitly output first file's 2nd field
 # -o 1.3 means explicitly output first file's 3rd field
 # by explicitly specifying which field we want to output, we can avoid join
 # from only outputting common field once. The developer controls now.
 join -t ':' -1 2 -2 2 -o 1.2 -o 1.3 -o 2.2 -o 2.3 ./test_1.tmp ./test_2.tmp  

 # remove the temporary files  
 rm test_1.tmp  
 rm test_2.tmp  

terminal:
Since we are specifying the output fields in script, for the empty line, all fields getting outputted are now, so the result is ":::", which means 4 null fields.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ ./test_join  
 =====================  

 :XX: Male  
 :YY: Female  
 =====================  

 :XX: Engineer  
 :YY: Lawyer  
 =====================  
 :::  
 XX: Male:XX: Engineer  
 YY: Female:YY: Lawyer  

No comments:

Post a Comment