Monday, May 5, 2014

Unix Shell Text Sorting: sort

1. sort by field
test_1(Note the spaces):

   :YY: Female  
 :XX: Male  

terminal:
-k means we are sorting by first field, by default, sort uses white space as the separator, for a field, trailing and tailing white spaces are ignored. That's why line starting with ":XX:" is before the line starting with "   :YY:". Because the spaces in front of ":YY:" are ignored, and the dictionary comparison result of ":XX:" and ":YY:" make the 2nd line goes first. 
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1 ./test_1  

 :XX: Male  
   :YY: Female  

-t: means using ":" as the separator. We are sorting by the first field. With ":" as the separator, 1st field of 3 lines are: null, null, 3 spaces. That's why we get the following result
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1 -t: ./test_1  

 :XX: Male  
   :YY: Female  

2. sort starting from one field
test_2:
 10:YY:Lawyer  
 10:XX:Engineer  

terminal:
We are using ":" as the delimiter and sort staring from first field. Note: -k1 means "sorting starting from 1st field to the end of record, not exactly 1st field only". That's why sort reversed the order of two records even if first fields are equal. Because the 2nd field "XX" is less than "YY", causing "XX" line goes first.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1 -t: ./test_2  
 10:XX:Engineer  
 10:YY:Lawyer  

3. Reverse the order
test_2:
 10:YY:Lawyer  
 15:XX:Engineer  

terminal:
with -r, we reverse the sorting order, so line starting with 15 goes first.
without -r, we use the normal ascending order, so line starting with 10 goes first.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1r ./test_2  
 15:XX:Engineer  
 10:YY:Lawyer  
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1 ./test_2  
 10:YY:Lawyer  
 15:XX:Engineer  

4. Sort by number
test_2:
 10:YY:Lawyer  
 9:XX:Engineer  

terminal:
First command: With ":" as the delimiter, we are sorting starting from 1st field, which is a number. But normal sorting does the job with dictionary order, so we get the following result with the line starting from "10" goes first.
Second Command: -n means sorting based on "number context", which means, 9 < 10 in this case. So the order get reversed with the new option -n.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -t: -k1 ./test_2  
 10:YY:Lawyer  
 9:XX:Engineer  
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -t: -k1n ./test_2  
 9:XX:Engineer  
 10:YY:Lawyer  

5. Sort By Range of Fields
test_2:
 10 20 XX   
 10 10 YY  

terminal:
First command: sort based on field 1 to field 3
Second command: sort based on field 2 to field 3
Third command: sort only based on field 3
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1,3 ./test_2  
 10 10 YY  
 10 20 XX   
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k2,3 ./test_2  
 10 10 YY  
 10 20 XX   
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k3,3 ./test_2  
 10 20 XX   
 10 10 YY  

6. Sort by specific position
test_2:
 XX20 Engineer   
 YY10 Lawyer  

terminal:
First Command: sort by 1st field, XX20 is before YY10
Second command: -k1.3 means, sort starting from 1st field's 3rd character, which is comparing 10 and 20 in this case.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1 ./test_2  
 XX20 Engineer   
 YY10 Lawyer  
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1.3 ./test_2  
 YY10 Lawyer  
 XX20 Engineer   

7. Sort by outputting unique record
test_2:
 10 XX Engineer   
 10 YY Lawyer  

terminal:
First Command, sort starting from field 1 to the end of record, so there is no duplicate record  here.
Second command, sort starting from field 1 to field 1, meaning that we only compare field 1, so -u only output the unique record on the specific field.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1 -u ./test_2  
 10 XX Engineer   
 10 YY Lawyer  
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ sort -k1,1 -u ./test_2  
 10 XX Engineer   

8. Sort Text Block
display the text block:
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ cat ./test  
 #sortkey: Bill  
 Bill Gates  
 President of Microsoft  
 =====  
 #sortkey: Jobs  
 Steve Jobs  
 President of CEO  
 =====  
 #sortkey: Barack  
 Barack Obama  
 President of United States  

put each text block on one line:
RS controls input separator, gsub is trying to replace "\n" with "--" globally
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ cat ./test |  
 > awk -v RS="=====\n" '{ gsub("\n","--"); print}'  
 #sortkey: Bill--Bill Gates--President of Microsoft--  
 #sortkey: Jobs--Steve Jobs--President of CEO--  
 #sortkey: Barack--Barack Obama--President of United States--  

sort all text blocks:
we use sort to sort all blocks. -f means converting all letters to a common lettercase for comparison. -k2 means we are comparing the 2nd field, which is the name.
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ cat ./test |  
 > awk -v RS="=====\n" '{ gsub("\n","--"); print}' |  
 > sort -f -k2  
 #sortkey: Barack--Barack Obama--President of United States--  
 #sortkey: Bill--Bill Gates--President of Microsoft--  
 #sortkey: Jobs--Steve Jobs--President of CEO--  

convert the sorting result to the original format:
ORS controls the output separator, gsub replaces all "--" with "\n".
 aubinxia@aubinxia-VirtualBox:~/Desktop/xxdev$ cat ./test |  
 > awk -v RS="=====\n" '{ gsub("\n","--"); print}' |  
 > sort -f -k2 |  
 > awk -v ORS="=====\n" '{ gsub("--", "\n"); print}'  
 #sortkey: Barack  
 Barack Obama  
 President of United States  
 =====  
 #sortkey: Bill  
 Bill Gates  
 President of Microsoft  
 =====  
 #sortkey: Jobs  
 Steve Jobs  
 President of CEO  
 =====  

9. Note:
Unix shell sort is very efficient, it is not using the simplest bubble sort.
Unix shell sort is not stable sort, which means, for two records which are determined to be equal, the original order is not guaranteed.

No comments:

Post a Comment