Friday, May 9, 2014

Unix Shell Example: Dictionary

1. Dictionary

dict:  a dictionary file containing all words in american english
 A  
 A's  
 AA's  
 AB's  
 ABM's  
 AC's  
 ACTH's  
 AI's  
 AIDS's  
 AM's  
 AOL  
 AOL's  
 ASCII's  
 ASL's  
 ATM's  
 ATP's  
 AWOL's  
 AZ's  
 ......  

script:
-h: if there are multiple files as input, we may ignore the file name in the standard output
-i: ignore the letter case when doing the matching
 #! /bin/bash  
 pattern="$1"  
 egrep -h -i "$pattern" ./dict | sort -u -f  

terminal:
1) First Command: pick up all words starting from the "Hello"(ignoring the lettercase)
2) Second Command: pick up all words starting from the "world"(ignoring the lettercase)
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ ./script Hello.*  
 hello  
 hellos  
 hello's  
 Othello  
 Othello's  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ ./script world.*  
 otherworldly  
 underworld  
 underworlds  
 underworld's  
 unworldly  
 world  
 worldlier  
 worldliest  
 worldliness  
 worldliness's  
 worldly  
 worlds  
 world's  
 worldwide  

2. Word lists
Given a text block, we need to list how many times each word occurs.

text:
 Hello world!   
 Hello Hello my great world!  

script:
1) First command in the pipeline: -c means complement, -s means squeezing repeated characters. Then  translate all characters which are not in set [A-Za-z!], into newline operator
2) Second command in the pipeline: translate all upper case characters into lower case characters
3) sort, by default, it sorted on first field with dictionary order
4) output the unique format while having the count number for different  words
5) using number order, sort the first field, if first field matches, then sort the 2nd field
6) ${1:-3} means using the first positional parameter, it not available, using default number "3". sed -n "1,3 s/ / /p" means enforce not printing out the pattern space(-n), while printing out the "touched line"(p). So we just output first 3 lines.
 #! /bin/bash  
 tr -cs [A-Za-z!] '\n' < ./text \  
  | tr [A-Z] [a-z] \  
  | sort \  
  | uniq -c \  
  | sort -k1,1nr -k2,2 \  
  | sed -n "1,${1:-3} s/ / /p"  

terminal:
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ ./script  
    3 hello  
    2 world!  
    1 great  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ ./script 2  
    3 hello  
    2 world!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ ./script 1  
    3 hello  

No comments:

Post a Comment