Sunday, July 13, 2014

Unix Shell: Spell Checker

1. comm command
Given two files, comm command can detect common lines in both files, the unique line in file1 and unique lines in file2.
Note: all two files need to be sorted firstly

terminal:
1 - 2) Print out the content of  t1 and t2, each contains two lines of strings
3) Use comm command to get command and unique lines from two files, --check-order option make comm command to check the order whenever proceeding one step. At this step, it complains that input file is not sorted yet.
4 - 5) sort files t1 and t2, output the result to sorted_t1 and sorted_t2 separately
6 - 7) Print out the file content of sorted_t1 and sorted_t2, both files are already sorted
8) Use comm command to get common and unique lines.
First column: "Hello Los Angeles" means this line only exists at first file sorted_t1
Second column: "Hello New York" means this line only exists at second file sorted_t2
Third column: "Hello world" means this line exists at both files.
9) -1 option means suppressing the output of "Unique lines in file 1"
10) -2 option means suppressing the output of "Unique lines in file 2"
11) -3 option means suppressing the output of "Common lines in both files"
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat t1  
 Hello world!  
 Hello Los Angeles!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat t2  
 Hello world!  
 Hello New York!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ comm --check-order t1 t2  
         Hello world!  
 comm: file 1 is not in sorted order  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ sort <t1 >sorted_t1  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ sort <t2 >sorted_t2  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat sorted_t1  
 Hello Los Angeles!  
 Hello world!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat sorted_t2  
 Hello New York!  
 Hello world!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ comm --check-order sorted_t1 sorted_t2  
 Hello Los Angeles!  
     Hello New York!  
         Hello world!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ comm --check-order -1 sorted_t1 sorted_t2  
 Hello New York!  
     Hello world!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ comm --check-order -2 sorted_t1 sorted_t2  
 Hello Los Angeles!  
     Hello world!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ comm --check-order -3 sorted_t1 sorted_t2  
 Hello Los Angeles!  
     Hello New York!  

2. A simple self-made spell checker program with "comm" command
terminal:
1) Print out the file content of owndict, which is a simple self-made "dictionary", the spell checker works based comparison of owndict and t1
2) Print out the file content of t1, it contains a wrongly spelled word: "worlds"
3) Sort owndict, and output the result into sorted_dict
4) Print out the file content of sorted_dict
5) Use comm command to list lines existed only on t1, but not in sorted_dict, -13 option is used to suppress the output first and third columns, which are lines existed only at sorted_dict and existed at both files. If the line exists only at t1 but not our dictionary "sorted_dict", then it will get output and taken as wrongly spelled word
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat owndict  
 Hello  
 world  
 New  
 York  
 Los  
 Angelesaubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat t1  
 Hello  
 worlds  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ sort <owndict >sorted_dict  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat sorted_dict  
 Angeles  
 Hello  
 Los  
 New  
 world  
 York  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ comm -13 sorted_dict t1  
 worlds  

3. Spell command
command can be used to check the wrongly spelled word in the text file
terminal:
1) Print out the file content of t1, the first line's "color" is american english, and the second line's "colour" is british english
2) Print out the file content of owndict, which is our own "dictionary"
3) Use spell command to get the wrongly spelled word, in this case, "colour" in the second line get picked. By default, spell command is using the american english as the standard.
4) -b option tell "comm" command to use "british english" as the standard
5) -d option allows user to specify own dictionary file, since at owndict, we don't have "colour", so "colour" at the t1 get picked as the wrongly spelled word.
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat t1  
 I love the blue color!  
 I love the red colour!  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ cat owndict  
 I  
 love  
 the  
 blue  
 red  
 color  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ spell t1  
 colour  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ spell -b t1  
 aubinxia@aubinxia-fastdev:~/Desktop/xxdev$ spell -d owndict t1  
 colour  
Note:
If changing the locale, we need to re-sort the dictionary with new rules in new locale, otherwise comparison result would be problematic.

No comments:

Post a Comment