C++ || Simple Spell Checker Using A Hash Table
The following is another programming assignment which was presented in a C++ Data Structures course. This assignment was used to gain more experience using hash tables.
REQUIRED KNOWLEDGE FOR THIS PROGRAM
Hash Table - What Is It?
How To Create A Spell Checker
How To Read Data From A File
Strtok - Split Strings Into Tokens
#include 'HashTable.h'
The Dictionary File - Download Here
== OVERVIEW ==
This program first reads words from a dictionary file, and inserts them into a hash table.
The dictionary file consists of a list of 62,454 correctly spelled lowercase words, separated by whitespace. The words are inserted into the hash table, with each bucket growing dynamically as necessary to hold all of the incoming data.
After the reading of the dictionary file is complete, the program prompts the user for input. After input is obtained, each word that the user enteres into the program is looked up within the hash table to see if it exists. If the user entered word exists within the hash table, then that word is spelled correctly. If not, a list of possible suggested spelling corrections is displayed to the screen.
== HASH TABLE STATISTICS ==
To better understand how hash tables work, this program reports the following statistics to the screen:
• The total size of the hash table.
• The size of the largest hash table bucket.
• The size of the smallest hash table bucket.
• The total number of buckets used.
• The average hash table bucket size.
A timer is used in this program to time (in seconds) how long it takes to read in the dictionary file. The program also saves each hash table bucket into a separate output .txt file. This is used to further visualize how the hash table data is internally being stored within memory.
== SPELL CHECKING ==
The easiest way to generate corrections for a spell checker is via a trial and error method. If we assume that the misspelled word contains only a single error, we can try all possible corrections and look each up in the dictionary.
Example:
wird: bird gird ward word wild wind wire wiry
Traditionally, spell checkers look for four possible errors: a wrong letter (“wird”), also knows as alteration. An inserted letter (“woprd”), a deleted letter (“wrd”), or a pair of adjacent transposed letters (“wrod”).
The easiest of which is checking for a wrong letter. For example, if a word isnt found in the dictionary, all variants of that word can be looked up by changing one letter. Given the user input “wird,” a one letter variant can be “aird”, “bird”, “cird”, etc. through “zird.” Then “ward”, “wbrd”, “wcrd” through “wzrd”, can be checked, and so forth. Whenever a match is found within the dictionary, the spelling correction should be displayed to the screen.
For a detailed analysis how the other methods can be constructed, click here.
===== SIMPLE SPELL CHECKER =====
This program uses a custom template.h class. To obtain the code for that class, click here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
// ============================================================================ // Author: Kenneth Perkins // Date: Jan 31, 2013 // Taken From: http://programmingnotes.org/ // File: SpellCheck.cpp // Description: This is a simple spell checker which tests the HashTable.h // class. // ============================================================================ #include <iostream> #include <fstream> #include <cctype> #include <cstring> #include <string> #include <iomanip> #include <ctime> #include <limits> #include "HashTable.h" using namespace std; // iterator declaration for hash table typedef HashTable<string>::Iterator iterDec; // hash table size const int TABLE_SIZE = 19000; // strtok delimiters const char* DELIMITERS = " ,.-\':;?()+*/\\%$#!\"@\^&"; // function prototypes void PrintTableStats(HashTable<string>& hashTable); int SpellCheck(HashTable<string>& hashTable, string word); string ToLowerCase(string word); int main() { // declare variables int result = 0; string userInput; string currWord; clock_t beg; // used to time the hashtable load clock_t end; // used to time the hashtable load char response; ifstream infile; HashTable<string> hashTable(TABLE_SIZE); // open the dictionary file infile.open("INPUT_Dictionary_programmingnotes_freeweq_com.txt"); // check if the file exists, EXIT if it doesnt if(infile.fail()) { cout<<"\n\n**ERROR - The dictionary file could not be found...\n"; exit(1); } cerr<<"\nLoading dictionary...."; beg = clock(); // start the timer // get data from file and put into hashtable while(infile >> currWord) { // makes sure duplicate words arent inserted into table if(!hashTable.Count(currWord)) { hashTable.Insert(currWord); } } infile.close(); PrintTableStats(hashTable); end = clock()-beg; // end the timer cout<<"\n\nDictionary loaded in "<< (double)end / ((double)CLOCKS_PER_SEC)<<" secs!"; // creates a line separator cout<<endl; cout.fill('-'); cout<<left<<setw(50)<<""<<endl; do{ // get user input cout<<"\n>> Please enter a sentence: "; getline(cin,userInput); cout<<endl; // split each word from the string into individual words to check if // they are spelled correctly char* splitInput = strtok(const_cast<char*>(userInput.c_str()),DELIMITERS); while(splitInput!=NULL) { currWord = splitInput; currWord = ToLowerCase(currWord); result += SpellCheck(hashTable,currWord); splitInput = strtok(NULL,DELIMITERS); } // display results if(result > 0) { cout<<"Number of words spelled incorrectly: "<<result<<endl; result = 0; } // ask for more data cout<<"\nDo you want to enter another sentence? (y/n): "; cin >> response; cin.ignore(numeric_limits<streamsize>::max(),'\n'); // clear the cin buffer }while(toupper(response)=='Y'); cout<<"\nBYE!!\n"; return 0; }// end of main void PrintTableStats(HashTable<string>& hashTable) { int largestBucket = -9999999; int largestIndex = 0; int smallestBucket = 9999999; int smallestIndex = 0; double numBuckestUsed = 0; ofstream outfile("OUTPUT_HashTable_Stats_programmingnotes_freeweq_com.txt"); for(int x=0; x < hashTable.TableSize(); ++x) { // iterator is used to traverse each hashtable bucket iterDec it = hashTable.begin(x); if(!hashTable.IsEmpty(x)) { if(smallestBucket > hashTable.BucketSize(x)) { smallestBucket = hashTable.BucketSize(x); smallestIndex = x; } if(largestBucket < hashTable.BucketSize(x)) { largestBucket = hashTable.BucketSize(x); largestIndex = x; } ++numBuckestUsed; outfile<<"\nBucket #"<<x<<":\n"; for(int y = 0; y < hashTable.BucketSize(x); ++y) { outfile <<"\t"<< it[y] << endl; } } } cout<<"Complete!\n"; // creates a line separator cout<<endl; cout.fill('-'); cout<<left<<setw(50)<<""<<endl; cout<<"Total dictionary words = "<<hashTable.TotalElems()<<endl <<"Hash table size = "<<hashTable.TableSize()<<endl <<"Largest bucket size = "<<largestBucket<< " items at index #"<<largestIndex<<endl <<"Smallest bucket size = "<<smallestBucket<< " items at index #"<<smallestIndex<<endl <<"Total buckets used = "<<numBuckestUsed<<endl <<"Total percent of hash table used = "<<(numBuckestUsed/hashTable.TableSize())*100<<"%"<<endl <<"Average bucket size = "<<(hashTable.TotalElems()/numBuckestUsed)<<" items"; }// end of PrintTableStats int SpellCheck(HashTable<string>& hashTable, string word) { int result = 0; int suggestion = 0; string remove[256]; int numRemove=0; if(!hashTable.Count(word)) { ++result; cout<<"** "<<word<<": "; // alteration & insertion for(unsigned x = 0; x < word.length(); ++x) { string alteration = word; for(char c = 'a'; c <= 'z'; ++c) { //alteration alteration[x] = c; if(hashTable.Count(alteration)) { cout<<alteration<<", "; remove[numRemove++] = alteration; ++suggestion; // remove the entry so it isnt displayed multiple times hashTable.Remove(alteration); } //insertion string insertion = word.substr(0, x) + c + word.substr(x); if(hashTable.Count(insertion)) { cout<<insertion<<", "; remove[numRemove++] = insertion; ++suggestion; // remove the entry so it isnt displayed multiple times hashTable.Remove(insertion); } } } // transposition & deletion for(unsigned x = 0; x < word.length()-1;++x) { // transposition string transposition = word.substr(0,x) + word[x+1] + word[x] + word.substr(x+2); if(hashTable.Count(transposition)) { cout<<transposition<<", "; remove[numRemove++] = transposition; ++suggestion; // remove the entry so it isnt displayed multiple times hashTable.Remove(transposition); } // deletion string deletion = word.substr(0, x)+ word.substr(x + 1); if(hashTable.Count(deletion)) { cout<<deletion<<", "; remove[numRemove++] = deletion; ++suggestion; // remove the entry so it isnt displayed multiple times hashTable.Remove(deletion); } } // place the removed items back inside the hash table while(numRemove>=0) { hashTable.Insert(remove[numRemove--]); } if(suggestion < 1) { cout<<"No spelling suggestion found..."; } cout<<endl<<endl; } return result; }// end of SpellCheck string ToLowerCase(string word) { for(unsigned x = 0; x < word.length(); ++x) { word[x] = tolower(word[x]); } return word; }// http://programmingnotes.org/ |
QUICK NOTES:
The highlighted lines are sections of interest to look out for.
The code is heavily commented, so no further insight is necessary. If you have any questions, feel free to leave a comment below.
Remember to include the data file.
Once compiled, you should get this as your output
Loading dictionary....Complete!
--------------------------------------------------
Total dictionary words = 61286
Hash table size = 19000
Largest bucket size = 13 items at index #1551
Smallest bucket size = 1 items at index #11
Total buckets used = 18217
Total percent of hash table used = 95.8789%
Average bucket size = 3.36422 itemsDictionary loaded in 1.861 secs!
-------------------------------------------------->> Please enter a sentence: wird
** wird: bird, gird, ward, weird, word, wild, wind, wire, wired, wiry,
Number of words spelled incorrectly: 1
Do you want to enter another sentence? (y/n): y
--------------------------------------------------
>> Please enter a sentence: woprd
** woprd: word,
Number of words spelled incorrectly: 1
Do you want to enter another sentence? (y/n): y
--------------------------------------------------
>> Please enter a sentence: wrd
** wrd: ard, ord, ward, wed, word,
Number of words spelled incorrectly: 1
Do you want to enter another sentence? (y/n): y
--------------------------------------------------
>> Please enter a sentence: wrod
** wrod: brod, trod, wood, rod, word,
Number of words spelled incorrectly: 1
Do you want to enter another sentence? (y/n): y
--------------------------------------------------
>> Please enter a sentence: New! Safe and efective
** efective: defective, effective, elective,
Number of words spelled incorrectly: 1
Do you want to enter another sentence? (y/n): y
--------------------------------------------------
>> Please enter a sentence: This is a sentance with no corections gygyuigigigiug
** sentance: sentence,
** corections: corrections,
** gygyuigigigiug: No spelling suggestion found...
Number of words spelled incorrectly: 3
Do you want to enter another sentence? (y/n): n
BYE!!
What code does the header file hashtable.h consists?
The source code link for the hashtable.h class is listed on this current page
Very nicely done, thanks for posting this.
is there i c version of this code?