C++ || How To Create A Simple Predictive Spell Checker Using C++
The following is a module with functions which demonstrates how to create a simple predictive spell checker using C++.
The module demonstrated on this page is an implementation based on the Norvig Spelling Corrector.
1. Overview
In its simplest form, this process works by first telling the spelling corrector the valid words in our language. This is called “training” the corrector. Once the corrector is trained to know the valid words in our language, it tries to choose the word that is the most likely spelling correction.
This process is done by choosing the candidate with the highest probability to show up in our language. For example, should “lates” be corrected to “late” or “latest” or “lattes” or …? This choice is determined by the word which has the highest probability to appear in our language, according to the training text we provide.
Traditionally, spell checkers look for four possible errors: a wrong letter (“wird”), also knows as alteration. An inserted letter (“woprd”), a deleted letter (“wrd”), or a pair of adjacent transposed letters (“wrod”). This process is used when generating probable spelling correction candidates.
The training text used in the example on this page can be downloaded here. This file consists of 80891 distinct words, which is used to train the spelling corrector.
2. Spell Check – Basic Usage
The example below demonstrates the use of the ‘SpellChecker‘ class to load the training file and correct the spelling of a word.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
// Spell Check - Basic Usage // Declare spell checker SpellChecker spellchecker; // Declare training text path std::string path = "INPUT_Dictionary_programmingnotes_org.txt"; // Add training text from a file spellchecker.trainFromFile(path); // Optional: Add training text from an std::string //spellchecker.train(); // Correct a word and display results std::cout << spellchecker.correct("KENNtH"); // example output: /* KENNetH */ |
3. Spell Check – Unit Tests
The example below demonstrates the use of the ‘SpellChecker‘ class to load the training file and correct the spelling of a word.
In this example, words are added to a set to determine if the most likely spelling correction is returned by the spelling corrector. The first word in the set is the word to test against (some are spelled correctly, some incorrectly). The second word in the set is the expected result that should be returned by the spelling corrector.
If no spelling suggestion is found, an empty string is returned by the spelling corrector. This is true for one of the words in the test cases below. Even though the word is spelled correctly, it is not defined in our training text, and no correction can be found.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
// Spell Check - Unit Tests // Declare spell checker SpellChecker spellchecker; // Declare training text path std::string path = "INPUT_Dictionary_programmingnotes_org.txt"; // Add training text from a file spellchecker.trainFromFile(path); // Declare words to check spelling. // The first word in the set is the current spelling, the second word // is the expected corrected spell checker result std::pair<std::string, std::string> cases[] = { {"KENNtH", "KENNetH"}, {"Jennierr", "Jennifer"}, {"LYnNn", "Lynn"}, {"Soole", "Sole"}, {"speling", "spelling"}, {"korrectud", "corrected"}, {"bycycle", "bicycle"}, {"inconvient", "inconvenient"}, {"arrainged", "arranged"}, {"peotry", "poetry"}, {"peotryy", "poetry"}, {"word", "word"}, {"quintessential", "quintessential"}, {"transportibility", "transportability"}, {"addresable", "addressable"}, {"auxiliaryy", "auxiliary"}, {"WirD", "WorD"}, {"prplee", "purple"}, {"Succesfuil", "Successful"}, {"AMEIRICUA", "AMERICA"}, {"Langauege", "Language"} }; // Correct the words in the test cases for (const auto& pair : cases) { // Correct word auto correction = spellchecker.correct(pair.first); // Check to see if correction matches the expected result bool casePassed = (strcasecmp(correction.c_str(), pair.second.c_str()) == 0); // Display results if (casePassed) { std::cout << "Passed - Original: " + pair.first + ", Correction: " + correction << std::endl; } else { std::cout << " *** Failed - Original: " + pair.first + ", Correction: " + correction + ", Expected: " + pair.second << std::endl; } } // example output: /* Passed - Original: KENNtH, Correction: KENNetH Passed - Original: Jennierr, Correction: Jennifer Passed - Original: LYnNn, Correction: LYNn Passed - Original: Soole, Correction: Sole Passed - Original: speling, Correction: spelling Passed - Original: korrectud, Correction: corrected Passed - Original: bycycle, Correction: bicycle Passed - Original: inconvient, Correction: inconvenient Passed - Original: arrainged, Correction: arranged Passed - Original: peotry, Correction: poetry Passed - Original: peotryy, Correction: poetry Passed - Original: word, Correction: word *** Failed - Original: quintessential, Correction: , Expected: quintessential Passed - Original: transportibility, Correction: transportability Passed - Original: addresable, Correction: addressable Passed - Original: auxiliaryy, Correction: auxiliary Passed - Original: WirD, Correction: WorD Passed - Original: prplee, Correction: purple Passed - Original: Succesfuil, Correction: Successful Passed - Original: AMEIRICUA, Correction: AMERICA Passed - Original: Langauege, Correction: Language */ |
4. SpellChecker Class
The following is the SpellChecker Class. Include this in your project to start using!
The following is the header file ‘SpellChecker.h‘.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
// ============================================================================ // Author: Kenneth Perkins // Date: Dec 18, 2020 // Taken From: http://programmingnotes.org/ // File: SpellChecker.h // Description: Header file for a predictive spell checker class that // generates spelling corrections // ============================================================================ #pragma once #include <string> #include <unordered_set> #include <utility> #include <unordered_map> #include <cctype> #ifdef _MSC_VER #include <string.h> #define strncasecmp _strnicmp #define strcasecmp _stricmp #else #include <strings.h> #endif class SpellChecker { public: SpellChecker() {} /** * FUNCTION: correct * USE: Corrects the spelling of a word. If the word is spelled incorrectly, * a spelling suggestion is returned. If it is spelled correctly, the * initial word is returned. If no spelling suggestion is found, an empty * string is returned * @param word: The word to correct * @return: The spelling correction of the word */ std::string correct(const std::string& word); /** * FUNCTION: train * USE: Adds training text to the spell checker * @param text: The training text * @return: N/A */ void train(const std::string& text); /** * FUNCTION: trainFromFile * USE: Reads a file containing training text and adds it to the spell checker * @param path: The path of the file containing training text * @return: N/A */ void trainFromFile(const std::string& path); /** * FUNCTION: exists * USE: Determines if a word is a valid word in the spell checker * @param word: The word to test against * @return: True if the word exists in the spell checker, false otherwise */ bool exists(const std::string& word); ~SpellChecker() { NWORDS.clear(); } private: // Makes map keys case insensitive struct case_insensitive { struct comp { bool operator() (const std::string& lhs, const std::string& rhs) const { return strcasecmp(lhs.c_str(), rhs.c_str()) == 0; } }; struct hash { std::size_t operator() (std::string str) const { for (std::size_t index = 0; index < str.size(); ++index) { auto ch = static_cast<unsigned char>(str[index]); str[index] = static_cast<unsigned char>(std::tolower(ch)); } return std::hash<std::string>{}(str); } }; }; typedef std::unordered_map<std::string, unsigned , case_insensitive::hash, case_insensitive::comp> dictionary; std::unordered_set<std::string> edit(const std::unordered_set<std::string>& words); std::unordered_set<std::string> known(const std::unordered_set<std::string>& words); std::string max(const std::unordered_set<std::string>& candidates, const std::string& word); std::unordered_set<std::string> getCandidates(const std::string& word); dictionary NWORDS; const std::string alphabet = "abcdefghijklmnopqrstuvwxyz"; };// http://programmingnotes.org/ |
The following is the implementation file ‘SpellChecker.cpp‘.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
// ============================================================================ // Author: Kenneth Perkins // Date: Dec 18, 2020 // Taken From: http://programmingnotes.org/ // File: SpellChecker.cpp // Description: Implementation file for a predictive spell checker class // that generates spelling corrections // ============================================================================ #include "SpellChecker.h" #include <string> #include <sstream> #include <unordered_set> #include <fstream> #include <stdexcept> #include <algorithm> #include <cmath> #include <cctype> void clean(std::string& word) { auto isInvalidChar = [](unsigned char c) { return std::ispunct(c) && c != '\'' && c != '-'; }; for (int index = (int)word.size() - 1; index >= 0; --index) { if (isInvalidChar(word[index])) { word.erase(index, 1); } } } std::string SpellChecker::correct(const std::string& word) { std::string suggestion = ""; // Get spelling suggestions for the word auto candidates = getCandidates(word); if (!candidates.empty()) { suggestion = max(candidates, word); } return suggestion; } void SpellChecker::trainFromFile(const std::string& path) { std::ifstream infile; infile.open(path); if (infile.fail()) { throw std::invalid_argument{ "Unable to read file at path: '" + path + "'" }; } infile.seekg(0, std::ios::end); auto size = infile.tellg(); std::string contents((std::size_t)size, ' '); infile.seekg(0); infile.read(&contents[0], size); infile.close(); train(contents); } void SpellChecker::train(const std::string& text) { std::stringstream strStream(text); while (strStream) { std::string word; strStream >> word; clean(word); if (word.empty()) { continue; } NWORDS[word] += 1; } } std::unordered_set<std::string> SpellChecker::getCandidates(const std::string& word) { auto candidates = known({ word }); // Edits 1 std::unordered_set<std::string> firstEdit; if (candidates.empty()) { firstEdit = edit({ word }); candidates = known(firstEdit); } // Edits 2 if (candidates.empty()) { candidates = known(edit(firstEdit)); } return candidates; } std::string SpellChecker::max(const std::unordered_set<std::string>& candidates, const std::string& word) { std::string maxKey; const double nullValue = -1987199120102019; auto maxValue = nullValue; for (const auto& candidate : candidates) { int currentValue = NWORDS[candidate]; // Add a penalty to the candidate based on the length change auto lengthChange = std::abs((int)(candidate.size() - word.size())); currentValue -= lengthChange; if (maxValue == nullValue || currentValue > maxValue) { maxKey = candidate; maxValue = currentValue; } } return maxKey; } bool SpellChecker::exists(const std::string& word) { return !NWORDS.empty() && NWORDS.find(word) != NWORDS.end(); } std::unordered_set<std::string> SpellChecker::edit(const std::unordered_set<std::string>& words) { std::unordered_set<std::string> edits; for (const auto& word : words) { for (std::size_t index = 0; index <= word.size(); ++index) { auto a = word.substr(0, index); auto b = word.substr(index); auto c = b.size() > 1 ? b.substr(1) : ""; auto d = b.size() > 2 ? b.substr(2) : ""; if (!b.empty()) { // Deletes edits.insert(a + c); // Transposes if (b.size() > 1) { edits.insert(a + b[1] + b[0] + d); } // Alteration & Inserts for (const auto& letter : alphabet) { // Alteration edits.insert(a + letter + c); // Inserts edits.insert(a + letter + b); } } else { // Inserts (remaining set at the end) for (const auto& letter : alphabet) { edits.insert(a + letter); } } } } return edits; } std::unordered_set<std::string> SpellChecker::known(const std::unordered_set<std::string>& words) { std::unordered_set<std::string> set; for (const auto& word : words) { if (exists(word)) { set.insert(word); } } return set; }// http://programmingnotes.org/ |
5. More Examples
Below are more examples demonstrating the use of the ‘SpellChecker‘ class. Don’t forget to include the module when running the examples!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
// ============================================================================ // Author: Kenneth Perkins // Date: Dec 18, 2020 // Taken From: http://programmingnotes.org/ // File: program.cpp // Description: The following demonstrates the use of the SpellChecker class // ============================================================================ #include <iostream> #include <string> #include <exception> #include <utility> #include "SpellChecker.h" void display(const std::string& message); int main() { try { // Declare spell checker SpellChecker spellchecker; // Declare training text path std::string path = "INPUT_Dictionary_programmingnotes_org.txt"; // Add training text from a file spellchecker.trainFromFile(path); // Optional: Add training text from an std::string //spellchecker.train(); // Correct a word and display results display(spellchecker.correct("KENNtH")); display(""); // Declare words to check spelling. // The first word in the set is the current spelling, the second word // is the expected corrected spell checker result std::pair<std::string, std::string> cases[] = { {"KENNtH", "KENNetH"}, {"Jennierr", "Jennifer"}, {"LYnNn", "Lynn"}, {"Soole", "Sole"}, {"speling", "spelling"}, {"korrectud", "corrected"}, {"bycycle", "bicycle"}, {"inconvient", "inconvenient"}, {"arrainged", "arranged"}, {"peotry", "poetry"}, {"peotryy", "poetry"}, {"word", "word"}, {"quintessential", "quintessential"}, {"transportibility", "transportability"}, {"addresable", "addressable"}, {"auxiliaryy", "auxiliary"}, {"WirD", "WorD"}, {"prplee", "purple"}, {"Succesfuil", "Successful"}, {"AMEIRICUA", "AMERICA"}, {"Langauege", "Language"} }; // Correct the words in the test cases for (const auto& pair : cases) { // Correct word auto correction = spellchecker.correct(pair.first); // Check to see if correction matches the expected result bool casePassed = (strcasecmp(correction.c_str(), pair.second.c_str()) == 0); // Display results if (casePassed) { display("Passed - Original: " + pair.first + ", Correction: " + correction); } else { display(" *** Failed - Original: " + pair.first + ", Correction: " + correction + ", Expected: " + pair.second); } } } catch (std::exception& e) { display("\nAn error occurred: " + std::string(e.what())); } std::cin.get(); return 0; } void display(const std::string& message) { std::cout << message << std::endl; }// http://programmingnotes.org/ |
QUICK NOTES:
The highlighted lines are sections of interest to look out for.
The code is heavily commented, so no further insight is necessary. If you have any questions, feel free to leave a comment below.
Remember to include the training file.
Leave a Reply