用 Python 实现英文单词纠错功能！这样就不担心不会辅导孩子了

单词纠错

在我们平时使用Word或者其他文字编辑软件的时候，常常会遇到单词纠错的功能。比如在Word中：

单词拼写错误

单词“Chinab”有红色的下划线，说明该单词拼写有误，在“拼写检查”中，给出了几个可能的单词来帮助用户纠正拼写。那么，我们是否能够自己来实现这个功能呢？ Why not? 关于单词纠错的思路，可以参考Peter Norvig的鼎鼎大名的网站：http://norvig.com/spell-correct.html 。主要涉及到的相关概念为字符串的编辑距离，读者可以参考文章：用动态规划算法计算字符串的编辑距离

私信小编01 领取完整代码！

单词纠错算法

首先，我们需要一个语料库，基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt，里面包含的内容如下：

Gutenberg语料库数据;
维基词典;
英国国家语料库中的最常用单词列表。

下载的网址为：https://github.com/percent4/-word- 。接着，我们取出里面的所有英语单词，并统计其出现次数。对于一个给定的英语单词（不管其是否拼写有误），依次找到和它编辑距离为0,1,2的单词，这些单词的优先顺序为编辑距离为0的单词（即该单词本身） > 编辑距离为1的单词 > 编辑距离为2的单词。最后按照这些单词是否在语料库中出现及单词的优先顺序及在语料库中的出现次数考虑，考虑的顺序为：是否在语料库中出现，单词的优先顺序，在语料库中的出现次数，最后选取在预料库中出现，优先顺序最高，在语料库中出现次数最多的单词作为该单词的纠正结果。当然，也有可能是它本身，即单词正确。

Python实现

实现单词纠错的完整Python代码（spelling_correcter.py）如下：

# -*- coding: utf-8 -*-import re, collectionsdef tokens(text): ''' Get all words from the corpus ''' return re.findall('[a-z]+', text.lower())with open('E://big.txt', 'r') as f: WORDS = tokens(f.read())WORD_COUNTS = collections.Counter(WORDS)def known(words): ''' Return the subset of words that are actually in our WORD_COUNTS dictionary. ''' return {w for w in words if w in WORD_COUNTS}def edits0(word): ''' Return all strings that are zero edits away from the input word (i.e., the word itself). ''' return {word}def edits1(word): ''' Return all strings that are one edit away from the input word. ''' alphabet = 'abcdefghijklmnopqrstuvwxyz' def splits(word): ''' Return a list of all possible (first, rest) pairs that the input word is made of. ''' return [(word[:i], word[i:]) for i in range(len(word) + 1)] pairs = splits(word) deletes = [a + b[1:] for (a, b) in pairs if b] transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1] replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b] inserts = [a + c + b for (a, b) in pairs for c in alphabet] return set(deletes + transposes + replaces + inserts)def edits2(word): ''' Return all strings that are two edits away from the input word. ''' return {e2 for e1 in edits1(word) for e2 in edits1(e1)}def correct(word): ''' Get the best correct spelling for the input word ''' # Priority is for edit distance 0, then 1, then 2 # else defaults to the input word itself. candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word]) return max(candidates, key=WORD_COUNTS.get)def correct_match(match): ''' Spell-correct word in match, and preserve proper upper/lower/title case. ''' word = match.group() def case_of(text): ''' Return the case-function appropriate for text: upper, lower, title, or just str.: ''' return (str.upper if text.isupper() else str.lower if text.islower() else str.title if text.istitle() else str) return case_of(word)(correct(word.lower()))def correct_text_generic(text): ''' Correct all the words within a text, returning the corrected text. ''' return re.sub('[a-zA-Z]+', correct_match, text)

测试

有了上述的单词纠错程序，接下来我们对一些单词或句子做测试。如下：

original_word_list = ['fianlly', 'castel', 'case', 'monutaiyn', 'foresta', \ 'helloa', 'forteen', 'persreve', 'kisss', 'forteen helloa', \ 'phons forteen Doora. This is from Chinab.']for original_word in original_word_list: correct_word = correct_text_generic(original_word) print('Orginial word: %s\nCorrect word: %s'%(original_word, correct_word))

输出结果如下：

Orginial word: fianllyCorrect word: finallyOrginial word: castelCorrect word: castleOrginial word: caseCorrect word: caseOrginial word: monutaiynCorrect word: mountainOrginial word: forestaCorrect word: forestOrginial word: helloaCorrect word: helloOrginial word: forteenCorrect word: fourteenOrginial word: persreveCorrect word: preserveOrginial word: kisssCorrect word: kissOrginial word: forteen helloaCorrect word: fourteen helloOrginial word: phons forteen Doora. This is from Chinab.Correct word: peons fourteen Door. This is from China.

接着，我们对如下的Word文档（Spelling Error.docx）进行测试（下载地址为：https://github.com/percent4/-word-），

有单词错误的Word文档

对该文档进行单词纠错的Python代码如下：

from docx import Documentfrom nltk import sent_tokenize, word_tokenizefrom spelling_correcter import correct_text_genericfrom docx.shared import RGBColor# 文档中修改的单词个数COUNT_CORRECT = 0#获取文档对象file = Document('E://Spelling Error.docx')#print('段落数:'+str(len(file.paragraphs)))punkt_list = r',.?\''!()/\\-<>:@#$%^&*~'document = Document() # word文档句柄def write_correct_paragraph(i): global COUNT_CORRECT # 每一段的内容 paragraph = file.paragraphs[i].text.strip() # 进行句子划分 sentences = sent_tokenize(text=paragraph) # 词语划分 words_list = [word_tokenize(sentence) for sentence in sentences] p = document.add_paragraph(' '*7) # 段落句柄 for word_list in words_list: for word in word_list: # 每一句话第一个单词的第一个字母大写，并空两格 if word_list.index(word) == 0 and words_list.index(word_list) == 0: if word not in punkt_list: p.add_run(' ') # 修改单词，如果单词正确，则返回原单词 correct_word = correct_text_generic(word) # 如果该单词有修改，则颜色为红色 if correct_word != word: colored_word = p.add_run(correct_word[0].upper()+correct_word[1:]) font = colored_word.font font.color.rgb = RGBColor(0x00, 0x00, 0xFF) COUNT_CORRECT += 1 else: p.add_run(correct_word[0].upper() + correct_word[1:]) else: p.add_run(word) else: p.add_run(' ') # 修改单词，如果单词正确，则返回原单词 correct_word = correct_text_generic(word) if word not in punkt_list: # 如果该单词有修改，则颜色为红色 if correct_word != word: colored_word = p.add_run(correct_word) font = colored_word.font font.color.rgb = RGBColor(0xFF, 0x00, 0x00) COUNT_CORRECT += 1 else: p.add_run(correct_word) else: p.add_run(word)for i in range(len(file.paragraphs)): write_correct_paragraph(i)document.save('E://correct_document.docx')print('修改并保存文件完毕！')print('一共修改了%d处。'%COUNT_CORRECT)

输出的结果如下：

修改并保存文件完毕！一共修改了19处。

修改后的Word文档如下：

单词纠错后的Word文档

其中的红色字体部分为原先的单词有拼写错误，进行拼写纠错后的单词，一共修改了19处。

总结

单词纠错实现起来并没有想象中的那么难，但也不是那么容易~

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。