Actually, the finding is to gather the interesting errors (including long s, misrecognized, ...) together, and appear on the top of the datasets. It is far beyond long S words,
though the long S words is an outstanding case here. More interesting and complexed, there exists several phenomena parallelled-functioned on one word.
- Effectively gather the interesting errors (e.g. long S words) together.
- The more interesting is that these words will appear on the top of datasets.
I use my own visualization online service to show.
Δ I am thinking, if we know how it is misrecognized, we will know how it is corrected in the datasets. Assuming that one in a thousand is misrecognized, that will be a billion errors.
Example: Top10 by my algorithm
thefe, fuch, fome, moft, muft, firft, fhould, himfelf, againft, themfelves
Example: Top20 by my algorithm
thefe, fuch, fome, moft, muft, firft, fhould, himfelf, againft, themfelves, faid, alfo, prefent, becaufe, feveral, fay, laft, lefs, ufe
Example: Top30 by my algorithm
thefe, fuch, fome, moft, muft, firft, fhould, himfelf, againft, themfelves, faid, alfo, prefent, becaufe, feveral, fay, laft, lefs, ufe, whofe, fo, ftill, reafon, leaft, fet, fince, juft, beft, development
Here the last "development" is a different case.
Example: Scanning issue - l to 1
In the top100 words, there are two words totally unbelonging to the long S words: development, normal.
And some long S words are affected by this, like class, solution.
Example: Scanning issue - i to 1
In the top200 words, there are several words totally unbelonging to the long S words: economic, anything, everything.
Example: Scanning issue - y to g
In the top200 words, two words - anything and everything are also affected by this.
Example: Scanning issue - g to y
In the top200 words, two words - Figure and groups are also affected by this. But I am not very sure, still need to verified.