Google Books Visualization Services

Introduction

Actually, the finding is to gather the interesting errors (including long s, misrecognized, ...) together, and appear on the top of the datasets. It is far beyond long S words, though the long S words is an outstanding case here. More interesting and complexed, there exists several phenomena parallelled-functioned on one word.

Effectively gather the interesting errors (e.g. long S words) together.
The more interesting is that these words will appear on the top of datasets.

I use my own visualization online service to show.

Δ I am thinking, if we know how it is misrecognized, we will know how it is corrected in the datasets. Assuming that one in a thousand is misrecognized, that will be a billion errors.

Examples long S: top10 Ι top20 Ι top30 Ι Scanning Issues: l - 1 Ι i - 1 Ι y - g Ι g - y

Long S problem

Example: Top10 by my algorithm

thefe, fuch, fome, moft, muft, firft, fhould, himfelf, againft, themfelves

Example: Top20 by my algorithm

thefe, fuch, fome, moft, muft, firft, fhould, himfelf, againft, themfelves, faid, alfo, prefent, becaufe, feveral, fay, laft, lefs, ufe

Example: Top30 by my algorithm

thefe, fuch, fome, moft, muft, firft, fhould, himfelf, againft, themfelves, faid, alfo, prefent, becaufe, feveral, fay, laft, lefs, ufe, whofe, fo, ftill, reafon, leaft, fet, fince, juft, beft, development

Here the last "development" is a different case.

Scanning issue: l - 1

Example: Scanning issue - l to 1

In the top100 words, there are two words totally unbelonging to the long S words: development, normal.

And some long S words are affected by this, like class, solution.

Scanning issue: i - 1

Example: Scanning issue - i to 1

In the top200 words, there are several words totally unbelonging to the long S words: economic, anything, everything.

Scanning issue: y - g

Example: Scanning issue - y to g

In the top200 words, two words - anything and everything are also affected by this.

Scanning issue: g - y

Example: Scanning issue - g to y

In the top200 words, two words - Figure and groups are also affected by this. But I am not very sure, still need to verified.

Long S Words

Xiujun Li (email) and Michael Gleicher