Apple detailed their real-time machine learning engine for recognizing handwritten Chinese characters, supporting a collection of up to 30,000 characters. The model reportedly performs with asymptotic accuracy degradation as the character pool increases in size. This allowed researchers to recognize characters from large sets like GB18030-2005, with only slightly worse accuracy than when using smaller character sets like GB2312-80.
The Chinese National Standard GB18030-2005 contains 27,533 entries, making keyboard implementation challenging over the years, so a handwritten translator to codified text is especially valuable in Chinese-speaking populations. Several versions of Chinese language sets have been adapted over the years to address variation in frequently used characters over time and geography. A large corpus of potential character values, variation in handwriting methods, nature and properties of each person's unique hand-writing style makes for a challenging machine learning problem.
Convolutional neural networks are typically used for machine learning problems focused on image recognition and labeling. Earlier research methods outlined in the article went through an evolution of model approaches over time, with stroke-order playing a significant part in sub-setting the remaining pool of character possibilities into smaller groups with the hopes of improved odds at finding a match.
While early recognition algorithms mainly relied on structural methods based on individual stroke analysis, the need to achieve stroke-order independence later sparked interest into statistical methods using holistic shape information. This obviously complicates large-inventory recognition, as correct character classification tends to get harder with the number of categories to disambiguate.
A larger pool of characters exposed underlying problems with the stroke-order based approach. Ambiguous handwriting styles, increasing complexity and computational overhead for each n
number of strokes per character led Apple researchers to a more "shape driven" approach, agnostic of stroke-order.
The approach Apple employed is similar to what works well for Latin script translators based on MNIST, and where CNN's became the industry standard. But the scalability of real-time CNN's for 30-thousand or more characters made this challenge different. Collisions and ambiguities between character inventories provided additional complexity.
As a speedy input tends to drive toward cursive styles, it tends to increase ambiguity, e.g. between U+738B (王) and U+4E94 (五). Finally, increased internationalization sometimes introduces unexpected collisions: for example, U+4E8C (二), when cursively written, may conflict with the Latin characters “2” and “Z”.
Each hand-input is digested to a 48 x 48 pixel image representing the original character. This is the first convolved feature that's fed into the rest of the feed-forward neural network. The pre-processing step, or convolution is used to minimize the overall size of the CNN needed to process an image. The finite number of pixels and possible values for those pixels provide an upper-bounds on model complexity, and a reliable coarse representation of the input character that can be run through a trained network on peripheral devices like the Apple-watch.
The training data set consisted of tens of million of hand-written characters collected from a wide range of geographies and demographics throughout Chinese-speaking communities. Researchers noted that the success and accuracy should constitute good-enough performance for commercial use.