Japanese Text Recognition from Images Using Tesseract OCR [macOS Edition]
Tadashi Shigeoka · Thu, March 23, 2023
I’d like to introduce how to recognize Japanese text from images using the OSS tool Tesseract OCR on macOS.
Background: OSS Tool with Japanese OCR Support
While searching for an OSS tool with Japanese OCR support, I read the article 第577回 Tesseract OCRで文字認識をする | gihyo.jp and found that Tesseract OCR looked promising, so I tried it.
Initial Setup for Tesseract
For initial setup of Tesseract, perform installation and download of Japanese trained model files in order.
Installing Tesseract
brew install tesseract
Download Japanese Trained Model Files
cd /opt/homebrew/share/tessdata/
wget https://github.com/tesseract-ocr/tessdata/raw/main/jpn.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/main/jpn_vert.traineddata
- Download source: tesseract-ocr/tessdata: Trained models with fast variant of the "best" LSTM models + legacy models
Japanese OCR with Tesseract
tesseract target.png - -l jpn
That’s all from the Gemba, where I recognized Japanese text from images using Tesseract OCR.