Tesseract OCR is an open-source optical character recognition engine that converts text from images into machine-readable formats. This page provides a high-level overview of the Tesseract OCR system, its architecture, and main components. For information about building and installing Tesseract, see Building and Deployment, and for details about its API, see API Reference.
Tesseract consists of two main components:
Tesseract was originally developed at Hewlett-Packard Laboratories between 1985 and 1994, with additional development in 1996 and 1998. HP open-sourced it in 2005, and Google took over development from 2006 until August 2017. Stefan Weil is the current lead developer, with Zdenko Podobny as maintainer.
The current version is 5.5.2 (as of the VERSION file). Major version 5 was released on November 30, 2021, introducing significant improvements to the LSTM neural network engine while maintaining backward compatibility with the legacy engine through OEM_TESSERACT_ONLY mode.
Sources: README.md46-60 ChangeLog1-7 VERSION1-2
Sources: README.md34-44
Tesseract follows a modular architecture with clear separation of concerns between components.
The following diagram shows the complete architecture of the Tesseract OCR system, with major components and their relationships:
Sources: CMakeLists.txt1-100 configure.ac1-100 Makefile.am1-200 sw.cpp1-100 src/api/baseapi.cpp1-200 src/api/capi.cpp1-100 src/ccmain/tesseractclass.cpp1-150 src/ccmain/thresholder.cpp1-100 src/lstm/lstmrecognizer.cpp1-100 src/classify/adaptmatch.cpp1-100 src/ccmain/pagesegmain.cpp1-100 src/textord/linefind.cpp1-50 src/dict/dict.cpp1-50 src/ccutil/tessdatamanager.cpp1-100 src/api/renderer.cpp1-100 src/arch/simddetect.cpp1-150
The following diagram shows the complete processing flow from input image to recognized text output, mapping the logical stages to actual code entry points:
Sources: src/api/baseapi.cpp500-536 src/api/baseapi.cpp1455-1494 src/ccmain/thresholder.cpp159-287 src/ccmain/pagesegmain.cpp1-150 src/textord/linefind.cpp1-100 src/textord/imagefind.cpp1-100 src/textord/colfind.cpp1-150 src/textord/tordmain.cpp1-100 src/ccmain/linerec.cpp1-100 src/ccmain/control.cpp1-150 src/lstm/lstmrecognizer.cpp424-521 src/lstm/recodebeam.cpp179-361 src/classify/adaptmatch.cpp95-207 src/dict/dict.cpp288-371 src/api/hocrrenderer.cpp1-100 src/api/pdfrenderer.cpp451-731 src/api/altorenderer.cpp1-100
The API layer provides interfaces for applications to interact with the Tesseract OCR engine:
include/tesseract/baseapi.hinclude/tesseract/capi.h and src/api/capi.cppSources: include/tesseract/baseapi.h76-150 include/tesseract/pageiterator.h60-120 include/tesseract/resultiterator.h40-80 include/tesseract/renderer.h47-90 src/api/baseapi.cpp183-200
The Tesseract class is the core engine that implements the OCR functionality. It inherits from a hierarchy of classes that provide different capabilities:
Sources: src/ccmain/tesseractclass.h178-317 src/ccmain/tesseractclass.cpp53-150 src/ccutil/ccutil.h50-80 src/classify/classify.h100-150 src/wordrec/wordrec.h80-120
Tesseract has two OCR engines:
The LSTM engine uses recurrent neural networks, specifically Long Short-Term Memory networks, to recognize text. It processes text as sequences and is language-independent.
Sources: src/lstm/lstmrecognizer.cpp50-138 src/lstm/lstmrecognizer.h51-100
The legacy engine uses a feature-based approach with pattern matching and adaptive classification to recognize text.
The image processing in Tesseract is handled primarily by the ImageThresholder class, which is responsible for:
Sources: src/ccmain/thresholder.cpp38-100 src/ccmain/thresholder.h1-100
Tesseract supports multiple build systems and automatically detects and uses hardware-specific SIMD optimizations:
The build system compiles architecture-specific implementations with appropriate compiler flags (-mavx, -msse4.1, -mfpu=neon) and the SIMDDetect class selects the optimal implementation at runtime based on CPU capabilities detected via cpuid (x86/x64) or getauxval (ARM).
Sources: configure.ac1-634 Makefile.am1-880 CMakeLists.txt1-1108 sw.cpp1-365 src/arch/simddetect.cpp75-293 src/arch/simddetect.h32-95 src/arch/dotproduct.cpp1-100 src/arch/dotproductavx.cpp1-50 src/arch/dotproductneon.cpp1-52
The main command-line executable is built from src/tesseract.cpp with the following syntax:
tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]
| Parameter | Values | Description |
|---|---|---|
--oem | 0-3 | OCR Engine Mode: 0=Legacy only, 1=LSTM only, 2=Combined, 3=Default |
--psm | 0-13 | Page Segmentation Mode: 0=OSD only, 3=Auto (default), 6=Single block, 8=Single word |
-l | lang codes | Language(s): eng, deu+eng, etc. |
-c | var=value | Set configuration variables |
The CLI supports multiple output formats through the TessResultRenderer hierarchy:
| Renderer Class | Output Format | File Extension | Description |
|---|---|---|---|
TessTextRenderer | Plain text | .txt | Simple UTF-8 text output |
TessHOcrRenderer | hOCR HTML | .hocr | HTML with bounding boxes and confidence |
TessPDFRenderer | Searchable PDF | PDF with invisible text layer | |
TessAltoRenderer | ALTO XML | .xml | ALTO v2.0 library standard |
TessTsvRenderer | TSV | .tsv | Tab-separated values with coordinates |
TessPAGERenderer | PAGE XML | .xml | PAGE format for document analysis |
TessUnlvRenderer | UNLV | .unlv | UNLV text format |
TessBoxTextRenderer | Box file | .box | Character bounding boxes |
TessLSTMBoxRenderer | LSTM box | .box | LSTM training format |
TessWordStrBoxRenderer | WordStr box | .box | Word-level box format |
All renderers inherit from the abstract TessResultRenderer base class in include/tesseract/renderer.h34-100 and implement the AddImage() method for format-specific output generation.
Sources: include/tesseract/renderer.h34-180 src/api/renderer.cpp1-200 src/api/hocrrenderer.cpp1-100 src/api/pdfrenderer.cpp1-200 src/api/altorenderer.cpp1-100 README.md37-38
Tesseract relies on trained data files (.traineddata) managed by the TessdataManager class in src/ccutil/tessdatamanager.cpp:
Sources: src/ccutil/tessdatamanager.cpp50-150 src/ccutil/tessdatamanager.h40-80 src/lstm/lstmrecognizer.cpp100-150 src/ccutil/unicharset.cpp200-250 configure.ac360-361 src/ccutil/ccutil.cpp50-100
Tesseract searches for .traineddata files in:
--tessdata-dir parameterTESSDATA_PREFIX environment variable location@datadir@/tessdata from configure.ac)The search order is implemented in TessdataManager::Init() and CCUtil::GlobalParams().
Sources: src/ccutil/tessdatamanager.cpp50-150 src/ccutil/tessdatamanager.h40-80 configure.ac360-361 src/api/baseapi.cpp335-350
To use Tesseract in your application:
TessBaseAPI::Init()TessBaseAPI::SetImage()TessBaseAPI::Recognize()TessBaseAPI::GetUTF8Text()TessBaseAPI::End()For more detailed information about the API, see the API Reference section.
Sources: src/api/baseapi.cpp297-363 README.md79-85
Refresh this wiki