Overview

Relevant source files

Tesseract OCR is an open-source optical character recognition engine that converts text from images into machine-readable formats. This page provides a high-level overview of the Tesseract OCR system, its architecture, and main components. For information about building and installing Tesseract, see Building and Deployment, and for details about its API, see API Reference.

Tesseract consists of two main components:

libtesseract - A C++ library implementing the OCR engine
tesseract - A command-line program for OCR processing

Version and History

Tesseract was originally developed at Hewlett-Packard Laboratories between 1985 and 1994, with additional development in 1996 and 1998. HP open-sourced it in 2005, and Google took over development from 2006 until August 2017. Stefan Weil is the current lead developer, with Zdenko Podobny as maintainer.

The current version is 5.5.2 (as of the VERSION file). Major version 5 was released on November 30, 2021, introducing significant improvements to the LSTM neural network engine while maintaining backward compatibility with the legacy engine through OEM_TESSERACT_ONLY mode.

Sources: README.md46-60 ChangeLog1-7 VERSION1-2

Core Features

Universal text recognition supporting 100+ languages out of the box
Multiple input formats including PNG, JPEG, and TIFF
Multiple output formats including plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, ALTO and PAGE
Two recognition engines:
- LSTM neural network-based engine (default)
- Legacy OCR engine (optional)
Training capabilities to recognize additional languages and character sets

Sources: README.md34-44

System Architecture

Tesseract follows a modular architecture with clear separation of concerns between components.

High-Level System Architecture

The following diagram shows the complete architecture of the Tesseract OCR system, with major components and their relationships:

Sources: CMakeLists.txt1-100 configure.ac1-100 Makefile.am1-200 sw.cpp1-100 src/api/baseapi.cpp1-200 src/api/capi.cpp1-100 src/ccmain/tesseractclass.cpp1-150 src/ccmain/thresholder.cpp1-100 src/lstm/lstmrecognizer.cpp1-100 src/classify/adaptmatch.cpp1-100 src/ccmain/pagesegmain.cpp1-100 src/textord/linefind.cpp1-50 src/dict/dict.cpp1-50 src/ccutil/tessdatamanager.cpp1-100 src/api/renderer.cpp1-100 src/arch/simddetect.cpp1-150

OCR Processing Pipeline

The following diagram shows the complete processing flow from input image to recognized text output, mapping the logical stages to actual code entry points:

Sources: src/api/baseapi.cpp500-536 src/api/baseapi.cpp1455-1494 src/ccmain/thresholder.cpp159-287 src/ccmain/pagesegmain.cpp1-150 src/textord/linefind.cpp1-100 src/textord/imagefind.cpp1-100 src/textord/colfind.cpp1-150 src/textord/tordmain.cpp1-100 src/ccmain/linerec.cpp1-100 src/ccmain/control.cpp1-150 src/lstm/lstmrecognizer.cpp424-521 src/lstm/recodebeam.cpp179-361 src/classify/adaptmatch.cpp95-207 src/dict/dict.cpp288-371 src/api/hocrrenderer.cpp1-100 src/api/pdfrenderer.cpp451-731 src/api/altorenderer.cpp1-100

Key Components

API Layer Components

The API layer provides interfaces for applications to interact with the Tesseract OCR engine:

TessBaseAPI - The primary C++ API class in include/tesseract/baseapi.h
C API - A C language wrapper in include/tesseract/capi.h and src/api/capi.cpp
Iterator Classes - For navigating OCR results at different granularities

Sources: include/tesseract/baseapi.h76-150 include/tesseract/pageiterator.h60-120 include/tesseract/resultiterator.h40-80 include/tesseract/renderer.h47-90 src/api/baseapi.cpp183-200

Core Component: Tesseract Class Hierarchy

The Tesseract class is the core engine that implements the OCR functionality. It inherits from a hierarchy of classes that provide different capabilities:

Sources: src/ccmain/tesseractclass.h178-317 src/ccmain/tesseractclass.cpp53-150 src/ccutil/ccutil.h50-80 src/classify/classify.h100-150 src/wordrec/wordrec.h80-120

Recognition Engines

Tesseract has two OCR engines:

LSTM Neural Network Engine - Modern engine with better accuracy
Legacy/Traditional Engine - Original Tesseract engine

LSTM Engine

The LSTM engine uses recurrent neural networks, specifically Long Short-Term Memory networks, to recognize text. It processes text as sequences and is language-independent.

Sources: src/lstm/lstmrecognizer.cpp50-138 src/lstm/lstmrecognizer.h51-100

Legacy Engine

The legacy engine uses a feature-based approach with pattern matching and adaptive classification to recognize text.

Image Processing Components

The image processing in Tesseract is handled primarily by the ImageThresholder class, which is responsible for:

Converting input images to a binary format for processing
Providing methods for applying different thresholding techniques
Managing the image rectangles for processing

Sources: src/ccmain/thresholder.cpp38-100 src/ccmain/thresholder.h1-100

Build System and SIMD Optimization

Tesseract supports multiple build systems and automatically detects and uses hardware-specific SIMD optimizations:

Build System Components

The build system compiles architecture-specific implementations with appropriate compiler flags (-mavx, -msse4.1, -mfpu=neon) and the SIMDDetect class selects the optimal implementation at runtime based on CPU capabilities detected via cpuid (x86/x64) or getauxval (ARM).

Sources: configure.ac1-634 Makefile.am1-880 CMakeLists.txt1-1108 sw.cpp1-365 src/arch/simddetect.cpp75-293 src/arch/simddetect.h32-95 src/arch/dotproduct.cpp1-100 src/arch/dotproductavx.cpp1-50 src/arch/dotproductneon.cpp1-52

Command-Line Interface

The main command-line executable is built from src/tesseract.cpp with the following syntax:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

Key Parameters

Parameter	Values	Description
`--oem`	0-3	OCR Engine Mode: 0=Legacy only, 1=LSTM only, 2=Combined, 3=Default
`--psm`	0-13	Page Segmentation Mode: 0=OSD only, 3=Auto (default), 6=Single block, 8=Single word
`-l`	lang codes	Language(s): `eng`, `deu+eng`, etc.
`-c`	var=value	Set configuration variables

Available Output Renderers

The CLI supports multiple output formats through the TessResultRenderer hierarchy:

Renderer Class	Output Format	File Extension	Description
`TessTextRenderer`	Plain text	.txt	Simple UTF-8 text output
`TessHOcrRenderer`	hOCR HTML	.hocr	HTML with bounding boxes and confidence
`TessPDFRenderer`	Searchable PDF	.pdf	PDF with invisible text layer
`TessAltoRenderer`	ALTO XML	.xml	ALTO v2.0 library standard
`TessTsvRenderer`	TSV	.tsv	Tab-separated values with coordinates
`TessPAGERenderer`	PAGE XML	.xml	PAGE format for document analysis
`TessUnlvRenderer`	UNLV	.unlv	UNLV text format
`TessBoxTextRenderer`	Box file	.box	Character bounding boxes
`TessLSTMBoxRenderer`	LSTM box	.box	LSTM training format
`TessWordStrBoxRenderer`	WordStr box	.box	Word-level box format

All renderers inherit from the abstract TessResultRenderer base class in include/tesseract/renderer.h34-100 and implement the AddImage() method for format-specific output generation.

Sources: include/tesseract/renderer.h34-180 src/api/renderer.cpp1-200 src/api/hocrrenderer.cpp1-100 src/api/pdfrenderer.cpp1-200 src/api/altorenderer.cpp1-100 README.md37-38

Trained Data Architecture

Tesseract relies on trained data files (.traineddata) managed by the TessdataManager class in src/ccutil/tessdatamanager.cpp:

Trained Data Architecture

Sources: src/ccutil/tessdatamanager.cpp50-150 src/ccutil/tessdatamanager.h40-80 src/lstm/lstmrecognizer.cpp100-150 src/ccutil/unicharset.cpp200-250 configure.ac360-361 src/ccutil/ccutil.cpp50-100

Data File Locations

Tesseract searches for .traineddata files in:

Path specified by --tessdata-dir parameter
TESSDATA_PREFIX environment variable location
Compiled-in default path (@datadir@/tessdata from configure.ac)

The search order is implemented in TessdataManager::Init() and CCUtil::GlobalParams().

Sources: src/ccutil/tessdatamanager.cpp50-150 src/ccutil/tessdatamanager.h40-80 configure.ac360-361 src/api/baseapi.cpp335-350

Getting Started

To use Tesseract in your application:

Initialize the API with TessBaseAPI::Init()
Set the image using TessBaseAPI::SetImage()
Recognize text with TessBaseAPI::Recognize()
Get the results using methods like TessBaseAPI::GetUTF8Text()
Clean up with TessBaseAPI::End()

For more detailed information about the API, see the API Reference section.

Sources: src/api/baseapi.cpp297-363 README.md79-85

Overview

Relevant source files

Tesseract consists of two main components:

libtesseract - A C++ library implementing the OCR engine
tesseract - A command-line program for OCR processing

Version and History

Sources: README.md46-60 ChangeLog1-7 VERSION1-2

Core Features

Universal text recognition supporting 100+ languages out of the box
Multiple input formats including PNG, JPEG, and TIFF
Multiple output formats including plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, ALTO and PAGE
Two recognition engines:
- LSTM neural network-based engine (default)
- Legacy OCR engine (optional)
Training capabilities to recognize additional languages and character sets

Sources: README.md34-44

System Architecture

Tesseract follows a modular architecture with clear separation of concerns between components.

High-Level System Architecture

The following diagram shows the complete architecture of the Tesseract OCR system, with major components and their relationships:

OCR Processing Pipeline

The following diagram shows the complete processing flow from input image to recognized text output, mapping the logical stages to actual code entry points:

Key Components

API Layer Components

The API layer provides interfaces for applications to interact with the Tesseract OCR engine:

TessBaseAPI - The primary C++ API class in include/tesseract/baseapi.h
C API - A C language wrapper in include/tesseract/capi.h and src/api/capi.cpp
Iterator Classes - For navigating OCR results at different granularities

Sources: include/tesseract/baseapi.h76-150 include/tesseract/pageiterator.h60-120 include/tesseract/resultiterator.h40-80 include/tesseract/renderer.h47-90 src/api/baseapi.cpp183-200

Core Component: Tesseract Class Hierarchy

The Tesseract class is the core engine that implements the OCR functionality. It inherits from a hierarchy of classes that provide different capabilities:

Sources: src/ccmain/tesseractclass.h178-317 src/ccmain/tesseractclass.cpp53-150 src/ccutil/ccutil.h50-80 src/classify/classify.h100-150 src/wordrec/wordrec.h80-120

Recognition Engines

Tesseract has two OCR engines:

LSTM Neural Network Engine - Modern engine with better accuracy
Legacy/Traditional Engine - Original Tesseract engine

LSTM Engine

The LSTM engine uses recurrent neural networks, specifically Long Short-Term Memory networks, to recognize text. It processes text as sequences and is language-independent.

Sources: src/lstm/lstmrecognizer.cpp50-138 src/lstm/lstmrecognizer.h51-100

Legacy Engine

The legacy engine uses a feature-based approach with pattern matching and adaptive classification to recognize text.

Image Processing Components

The image processing in Tesseract is handled primarily by the ImageThresholder class, which is responsible for:

Converting input images to a binary format for processing
Providing methods for applying different thresholding techniques
Managing the image rectangles for processing

Sources: src/ccmain/thresholder.cpp38-100 src/ccmain/thresholder.h1-100

Build System and SIMD Optimization

Tesseract supports multiple build systems and automatically detects and uses hardware-specific SIMD optimizations:

Build System Components

Command-Line Interface

The main command-line executable is built from src/tesseract.cpp with the following syntax:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

Key Parameters

Parameter	Values	Description
`--oem`	0-3	OCR Engine Mode: 0=Legacy only, 1=LSTM only, 2=Combined, 3=Default
`--psm`	0-13	Page Segmentation Mode: 0=OSD only, 3=Auto (default), 6=Single block, 8=Single word
`-l`	lang codes	Language(s): `eng`, `deu+eng`, etc.
`-c`	var=value	Set configuration variables

Available Output Renderers

The CLI supports multiple output formats through the TessResultRenderer hierarchy:

Renderer Class	Output Format	File Extension	Description
`TessTextRenderer`	Plain text	.txt	Simple UTF-8 text output
`TessHOcrRenderer`	hOCR HTML	.hocr	HTML with bounding boxes and confidence
`TessPDFRenderer`	Searchable PDF	.pdf	PDF with invisible text layer
`TessAltoRenderer`	ALTO XML	.xml	ALTO v2.0 library standard
`TessTsvRenderer`	TSV	.tsv	Tab-separated values with coordinates
`TessPAGERenderer`	PAGE XML	.xml	PAGE format for document analysis
`TessUnlvRenderer`	UNLV	.unlv	UNLV text format
`TessBoxTextRenderer`	Box file	.box	Character bounding boxes
`TessLSTMBoxRenderer`	LSTM box	.box	LSTM training format
`TessWordStrBoxRenderer`	WordStr box	.box	Word-level box format

All renderers inherit from the abstract TessResultRenderer base class in include/tesseract/renderer.h34-100 and implement the AddImage() method for format-specific output generation.

Sources: include/tesseract/renderer.h34-180 src/api/renderer.cpp1-200 src/api/hocrrenderer.cpp1-100 src/api/pdfrenderer.cpp1-200 src/api/altorenderer.cpp1-100 README.md37-38

Trained Data Architecture

Tesseract relies on trained data files (.traineddata) managed by the TessdataManager class in src/ccutil/tessdatamanager.cpp:

Trained Data Architecture

Sources: src/ccutil/tessdatamanager.cpp50-150 src/ccutil/tessdatamanager.h40-80 src/lstm/lstmrecognizer.cpp100-150 src/ccutil/unicharset.cpp200-250 configure.ac360-361 src/ccutil/ccutil.cpp50-100

Data File Locations

Tesseract searches for .traineddata files in:

Path specified by --tessdata-dir parameter
TESSDATA_PREFIX environment variable location
Compiled-in default path (@datadir@/tessdata from configure.ac)

The search order is implemented in TessdataManager::Init() and CCUtil::GlobalParams().

Sources: src/ccutil/tessdatamanager.cpp50-150 src/ccutil/tessdatamanager.h40-80 configure.ac360-361 src/api/baseapi.cpp335-350

Getting Started

To use Tesseract in your application:

Initialize the API with TessBaseAPI::Init()
Set the image using TessBaseAPI::SetImage()
Recognize text with TessBaseAPI::Recognize()
Get the results using methods like TessBaseAPI::GetUTF8Text()
Clean up with TessBaseAPI::End()

For more detailed information about the API, see the API Reference section.

Sources: src/api/baseapi.cpp297-363 README.md79-85

Overview

Version and History

Core Features

System Architecture

High-Level System Architecture

OCR Processing Pipeline

Key Components

API Layer Components

Core Component: Tesseract Class Hierarchy

Recognition Engines

LSTM Engine

Legacy Engine

Image Processing Components

Build System and SIMD Optimization

Build System Components

Command-Line Interface

Key Parameters

Available Output Renderers

Trained Data Architecture

Trained Data Architecture

Data File Locations

Getting Started

On this page

Overview

Version and History

Core Features

System Architecture

High-Level System Architecture

OCR Processing Pipeline

Key Components

API Layer Components

Core Component: Tesseract Class Hierarchy

Recognition Engines

LSTM Engine

Legacy Engine

Image Processing Components

Build System and SIMD Optimization

Build System Components

Command-Line Interface

Key Parameters

Available Output Renderers

Trained Data Architecture

Trained Data Architecture

Data File Locations

Getting Started

On this page