How to Implement In-Memory OCR Without Files Using FileReader Callback in Tesseract

Use the FileReader callback parameter in TessBaseAPI::Init to intercept all file system requests and serve traineddata and auxiliary files directly from memory buffers.

The tesseract-ocr/tesseract repository provides a powerful mechanism for running OCR pipelines entirely in memory. By leveraging the FileReader callback type in the TessBaseAPI::Init overload, you can embed language models in binaries, stream data from network caches, or load resources from custom archive formats without ever writing to disk.

Understanding the FileReader Callback Mechanism

Tesseract’s standard initialization reads .traineddata and auxiliary files (such as unicharset and config files) from disk. However, the API exposes an advanced initialization signature that accepts a custom file reader callback.

The FileReader Type Definition

The callback conforms to the following function signature defined in include/tesseract/baseapi.h:

bool (*)(const char *filename, std::vector<char> *data)

When Tesseract needs to load any file, it invokes this callback with the requested filename. Your implementation must populate the std::vector<char> with the file’s raw bytes and return true on success, or false if the file is unavailable.

Internal Implementation Flow

According to the source code in src/api/baseapi.cpp (lines 332-335), the TessBaseAPI class stores your callback in a private member variable reader_. When the engine initializes, TessdataManager::Init (in src/ccutil/tessdatamanager.cpp, line 90) checks this member. If reader_ is non-null, the manager bypasses standard file I/O and calls (*reader_)(filename, &data) to retrieve every component required for OCR.

Implementing In-Memory OCR: Three Approaches

The following patterns demonstrate how to use the FileReader callback for different deployment scenarios. All examples assume you have loaded your Tesseract language data into memory buffers.

Approach 1: Single Traineddata File Without Callback

If your language pack requires no auxiliary files, you can initialize directly from a memory buffer without providing a custom callback. Pass nullptr for the FileReader parameter:

#include <tesseract/baseapi.h>
#include <fstream>
#include <vector>

int main() {
    // Load traineddata into memory
    std::ifstream fin("eng.traineddata", std::ios::binary);
    std::vector<char> buf((std::istreambuf_iterator<char>(fin)),
                         std::istreambuf_iterator<char>());

    tesseract::TessBaseAPI api;
    
    // Initialize from memory buffer, no FileReader needed
    if (api.Init(buf.data(), static_cast<int>(buf.size()),
                 "eng", tesseract::OEM_DEFAULT,
                 nullptr, 0, nullptr, nullptr, false, nullptr) != 0) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        return 1;
    }
    
    // Proceed with image processing...
}

Key insight: When data_size is non-zero, Tesseract treats the data pointer as the raw .traineddata content. Since the engine already has the primary language model, it never requests additional files, making the callback unnecessary.

Approach 2: Custom FileReader with In-Memory Storage

For languages requiring auxiliary files (configs, unicharsets, DAWG files), implement a FileReader that serves all requests from an in-memory map. This approach enables fully self-contained binaries:

#include <tesseract/baseapi.h>
#include <unordered_map>
#include <string>
#include <vector>
#include <fstream>

using FileStore = std::unordered_map<std::string, std::vector<char>>;

// Populate the store from embedded resources or archives
FileStore store;
void LoadResources() {
    auto load_file = [&](const char *path) {
        std::ifstream f(path, std::ios::binary);
        std::vector<char> data((std::istreambuf_iterator<char>(f)),
                               std::istreambuf_iterator<char>());
        store.emplace(path, std::move(data));
    };
    
    load_file("eng.traineddata");
    load_file("eng.unicharset");
    load_file("eng.config");
}

// FileReader callback implementation
bool MemoryFileReader(const char *filename, std::vector<char> *out) {
    auto it = store.find(filename);
    if (it == store.end()) return false;
    *out = it->second;  // Copy the content
    return true;
}

int main() {
    LoadResources();
    const auto &trained = store.at("eng.traineddata");
    
    tesseract::TessBaseAPI api;
    if (api.Init(trained.data(), static_cast<int>(trained.size()),
                 "eng", tesseract::OEM_DEFAULT,
                 nullptr, 0, nullptr, nullptr, false,
                 MemoryFileReader) != 0) {
        fprintf(stderr, "Tesseract init failed\n");
        return 1;
    }
    
    // OCR processing...
}

Critical implementation details:

  • The callback receives exact filenames as requested by the engine (e.g., "eng.unicharset", "eng.punc-dawg").
  • You must copy the data into the provided std::vector<char>; Tesseract manages the lifetime of this buffer after the call returns.
  • Returning false triggers initialization failure, so ensure your store contains all required auxiliary files for your language pack.

Approach 3: Hybrid Reader with Filesystem Fallback

For scenarios where you embed core language data but allow users to install additional resources on disk, implement a hybrid callback that checks memory first, then falls back to standard file I/O:

#include <tesseract/baseapi.h>
#include <unordered_map>
#include <fstream>
#include <vector>

// In-memory store for embedded files
std::unordered_map<std::string, std::vector<char>> mem_store;

// Default filesystem loader
bool DefaultFileReader(const char *filename, std::vector<char> *out) {
    std::ifstream f(filename, std::ios::binary);
    if (!f) return false;
    *out = std::vector<char>((std::istreambuf_iterator<char>(f)),
                             std::istreambuf_iterator<char>());
    return true;
}

// Hybrid reader: memory first, then disk
bool HybridReader(const char *filename, std::vector<char> *out) {
    auto it = mem_store.find(filename);
    if (it != mem_store.end()) {
        *out = it->second;
        return true;
    }
    return DefaultFileReader(filename, out);
}

// Usage: populate mem_store with critical files, then pass HybridReader to api.Init()

This pattern minimizes disk access for frequently used languages while maintaining compatibility with standard Tesseract resource installations.

Key Source Files and Implementation Details

Understanding the internal architecture helps debug callback issues and optimize memory usage. The following files implement the FileReader mechanism:

File Role Key Components
include/tesseract/baseapi.h Declares the FileReader typedef and the in-memory Init overload. FileReader function pointer type, TessBaseAPI::Init signature
src/api/baseapi.cpp Stores the callback in reader_ member and passes it to TessdataManager. Lines 332-335: reader_ = reader;
src/ccutil/tessdatamanager.h Defines TessdataManager class that orchestrates file loading. FileReader member storage
src/ccutil/tessdatamanager.cpp Invokes the callback when auxiliary files are requested. Line 90: if (reader_) return (*reader_)(filename, data);
src/ccutil/serialis.h / .cpp Provides TFile abstraction for reading from memory buffers. TFile class for vector-based I/O

When tracing execution, start at TessBaseAPI::Init in baseapi.cpp, follow the reader_ assignment, then examine TessdataManager::GetComponent in tessdatamanager.cpp to see how the callback is invoked for each language component.

Summary

  • The FileReader callback enables fully in-memory OCR by intercepting all file system requests from Tesseract's TessdataManager.
  • Initialization signature requires passing a function pointer of type bool (*)(const char*, std::vector<char>*) as the final parameter to TessBaseAPI::Init.
  • Three implementation patterns cover most use cases: direct buffer initialization without callback (simplest), custom memory-mapped storage (fully embedded), and hybrid readers with disk fallback (flexible deployment).
  • Critical source locations include src/api/baseapi.cpp (lines 332-335) for callback storage and src/ccutil/tessdatamanager.cpp (line 90) for callback invocation.

Frequently Asked Questions

Can I use FileReader with multiple language files simultaneously?

Yes. When initializing Tesseract with multiple languages (e.g., "eng+deu"), the engine requests each component file separately through the same callback. Your FileReader implementation must handle lookups for all required filenames, including eng.traineddata, deu.traineddata, and their respective auxiliary files like eng.unicharset and deu.unicharset.

What happens if the FileReader callback returns false?

If your callback returns false for any file request, TessdataManager::Init propagates the failure upward, causing TessBaseAPI::Init to return a non-zero error code. This typically results in the initialization error "Could not initialize tesseract." Ensure your callback handles all expected filenames and returns true only when the vector is successfully populated with the complete file contents.

Is the FileReader approach thread-safe?

The thread safety of in-memory OCR depends entirely on your callback implementation. Tesseract's TessBaseAPI instance is not thread-safe by design—each thread requires its own API object. However, if multiple TessBaseAPI instances share the same FileReader callback, your implementation must handle concurrent access to the underlying memory store. Use immutable data structures or mutex-protected lookups to ensure safe concurrent reads.

How do I handle large traineddata files in memory?

For large language models, avoid copying buffers unnecessarily. Store files in std::vector<char> or memory-mapped regions, and implement copy-on-write semantics or shared pointers if multiple API instances need access. When passing data to Init, ensure the buffer remains valid for the duration of the initialization call—Tesseract copies the traineddata content internally during TessdataManager setup, so you can safely free your buffer after Init returns successfully.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →