# How to Implement In-Memory OCR Without Files Using FileReader Callback in Tesseract

> Implement in-memory OCR with Tesseract without files using the FileReader callback. Serve traineddata directly from memory buffers for faster processing.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Use the `FileReader` callback parameter in `TessBaseAPI::Init` to intercept all file system requests and serve traineddata and auxiliary files directly from memory buffers.**

The `tesseract-ocr/tesseract` repository provides a powerful mechanism for running OCR pipelines entirely in memory. By leveraging the `FileReader` callback type in the `TessBaseAPI::Init` overload, you can embed language models in binaries, stream data from network caches, or load resources from custom archive formats without ever writing to disk.

## Understanding the FileReader Callback Mechanism

Tesseract’s standard initialization reads `.traineddata` and auxiliary files (such as `unicharset` and config files) from disk. However, the API exposes an advanced initialization signature that accepts a custom file reader callback.

### The FileReader Type Definition

The callback conforms to the following function signature defined in [`include/tesseract/baseapi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h):

```cpp
bool (*)(const char *filename, std::vector<char> *data)

```

When Tesseract needs to load any file, it invokes this callback with the requested filename. Your implementation must populate the `std::vector<char>` with the file’s raw bytes and return `true` on success, or `false` if the file is unavailable.

### Internal Implementation Flow

According to the source code in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) (lines 332-335), the `TessBaseAPI` class stores your callback in a private member variable `reader_`. When the engine initializes, `TessdataManager::Init` (in [`src/ccutil/tessdatamanager.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccutil/tessdatamanager.cpp), line 90) checks this member. If `reader_` is non-null, the manager bypasses standard file I/O and calls `(*reader_)(filename, &data)` to retrieve every component required for OCR.

## Implementing In-Memory OCR: Three Approaches

The following patterns demonstrate how to use the `FileReader` callback for different deployment scenarios. All examples assume you have loaded your Tesseract language data into memory buffers.

### Approach 1: Single Traineddata File Without Callback

If your language pack requires no auxiliary files, you can initialize directly from a memory buffer without providing a custom callback. Pass `nullptr` for the `FileReader` parameter:

```cpp
#include <tesseract/baseapi.h>
#include <fstream>
#include <vector>

int main() {
    // Load traineddata into memory
    std::ifstream fin("eng.traineddata", std::ios::binary);
    std::vector<char> buf((std::istreambuf_iterator<char>(fin)),
                         std::istreambuf_iterator<char>());

    tesseract::TessBaseAPI api;
    
    // Initialize from memory buffer, no FileReader needed
    if (api.Init(buf.data(), static_cast<int>(buf.size()),
                 "eng", tesseract::OEM_DEFAULT,
                 nullptr, 0, nullptr, nullptr, false, nullptr) != 0) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        return 1;
    }
    
    // Proceed with image processing...
}

```

**Key insight:** When `data_size` is non-zero, Tesseract treats the `data` pointer as the raw `.traineddata` content. Since the engine already has the primary language model, it never requests additional files, making the callback unnecessary.

### Approach 2: Custom FileReader with In-Memory Storage

For languages requiring auxiliary files (configs, unicharsets, DAWG files), implement a `FileReader` that serves all requests from an in-memory map. This approach enables fully self-contained binaries:

```cpp
#include <tesseract/baseapi.h>
#include <unordered_map>
#include <string>
#include <vector>
#include <fstream>

using FileStore = std::unordered_map<std::string, std::vector<char>>;

// Populate the store from embedded resources or archives
FileStore store;
void LoadResources() {
    auto load_file = [&](const char *path) {
        std::ifstream f(path, std::ios::binary);
        std::vector<char> data((std::istreambuf_iterator<char>(f)),
                               std::istreambuf_iterator<char>());
        store.emplace(path, std::move(data));
    };
    
    load_file("eng.traineddata");
    load_file("eng.unicharset");
    load_file("eng.config");
}

// FileReader callback implementation
bool MemoryFileReader(const char *filename, std::vector<char> *out) {
    auto it = store.find(filename);
    if (it == store.end()) return false;
    *out = it->second;  // Copy the content
    return true;
}

int main() {
    LoadResources();
    const auto &trained = store.at("eng.traineddata");
    
    tesseract::TessBaseAPI api;
    if (api.Init(trained.data(), static_cast<int>(trained.size()),
                 "eng", tesseract::OEM_DEFAULT,
                 nullptr, 0, nullptr, nullptr, false,
                 MemoryFileReader) != 0) {
        fprintf(stderr, "Tesseract init failed\n");
        return 1;
    }
    
    // OCR processing...
}

```

**Critical implementation details:**

* The callback receives exact filenames as requested by the engine (e.g., `"eng.unicharset"`, `"eng.punc-dawg"`).
* You must copy the data into the provided `std::vector<char>`; Tesseract manages the lifetime of this buffer after the call returns.
* Returning `false` triggers initialization failure, so ensure your store contains all required auxiliary files for your language pack.

### Approach 3: Hybrid Reader with Filesystem Fallback

For scenarios where you embed core language data but allow users to install additional resources on disk, implement a hybrid callback that checks memory first, then falls back to standard file I/O:

```cpp
#include <tesseract/baseapi.h>
#include <unordered_map>
#include <fstream>
#include <vector>

// In-memory store for embedded files
std::unordered_map<std::string, std::vector<char>> mem_store;

// Default filesystem loader
bool DefaultFileReader(const char *filename, std::vector<char> *out) {
    std::ifstream f(filename, std::ios::binary);
    if (!f) return false;
    *out = std::vector<char>((std::istreambuf_iterator<char>(f)),
                             std::istreambuf_iterator<char>());
    return true;
}

// Hybrid reader: memory first, then disk
bool HybridReader(const char *filename, std::vector<char> *out) {
    auto it = mem_store.find(filename);
    if (it != mem_store.end()) {
        *out = it->second;
        return true;
    }
    return DefaultFileReader(filename, out);
}

// Usage: populate mem_store with critical files, then pass HybridReader to api.Init()

```

This pattern minimizes disk access for frequently used languages while maintaining compatibility with standard Tesseract resource installations.

## Key Source Files and Implementation Details

Understanding the internal architecture helps debug callback issues and optimize memory usage. The following files implement the `FileReader` mechanism:

| File | Role | Key Components |
|------|------|----------------|
| [`include/tesseract/baseapi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) | Declares the `FileReader` typedef and the in-memory `Init` overload. | `FileReader` function pointer type, `TessBaseAPI::Init` signature |
| [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) | Stores the callback in `reader_` member and passes it to `TessdataManager`. | Lines 332-335: `reader_ = reader;` |
| [`src/ccutil/tessdatamanager.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccutil/tessdatamanager.h) | Defines `TessdataManager` class that orchestrates file loading. | `FileReader` member storage |
| [`src/ccutil/tessdatamanager.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccutil/tessdatamanager.cpp) | Invokes the callback when auxiliary files are requested. | Line 90: `if (reader_) return (*reader_)(filename, data);` |
| [`src/ccutil/serialis.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccutil/serialis.h) / `.cpp` | Provides `TFile` abstraction for reading from memory buffers. | `TFile` class for vector-based I/O |

When tracing execution, start at `TessBaseAPI::Init` in [`baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/baseapi.cpp), follow the `reader_` assignment, then examine `TessdataManager::GetComponent` in [`tessdatamanager.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/tessdatamanager.cpp) to see how the callback is invoked for each language component.

## Summary

- **The `FileReader` callback** enables fully in-memory OCR by intercepting all file system requests from Tesseract's `TessdataManager`.
- **Initialization signature** requires passing a function pointer of type `bool (*)(const char*, std::vector<char>*)` as the final parameter to `TessBaseAPI::Init`.
- **Three implementation patterns** cover most use cases: direct buffer initialization without callback (simplest), custom memory-mapped storage (fully embedded), and hybrid readers with disk fallback (flexible deployment).
- **Critical source locations** include [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) (lines 332-335) for callback storage and [`src/ccutil/tessdatamanager.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccutil/tessdatamanager.cpp) (line 90) for callback invocation.

## Frequently Asked Questions

### Can I use FileReader with multiple language files simultaneously?

Yes. When initializing Tesseract with multiple languages (e.g., "eng+deu"), the engine requests each component file separately through the same callback. Your `FileReader` implementation must handle lookups for all required filenames, including `eng.traineddata`, `deu.traineddata`, and their respective auxiliary files like `eng.unicharset` and `deu.unicharset`.

### What happens if the FileReader callback returns false?

If your callback returns `false` for any file request, `TessdataManager::Init` propagates the failure upward, causing `TessBaseAPI::Init` to return a non-zero error code. This typically results in the initialization error "Could not initialize tesseract." Ensure your callback handles all expected filenames and returns `true` only when the vector is successfully populated with the complete file contents.

### Is the FileReader approach thread-safe?

The thread safety of in-memory OCR depends entirely on your callback implementation. Tesseract's `TessBaseAPI` instance is not thread-safe by design—each thread requires its own API object. However, if multiple `TessBaseAPI` instances share the same `FileReader` callback, your implementation must handle concurrent access to the underlying memory store. Use immutable data structures or mutex-protected lookups to ensure safe concurrent reads.

### How do I handle large traineddata files in memory?

For large language models, avoid copying buffers unnecessarily. Store files in `std::vector<char>` or memory-mapped regions, and implement copy-on-write semantics or shared pointers if multiple API instances need access. When passing data to `Init`, ensure the buffer remains valid for the duration of the initialization call—Tesseract copies the traineddata content internally during `TessdataManager` setup, so you can safely free your buffer after `Init` returns successfully.