Models¶

Pydantic models for text extraction outputs.

Defines output types and format enums for text extraction.

OutputFormat ¶

Bases: str, Enum

Supported text extraction output formats.

Each format has different characteristics

HTML: Structured with div elements, preserves layout semantics
MARKDOWN: Portable, human-readable, good for documentation
JSON: Structured data with layout information (Dots OCR)

TextOutput ¶

Bases: BaseModel

Text extraction output from a document image.

Contains the extracted text content in the requested format, along with optional raw output and plain text versions.

Example

result = extractor.extract(image, output_format="markdown")
print(result.content)  # Clean markdown
print(result.plain_text)  # Plain text without formatting

content_length `property` ¶

content_length: int

Length of the extracted content in characters.

word_count `property` ¶

word_count: int

Approximate word count of the plain text.

LayoutElement ¶

Bases: BaseModel

Single layout element from document layout detection.

Represents a detected region in the document with its bounding box, category label, and extracted text content.

ATTRIBUTE	DESCRIPTION
`bbox`	Bounding box coordinates [x1, y1, x2, y2] (normalized to 0-1024) TYPE: `List[int]`
`category`	Layout category (e.g., "Text", "Title", "Table", "Formula") TYPE: `str`
`text`	Extracted text content (None for pictures) TYPE: `Optional[str]`
`confidence`	Detection confidence score (optional) TYPE: `Optional[float]`

DotsOCRTextOutput ¶

Bases: BaseModel

Text extraction output from Dots OCR with layout information.

Dots OCR provides structured output with: - Layout detection (11 categories) - Bounding boxes (normalized to 0-1024) - Multi-format text (Markdown/LaTeX/HTML) - Reading order preservation

Layout Categories

Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title

Text Formatting

Text/Title/Section-header: Markdown
Formula: LaTeX
Table: HTML
Picture: (text omitted)

Example

from omnidocs.tasks.text_extraction import DotsOCRTextExtractor
result = extractor.extract(image, include_layout=True)
print(result.content)  # Full text with formatting
for elem in result.layout:
        print(f"{elem.category}: {elem.bbox}")

num_layout_elements `property` ¶

num_layout_elements: int

Number of detected layout elements.

content_length `property` ¶

content_length: int

Length of extracted content in characters.

Models¶

OutputFormat ¶

TextOutput ¶

content_length property ¶

word_count property ¶

LayoutElement ¶

DotsOCRTextOutput ¶

num_layout_elements property ¶

content_length property ¶

content_length `property` ¶

word_count `property` ¶

num_layout_elements `property` ¶

content_length `property` ¶