Build Status

Purpose

Coradoc is a hub-and-spoke document transformation library for Ruby. It provides a canonical CoreModel that serves as the transformation hub, enabling seamless conversions between AsciiDoc, HTML, Markdown, DOCX, and other formats.

Features

Architecture

Hub-and-Spoke Model

Coradoc uses a hub-and-spoke architecture where all format transformations go through a canonical CoreModel:

                    ┌─────────────────────────────────────┐
                    │           Source Formats            │
                    │                                     │
    ┌─────────┐     │  ┌─────────┐  ┌─────────┐  ┌──────┐ │     ┌─────────┐
    │AsciiDoc │────►│  │ ToCore  │  │  From  │  │ HTML │────►│   HTML  │
    │  .adoc  │     │  │         │  │  Core  │  │Render│ │     │  .html  │
    └─────────┘     │  │         │  │         │  │      │ │     └─────────┘
                    │  │         ▼  ▼         │  │      │ │
    ┌─────────┐     │  │      ┌─────────┐     │  │      │ │     ┌─────────┐
    │Markdown │────►│  │      │CoreModel│     │  │      │────►│Markdown │
    │   .md   │     │  │      │   Hub   │     │  │      │ │     │   .md   │
    └─────────┘     │  │      └─────────┘     │  │      │ │     └─────────┘
                    │  │                     │  │      │ │
    ┌─────────┐     │  │                     │  │      │ │     ┌─────────┐
    │  HTML   │────►│  │                     │  │      │────►│ AsciiDoc│
    │  .html  │     │  └─────────────────────┘  └──────┘ │     │  .adoc  │
    └─────────┘     │                                      │     └─────────┘
                    └──────────────────────────────────────┘
    ┌─────────┐
    │  DOCX   │────► ToCoreModel (via Uniword) ──► CoreModel Hub
    │  .docx  │
    └─────────┘

This architecture means: - Adding a new format only requires two transformers (ToCoreModel, FromCoreModel) - N formats can interoperate with just 2N transformers (not N*(N-1)) - The CoreModel provides a canonical, well-defined structure

CoreModel

The CoreModel (Coradoc::CoreModel) is the canonical representation of documents:

StructuralElement

Document structure (document, section)

Block

Content blocks (paragraph, code, quote)

ListBlock

Lists (ordered, unordered, definition)

InlineElement

Inline formatting (bold, italic, link)

Table

Tables with rows and cells

Image

Images with alt text

Installation

Add this line to your application’s Gemfile:

gem 'coradoc'

# For DOCX support, also add:
gem 'coradoc-docx'
gem 'uniword'        # DOCX reader

And then execute:

bundle install

Or install it yourself as:

gem install coradoc

Quick Start

Using the Developer API

require 'coradoc'
require 'coradoc/html'
require 'coradoc/markdown'

# Convert Markdown to HTML
html = Coradoc.convert("# Title\n\nParagraph", from: :markdown, to: :html)

# Parse to CoreModel
core = Coradoc.parse("# Title\n\nParagraph", format: :markdown)

# Serialize to any format
markdown = Coradoc.serialize(core, to: :markdown)
html = Coradoc.serialize(core, to: :html)

DOCX Conversion

Convert Word documents to AsciiDoc or Markdown:

require 'coradoc'
require 'coradoc/docx'

# Convert DOCX to AsciiDoc
adoc = Coradoc.convert("input.docx", from: :docx, to: :asciidoc)

# Convert DOCX to Markdown
md = Coradoc.convert("input.docx", from: :docx, to: :markdown)

# Parse DOCX to CoreModel for manipulation
core = Coradoc.parse("input.docx", format: :docx)
core.title         # => "Document Title"
core.children      # => Array of sections, paragraphs, tables, etc.

# Serialize to different formats
adoc = Coradoc.serialize(core, to: :asciidoc)
md  = Coradoc.serialize(core, to: :markdown)

Using the CLI

# Convert Markdown to HTML
coradoc convert document.md -o output.html

# Convert with auto-detection
coradoc convert document.adoc --to html

# List supported formats
coradoc formats

# Show version
coradoc version

Developer API

Format Conversion

The Coradoc.convert method handles the complete transformation pipeline:

# AsciiDoc to HTML
html = Coradoc.convert(adoc_text, from: :asciidoc, to: :html)

# Markdown to HTML
html = Coradoc.convert(md_text, from: :markdown, to: :html)

# HTML to Markdown
md = Coradoc.convert(html_text, from: :html, to: :markdown)

# DOCX to AsciiDoc (requires coradoc-docx gem)
adoc = Coradoc.convert("document.docx", from: :docx, to: :asciidoc)

# DOCX to Markdown
md = Coradoc.convert("document.docx", from: :docx, to: :markdown)

Parsing

Parse documents to CoreModel for manipulation:

# Parse Markdown
core = Coradoc.parse("# Title\n\nContent", format: :markdown)

# Access the structure
core.element_type  # => "document"
core.title         # => "Title"
core.children      # => Array of child elements

Serialization

Serialize CoreModel to any supported format:

# Create or modify CoreModel
core = Coradoc::CoreModel::StructuralElement.new(
  element_type: "document",
  title: "My Document",
  children: [...]
)

# Serialize to HTML
html = Coradoc.serialize(core, to: :html)

# Serialize to Markdown
md = Coradoc.serialize(core, to: :markdown)

Transform Models

Transform between format-specific models:

# Parse Markdown to its native model
md_doc = Coradoc::Markdown.parse("# Title\n\nContent")

# Transform to CoreModel
core = Coradoc.to_core(md_doc)

# Transform back to Markdown model
md_doc2 = Coradoc::Markdown.from_core_model(core)

CLI

The coradoc command-line tool provides quick conversions:

# Basic conversion
coradoc convert input.md -o output.html

# Specify formats explicitly
coradoc convert input.md --from markdown --to html

# Convert DOCX to AsciiDoc (requires coradoc-docx gem)
coradoc convert document.docx -o output.adoc

# Convert DOCX to Markdown
coradoc convert document.docx -o output.md

# Use different HTML themes
coradoc convert input.md -o output.html --theme modern

# Verbose output
coradoc convert input.md -o output.html --verbose

# Show supported formats
coradoc formats

CLI Options

--to, -t FORMAT

Target format (html, md, adoc)

--from, -f FORMAT

Source format (auto-detected from extension)

--output, -o FILE

Output file (default: stdout)

--theme THEME

HTML theme (classic, modern)

--verbose

Enable verbose output

Extensibility

Adding a New Format

To add a new format, create a gem with:

  1. Format module with parse/serialize methods

  2. ToCoreModel transformer - converts native model to CoreModel

  3. FromCoreModel transformer - converts CoreModel to native model

  4. Register with Coradoc

# lib/coradoc/my_format.rb
module Coradoc
  module MyFormat
    # Parse input to native model
    def self.parse(content)
      # ...
    end

    # Parse directly to CoreModel
    def self.parse_to_core(content)
      Transform::ToCoreModel.transform(parse(content))
    end

    # Transform native model to CoreModel
    def self.to_core(model)
      Transform::ToCoreModel.transform(model)
    end

    # Transform CoreModel to native model
    def self.from_core(core)
      Transform::FromCoreModel.transform(core)
    end

    # Serialize CoreModel to output
    def self.serialize(core, **options)
      model = from_core(core)
      serialize_native(model)
    end
  end
end

# Register the format
Coradoc.register_format(:my_format, Coradoc::MyFormat,
                        extensions: ['.myf', '.myformat'])

Plugin Lifecycle Hooks

Hook into the transformation pipeline:

# Register hooks via options
Coradoc.convert(text, from: :markdown, to: :html,
  before_parse: ->(content) { content.upcase },
  after_transform: ->(core) { process(core) }
)

CoreModel Reference

Structural Elements

# Document
doc = Coradoc::CoreModel::StructuralElement.new(
  element_type: "document",
  title: "My Document",
  children: [...]
)

# Section
section = Coradoc::CoreModel::StructuralElement.new(
  element_type: "section",
  level: 1,
  title: "Section Title",
  children: [...]
)

Block Elements

# Paragraph
para = Coradoc::CoreModel::Block.new(
  element_type: "paragraph",
  content: "Paragraph text"
)

# Code block
code = Coradoc::CoreModel::Block.new(
  element_type: "block",
  delimiter_type: "----",
  content: "def hello; puts 'world'; end",
  language: "ruby"
)

Lists

# Unordered list
list = Coradoc::CoreModel::ListBlock.new(
  marker_type: "unordered",
  items: [
    Coradoc::CoreModel::ListItem.new(content: "Item 1", marker: "*"),
    Coradoc::CoreModel::ListItem.new(content: "Item 2", marker: "*"),
  ]
)

# Definition list
def_list = Coradoc::CoreModel::DefinitionList.new(
  items: [
    Coradoc::CoreModel::DefinitionItem.new(
      term: "API",
      definitions: ["Application Programming Interface"]
    ),
  ]
)

Inline Elements

# Bold
bold = Coradoc::CoreModel::InlineElement.new(
  format_type: "bold",
  content: "bold text"
)

# Link
link = Coradoc::CoreModel::InlineElement.new(
  format_type: "link",
  target: "https://example.com",
  content: "Example"
)

# STEM formula
stem = Coradoc::CoreModel::InlineElement.new(
  format_type: "stem",
  content: "E = mc^2",
  stem_type: "stem"
)

Supported inline format types:

bold

Bold text

italic

Italic/emphasized text

monospace

Code/monospace text

link

Hyperlinks

xref

Cross-references

stem

STEM formulas (mathematical notation)

footnote

Footnotes

term

Term references (glossary terms)

superscript

Superscript text

subscript

Subscript text

Query API

Query documents using CSS-like selectors:

# Parse document
doc = Coradoc.parse(adoc_text, format: :asciidoc)

# Find all sections
sections = doc.query('section')

# Find level-2 sections
doc.query('section.level-2').each do |section|
  puts section.title
end

# Find paragraphs with specific role
examples = doc.query('[role=example]')

# Complex selectors with pseudo-classes
doc.query('section > paragraph:first-child')

# Query within a specific element
doc.query_within(section, 'paragraph')

# Chain queries
doc.query('section').filter('.important').first

Selector Syntax

element

Element type (section, paragraph, table)

#id

ID selector

.class

Class/role selector

[attr=value]

Attribute selector

:first-child

Pseudo-class selectors

>

Child combinator

(space)

Descendant combinator

Validation Framework

Validate documents against schemas:

# Define a validation schema
schema = Coradoc::Validation::Schema.define do
  required :title, type: String, min_length: 1
  required :sections, type: Array, min_count: 1
  optional :author, type: String

  rule :check_references do |doc|
    refs = doc.query('xref')
    missing = refs.reject { |r| doc.resolve_reference(r) }
    missing.map { |r| "Unresolved reference: #{r.target}" }
  end
end

# Validate a document
result = schema.validate(document)

if result.valid?
  puts "Document is valid"
else
  result.errors.each { |e| puts "#{e.path}: #{e.message}" }
end

Built-in Validation Rules

required

Field must be present

type

Field must be specific type

min_length/max_length

String/collection length bounds

min_count/max_count

Collection count bounds

format

Match against regex pattern

rule

Custom validation block

Streaming Processor

Process large documents without loading everything into memory:

# Stream parse large file
Coradoc::Streaming.parse_large_file("large.adoc", format: :asciidoc,
                                    chunk_size: 100) do |chunk|
  chunk.each { |element| process_element(element) }
end

# Transform in chunks
results = Coradoc::Streaming.transform_in_chunks(elements, chunk_size: 50) do |chunk|
  chunk.map { |el| transform_element(el) }
end

# Incremental serialization
File.open("output.html", "w") do |file|
  Coradoc::Streaming.serialize_incremental(document, format: :html) do |chunk|
    file.write(chunk)
  end
end

# Process with memory constraints
progress = Coradoc::Streaming.process_with_memory_limit(
  "input.adoc", "output.html",
  format: :asciidoc, output_format: :html,
  max_memory: 50 * 1024 * 1024  # 50MB
)
puts progress.to_s  # "100 processed (100.0%) at 10.0/sec ~0.5min remaining"

Streaming Features

ChunkProcessor

Batch operations with configurable chunk size

Progress

Track progress, rate, estimated time remaining

MemoryMonitor

Monitor memory usage during processing

StreamReader/StreamWriter

File I/O streaming

Lazy Evaluation

Memory-efficient processing using lazy enumerators and on-demand evaluation:

# Wrap document for lazy iteration
wrapper = Coradoc::Lazy.wrap(document)
wrapper.each_section do |section|
  process_section(section)  # Processed on-demand
end

# Lazy transformation pipeline
result = Coradoc::Lazy.transform(sections) do |p|
  p.map { |s| transform_section(s) }
   .select { |s| s.visible? }
   .take(10)
end.to_a  # Only evaluates when to_a is called

# Process in batches
wrapper.each_batch(10) do |batch|
  batch.each { |section| process(section) }
end

# Lazy reference resolution
resolver = Coradoc::Lazy.resolver(document, loader: ->(ref, _) {
  load_include_file(ref)
})
content = resolver.resolve("include::chapter1.adoc[]")

Lazy Evaluation Features

DocumentWrapper

Lazy iteration over document sections

TransformationPipeline

Chain lazy transformations without evaluation

ReferenceResolver

On-demand loading of includes/references

ChunkProcessor

Process large content in memory-safe chunks

Development

Running Tests

# Run all tests
bundle exec rspec

# Run specific test file
bundle exec rspec spec/coradoc/developer_experience_spec.rb

# Run with documentation
bundle exec rspec --format documentation

Running Linting

bundle exec rubocop

Project Structure

coradoc/
├── lib/
│   └── coradoc/
│       ├── coradoc.rb      # Main API (parse, convert, serialize)
│       ├── registry.rb     # Format registry
│       ├── core_model/     # CoreModel classes
│       ├── transform/      # Base transformer
│       ├── query.rb        # Document query API
│       ├── validation.rb   # Document validation
│       ├── streaming.rb    # Large document processing
│       ├── hooks.rb        # Plugin lifecycle hooks
│       ├── extensions.rb   # Custom element extensions
│       └── cli.rb          # CLI implementation
├── coradoc-adoc/           # AsciiDoc format gem
├── coradoc-docx/           # DOCX format gem (OOXML → CoreModel via Uniword)
├── coradoc-html/           # HTML format gem
├── coradoc-markdown/       # Markdown format gem
├── spec/                   # Test files
└── exe/
    └── coradoc             # CLI executable

Contributing

  1. Fork the repository

  2. Create your feature branch (git checkout -b feature/amazing-feature)

  3. Commit your changes (git commit -am 'Add amazing feature')

  4. Push to the branch (git push origin feature/amazing-feature)

  5. Open a Pull Request

License

Copyright

2024-2026 Ribose Inc.

Licensed under the Apache License, Version 2.0.