Purpose
Coradoc is a hub-and-spoke document transformation library for Ruby. It provides a canonical CoreModel that serves as the transformation hub, enabling seamless conversions between AsciiDoc, HTML, Markdown, DOCX, and other formats.
Features
-
Hub-and-Spoke Architecture - CoreModel as canonical representation
-
Format Conversion - Convert between any supported formats
-
Developer API - Simple, intuitive Ruby API
-
Command-Line Interface - CLI for quick conversions
-
Extensibility - Add new formats with minimal code
-
Query API - CSS-like selectors for document querying
-
Validation Framework - Schema-based document validation
-
Streaming Processor - Process large documents efficiently
-
Lazy Evaluation - Memory-efficient lazy document processing
Architecture
Hub-and-Spoke Model
Coradoc uses a hub-and-spoke architecture where all format transformations go through a canonical CoreModel:
┌─────────────────────────────────────┐
│ Source Formats │
│ │
┌─────────┐ │ ┌─────────┐ ┌─────────┐ ┌──────┐ │ ┌─────────┐
│AsciiDoc │────►│ │ ToCore │ │ From │ │ HTML │────►│ HTML │
│ .adoc │ │ │ │ │ Core │ │Render│ │ │ .html │
└─────────┘ │ │ │ │ │ │ │ │ └─────────┘
│ │ ▼ ▼ │ │ │ │
┌─────────┐ │ │ ┌─────────┐ │ │ │ │ ┌─────────┐
│Markdown │────►│ │ │CoreModel│ │ │ │────►│Markdown │
│ .md │ │ │ │ Hub │ │ │ │ │ │ .md │
└─────────┘ │ │ └─────────┘ │ │ │ │ └─────────┘
│ │ │ │ │ │
┌─────────┐ │ │ │ │ │ │ ┌─────────┐
│ HTML │────►│ │ │ │ │────►│ AsciiDoc│
│ .html │ │ └─────────────────────┘ └──────┘ │ │ .adoc │
└─────────┘ │ │ └─────────┘
└──────────────────────────────────────┘
┌─────────┐
│ DOCX │────► ToCoreModel (via Uniword) ──► CoreModel Hub
│ .docx │
└─────────┘
This architecture means: - Adding a new format only requires two transformers (ToCoreModel, FromCoreModel) - N formats can interoperate with just 2N transformers (not N*(N-1)) - The CoreModel provides a canonical, well-defined structure
CoreModel
The CoreModel (Coradoc::CoreModel) is the canonical representation of documents:
StructuralElement
|
Document structure (document, section) |
Block
|
Content blocks (paragraph, code, quote) |
ListBlock
|
Lists (ordered, unordered, definition) |
InlineElement
|
Inline formatting (bold, italic, link) |
Table
|
Tables with rows and cells |
Image
|
Images with alt text |
Installation
Add this line to your application’s Gemfile:
gem 'coradoc'
# For DOCX support, also add:
gem 'coradoc-docx'
gem 'uniword' # DOCX reader
And then execute:
bundle install
Or install it yourself as:
gem install coradoc
Quick Start
Using the Developer API
require 'coradoc'
require 'coradoc/html'
require 'coradoc/markdown'
# Convert Markdown to HTML
html = Coradoc.convert("# Title\n\nParagraph", from: :markdown, to: :html)
# Parse to CoreModel
core = Coradoc.parse("# Title\n\nParagraph", format: :markdown)
# Serialize to any format
markdown = Coradoc.serialize(core, to: :markdown)
html = Coradoc.serialize(core, to: :html)
DOCX Conversion
Convert Word documents to AsciiDoc or Markdown:
require 'coradoc'
require 'coradoc/docx'
# Convert DOCX to AsciiDoc
adoc = Coradoc.convert("input.docx", from: :docx, to: :asciidoc)
# Convert DOCX to Markdown
md = Coradoc.convert("input.docx", from: :docx, to: :markdown)
# Parse DOCX to CoreModel for manipulation
core = Coradoc.parse("input.docx", format: :docx)
core.title # => "Document Title"
core.children # => Array of sections, paragraphs, tables, etc.
# Serialize to different formats
adoc = Coradoc.serialize(core, to: :asciidoc)
md = Coradoc.serialize(core, to: :markdown)
Using the CLI
# Convert Markdown to HTML
coradoc convert document.md -o output.html
# Convert with auto-detection
coradoc convert document.adoc --to html
# List supported formats
coradoc formats
# Show version
coradoc version
Developer API
Format Conversion
The Coradoc.convert method handles the complete transformation pipeline:
# AsciiDoc to HTML
html = Coradoc.convert(adoc_text, from: :asciidoc, to: :html)
# Markdown to HTML
html = Coradoc.convert(md_text, from: :markdown, to: :html)
# HTML to Markdown
md = Coradoc.convert(html_text, from: :html, to: :markdown)
# DOCX to AsciiDoc (requires coradoc-docx gem)
adoc = Coradoc.convert("document.docx", from: :docx, to: :asciidoc)
# DOCX to Markdown
md = Coradoc.convert("document.docx", from: :docx, to: :markdown)
Parsing
Parse documents to CoreModel for manipulation:
Serialization
Serialize CoreModel to any supported format:
# Create or modify CoreModel
core = Coradoc::CoreModel::StructuralElement.new(
element_type: "document",
title: "My Document",
children: [...]
)
# Serialize to HTML
html = Coradoc.serialize(core, to: :html)
# Serialize to Markdown
md = Coradoc.serialize(core, to: :markdown)
CLI
The coradoc command-line tool provides quick conversions:
# Basic conversion
coradoc convert input.md -o output.html
# Specify formats explicitly
coradoc convert input.md --from markdown --to html
# Convert DOCX to AsciiDoc (requires coradoc-docx gem)
coradoc convert document.docx -o output.adoc
# Convert DOCX to Markdown
coradoc convert document.docx -o output.md
# Use different HTML themes
coradoc convert input.md -o output.html --theme modern
# Verbose output
coradoc convert input.md -o output.html --verbose
# Show supported formats
coradoc formats
CLI Options
--to, -t FORMAT
|
Target format (html, md, adoc) |
--from, -f FORMAT
|
Source format (auto-detected from extension) |
--output, -o FILE
|
Output file (default: stdout) |
--theme THEME
|
HTML theme (classic, modern) |
--verbose
|
Enable verbose output |
Extensibility
Adding a New Format
To add a new format, create a gem with:
-
Format module with parse/serialize methods
-
ToCoreModel transformer - converts native model to CoreModel
-
FromCoreModel transformer - converts CoreModel to native model
-
Register with Coradoc
# lib/coradoc/my_format.rb
module Coradoc
module MyFormat
# Parse input to native model
def self.parse(content)
# ...
end
# Parse directly to CoreModel
def self.parse_to_core(content)
Transform::ToCoreModel.transform(parse(content))
end
# Transform native model to CoreModel
def self.to_core(model)
Transform::ToCoreModel.transform(model)
end
# Transform CoreModel to native model
def self.from_core(core)
Transform::FromCoreModel.transform(core)
end
# Serialize CoreModel to output
def self.serialize(core, **)
model = from_core(core)
serialize_native(model)
end
end
end
# Register the format
Coradoc.register_format(:my_format, Coradoc::MyFormat,
extensions: ['.myf', '.myformat'])
CoreModel Reference
Structural Elements
# Document
doc = Coradoc::CoreModel::StructuralElement.new(
element_type: "document",
title: "My Document",
children: [...]
)
# Section
section = Coradoc::CoreModel::StructuralElement.new(
element_type: "section",
level: 1,
title: "Section Title",
children: [...]
)
Block Elements
Lists
# Unordered list
list = Coradoc::CoreModel::ListBlock.new(
marker_type: "unordered",
items: [
Coradoc::CoreModel::ListItem.new(content: "Item 1", marker: "*"),
Coradoc::CoreModel::ListItem.new(content: "Item 2", marker: "*"),
]
)
# Definition list
def_list = Coradoc::CoreModel::DefinitionList.new(
items: [
Coradoc::CoreModel::DefinitionItem.new(
term: "API",
definitions: ["Application Programming Interface"]
),
]
)
Inline Elements
# Bold
bold = Coradoc::CoreModel::InlineElement.new(
format_type: "bold",
content: "bold text"
)
# Link
link = Coradoc::CoreModel::InlineElement.new(
format_type: "link",
target: "https://example.com",
content: "Example"
)
# STEM formula
stem = Coradoc::CoreModel::InlineElement.new(
format_type: "stem",
content: "E = mc^2",
stem_type: "stem"
)
Supported inline format types:
bold
|
Bold text |
italic
|
Italic/emphasized text |
monospace
|
Code/monospace text |
link
|
Hyperlinks |
xref
|
Cross-references |
stem
|
STEM formulas (mathematical notation) |
footnote
|
Footnotes |
term
|
Term references (glossary terms) |
superscript
|
Superscript text |
subscript
|
Subscript text |
Query API
Query documents using CSS-like selectors:
# Parse document
doc = Coradoc.parse(adoc_text, format: :asciidoc)
# Find all sections
sections = doc.query('section')
# Find level-2 sections
doc.query('section.level-2').each do |section|
puts section.title
end
# Find paragraphs with specific role
examples = doc.query('[role=example]')
# Complex selectors with pseudo-classes
doc.query('section > paragraph:first-child')
# Query within a specific element
doc.query_within(section, 'paragraph')
# Chain queries
doc.query('section').filter('.important').first
Selector Syntax
element
|
Element type (section, paragraph, table) |
#id
|
ID selector |
.class
|
Class/role selector |
[attr=value]
|
Attribute selector |
:first-child
|
Pseudo-class selectors |
>
|
Child combinator |
| (space) |
Descendant combinator |
Validation Framework
Validate documents against schemas:
# Define a validation schema
schema = Coradoc::Validation::Schema.define do
required :title, type: String, min_length: 1
required :sections, type: Array, min_count: 1
optional :author, type: String
rule :check_references do |doc|
refs = doc.query('xref')
missing = refs.reject { |r| doc.resolve_reference(r) }
missing.map { |r| "Unresolved reference: #{r.target}" }
end
end
# Validate a document
result = schema.validate(document)
if result.valid?
puts "Document is valid"
else
result.errors.each { |e| puts "#{e.path}: #{e.}" }
end
Built-in Validation Rules
required
|
Field must be present |
type
|
Field must be specific type |
min_length/max_length
|
String/collection length bounds |
min_count/max_count
|
Collection count bounds |
format
|
Match against regex pattern |
rule
|
Custom validation block |
Streaming Processor
Process large documents without loading everything into memory:
# Stream parse large file
Coradoc::Streaming.parse_large_file("large.adoc", format: :asciidoc,
chunk_size: 100) do |chunk|
chunk.each { |element| process_element(element) }
end
# Transform in chunks
results = Coradoc::Streaming.transform_in_chunks(elements, chunk_size: 50) do |chunk|
chunk.map { |el| transform_element(el) }
end
# Incremental serialization
File.open("output.html", "w") do |file|
Coradoc::Streaming.serialize_incremental(document, format: :html) do |chunk|
file.write(chunk)
end
end
# Process with memory constraints
progress = Coradoc::Streaming.process_with_memory_limit(
"input.adoc", "output.html",
format: :asciidoc, output_format: :html,
max_memory: 50 * 1024 * 1024 # 50MB
)
puts progress.to_s # "100 processed (100.0%) at 10.0/sec ~0.5min remaining"
Streaming Features
ChunkProcessor
|
Batch operations with configurable chunk size |
Progress
|
Track progress, rate, estimated time remaining |
MemoryMonitor
|
Monitor memory usage during processing |
StreamReader/StreamWriter
|
File I/O streaming |
Lazy Evaluation
Memory-efficient processing using lazy enumerators and on-demand evaluation:
# Wrap document for lazy iteration
wrapper = Coradoc::Lazy.wrap(document)
wrapper.each_section do |section|
process_section(section) # Processed on-demand
end
# Lazy transformation pipeline
result = Coradoc::Lazy.transform(sections) do |p|
p.map { |s| transform_section(s) }
.select { |s| s.visible? }
.take(10)
end.to_a # Only evaluates when to_a is called
# Process in batches
wrapper.each_batch(10) do |batch|
batch.each { |section| process(section) }
end
# Lazy reference resolution
resolver = Coradoc::Lazy.resolver(document, loader: ->(ref, _) {
load_include_file(ref)
})
content = resolver.resolve("include::chapter1.adoc[]")
Lazy Evaluation Features
DocumentWrapper
|
Lazy iteration over document sections |
TransformationPipeline
|
Chain lazy transformations without evaluation |
ReferenceResolver
|
On-demand loading of includes/references |
ChunkProcessor
|
Process large content in memory-safe chunks |
Development
Running Tests
# Run all tests
bundle exec rspec
# Run specific test file
bundle exec rspec spec/coradoc/developer_experience_spec.rb
# Run with documentation
bundle exec rspec --format documentation
Running Linting
bundle exec rubocop
Project Structure
coradoc/
├── lib/
│ └── coradoc/
│ ├── coradoc.rb # Main API (parse, convert, serialize)
│ ├── registry.rb # Format registry
│ ├── core_model/ # CoreModel classes
│ ├── transform/ # Base transformer
│ ├── query.rb # Document query API
│ ├── validation.rb # Document validation
│ ├── streaming.rb # Large document processing
│ ├── hooks.rb # Plugin lifecycle hooks
│ ├── extensions.rb # Custom element extensions
│ └── cli.rb # CLI implementation
├── coradoc-adoc/ # AsciiDoc format gem
├── coradoc-docx/ # DOCX format gem (OOXML → CoreModel via Uniword)
├── coradoc-html/ # HTML format gem
├── coradoc-markdown/ # Markdown format gem
├── spec/ # Test files
└── exe/
└── coradoc # CLI executable
Contributing
-
Fork the repository
-
Create your feature branch (
git checkout -b feature/amazing-feature) -
Commit your changes (
git commit -am 'Add amazing feature') -
Push to the branch (
git push origin feature/amazing-feature) -
Open a Pull Request
License
- Copyright
-
2024-2026 Ribose Inc.
Licensed under the Apache License, Version 2.0.