AcroForge
A Ruby toolkit for working with PDF AcroForms, especially the broken ones.
AcroForge reads, validates, relabels, and fills PDF forms. Its standout feature is the relabeling workflow: when a vendor ships you a fillable PDF whose internal field names look like page0_field6, Text101, or worse, AcroForge runs a spatial heuristic to figure out what each field is actually for, writes its proposal to a human-reviewable YAML file, and then permanently renames the AcroForm fields once you've approved the mapping. The result is a PDF you can fill programmatically without ever again writing pdf.fields["page0_field6"] = "Alice".
It works on any AcroForm PDF: loan applications, school admission forms, government paperwork, internal HR templates. Nothing in the gem is domain-specific.
Requirements
- Ruby
>= 2.7 - HexaPDF
~> 1.0(runtime dependency, installed automatically)
Installation
gem install acroforge
The acroforge command lands on your PATH automatically. RubyGems handles this the same way it does for rails, bundle, or any other Ruby CLI; tools like mise, rbenv, asdf, and rvm pick up the new binary through their shim layer without further configuration.
To use AcroForge as a library inside another Ruby project, add it to that project's Gemfile:
gem "acroforge"
See the Installation guide for troubleshooting PATH issues on non-standard Ruby setups.
Quick start: the relabeling workflow
Given a PDF with garbage-named fields:
# 1. Generate a starter schema (advisory; the heuristic's best guess at canonical keys)
$ acroforge schema infer broken_form.pdf --out schema.yml
# 2. Generate a draft mapping (per-field rename proposals, sorted by page/position)
$ acroforge relabel propose broken_form.pdf --schema schema.yml --out mapping.yml
# 3. Review mapping.yml in your editor: fix wrong proposals, fill in any blanks
# 4. Apply the mapping: permanently renames the AcroForm fields in place
$ acroforge relabel apply broken_form.pdf mapping.yml
After step 4, the PDF's internal field names are semantic (full_name, email, gender, ...) and you can fill the form programmatically with confidence.
The shortcut for the first two steps:
$ acroforge bootstrap broken_form.pdf
# writes schema.yml AND mapping.yml in one pass
CLI
acroforge schema infer <pdf> [--out schema.yml] [--sections a,b,c]
acroforge relabel propose <pdf> [--out mapping.yml] [--schema schema.yml] [--merge|--overwrite]
acroforge relabel apply <pdf> <mapping.yml>
acroforge compile <pdf> [--schema schema.yml]
acroforge bootstrap <pdf> [--schema-out s.yml] [--mapping-out m.yml]
acroforge version
acroforge help
| Subcommand | What it does |
|---|---|
schema infer |
Runs the heuristic on a PDF and writes a starter schema (canonical key → type + variations). Advisory; you review and edit. |
relabel propose |
Writes a YAML mapping file proposing a semantic name for every AcroForm field. Sorted by page → top-to-bottom → left-to-right. Default mode --merge preserves any key/type values you've already edited. |
relabel apply |
Reads a corrected mapping file and rewrites field[:T] / field[:TU] in the source PDF in place. Auto-disambiguates collisions (full_name, full_name_1, ...). |
compile |
Diagnostic: runs the engine and prints mapped/unmapped counts. Useful for checking heuristic coverage without writing any files. |
bootstrap |
Convenience: schema infer + relabel propose in one call. |
Exit codes: 0 success, 1 user error (bad args, missing file), 2 validation error, 3 internal error.
Library API
require "acroforge"
# Compile a PDF and inspect what the heuristic found.
engine = AcroForge::Engine.new(
"form.pdf",
schema: AcroForge::Schema.load("schema.yml"), # or pass a Hash directly
overrides: {}, # optional per-PDF overrides
sections: ["Personal Details", "Loan Details"] # optional section headers for scoping
)
result = engine.compile!
# => { mapped: {...}, unmapped: [...], select_options: {...}, new_fields_detected: [...] }
# Fill a form with a payload.
engine.validate_payload!(full_name: "Alice", email: "alice@example.com")
engine.fill!({ full_name: "Alice", email: "alice@example.com" }, "filled.pdf")
# Generate a starter schema from a PDF.
schema = AcroForge::Schema.infer("form.pdf")
AcroForge::Schema.dump(schema, "schema.yml")
# Run the relabeler programmatically.
AcroForge::Relabeler.propose("form.pdf", out: "mapping.yml", schema: schema)
AcroForge::Relabeler.apply!("form.pdf", "mapping.yml")
# Validate individual values.
AcroForge::Validator.valid?("alice@example.com", :email) # => true
AcroForge::Validator.valid?("not a date", :date) # => false
Errors
AcroForge::ValidationError: raised byEngine#validate_payload!on type mismatch.AcroForge::RelabelError: raised byRelabeler.apply!on malformed mapping YAML, invalid key names, or missing AcroForm.
Schema format
Schemas are YAML or JSON files in the "rich form":
full_name:
type: string
variations:
- Full Name
- First Name
- Surname
gender:
type: select
variations:
- Gender
- Sex
options:
- male
- female
amount_requested:
type: money
variations:
- Amount Requested
- Loan Amount
Field keys (full_name, gender, ...) become Ruby symbols. type is one of string | select | boolean | money | date | email | number. variations are the human-readable label strings to look for on the page. options are the allowed select values (for select and boolean types).
AcroForge also accepts a legacy "shorthand" form where the value is just an array of variations. AcroForge::Schema.normalize upgrades it to rich form on the way in:
{
full_name: ["Full Name", "First Name", "Surname"],
dob: ["Date of Birth", "DOB"]
}
Mapping file format
relabel propose writes one of these. Edit the key: and type: values; the meta: blocks are advisory and get regenerated on the next propose.
_meta:
source_pdf: broken_form.pdf
generated_at: 2026-05-26T14:32:11Z
acroforge_version: 0.1.0
total_fields: 98
page0_field6:
key: full_name
type: string
meta:
raw_label: Full Name
confidence: high
section: personal_details
page: 0
page0_field28:
key: full_name # collision: apply! renames this one to full_name_1
type: string
meta:
raw_label: Customer Name
confidence: medium
section: personal_details
page: 0
page0_field99:
key: ~ # null = skip this field, leave its name unchanged
type: ~
meta:
raw_label: ~
confidence: none
section: ~
page: 3
key must match /\A[a-z][a-z0-9_]*\z/. Invalid keys cause apply! to raise RelabelError before writing anything to the PDF.
How the heuristic works
For each AcroForm field, AcroForge:
- Reads every text chunk on the page along with its bounding box.
- Scores nearby text against the field's widget rectangle using a mode-aware weighted heuristic (Grid-Lock, Inline Paragraph, or Standard Label depending on layout).
- Picks the best-scoring label, sanitises it into a snake-case key.
- If a
schemais supplied, canonicalises the key against itsvariationslists. - For radio groups and checkboxes, also discovers the option export values from the widget appearance states.
You can inspect what it found via engine.field_proposals after compile!. That's the data structure the Relabeler consumes.
Development
bundle install
bundle exec rspec # run the test suite
bundle exec standardrb # lint
Synthetic test fixtures live in spec/fixtures/. To regenerate them, run ruby spec/fixtures/build_fixtures.rb.
License
MIT.