Class: Uniword::Docx::DocumentStatistics

Inherits:
Object
  • Object
show all
Defined in:
lib/uniword/docx/document_statistics.rb

Overview

Calculates document statistics that appear in docProps/app.xml.

Verified against Microsoft Word 2024. Rules:

- Words: whitespace-separated tokens; each CJK char = 1 word
- Characters: total chars minus whitespace (NO paragraph marks)
- CharactersWithSpaces: total chars including spaces (NO paragraph marks)
- Paragraphs: count of non-empty paragraphs (empty paras excluded)
- Lines: same as non-empty paragraph count (no page layout engine)
- Pages: simple approximation (no page layout engine)

Known limitations:

- Pages and Lines use paragraph-based approximation (no page layout engine)
- Footnote/endnote text is not included (unclear if Word includes it)
- Header/footer text is not included (Word likely excludes it)

Constant Summary collapse

CJK_REGEX =

CJK Unified Ideographs and extension ranges

/[\u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u2F00-\u2FDF\u2E80-\u2EFF]/
WHITESPACE_REGEX =
/[ \t\r\n]/

Instance Method Summary collapse

Constructor Details

#initialize(package) ⇒ DocumentStatistics

Returns a new instance of DocumentStatistics.



24
25
26
# File 'lib/uniword/docx/document_statistics.rb', line 24

def initialize(package)
  @package = package
end

Instance Method Details

#calculateHash{Symbol => Integer}

Returns:

  • (Hash{Symbol => Integer})


29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/uniword/docx/document_statistics.rb', line 29

def calculate
  body = @package.document&.body
  return empty_stats unless body

  text_per_paragraph = []
  collect_text(body, text_per_paragraph)

  # Word only counts non-empty paragraphs
  non_empty = text_per_paragraph.reject { |t| t.strip.empty? }

  {
    pages: estimate_pages(non_empty.size),
    words: count_words(non_empty),
    characters: count_characters_no_spaces(non_empty),
    characters_with_spaces: count_characters_with_spaces(non_empty),
    paragraphs: non_empty.size,
    lines: estimate_lines(non_empty.size),
  }
end