Class: Uniword::Docx::DocumentStatistics
- Inherits:
-
Object
- Object
- Uniword::Docx::DocumentStatistics
- Defined in:
- lib/uniword/docx/document_statistics.rb
Overview
Calculates document statistics that appear in docProps/app.xml.
Verified against Microsoft Word 2024. Rules:
- Words: whitespace-separated tokens; each CJK char = 1 word
- Characters: total chars minus whitespace (NO paragraph marks)
- CharactersWithSpaces: total chars including spaces (NO paragraph marks)
- Paragraphs: count of non-empty paragraphs (empty paras excluded)
- Lines: same as non-empty paragraph count (no page layout engine)
- Pages: simple approximation (no page layout engine)
Known limitations:
- Pages and Lines use paragraph-based approximation (no page layout engine)
- Footnote/endnote text is not included (unclear if Word includes it)
- Header/footer text is not included (Word likely excludes it)
Constant Summary collapse
- CJK_REGEX =
CJK Unified Ideographs and extension ranges
/[\u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u2F00-\u2FDF\u2E80-\u2EFF]/- WHITESPACE_REGEX =
/[ \t\r\n]/
Instance Method Summary collapse
- #calculate ⇒ Hash{Symbol => Integer}
-
#initialize(package) ⇒ DocumentStatistics
constructor
A new instance of DocumentStatistics.
Constructor Details
#initialize(package) ⇒ DocumentStatistics
Returns a new instance of DocumentStatistics.
24 25 26 |
# File 'lib/uniword/docx/document_statistics.rb', line 24 def initialize(package) @package = package end |
Instance Method Details
#calculate ⇒ Hash{Symbol => Integer}
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/uniword/docx/document_statistics.rb', line 29 def calculate body = @package.document&.body return empty_stats unless body text_per_paragraph = [] collect_text(body, text_per_paragraph) # Word only counts non-empty paragraphs non_empty = text_per_paragraph.reject { |t| t.strip.empty? } { pages: estimate_pages(non_empty.size), words: count_words(non_empty), characters: count_characters_no_spaces(non_empty), characters_with_spaces: count_characters_with_spaces(non_empty), paragraphs: non_empty.size, lines: estimate_lines(non_empty.size), } end |