I’m a skeptic, and possibly addicted to hard data, i.e. to numbers, when it comes to evaluating advertising claims, etc.
I’ve heard that Adobe Acrobat Pro DC is accurate in preserving pagination and format when converting to .docx, but I am instinctively skeptical.
Because I want to be able to get PDFs into Index Manager (more on this embedded indexing software tool in a later post) with this conversion I wanted to know just how accurate this was. For my particular use it was important that the index embedded in the Word document generate accurate locators for the entries that match the pages in the PDF.
Software Used
Adobe Acrobat Pro DC (Creative Cloud subscription) and Microsoft Word 365 were used in this evaluation, so I cannot speak to earlier versions of Word. I did (briefly) check Acrobat Reader DC and one draft PDF.
Method
To assess the accuracy of this process, I converted three final versions of PDFs into Word documents (Word 365 .docx):
- One document of 190 pages that had multiple tables and bulleted lists with complicated formatting
- One 340-page document with multiple headings with some graphics in headings as well as tables, figures, and bulleted lists
- One 97-page scholarly document with footnotes and references
With Adobe Acrobat Pro DC and Word both in read mode and with sizes of pages (zoom levels) matching, I viewed the full pages side by side. To assess accuracy I looked at the page number per se, read the first and last lines of each page, and visually checked the similarity of spacing on the pages. For documents with tables and lists, I checked the table format and verified the locations and effects on preceding and following text. Because of variation in font in the Word document, I also checked many first and last lines of many paragraphs, as well as number of lines in the paragraphs, some by actual line count and some by visual scanning.
Results
- Acrobat Reader DC was close, but the location of text on pages was not maintained even with PDFs in the final format.
- Neither Reader nor Acrobat DC Pro could convert a draft PDF accurately. The Acrobat DC Pro conversion of the draft was close enough to suggest that the final version might convert accurately. Most of the variation was due to reflow with tables and figures.
- With Adobe Acrobat Pro DC, all page numbers were accurate in the conversions of 627 pages. (This is possibly the most boring thing I’ve ever done but the results are worth the time and effort, since I now have some data to share.) Some minor “failures” are noted below.
- Figures and tables were accurate in location, spacing, and content.
- Hyphenation of words carried over accurately as did the lines per paragraph (with one exception) even with the differences in fonts observed.
- I observed a single instance where the line length within a paragraph was different in the two documents. The PDF had a single word on the final line of the paragraph, but the Word document did not—only five lines rather than the six in the PDF; however, this did not affect the pagination in any way . (This was in the Acrobat DC Pro.)
Differences in Documents
- With graphics in the layout—e.g. in a title underlined with a solid bar of color, where the descenders of letters (e.g. “y”) have been brought forward in the PDF, these descenders may not show in the Word conversion—the descender will appear to have been cut off. I saw this frequently in one of my test documents; however, there was nothing that would affect pagination. All first and last lines were accurate despite this “failure.”
- There were rare instances in one of the PDFs where spaces between words in the section headings were slightly different but, again, no effect on the pagination.
- There were some apparent differences in fonts used in the conversion; however, again, no effects on pagination. This was likely because I did not install all fonts used in the PDF for use in Word.
- With dropped capitals there were variations in spacing in the lines following these in Word not seen in the PDF—some lines might overlap the capital letter; however, the number of lines in the paragraph was maintained.
If I were I to repeat this experiment again (which I shall not!), I would install the fonts used for the PDF document so that these would be available for Word in the conversion. These would be easily available with an Adobe Creative Cloud subscription. Even with the variations in fonts, the conversion of the PDF into a Word document (.docx using Word 365) was accurate, although I would certainly do spot comparisons from any conversion.
Leave a Reply