PDF documents can be generated by:
- Scanning print document as an image and then saved as PDF file (structured)
- From PostScript files or by using Acrobat Distiller with other authoring tools (structured)
- Directly “printed” from word processing, TeX or other publishing tools using other desktop and serverside PDF generators (structured /unstructured)
- “Printed” as a PDF file from many word processing applications or Mac OSX (unstructured/ tagged)
- Generated from recent versions of MS Office and Adobe authoring tools (tagged)
Currently MS Office, OpenOffice, Star Office and other Adobe authoring tools can automatically generate a tagged PDF document. Server-side tools and other PDF editors do not support tagged or accessibility features and generate the document either as an image (unstructured) or searchable text (structured) format.
Characteristics of a tagged PDF document
- Semantic structure is encoded allowing for:
- precise control over reading-order of text blocks, table cell and form fields
- correct interpretation and recognition of structural elements (paragraphs, tables, columns, lists) and their attributes required for accurate text reflow
- hierarchical nested tree layout consisting of a standard set of tags that can be manually edited
- Word boundaries are explicitly defined for easy recognition.
- Fonts are mapped into standard Unicode equivalents, ensuring reliable translation of all text and correct interpretation of ligatures and hyphens. This allows the screen reader to correctly read all characters and words.
- Text descriptors (alt text) used for all non-textual content objects are recognized by screen readers.
- Decorative and non-essential content can be treated as artifacts which are not recognized by assistive technology
- Allow for interaction with documents elements like form fields and hyperlinks.
- Documents can be exported to other file formats (RFT, TXT, HTML), while maintaining their original format.
Accessible (tagged) PDF documents can be created by:
- Automatically converting a well designed Microsoft Office (versions 2000 and above) document into a tagged PDF format.
- Automatically converting into tagged PDF file from recent versions of Adobe publishing tools like FrameMaker, InDesign and PageMaker.
- Scanning, OCR (Optical Character Recognition) a print document; then editing and formatting the document using Microsoft Word and converting into PDF format
- Using Acrobat OCR utility to convert scanned image formats (unstructured) into structured format from within PDF editor and then adding the tags from within Acrobat Professional.
- Using “Add Tags” from within Acrobat to convert structured PDF documents (Distilled or OCR-ed) into tagged documents. These often required manual edits of the “tags”.
- Manually creating the “tags” and read-order from within Acrobat Professional (very tedious)
Issues to consider:
- Complex graphics take great deal of processing time
- Complex tables may not be correctly interpreted
- Complex layouts with multiple layers may not be fully recognized or follow the correct reading order
- Complex layouts may not completely capture all information when exported into other formats