ABSTRACT
Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving.Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'.This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window.Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document.
- Adobe Systems Incorporated, PDF Reference (Second Edition) version 1.3, ISBN 0-201-61588-6, Addison-Wesley, July 2000.Google Scholar
- Adobe Systems Incorporated, PDF Reference (Third Edition) version 1.4, ISBN 0-201-75839-3, Addison-Wesley, December 2001.Google Scholar
- David F. Brailsford, "Separable hyperstructure and delayed link binding," ACM Computing Surveys, vol. 31, no. 4es, December 1999. http://doi.acm.org/10.1145/345966.346029 Google ScholarDigital Library
- Kenneth Brooks, "A two-view document editor with user-definable document structure," DEC Research Report No. 33, November 1988. Available online via ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-033.pdfGoogle Scholar
- Donald D. Chamberlin, James C. King, Donald R. Slutz, Stephen J. Todd, and Bradford W. Wade, "JANUS: An interactive formatter based on declarative tags" IBM Systems Journal, vol. 21, no. 3, pp. 250--271, 1982.Google ScholarDigital Library
- Donald D. Chamberlin, H.F. Hasselmeier, A. W. Luniewski, D.P. Paris, B. W. Wade, and M. L. Zolliker, "Quill: An extensible system for editing documents of mixed type," in Proc. 21st Hawaii Int. Conf. on System Sciences, pp. 317--326, IEEE Computer Society Press, April 1988. Google ScholarDigital Library
- The Document Object Model (DOM). http://www.w3c.org/TR/2000/REC-DOMLevel-2-Core-20001113/Google Scholar
- W.S. Lovegrove and D. F. Brailsford, "Document Analysis of PDF Files: Methods, Results and Implications," Electronic Publishing-Origination, Dissemination and Design, vol. 8, no. 2 & 3, pp. 207--220, June & September 1995.Google Scholar
- Vincent Quint and Irène Vatton, "Grif: An interactive system for document structure manipulation," in Proceedings International Conference on Text Processing and Document Manipulation, ed. J. C. van Vliet, pp. 200--213, Cambridge University Press, April 1986.Google Scholar
- Namespaces in XML. http://www.w3c.org/TR/1999/REC-xml-names-19990114/Google Scholar
- Philip N. Smith, David F. Brailsford, David R. Evans, Leon Harrison, Steve G. Probets, and Peter E. Sutton, "Journal Publishing with Acrobat: the CAJUN project," Electronic Publishing - Origination, Dissemination and Design, vol. 6, no. 4, pp. 481--493, December 1993. http://cajun.cs.nott.ac.uk/compsci/epo/papers/epoddtoc.htmlGoogle Scholar
- The treediff project. http://www.alphaworks.ibm.com/tech/xmltreediffGoogle Scholar
Index Terms
- Mapping and displaying structural transformations between XML and PDF
Recommendations
Creating structured PDF files using XML templates
DocEng '04: Proceedings of the 2004 ACM symposium on Document engineeringThis paper describes a tool for recombining the logical structure from an XML document with the typeset appearance of the corresponding PDF document. The tool uses the XML representation as a template for the insertion of the logical structure into the ...
The Mars project: PDF in XML
DocEng '07: Proceedings of the 2007 ACM symposium on Document engineeringThe Portable Document Format (PDF) is a page-oriented, graphically rich document format based on PostScript semantics. It is the file format underlying the Adobe® Acrobat® viewers and is used throughout the publishing industry for final form documents ...
Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineeringAccessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
Comments