RTF TO XML
User’s Guide
Version 5.2.1
For information on commercial use, support, and version updates please visit:
Command Line Conversion • Single Conversion of RTF File • Batch Conversion of Multiple RTF Files • Recursive Conversion of RTF files • Description of RTF TO XML Options • Template Preparation Options • Using a Conversion Task • Conversion of an Individual File • Conversion to XML Templates • Description of Data Extraction Algorithm • Configuring the RTF Parser • Finding the most matching substitution rule • Selecting the font name for rendering tabs • Composing the family attribute while writing to XML FO • Default font substitution rules • Configuring Picture Plug-ins • Configuring Output Plug-ins • Specifying User Preferences • List of Parsed RTF Control Words • Interpreting of continuous section breaks • Horizontal positioning of tables •
The RTF TO XML converter is designed for conversion of Rich Text Format (RTF) files to well-formed Extensible Markup Language (XML) documents according to the Extensible Stylesheet Language Formatting Objects (XSL FO) specification.
The RTF TO XML allows splitting an XML FO file into an XSL template containing formatting and an XML file containing textual data. A range of criteria for data extraction can be used during conversion. Insertion of simple cycles is also allowed.
The current document describes the version 5.2.1 of RTF TO XML. This version is based on a new RTF parsing solution — the Novosoft RTF DOM Builder.
The converter can be called via Graphics User Interface (GUI), from the command line (the ru.novosoft.dc.rtf_to_xml.Convert class provides this functionality), or via API requests. The usage of RTF TO XML is described in the rest of the current document.
Installation guidelines are described in the readme.txt file supplied with the distribution.
Using GUI under Linux requires X11 support.
Different versions of RTF TO XML have different usage limitations:
Evaluation version limitations
· | | Text content is stained with characters occasionally replaced with punctuation marks; |
· | | In prepare-template mode, some text entries are not moved to XML data file. |
Simple license limitations
· | | No batch and recursive conversion; |
Server and Site license limitations
· | | Conversion to templates is optional. |
Conversion methods
RTF TO XML converts a RTF document (for example, MS Word 2000 document saved as Rich Text Format) into XSL FO format. The converter supports two conversion types: the conversion to well-formed XML according to XSL FO specification and the conversion to an XML template consisting of two files: XSL template and XML data (see Section 5). RTF TO XML supports the following RTF specifications (changes from the version 2.2.2 are highlighted):
Page formatting support
· | | Page setup options: margins, page size; |
· | | Page headers and footers of all types; |
· | | Section breaks of all types (continuous section breaks are supported in compatibility with Antenna House XSL Formatter 2.5 and XEP 3.0); |
· | | Custom restart of section page numbers; |
· | | Footnotes: custom and Arabic numbering labels, custom footnote separators; |
· | | Watermarks (document background). Since version 3.0, the support of background images is temporary disabled; |
· | | Document columns with identical widths and gaps (the XSL FO specification does not support columns with different widths and different gaps between columns); |
· | | Paragraph pagination: widow/orphan control, keep together, keep with next; |
· | | Page breaks before or after a paragraph. |
Text formatting support
· | | Font family and size, superscript and subscript; |
· | | Font style and weight (bold, italic, underline, etc.); |
· | | Font color and background color; |
· | | Cell, paragraph, and text color shading (without patterns); |
· | | Paragraph alignment and margins; |
· | | Paragraph line spacing; |
· | | Space before and after a paragraph; |
· | | Commonly used special RTF symbols; |
· | | Preservation of white spaces; |
Tabs support
· | | The rendering is applied for calculating true position of text with tabs; |
· | | Multiple tabs of the “center” or “left” type in the first line and one tab of the “right” type in the last line of the paragraph are allowed; |
· | | Two tabs conversion methods are allowed on your choice (fo:leader with leader-pattern=”space” and with leader-pattern=”use-content”); |
Miscellaneous
· | | Full implementation of tables (except slanted borders); |
· | | Height of table rows and vertical alignment of text in table cells; |
· | | Last page number field; |
· | | Pictures of any graphic format provided with RTF Specification 1.6; |
· | | Picture conversion plug-ins; |
· | | Support of non-grouped textboxes; |
· | | Multilingual support (23 RTF code pages and 17 font character sets are supported); |
· | | Links (e.g. in table of contents) and hyperlinks; |
· | | Track changes support (is temporary turned off because of improvements planned). |
· | | Tabs rendering limitations are the following: |
o | | Rendering is applied to standard fonts known in Java; |
o | | A right tab must close a paragraph; |
o | | Only one right tab per paragraph is allowed; |
o | | Mixing of underlined tabs and fonts is forbidden. |
If graphics could not be initialized, tabs rendering is turned off. So, if X11 is not supported under Linux, tabs rendering will be forbidden.
· | | RTF format supports different column gaps (e.g. 12 pt between first and second column and 24 pt between second and third column) but XSL FO does not. So "column-gap" is set equal the last encountered gap width. |
· | | Watermark / background image cannot be resized in XSL FO format (while this is possible in RTF). |
For information on commercial use, support, and version updates please visit
XSL FO specification:
XSL FO processors:
X11 support for Linux:
ImageMagick:
FOP:
http://www.apache.org/fop/
Using RTF TO XML GUI requires Java graphics components to be supported. For example, to run GUI under Linux X11 support is required.
To start the GUI, run
rtf_to_xml.bat
from RTF TO XML home directory. The main window of the GUI looks as follows:
The Select Input File(s) pane provides selection of rtf-file to be converted. To do this, click on the Select button and select an rtf-file in an ordinary file selection dialog appearing. After that, the CONVERT button becomes enabled and allows the conversion to XSL FO.
The Select Output Directory pane allows select a directory to store the results in. The <default> selection means the output to be stored near the input (in the same directory of the input file). A name of output file is composed from the name of input file with changing its extension to “.fo” (in XSL+XML mode, two output files are created for every input file, namely an XML data file with the “.xml” extension and an XSL stylesheet file with the “.xsl” extension).
The Select Conversion Options pane shows the most important options for the conversion. It is divided into two parts: the left part contains common user options and the right part contains options useful for splitting an output into an XML data file and XSL stylesheet file.
The Compatibility combo box allows select a compatibility model to be used while conversion. The following variants of compatibility are supported now:
· | | ahxf-2.1 – compatibility with the XSL Formatter versions 2.1–2.4 from Antenna House, Inc.; |
· | | ahxf-2.5 – compatibility with the XSL Formatter version 2.5 or greater from Antenna House, Inc.; |
· | | fop-0.20.1 – compatibility with the FOP version 0.20.1 or earlier, |
· | | fop-0.20.3 – compatibility with the FOP version 0.20.3; |
· | | fop-0.20.4 – compatibility with the FOP version 0.20.4 or later; |
· | | xep-2.5 – compatibility with the XEP version 2.5 or earlier; |
· | | xep-2.7 – compatibility with the XEP version 2.7; |
· | | xep-3.0 – compatibility with the XEP version 3.0 or later; |
· | | w3c – default compatibility with the XSL FO specification 1.0 by WWW Consortium. |
The Log level combo box allow select one of three logging levels:
· | | info – logging information messages to the screen log while conversion; |
· | | normal – logging messages to the screen log and to the file with “.log” extension; |
· | | full – logging more messages than in normal level. Two log files are created in this case: “.log” file contains all messages and “.log0” file contains rtf commands skipped while conversion. |
For normal or full logging, log files are stored near the conversion results.
The Output plug-in combo box allows select a plug-in command to be applied after successful conversion of an rtf-file. It is disabled if no active1 output plug-ins recognized (see Section 4.e for more details). Other options of the left half of the options pane are described in Section n. The options on the right of the options pane are used for a special conversion in the “prepare-template” mode. When the XSL+XML box is checked, this mode is turned on and other options become enable for change (see Section o for more details). The buttons at the bottom of the main form provide the following actions:
· | | Save Config As … button allows save the current configuration of GUI (selected options, history of selected files and directories) to a file. It starts an ordinary file selection dialog. |
· | | Load Config From … button allows load a GUI configuration from a file. The loaded configuration is copied to the current configuration. |
· | | CONVERT button starts the conversion task in a separate window. |
· | | Exit button closes the RTF TO XML GUI program. |
When a file is successfully converted, its path is added to the top of the file conversion history list. You can easy select a file converted earlier just opening the file history combo box in the Select Input File(s) pane. The selected directories are also saved in the history list. The GUI program remembers last 10 files and directories used.
The Batch Select button in the Select Input File(s) pane allows select all rtf files in a directory for a conversion. The following dialog appears:
You can select a directory containing rtf-files to be converted and specify additional batch conversion options meaning the following:
· | | Recourse into subdirectories: All rtf-files in a selected directory and in all its subdirectories will be converted. Conversion results are stored near the input files or relatively to the output directory with the same directory tree structure as the input directory tree. |
· | | Skip (do not overwrite) existing files: In this mode, the RTF TO XML converts only those files which have not been converted yet; already existing target files remain untouched. |
· | | Stop on error: In this mode, an error during conversion a file interrupts the run and stops conversion of remaining files in the list. |
· | | Batch logging to: If this mode is selected, the conversion log will be written to a single file of the specified name and the “.log” extension. The log file is stored in the selected input directory or in the output directory if it is specified. This option has effect if the log level is Normal or Full. |
Note: Depending on the type of the RTF TO XML license, batch conversion can be unavailable. In this case, you will be able to convert only a single file per run.
h. | | Single Conversion of RTF File |
To convert an RTF file to XSL FO format from command line, open the console (DOS) window and run the following command:
rtf_to_xml_cmd rtf-file
Here rtf-file is a path to an RTF file to be converted. The path can be either absolute or relative to the current directory. The destination XML FO file will be created with the “.fo“ extension in the same directory where the rtf-file is located. The default conversion rules meet W3C recommendations, but existing rendering tools can have differences in XSL FO specification. So, to satisfy the input specification of a specific renderer, the corresponding compatibility option must be used while running a conversion to XML FO. For example, to provide compatibility with the FOP-0.20.3, run the following command:
rtf_to_xml_cmd -o fop-0.20.3 rtf-file
You can also use the –d option to specify a destination directory where the FO file will be stored in. For example, the following command converts “c:\rtf\sample.rtf” and stores the resulting file as “c:\fo\sample.fo”:
rtf_to_xml -o fop-0.20.3 c:\rtf\sample.rtf –d c:\fo
i. | | Batch Conversion of Multiple RTF Files |
The converter allows batch processing of many RTF files using a single command:
rtf_to_xml_cmd rtf-file1 rtf-file2 …
You can use wildcards in file names. For example, the following command converts all RTF files in the current directory and saves the results with the “.fo“ extension in the same directory:
rtf_to_xml *.rtf
You can also use the –d option to specify a destination directory where the FO files will be stored in. For example, the following command converts all RTF files in the current directory and creates the resulting FO files in the subdirectory fo:
rtf_to_xml_cmd *.rtf –d fo
You can also specify that the source directory should be replaced with the destination directory during a conversion. The –s option is used for this purpose. For example, the following command converts all RTF files in two subdirectories of the rtf directory and stores the results in respective subdirectories of the fo directory:
rtf_to_xml_cmd rtf\dir1\*.rtf rtf\dir2\*.rtf –s rtf –d fo
If the source directory is not specified, but the destination directory is specified, the location of the source directory is set to the directory of the first file in the conversion list. If some files in the list do not belong to the source directory, conversion results for these files will be kept in the respective source directories.
j. | | Recursive Conversion of RTF files |
The recursive processing allows convert RTF files in specified directories and in all their subdirectories. To achieve this, use the –r option. We recommend enclosing the template name for converted RTF files in double quotes to prevent automatic wildcard expansion provided by some operating systems. In the example below, all RTF files in the current directory and in all of its subdirectories are converted:
rtf_to_xml_cmd –r "*.rtf"
Another example provides conversion of all RTF files in the subdirectory tree starting from the rtf subdirectory of the current directory. The results will be stored with “.fo“ extension in the tree of the same structure, but created in the fo subdirectory:
rtf_to_xml_cmd –r "rtf\*.rtf" –d fo
k. | | Description of RTF TO XML Options |
The RTF TO XML provides many options and many ways to set them. All options can be divided into the following categories:
· | | Common Options are specified in the command line only; |
· | | Core Options are specified in the RTF TO XML configuration file only. They cannot be modified from command line (see the "conf/nsdc.properties" file for a more detailed description); |
· | | User Options are specified in many ways. Their default values can be specified in the RTF TO XML configuration file. You can override the defaults from the command line or using a file of options loaded with the –o option. |
Common options have a special syntax described in Section l. User options have the following syntax: · | | Command line syntax:
–key:value.
If the ":value" is omitted, the "true" value is supposed. Specifying the option as "–key:" means that an empty string will be used as the option value;
|
· | | RTF TO XML configuration file syntax:
rtf_to_xml.key=value
|
· | | Syntax in a file of options:
key=value
|
Here the "key" is the option name, and the "value" is the option value.
During conversion, the values of options are used in the following order:
1. | | The RTF TO XML configuration file "conf/nsdc.properties" is loaded at first. Core options and default values of user options are set here. |
2. | | Command line options are processed from the left to the right. A new entry of an option overrides its previous value. An exception is the "font-substitution" option, for which a new value means that a file of font substitutions will be loaded in addition to already specified font substitutions. |
3. | | When options are loaded from a file, the processing order of these is undefined. |
The following common options can be used only in the command line:
· | | -h or --help option prints the help info with description of common options and stops the processing; |
· | | -i option turns on the indentation in the output files. This option takes no effect on XML data files in templates preparation mode because XML data is always produced with indentation; |
· | | -u option applies GUI preferences to a command line conversion. This option is introduced since version 3.2. It allows starting a command line conversion with options selected in RTF TO XML GUI; |
· | | -d directory option specifies the destination directory root where the output file(s) will be stored in; |
· | | -s directory option specifies the source directory root. It works in conjunction with the –d option. When both options are specified, the source directory root is replaced with the destination directory root when determining the paths of output files. If the destination directory is specified but the source directory is not, the source directory is taken from the path to the first file in the list of source files. This option is used only during batch conversion; |
· | | -r option allows recursive processing of subdirectories. This option is used only during batch conversion; |
· | | -S option turns Skip Mode on. In the skip mode, the RTF TO XML converts only those files from the conversion list, which were not yet converted. This option is used only during batch conversion; |
· | | -E option turns the Stop on Error mode on. It interrupts the whole conversion procedure if an error occurs during conversion of one of files. This option is used only during batch conversion; |
· | | -o path option loads options/settings from a file specified in the path. This option can be used for loading compatibility options. For example, -o xep-2.7 option loads compatibility options for XEP-2.7. Files of compatibility options are stored in the “conf” subdirectory of the RTF TO XML home directory and have the “.opt” extension. You can prepare your own file of options and type its path as the value for this option. The default option file extension “.opt” can be omitted. If a file is located in the “conf” subdirectory of the RTF TO XML home, its name (and, maybe, extension) is sufficient to find it. If the options file was not found at the path specified by this option, RTF TO XML then looks for that file in the “conf” subdirectory. Options in a file should be listed using the key=value syntax. Default values of options are specified in the conf/nsdc.properties file; |
· | | -vN option sets the logging level (from 0 to 5): |
o | | -v1 info logging (log information messages to the console), |
o | | -v2 silent normal logging (log information, warning, and error messages to the ".log" file), |
o | | -v3 normal logging (log information, warning, and error messages both to the console and to the ".log" file), |
o | | -v4 silent full logging (log information, warning, error, and debug messages to the ".log" file; messages on unrecognized commands to the ".log0" file), |
o | | -v5 full logging (log information, warning, and error messages to the console; log information, warning, error, and debug messages to the ".log" file; messages on unrecognized commands to the ".log0" file). |
The default logging level is 1. If the logging level is greater than 1, a log file is created for each converted file (with the same name and a ".log" extension) and is saved in the same directory with the conversion results. In batch-logging mode (see Section n) with logging level greater than 1, a single "rtf_to_xml.log" file is created in the "log" subdirectory of the RTF TO XML home directory. For ordinary cases we recommend using logging levels from 1 to 3. Since version 3.2, a number of compatibility options were reduced to an only option, the model option. A list of possible values of this option and their meanings is specified in Section f in the description of Compatibility option. For convenience and for backward compatibility with elder versions of RTF TO XML, the files of compatibility options remain the same, but they now contain the model option only. So, you can specify a compatibility model in two ways (for FOP-0.20.4 compatibility example): use either
-o fop-0.20.4
or
-model:fop-0.20.4
in command line.
The meaning of compatibility settings is specified in the “data/options.xml” file. Compatibility specifics is described in Section 4.h. In this section we describe miscellaneous user options providing additional conversion abilities. The command line syntax is used here:
· | | -output-plugin:name option specifies a name of output plug-in configuration file to be applied after conversion of rtf-file (see Section 4.e for more details); |
· | | -track-changes option generates document with track changes shown with a strikethrough font. Default behavior is `no track changes’. In this case deleted and revised text is ignored [the track change support is revised since version 3.0. It is not functioning yet]; |
· | | -use-content-tabs option turns on using of “use-content” type leaders in conversion of tabs. Using of content tabs provides more exact tabs alignment but if a text before tab goes out of the tab position it will disappear. So, by default the “space” type leaders are used; |
· | | -no-picture-conversion option is used for disabling picture conversion plug-ins in this run. The default behavior of the converter is to apply picture conversion plug-ins if specified (see Section 4.d); |
· | | -compare-pictures option is used for reducing a number of output pictures if a converted file contains many identical pictures; |
· | | -batch-logging option switches log output of the run to the |