Review: RiverDocs Converter
- Published by RiverDocs Limited
- Price: £399
- Available from: http://www.riverdocs.com/
- Version reviewed: 1.1
Disclosure: This is a paid review. RiverDocs Limited have had no influence on the tone or content of this review.
An essential tool for any organisation which publishes Microsoft Word or PDF files online, RiverDocs Converter is vastly superior to any other conversion software currently available. There's now no reason for publishers not to offer accessible, high quality HTML versions of documents previously published only in proprietary formats. The parser even compensates for poorly authored source documents, previously a significant barrier to producing accessible, semantic HTML versions of Word and PDF documents.
It's not a magic bullet though - every conversion requires human-checking, and documents with any degree of complexity require a degree of input from an experienced web editor - but despite a slightly weak editor it's still well worth the price and will only get better in future versions given the publisher's focus on research and development.
RiverDocs Converter is a software package for the Microsoft Windows operating system which claims to convert documents designed for print into structured, accessible HTML documents for online delivery. In short this means it'll take PDFs and Microsoft Word files and attempt to convert them into a format more suitable for delivery and consumption over the web.
PDF and MS Word are beloved of government and corporations who often need to publish large documents quickly, but these formats are primarily designed for printing, not for delivery online, and have serious accessibility issues associated with them. So the potential benefits from effective conversion software are enormous - being able to offer HTML versions of these documents cost effectively is something that hasn't been possible before.
Installation was straightforward, taking a couple of minutes on my workhorse desktop PC.
The software does require the latest version of the Microsoft .NET 2.0 Framework, if this isn't already installed and available you will be prompted to download and install it.
Starting the software for the first time you are presented with a quick guide to converting your first document, and the clean, functional RiverDocs interface.
Test 1 - my first conversion
To test the software for the first time I used a PDF document regarding chimney stack removal I found on Cambridge City Council's website at:
It's a 4 page document containing a cover sheet, and a mix of different levels of heading, bullets and images. The PDF document was not tagged.
Opening the file displays it in the main RiverDocs window:
Clicking the Convert button started the conversion, which took less than a second using the default settings. The interface changes to a split-screen affair, with the original document in the left pane, and the converted document in the right pane:
To give an idea of the quality of conversion and mark-up the software can produce automatically I wanted to save the document immediately. Admittedly this is not intended real-world usage of the product, but does provide an idea of quality of the baseline conversion prior to manual editing.
Big River had provided me with a one page crib-sheet covering the major interface elements, so I knew that the Save function was for saving a RiverDocs project, and the Publish function was for saving the converted document as HTML, CSS and images.
Clicking the Publish button presents the Publish dialogue box:
In addition to publishing as HTML, the software also supports output in CHM (Microsoft Compiled HTML Help) format.
To keep things tidy I wanted to publish this version into a new folder, but this is not a standard Windows file dialogue box, and doesn't provide the facility to create a new folder, so I had to switch out to Windows Explorer to do this before publishing the document in RiverDocs.
But, it turns out the file name entered into this dialogue box is actually used as a folder name, which will be created for you and into which the document is published. These sorts of interface issues are symptomatic of the software's relative youth, and will no doubt be ironed out as the product matures.
The publishing of this document took less than a second, here are the results:
The default settings produce HTML documents with an XHTML 1.0 Transitional doctype, generating a separate HTML file for each page of the source PDF, an index HTML document containing a generated table of contents, a single CSS file and an images folder containing converted images. The CSS is valid, and attempts to mimic the style of the original document as closely as possible.
As a comparison I ran the same file through Abbyy's PDF Transformer, another PDF to HTML conversion tool. The results were much vastly less impressive:
The Abbyy software makes no attempt to produce structured HTML, instead presenting every single line in the document as a paragraph and styling them to appear as closely as possible to the original PDF.
In general the quality of the default output from RiverDocs is extremely impressive. In this case there were just two validation problems: an unclosed list item in the generated table of contents, and missing alt attributes for the images on the final page. Since the default output is "section based" the parser moved the words "GUIDANCE NOTES" onto a page by itself despite displaying it as part of the title page in the preview pane, which was the only deviation from the page layout of the original.
But this isn't a fair test of the software which wasn't designed to be operated in this manner. While the results are good, they aren't good enough to publish without manual editing, so let's try again, only this time using some good old human judgement.
Test 2 - getting serious
For the second test I wanted to take the same document but publish it to a single HTML document of the highest quality as close to the original format as possible. The process is the same - open the file to be converted, and click Convert.
Before getting stuck into the document itself I wanted to specify some metadata for it. Fortunately RiverDocs make this very easy to do (just click the Metadata button), and provides a default set of Dublin Core elements for completion:
It appears that additional user-defined elements can be created, so publishers in UK government for example can easily add eGMS metadata to converted documents:
Unfortunately these additional elements didn't make it to my published document, a bug I've reported to Big River.
RiverDocs offers a number of options to customise the output of the converted document. The most important are:
- Publish mode Can be single file, section based (default) and page based. Section based splits the document into section based on a heading level specified by the user.
- HTML Tidy configuration RiverDocs uses the HTML Tidy library to identify and report issues with converted documents. This can be set to A, AA or AA (Strict). It's not clear from the documentation what the difference between AA and AA (Strict) is.
- CSS Options Although the software makes a fair attempt to reproduce the style of the original document, it's likely that most publishers will want to use established in-house styles for publishing to the web. RiverDocs has full support for external stylesheets, allowing the specification of a local file (which will allow you to preview and edit the document with your styles applied) and a relative path to be used when the document is published. The option to use a remote stylesheet would be a welcome addition.
- HTML Navigation Finally, the navigation elements which allow the user to move from page to page or section to section of the published document can be renamed, or disabled.
For many users the area of the application where most time will be spent is the HTML editor, where the converted output can be modified and fine-tuned. In most cases this will be to either match the original document or to conform to a house web publishing style.
The editor always presents the output document in a page-by-page format, regardless of the publish mode that's currently set. It would be nice to be able to preview the single page and section-based options.
The editor can be used visually in preview mode, or in source mode which provides a simple text editor view of the document page you're working on. As I wanted a single file output and had set the options accordingly there was something of a disconnection between working on a separate HTML file for each page, and the intended output. As far as I can see there is no way to preview the single file output prior to publishing.
The toolbar provides standard editing tools you'd expect to find available on a simple HTML editor. These generally work as expected, although there are some quirks - for example undo will only remember changes you've made until you switch to source mode: so if you make change, switch to source mode and back to visual mode you'll need to correct any errors manually in source mode.
Once you've got used to the way the editor functions it's a reasonably comfortable working environment, but don't expect it have the functionality of DreamWeaver. I can foresee many users doing the initial conversion in RiverDocs and taking the published output into the editor of their choice to complete the process: indeed if I was using RiverDocs on a daily basis to convert a large number of files this is the way I'd work - the software's value lies in its conversion capabilities, not its editing capabilities.
One of the most common problems that will arise from automatic conversion is that of images and appropriate alt attributes. Editing images is easy - select the image in the editor, and click the image icon:
The id is a temporary value used by the software during conversion and editing, and is removed on publishing.
One very nice feature of RiverDocs is the screen capture tool. On the final page of the original PDF is a diagram showing a cross-section of a wall, with some labels indicating particular features of the diagram. Since the PDF was generated from Adobe Pagemaker, the diagram consists of an image object and a series of text objects for the labels. In the automatic conversion RiverDocs quite rightly converted these separately, which can be seen on the last page of the output of test 1.
In my final version I want the image and labels as a single image, and this is where the screen capture tool comes in:
It operates like any screen capture tool you've used before - highlight the area to be captured and click an icon. In RiverDocs the highlighted area will be inserted into your HTML document as an image.
You've got issues
The software provides assistance to help you identify and correct potential issues with the converted document. The Issues icon gives a quick idea of the number of issues identified by the software at any stage after automatic conversion. Clicking the icon opens a third pane with details of the issues:
The potential issues highlighted include missing alt attributes on images. I was disappointed to note that alt text from objects in tagged PDFs wasn't carried across to the converted HTML document. Otherwise the guidance provided by the issues is sound, based as it is on HTML Tidy - those of you familiar with the Tidy extension for Firefox will know what to expect.
For non-expert users this provides an extremely useful indication of where there are potential problems in the converted document, and the separation of current page issues and whole document issues guides such users through the document with ease. Personally I was more comfortable editing the document first before using the issues tool - picking up the issues I could see, modifying structure, adding or correcting alt attributes, generally tidying the document up - but that's probably no more than a reflection of my workflow habits.
Test 2 results
Here's the output:
It took 10 minutes from opening the original PDF to publishing this version - very impressive results in such a short space of time.
Test 3 - getting more complex
To really test the software we need something a little more complex than a single-column, text and images document. On the Clackmannanshire Council website I found a 24 page consultation document laid out in 2 columns, which included multiple levels of headings and a data table:
The untouched output from RiverDocs shows its limitations, but is still an impressive result:
It took me about 30 minutes to tidy the document up in RiverDocs, but I was still left with a lot of redundant classes with names like "font19" and all those named anchors generated for the table of contents. Cleaning up the mark-up in RiverDocs proved to be a bit of a chore, so I tried again, this time dumping the output immediately into DreamWeaver.
15 minutes later I had this clean, structured version of the PDF:
My conclusion - if your document is anything more complex than single-column text then forego the RiverDocs editor for your favourite HTML editor.
Test 5 - Microsoft Word
So what about Microsoft Word conversion? Well, this review was produced in a simple Word document, so I ran it through the RiverDocs Converter for publishing online. Here is the untouched conversion:
This was a 12 page Word document, and conversion took noticeably longer than PDF conversion, at about 20 seconds. The only real issues with the conversion were the failure to convert Word bullets to HTML lists and the failure to pick up alternative text on images. Other than this the structure was accurately represented and the images correctly positioned.
The converter doesn't appear to parse the styles used in Word documents - I converted a test document which was styled throughout as paragraphs, but with headings made bold with larger font sizes. RiverDocs therefore accommodates poorly authored, unstructured source documents, by analysing the font size and weight and assigning heading levels accordingly. This is a great feature given the preponderance of incorrectly produced Word documents in many organisations.
Given the immaturity of the package there are some inevitable annoyances with the interface and output:
- Table of contents There appears to be no way to disable the table of contents for a document you wish to publish as a single page. This means superfluous named anchors are scattered throughout the output HTML, and removing them within RiverDocs is only
possible in source mode. In a long document this can quickly become tedious. It would also be an improvement if the TOC used ids rather than named anchors.
- Keystrokes in source mode Some of the standard Windows keystrokes have been hijacked in source mode - for example Ctrl+A should highlight the entire text, but instead pops-up an "insert anchor" dialogue - worse still cancelling that dialogue inserts a single "a" character into the source.
- Give me vanilla output I'd love to see the option to output plain, vanilla HTML with no ids, named anchors, classes or other generated content. In many organisations the HTML output will be dropped into a template where headings, paragraphs and other elements are already styled by a surrounding div or the document body itself. (Note: I did manage to suppress the proliferation of classes like "font19" by specifying a blank CSS file in the output options.)
- Mark-up issues The mark-up is sometimes sub-optimal, for example:
<p class="font9"><span class="font9"><strong>NOTE: Some chimneys act as a buttress and provide support to long walls.</strong></span> <strong>Please check with Building Control or a structural engineer</strong><span class="font9"><strong>, before</strong></span> <span class="font9"><strong>proceeding, to determine if this is the case.</strong></span></p>
None of these are major problems though, and I would expect the interface to improve as the software is developed further. The key feature of the product is the conversion algorithm, which is extremely impressive.
RiverDocs is an impressive product and an essential tool for any organisation which has a need to publish more than a small number of PDF and Word documents online. Simple documents take no time at all to convert and tidy using the RiverDocs editor, while I found more complex documents are best converted in RiverDocs and then edited in a more powerful and functional dedicated HTML editor such as DreamWeaver.
The true value of RiverDocs lies in its ability to turn unstructured, multi-column PDF documents into structured HTML documents, whilst maintaining the correct reading order. Critically, the intelligent parsing engine compensates for low-quality source documents, previously a real barrier to producing HTML versions of PDF and Word documents.
Future versions of RiverDocs are very likely to offer significant improvements, both in terms of quality of conversion and the application interface. Apart from being a single-product company, concentrating solely on the development of the RiverDocs Converter, they also fund applied research at Queens University Belfast as well as other universities engaged in the fields of accessibility, artificial intelligence and character recognition.
About the reviewer
Dan Champion has worked in the web industry since 1995 through his company Champion Internet Solutions Limited, with clients in the private and public sectors. Between 1999 and 2007 he was responsible for Clackmannanshire Council's multi-award winning websites.
He is a regular speaker on the subjects of web accessibility, web standards and web strategy at conferences and workshops throughout the UK, has written on the subjects of e-government and web accessibility for the Guardian, and featured on national BBC Radio in various guises.