13 December 2007

OpenOffice.org localization: an easy way to deal with .sdf files

What are .sdf files ?



A few days ago I wrote about OpenOffice.org 2.4 localization update.

For some reason related to the way SUN manages the UI/Help strings, the translation source file comes in a weird format: all the XML "<" and ">" etc are escaped with "\" and the file structure comes as a set of 2 lines pairs, the first line being the en-US original and the second line a placeholder for the target string.

This placeholder contains sometimes the en-US string and sometimes a close approximate of what would be the translation of the source string in the target language. All this is nicely embedded into a lot of meta information that makes the file impossible to parse with normal human senses...

Here is an example (without the meta information):

String in the .sdf:
\<ahelp hid=\".\" visibility=\"hidden\"\>something in the .sdf\</ahelp\>

(.sdf is the extension SUN has created to name the format)

SUN also provides translators with TMX files of the whole UI/Help for a number of languages (de, es, fr, hu, it, ja, ko, nl, pl, pt-BR, pt, ru, sv, zh-CN, zh-TW, at the time of this writting).

The TMX seem to have been created not from the original XML (with nicely encapsulating TMX 1.4 level2 tags) but from the funky .sdf file. Which means that all the original XML tags are found escaped as per the .sdf, alongside the translatable contents...

So the above string would be exactly the same in the TMX:
\<ahelp hid=\".\" visibility=\"hidden\"\>something in the .tmx\</ahelp\>


How to translate that ?



So, how to practically translate such files while making use of the TMX data ?

The no brainer way...



Edit the .sdf file directly, possibly after renaming it to .csv and importing it into OpenOffice.org, where all the {tab} separated meta information fields will nicely fill their own column and leave the translatable contents on its own...

It is not exactly translator friendly... But with a little playing with the column width you'll manage to have only the translatable parts displayed...

This procedure allows translators to separately (and manually) do searches in the TMX or the glossary (Sun Gloss) and to use the matched contents directly without having to play with the "\" too much.

It is not very practical because the TMX data is embedded in plenty of XML tags and the result is thus not exactly pretty...

The PO way



The PO way is not the best way to leverage the TMX contents. It also requires translators quite some editing when wanting to use TMX matches... Still, it seems to be the most common way to localize OpenOffice.org...

PO files are provided by the team coordinators, they are created with the Translator's Toolkit's oo2po tool.

The above .sdf contents would be converted like this:

\\<ahelp hid=\\\".\\\" visibility=\\"hidden\\\"\\>something in the .po\\</ahelp\\>


The reason is that oo2po wants to be smart and adds an extra layer of escape characters (the ugly and ubiquitous "\"). And as you see above, the number of added "\" depends on what has been escaped: a simple [\] will become [\\], but [\"] will become [\\\"] because PO wants to escape both [\] and ["] with another [\]...

Now, it does not take much to see that matching that against the TMX data will be a problem. Even if the translator uses a smart PO editor to refer to the TMX there will still be a need to add all the ugly extra "\" that oo2po has added to the .sdf contents.

Basically, oo2po adds a useless extra layer of complexity to an already complex process that also happens to render TMX matching pretty much useless.


The smart way that also happens to really ease the translator's work



Here we are. Now, to keep the post to a reasonable length, let me refer you to the mail I just wrote to the OOo-l10n-dev list where everything is explained.

The idea is basically that, since the TMX matches the structure of the .sdf, then it is easier to work from the .sdf. But to make the TMX really useful it is necessary to make the .sdf contents easily handled by a tool that will also make full use of the TMX contents.

OmegaT for example...

Within OmegaT you can have automatic TMX and glossary (Sun Gloss export) matching, automatic file encoding handling, automatic file naming handling etc...

So, there is a very small Java utility sdf2txt.jar that basically extracts all the translatable contents of the .sdf file and outputs it as a "key=value" format that OmegaT can parse natively.

From there you see what needs to be done...

Basically:

  • put the extracted files in the /source/ folder of your newly created OmegaT translation project,

  • put the TMXs in /tm/,

  • put the glossary files (if any) in /glossary/,

  • load the project...


and enjoy translating in a Nice and Friendly to the translator Professional yet Free Computer Aided Translation tool....

Another smart but regexpy way...



Before using the CSV trick above ensure that the line pairs are converted so that the 2 lines are put on one line.

To do that in a text editor that supports regular expressions, search for:
^(.*)(en-US)(.*)\r^(.*)(fr)(.*)

replace with:
\1\2\3\t\4\5\6


Now that your .sdf is "linearized", change its name to .csv and open it in OpenOffice by using "tab" as field separator and "nothing" as text delimiter.

The tabs in the original .sdf create a number of columns from where you just need to copy the column with the en-US translatable contents.

Paste that into a text file with the ".utf8" extension, load into OmegaT... Et voilà !

You'll have to paste the contents of the translated file into the target part of the CSV file, convert back to a 2 lines pair set.

The pattern we need to find to revert the 1 line blocks to 2 line blocks is something like:

(something)(followed by lots of en-US stuff)a tab(the same something)(followed by lots of translated stuff)

^([^\t])(.*)\t\1(.*)$

and we need to replace it with:
\1\2\r\1\4


Make sure there are no mistakes (if there are any they are likely to appear right in the first lines).

Now you should have your 2 lines block.

Rename the file to .sdf and deliver...

Conclusion



There are plenty of ways to deal with OpenOffice.org's localization files. But to make sure that the contents of the TMX can be fully leveraged (and with close to 70,000 segments, it would be a waste if it were not) there is a real need to avoid the PO files created by oo2po. Problem is, anything that involves the .sdf files directly requires a little bit of massaging...

Ideally, SUN would provide XLIFF files that are created directly from the original XML files (and with empty targets), as well as properly encapsulating TMX files...

Credits



sdf2txt.jar has been created by Alex Buloichik. The word count included in the output may not be 100% exact but the extraction/merge works, which is what matters for now. The code is within the Jar file and the whole thing is GPLed. Thank you very much Alex.