Mac For Translators: OpenOffice.org

Showing posts with label OpenOffice.org. Show all posts

Excel files with colored non translatables...

Here comes an excel file, with pseudo HTML in the cells.

The HTML tags are red and must not be modified. If you want how the segments look check the follow up post.

Translating that in OmegaT is relatively straightforward:

save the file as ODS in Openffice.org (or NeoOffice)
put that source file in the /source/ folder of your OmegaT project
load the project and translate

The problem is that not only you are going to have all the HTML tags displayed for what they are within the translatable text, but you're going to have to deal with the red color tags that will surround all the HTML...

Not user friendly at all...

Another solution is to do like this:

copy-paste the column into a text file -> no more red color, will deal with that later
insert a visible marker like @@@ at each end of line
save the file as .html -> no more full HTML tags in the segments
put in /source/
go to Options > Segmentation and add 2 rules. One where you segment before @@@ and one where you segment after @@@, that way you'll nicely isolate the marker and it will be translated only once
load, translate

The resulting file should contain all the original tags, without modifications, but some characters in the original may have been converted to HTML references. Replace those with the original character if you think it is better.

Now, open your file in a text editor, remove the @@@ markers and paste the contents into a Write page in OpenOffice.

There, do a "Regular Expression" search for the string: (<[^>]*>)

The string means: a "<" followed by a number of anything but ">" followed by a ">": (basically any HTML tag). The surrounding parenthesis put the matching string into a memory for later retrieval!

and replace by "&" with the style "font color=red". "&" means "the group that was just put into memory".

All your HTML tags should be colored in red now.

Copy-paste the contents into the original file where it needs to be, and deliver !!! Also make sure that one line corresponds to one cell (manipulating the @@@ marker should not change the overall structure but one never knows!)

(There are probably easier ways to deal with such files. Let me know!)

StarOffice 9 beta for Mac

It looks like Sun Microsystems is making good use of the work of all the volunteers who made OpenOffice.org run on OSX without X11 (i.e. in OSX speak: like a normal OSX application...)

The beta version is available, with the StarOffice logo instead of the OpenOffice.org one.

I have downloaded the beast but if everything is business as usual, we should not see much differences between the 2 suites, except for a few proprietary things added to the StarOffice version.

StarOffice:
http://www.sun.com/software/staroffice/get_beta.jsp (185 mb)
What's new:
http://www.sun.com/software/staroffice/docs/StarOffice9_WhatsNew.pdf

OpenOffice.org:
http://download.openoffice.org/3.0beta/ (157 mb)
What's new:
http://marketing.openoffice.org/3.0/featurelistbeta.html

Also, NeoOffice is still running behind in term of code base but has recently released a patch for its 2.2.4 version:
http://www.neooffice.org/neojava/en/patch.php

OpenOffice.org 3.0 Beta available !

It is official, OpenOffice.org 3.0 Beta version is available for download.

The feature list is here and you'll be glad to know that support for Microsoft 2007 file format (OOXML) is now a reality !

Also, OpenOffice.org for Mac is now an Aqua application that does not require the X11 windowing environment. Those of you who don't know what that means are blessed !

The stable version is planned for release in September. Although the free office suite is still not considered stable, it is stable enough for most of your non-mission critical work. I've been using test versions of 3.0 for a while now and I have been very pleased with it. I've noticed that it is significantly faster than NeoOffice at launch too.

Feel free to download it from here.

Spellchecking in OmegaT 1.8

This (or something similar) will eventually make its way into the user manual. Meanwhile...

Click on Options > Spell checking...

Indicate where you want OmegaT to look for dictionaries.

If there are valid dictionaries in that location, OmegaT will recognize them and will display them. If the dictionary you want to use is already there and visible to OmegaT, you're done. If that is not the case, proceed with the following:

Click on "Install". This takes a while because OmegaT gets a list of dictionaries from the internet.

OmegaT will display a list of dictionaries, click on the dictionaries you want to install (Cmd+click will do multiple selections on Mac, maybe Ctrl+click will do on other platforms).

After you have clicked "Install", the button will change of color and OmegaT will get the files from the internet and nothing noticeable will happen for a while. Just wait until the button reverts to its "normal" state.

Close.

The new dictionaries will be displayed in the dictionary list.

To use the dictionaries, make sure the language code of the target files corresponds to the dictionary's language code: an FR-FR dictionary will not work with an FR target setting. You need to change the setting to FR-FR to have the spellchecker recognize the correct dictionary for your target.

You don't have to use that interface to install new dictionaries.

Go to OpenOffice.org's dictionary download page and get the files you want.

Uncompress them in the directory specified in step 2) above.

If OmegaT does not notice them after that install, you can try reloading the project or restarting OmegaT.

Once you have started translating, OmegaT will produce a familiar red wavy underlining for words that are not included in the applied dictionary. A right-click on the word should produce a contextual menu that will display a number of candidates as well as a few options.

People who can't "right-click" because they only have one mouse button can use Command+Click to display the contextual menu. Those of you who have a recent Mighty Mouse from Apple should know that it is quite configurable. Check the System Preferences.

It is also possible to configure some touchpads to simulate a right-click when hitting them with 2 fingers at once. Check your preferences...

OpenOffice.org localization: an easy way to deal with .sdf files

What are .sdf files ?

A few days ago I wrote about OpenOffice.org 2.4 localization update.

For some reason related to the way SUN manages the UI/Help strings, the translation source file comes in a weird format: all the XML "<" and ">" etc are escaped with "\" and the file structure comes as a set of 2 lines pairs, the first line being the en-US original and the second line a placeholder for the target string.

This placeholder contains sometimes the en-US string and sometimes a close approximate of what would be the translation of the source string in the target language. All this is nicely embedded into a lot of meta information that makes the file impossible to parse with normal human senses...

Here is an example (without the meta information):

String in the .sdf:

\<ahelp hid=\".\" visibility=\"hidden\"\>something in the .sdf\</ahelp\>

(.sdf is the extension SUN has created to name the format)

SUN also provides translators with TMX files of the whole UI/Help for a number of languages (de, es, fr, hu, it, ja, ko, nl, pl, pt-BR, pt, ru, sv, zh-CN, zh-TW, at the time of this writting).

The TMX seem to have been created not from the original XML (with nicely encapsulating TMX 1.4 level2 tags) but from the funky .sdf file. Which means that all the original XML tags are found escaped as per the .sdf, alongside the translatable contents...

So the above string would be exactly the same in the TMX:

\<ahelp hid=\".\" visibility=\"hidden\"\>something in the .tmx\</ahelp\>

How to translate that ?

So, how to practically translate such files while making use of the TMX data ?

The no brainer way...

Edit the .sdf file directly, possibly after renaming it to .csv and importing it into OpenOffice.org, where all the {tab} separated meta information fields will nicely fill their own column and leave the translatable contents on its own...

It is not exactly translator friendly... But with a little playing with the column width you'll manage to have only the translatable parts displayed...

This procedure allows translators to separately (and manually) do searches in the TMX or the glossary (Sun Gloss) and to use the matched contents directly without having to play with the "\" too much.

It is not very practical because the TMX data is embedded in plenty of XML tags and the result is thus not exactly pretty...

The PO way

The PO way is not the best way to leverage the TMX contents. It also requires translators quite some editing when wanting to use TMX matches... Still, it seems to be the most common way to localize OpenOffice.org...

PO files are provided by the team coordinators, they are created with the Translator's Toolkit's oo2po tool.

The above .sdf contents would be converted like this:

\\<ahelp hid=\\\".\\\" visibility=\\"hidden\\\"\\>something in the .po\\</ahelp\\>

The reason is that oo2po wants to be smart and adds an extra layer of escape characters (the ugly and ubiquitous "\"). And as you see above, the number of added "\" depends on what has been escaped: a simple [\] will become [\\], but [\"] will become [\\\"] because PO wants to escape both [\] and ["] with another [\]...

Now, it does not take much to see that matching that against the TMX data will be a problem. Even if the translator uses a smart PO editor to refer to the TMX there will still be a need to add all the ugly extra "\" that oo2po has added to the .sdf contents.

Basically, oo2po adds a useless extra layer of complexity to an already complex process that also happens to render TMX matching pretty much useless.

The smart way that also happens to really ease the translator's work

Here we are. Now, to keep the post to a reasonable length, let me refer you to the mail I just wrote to the OOo-l10n-dev list where everything is explained.

The idea is basically that, since the TMX matches the structure of the .sdf, then it is easier to work from the .sdf. But to make the TMX really useful it is necessary to make the .sdf contents easily handled by a tool that will also make full use of the TMX contents.

OmegaT for example...

Within OmegaT you can have automatic TMX and glossary (Sun Gloss export) matching, automatic file encoding handling, automatic file naming handling etc...

So, there is a very small Java utility sdf2txt.jar that basically extracts all the translatable contents of the .sdf file and outputs it as a "key=value" format that OmegaT can parse natively.

From there you see what needs to be done...

Basically:

put the extracted files in the /source/ folder of your newly created OmegaT translation project,

put the TMXs in /tm/,

put the glossary files (if any) in /glossary/,

load the project...

and enjoy translating in a Nice and Friendly to the translator Professional yet Free Computer Aided Translation tool....

Another smart but regexpy way...

Before using the CSV trick above ensure that the line pairs are converted so that the 2 lines are put on one line.

To do that in a text editor that supports regular expressions, search for:

^(.*)(en-US)(.*)\r^(.*)(fr)(.*)

replace with:

\1\2\3\t\4\5\6

Now that your .sdf is "linearized", change its name to .csv and open it in OpenOffice by using "tab" as field separator and "nothing" as text delimiter.

The tabs in the original .sdf create a number of columns from where you just need to copy the column with the en-US translatable contents.

Paste that into a text file with the ".utf8" extension, load into OmegaT... Et voilà !

You'll have to paste the contents of the translated file into the target part of the CSV file, convert back to a 2 lines pair set.

The pattern we need to find to revert the 1 line blocks to 2 line blocks is something like:

(something)(followed by lots of en-US stuff)a tab(the same something)(followed by lots of translated stuff)

^([^\t])(.*)\t\1(.*)$

and we need to replace it with:

\1\2\r\1\4

Make sure there are no mistakes (if there are any they are likely to appear right in the first lines).

Now you should have your 2 lines block.

Rename the file to .sdf and deliver...

Conclusion

There are plenty of ways to deal with OpenOffice.org's localization files. But to make sure that the contents of the TMX can be fully leveraged (and with close to 70,000 segments, it would be a waste if it were not) there is a real need to avoid the PO files created by oo2po. Problem is, anything that involves the .sdf files directly requires a little bit of massaging...

Ideally, SUN would provide XLIFF files that are created directly from the original XML files (and with empty targets), as well as properly encapsulating TMX files...

Credits

sdf2txt.jar has been created by Alex Buloichik. The word count included in the output may not be 100% exact but the extraction/merge works, which is what matters for now. The code is within the Jar file and the whole thing is GPLed. Thank you very much Alex.

Kazunari Hirano interview

Kazunari Hirano is a long time contributor to the OpenOffice.org Japanese community and has recently been involved with Open Solaris and its localization community.

He was recently interviewed by both Reiko Saito, Japanese Language lead at SUN for Solaris, Java and Sun Java Enterprise System and by Jim Grisanzio, Sr. Program Manager, OpenSolaris Engineering at Sun.

Reiko is also very active in the OpenOffice.org Japanese localization community where she helps us a lot.

The interview has been conducted in Japanese and English and is available on both Reiko's blog and Jim's.

OpenOffice.org 2.4 localization

Almost two weeks since the last post. Amazing how 3 kids can suck your energy into nether...

Today's first post is an announcement.

OpenOffice.org is a free office suite that a lot of translators already use for its compatibility with MS Office and the fact that, well, it is a free download and a free use application. OpenOffice.org is developped in part by SUN Microsystem, contributions come from IBM and other major players in the software industry and there is a very strong community of users and volunteers that exchange in a variety of languages. The "Native Language Confederation" is where all the non-English things take place.

OpenOffice.org is thus localized by this community of communities under a separate project called, obviously, the "Localization Project".

The current available version of OpenOffice.org is version 2.3. Version 2.4 is expected to be released sometimes at the beginning of March and the localization efforts will thus start very soon.

Translators on Mac who do not use OpenOffice.org but prefer NeoOffice should be aware that all the localization work that goes into OpenOffice.org is automatically "recycled" into NeoOffice.

So, the deal is: you're enjoying a wonderful free office suite, and somehow you feel guilty for not having had to pay for it, or you feel that you'd love to "pay something back" but not being a programer you are not sure where to start...

Well, you are a translator by trade, aren't you ? Localization is where your skills can be used the best. Here is where you'll be able to find all the necessary informations for this version's translation.

There are TMX files available for some language communities and since the source files are in the PO format you can translate them in your favorite CAT tool.

First, get in touch with the translation group within your language community (from the Native Project page: click on your language community, go to the relevant page from there, either "contributions" or "participation" or "projects" etc. and propose your help !

OpenWordFast

Christmas in November !!!

After the Okapi for Mono package 2 days ago, another package useable on the Mac has just been released: OpenWordFast, a macro for OpenOffice.org that accepts WordFast translation memories.

The project was registered on October 8th, which means that it is yet a little early to expect function parity with WordFast, currently it only accept 100% matches from the TM... But since the project is free software (GPL) I have no doubts that it will find a lot of contributors.

Update:

I received a mail from Oleg, OpenWordFast's developer after congratulating him for his work:


Hi, Jean-Christophe.

Thanks for your post. But OpenWordFast in the raw Beta stage. I'm not tested it on Mac yet.
Its lacks of vital functionality - Glossary, Terminology Recognition, search of not full match TU. 

But I plan to work on this list in the future releases.

Best regards, Oleg Tsygany.

Here we go !

Office 2007 files (.docx, .xlsx, .pptx) on Mac

(updated to reflect the release of StarOffice 9 for Mac and the OOXML conversion software for Office 2004)

Microsoft Office 2007 for Windows (and its Mac counterpart: Microsoft Office 2008) uses a new file format that has been available for a while now as .docx, .xlsx or .pptx ("x" to distinguish them from the standard MS formats).

The file format is commonly known as OOXML or OpenXML, or more simply as Microsoft Office 2007 format.

Even if the new files don't seem to be very widely used, they sometimes end up on a Mac user's desktop, especially since they are the default file format of the two suites (i.e., you need to go through a number of loops to save to a different format)...

What to do when you encounter such files ?

Since I do not own Office 2008 and I did not have the OOXML update for Office 2004 at the time of the writing, I had to test access with OOXML files created with NeoOffice, from "real" Microsoft .doc, .ppt and .xls. All of the test files were pretty complex and quite heavy and had all been created originally on various versions of Microsoft for Windows.

Access through proprietary applications

The iWorks '08 way

iWorks: $79 from Apple

As far as I can tell, iWorks '08 applications Pages and Keynotes opened the .docx and .pptx files I had created without any problems.

And the result was as good looking as the original files. Very impressive.

When I tried to open the .xlsx file, Numbers was considered as the default application (even the converter was not listed) but it was unable to open it correctly. I'll need to have a "genuine" .xlsx file to test Numbers' capacities.

The problem with iWorks it that it cannot save a file to the new format. It can save it to the iWords default format or to the old Microsoft format, along with a few other more classical formats.

The Microsoft way

Microsoft Office 2008: $399.95 retail, $284.99 online, $239.95 retail upgrade version, $194.99 online upgrade version from Microsoft.
(The prices given correspond to the cheapest available version for professionals, the "Home & Student" package is not available for commercial activity.)

The Mac equivalent of Office 2007, Office 2008, has been available for a few months already. Office 2008 is the quickest way to access the new file format in a relatively smooth and painless way.

If you don't want to acquire Office 2008, you can download Microsoft's "Open XML File Format Converter for Mac". The application is available from here. It is at the bottom of the page, if the URL has not changed...

The converter requires OSX 10.4.8 or later. Microsoft also says that to view the files, you need either Microsoft Office 2004 11.3.4 or later, or Microsoft Office v.X 10.1.9 or later.

If you also install "Microsoft Office 2004 for Mac 11.5.0 Update" (description available here: http://support.microsoft.com/kb/953824) you'll also enable "Office 2004 for Mac to read and to write Office documents that are in Open XML Format".

The StarOffice 9 (beta) way

StarOffice: $69.95 (StarOffice 8 price, 9 is still beta), from Sun Microsystems.

StarOffice 9 beta is available from here:
http://www.sun.com/software/staroffice/get_beta.jsp

It should work pretty much as OpenOffice.org 3.0 beta. See below.

System wide support on Leopard (OSX 10.5)

Leopard: $129, from Apple.

If you don't (plan to) own any recent version of Office for Mac what can you do ?

Leopard users have the free option of using the new TextEdit. It can open and save the new file format.

OOXML support is system wide, which means that the Finder and other applications will also give you a "quicklook" of such files. Although not all files are equal under Quicklook. Some are displayed properly, some are displayed as a white icon and no contents is shown... The test .pptx worked, the .docx and xlsx did not.

So, support is not extremely good and I would not rely on it to check the translatable contents of a client file...

Access through free applications

OpenOffice.org and NeoOffice anyone?

Users on Panther (10.3) and above can use NeoOffice 2.2. NeoOffice is a sister application of OpenOffice.org.

The current available version of the standard OpenOffice.org (2.4) does not include OOXML support but NeoOffice includes special goodies, like OOXML support, that are found in Novell's version of OpenOffice.org, which is, sadly, not available for the Mac...

As of May 7th, the beta version of OpenOffice.org 3.0 is available. This version does include support for OOXML.

As written above, I used NeoOffice to create Office 2007/OOXML files with various degrees of success in terms of interoperability. I am pretty sure NeoOffice could open relatively complex files since the files I fed it for OOXML output were fairly complex, although I'd need to test that.

As text ?

An extreme way to access the contents of such files it to handle them as if they were zipped, unzip them and find the document.xml located somewhere in the folder hierarchy that appears (it would be under /word/ for a Word document). This file is standard XML and can be opened in any text editor.

To properly access the contents of the file, you'd need to use Okapi's Tikal utility, available for the Mono (free) running environment. Tikal should be able to extract the contents of the XML into an XLIFF file that you can later load into a translation tool...

Translation

Once you have access to the file, you can translate it by overwriting it in the application of your choice. Saving the resulting file to .docx will produce results that vary with the application you used. A best bet would be to save the result to .rtf for delivery.

OmegaT and other Java based applications

If you want to use a translation memory tool, the few I know that directly handle OOXML are OmegaT, Swordfish, the newborn from Maxprograms, and Heartsome's Translation Suite.

Appletrans

If you have converted the file to .rtf or HTML before translation, AppleTrans should be able to handle it directly.

Okapi's Tikal for conversion to XLIFF

Or, as written above, you can use Okapi's Tikal command line utility to convert its contents to XLIFF and translate it in any of the above mentioned applications.

Wordfast

The Microsoft converter opens the file in Word in the RTF format and you can then use WordFast to translate it directly (from within Word 2004 / Word v.X).

OpenLanguageTools

With hacks, you can also translate the document.xml file in OpenLanguageTools.

Have I forgotten your favorite tool ?

How to support this blog?

Search the site:

What are .sdf files ?

How to translate that ?

The no brainer way...

The PO way

The smart way that also happens to really ease the translator's work

Another smart but regexpy way...

Conclusion

Credits

Access through proprietary applications

The iWorks '08 way

The Microsoft way

The StarOffice 9 (beta) way

System wide support on Leopard (OSX 10.5)

Access through free applications

OpenOffice.org and NeoOffice anyone?

As text ?

Translation

OmegaT and other Java based applications

Appletrans

Okapi's Tikal for conversion to XLIFF

Wordfast

OpenLanguageTools

Popular, if not outdated, posts...

Also on...

Check the discussion!

Script Debugger forum

MacScripter

The Robservatory

Applehelp Writer

Multilingual Mac

Apple @ Slashdot

TidBITS: Apple News for the Rest of Us

Blog Archive