27 April 2008

Text alignment on Mac

When you start a translation, it is important to prepare your reference materials so that you can use them in the most efficient way possible.

Computer Aided Translation (CAT) tools have a special way to do that. They can use translation memories (TM) that contain source and target language information that will be matched against the source text to provide translation suggestions.

Using translation memories has two major benefits. The first is that any text present in the memory and also in a similar form in the file to translate can see its TM translation be recycled in your work. The other benefit is that the translation memory, if properly used will increase the style consistency of your final work.

Creating TMs is also called aligning bilingual texts. The end format will depend on your CAT flavor but the standard today is TMX (Translation Memory eXchange), an XML dialect maintained by LISA .


There are 3 ways to align text.

  1. by hand
  2. with a-free software
  3. with free software (including free of charge)

I used to do it by hand (copy the texts in 2 text editors windows with line numbers displayed, hack the contents so that strings on the same line number correspond, paste TMX code all over this).

In a good text editor, you can do all the work with the keyboard using only shortcuts. The paste TMX code all over this part is a little tricky but some smart people have created simple scripts in Perl or Python to ease the pain .

Then, I bought Heartsome's Translation Suite, a set of Java applications for translators. The set includes TMXEditor, a TMX file editor as its name says, and I did most of my alignment there for a while. I've always had mixed feelings about TMXEditor. There are display glitches, it does not seem easy to work only with the keyboard, it uses a lot of memory... TMXEditor does a few things very well (TMX merging and various checks), but on Mac, it is not the best tool for aligning texts .

The best tool (for now) on Mac is a native application called Appletrans, previously known as Alair. Appletrans had been on my hard disk for so long that I had almost forgotten about it, always promising myself that I'd test it to write a blog entry about it.

Appletrans is a text editor for translators. It is available free of charge directly from Apple, from their localization page, and besides for being a very nice aligner, it also is a full fledged CAT tool that a number of people have adopted as their tool of choice .

The following is an introduction to text alignment in Appletrans. I'd like to thank Steven DeWitt for helping me when I was lost in the shortcuts and for confirming that what follows is not merely the product of my feverish imagination.

Aligning text in Appletrans


  1. Prepare the files


  2. (This part is very well explained in the Appletrans manual. Don't hesitate to refer to it.)


    1. Appletrans does not open .doc files.
      → save the files to align to the RTF format in TextEdit

      Appletrans can also open a number of other file formats by default and plugins are available to add even more file formats.



    2. Open the source file and the target file from the finder or in Appletrans.
      → in the Finder, right click, Open With, Appletrans should come in the list.

      The files should be displayed with most of their styling but without any images, if any were present in the original files. Also, the files names now come with an .alair extension in replacement for the .rtf extension (see the title bar).



    3. Segment the two files (repeat the procedure for both files)
      → Do not select any contents in the opened files
      Tool menu, Segment submenu, Segment

      A dialog shows, select the segmentation type you want in the drop down menu, press segment all, you'll see small orange markers at the beginning and end of each segment Appletrans has created for you.



    4. Let Appletrans know that the two files are to be synchronized, do that for the two files.
      Tool menu, Synchronize

      A dialog shows, enter the language of the file.


      The synchronization causes the display to change a little bit. Use Cmd+1 (or Cmd+2) on the frontmost text and you'll see that the segments defined in that window are somehow linked to segments in the other window.

      By doing that, you can already see that some source segments are not associated to the correct target segment. The alignment process is about correcting such association mistakes.


  3. Correct the default segment associations


  4. (This part is not as clear in the user manual and required a bit of guessing.)

    You have now 2 windows open:

    1. The segmented source file
    2. The segmented target file


    Here are the Appletrans specific shortcuts that you will need to modify the alignment:

    Cmd+1 (Tool menu, Segment submenu)

    → selects the next segment and shows the associated segment in the other window

    Cmd+2 (Tool menu, Segment submenu)

    → selects the previous segment and shows the associated segment in the other window

    It is also possible to select any segment in the text by clicking on one of its orange segment marker.

    Opt+Cmd+R (Tool menu, Segment submenu)

    Restore, removes the segmentation for the selected segments, at least one full segment must be selected for the action to work

    Opt+Cmd+S (Tool menu, Segment submenu)

    Segment Selection, no need to go through the Segment dialog again !



    Now, here are some practical standard shortcuts that will make your life easier.

    Arrows

    → moves the cursor around the window

    Shift+arrows

    → selects while the cursor is moving

    Delete

    → deletes the selected part (segment or text)

    Cmd+X, Cmd+V

    → standard cut, paste that you can use to move segments or text around



    Merge segments


    • Select the segments to merge.
    • Press Opt+Cmd+R (Restore) to remove their original segmentation.
    • Press Opt+Cmd+S (Segment Selection) to make a segment from the selection.



    Split a segment


    • Select the segment to split.
    • Press Opt+Cmd+R (Restore) to remove its original segmentation.
    • Select the part you want to make a segment out of.
    • Press Opt+Cmd+S (Segment Selection) to make a segment from the selection.
    • Proceed similarly with the remaining of the original segment until every part is a segment.


    It is also possible to cut and paste segment contents around to achieve the same result. You may end up with empty segments that will have to be deleted. Do what fits best your workflow.

    In the system shortcuts (see System Preferences, Keyboard Shortcuts), you should have a Move focus to next window in active application.

    I have set this shortcut to Cmd+Esc, so that I can Cmd+Tab to navigate the running applications and Cmd+Esc to navigate the open windows of the frontmost one.

    Imagine the following scenario:

    Cmd+1, you select the next coming segment, you notice that it is not associated with the right segment in the other window.

    Cmd+Esc, you go to that window, you do what you have to do there, and when the segments are properly aligned, you don't need to go back to the first window, just proceed with Cmd+1.

    Anyway, with the above indications, you should be able to correct all the segments association in the files by using only the keyboard and by thus saving a huge amount of time.

  5. Create the alignment file


  6. The purpose of all this is of course to create an aligned file that you will later use for reference in your favorite CAT tool.

    Appletrans allows you to save such corpus in the familiar TMX format that most CAT tools support.

    First, you need to create a new corpus that will contain the data you just aligned.

    File menu, New Corpus


    A new dialog should be displayed but you don't have to worry about it. Click on any of the two text window that you have just aligned.

    Now, to save your data:

    Tools menu, Build Corpus


    Appletrans will be busy for a few seconds and then will release the focus.

    If you go back to the Corpus dialog, you will notice that the upper left red light now has a black dot in it, which indicates that the corpus has been modified.

    To create the final TMX:

    File menu, Save As


    Put a relevant file name and select TMX Format from the File Format drop down menu. Then save.

    The TMX that you have just created is a TMX 1.4 file that contains only textual information. All the style that was present in the RTF files has been removed. It is thus a TMX 1.4 level 1 file.

    A typical Appletrans created TMX file will look like this:

    <?xml version="1.0" encoding="UTF-8"?>
    <tmx version="1.4">
    <header creationtool="AppleTrans" creationtoolversion="38" datatype="unknown" segtype="sentence" adminlang="en" srclang="en" o-tmf="AlairCorpus">
    </header>
    <body>
    <tu>
    <tuv xml:lang="en"><seg>sentence</seg></tuv>
    <tuv xml:lang="fr"><seg>phrase</seg></tuv>
    </tu>


    You may want to change the srclang argument since Appletrans defaults to "en". If you use this TMX in OmegaT, the change won't be necessary as long as the xml:lang argument for the two tuvs corresponds to language variants of the languages you have set at the time of the project creation.

  7. Validate the TMX contents

  8. Appletrans (and a number of other tools) do not ensure that the TMX contents perfectly follows the TMX standard. In some cases, the textual contents that you have just aligned and converted will contain characters that should not be included in a TMX file. To ensure that the TMX you have just created does not contain such characters, you are going to need another utility.

    Maxprograms, the creator of a number of translation related tools, has released a free TMX validation utility that will be put at use here .

    Launch TMXValidator, and instead of using the Validate File in the File menu, use the Clean Invalid Characters from the same menu.

    TMXValidator will ask you to select your TMX file. After a very short time, the main window should display "File cleaned". No need to save, the file has already been modified.

    You can now use the TMX file in any CAT tool that support TMX files.






Links



About LISA:
TMX, XLIFF, etc...

Perl and Python scripts to create TMX files are available from the OmegaT official page.

Heartsome's page is here. You can download the software set and use it without limitations for 30 days.

Appletrans can be found from Apple's Localization Tools page, here.
There is a very active support group hosted by Yahoo Groups.

Maxprograms has been around for a while but limited itself to deliver free utilities eventually distributed with the Heartsome tool set. Now it has a full fledged XLIFF editor, Swordfish, along with all the smaller utilities that are all very useful.

10 April 2008

OSX in Arabic !

I was wondering how much news I'd get from reading the Mac related French sites and until now I've only been disappointed by seeing only translation of the English news.

This morning, something that was not reported in the English sites made its way to my RSS page... The Arabic localization of OSX ! The site is Mac Génération and reported on the release of an Arabic kit for OSX.

The release is available for OSX 10.5.2 as a .dmg package. Looking at the release page, one can see an Intel 10.4.10 localization package is also available.

Very good news for the Arabic OSX users!