tmx2ske.xsl

tmx2ske: converts TMX translation memory format files (generated by the OCTC) into SketchEngine input files for parallel corpora.

The SketchEngine accepts parallel corpora in the form of two separate files with, minimally, <align> tags (there has to be the same number of them in both files!). This script goes through a single TMX file, looks at the first set of equivalents to determine what (and how many) languages are aligned and then extracts segments corresponding to those languages and places them in parallel files with '.txt' suffixes that the SketchEngine required as of June 2010 (note that a sequence of <align> elements is not a well-formed XML file, because it doesn't have a single root).

If part of the TMX file is malformed, i.e., if some alignments are missing (e.g., most of the <tu> elements contain three equivalents, say, for "pl", "sw", "en", but there are some that contain only two (say, "en" and "pl"), then such <tu> elements are by default skipped altogether, because the primary constraint is: "keep the number of <align> elements equal throughout". (An alternative would be to create an empty <align> element, but it is not clear if the the SkE allows for that). See also the description of the langs parameter below.

If you know that this is the case in your TMX file, and want to extract only two languages from it (e.g., ignoring "sw", because it is not completely aligned), then you have to pass the list of languages to the script as a parameter. Note that the language identification codes are taken mostly from ISO 639-1 (if an ISO 639-1 code for your language is missing, use an ISO 639-3 code).

In the future, this script might also use a local OCTC tokenizer, if one exists (ATM, during import to the SkE, the corpora have to be run through the SkE tokenizer, which seems a more sensible option from the point of view of a SkE user).

Soon to come

This script must be rewritten to use the streaming capabilities of XSLT 2.1 (soon to be XSLT 3.0). This is the absolute priority for it now.

Requirements and advice

In order to run this script, you need the Saxon XSLT 2.0 processor (its home edition is open-source and free) and Java or .NET. Kernow, the front-end to Saxon, may be an attractive alternative to using the commandline.

YouTube has some instruction movies if you feel lost, there is also a page on invoking Saxon from the commandline.

If you are a Windows user inexperienced with handling spaces in the path, do not run this from your Desktop -- create e.g. a "saxon" directory in your C: drive, and work from there. If you are a Windows user not accustomed to the command line, there is a MS PowerToy called "open command line here" that may be helpful -- it lets you right-click on a folder and open the command line right in it (no need to navigate to it across the system; note that it will still be the horrible Windows cmd.exe with e.g. abnormal cut&paste -- replacements for that do exist.)

Usage

java -jar saxon9he.jar -s:INPUT-TMX -xsl:tmx2ske.xsl OPTIONS

for example:

java -jar saxon9he.jar -s:en-hr.tmx -xsl:tmx2ske.xsl

This command line may be expanded in various ways. A list of additional options follows.

-Xmx1024m
This is a java-internal option that regulates the maximum amount of memory made available to the task. This particular example means that the memory is increased to 1GB (= 1024 MB). This option should come right after java in your command line.
!encoding=UTF-16
This is a Saxon-internal option that allows you to regulate some features of the output file (note the leading '!'). In this example, the encoding of that file is changed from the default UTF-8 to UTF-16.
?langs=('en','hr')
Restrict the extracted languages to the specified sequence (in this case, English and Croatian). The parameter name has to be preceded by a '?', because a sequence of language identifiers can only be passed to Saxon as an XPath statement. This is useful only when the TMX file contains more than two languages and you want to extract alignments between a smaller number of them.
?langs='en' or langs=en
Extract a single language from the TMX. In such cases, you probably also want to do tags=0 (see below).
?secure=false() or secure=0
SkE needs parallel corpora to consist of an identical number of <align> tags. There is no TMX-internal guarantee that a TMX file will always contain the same number of equivalent strings in the same set of languages, so this script by default makes sure that your SkE output is uniform. If you know that your TMX file is well-formed, you can disable this security check, so that large files are processed somewhat faster.
?normalize=true() or normalize=1
Remove excess whitespace. This is largely a cosmetic option.
output_prefix=gizmo/
Add a path or a fragment of file name to the default output file name, which is LG.txt (LG stands for a language code, as in en.txt). In this example, the output will be placed in the gizmo/ subdirectory (subfolder) of the directory from which the script is invoked.
?output_prefix=concat(xs:duration(current-dateTime()-xs:dateTime('1970-01-01T00:00:00')),'_')
Be fancy and prefix the output file name with a string reflecting the time distance between January 1, 1970, 00:00 a.m. and the moment of invocation. Useful to guarantee that the result of the current invocation will not overwrite the previously created files. Example: P14783DT19H38M58.899S_en.txt.
tags=0
Do not enclose each extracted string between <align> and </align>tags. Useful to extract monolingual strings from a TMX.

Examples of invocation

If you want to explicitly pass your chosen languages as parameters (if the TMX file contains alignments for more than two languages or if the first <tu> element is misaligned), do

java -jar saxon9he.jar -s:en-pl-sk.tmx -xsl:tmx2ske.xsl ?langs=('pl','en')

If you encounter a "java heap space" error, you may want to add another option to the commandline, e.g. -Xmx1024m might help, by increasing the heap size to 1GB. An example command line to tackle a big English-French TMX file called en-fr.tmx is:

java -Xmx1024m -jar saxon9he.jar -s:en-fr.tmx -xsl:../../tools/xsl/tmx2ske.xsl secure=0 normalize=1

In this example, the script is located in a different directory, secure checks are disabled, and whitespace normalization is turned on.

Credits and further help

Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/)

If you find this script useful, we will appreciate (i) learning about that and possibly (ii) a mention and a link in your project documentation. The former will make is possible for us to let you know of any changes to this tool, the latter is one way of getting compensated in the free world.

For bug reports, please use the OCTC Trac. For discussion, you are welcome at the OCTC forum.

Author:

Piotr Bański

the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).

SVN Id:

$Id: tmx2ske.xsl 382 2010-11-06 18:12:41Z bansp $

XSLT Version:

2.0

Namespace Prefix Summary:

xd - http://www.pnp-software.com/XSLTdoc

xs - http://www.w3.org/2001/XMLSchema

xsl - http://www.w3.org/1999/XSL/Transform

Outputs Summary

#default - source

No short description available

Parameters Summary

xs:string+ langs - source

Set the languages explicitly (useful if the TMX file contains equivalents from more than two languages or if the first <tu> element of the TMX (which serves as a heuristic diagnostic for the number and ID of languages involved) is misaligned

xs:boolean normalize - source

Normalize whitespace? Some TMX files contain lots of it and you might want to get rid of it

xs:string output_prefix - source

Determine the placement of the output files

xs:boolean secure - source

Play insecurely? If this is passed as '0', the TMX-well-formedness checks are switched off (and you may end up with files with different numbers of <align> elements)

xs:boolean tags - source

Set this to 0 to suppress the <align> tags

Variables Summary

xs:integer chunk_size - source

This is an attempt to reduce memory usage

Match Templates Summary

/ - source

No short description available

Outputs Detail

#default - source

No short description available

Attributes

encoding

UTF-8

method

xml

omit-xml-declaration

yes

Parameters Detail

xs:string+ langs - source

xs:boolean normalize - source

Normalize whitespace? Some TMX files contain lots of it and you might want to get rid of it

The default is to skip this step.

xs:string output_prefix - source

Determine the placement of the output files

The result has the form: output_prefixLG.txt, where "LG" stands for the language code and ".txt" is appended to please the SkE. The default is simply, e.g., "en.txt", in the directory where the source file is.

xs:boolean secure - source

Play insecurely? If this is passed as '0', the TMX-well-formedness checks are switched off (and you may end up with files with different numbers of <align> elements)

xs:boolean tags - source

Set this to 0 to suppress the <align> tags

Useful for extraction of monolingual material from a multilingual TMX.

Variables Detail

xs:integer chunk_size - source

This is an attempt to reduce memory usage

It nicely fails so far, on big files. Apparently, we need some more intensive garbage collection here instead of such tricks.

Match Templates Detail

/ - source

No short description available