tmx2ske.xsl
The SketchEngine accepts parallel corpora in the form of two separate files with, minimally, <align> tags (there has to be the same number of them in both files!). This script goes through a single TMX file, looks at the first set of equivalents to determine what (and how many) languages are aligned and then extracts segments corresponding to those languages and places them in parallel files with '.txt' suffixes that the SketchEngine required as of June 2010 (note that a sequence of <align> elements is not a well-formed XML file, because it doesn't have a single root).
If part of the TMX file is malformed, i.e., if some alignments are missing (e.g., most of the <tu> elements contain three equivalents, say, for "pl", "sw", "en", but there are some that contain only two (say, "en" and "pl"), then such <tu> elements are by default skipped altogether, because the primary constraint is: "keep the number of <align> elements equal throughout". (An alternative would be to create an empty <align> element, but it is not clear if the the SkE allows for that). See also the description of the langs
parameter below.
If you know that this is the case in your TMX file, and want to extract only two languages from it (e.g., ignoring "sw", because it is not completely aligned), then you have to pass the list of languages to the script as a parameter. Note that the language identification codes are taken mostly from ISO 639-1 (if an ISO 639-1 code for your language is missing, use an ISO 639-3 code).
In the future, this script might also use a local OCTC tokenizer, if one exists (ATM, during import to the SkE, the corpora have to be run through the SkE tokenizer, which seems a more sensible option from the point of view of a SkE user).
Soon to come
This script must be rewritten to use the streaming capabilities of XSLT 2.1 (soon to be XSLT 3.0). This is the absolute priority for it now.
Requirements and advice
In order to run this script, you need the Saxon XSLT 2.0 processor (its home edition is open-source and free) and Java or .NET. Kernow, the front-end to Saxon, may be an attractive alternative to using the commandline.
YouTube has some instruction movies if you feel lost, there is also a page on invoking Saxon from the commandline.
If you are a Windows user inexperienced with handling spaces in the path, do not run this from your Desktop -- create e.g. a "saxon" directory in your C: drive, and work from there. If you are a Windows user not accustomed to the command line, there is a MS PowerToy called "open command line here" that may be helpful -- it lets you right-click on a folder and open the command line right in it (no need to navigate to it across the system; note that it will still be the horrible Windows cmd.exe with e.g. abnormal cut&paste -- replacements for that do exist.)
Usage
java -jar saxon9he.jar -s:INPUT-TMX -xsl:tmx2ske.xsl OPTIONS
for example:java -jar saxon9he.jar -s:en-hr.tmx -xsl:tmx2ske.xsl
This command line may be expanded in various ways. A list of additional options follows.
-
-Xmx1024m
This is a java-internal option that regulates the maximum amount of memory made available to the task. This particular example means that the memory is increased to 1GB (= 1024 MB). This option should come right after
java
in your command line. -
!encoding=UTF-16
This is a Saxon-internal option that allows you to regulate some features of the output file (note the leading '!'). In this example, the encoding of that file is changed from the default UTF-8 to UTF-16.
-
?langs=('en','hr')
Restrict the extracted languages to the specified sequence (in this case, English and Croatian). The parameter name has to be preceded by a '?', because a sequence of language identifiers can only be passed to Saxon as an XPath statement. This is useful only when the TMX file contains more than two languages and you want to extract alignments between a smaller number of them.
-
?langs='en'
orlangs=en
Extract a single language from the TMX. In such cases, you probably also want to do
tags=0
(see below). -
?secure=false()
orsecure=0
SkE needs parallel corpora to consist of an identical number of <align> tags. There is no TMX-internal guarantee that a TMX file will always contain the same number of equivalent strings in the same set of languages, so this script by default makes sure that your SkE output is uniform. If you know that your TMX file is well-formed, you can disable this security check, so that large files are processed somewhat faster.
-
?normalize=true()
ornormalize=1
Remove excess whitespace. This is largely a cosmetic option.
-
output_prefix=gizmo/
Add a path or a fragment of file name to the default output file name, which is
LG.txt
(LG stands for a language code, as inen.txt
). In this example, the output will be placed in thegizmo/
subdirectory (subfolder) of the directory from which the script is invoked. -
?output_prefix=concat(xs:duration(current-dateTime()-xs:dateTime('1970-01-01T00:00:00')),'_')
Be fancy and prefix the output file name with a string reflecting the time distance between January 1, 1970, 00:00 a.m. and the moment of invocation. Useful to guarantee that the result of the current invocation will not overwrite the previously created files. Example:
P14783DT19H38M58.899S_en.txt
. -
tags=0
Do not enclose each extracted string between <align> and </align>tags. Useful to extract monolingual strings from a TMX.
Examples of invocation
If you want to explicitly pass your chosen languages as parameters (if the TMX file contains alignments for more than two languages or if the first <tu> element is misaligned), do java -jar saxon9he.jar -s:en-pl-sk.tmx -xsl:tmx2ske.xsl ?langs=('pl','en')
If you encounter a "java heap space" error, you may want to add another option to the commandline, e.g. -Xmx1024m
might help, by increasing the heap size to 1GB. An example command line to tackle a big English-French TMX file called en-fr.tmx
is:
java -Xmx1024m -jar saxon9he.jar -s:en-fr.tmx -xsl:../../tools/xsl/tmx2ske.xsl secure=0 normalize=1
In this example, the script is located in a different directory, secure checks are disabled, and whitespace normalization is turned on.
Credits and further help
Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/)
If you find this script useful, we will appreciate (i) learning about that and possibly (ii) a mention and a link in your project documentation. The former will make is possible for us to let you know of any changes to this tool, the latter is one way of getting compensated in the free world.
For bug reports, please use the OCTC Trac. For discussion, you are welcome at the OCTC forum.