XML source view

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xd="http://www.pnp-software.com/XSLTdoc" exclude-result-prefixes="xs xd" version="2.0">

<xsl:output encoding="UTF-8" method="xml" omit-xml-declaration="yes" />

<xd:doc type="stylesheet">

 <xd:short>tmx2ske: converts TMX translation memory format files (generated by the OCTC) into SketchEngine input files for parallel corpora.</xd:short>

 <xd:detail>

 The <a href="http://ca.sketchengine.co.uk/">SketchEngine</a> accepts parallel corpora in the form of two separate files with, minimally, <align> tags (there has to be the same number of them in both files!). This script goes through a single TMX file, looks at the first set of equivalents to determine what (and how many) languages are aligned and then extracts segments corresponding to those languages and places them in parallel files with '.txt' suffixes that the SketchEngine required as of June 2010 (note that a sequence of <align> elements is not a well-formed XML file, because it doesn't have a single root).

 If part of the TMX file is malformed, i.e., if some alignments are missing (e.g., most of the <tu> elements contain three equivalents, say, for "pl", "sw", "en", but there are some that contain only two (say, "en" and "pl"), then such <tu> elements are by default skipped altogether, because the primary constraint is: "keep the number of <align> elements equal throughout". (An alternative would be to create an empty <align> element, but it is not clear if the the SkE allows for that). See also the description of the <code>langs</code> parameter below.

 If you know that this is the case in your TMX file, and want to extract only two languages from it (e.g., ignoring "sw", because it is not completely aligned), then you have to pass the list of languages to the script as a parameter. Note that the language identification codes are taken mostly from ISO 639-1 (if an ISO 639-1 code for your language is missing, use an <a href="http://www.sil.org/iso639-3/codes.asp">ISO 639-3 code</a>).

 In the future, this script might also use a local OCTC tokenizer, if one exists (ATM, during import to the SkE, the corpora have to be run through the SkE tokenizer, which seems a more sensible option from the point of view of a SkE user).

 <h2>Soon to come</h2>

 This script must be rewritten to use the streaming capabilities of XSLT 2.1 (soon to be XSLT 3.0). This is the absolute priority for it now.

 <h2>Requirements and advice</h2>

 In order to run this script, you need the <a href="http://saxon.sourceforge.net/#F9.2HE">Saxon XSLT 2.0 processor</a> (its home edition is open-source and free) and <a href="http://java.com/">Java</a> or .NET. <a href="http://kernowforsaxon.sourceforge.net/">Kernow</a>, the front-end to Saxon, may be an attractive alternative to using the commandline.

 YouTube has some <a href="http://www.youtube.com/watch?v=he8KiRFmM6o">instruction movies</a> if you feel lost, there is also a page on <a href="http://www.saxonica.com/documentation/using-xsl/commandline.html">invoking Saxon from the commandline</a>.

 If you are a Windows user inexperienced with handling spaces in the path, do not run this from your Desktop -- create e.g. a "saxon" directory in your C: drive, and work from there. If you are a Windows user not accustomed to the command line, there is a <a href="http://www.microsoft.com/windowsxp/downloads/powertoys/xppowertoys.mspx">MS PowerToy</a> called "open command line here" that may be helpful -- it lets you right-click on a folder and open the command line right in it (no need to navigate to it across the system; note that it will still be the horrible Windows cmd.exe with e.g. abnormal cut&paste -- replacements for that do exist.)

 <h2>Usage</h2>

 <code>java -jar saxon9he.jar -s:INPUT-TMX -xsl:tmx2ske.xsl OPTIONS</code> for example: <code>java -jar saxon9he.jar -s:en-hr.tmx -xsl:tmx2ske.xsl</code>

 This command line may be expanded in various ways. A list of additional options follows.

 <ul>

 <li>

 <head>

 <code>-Xmx1024m</code>

 </head>

 This is a <a href="http://java.sun.com/javase/6/docs/technotes/tools/windows/java.html">java-internal option</a> that regulates the maximum amount of memory made available to the task. This particular example means that the memory is increased to 1GB (= 1024 MB). This option should come right after <code>java</code> in your command line.

 </li>

 <li>

 <head>

 <code>!encoding=UTF-16</code>

 </head>

 This is a Saxon-internal option that allows you to regulate some features of the output file (note the leading '!'). In this example, the encoding of that file is changed from the default UTF-8 to UTF-16.

 </li>

 <li>

 <head>

 <code>?langs=('en','hr')</code>

 </head>

 Restrict the extracted languages to the specified sequence (in this case, English and Croatian). The parameter name has to be preceded by a '?', because a sequence of language identifiers can only be passed to Saxon as an XPath statement. This is useful only when the TMX file contains more than two languages and you want to extract alignments between a smaller number of them.

 </li>

 <li>

 <head>

 <code>?langs='en'</code> or <code>langs=en</code>

 </head>

 Extract a single language from the TMX. In such cases, you probably also want to do <code>tags=0</code> (see below).

 </li>

 <li>

 <head>

 <code>?secure=false()</code> or <code>secure=0</code>

 </head>

 SkE needs parallel corpora to consist of an identical number of <align> tags. There is no TMX-internal guarantee that a TMX file will always contain the same number of equivalent strings in the same set of languages, so this script by default makes sure that your SkE output is uniform. If you know that your TMX file is well-formed, you can disable this security check, so that large files are processed somewhat faster.

 </li>

 <li>

 <head>

 <code>?normalize=true()</code> or <code>normalize=1</code>

 </head>

 Remove excess whitespace. This is largely a cosmetic option.

 </li>

 <li>

 <head>

 <code>output_prefix=gizmo/</code>

 </head>

 Add a path or a fragment of file name to the default output file name, which is <code>LG.txt</code> (LG stands for a language code, as in <code>en.txt</code>). In this example, the output will be placed in the <code>gizmo/</code> subdirectory (subfolder) of the directory from which the script is invoked.

 </li>

 <li>

 <head>

 <code>?output_prefix=concat(xs:duration(current-dateTime()-xs:dateTime('1970-01-01T00:00:00')),'_')</code>

 </head>

 Be fancy and prefix the output file name with a string reflecting the time distance between January 1, 1970, 00:00 a.m. and the moment of invocation. Useful to guarantee that the result of the current invocation will not overwrite the previously created files. Example: <code>P14783DT19H38M58.899S_en.txt</code>.

 </li>

 <li>

 <head>

 <code>tags=0</code>

 </head>

 Do not enclose each extracted string between <align> and </align>tags. Useful to extract monolingual strings from a TMX.

 </li>

 </ul>

 <h2>Examples of invocation</h2>

 If you want to explicitly pass your chosen languages as parameters (if the TMX file contains alignments for more than two languages or if the first <tu> element is misaligned), do <code>java -jar saxon9he.jar -s:en-pl-sk.tmx -xsl:tmx2ske.xsl ?langs=('pl','en')</code> 

 If you encounter a "java heap space" error, you may want to add another <a href="http://java.sun.com/javase/6/docs/technotes/tools/windows/java.html">option</a> to the commandline, e.g. <code>-Xmx1024m</code> might help, by increasing the heap size to 1GB. An example command line to tackle a big English-French TMX file called <code>en-fr.tmx</code> is: 

 <code>java -Xmx1024m -jar saxon9he.jar -s:en-fr.tmx -xsl:../../tools/xsl/tmx2ske.xsl secure=0 normalize=1</code> In this example, the script is located in a different directory, secure checks are disabled, and whitespace normalization is turned on.

 <h2>Credits and further help</h2>

 Distributor: Open-Content Text Corpus (<a href="http://OCTC.sourceforge.net/">http://OCTC.sourceforge.net/</a>)

 If you find this script useful, we will appreciate (i) learning about that and possibly (ii) a mention and a link in your project documentation. The former will make is possible for us to let you know of any changes to this tool, the latter is one way of getting compensated in the free world.

 For bug reports, please use the <a href="https://sourceforge.net/apps/trac/octc/">OCTC Trac</a>. For discussion, you are welcome at the <a href="https://sourceforge.net/apps/phpbb/octc/">OCTC forum</a>.

 </xd:detail>

 <xd:author>Piotr Bański</xd:author>

 <xd:copyright>the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).</xd:copyright>

 <xd:svnId>$Id: tmx2ske.xsl 382 2010-11-06 18:12:41Z bansp $</xd:svnId>

 </xd:doc>

<xd:doc>Set the languages explicitly (useful if the TMX file contains equivalents from more than two languages or if the first <tu> element of the TMX (which serves as a heuristic diagnostic for the number and ID of languages involved) is misaligned.</xd:doc>

<xsl:param name="langs" as="xs:string+" select="/tmx/body/tu[1]/tuv/@xml:lang" />

<xd:doc>Play insecurely? If this is passed as '0', the TMX-well-formedness checks are switched off (and you may end up with files with different numbers of <align> elements).</xd:doc>

<xsl:param name="secure" as="xs:boolean" select="true()" />

<xd:doc>Normalize whitespace? Some TMX files contain lots of it and you might want to get rid of it. The default is to skip this step.</xd:doc>

<xsl:param name="normalize" as="xs:boolean" select="false()" />

<xd:doc>Determine the placement of the output files. The result has the form: output_prefixLG.txt, where "LG" stands for the language code and ".txt" is appended to please the SkE. The default is simply, e.g., "en.txt", in the directory where the source file is.</xd:doc>

<xsl:param name="output_prefix" as="xs:string" select="''" />

<xd:doc>Set this to 0 to suppress the <align> tags. Useful for extraction of monolingual material from a multilingual TMX.</xd:doc>

<xsl:param name="tags" as="xs:boolean" select="true()" />

<xd:doc>This is an attempt to reduce memory usage. It nicely fails so far, on big files. Apparently, we need some more intensive garbage collection here instead of such tricks.</xd:doc>

<xsl:variable name="chunk_size" as="xs:integer" select="500" />



 <xsl:template match="/">

 <xsl:variable name="root" select="/" as="document-node()" />

 <xsl:for-each select="$langs">

 <xsl:result-document href="{concat($output_prefix,.,'.txt')}">

 <xsl:for-each-group select="$root/tmx/body/tu/tuv[@xml:lang = current()]" group-adjacent="position() idiv $chunk_size">

 <xsl:for-each select="current-group()">

 <xsl:if test="not($secure) or (every $x in $langs satisfies parent::tu/tuv/@xml:lang = $x)">

 <xsl:choose>

 <xsl:when test="$tags">

 <xsl:element name="align">

 <xsl:value-of select="if ($normalize) then normalize-space(string-join(seg,' ')) else string-join(seg,' ')" />

 </xsl:element>

 </xsl:when>

 <xsl:otherwise>

 <xsl:value-of select="if ($normalize) then normalize-space(string-join(seg,' ')) else string-join(seg,' ')" />

 </xsl:otherwise>

 </xsl:choose>

 <xsl:text>

</xsl:text>

 </xsl:if>

 </xsl:for-each>



 </xsl:for-each-group>

 </xsl:result-document>

 </xsl:for-each>

 </xsl:template>

</xsl:stylesheet>