octc2tmx.xsl

octc2tmx: converts OCTC aligned documents into the TMX translation memory format.

This script creates TMX translation memory documents out of Open-Content Text Corpus align.xml files.

It should probably cache the content of the <text> of the monolingual documents, to reduce file access. Not sure whether it's worth it.

TO DO: make it dive into divs that do not fully partition the given linkGrp; differentiate between 1/many:many misalignments and 1/many:0 misalignments perhaps (what if there are more than two languages involved though?). Make sure that the type of alignment (e.g. paragraph, sentence, etc., can be retrieved from the align.xml files for the purpose of creating the appropriate properties in the TMX.

Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/). Please report problems in our Trac instance.

The documentation of the input files is provided in the OCTC wiki.

Note that the official namespace for TMX is "http://www.lisa.org/tmx14", but I haven't seen it used even once, so the OCTC tools do not support it, for now. Please let us know if you encounter problems related to the (non-)use of this namespace.

Author:

Piotr Bański

the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).

SVN Id:

$Id: octc2tmx.xsl 312 2010-06-26 12:56:56Z bansp $

XSLT Version:

2.0

Namespace Prefix Summary:

f - func

xd - http://www.pnp-software.com/XSLTdoc

xi - http://www.w3.org/2001/XInclude

xs - http://www.w3.org/2001/XMLSchema

xsl - http://www.w3.org/1999/XSL/Transform

XPath Default Namespace:

http://www.tei-c.org/ns/1.0

Outputs Summary

#default - source

No short description available

Element Space Summary

strip * - source

No short description available

Parameters Summary

xs:boolean cascade - source

Cascade from div/@type="doc" or process div/@type="tu"

xs:string srclang - source

The 'srclang' parameter must be a single string; it is TMX-internal

xs:string+ trglang - source

The 'trglang' parameter is a sequence

Variables Summary

xs:string date - source

The date has to be adjusted to the UTC

xs:string my_id - source

Id of the creator

xs:string o-tmf - source

Format of the source

xs:string version - source

The current version of the script, set automatically by SVN

Match Templates Summary

/ - source

The initial template

div - source

Process div elements containing potential translation units

q - source

Turn q elements into double quotes

Named Templates Summary

process_linkGrp (param: node()+ node) - source

Recursively process linkGrp elements

Functions Summary

xs:string+ f:process (param: xs:string+ targ, node() context) - source

Dive into each node

Outputs Detail

#default - source

No short description available

Attributes

doctype-public

-//LISA OSCAR:1998//DTD for Translation Memory eXchange//EN

doctype-system

tmx14.dtd

encoding

UTF-8

indent

yes

method

xml

Element Space Detail

strip * - source

No short description available

Namespace Prefix Summary:

#default -

Parameters Detail

xs:boolean cascade - source

Cascade from div/@type="doc" or process div/@type="tu"

Set it to false only in the case of somehow incomplete align.xml documents; the default should be generally safe if you remember about the div/type="doc" element.

xs:string srclang - source

The 'srclang' parameter must be a single string; it is TMX-internal

xs:string+ trglang - source

The 'trglang' parameter is a sequence

Variables Detail

xs:string date - source

The date has to be adjusted to the UTC

xs:string my_id - source

Id of the creator

This is just a placeholder, well, with some information value.

xs:string o-tmf - source

Format of the source

.. well, that's close enough :-)

xs:string version - source

The current version of the script, set automatically by SVN

Match Templates Detail

/ - source

The initial template

It sets up the TMX document, fills out the header and starts the processing of an aligned OCTC document.

div - source

Process div elements containing potential translation units

('tu' is a term from the TMX specification). All this template does is redirect to another template that performs recursive processing of linkGrp elements.

q - source

Turn q elements into double quotes

Named Templates Detail

process_linkGrp (param: node()+ node) - source

Recursively process linkGrp elements

If there is a div that completely partitions the linkGrp that we are thinking of processing, abandon the linkGrp and process the contents of the div. Repeat. If the processed div is org="uniform" (refer to the wiki for explanation), look for partitioning divs right after the processed linkGrps; if not, look at the ptr/@type="part" elements and work from there.

Parameters:

node()+ node -

Functions Detail

xs:string+ f:process (param: xs:string+ targ, node() context) - source

Dive into each node

The nodes accessed by resolving the URIs may have other nodes embedded within them. These have to be processed separately (string-joined) from multiple siblings (which are string-joined with a space). This function handles embedded q elements, for example.

Parameters:

xs:string+ targ -

node() context -