octc2tmx.xsl

octc2tmx: converts OCTC aligned documents into the TMX translation memory format.

This script creates TMX translation memory documents out of Open-Content Text Corpus align.xml files.

It should probably cache the content of the <text> of the monolingual documents, to reduce file access. Not sure whether it's worth it.

TO DO: make it dive into divs that do not fully partition the given linkGrp; differentiate between 1/many:many misalignments and 1/many:0 misalignments perhaps (what if there are more than two languages involved though?). Make sure that the type of alignment (e.g. paragraph, sentence, etc., can be retrieved from the align.xml files for the purpose of creating the appropriate properties in the TMX.

Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/). Please report problems in our Trac instance.

The documentation of the input files is provided in the OCTC wiki.

Note that the official namespace for TMX is "http://www.lisa.org/tmx14", but I haven't seen it used even once, so the OCTC tools do not support it, for now. Please let us know if you encounter problems related to the (non-)use of this namespace.

Author:
Piotr Bański
Copyright:
the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).
SVN Id:
$Id: octc2tmx.xsl 312 2010-06-26 12:56:56Z bansp $
XSLT Version:
2.0
Namespace Prefix Summary:
f - func
xd - http://www.pnp-software.com/XSLTdoc
xi - http://www.w3.org/2001/XInclude
xs - http://www.w3.org/2001/XMLSchema
xsl - http://www.w3.org/1999/XSL/Transform
XPath Default Namespace:
http://www.tei-c.org/ns/1.0

Outputs Summary

No short description available

Element Space Summary

strip * - source
No short description available

Parameters Summary

xs:boolean cascade - source
Cascade from div/@type="doc" or process div/@type="tu"
xs:string srclang - source
The 'srclang' parameter must be a single string; it is TMX-internal
xs:string+ trglang - source
The 'trglang' parameter is a sequence

Variables Summary

xs:string date - source
The date has to be adjusted to the UTC
xs:string my_id - source
Id of the creator
xs:string o-tmf - source
Format of the source
xs:string version - source
The current version of the script, set automatically by SVN

Match Templates Summary

The initial template
Process div elements containing potential translation units
Turn q elements into double quotes

Named Templates Summary

process_linkGrp (param: node()+ node) - source
Recursively process linkGrp elements

Functions Summary

xs:string+ f:process (param: xs:string+ targnode() context) - source
Dive into each node

Outputs Detail

No short description available
Attributes
doctype-public
-//LISA OSCAR:1998//DTD for Translation Memory eXchange//EN
doctype-system
tmx14.dtd
encoding
UTF-8
indent
yes
method
xml

Element Space Detail

strip * - source
No short description available
Namespace Prefix Summary:
#default - 

Parameters Detail

xs:boolean cascade - source
Cascade from div/@type="doc" or process div/@type="tu"
Set it to false only in the case of somehow incomplete align.xml documents; the default should be generally safe if you remember about the div/type="doc" element.
xs:string srclang - source
The 'srclang' parameter must be a single string; it is TMX-internal
xs:string+ trglang - source
The 'trglang' parameter is a sequence

Variables Detail

xs:string date - source
The date has to be adjusted to the UTC
xs:string my_id - source
Id of the creator
This is just a placeholder, well, with some information value.
xs:string o-tmf - source
Format of the source
.. well, that's close enough :-)
xs:string version - source
The current version of the script, set automatically by SVN

Match Templates Detail

The initial template
It sets up the TMX document, fills out the header and starts the processing of an aligned OCTC document.
Process div elements containing potential translation units
('tu' is a term from the TMX specification). All this template does is redirect to another template that performs recursive processing of linkGrp elements.
Turn q elements into double quotes

Named Templates Detail

process_linkGrp (param: node()+ node) - source
Recursively process linkGrp elements
If there is a div that completely partitions the linkGrp that we are thinking of processing, abandon the linkGrp and process the contents of the div. Repeat. If the processed div is org="uniform" (refer to the wiki for explanation), look for partitioning divs right after the processed linkGrps; if not, look at the ptr/@type="part" elements and work from there.
Parameters:
node()+ node -

Functions Detail

xs:string+ f:process (param: xs:string+ targnode() context) - source
Dive into each node
The nodes accessed by resolving the URIs may have other nodes embedded within them. These have to be processed separately (string-joined) from multiple siblings (which are string-joined with a space). This function handles embedded q elements, for example.
Parameters:
xs:string+ targ -
node() context -