<
xsl:
stylesheet xmlns:xsl="
http://www.w3.org/1999/XSL/Transform"
xmlns:xs="
http://www.w3.org/2001/XMLSchema"
xmlns:xi="
http://www.w3.org/2001/XInclude"
xmlns:xd="
http://www.pnp-software.com/XSLTdoc"
xmlns:f="
func"
xpath-default-namespace="
http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="
xs xd f"
version="
2.0">
<
xsl:
strip-space elements="
*" />
<
xsl:
output encoding="
UTF-8"
method="
xml"
indent="
yes"
doctype-system="
tmx14.dtd"
doctype-public="
-//LISA OSCAR:1998//DTD for Translation Memory eXchange//EN" />
<
xd:
doc type="
stylesheet">
<
xd:
short>
octc2tmx: converts OCTC aligned documents into the TMX translation memory format.</
xd:
short>
<
xd:
detail>
<
p>
This script creates <
a href="
http://www.lisa.org/tmx/tmx.htm">
TMX translation memory</
a>
documents out of Open-Content Text Corpus align.xml files.</
p>
<
p>
It should probably cache the content of the <text> of the monolingual documents, to reduce file access. Not sure whether it's worth it.</
p>
<
p>
TO DO: make it dive into divs that do not fully partition the given linkGrp; differentiate between 1/many:many misalignments and 1/many:0 misalignments perhaps (what if there are more than two languages involved though?). Make sure that the type of alignment (e.g. paragraph, sentence, etc., can be retrieved from the align.xml files for the purpose of creating the appropriate properties in the TMX.</
p>
<
p>
Distributor: Open-Content Text Corpus (<
a href="
http://OCTC.sourceforge.net/">
http://OCTC.sourceforge.net/</
a>
). Please report problems in <
a href="
https://sourceforge.net/apps/trac/octc/report">
our Trac instance</
a>
.</
p>
<
p>
The documentation of the input files is provided in the <
a href="
https://sourceforge.net/apps/mediawiki/octc/index.php?title=Align.xml">
OCTC wiki</
a>
.</
p>
<
p>
Note that the official namespace for TMX is "http://www.lisa.org/tmx14", but I haven't seen it used even once, so the OCTC tools do not support it, for now. Please let us know if you encounter problems related to the (non-)use of this namespace.</
p>
</
xd:
detail>
<
xd:
author>
Piotr Bański</
xd:
author>
<
xd:
copyright>
the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).</
xd:
copyright>
<
xd:
svnId>
$Id: octc2tmx.xsl 312 2010-06-26 12:56:56Z bansp $</
xd:
svnId>
</
xd:
doc>
<
xd:
doc>
The current version of the script, set automatically by SVN.</
xd:
doc>
<
xsl:
variable name="
version"
select="
'$Rev: 312 $'"
as="
xs:string" />
<
xd:
doc>
The 'srclang' parameter must be a single string; it is TMX-internal.</
xd:
doc>
<
xsl:
param name="
srclang"
select="
'pl'"
as="
xs:string" />
<
xd:
doc>
The 'trglang' parameter is a sequence.</
xd:
doc>
<
xsl:
param name="
trglang"
select="
('sw')"
as="
xs:string+" />
<
xd:
doc>
The date has to be adjusted to the UTC.</
xd:
doc>
<
xsl:
variable name="
date"
select="
format-dateTime(adjust-dateTime-to-timezone(current-dateTime(),xs:dayTimeDuration('PT0S')), '[Y1][M01][D01]T[H01][m01][s01]Z')"
as="
xs:string" />
<
xd:
doc>
Format of the source... well, that's close enough :-)</
xd:
doc>
<
xsl:
variable name="
o-tmf"
select="
'OCTC_alignment.rng'"
as="
xs:string" />
<
xd:
doc>
Id of the creator. This is just a placeholder, well, with some information value.</
xd:
doc>
<
xsl:
variable name="
my_id"
select="
'OCTC'"
as="
xs:string" />
<
xd:
doc>
Cascade from div/@type="doc" or process div/@type="tu". Set it to false only in the case of somehow incomplete align.xml documents; the default should be generally safe if you remember about the div/type="doc" element.</
xd:
doc>
<
xsl:
param name="
cascade"
as="
xs:boolean"
select="
true()" />
<
xd:
doc>
The initial template. It sets up the TMX document, fills out the header and starts the processing of an aligned OCTC document.</
xd:
doc>
<
xsl:
template match="
/">
<
xsl:
element name="
tmx">
<
xsl:
attribute name="
version"
select="
1.4" />
<
xsl:
element name="
header">
<
xsl:
attribute name="
creationtool"
select="
'octc2tmx.xsl (OCTC)'" />
<
xsl:
attribute name="
creationtoolversion"
select="
$version" />
<
xsl:
attribute name="
creationdate"
select="
$date" />
<
xsl:
attribute name="
changedate"
select="
$date" />
<
xsl:
attribute name="
creationid"
select="
$my_id" />
<
xsl:
attribute name="
changeid"
select="
$my_id" />
<
xsl:
attribute name="
segtype"
select="
'paragraph'" />
<
xsl:
attribute name="
datatype"
select="
'plaintext'" />
<
xsl:
attribute name="
o-tmf"
select="
$o-tmf" />
<
xsl:
attribute name="
adminlang"
select="
'en'" />
<
xsl:
attribute name="
srclang"
select="
$srclang" />
<
xsl:
element name="
note">
<
xsl:
text>
This file was extracted from a part of the the OCTC (Open-Content Text Corpus, https://sourceforge.net/projects/octc/).</
xsl:
text>
</
xsl:
element>
<
xsl:
element name="
note">
<
xsl:
text>
It is available under the terms of the GNU General Public License, version 3 or any later version (http://www.gnu.org/licenses/gpl.html).</
xsl:
text>
</
xsl:
element>
<
xsl:
element name="
note">
<
xsl:
value-of select="
concat('Source file (aligned): ',substring-after(document-uri(/),'/align/'),' ver. ', /teiCorpus/teiCorpus/TEI/text/@n)" />
</
xsl:
element>
<
xsl:
for-each select="
/teiCorpus/teiCorpus/TEI/text/body/div[@type='doc']/linkGrp/ptr except /teiCorpus/teiCorpus/TEI/text/body/div[@type='doc']/linkGrp/ptr[@type='part']">
<
xsl:
element name="
note">
<
xsl:
value-of select="
concat('Source file (',current()/@xml:lang,'): ',current()/@target,' ver. ', document(current()/@target)/teiCorpus/teiCorpus/TEI/text/@n)" />
</
xsl:
element>
</
xsl:
for-each>
</
xsl:
element>
<
xsl:
element name="
body">
<
xsl:
choose>
<
xsl:
when test="
$cascade">
<
xsl:
apply-templates select="
/teiCorpus/teiCorpus/TEI/text/body/div[@type='doc']" />
</
xsl:
when>
<
xsl:
otherwise>
<
xsl:
apply-templates select="
/teiCorpus/teiCorpus/TEI/text/body/div[@type='tu']" />
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
element>
</
xsl:
element>
</
xsl:
template>
<
xd:
doc>
Process div elements containing potential translation units. ('tu' is a term from the TMX specification). All this template does is redirect to another template that performs recursive processing of linkGrp elements.</
xd:
doc>
<
xsl:
template match="
div">
<
xsl:
call-template name="
process_linkGrp">
<
xsl:
with-param name="
node"
select="
linkGrp"
as="
node()+" />
</
xsl:
call-template>
</
xsl:
template>
<
xd:
doc>
Recursively process linkGrp elements. If there is a div that completely partitions the linkGrp that we are thinking of processing, abandon the linkGrp and process the contents of the div. Repeat. If the processed div is org="uniform" (refer to the wiki for explanation), look for partitioning divs right after the processed linkGrps; if not, look at the ptr/@type="part" elements and work from there.</
xd:
doc>
<
xsl:
template name="
process_linkGrp">
<
xsl:
param name="
node"
as="
node()+" />
<
xsl:
for-each select="
$node">
<
xsl:
variable name="
partitioning_divs"
as="
element()*">
<
xsl:
choose>
<
xsl:
when test="
ancestor::div[1][count(@org) and @org eq 'uniform']">
<
xsl:
variable name="
f_lG"
select="
following-sibling::linkGrp[1]" />
<
xsl:
sequence select="
following-sibling::div[. << $f_lG][substring(@prev,2) eq current()/@xml:id][@part eq 'N']" />
</
xsl:
when>
<
xsl:
otherwise>
<
xsl:
sequence select="
if (count(ptr[@type = 'part'])) then id(substring(ptr[@type = 'part']/@target,2)) else ()" />
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
variable>
<
xsl:
variable name="
curr"
select="
." />
<
xsl:
choose>
<
xsl:
when test="
exists($partitioning_divs)">
<
xsl:
for-each select="
$partitioning_divs">
<
xsl:
call-template name="
process_linkGrp">
<
xsl:
with-param name="
node"
select="
./linkGrp" />
</
xsl:
call-template>
</
xsl:
for-each>
</
xsl:
when>
<
xsl:
otherwise>
<
xsl:
element name="
tu">
<
xsl:
for-each select="
($srclang,$trglang)">
<
xsl:
element name="
tuv">
<
xsl:
attribute name="
xml:lang"
select="
." />
<
xsl:
attribute name="
creationtool"
select="
'octc2tmx.xsl (OCTC)'" />
<
xsl:
attribute name="
creationtoolversion"
select="
$version" />
<
xsl:
attribute name="
creationdate"
select="
$date" />
<
xsl:
attribute name="
changedate"
select="
$date" />
<
xsl:
attribute name="
creationid"
select="
$my_id" />
<
xsl:
attribute name="
changeid"
select="
$my_id" />
<
xsl:
attribute name="
datatype"
select="
'plaintext'" />
<
xsl:
attribute name="
o-tmf"
select="
$o-tmf" />
<
xsl:
element name="
prop">
<
xsl:
attribute name="
type"
select="
'x-xml:id'" />
<
xsl:
value-of select="
string-join($curr/ptr[@xml:lang=current()]/@xml:id, ' ')" />
</
xsl:
element>
<
xsl:
element name="
prop">
<
xsl:
attribute name="
type"
select="
'x-target'" />
<
xsl:
value-of select="
string-join($curr/ptr[@xml:lang=current()]/@target, ' ')" />
</
xsl:
element>
<
xsl:
element name="
seg">
<
xsl:
value-of select="
normalize-space(string-join(f:process($curr/ptr[@xml:lang=current()]/@target, $curr),' '))" />
</
xsl:
element>
</
xsl:
element>
</
xsl:
for-each>
</
xsl:
element>
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
for-each>
</
xsl:
template>
<
xd:
doc>
Turn q elements into double quotes.</
xd:
doc>
<
xsl:
template match="
q">
<
xsl:
text>
"</
xsl:
text>
<
xsl:
apply-templates />
<
xsl:
text>
"</
xsl:
text>
</
xsl:
template>
<
xd:
doc>
Dive into each node. The nodes accessed by resolving the URIs may have other nodes embedded within them. These have to be processed separately (string-joined) from multiple siblings (which are string-joined with a space). This function handles embedded q elements, for example.</
xd:
doc>
<
xsl:
function name="
f:process"
as="
xs:string+">
<
xsl:
param name="
targ"
as="
xs:string+" />
<
xsl:
param name="
context"
as="
node()" />
<
xsl:
variable name="
node"
as="
node()*"
select="
document($targ, $context)" />
<
xsl:
choose>
<
xsl:
when test="
exists($node)">
<
xsl:
for-each select="
$node">
<
xsl:
variable name="
seq"
as="
xs:string*">
<
xsl:
apply-templates select="
." />
</
xsl:
variable>
<
xsl:
value-of select="
string-join($seq,'')" />
</
xsl:
for-each>
</
xsl:
when>
<
xsl:
otherwise>
<
xsl:
value-of select="
('[ERROR: NO MATCH]')" />
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
function>
</
xsl:
stylesheet>
v