<
xsl:
stylesheet xmlns:xsl="
http://www.w3.org/1999/XSL/Transform"
xmlns:xs="
http://www.w3.org/2001/XMLSchema"
xmlns:xi="
http://www.w3.org/2001/XInclude"
xmlns:xd="
http://www.pnp-software.com/XSLTdoc"
xmlns:f="
func"
xpath-default-namespace="
http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="
xs xd f"
version="
2.0">
<
xsl:
import href="
identity_transform.xsl" />
<
xsl:
output encoding="
UTF-8"
method="
xml"
indent="
no" />
<
xsl:
strip-space elements="
*" />
<
xd:
doc type="
stylesheet">
<
xd:
short>
Token-splitter: creates the segmentation level of annotation.</
xd:
short>
<
xd:
detail>
<
p>
Splits tokens according to whitespace and punctuation; may apply language-specific settings; doesn't know about sentence boundaries.</
p>
<
p>
This is intended as a quick tool for segmenting languages that are easy to segment. Other languages may require dedicated tools. Still, some provision for parametrization is included here.</
p>
<
p>
The output file is a token-segmentation annotation document, referencing the source text document in a variety of ways.</
p>
<
p>
Distributor: Open-Content Text Corpus (<
a href="
http://OCTC.sourceforge.net/">
http://OCTC.sourceforge.net/</
a>
)</
p>
</
xd:
detail>
<
xd:
author>
Piotr Bański</
xd:
author>
<
xd:
copyright>
the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).</
xd:
copyright>
<
xd:
svnId>
$Id: tokenizer.xsl 420 2010-12-18 23:01:47Z bansp $</
xd:
svnId>
</
xd:
doc>
<
xd:
doc>
Should <head> elements be tokenized? There is no single good answer to this, but this information should
definitely be kept in the file somehow, so that it can make it into the header. If this is set, the relevant elements
will receive type="head".</
xd:
doc>
<
xsl:
param name="
tokenize_heads"
as="
xs:boolean"
select="
true()" />
<
xd:
doc>
For debugging. Print the dereferenced string inside a comment after each segment.</
xd:
doc>
<
xsl:
param name="
source_in_comment"
as="
xs:boolean"
select="
false()" />
<
xd:
doc>
Redundantly keep the offset info for the source text inside @corresp attributes.</
xd:
doc>
<
xsl:
param name="
keep_offset"
as="
xs:boolean"
select="
false()" />
<
xd:
doc>
Use TEI pointer syntax.</
xd:
doc>
<
xsl:
param name="
use_tei"
as="
xs:boolean"
select="
false()" />
<
xd:
doc>
Use W3C pointer syntax.</
xd:
doc>
<
xsl:
param name="
use_w3c"
select="
true()" />
<
xd:
doc>
Use XPath id() to locate the relevant elements. It should have been in from the beginning -- for some reason, I was sure it didn't work but it does. Because
implementations of XPointers are so scarce and scary, I'm leaving the old code, just in case.</
xd:
doc>
<
xsl:
param name="
use_id"
as="
xs:boolean"
select="
true()" />
<
xd:
doc>
Show extra debugging information as egXML.</
xd:
doc>
<
xsl:
param name="
debug_ranges"
as="
xs:boolean"
select="
false()" />
<
xd:
doc>
Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla.gnome.org/show_bug.cgi?id=620190). The kludge is just as nasty as the
bug but well, at least it produces valid output!</
xd:
doc>
<
xsl:
param name="
edge_bug_kludge"
as="
xs:boolean"
select="
true()" />
<
xd:
doc>
Might be considered "embedding_bug_kludge". Don't try to happily use the string value of the elements pointed at: dive into them in search of embedded
nodes.This is to compensate for an xmllint bug that doesn't treat embedded nodes properly (https://bugzilla.gnome.org/show_bug.cgi?id=620195).</
xd:
doc>
<
xsl:
param name="
dive_into_content"
as="
xs:boolean"
select="
true()" />
<
xd:
doc>
TEI namespace.</
xd:
doc>
<
xsl:
variable name="
teixmlns"
as="
xs:string"
select="
'xmlns(t=http://www.tei-c.org/ns/1.0)'" />
<
xd:
doc>
Override the variable defined in identity_transform.xsl: we need a different schema here.</
xd:
doc>
<
xsl:
variable name="
my_schema"
select="
'OCTC_segmentation.rng'"
as="
xs:string" />
<
xd:
doc>
Assume you only tokenize under lg/ (see the indexer for a more flexible version).</
xd:
doc>
<
xsl:
variable name="
iso_id"
as="
xs:string"
select="
substring-before(substring-after(document-uri(/),'/lg/'),'/')" />
<
xd:
doc>
Holds specifications of sub-token sequences for each language, space-separated; naturally,
this can only work for simple languages. We can use the $ anchor here because this is matched against
strings that have already been whitespace- and punctuation- separated.</
xd:
doc>
<
xsl:
variable name="
lg_hash"
as="
element()+">
<
lg key="
eng"
seq="
's 'es '$ 'll 'd 've 're n't \-" />
<!---->
<!---->
<!---->
<!---->
<
lg key="
swh"
seq="
\-" />
</
xsl:
variable>
<
xd:
doc>
the complete file name of the file operated on; needed to construct references.</
xd:
doc>
<
xsl:
variable name="
my_fname"
as="
xs:string"
select="
tokenize(document-uri(/),'/')[last()]" />
<
xd:
doc>
regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just
opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate </
xd:
doc>
<
xsl:
variable name="
sub-token_seq_basic"
as="
xs:string+"
select="
('\.{1,}', ',', ';', ':', '%', '!{1,}', '\?{1,}', '\p{Ps}', '\p{Pe}', '\p{Pi}', '\p{Pf}', '\p{Sc}')" />
<
xsl:
variable name="
sub-token_match_basic"
select="
concat('(',string-join($sub-token_seq_basic, '|'), ')')"
as="
xs:string" />
<
xsl:
variable name="
sub-token_match_adv"
as="
xs:string">
<
xsl:
variable name="
sub-token_seq_adv"
as="
xs:string+"
select="
if (count($lg_hash[@key eq $iso_id])) then tokenize(normalize-space($lg_hash[@key eq $iso_id]/@seq),' ') else ('\-')" />
<
xsl:
value-of select="
concat('(\w+)(',string-join($sub-token_seq_adv, '|'), ')')" />
</
xsl:
variable>
<
xsl:
template match="
div|list">
<
xsl:
apply-templates />
</
xsl:
template>
<
xd:
doc>
The @rend attribute is abused here to store reference to the source file's SVN Id string.</
xd:
doc>
<
xsl:
template match="
body">
<
xsl:
text>
</
xsl:
text>
<
xsl:
copy>
<
xsl:
attribute name="
rend"
select="
substring-after(parent::text/@n,'$')" />
<
xsl:
apply-templates select="
@* except @xml:id" />
<
xsl:
text>
</
xsl:
text>
<
xsl:
if test="
$source_in_comment">
<
xsl:
comment select="
'Note that the offsets in the comments (start, end, length) need not be identical to the offsets used in the xpointers, which may obey extra rules.'" />
<
xsl:
text>
</
xsl:
text>
</
xsl:
if>
<
xsl:
apply-templates select="
node()" />
</
xsl:
copy>
</
xsl:
template>
<
xsl:
template mode="
dive"
as="
element()*"
match="
text()|*">
<
xsl:
choose>
<
xsl:
when test="
name(.) eq ''">
<
cell text_seq-text="
{.}"
text_seq-segm="
{count(preceding-sibling::text())+1}"
text_seq-id="
{parent::*/@xml:id}" />
</
xsl:
when>
<
xsl:
when test="
count(node()) eq 0" />
<!---->
<
xsl:
when test="
count(node()) eq count(text())">
<
cell text_seq-text="
{.}"
text_seq-segm="
0"
text_seq-id="
{@xml:id}" />
</
xsl:
when>
<
xsl:
otherwise><
xsl:
apply-templates mode="
dive"
select="
text()|*" /></
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
template>
<
xd:
doc>
This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as
they come (the latter is the default due to a bug in xmllint. The structure is flattened: it becomes a sequence of <ab> elements.</
xd:
doc>
<
xsl:
template match="
p|ab|item|head">
<
xsl:
choose>
<
xsl:
when test="
local-name() eq 'head' and not($tokenize_heads)" />
<
xsl:
otherwise>
<
xsl:
element name="
ab"
namespace="
http://www.tei-c.org/ns/1.0">
<
xsl:
attribute name="
type"
select="
if (local-name() eq 'head') then 'head' else 'para'" />
<!---->
<
xsl:
attribute name="
corresp"
select="
concat($my_fname,'#',./@xml:id)" />
<
xsl:
variable name="
text_seq"
as="
element()+">
<!---->
<
xsl:
choose>
<
xsl:
when test="
* and $dive_into_content">
<
xsl:
apply-templates select="
text()|*"
mode="
dive" />
</
xsl:
when>
<
xsl:
otherwise><
cell text_seq-text="
{string(.)}"
text_seq-segm="
0"
text_seq-id="
{@xml:id}" /></
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
variable>
<
xsl:
for-each select="
$text_seq">
<
xsl:
variable name="
current_text_seq"
select="
." />
<
xsl:
variable name="
trans"
select="
translate(@text_seq-text,'@_`',' ''')"
as="
xs:string" />
<
xsl:
variable name="
match_seq"
as="
xs:string"
select="
normalize-space(replace($trans,$sub-token_match_basic,' $1 ','i'))" />
<
xsl:
variable as="
xs:string+"
name="
basic_seq">
<
xsl:
for-each select="
tokenize($match_seq,' ')">
<
xsl:
sequence select="
." />
</
xsl:
for-each>
</
xsl:
variable>
<
xsl:
variable as="
xs:string+"
name="
adv_seq">
<
xsl:
for-each select="
$basic_seq">
<
xsl:
variable name="
match_seq2"
as="
xs:string"
select="
normalize-space(replace(.,$sub-token_match_adv,' $1 $2 ','i'))" />
<
xsl:
sequence select="
tokenize($match_seq2,' ')" />
</
xsl:
for-each>
</
xsl:
variable>
<
xsl:
variable name="
string_map"
as="
element()+"
select="
f:map_string($current_text_seq/@text_seq-text,$adv_seq,1,0)" />
<!---->
<
xsl:
if test="
$debug_ranges">
<
xsl:
text>
</
xsl:
text>
<
xsl:
element name="
egXML"
namespace="
http://www.tei-c.org/ns/Examples">
<
xsl:
for-each select="
$string_map">
<
xsl:
text>
</
xsl:
text>
<
xsl:
copy-of select="
." />
</
xsl:
for-each>
</
xsl:
element>
</
xsl:
if>
<
xsl:
text>
</
xsl:
text>
<
xsl:
for-each select="
$adv_seq">
<
xsl:
variable name="
pos"
select="
position()" />
<
xsl:
element name="
seg"
namespace="
http://www.tei-c.org/ns/1.0">
<
xsl:
variable name="
tei_info"
as="
xs:string"
select="
concat($string_map[$pos]/@point_bef_start,',',$string_map[$pos]/@length)" />
<
xsl:
variable name="
w3c_info"
as="
xs:string"
select="
concat($string_map[$pos]/@point_bef_start+1,',',$string_map[$pos]/@length)" />
<
xsl:
variable name="
node_id-left"
as="
xs:string"
select="
if ($use_id) then 'id(''' else '/t:teiCorpus/t:teiCorpus/t:TEI/t:text/t:body//t:*[@xml:id='''" />
<
xsl:
variable name="
node_id-right"
as="
xs:string"
select="
concat(if ($use_id) then ''')' else ''']',if(number($current_text_seq/@text_seq-segm)) then concat('/text()[',$current_text_seq/@text_seq-segm,']') else '')" />
<
xsl:
if test="
$string_map[$pos]/@glued eq 'true'">
<
xsl:
attribute name="
rend"
select="
'glued'" />
<!---->
</
xsl:
if>
<
xsl:
if test="
$keep_offset or ($edge_bug_kludge and count($string_map[$pos]/@form))">
<!---->
<
xsl:
attribute name="
corresp"
select="
concat($my_fname,'#',if (not($use_id)) then $teixmlns else '','xpointer(string-range(',$node_id-left,$current_text_seq/@text_seq-id,$node_id-right,','''',',$w3c_info,')[1])')" />
</
xsl:
if>
<
xsl:
choose>
<
xsl:
when test="
$edge_bug_kludge and count($string_map[$pos]/@form)">
<
xsl:
value-of select="
$string_map[$pos]/@form" />
</
xsl:
when>
<
xsl:
otherwise>
<
xsl:
element name="
xi:include">
<
xsl:
attribute name="
href"
select="
$my_fname" />
<
xsl:
attribute name="
xpointer"
select="
concat(if (not($use_id)) then $teixmlns else '',if ($use_tei) then concat('string-range(',$node_id-left,$current_text_seq/@text_seq-id,$node_id-right,','''',',$tei_info,')') else '',if ($use_w3c) then concat('xpointer(string-range(',$node_id-left,$current_text_seq/@text_seq-id,$node_id-right,','''',',$w3c_info,')[1])') else '')" />
</
xsl:
element>
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
element>
<
xsl:
if test="
$source_in_comment">
<
xsl:
comment select="
concat(.,' (',$string_map[$pos]/@point_bef_start,',',$string_map[$pos]/@point_bef_end,',',$string_map[$pos]/@length,')')" />
</
xsl:
if>
<
xsl:
text>
</
xsl:
text>
</
xsl:
for-each>
</
xsl:
for-each>
</
xsl:
element>
<
xsl:
text>
</
xsl:
text>
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
template>
<
xsl:
function name="
f:map_string"
as="
element()+">
<
xsl:
param name="
str"
as="
xs:string" />
<
xsl:
param name="
seq"
as="
xs:string+" />
<
xsl:
param name="
pos"
as="
xs:integer" />
<
xsl:
param name="
offset"
as="
xs:integer" />
<
xsl:
variable name="
head"
as="
xs:string"
select="
$seq[$pos]" />
<
xsl:
variable name="
head_length"
as="
xs:integer"
select="
string-length($head)" />
<
xsl:
variable name="
tail"
as="
xs:string"
select="
substring-after($str,$head)" />
<
xsl:
variable name="
is_first"
as="
xs:boolean"
select="
$pos eq 1" />
<
xsl:
variable name="
is_last"
as="
xs:boolean"
select="
$pos eq count($seq)" />
<
xsl:
variable name="
glued"
as="
xs:boolean"
select="
not(starts-with($str, ' ')) and not($is_first)" />
<
xsl:
variable name="
point_bef_start"
as="
xs:integer"
select="
if ($glued or $is_first) then $offset else $offset+1" />
<
xsl:
variable name="
point_bef_end"
select="
$point_bef_start + $head_length" />
<
xsl:
variable name="
nd"
as="
element()">
<
xsl:
element name="
row">
<
xsl:
attribute name="
pos"
select="
$pos" />
<
xsl:
attribute name="
str"
select="
$head" />
<
xsl:
attribute name="
point_bef_start"
select="
$point_bef_start" />
<
xsl:
attribute name="
point_bef_end"
select="
$point_bef_end" />
<
xsl:
attribute name="
length"
select="
$head_length" />
<
xsl:
attribute name="
glued"
select="
$glued" />
<
xsl:
if test="
$edge_bug_kludge and ($is_first or $is_last)">
<
xsl:
attribute name="
form"
select="
$head" />
</
xsl:
if>
</
xsl:
element>
</
xsl:
variable>
<
xsl:
sequence select="
($nd, if (not($is_last)) then f:map_string($tail,$seq,$pos+1,$point_bef_end) else ())" />
</
xsl:
function>
</
xsl:
stylesheet>
v