tokenizer.xsl

Token-splitter: creates the segmentation level of annotation.

Splits tokens according to whitespace and punctuation; may apply language-specific settings; doesn't know about sentence boundaries.

This is intended as a quick tool for segmenting languages that are easy to segment. Other languages may require dedicated tools. Still, some provision for parametrization is included here.

The output file is a token-segmentation annotation document, referencing the source text document in a variety of ways.

Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/)

Author:
Piotr Bański
Copyright:
the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).
SVN Id:
$Id: tokenizer.xsl 420 2010-12-18 23:01:47Z bansp $
XSLT Version:
2.0
Namespace Prefix Summary:
f - func
xd - http://www.pnp-software.com/XSLTdoc
xi - http://www.w3.org/2001/XInclude
xs - http://www.w3.org/2001/XMLSchema
xsl - http://www.w3.org/1999/XSL/Transform
XPath Default Namespace:
http://www.tei-c.org/ns/1.0

Outputs Summary

No short description available

Element Space Summary

strip * - source
No short description available

Parameters Summary

xs:boolean debug_ranges - source
Show extra debugging information as egXML
Might be considered "embedding_bug_kludge"
xs:boolean edge_bug_kludge - source
Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla
xs:boolean keep_offset - source
Redundantly keep the offset info for the source text inside @corresp attributes
For debugging
xs:boolean tokenize_heads - source
Should <head> elements be tokenized? There is no single good answer to this, but this information should definitely be kept in the file somehow, so that it can make it into the header
xs:boolean use_id - source
Use XPath id() to locate the relevant elements
xs:boolean use_tei - source
Use TEI pointer syntax
Use W3C pointer syntax

Variables Summary

xs:string iso_id - source
Assume you only tokenize under lg/ (see the indexer for a more flexible version)
element()+ lg_hash - source
Holds specifications of sub-token sequences for each language, space-separated; naturally, this can only work for simple languages
xs:string my_fname - source
the complete file name of the file operated on; needed to construct references
xs:string my_schema - source
Override the variable defined in identity_transform
No short description available
No short description available
regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate
xs:string teixmlns - source
TEI namespace

Template Modes Summary

No short description available
No short description available

Match Templates Summary

body (mode: #default) - source
The @rend attribute is abused here to store reference to the source file's SVN Id string
No short description available
This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as they come (the latter is the default due to a bug in xmllint
element()* text()|* (mode: dive) - source
No short description available

Functions Summary

element()+ f:map_string (param: xs:string strxs:string+ seqxs:integer posxs:integer offset) - source
No short description available

Outputs Detail

No short description available
Attributes
encoding
UTF-8
indent
no
method
xml

Element Space Detail

strip * - source
No short description available
Namespace Prefix Summary:
#default - 

Parameters Detail

xs:boolean debug_ranges - source
Show extra debugging information as egXML
Might be considered "embedding_bug_kludge"
Don't try to happily use the string value of the elements pointed at: dive into them in search of embedded nodes.This is to compensate for an xmllint bug that doesn't treat embedded nodes properly (https://bugzilla.gnome.org/show_bug.cgi?id=620195).
xs:boolean edge_bug_kludge - source
Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla
gnome.org/show_bug.cgi?id=620190). The kludge is just as nasty as the bug but well, at least it produces valid output!
xs:boolean keep_offset - source
Redundantly keep the offset info for the source text inside @corresp attributes
For debugging
Print the dereferenced string inside a comment after each segment.
xs:boolean tokenize_heads - source
Should <head> elements be tokenized? There is no single good answer to this, but this information should definitely be kept in the file somehow, so that it can make it into the header
If this is set, the relevant elements will receive type="head".
xs:boolean use_id - source
Use XPath id() to locate the relevant elements
It should have been in from the beginning -- for some reason, I was sure it didn't work but it does. Because implementations of XPointers are so scarce and scary, I'm leaving the old code, just in case.
xs:boolean use_tei - source
Use TEI pointer syntax
Use W3C pointer syntax

Variables Detail

xs:string iso_id - source
Assume you only tokenize under lg/ (see the indexer for a more flexible version)
element()+ lg_hash - source
Holds specifications of sub-token sequences for each language, space-separated; naturally, this can only work for simple languages
We can use the $ anchor here because this is matched against strings that have already been whitespace- and punctuation- separated.
xs:string my_fname - source
the complete file name of the file operated on; needed to construct references
xs:string my_schema - source
Override the variable defined in identity_transform
xsl: we need a different schema here.
No short description available
No short description available
regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate
xs:string teixmlns - source
TEI namespace

Template Modes Detail

No short description available
Templates Using This Mode:
No short description available
Templates Using This Mode:

Match Templates Detail

body (mode: #default) - source
The @rend attribute is abused here to store reference to the source file's SVN Id string
No short description available
This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as they come (the latter is the default due to a bug in xmllint
The structure is flattened: it becomes a sequence of <ab> elements.
element()* text()|* (mode: dive) - source
No short description available

Functions Detail

element()+ f:map_string (param: xs:string strxs:string+ seqxs:integer posxs:integer offset) - source
No short description available
Parameters:
xs:string str -
xs:string+ seq -
xs:integer pos -
xs:integer offset -