tokenizer.xsl
Imports
Token-splitter: creates the segmentation level of annotation.
Splits tokens according to whitespace and punctuation; may apply language-specific settings; doesn't know about sentence boundaries.
This is intended as a quick tool for segmenting languages that are easy to segment. Other languages may require dedicated tools. Still, some provision for parametrization is included here.
The output file is a token-segmentation annotation document, referencing the source text document in a variety of ways.
Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/)
Author:
Piotr Bański
Copyright:
the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).
SVN Id:
$Id: tokenizer.xsl 420 2010-12-18 23:01:47Z bansp $
XSLT Version:
2.0
Namespace Prefix Summary:
f - func
xd - http://www.pnp-software.com/XSLTdoc
xi - http://www.w3.org/2001/XInclude
xs - http://www.w3.org/2001/XMLSchema
xsl - http://www.w3.org/1999/XSL/Transform
XPath Default Namespace:
http://www.tei-c.org/ns/1.0
Parameters Summary
Show extra debugging information as egXML
Might be considered "embedding_bug_kludge"
Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla
Redundantly keep the offset info for the source text inside @corresp attributes
For debugging
Should <head> elements be tokenized? There is no single good answer to this, but this information should
definitely be kept in the file somehow, so that it can make it into the header
Variables Summary
Assume you only tokenize under lg/ (see the indexer for a more flexible version)
Holds specifications of sub-token sequences for each language, space-separated; naturally,
this can only work for simple languages
the complete file name of the file operated on; needed to construct references
No short description available
No short description available
regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just
opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate
Match Templates Summary
The @rend attribute is abused here to store reference to the source file's SVN Id string
This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as
they come (the latter is the default due to a bug in xmllint
Functions Summary
element()+ f:map_string (param: xs:string str, xs:string+ seq, xs:integer pos, xs:integer offset) - source
No short description available
Outputs Detail
Element Space Detail
Parameters Detail
Show extra debugging information as egXML
Might be considered "embedding_bug_kludge"
Don't try to happily use the string value of the elements pointed at: dive into them in search of embedded
nodes.This is to compensate for an xmllint bug that doesn't treat embedded nodes properly (https://bugzilla.gnome.org/show_bug.cgi?id=620195).
Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla
gnome.org/show_bug.cgi?id=620190). The kludge is just as nasty as the
bug but well, at least it produces valid output!
Redundantly keep the offset info for the source text inside @corresp attributes
For debugging
Print the dereferenced string inside a comment after each segment.
Should <head> elements be tokenized? There is no single good answer to this, but this information should
definitely be kept in the file somehow, so that it can make it into the header
If this is set, the relevant elements
will receive type="head".
Use XPath id() to locate the relevant elements
It should have been in from the beginning -- for some reason, I was sure it didn't work but it does. Because
implementations of XPointers are so scarce and scary, I'm leaving the old code, just in case.
Variables Detail
Assume you only tokenize under lg/ (see the indexer for a more flexible version)
Holds specifications of sub-token sequences for each language, space-separated; naturally,
this can only work for simple languages
We can use the $ anchor here because this is matched against
strings that have already been whitespace- and punctuation- separated.
the complete file name of the file operated on; needed to construct references
Override the variable defined in identity_transform
xsl: we need a different schema here.
No short description available
No short description available
regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just
opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate
Template Modes Detail
No short description available
Templates Using This Mode:
Match Templates Detail
The @rend attribute is abused here to store reference to the source file's SVN Id string
This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as
they come (the latter is the default due to a bug in xmllint
The structure is flattened: it becomes a sequence of <ab> elements.
Functions Detail
element()+ f:map_string (param: xs:string str, xs:string+ seq, xs:integer pos, xs:integer offset) - source
No short description available
Parameters:
xs:string str -
xs:string+ seq -
xs:integer pos -
xs:integer offset -