tokenizer.xsl

Imports

identity_transform.xsl

Token-splitter: creates the segmentation level of annotation.

Splits tokens according to whitespace and punctuation; may apply language-specific settings; doesn't know about sentence boundaries.

This is intended as a quick tool for segmenting languages that are easy to segment. Other languages may require dedicated tools. Still, some provision for parametrization is included here.

The output file is a token-segmentation annotation document, referencing the source text document in a variety of ways.

Distributor: Open-Content Text Corpus (http://OCTC.sourceforge.net/)

Author:

Piotr Bański

the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).

SVN Id:

$Id: tokenizer.xsl 420 2010-12-18 23:01:47Z bansp $

XSLT Version:

2.0

Namespace Prefix Summary:

f - func

xd - http://www.pnp-software.com/XSLTdoc

xi - http://www.w3.org/2001/XInclude

xs - http://www.w3.org/2001/XMLSchema

xsl - http://www.w3.org/1999/XSL/Transform

XPath Default Namespace:

http://www.tei-c.org/ns/1.0

Outputs Summary

#default - source

No short description available

Element Space Summary

strip * - source

No short description available

Parameters Summary

xs:boolean debug_ranges - source

Show extra debugging information as egXML

xs:boolean dive_into_content - source

Might be considered "embedding_bug_kludge"

xs:boolean edge_bug_kludge - source

Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla

xs:boolean keep_offset - source

Redundantly keep the offset info for the source text inside @corresp attributes

xs:boolean source_in_comment - source

For debugging

xs:boolean tokenize_heads - source

Should <head> elements be tokenized? There is no single good answer to this, but this information should definitely be kept in the file somehow, so that it can make it into the header

xs:boolean use_id - source

Use XPath id() to locate the relevant elements

xs:boolean use_tei - source

Use TEI pointer syntax

use_w3c - source

Use W3C pointer syntax

Variables Summary

xs:string iso_id - source

Assume you only tokenize under lg/ (see the indexer for a more flexible version)

element()+ lg_hash - source

Holds specifications of sub-token sequences for each language, space-separated; naturally, this can only work for simple languages

xs:string my_fname - source

the complete file name of the file operated on; needed to construct references

xs:string my_schema - source

Override the variable defined in identity_transform

xs:string sub-token_match_adv - source

No short description available

xs:string sub-token_match_basic - source

No short description available

xs:string+ sub-token_seq_basic - source

regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate

xs:string teixmlns - source

TEI namespace

Template Modes Summary

#default

No short description available

dive

No short description available

Match Templates Summary

body (mode: #default) - source

The @rend attribute is abused here to store reference to the source file's SVN Id string

div|list (mode: #default) - source

No short description available

p|ab|item|head (mode: #default) - source

This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as they come (the latter is the default due to a bug in xmllint

element()* text()|* (mode: dive) - source

No short description available

Functions Summary

element()+ f:map_string (param: xs:string str, xs:string+ seq, xs:integer pos, xs:integer offset) - source

No short description available

Outputs Detail

#default - source

No short description available

Attributes

encoding

UTF-8

indent

method

xml

Element Space Detail

strip * - source

No short description available

Namespace Prefix Summary:

#default -

Parameters Detail

xs:boolean debug_ranges - source

Show extra debugging information as egXML

xs:boolean dive_into_content - source

Might be considered "embedding_bug_kludge"

Don't try to happily use the string value of the elements pointed at: dive into them in search of embedded nodes.This is to compensate for an xmllint bug that doesn't treat embedded nodes properly (https://bugzilla.gnome.org/show_bug.cgi?id=620195).

xs:boolean edge_bug_kludge - source

Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla

gnome.org/show_bug.cgi?id=620190). The kludge is just as nasty as the bug but well, at least it produces valid output!

xs:boolean keep_offset - source

Redundantly keep the offset info for the source text inside @corresp attributes

xs:boolean source_in_comment - source

For debugging

Print the dereferenced string inside a comment after each segment.

xs:boolean tokenize_heads - source

Should <head> elements be tokenized? There is no single good answer to this, but this information should definitely be kept in the file somehow, so that it can make it into the header

If this is set, the relevant elements will receive type="head".

xs:boolean use_id - source

Use XPath id() to locate the relevant elements

It should have been in from the beginning -- for some reason, I was sure it didn't work but it does. Because implementations of XPointers are so scarce and scary, I'm leaving the old code, just in case.

xs:boolean use_tei - source

Use TEI pointer syntax

use_w3c - source

Use W3C pointer syntax

Variables Detail

xs:string iso_id - source

Assume you only tokenize under lg/ (see the indexer for a more flexible version)

element()+ lg_hash - source

Holds specifications of sub-token sequences for each language, space-separated; naturally, this can only work for simple languages

We can use the $ anchor here because this is matched against strings that have already been whitespace- and punctuation- separated.

xs:string my_fname - source

the complete file name of the file operated on; needed to construct references

xs:string my_schema - source

Override the variable defined in identity_transform

xsl: we need a different schema here.

xs:string sub-token_match_adv - source

No short description available

xs:string sub-token_match_basic - source

No short description available

xs:string+ sub-token_seq_basic - source

regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate

xs:string teixmlns - source

TEI namespace