<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xd="http://www.pnp-software.com/XSLTdoc" xmlns:f="func" xpath-default-namespace="http://www.tei-c.org/ns/1.0" exclude-result-prefixes="xs xd f" version="2.0">

  
<xsl:import href="identity_transform.xsl" />

  
<xsl:output encoding="UTF-8" method="xml" indent="no" />

  
<xsl:strip-space elements="*" />



  
  <xd:doc type="stylesheet">

    
<xd:short>Token-splitter: creates the segmentation level of annotation.</xd:short>

    
<xd:detail>

      
<p>Splits tokens according to whitespace and punctuation; may apply language-specific settings; doesn't know about sentence boundaries.</p>

      
<p>This is intended as a quick tool for segmenting languages that are easy to segment. Other languages may require dedicated tools. Still, some provision for parametrization is included here.</p>

      
<p>The output file is a token-segmentation annotation document, referencing the source text document in a variety of ways.</p>

      
<p>Distributor: Open-Content Text Corpus (<a href="http://OCTC.sourceforge.net/">http://OCTC.sourceforge.net/</a>)</p>

    
</xd:detail>

    
<xd:author>Piotr Bański</xd:author>

    
<xd:copyright>the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).</xd:copyright>

    
<xd:svnId>$Id: tokenizer.xsl 420 2010-12-18 23:01:47Z bansp $</xd:svnId>

  
</xd:doc>




  
  <xd:doc>Should &lt;head&gt; elements be tokenized? There is no single good answer to this, but this information should 

    definitely be kept in the file somehow, so that it can make it into the header. If this is set, the relevant elements 

    will receive type="head".
</xd:doc>


  
<xsl:param name="tokenize_heads" as="xs:boolean" select="true()" />

  

  
  <xd:doc>For debugging. Print the dereferenced string inside a comment after each segment.</xd:doc>


  
<xsl:param name="source_in_comment" as="xs:boolean" select="false()" />



  
  <xd:doc>Redundantly keep the offset info for the source text inside @corresp attributes.</xd:doc>


  
<xsl:param name="keep_offset" as="xs:boolean" select="false()" />

  

  
  <xd:doc>Use TEI pointer syntax.</xd:doc>


  
<xsl:param name="use_tei" as="xs:boolean" select="false()" />

  

  
  <xd:doc>Use W3C pointer syntax.</xd:doc>


  
<xsl:param name="use_w3c" select="true()" />

  

  
  <xd:doc>Use XPath id() to locate the relevant elements. It should have been in from the beginning -- for some reason, I was sure it didn't work but it does. Because

    implementations of XPointers are so scarce and scary, I'm leaving the old code, just in case.
</xd:doc>


  
<xsl:param name="use_id" as="xs:boolean" select="true()" />

  

  
  <xd:doc>Show extra debugging information as egXML.</xd:doc>


  
<xsl:param name="debug_ranges" as="xs:boolean" select="false()" />



  
  <xd:doc>Use a kludge around the nasty xmlint bug that spoils edges of strings (https://bugzilla.gnome.org/show_bug.cgi?id=620190). The kludge is just as nasty as the

    bug but well, at least it produces valid output!
</xd:doc>


  
<xsl:param name="edge_bug_kludge" as="xs:boolean" select="true()" />



  
  <xd:doc>Might be considered "embedding_bug_kludge". Don't try to happily use the string value of the elements pointed at: dive into them in search of embedded

    nodes.This is to compensate for an xmllint bug that doesn't treat embedded nodes properly (https://bugzilla.gnome.org/show_bug.cgi?id=620195).
</xd:doc>


  
<xsl:param name="dive_into_content" as="xs:boolean" select="true()" />



  
  <xd:doc>TEI namespace.</xd:doc>


  
<xsl:variable name="teixmlns" as="xs:string" select="'xmlns(t=http://www.tei-c.org/ns/1.0)'" />



  
  <xd:doc>Override the variable defined in identity_transform.xsl: we need a different schema here.</xd:doc>


  
<xsl:variable name="my_schema" select="'OCTC_segmentation.rng'" as="xs:string" />

  

  
  <xd:doc>Assume you only tokenize under lg/ (see the indexer for a more flexible version).</xd:doc>


  
<xsl:variable name="iso_id" as="xs:string" select="substring-before(substring-after(document-uri(/),'/lg/'),'/')" />

  

  
  <xd:doc>Holds specifications of sub-token sequences for each language, space-separated; naturally, 

    this can only work for simple languages. We can use the $ anchor here because this is matched against 

    strings that have already been whitespace- and punctuation- separated.
</xd:doc>


  
<xsl:variable name="lg_hash" as="element()+">

    
<lg key="eng" seq="'s 'es '$ 'll 'd 've 're n't \-" /> <!-- this is close to CLAWS, I believe -->

    
<!-- catching also the trailing apostrophe, as in "the reviewers' remarks"; note that at this point, 

    the apostrophe is at the end of the string
-->

    
<!-- for CLAWS, let me put these two links here, until they are implemented:

      http://ucrel.lancs.ac.uk/bnc2sampler/fused.htm

      http://ucrel.lancs.ac.uk/bnc2sampler/ditto.htm

      

      Note that the lg_hash should probably become more complicated to handle forms such as "innit" - sometimes, 

      the entire word-form has to be scanned for, and then the appropriate tokenizing action should take place

    
-->

    

    
<!--<lg key="swh" seq="je$ \-"/>  need to add a list of false matches first (due to Beata):

    

    nije

    uje

    aje

    tuje

    mje

    waje

    nisije

    usije

    asije

    tusije

    msije

    wasije

    nje

    ingawaje

    punje

    
-->

    
<lg key="swh" seq="\-" />

  
</xsl:variable>



  
  <xd:doc>the complete file name of the file operated on; needed to construct references.</xd:doc>


  
<xsl:variable name="my_fname" as="xs:string" select="tokenize(document-uri(/),'/')[last()]" />



  
  <xd:doc>regex character group for punctuation is too coarse-grained, have to catch them by hand :-/ using just 

    opening/closing characters/quotes, and Sc for monies; '%' is always treated as separate 
</xd:doc>


  
<xsl:variable name="sub-token_seq_basic" as="xs:string+" select="('\.{1,}', ',', ';', ':', '%', '!{1,}', '\?{1,}', '\p{Ps}', '\p{Pe}', '\p{Pi}', '\p{Pf}', '\p{Sc}')" />



  
<xsl:variable name="sub-token_match_basic" select="concat('(',string-join($sub-token_seq_basic, '|'), ')')" as="xs:string" />

   

  
<xsl:variable name="sub-token_match_adv" as="xs:string">

    
<xsl:variable name="sub-token_seq_adv" as="xs:string+" select="if (count($lg_hash[@key eq $iso_id])) then tokenize(normalize-space($lg_hash[@key eq $iso_id]/@seq),' ') else ('\-')" />

    
<xsl:value-of select="concat('(\w+)(',string-join($sub-token_seq_adv, '|'), ')')" />

  
</xsl:variable>

  

  
<xsl:template match="div|list">

    
<xsl:apply-templates />

  
</xsl:template>

  

  
  <xd:doc>The @rend attribute is abused here to store reference to the source file's SVN Id string.</xd:doc>


  
<xsl:template match="body">

    
<xsl:text>

</xsl:text>

    
<xsl:copy>

      
<xsl:attribute name="rend" select="substring-after(parent::text/@n,'$')" />

      
<xsl:apply-templates select="@* except @xml:id" />

      
<xsl:text>

</xsl:text>

      
<xsl:if test="$source_in_comment">

        
<xsl:comment select="'Note that the offsets in the comments (start, end, length) need not be identical to the offsets used in the xpointers, which may obey extra rules.'" />

        
<xsl:text>

</xsl:text>

      
</xsl:if>

      
<xsl:apply-templates select="node()" />

    
</xsl:copy>

  
</xsl:template>

  

  
<xsl:template mode="dive" as="element()*" match="text()|*">

    
<xsl:choose>

      
<xsl:when test="name(.) eq ''">

        
<cell text_seq-text="{.}" text_seq-segm="{count(preceding-sibling::text())+1}" text_seq-id="{parent::*/@xml:id}" />

      
</xsl:when>

      
<xsl:when test="count(node()) eq 0" /> <!-- return empty seq if this is an empty element -->

      
<xsl:when test="count(node()) eq count(text())">

        
<cell text_seq-text="{.}" text_seq-segm="0" text_seq-id="{@xml:id}" />

      
</xsl:when>

      
<xsl:otherwise><xsl:apply-templates mode="dive" select="text()|*" /></xsl:otherwise>

    
</xsl:choose>

    

  
</xsl:template>

  

  
  <xd:doc>This template looks at the paragraph/sentence level elements and tokenizes them, either using their string value rather or processing text() and element() nodes as

    they come (the latter is the default due to a bug in xmllint. The structure is flattened: it becomes a sequence of &lt;ab&gt; elements.
</xd:doc>


  
<xsl:template match="p|ab|item|head">

    
<xsl:choose>

      
<xsl:when test="local-name() eq 'head' and not($tokenize_heads)" />

      
<xsl:otherwise>

        
<xsl:element name="ab" namespace="http://www.tei-c.org/ns/1.0">

          
<xsl:attribute name="type" select="if (local-name() eq 'head') then 'head' else 'para'" />  <!-- note the type here - should it always be 'para'? --> 

          
<xsl:attribute name="corresp" select="concat($my_fname,'#',./@xml:id)" />



          
<xsl:variable name="text_seq" as="element()+">

            
<!-- oh no, this catches elements, but should also catch comments etc. ugh-->

            
<xsl:choose>

              
<xsl:when test="* and $dive_into_content">

                
<xsl:apply-templates select="text()|*" mode="dive" />

              
</xsl:when>

              
<xsl:otherwise><cell text_seq-text="{string(.)}" text_seq-segm="0" text_seq-id="{@xml:id}" /></xsl:otherwise>

            
</xsl:choose>

          
</xsl:variable>



          
<xsl:for-each select="$text_seq">

            
<xsl:variable name="current_text_seq" select="." />

            
<xsl:variable name="trans" select="translate(@text_seq-text,'@_`',' ''')" as="xs:string" />

            
<xsl:variable name="match_seq" as="xs:string" select="normalize-space(replace($trans,$sub-token_match_basic,' $1 ','i'))" />



            
<xsl:variable as="xs:string+" name="basic_seq">

              
<xsl:for-each select="tokenize($match_seq,' ')">

                
<xsl:sequence select="." />

              
</xsl:for-each>

            
</xsl:variable>



            
<xsl:variable as="xs:string+" name="adv_seq">

              
<xsl:for-each select="$basic_seq">

                
<xsl:variable name="match_seq2" as="xs:string" select="normalize-space(replace(.,$sub-token_match_adv,' $1 $2 ','i'))" />

                
<xsl:sequence select="tokenize($match_seq2,' ')" />

              
</xsl:for-each>

            
</xsl:variable>



            
<xsl:variable name="string_map" as="element()+" select="f:map_string($current_text_seq/@text_seq-text,$adv_seq,1,0)" />



            
<!-- dumps ugly debug info -->

            
<xsl:if test="$debug_ranges">

              
<xsl:text>

</xsl:text>

              
<xsl:element name="egXML" namespace="http://www.tei-c.org/ns/Examples">

                
<xsl:for-each select="$string_map">

                  
<xsl:text>

</xsl:text>

                  
<xsl:copy-of select="." />

                
</xsl:for-each>

              
</xsl:element>

            
</xsl:if>



            
<xsl:text>

</xsl:text>

            
<xsl:for-each select="$adv_seq">

              
<xsl:variable name="pos" select="position()" />

              
<xsl:element name="seg" namespace="http://www.tei-c.org/ns/1.0">

                
<xsl:variable name="tei_info" as="xs:string" select="concat($string_map[$pos]/@point_bef_start,',',$string_map[$pos]/@length)" />

                
<xsl:variable name="w3c_info" as="xs:string" select="concat($string_map[$pos]/@point_bef_start+1,',',$string_map[$pos]/@length)" />

                
<xsl:variable name="node_id-left" as="xs:string" select="if ($use_id) then 'id(''' else '/t:teiCorpus/t:teiCorpus/t:TEI/t:text/t:body//t:*[@xml:id='''" />

                
<xsl:variable name="node_id-right" as="xs:string" select="concat(if ($use_id) then ''')' else ''']',if(number($current_text_seq/@text_seq-segm)) then concat('/text()[',$current_text_seq/@text_seq-segm,']') else '')" />

                
<xsl:if test="$string_map[$pos]/@glued eq 'true'">

                  
<xsl:attribute name="rend" select="'glued'" />

                  
<!-- 'glued' for 'continuing the string to the left', sorry about the name -->

                
</xsl:if>

                
<xsl:if test="$keep_offset or ($edge_bug_kludge and count($string_map[$pos]/@form))">

                  
<!-- note that out of laziness, this is only done for w3c info for now -->

                  
<xsl:attribute name="corresp" select="concat($my_fname,'#',if (not($use_id)) then $teixmlns else '','xpointer(string-range(',$node_id-left,$current_text_seq/@text_seq-id,$node_id-right,','''',',$w3c_info,')[1])')" />

                
</xsl:if>

                
<xsl:choose>

                  
<xsl:when test="$edge_bug_kludge and count($string_map[$pos]/@form)">

                    
<xsl:value-of select="$string_map[$pos]/@form" />

                  
</xsl:when>

                  
<xsl:otherwise>

                    
<xsl:element name="xi:include">

                      
<xsl:attribute name="href" select="$my_fname" />

                      
<xsl:attribute name="xpointer" select="concat(if (not($use_id)) then $teixmlns else '',if ($use_tei) then concat('string-range(',$node_id-left,$current_text_seq/@text_seq-id,$node_id-right,','''',',$tei_info,')') else '',if ($use_w3c) then concat('xpointer(string-range(',$node_id-left,$current_text_seq/@text_seq-id,$node_id-right,','''',',$w3c_info,')[1])') else '')" />

                    
</xsl:element>

                  
</xsl:otherwise>

                
</xsl:choose>

              
</xsl:element>

              
<xsl:if test="$source_in_comment">

                
<xsl:comment select="concat(.,' (',$string_map[$pos]/@point_bef_start,',',$string_map[$pos]/@point_bef_end,',',$string_map[$pos]/@length,')')" />

              
</xsl:if>

              
<xsl:text>

</xsl:text>

            
</xsl:for-each>

          
</xsl:for-each>

        
</xsl:element>

        
<xsl:text>

</xsl:text>

      
</xsl:otherwise>

    
</xsl:choose>

  
</xsl:template>

  

  
<xsl:function name="f:map_string" as="element()+">

    
<xsl:param name="str" as="xs:string" />

    
<xsl:param name="seq" as="xs:string+" />

    
<xsl:param name="pos" as="xs:integer" />

    
<xsl:param name="offset" as="xs:integer" />

    

    
<xsl:variable name="head" as="xs:string" select="$seq[$pos]" />

    
<xsl:variable name="head_length" as="xs:integer" select="string-length($head)" />

    
<xsl:variable name="tail" as="xs:string" select="substring-after($str,$head)" />

    
<xsl:variable name="is_first" as="xs:boolean" select="$pos eq 1" />

    
<xsl:variable name="is_last" as="xs:boolean" select="$pos eq count($seq)" />

    
<xsl:variable name="glued" as="xs:boolean" select="not(starts-with($str, ' ')) and not($is_first)" />

    
<xsl:variable name="point_bef_start" as="xs:integer" select="if ($glued or $is_first) then $offset else $offset+1" />

    
<xsl:variable name="point_bef_end" select="$point_bef_start + $head_length" />

    

    
<xsl:variable name="nd" as="element()">

      
<xsl:element name="row">

        
<xsl:attribute name="pos" select="$pos" />

        
<xsl:attribute name="str" select="$head" />

        
<xsl:attribute name="point_bef_start" select="$point_bef_start" />

        
<xsl:attribute name="point_bef_end" select="$point_bef_end" />

        
<xsl:attribute name="length" select="$head_length" />

        
<xsl:attribute name="glued" select="$glued" />

        
<xsl:if test="$edge_bug_kludge and ($is_first or $is_last)">

          
<xsl:attribute name="form" select="$head" />

        
</xsl:if>

      
</xsl:element>

    
</xsl:variable>

    

    
<xsl:sequence select="($nd, if (not($is_last)) then f:map_string($tail,$seq,$pos+1,$point_bef_end) else ())" />    

  
</xsl:function>

</xsl:stylesheet>













































































v