<
xsl:
stylesheet xmlns:xsl="
http://www.w3.org/1999/XSL/Transform"
xmlns:xs="
http://www.w3.org/2001/XMLSchema"
xmlns:xi="
http://www.w3.org/2001/XInclude"
xmlns:xd="
http://www.pnp-software.com/XSLTdoc"
xmlns:f="
func"
xpath-default-namespace="
http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="
xs xd f"
version="
2.0">
<
xsl:
import href="
identity_transform.xsl" />
<
xsl:
output encoding="
UTF-8"
method="
xml"
indent="
no" />
<
xd:
doc type="
stylesheet">
<
xd:
short>
Indexer: creates indices for everything under the body.</
xd:
short>
<
xd:
detail>
<
p>
Builds indices on the basis of the subcorpus, file type, location in the tree and (redundantly) the tag name. The index structure is as follows (linearly):</
p>
<
ol>
<
li>
subcorpus identifier as ISO-639-3 (or a sequence thereof) cut out from the path.</
li>
<
li>
underline</
li>
<
li>
file type (source text, kind of annotation, etc.), using the classification from the main header (soon)</
li>
<
li>
underline</
li>
<
li>
position among the siblings, for each subtree below <body></
li>
<
li>
dash</
li>
<
li>
tag name</
li>
</
ol>
<
p>
<body> is indexed separately - its index skips parts 4 and 5 from the above sequence.</
p>
<
p>
TODO: possibly, make it skip the entire file upon text/@rend="noindex" or the individual node and its children when */@rend="noindex" is encountered.</
p>
<
p>
There might need to be two versions, one with indenting, and one with indent set to "no"... (compare the segmentation files); alternatively, indenting may be switched off for good.</
p>
<
p>
Distributor: Open-Content Text Corpus (<
a href="
http://OCTC.sourceforge.net/">
http://OCTC.sourceforge.net/</
a>
)</
p>
</
xd:
detail>
<
xd:
author>
Piotr BaĆski</
xd:
author>
<
xd:
copyright>
the author(s), 2010; license: GPL v3 or any later version (http://www.gnu.org/licenses/gpl.html).</
xd:
copyright>
<
xd:
svnId>
$Id: indexer.xsl 426 2010-12-19 02:46:42Z bansp $</
xd:
svnId>
</
xd:
doc>
<
xd:
doc>
There is no need to count from the top of the tree, <body> is enough. Note
that, somewhat kludgily, <body> is explicitly indexed below as well. </
xd:
doc>
<
xsl:
variable name="
index_root"
select="
/teiCorpus/teiCorpus/TEI/text/body"
as="
item()" />
<
xsl:
variable name="
index_root_depth"
select="
count($index_root/ancestor::*)"
as="
xs:integer" />
<
xd:
doc>
Not going to implement this right now, but it may hold (bare) names of non-indexed elements (gap, hi, q?); for now, I list them positively, in the match template</
xd:
doc>
<
xsl:
variable name="
excepted_tags"
select="
()"
as="
xs:string*" />
<
xd:
doc>
Note that this assumes that you only index stuff under lg/ or align/.</
xd:
doc>
<
xsl:
variable name="
iso_id"
as="
xs:string">
<
xsl:
variable name="
lg"
select="
substring-after(document-uri(/),'/lg/')" />
<
xsl:
value-of select="
if (string-length($lg)) then substring-before($lg,'/') else substring-before(substring-after(document-uri(/),'/align/'),'/')" />
</
xsl:
variable>
<
xd:
doc>
A lookup-table for file types; it will be externalized at some point and only referenced from here. Never use dashes here because they separate modifiers from file names ("morph-2.xml", etc.).</
xd:
doc>
<
xsl:
variable name="
file_types"
as="
item()+">
<
file type="
text"
fname="
text"
idx="
txt" />
<
file type="
segmentation"
fname="
ana_segm"
idx="
sgm" />
<
file type="
sentence_boundaries"
fname="
ana_sent"
idx="
snt" />
<
file type="
morphosyntax"
fname="
ana_morph"
idx="
mph" />
<
file type="
alignment"
fname="
align"
idx="
aln" />
</
xsl:
variable>
<
xd:
doc>
check the file name (and type) of the file operated on; recall that dash MUST ONLY separate file modifiers. Additional assumption: files will end in '.xml'.</
xd:
doc>
<
xsl:
variable name="
my_fname"
as="
xs:string">
<
xsl:
variable name="
full_name"
select="
tokenize(document-uri(/),'/')[last()]" />
<
xsl:
variable name="
modified_name"
select="
substring-before($full_name,'-')" />
<
xsl:
value-of select="
if (string-length($modified_name)) then $modified_name else substring-before($full_name,'.xml')" />
</
xsl:
variable>
<
xd:
doc>
Go through the designated elements and (re)index them.</
xd:
doc>
<
xsl:
template match="
div | head | p | ab | item | list | hi | q | linkGrp | ptr | seg | s">
<
xsl:
copy>
<
xsl:
attribute name="
xml:id"
select="
f:create_index(.)" />
<
xsl:
apply-templates select="
node()|@*" />
</
xsl:
copy>
</
xsl:
template>
<
xd:
doc>
Index the body as well... this actually circumvents the $index_root and should
be done in the general template (above) by matching against that variable, but I don't
want to spuriously check N times for whether something happens (not) to be the $index_root.</
xd:
doc>
<
xsl:
template match="
body">
<
xsl:
copy>
<
xsl:
attribute name="
xml:id"
select="
concat($iso_id,'_',$file_types[@fname = $my_fname]/@idx,'-',local-name(.))" />
<
xsl:
apply-templates select="
node()|@*" />
</
xsl:
copy>
</
xsl:
template>
<
xd:
doc>
Don't copy the old xml:id, if it is in the subtree we are interested in
(otherwise the new one would be overwritten).</
xd:
doc>
<
xsl:
template match="
@xml:id">
<
xsl:
if test="
not(exists(./ancestor::*[. is $index_root]))">
<
xsl:
copy>
<
xsl:
apply-templates select="
." />
</
xsl:
copy>
</
xsl:
if>
</
xsl:
template>
<
xd:
doc>
Create the dot string for the element's position in the tree rooted in <body>.</
xd:
doc>
<
xsl:
function name="
f:calc_pos"
as="
xs:string">
<
xsl:
param name="
my_node"
as="
item()" />
<
xsl:
variable name="
my_depth"
select="
count($my_node/ancestor::*)-$index_root_depth"
as="
xs:integer" />
<
xsl:
variable name="
my_pos"
select="
count($my_node/preceding-sibling::*)+1"
as="
xs:integer" />
<
xsl:
choose>
<
xsl:
when test="
$my_depth eq 1">
<
xsl:
value-of select="
string($my_pos)" />
</
xsl:
when>
<
xsl:
otherwise>
<
xsl:
value-of select="
concat(f:calc_pos($my_node/parent::*),'.',string($my_pos))" />
</
xsl:
otherwise>
</
xsl:
choose>
</
xsl:
function>
<
xsl:
function name="
f:create_index"
as="
xs:string">
<
xsl:
param name="
node"
as="
item()" />
<
xsl:
value-of select="
concat($iso_id,'_',$file_types[@fname = $my_fname]/@idx,'_',f:calc_pos($node),'-',local-name($node))" />
</
xsl:
function>
</
xsl:
stylesheet>
v