nmtoken support for UTF-8 characters

Poster Content
nk4um User
Posts: 111
May 18, 2010 18:44
Yes I understand your point of view and I will use a custom regex.

But I guess you probably don''t want to implement the new type "unicode-nmtoken" using:
<regex>[^\\p{Punct}]*</regex>

neither, because it wouldn''t be consistent with "nmtoken" which recognizes hyphens :

http://www.w3.org/TR/REC-xml/#NT-Nmtoken

The URL above shows that a nmtoken is :

(":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040])+


Unfortunately the # encoding is not explicit and I don''t know if these ranges cover the UTF-8 characters or not (I guess they don''t, otherwise it would be a bug in the current NetKernel implementation).

Thanks for your help!
Gregoire
nk4um Moderator
Posts: 485
May 18, 2010 18:19
It wouldn''t be dangerous per se but I wouldn''t be happy to create a built in type with that loose a definition. It might be confusing if it was used in different situations. I think the best solution then is just to use a custom regex expression that works for you.

Cheers,
Tony
nk4um User
Posts: 111
May 18, 2010 17:40
Would it be dangerous to allow everything but "/"?

<regex>[^/]*</regex>


Thanks!
nk4um User
Posts: 111
May 18, 2010 17:28
A second test showed that towns with an hyphen, like "Saint-Malo", are not recognized anymore. I guess that \\\\p{Punct} considers that hyphens are punctuation?
nk4um Moderator
Posts: 485
May 18, 2010 17:00Escaping!
Yep that is what I meant. If I put a double backslash it gets rendered properly:

<regex>\\\\p{Punct}</regex> -->

<regex>\\p{Punct}</regex>

Ok, I''ll put a regex type into BNF so you can do the following:

<regextype="unicode-nmtoken" />

Cheers, Tony
nk4um User
Posts: 111
May 18, 2010 16:40
Hi Tony,

With an antislash before p, it works! Thank you!

Gregoire
nk4um Moderator
Posts: 485
May 18, 2010 16:19
I''m glad you are pushing on this capabilities. It''s interesting I''ve looked through the regex documentation and I don''t see clearly that non ascii alphanum
is well defined. :-( The best I can see is to say anything that isn''t punctuation.

Give this a try in your grammar:
<regex>[^p{Punct}]*</regex>

This will match anything that isn''t punctuation. If this works for you we can add it as a standard regex type.

Cheers, Tony
nk4um User
Posts: 111
May 18, 2010 15:51nmtoken support for UTF-8 characters
Hi,

It would be useful to have a regex type just like nmtoken, but which would support UTF-8 characters. It would allow to write grammars like this one :

<grammar>res:/services/towns/
  <groupname="townName">
    <regextype="nmtoken" />
  </group>
  <optional> /
    <groupname="indiceMin">
      <regextype="integer" />
    </group> -
    <groupname="indiceMax">
      <regextype="integer" />
    </group>
  </optional>
</grammar>


This grammar can recognize URLs like :
res:/services/towns/Par/1-10

However, it seems impossible to write a grammar which would recognize the "townName" argument if there are accentuated characters, like in :
res:/services/towns/Sèt/1-10

So my suggestion is to add a new type that would accept anything up to the next "/" character.

Grégoire