Collation Sequences in SPARQL

by James

The SPARQL query language is relatively silent about how to order strings. When the question was posed to us a while back, what to expect as the order of a solution sequence which contained string literals with language tags, we had just the conservative answer that the relation among simple or string literals and plain literals was undefined. This was not a nice situation.

Even though RDF 1.1 ratifies the type rdf:langString, it defines no relation beyond equality, which leaves plfn:compare to apply but requires some context where it is possible to determine the collation sequence. This situation is not quite as unpleasant, but still not satisfactory. Fortunately ‘undefined’ leaves latitude for improvement, by definition.

The obvious improvement is to exercise the option to extend value comparison, to define an extended form of plfn:compare to apply to the string values of plain literal terms according to a collation sequence derived from the query run-time context.

When the enquiry continued, to ask, how one would define the context from which the collation sequence would be chosen, we, once again, had no recommended answer—as even if we were to add a collation environment to the query run-time settings, it would still not suffice for variant language tags. This convinced us that the most purposeful approach was to simply observe the respective language tag.

The consequent logic is:

  • if neither term has a language tag, the comparison is according to Unicode code points.
  • if the two terms do not agree in language tag, either because just one lacks a tag or because the tags are not identical, then the two terms are not ordered.
  • if both terms share the identical language tag—allowing for canonicalization—then the effective collation sequence is that for the respective indicated language.

In order to determine the effective collation sequence we extract the initial ISO 639-1 code from the language tag and use it to designate a locale, which in turn determines the collation.

The benefit of this approach is that without any configuration or declarations it is possible to order terms collections with varied language tags according to the order which is specified by the data itself. For example, DBpedia demonstrate distinct ordering in French, Russian, or any of the other dozen available languages:

select *
where {
  ?city <> ?labelRussian .
  ?city <> ?labelFrench .
  filter ( lang(?labelRussian) = 'ru' && lang(?labelFrench) = 'fr')
} order by ?labelFrench
blog comments powered by Disqus