Parsers in Raptor (syntax to triples)

Introduction This section describes the parsers that can be compiled into Raptor and their features. The exact parsers supported may vary by different builds of raptor and can be queried at run-time by use of the raptor_parsers_enumerate and raptor_syntaxes_enumerate functions The optional features that may be set on parsers can also be queried at run-time iwth the raptor_features_enumerate function.

GRDDL parser (name <literal>grddl</literal>) A parser for the Gleaning Resource Descriptions from Dialects of Languages (GRDDL), W3C Proposed Recommendation of 2007-07-16 which allows reading XHTML and XML as RDF triples by using profiles in the document that declare XSLT transforms from the XHTML or XML content into RDF/XML or other RDF syntax which can then be parsed. The GRDDL parser is rather complex and different from the other parsers in that it retrieves URIs, reads HTML documents (possibly with errors), transforms the documents with XSLT and turns the result into a single graph. The default configuration of the GRDDL parser also reads microformats (hcard, hcalendar) and follows <link> tags that point to RDF/XML. Parts of the GRDDL process can be altered by configuration, which are describe below. The URIs that are processed during GRDDL operations can be checked and skipped if required using a handler set with the raptor_parser_set_uri_filter() function. If the handler returns non-0, the URI is rejected. This uses raptor_www_set_uri_filter() internally. If the value of feature RAPTOR_FEATURE_WWW_TIMEOUT if set to a number >0, it is used as the timeout in seconds for retrieving of URIs during GRDDL processing. This uses raptor_www_set_connection_timeout() internally. The hardcoded support for hcard and hcalendar microformats can be disabled by setting parser feature RAPTOR_FEATURE_MICROFORMATS to 0 or using raptor_set_parser_strict() with a value of 1. The GRDDL parser by default will try an XML parser on the content followed by a lax HTML parser. This can be disabled by setting parser feature RAPTOR_FEATURE_HTML_TAG_SOUP to 0 or using raptor_set_parser_strict() with a value of 1. The GRDDL parser by default will try to look for an HTML <link> tag that points to RDF/XML. This can be disabled by setting parser feature RAPTOR_FEATURE_HTML_LINK to 0 or using raptor_set_parser_strict() with a value of 1.

Guess parser (name <literal>guess</literal>) This is a special parser that picks the actual parser to use based on the content type, the content bytes or the content identifier. The content name can be either from a local file or from a URI. If the protocol that delivered the content (such as HTTP) provided a Content Type (aka MIME Type) then this will be the primary means for identifying th ecotnent. The secondary means to identify the content are the bytes of the content (if available), otherwise the content identifier is used, which is the least reliable.

N-Triples parser (name <literal>ntriples</literal>) A parser for the N-Triples syntax as used by the W3C RDF Core working group for the RDF Test Cases.

RDFa parser - (name <literal>rdfa</literal>) A parser for the RDFa syntax, W3C Candidate Recommendation 20 June 2008 which allows reading XHTML and XML as RDF triples by interpreting attributes on elements to describe which ones have RDF semantics. This is implemented via librdfa linked inside Raptor, written by Manu Sporny of Digital Bazaar, and licensed with the same license as Raptor. This parser is beta quality and passes all but 4 of the RDFa tests as of Raptor 1.4.18.

RDF/XML parser - default (name <literal>rdfxml</literal>) A parser for the standard RDF/XML syntax as revised by the W3C RDF Core working group. This is the default parser in Raptor. Features of this parser: Fully handles the RDF/XML syntax updates for XML Base, xml:lang, RDF datatyping and Collections. Handles all RDF vocabularies such as FOAF, RSS 1.0, Dublin Core, OWL, DOAP Handles rdf:resource / resource attributes Uses expat and/or (GNOME) libxml XML parsers as available or required

RSS Tag Soup parser (name <literal>rss-tag-soup</literal>) A parser for the multiple XML RSS formats that use the elements such as channel, item, title, description in different ways. This includes support for the Atom 1.0 syndication format defined in IETF RFC 4287 The parser attempts to turn the input into RSS 1.0 RDF triples in the RSS 1.0 model of a syndication feed. This includes triples for RSS Enclosures. True RSS 1.0 when wanted to be used as a full RDF vocabulary, is best parsed by the RDF/XML parser (name rdfxml).

TRiG parser (name <literal>trig</literal>) A parser for the TriG - Turtle with Named Graphs syntax. The parser is alpha quality and may not support the entire TRiG specification.

Turtle Terse RDF Triple Language parser (name <literal>turtle</literal>) A parser for the Turtle Terse RDF Triple Language syntax, designed as a useful subset of Notation 3.