Parsing syntaxes to RDF Triples

Introduction The typical sequence of operations to parse is to create a parser object, set various callback and features, start the parsing, send some syntax content to the parser object, finish the parsing and destroy the parser object. Several parts of this process are optional, including actually using the triple results, which is useful as a syntax checking process.

Create the Parser object The parser can be created directly from a known name such as rdfxml for the W3C Recommendation RDF/XML syntax: raptor_parser* rdf_parser; rdf_parser = raptor_new_parser("rdfxml"); or the name can be discovered from an enumeration as discussed in Querying Functionality The parser can also be created by identifying the syntax by a URI, specifying the syntax by a MIME Type, providng an identifier for the content such as filename or URI string or giving some initial content bytes that can be used to guess. Using the raptor_new_parser_for_content() function, all of these can be given as optional parameters, using NULL or 0 for undefined parameters. The constructor will then use as much of this information as possible. raptor_parser* rdf_parser; Create a parser that reads the MIME Type for RDF/XML application/rdf+xml rdf_parser = raptor_new_parser_for_content(NULL, "application/rdf+xml", NULL, 0, NULL); Create a parser that can read a syntax identified by the URI for Turtle http://www.dajobe.org/2004/01/turtle/, which has no registered MIME Type at this date: syntax_uri = raptor_new_uri("http://www.dajobe.org/2004/01/turtle/"); rdf_parser = raptor_new_parser_for_content(syntax_uri, NULL, NULL, 0, NULL); Create a parser that recognises the identifier foo.rss: rdf_parser = raptor_new_parser_for_content(NULL, NULL, NULL, 0, "foo.rss"); Create a parser that recognises the content in buffer: rdf_parser = raptor_new_parser_for_content(NULL, NULL, buffer, len, NULL); Any of the constructor calls can return NULL if no matching parser could be found, or the construction failed in another way.

Parser features There are several options that can be set on parsers, called features. The exact list of features can be found via Querying Functionality or in the API reference for raptor_set_feature(). (This should be properly called raptor_parser_set_feature() as it only applies to raptor_parser objects). Features are integer enumerations of the raptor_feature enum and have values that are either integers (often acting as booleans) or strings. The two functions that set features are: /* Set an integer (or boolean) valued feature */ raptor_set_feature(rdf_parser, feature, 1); /* Set a string valued feature */ raptor_set_feature_string(rdf_parser, feature, "abc"); There are also two corresponding functions for reading the values of parser features: raptor_get_feature() and raptor_get_feature_string() taken the feature enumeration parameter and returning the integer or string value correspondingly.

Set RDF triple callback handler The main reason to parse a syntax is to get RDF triples returned and this is done by a callback function which is called with parameters of a user data pointer and the triple itself. The handler is set with raptor_set_statement_handler() as follows: void triples_handler(void* user_data, const raptor_statement* triple) { /* do something with the triple */ } raptor_set_statement_handler(rdf_parser, user_data, triples_handler); It is optional to set a handler function for triples, which does have some uses if just counting triples or validating a syntax.

Set fatal error, error and warning handlers There are several other callback handlers that can be set on parsers. These can be set any time before parsing is called. Errors and warnings from parsing can be returned with functions that all take a callback of type raptor_message_handler and signature: void message_handler(void *user_data, raptor_locator* locator, const char *message) { /* do something with the message */ } returning the user data given, associated location information as a raptor_locator and the error/warning message itself. The locator structure contains full information on the details of where in the file or URI the message occurred. The fatal error, error and warning handlers are all set with similar functions that take a handler as follows: raptor_set_fatal_error_handler(rdf_parser, user_data, fatal_handler); raptor_set_error_handler(rdf_parser, user_data, error_handler); raptor_set_warning_handler(rdf_parser, user_data, warning_handler); The program will terminate with abort() if the fatal error handler returns.

Set the identifier creator handler Identifiers are created in some parsers by generating them automatically or via hints given a syntax. Raptor can customise this process using a user-supplied identifier handler function. For example, in RDF/XML generated blank node identifiers and those those specified rdf:nodeID are passed through this process. Setting a handler allows the identifier generation mechanism to be fully replaced. A lighter alternative is to use raptor_set_default_generate_id_parameters() to adjust the default algorithm for generated identifiers. It is used as follows raptor_generate_id_handler id_handler; raptor_set_generate_id_handler(rdf_parser, user_data, id_handler); The id_handler takes the following signature: unsigned char* generate_id_handler(void* user_data, raptor_genid_type type, unsigned char* user_id) { /* return a new generated ID based on user_id (optional) */ } where the raptor_genid_type provides extra information on the identifier being created and user_id an optional user-supplied identifier, such as the value of a rdf:nodeID in RDF/XML.

Set namespace declared handler Raptor can report when namespace prefix/URIs are declared in during parsing a syntax such as those in XML, RDF/XML or Turtle. A handler function can be set to receive these declarations using the namespace handler method. raptor_namespace_handler namespaces_handler; raptor_set_namespace_handler(rdf_parser, user_data, namespaces_handler); The namespaces_handler takes the following signature: void namespaces_handler(void* user_data, raptor_namespace *nspace) { /* */ } This may be called multiple times with the same namespace, if the namespace is declared inside different XML sub-trees.

Set the parsing strictness raptor_set_parser_strict() allows setting of the parser strictness flag. The default is lax parsing, accepting older or deprecated syntax forms but may generate a warning. Setting to non-0 (true) will cause parser errors to be generated in these cases.

Provide syntax content to parse The operation of turning syntax into RDF triples has several alternatives from functions that do most of the work starting from a URI to functions that allow passing in data buffers. Parsing and MIME Types The mime type of the retrieved content is not used to choose a parser unless the parser is of type guess. The guess parser will send an Accept: header for all known parser syntax mime types (if a URI request is made) and based on the response, including the identifiers used, pick the appropriate parser to execute. See raptor_guess_parser_name() for a full discussion of the inputs to the guessing.

Parse the content from a URI (<link linkend="raptor-parse-uri"><function>raptor_parse_uri()</function></link>) The URI is resolved and the content read from it and passed to the parser: raptor_parse_uri(rdf_parser, uri, base_uri); The base_uri is optional (can be NULL) and will default to the uri.

Parse the content of a URI using an existing WWW connection (<link linkend="raptor-parse-uri-with-connection"><function>raptor_parse_uri_with_connection()</function></link>) The URI is resolved using an existing WWW connection (for example a libcurl CURL handle) to allow for any existing WWW configuration to be reused. See raptor_www_new_with_connection for full details of how this works. The content is then read from the result of resolving the URI: raptor_parse_uri_with_connection(rdf_parser, uri, base_uri, connection); The base_uri is optional (can be NULL) and will default to the uri.

Parse the content of a C <literal>FILE*</literal> (<link linkend="raptor-parse-file-stream"><function>raptor_parse_file_stream()</function></link>) Parsing can read from a C STDIO file handle: stream=fopen(filename, "rb"); raptor_parse_file_stream(rdf_parser, stream, filename, base_uri); fclose(stream); This function can use take an optional filename which is used in locator error messages. The base_uri may be required by some parsers and if NULL will cause the parsing to fail.

Parse the content of a file URI (<link linkend="raptor-parse-file"><function>raptor_parse_file()</function></link>) Parsing can read from a URI known to be a file: URI: raptor_parse_file(rdf_parser, file_uri, base_uri); This function requires that the file_uri is a file URI, that is raptor_uri_uri_string_is_file_uri( raptor_uri_as_string( file_uri) ) must be true. The base_uri may be required by some parsers and if NULL will cause the parsing to fail.

Parse chunks of syntax content provided by the application (<link linkend="raptor-start-parse"><function>raptor_start_parse()</function></link> and <link linkend="raptor-parse-chunk"><function>raptor_parse_chunk()</function></link>) raptor_start_parse(rdf_parser, base_uri); while(/* not finished getting content */) { unsigned char *buffer; size_t buffer_len; /* obtain some syntax content in buffer of size buffer_len bytes */ raptor_parse_chunk(rdf_parser, buffer, buffer_len, 0); } raptor_parse_chunk(rdf_parser, NULL, 0, 1); /* no data and is_end = 1 */ The base_uri argument to raptor_start_parse() may be required by some parsers and if NULL will cause the parsing to fail. On the last raptor_parse_chunk() call, or after the loop is ended, the is_end parameter must be set to non-0. Content can be passed with the final call. If no content is present at the end (such as in some kind of end of file situation), then a 0-length buffer_len or NULL buffer can be used. The minimal case is an entire parse in one chunk as follows: raptor_start_parse(rdf_parser, base_uri); raptor_parse_chunk(rdf_parser, buffer, buffer_len, 1); /* is_end = 1 */

Restrict parser network access Parsing can cause network requests to be performed, especially if a URI is given as an argument such as with raptor_parse_uri() however there may also be indirect requests such as with the GRDDL parser that retrieves URIs depending on the results of initial parse requests. The URIs requested may not be wanted to be fetched or need to be filtered, and this can be done in three ways.

Filtering parser network requests with feature <link linkend="RAPTOR-FEATURE-NO-NET:CAPS"><literal>RAPTOR_FEATURE_NO_NET</literal></link> The parser feature RAPTOR_FEATURE_NO_NET can be set with raptor_set_feature() and forbids all network requests. There is no customisation with this approach, for that see the URI filter in the next section. rdf_parser = raptor_new_parser("rdfxml"); /* Disable internal network requests */ raptor_set_feature(rdf_parser, RAPTOR_FEATURE_NO_NET, 1);

Filtering parser network requests with <link linkend="raptor-www-set-uri-filter"><function>raptor_www_set_uri_filter()</function></link> The raptor_www_set_uri_filter() allows setting of a filtering function to operate on all URIs retrieved by a WWW connection. This connection can be used in parsing when operated by hand. void write_bytes_handler(raptor_www* www, void *user_data, const void *ptr, size_t size, size_t nmemb) { { raptor_parser* rdf_parser=(raptor_parser*)user_data; raptor_parse_chunk(rdf_parser, (unsigned char*)ptr, size*nmemb, 0); } int uri_filter(void* filter_user_data, raptor_uri* uri) { /* return non-0 to forbid the request */ } int main(int argc, char *argv[]) { ... rdf_parser = raptor_new_parser("rdfxml"); www = raptor_new_www(); /* filter all URI requests */ raptor_www_set_uri_filter(www, uri_filter, filter_user_data); /* make WWW write bytes to parser */ raptor_www_set_write_bytes_handler(www, write_bytes_handler, rdf_parser); raptor_start_parse(rdf_parser, uri); raptor_www_fetch(www, uri); /* tell the parser that we are done */ raptor_parse_chunk(rdf_parser, NULL, 0, 1); raptor_www_free(www); raptor_free_parser(rdf_parser); ... }

Filtering parser network requests with <link linkend="raptor-parser-set-uri-filter"><function>raptor_parser_set_uri_filter()</function></link> The raptor_parser_set_uri_filter() allows setting of a filtering function to operate on all URIs that the parser sees. This operates on the internal raptor_www object used inside parsing to retrieve URIs, similar to that described in the previous section. int uri_filter(void* filter_user_data, raptor_uri* uri) { /* return non-0 to forbid the request */ } rdf_parser = raptor_new_parser("rdfxml"); raptor_parser_set_uri_filter(rdf_parser, uri_filter, filter_user_data); /* parse content as normal */ raptor_parse_uri(rdf_parser, uri, base_uri);

Setting timeout for parser network requests with feature <link linkend="RAPTOR-FEATURE-WWW-TIMEOUT:CAPS"><literal>RAPTOR_FEATURE_WWW_TIMEOUT</literal></link> If the value of feature RAPTOR_FEATURE_WWW_TIMEOUT if set to a number >0, it is used as the timeout in seconds for retrieving of URIs during parsing (primarily for GRDDL). This uses raptor_www_set_connection_timeout() internally. rdf_parser = raptor_new_parser("grddl"); /* set internal URI retrieval maximum time to 5 seconds */ raptor_set_feature(rdf_parser, RAPTOR_FEATURE_WWW_TIMEOUT , 5);

Querying parser static information These methods return information about the constructed parser implementation corresponding to the information available via raptor_syntaxes_enumerate() for all parsers. raptor_get_name() return the parser syntax name, raptor_get_label() the long label for the parser and raptor_get_mime_type() the primary MIME Type for the parser (there may be others that the parser will accept but this is the main one). raptor_parser_get_accept_header() returns a string that would be sent in an HTTP request Accept: header for the syntaxes accepted by this parser only.

Querying parser run-time information raptor_get_locator() returns the raptor_locator for the current position in the input stream. The locator structure contains full information on the details of where in the file or URI the current parser has reached.

Aborting parsing raptor_parse_abort() allows the current parsing to be aborted, at which point no further triples will be passed to callbacks and the parser will attempt to return control to the application. This is most useful when called inside a handler function which allows the application to decide to stop an active parsing.

Destroy the parser To tidy up, delete the parser object as follows: raptor_free_parser(rdf_parser);

Parsing example code <filename>rdfprint.c</filename>: Parse an RDF/XML file and print the triples Compile it like this: $ gcc -o rdfprint rdfprint.c `raptor-config --cflags` `raptor-config --libs` and run it on an RDF file as: $ ./rdfprint raptor.rdf _:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://usefulinc.com/ns/doap#Project> . _:genid1 <http://usefulinc.com/ns/doap#name> "Raptor" . _:genid1 <http://usefulinc.com/ns/doap#homepage> <http://librdf.org/raptor/> . ...