drupal and xml

The best discussion of XML and drupal I've found on drupal.org is at http://drupal.org/node/29946, and I don't follow it all.

One person says that XML is not a good format for storing and working with documents. We have a ton of XML documents with highly variable data (chat reference transcripts) and when I tried importing them bit-by-bit into a flexinode, there were just too many joins for MySQL to handle (though my sense is, after some later experiences, that this might be a limitation in flexinode).

I wrote that I was using a custom XSL patch and anothermark asked what the heck I was talking about (http://drupalib.interoperating.info/node/10#comment).

To be honest, I'm not sure, but I'm happy to show and tell. Your feedback is appreciated.

1. I created a flexinode (http://drupal.org/project/flexinode) and made a simple node with just a title and a text-entry field for the XML.

2. I imported a bunch of documents with a Perl script and a lot of trial and error, making sure the right tables and fields got updated so that my site recognized the imported document as a new node.

3. I added an XSL function to core in common.inc (this is bad, right? I'm not supposed to modify the core? oh well....) I'm not sure where I borrowed this one from, but it charms. The XSLT extension came with our hosting service, and I'm pretty sure it's Sablotron-based.

function xml2html($xmldata, $xsl)
{
   /* $xmldata -> your XML */
   /* $xsl -> XSLT file */

   $path = '/path/to/xsl/files';
   $arguments = array('/_xml' => $xmldata);
   $xsltproc = xslt_create();
   $html = xslt_process($xsltproc, 'arg:_xml', "$path/$xsl",NULL,$arguments);

   if (empty($html)) {
       die('XSLT processing error: '. xslt_error($xsltproc));
   }
   xslt_free($xsltproc);
   return $html;
}

4. I added a page template for that flexinode type so that when viewing nodes of this type, an XSLT is called to display the content.

        $output = xml2html($node->flexinode_field_no,'transcript.xsl');
        $output = str_replace(array("&gt;","&lt;","&apos;","&amp;"),array(">","<
","'","&"),$output);
        echo $output;

5. I am still learning drupal slowly, so for now, the content isn't themed. This created a problem in the search indexing process, since the search indexer calls $content, and $content was the raw xml document, not what I actually wanted to index. I patched search.module in order to call the transform on nodes of this flexinode type (otherwise, the search indexed the xml document, tags and all, no good). I themed my first view last week, so I may soon be ready to nerd it up and do this right.

But is this a good solution?

* I'm definitely wary of upgrading Drupal and losing my patches. If I'm going to continue like this, it will have to be a custom module someday. If not, all the data is still in XML.

* I like working with XSL to transform the whole document or just bits and pieces of it. XSL transforms are elegant to me; token parsers and condition statements are crass and burly.

* Since all the data is in XML and is really only accessible through XSLT or a crass, burly token parser, I haven't figured out how to handle reports/views inside Drupal. Instead, I transform the XML into a tab-delimited file and export to Excel, and it almost defeats the purpose.

I would think this last part would be important form MARCXML - you are going to want to search by and create views for specific fields. But hasn't that been done?

Good set of problems...

Caleb,

3. I added an XSL function to core in common.inc (this is bad, right? I'm not supposed to modify the core? oh well....)

I agree -- I don't think hacking the core is a good idea. You have a number of options (I may be corrected if any of these don't actually work but they should): 1) you could create a module, 2) add this function to your theme, 3) create a PHP snippet from your function that transforms the XML, or 4) create an input format that transforms XML to HTML using XSLT.

This created a problem in the search indexing process, since the search indexer calls $content, and $content was the raw xml document, not what I actually wanted to index. I patched search.module in order to call the transform on nodes of this flexinode type (otherwise, the search indexed the xml document, tags and all, no good).

I've written a module called search_attachments that sucks the text out of pdf, doc, and other formats using helper apps like pdf2text and catdoc, then appends the text to the node conent for indexing. It's still at "version 1" and I hope to add more functionality in the future. You could use it to index your XML by replacing the helper apps with a simple XSL stylesheet that sucked out the document text. Here's the hook_update_index function:

<?php
/**
* Implementation of hook_nodeapi('update index').
*/
function search_attachments_node_update_index(&$node) {
   
  
$combined_attachment_texts = array();

  
// Select all filepaths associated with the node
  
$result = db_query("SELECT filepath FROM {files} where nid = %d", $node->nid);
   while (
$row = db_fetch_array($result)) {
     
$attachment_path = $row['filepath'];
     
     
// Get file extension for the current file
     
$attachment_extension = substr(strrchr($attachment_path, "."), 1);
     
     
// Determine helper based on file extension
     
switch ($attachment_extension) {
          case
'pdf':
            
$helper_command = variable_get('search_attachments_path_to_pdf_helper','');
             break;
          case
'doc':
            
$helper_command = variable_get('search_attachments_path_to_doc_helper','');
             break;
          case
'txt':
            
$helper_command = variable_get('search_attachments_path_to_txt_helper','');
             break;
              default:
            
$helper_command = '';
      }
     
     
// If we have determined which helper to use, extract the text from the attachement.
     
if ($helper_command != '') { // Empty entries in settings form mean that helper is disabled.
     // %file% is a token that is placed in the helper's parameter list to represent
     // the file path to the attachment.
        
$helper_command = preg_replace('/%file%/', $attachment_path, $helper_command);
        
$helper_command = escapeshellcmd($helper_command);
        
$attachment_text = shell_exec($helper_command);
      }
       
     
// Since we want the text of all the attachments for a single node, concatenate
      // each $attachment_text to the existing value.
     
$combined_attachment_texts[] = $attachment_text;
   }
     
  
// Return the string containing all the concatenated text strings
  
$string_to_index = implode(' ', $combined_attachment_texts);
   return
$string_to_index;  

}

?>

This doesn't provide element-level searching on the XML, but if your nodes are just a title and the raw XML, this technique should work fairly well for you.

neat

Ok, it was ridiculously easy to just take that function and put it into a module and activate it. For some reason, calling it a "module" was daunting. The search hook should be easy now too - thanks.

I definitely will try out search_attachments for all of our pdfs.

As far as searching individual elements - this has to go for MARCXML records too, right? You're going to want to search and create views by author, title, subject, etc., right?

The main place that XSL hurts is in processor time. Drupal won't work on a page for more than 30 seconds, so I generally can't transform more than 200 documents at a time, which is a problem if I want to sort all 15,000 of them by zip code.

MySQL obviously wins here, if I stored zip codes in a zip code field, I could SELECT count(*) and GROUP BY and be done with it.

But I'm not ready to give up on the XML yet, however stubbornly. My sense was that you wanted to keep MARCXML around also, yes?

Now that I've written a "module" and have seen it's not so scary, it might make sense to create a separate search index, at least for the fields I want to use in views.

Fielded searching

I'm writing a module that will allow "fielded" searching on CCK nodes. This is part of my "digial library content management system based on Drupal" project. I'll be demoing it at a conference in a couple of weeks, at which time I'd be happy to let everyone take a look. It's coming along nicely (battling with some stubborn "Headers already sent" messages, hacking through some theming issues, getting the drupal pager to work, etc.), but the hard parts, the indexing and boolean searching, work great. Basically, I iterate through all the fields in a CCK content type and index them on a field level. I ran this idea past one of the search.module maintainers and he said that approach was consistent with how the search_index table was intended to be used. I also want to add some administrative functions that allow you to configure your search forms in various ways, and doing so will take some time.

If you wanted to parse out certain fields from MARCXLM records, load them into fields in a CCK content type, and search on them, you could do it with the module I'm working on. You could then display the entire or partial MARCXML using XSLT. I don't plan on porting it to flexinode, however.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.