3. I added an XSL function to core in common.inc (this is bad, right? I'm not supposed to modify the core? oh well....)
I agree -- I don't think hacking the core is a good idea. You have a number of options (I may be corrected if any of these don't actually work but they should): 1) you could create a module, 2) add this function to your theme, 3) create a PHP snippet from your function that transforms the XML, or 4) create an input format that transforms XML to HTML using XSLT.
This created a problem in the search indexing process, since the search indexer calls $content, and $content was the raw xml document, not what I actually wanted to index. I patched search.module in order to call the transform on nodes of this flexinode type (otherwise, the search indexed the xml document, tags and all, no good).
I've written a module called search_attachments that sucks the text out of pdf, doc, and other formats using helper apps like pdf2text and catdoc, then appends the text to the node conent for indexing. It's still at "version 1" and I hope to add more functionality in the future. You could use it to index your XML by replacing the helper apps with a simple XSL stylesheet that sucked out the document text. Here's the hook_update_index function:
<?php /** * Implementation of hook_nodeapi('update index'). */ function search_attachments_node_update_index(&$node) {
$combined_attachment_texts = array();
// Select all filepaths associated with the node $result = db_query("SELECT filepath FROM {files} where nid = %d", $node->nid); while ($row = db_fetch_array($result)) { $attachment_path = $row['filepath'];
// Get file extension for the current file $attachment_extension = substr(strrchr($attachment_path, "."), 1);
// Determine helper based on file extension switch ($attachment_extension) { case 'pdf': $helper_command = variable_get('search_attachments_path_to_pdf_helper',''); break; case 'doc': $helper_command = variable_get('search_attachments_path_to_doc_helper',''); break; case 'txt': $helper_command = variable_get('search_attachments_path_to_txt_helper',''); break; default: $helper_command = ''; }
// If we have determined which helper to use, extract the text from the attachement. if ($helper_command != '') { // Empty entries in settings form mean that helper is disabled. // %file% is a token that is placed in the helper's parameter list to represent // the file path to the attachment. $helper_command = preg_replace('/%file%/', $attachment_path, $helper_command); $helper_command = escapeshellcmd($helper_command); $attachment_text = shell_exec($helper_command); }
// Since we want the text of all the attachments for a single node, concatenate // each $attachment_text to the existing value. $combined_attachment_texts[] = $attachment_text; }
// Return the string containing all the concatenated text strings $string_to_index = implode(' ', $combined_attachment_texts); return $string_to_index;
}
?>
This doesn't provide element-level searching on the XML, but if your nodes are just a title and the raw XML, this technique should work fairly well for you.
Reply
Drupal4libcamp
February 27, 2009, Darien Public Library, Darien, CT
Good set of problems...
Caleb,
I agree -- I don't think hacking the core is a good idea. You have a number of options (I may be corrected if any of these don't actually work but they should): 1) you could create a module, 2) add this function to your theme, 3) create a PHP snippet from your function that transforms the XML, or 4) create an input format that transforms XML to HTML using XSLT.
I've written a module called search_attachments that sucks the text out of pdf, doc, and other formats using helper apps like pdf2text and catdoc, then appends the text to the node conent for indexing. It's still at "version 1" and I hope to add more functionality in the future. You could use it to index your XML by replacing the helper apps with a simple XSL stylesheet that sucked out the document text. Here's the hook_update_index function:
<?php
/**
* Implementation of hook_nodeapi('update index').
*/
function search_attachments_node_update_index(&$node) {
$combined_attachment_texts = array();
// Select all filepaths associated with the node
$result = db_query("SELECT filepath FROM {files} where nid = %d", $node->nid);
while ($row = db_fetch_array($result)) {
$attachment_path = $row['filepath'];
// Get file extension for the current file
$attachment_extension = substr(strrchr($attachment_path, "."), 1);
// Determine helper based on file extension
switch ($attachment_extension) {
case 'pdf':
$helper_command = variable_get('search_attachments_path_to_pdf_helper','');
break;
case 'doc':
$helper_command = variable_get('search_attachments_path_to_doc_helper','');
break;
case 'txt':
$helper_command = variable_get('search_attachments_path_to_txt_helper','');
break;
default:
$helper_command = '';
}
// If we have determined which helper to use, extract the text from the attachement.
if ($helper_command != '') { // Empty entries in settings form mean that helper is disabled.
// %file% is a token that is placed in the helper's parameter list to represent
// the file path to the attachment.
$helper_command = preg_replace('/%file%/', $attachment_path, $helper_command);
$helper_command = escapeshellcmd($helper_command);
$attachment_text = shell_exec($helper_command);
}
// Since we want the text of all the attachments for a single node, concatenate
// each $attachment_text to the existing value.
$combined_attachment_texts[] = $attachment_text;
}
// Return the string containing all the concatenated text strings
$string_to_index = implode(' ', $combined_attachment_texts);
return $string_to_index;
}
?>
This doesn't provide element-level searching on the XML, but if your nodes are just a title and the raw XML, this technique should work fairly well for you.