Posted in General | Leave a comment

Term Extraction in PHP

The new version of the term extraction tool on fivefilters.org is now in PHP.

Read the blog post explaining what’s new.

For anyone looking for a simple way to carry out term extraction on English text using PHP, here’s a snippet using the PHP port of Topia’s Term Extractor:

require 'TermExtractor/TermExtractor.php';

$text = 'Politics is the shadow cast on society by big business';

$extractor = new TermExtractor();
$terms = $extractor->extract($text);

// We're outputting results in plain text...
header('Content-Type: text/plain; charset=UTF-8');

// Loop through extracted terms and print each term on a new line
foreach ($terms as $term_info) {
  // index 0: term
  // index 1: number of occurrences in text
  // index 2: word count
  list($term, $occurrence, $word_count) = $term_info;
  echo "$term\n";
}
Posted in Code | Leave a comment

Chris Hedges: Assault on Gaza is Not a War, it is Murder

via Jonathan Cook

Posted in General | Comments closed

PHP DOMDocument replace DOMElement contents with HTML string

This is another StackOverflow answer I’m moving over to my blog.

AWinter asked:

Using PHP I’m attempting to take an HTML string passed from a WYSIWYG editor and replace the children of an element inside of a preloaded HTML document with the new HTML.

So far I’m loading the document identifying the element I want to change by ID but the process to convert an HTML to something that can be placed inside a DOMElement is eluding me.

$doc = new DOMDocument();
$doc->loadHTML($html);

$element = $doc->getElementById($item_id);
if(isset($element)){
    //Remove the old children from the element
    while($element->childNodes->length){
        $element->removeChild($element->firstChild);
    }

    //Need to build the new children from $html_string and append to $element
}

My answer:

If the HTML string can be parsed as XML, you can do this (after clearing the element of all child nodes):

$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html_string);
$element->appendChild($fragment);

If $html_string cannot be parsed as XML, it will fail. If it does, you’ll have to use loadHTML(), which is less strict — but it will add elements around the fragment which you will have to strip.

Unlike PHP, Javascript has the innerHTML property which allows you to do this very easily. I needed something like it for a project so I extended PHP’s DOMElement to include Javascript-like innerHTML access.

With it you can access the innerHTML property and change it just as you would in Javascript:

echo $element->innerHTML;
$elem->innerHTML = 'example';
Posted in Code | Leave a comment

Clean up HTML on paste in CKEditor

We use CKEditor at FiveFilters.org for our PastePad service. The idea is to allow users to paste content that’s not currently publically available on the web for processing with one of our web tools. This can be content that’s in a Word document, an email, or behind a paywall.

CKEditor can automatically clean up HTML it identifies as coming from MS Word, but there’s no way to force cleanup on all pasted content. By default, HTML cleanup occurs in the following two cases:

  1. User clicks the ‘paste from word’ toolbar icon
  2. User pastes content copied from MS Word itself

In the second case, CKEditor looks for signs of MS Word formatting. It does this by testing whatever you paste against the following regular expression:

/(class=\"?Mso|style=\"[^\"]*\bmso\-|w:WordDocument)/

If there’s a match, it will be cleaned up. Otherwise it will paste as normal.

I want to avoid editing core files, so my solution is simply to ensure that this regular expression always matches pasted content. Here’s what I’ve come up with:

CKEDITOR.on('instanceReady', function(ev) {
    ev.editor.on('paste', function(evt) {    
        evt.data['html'] = ''+evt.data['html'];
    }, null, null, 9);
});

I haven’t tested extensively, but this appears to work as expected (CKEditor 3.6.2). You can try it out.

What the code does is it registers a new listener for the paste event, just like the Paste from Word plugin. When it receives the pasted HTML, it simply prepends an HTML comment containing one of the strings the Paste from Word plugin looks for. The listener has a priority of 9 to ensure it runs before the plugin which will trigger the actual cleaning (default priority of 10).

Note: I posted this solution on StackOverflow as an alternative to another solution, titled “CKEditor – use pastefromword filtering on all pasted content.” StackOverflow recently deleted some of my answers (and hid them from me) so I’m moving the rest of my meagre contributions over to my own blog.

Posted in Code | 6 Comments