Pages
Categories
Archives
- March 2013
- January 2013
- November 2012
- October 2012
- September 2012
- July 2012
- March 2012
- January 2012
- November 2011
- September 2011
- July 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- December 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- November 2007
- July 2007
- November 2006
- October 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005
- September 2005
- August 2005
- July 2005
- May 2005
- April 2005
- March 2005
- December 2004
- November 2004
Term Extraction in PHP
The new version of the term extraction tool on fivefilters.org is now in PHP.
Read the blog post explaining what’s new.
For anyone looking for a simple way to carry out term extraction on English text using PHP, here’s a snippet using the PHP port of Topia’s Term Extractor:
require 'TermExtractor/TermExtractor.php';
$text = 'Politics is the shadow cast on society by big business';
$extractor = new TermExtractor();
$terms = $extractor->extract($text);
// We're outputting results in plain text...
header('Content-Type: text/plain; charset=UTF-8');
// Loop through extracted terms and print each term on a new line
foreach ($terms as $term_info) {
// index 0: term
// index 1: number of occurrences in text
// index 2: word count
list($term, $occurrence, $word_count) = $term_info;
echo "$term\n";
}
Posted in Code Leave a comment
Chris Hedges: Assault on Gaza is Not a War, it is Murder
via Jonathan Cook
Posted in General Comments closed
PHP DOMDocument replace DOMElement contents with HTML string
This is another StackOverflow answer I’m moving over to my blog.
AWinter asked:
Using PHP I’m attempting to take an HTML string passed from a WYSIWYG editor and replace the children of an element inside of a preloaded HTML document with the new HTML.
So far I’m loading the document identifying the element I want to change by ID but the process to convert an HTML to something that can be placed inside a DOMElement is eluding me.
$doc = new DOMDocument(); $doc->loadHTML($html); $element = $doc->getElementById($item_id); if(isset($element)){ //Remove the old children from the element while($element->childNodes->length){ $element->removeChild($element->firstChild); } //Need to build the new children from $html_string and append to $element }
My answer:
If the HTML string can be parsed as XML, you can do this (after clearing the element of all child nodes):
$fragment = $doc->createDocumentFragment(); $fragment->appendXML($html_string); $element->appendChild($fragment);
If $html_string cannot be parsed as XML, it will fail. If it does, you’ll have to use loadHTML(), which is less strict — but it will add elements around the fragment which you will have to strip.
Unlike PHP, Javascript has the innerHTML property which allows you to do this very easily. I needed something like it for a project so I extended PHP’s DOMElement to include Javascript-like innerHTML access.
With it you can access the innerHTML property and change it just as you would in Javascript:
echo $element->innerHTML; $elem->innerHTML = 'example';
Posted in Code Leave a comment
Clean up HTML on paste in CKEditor
We use CKEditor at FiveFilters.org for our PastePad service. The idea is to allow users to paste content that’s not currently publically available on the web for processing with one of our web tools. This can be content that’s in a Word document, an email, or behind a paywall.
CKEditor can automatically clean up HTML it identifies as coming from MS Word, but there’s no way to force cleanup on all pasted content. By default, HTML cleanup occurs in the following two cases:
- User clicks the ‘paste from word’ toolbar icon
- User pastes content copied from MS Word itself
In the second case, CKEditor looks for signs of MS Word formatting. It does this by testing whatever you paste against the following regular expression:
/(class=\"?Mso|style=\"[^\"]*\bmso\-|w:WordDocument)/
If there’s a match, it will be cleaned up. Otherwise it will paste as normal.
I want to avoid editing core files, so my solution is simply to ensure that this regular expression always matches pasted content. Here’s what I’ve come up with:
CKEDITOR.on('instanceReady', function(ev) {
ev.editor.on('paste', function(evt) {
evt.data['html'] = ''+evt.data['html'];
}, null, null, 9);
});
I haven’t tested extensively, but this appears to work as expected (CKEditor 3.6.2). You can try it out.
What the code does is it registers a new listener for the paste event, just like the Paste from Word plugin. When it receives the pasted HTML, it simply prepends an HTML comment containing one of the strings the Paste from Word plugin looks for. The listener has a priority of 9 to ensure it runs before the plugin which will trigger the actual cleaning (default priority of 10).
Note: I posted this solution on StackOverflow as an alternative to another solution, titled “CKEditor – use pastefromword filtering on all pasted content.” StackOverflow recently deleted some of my answers (and hid them from me) so I’m moving the rest of my meagre contributions over to my own blog.
Posted in Code 6 Comments