Jeff Schmidt on Academic Freedom

Jeff Schmidt, author of one of my favourite books (Disciplined Minds):

“Intellectual workers don’t need academic freedom to service the status quo, which is what they’ve been hired to do, but they do need academic freedom to do what they should be doing, which is questioning what they’ve been hired to do and working instead in the public interest. If salaried intellectual workers want to make a difference in the world, if they want to make the world a better place, then they have to do things beyond the service work that they’ve been hired to do. That’s what activists do: things that they weren’t hired to do.”

The clip below is from his talk at the Frederic Ewen Academic Freedom Center’s 2nd Annual Conference at New York University on April 3, 2009.

Posted in General | Leave a comment

PHP Port of Arc90′s Readability

Last year I ported Arc90′s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.

As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online:

For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.

It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place. Here’s an example of how to use the PHP port:

require_once 'Readability.php';
header('Content-Type: text/plain; charset=utf-8');

// get latest Medialens alert
// (change this URL to whatever you'd like to test)
$url = 'http://medialens.org/alerts/index.php';
$html = file_get_contents($url);

// PHP Readability works with UTF-8 encoded content.
// If $html is not UTF-8 encoded, use iconv() or
// mb_convert_encoding() to convert to UTF-8.

// give it to Readability
$readability = new Readability($html, $url);

// print debug output?
// useful to compare against Arc90's original JS version -
// simply click the bookmarklet with FireBug's
// console window open
$readability->debug = false;

// convert links to footnotes?
$readability->convertLinksToFootnotes = true;

// process it
$result = $readability->init();

// does it look like we found what we wanted?
if ($result) {
	echo "== Title ===============================\n";
	echo $readability->getTitle()->textContent, "\n\n";

	echo "== Body ===============================\n";
	$content = $readability->getContent()->innerHTML;

	// if we've got Tidy, let's clean it up for output
	if (function_exists('tidy_parse_string')) {
		$tidy = tidy_parse_string($content,
			array('indent'=>true, 'show-body-only'=>true),
			'UTF8');
		$tidy->cleanRepair();
		$content = $tidy->value;
	}
	echo $content;
} else {
	echo 'Looks like we couldn\'t find the content.';
}

Differences between the PHP port and the original

Arc90′s Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page’s CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP’s ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90′s Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90′s Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)

Another significant difference is that the aim of Arc90′s Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser — Arc90 already do that extremely well, and for PDF output there’s FiveFilters.org’s PDF Newspaper.

Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don’t want to do because it makes debugging and updating more difficult), I’ve tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

Posted in Code | Tagged , , | 7 Comments

JavaScript-like innerHTML access in PHP

As part of an update to the Five Filters Full-Text RSS service, I’ve been porting some JavaScript code (Arc90′s current version of Readability) to PHP. It contains a lot of DOM manipulation which translates very easily – thanks to PHP5′s DOM support. But one thing I wasn’t able to do was manipulate the DOM tree through the innerHTML property.

In JavaScript, it’s very easy to do. The Mozilla Developer Network’s page on innerHTML gives the following example:

var content = element.innerHTML;
// Returns a string containing the HTML syntax describing all
// of the element's descendants
element.innerHTML = content;
// Removes all of element's descendants, parses the content
// string and assigns  the resulting nodes as descendants of
// the element.

Using PHP’s magic getter and setter methods, it’s possible to extend DOMElement to achieve this type of access and manipulation. My attempt at doing it is JSLikeHTMLElement. Here’s an example of how to use it (with relevant lines highlighted):

require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);

// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'

// set innerHTML
$elem->innerHTML = 'FF';

// print document (with our changes)
echo $doc->saveXML();

Download: JSLikeHTMLElement.php. Feedback appreciated.

Posted in Code | Tagged , , , | 2 Comments

Propaganda, State Religion and the Attack on the Gaza Peace Flotilla

Another excellent alert from Medialens: Headshot – Propaganda, State Religion and the Attack on the Gaza Peace Flotilla. It compares the media reaction to the Israeli killing of activists with the reaction towards the 2007 incident involving British sailors being detained by Iranian forces.

…media coverage of the non-violent Iranian capture of 15 British sailors (in Iranian waters) focused on the humiliating failure of the sailors to open fire in self-defence. Journalists took a very different view of the May 31 Israeli attack on the ship Mavi Marmara carrying human rights activists and supplies to the besieged population of Gaza.

In this case, the key question was not why the activists failed to open fire (they had no guns) on their approaching kidnappers, but whether they used lesser violence – hitting with sticks and poles – before the commandos opened fire killing nine people and wounding several dozen more.

Posted in General | 1 Comment

WordPress Blog to PDF

If you’d like to turn your WordPress blog content into a printable PDF in newspaper format, there’s now a solution for you: Make PDF Newspaper by Martin Hawksey. It makes use of the FiveFilters.org PDF Newspaper source code.

Developing a plugin like this has been on my todo list for the FiveFilters.org project for a while now. To give a little background, one goal of the project is to encourage users to explore the world of non-corporate media, and to do that we’re developing tools and services to make content on the web a little more accessible. One area of development has been the PDF Newspaper project which can take feed or HTML input and turn it into a printable PDF in newspaper format. But while the service at FiveFilters.org has been up and running for over a year now, bloggers have not had a convenient way to generate and offer their content in PDF format directly from their blogs.

I’m happy to say that a few days ago Martin Hawksey got in touch to say he had integrated the FiveFilters.org source code into a WordPress plugin. Martin had previously released the Make Tabbloid plugin which did something similar using HP’s Tabbloid service (in fact, my own plan for the plugin was to build on Martin’s work). HP, however, pulled API access to the service earlier in the year without warning – breaking any application that depended on it (one risk of relying on closed, cloud-based services, but more on that in another post). The Newspaper PDF project was started to offer users a free software (open source) alternative to HP’s service, so I’m very happy that the work has now been extended to the WordPress platform. Thanks Martin!

For a full list of features and an example of the PDF output, please visit Make PDF Newspaper.

I tested the plugin on this site a little earlier, following the installation steps, and it worked without any problems. To see the output, you can view the generated PDF for this blog.

Posted in General | 2 Comments