PHP Port of Arc90’s Readability

Update 2011-03-23: Readers may also be interested in how we use PHP Readability at FiveFilters.org: Content Extraction at FiveFilters.org

Last year I ported Arc90’s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.

As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online:

For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.

It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place. Here’s an example of how to use the PHP port:

require_once 'Readability.php';
header('Content-Type: text/plain; charset=utf-8');

// get latest Medialens alert 
// (change this URL to whatever you'd like to test)
$url = 'http://medialens.org/alerts/index.php';
$html = file_get_contents($url);
 
// PHP Readability works with UTF-8 encoded content. 
// If $html is not UTF-8 encoded, use iconv() or 
// mb_convert_encoding() to convert to UTF-8.

// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often does a terrible job and results in strange output.
if (function_exists('tidy_parse_string')) {
	$tidy = tidy_parse_string($html, array(), 'UTF8');
	$tidy->cleanRepair();
	$html = $tidy->value;
}

// give it to Readability
$readability = new Readability($html, $url);

// print debug output? 
// useful to compare against Arc90's original JS version - 
// simply click the bookmarklet with FireBug's 
// console window open
$readability->debug = false;

// convert links to footnotes?
$readability->convertLinksToFootnotes = true;

// process it
$result = $readability->init();

// does it look like we found what we wanted?
if ($result) {
	echo "== Title ===============================\n";
	echo $readability->getTitle()->textContent, "\n\n";

	echo "== Body ===============================\n";
	$content = $readability->getContent()->innerHTML;

	// if we've got Tidy, let's clean it up for output
	if (function_exists('tidy_parse_string')) {
		$tidy = tidy_parse_string($content, 
			array('indent'=>true, 'show-body-only'=>true), 
			'UTF8');
		$tidy->cleanRepair();
		$content = $tidy->value;
	}
	echo $content;
} else {
	echo 'Looks like we couldn\'t find the content.';
}

Differences between the PHP port and the original

Arc90’s Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page’s CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP’s ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90’s Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90’s Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)

Another significant difference is that the aim of Arc90’s Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser — Arc90 already do that extremely well, and for PDF output there’s FiveFilters.org’s PDF Newspaper.

Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don’t want to do because it makes debugging and updating more difficult), I’ve tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

This entry was posted in Code and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

49 Comments

  1. Posted 10 August 2010 at 1:47 am | Permalink

    There’s also a Python port and a Ruby port available if PHP isn’t your language.

  2. Posted 10 August 2010 at 3:50 pm | Permalink

    This is good stuff. I’ve played with it a bit, works great. Nice work, Keyvan!

  3. Posted 10 August 2010 at 5:21 pm | Permalink

    Thanks Chris! :)

  4. ph
    Posted 21 August 2010 at 8:54 pm | Permalink

    Hi,

    I’ve been trying to get this to work and all im getting is a blank screen. I turned on “debug:true” and still no output… any ideas?

  5. Posted 21 August 2010 at 10:58 pm | Permalink

    ph: try adding

    error_reporting(E_ALL);
    ini_set("display_errors", 1);

    to the top of the example file and see if that produces an error message.

  6. Neal
    Posted 1 September 2010 at 5:38 am | Permalink

    Good work, Keyvan.
    A quick question – do you plan to support multiple pages? i.e. porting findNextPageLink from the original js?

  7. Posted 1 September 2010 at 9:43 am | Permalink

    Neal: Thanks. I do plan to port that over too, yes. Hopefully sometime this month.

  8. Fab
    Posted 7 September 2010 at 2:28 pm | Permalink

    This is great work, thanks for sharing!

  9. Chris
    Posted 16 September 2010 at 5:04 am | Permalink

    There appears to be a few issues with images.

    It seems to be killing divs with images in them while as the JS version does not.

    Try this URL for instance: http://www.popsci.com/cars/article/2010-09/giving-traffic-lights-mind-their-own-can-reduce-congestion-study-says

    Any ideas?

  10. Chris
    Posted 16 September 2010 at 5:10 am | Permalink

    Oh by the way, how about creating a Github repo for this. I’d gladly contribute. I was working on something similar back when they had 1.6.x lol.

  11. Posted 16 September 2010 at 10:16 am | Permalink

    Chris: Thanks for the report. I’ll look into it. As for the github, I’ll think about it. :)

  12. Mridang
    Posted 21 September 2010 at 12:09 pm | Permalink

    Is this the extraction backend that powers the FiveFilters Full-Text Feeds application? Thank you.

  13. Posted 21 September 2010 at 2:29 pm | Permalink

    Mridang: yes, it is

  14. Simon
    Posted 22 October 2010 at 7:03 pm | Permalink

    I’m a french speaker, and I’ve tried it on a french page, and found a bug using accent (I know a lot of language use characters as À É È and so on) on a website not using the normal É . Is there a way to fix it?

  15. Posted 23 October 2010 at 12:14 pm | Permalink

    Simon: the content should be UTF-8 encoded before you pass it to PHP Readability. If it’s not you will need to convert it. If you’ve got a URL of the page I can take a look.

  16. pat
    Posted 10 November 2010 at 9:02 am | Permalink

    hi would love this on github aswell. im willing to contribute what i have.

  17. Edward
    Posted 31 December 2010 at 12:02 pm | Permalink

    This is WONDERFUL! Thanks for creating it. Any updates on additional features?

  18. Posted 12 January 2011 at 4:40 pm | Permalink

    Having problems with relative image paths on websites. Any hints or updates in the making?
    Allowing Github contributions would be cool indeed.

  19. Posted 14 February 2011 at 12:09 am | Permalink

    thinkery: PHP Readability does not automatically convert relative URLs to absolute ones, but it’s not difficult to do. Is that what you’re trying to do?

    Regarding github contributions: the source code has now moved to code.fivefilters.org using Indefero. You can now grab it with git. I hope that makes it easier for those of you who’d like to fork it and modify it. If you do make changes, please share them – I’ll consider incorporating any changes once tested.

  20. Posted 17 February 2011 at 3:01 pm | Permalink

    Hi,
    this is great tool. it works like a charm, I will use it with combination of Bing Search engine. Is there any algorithm for creating such information. I just want to learn about different algorithms for this purpose.

  21. Brad
    Posted 17 February 2011 at 10:05 pm | Permalink

    I’m getting errors on line 293:
    $this->dom->documentElement->appendChild($this->body);
    with poor content such as when the source url has become a 404. Is there anyway for this to fail gracefully?

  22. Posted 17 February 2011 at 11:32 pm | Permalink

    Brad: can you give me an example of the HTML ($html) that produces that error in PHP Readability?

  23. Al
    Posted 22 February 2011 at 3:38 pm | Permalink

    Thanks for the work in doing this, exactly what I’m looking for. However I’m getting the following error when testing using the example provided:

    Warning: tidy_parse_string() [function.tidy-parse-string]: Could not load configuration file ‘UTF8′ in /home/public_html/readability/Readability.php on line 18

    Any ideas what is causing this?

    Cheers.

  24. Posted 22 February 2011 at 3:53 pm | Permalink

    Al: sorry, there was an error in the example code on this page (the example in the repository should work fine). tidy_parse_string expects the character encoding in the third argument but in the code I’d posted up it was being passed as the second argument. I’ve fixed it now by adding an empty array() as the second argument – please try copying again from the code on this page and let me know if you still get an error.

  25. Al
    Posted 22 February 2011 at 7:37 pm | Permalink

    Thank you! That was it!

    One other thing, I noticed on Gizmodo pages it doesn’t work, any idea why it returns the results it does? Example URL:

    http://gizmodo.com/#!5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook

    Thanks.

  26. Posted 22 February 2011 at 7:45 pm | Permalink

    Al: that’s great, thanks for letting me know.

    As for Gizmodo, they, like Twitter, have embraced a crazy new trend which breaks the way most people expect URLs to work. Tim Bray has written more about it here: http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch

    Basically, you will have to rewrite these hash bang (#!) URLs into a form which leads to a page with real content. The simple rule is replace ‘#!’ with ‘?_escaped_fragment_=’ so your example would become: http://gizmodo.com/?_escaped_fragment_=5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook

    The full gory details available here: http://code.google.com/web/ajaxcrawling/docs/specification.html

    Hope that helps.

  27. Al
    Posted 22 February 2011 at 8:06 pm | Permalink

    Ahhh, pesky hashbangs! lol. Ok, thanks for the info. This kinda screws things up for me a bit though since I’m pulling the links in from Facebook and on facebook Gizmodo posts the links as http://gizmodo.com/5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook but if you put that in the browser it becomes http://gizmodo.com/#!5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook, so I basically have no way of knowing if they’re using hashbangs until the URL is fully resolved. So I’m kinda stuck..

    Thanks though!

  28. Al
    Posted 22 February 2011 at 9:46 pm | Permalink

    Following up on my last comment, In case anyone runs into issues with URLs redirecting to hashbang links I was able to resolve this by using the following function in my php code:

    function get_redirect_url($url){
    	$redirect_url = null; 
     
    	$url_parts = @parse_url($url);
    	if (!$url_parts) return false;
    	if (!isset($url_parts['host'])) return false; //can't process relative URLs
    	if (!isset($url_parts['path'])) $url_parts['path'] = '/';
     
    	$sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    	if (!$sock) return false;
     
    	$request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
    	$request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
    	$request .= "Connection: Close\r\n\r\n"; 
    	fwrite($sock, $request);
    	$response = '';
    	while(!feof($sock)) $response .= fread($sock, 8192);
    	fclose($sock);
     
    	if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
    		if ( substr($matches[1], 0, 1) == "/" )
    			return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
    		else
    			return trim($matches[1]);
     
    	} else {
    		return false;
    	}
     
    }
    
    $url = get_redirect_url($url);
    
    // Strip any hashangs
    $url = str_replace("#!","?_escaped_fragment_=",$url);
    $html = file_get_contents($url);
    
  29. Posted 22 February 2011 at 11:45 pm | Permalink

    Al: that will work if you expect $url to have exactly one redirect. But if the URL returned by get_redirect_url() has further redirects, you might not catch the hash bang.

    A more robust solution would be to follow redirects one by one, resolving relative URLs and rewriting any hash-bangs you encounter, or a simpler option is to use cURL, let it handle redirects but grab the effective URL (the final URL it fetches) – see http://www.php.net/manual/en/function.curl-getinfo.php – and if that contains a hash bang, rewrite that and fetch it again. Although I haven’t tested to see if cURL preserves the fragment identifier when it returns the effective URL – if it doesn’t, then this solution will be no good.

  30. Al
    Posted 23 February 2011 at 3:46 pm | Permalink

    Thanks Keyvan – some good points. I’ll work that into my script.

    I have run into another problem with this link: http://gizmodo.com/5767306/apple-will-unveil-ipad-2-on-march-2 (unfortunately another gizmodo link). Really not sure what’s happening here, looks fine with readability bookmarklet, any idea what could be causing this to happen?

    Cheers

  31. Posted 23 February 2011 at 5:55 pm | Permalink

    Al: I’ll soon be collecting URLs of pages which fail extraction in an effort to improve PHP Readability. I’ve deliberately held off the desire to change the code because at the moment I don’t have a decent test framework in place to allow me to see the impact of the changes on sites other than the one in question. Once I have something in place, I’ll post up here and hopefully get help from anyone interested in improving the PHP Readability code.

    I think with the new readability.com service Arc90 are unlikely to continue developing their open source version, so perhaps a community effort can keep it alive.

    My suggestion to you regarding gizmodo.com, and any other site you think you’ll be extracting from fairly regularly, is to create your own extraction pattern and rely on PHP Readability if that pattern fails (e.g. in the case of a redesign). That’s actually what Flipboard and Instapaper appear to do – see links in the comments here: http://www.corgitoergosum.net/2011/01/17/replicating-flipboard-part-i-site-scraping/comment-page-1/#comment-39

  32. Posted 14 March 2011 at 12:20 pm | Permalink

    Bonne initiative. Maintenant disponible aussi en plugin chez SPIP.

  33. Alan
    Posted 15 March 2011 at 12:27 am | Permalink

    Just want to say thank you for this code. There are other php readability libraries popping up on git hub but none seem to work as well as this one. Would you be willing to create a repository for this on there too?

  34. Nikhil
    Posted 28 March 2011 at 3:14 pm | Permalink

    This is incredibly helpful. Thanks a lot for the port.

  35. Posted 6 October 2011 at 11:07 pm | Permalink

    i’m making a Digg like url submittor.
    ,how far this class can help me to extract title ,description and images from any submitted url,
    thanks

  36. Don
    Posted 2 December 2011 at 8:09 pm | Permalink

    I’m curious if it’s possible to use this to extract content from custom comment tags. For example, if the page I’m looking at contains something like: <code><!-- MYTAG myname=myvalue --></code>, is there a clear way to pull out the name and value using Readability?

  37. ragess
    Posted 8 December 2011 at 6:19 pm | Permalink

    Hello Keyvan. I came to your blog after googling about’ how to get full content from rss feed’. Your Readability sounds promising. But I tried with your example and it gave so many ‘s, ‘s ‘s etc but not the text, not at all.
    I even used simplepie, but again ‘get_content()’ just gave me excerpts.
    Keyvan could you please help me in how I can get full content from rss feed? Please describe in detail, like which code to put where, if you could.
    Thanks.

  38. Posted 8 December 2011 at 6:33 pm | Permalink

    MetLife: Merci!

    Alan: Thanks! Regarding GitHub, please see earlier comment http://www.keyvan.net/2010/08/php-readability/#comment-322379 – our repository on code.fivefilters.org is accessible via git, anyone can fork it and place it on GitHub. I’m not interested in doing that myself.

    Nikhil: Thanks!

    jaideep: I don’t know. Try it and see. If you want more control over extraction I suggest you check our Full-Text RSS tool: http://fivefilters.org/content-only/

    Don: I wouldn’t use PHP Readability for that. Try regular expressions.

    ragess: If you’re trying to create full-text RSS feeds, I’d suggest you look at our Full-Text RSS tool: http://fivefilters.org/content-only/ You’ll find some documentation here: http://help.fivefilters.org/customer/portal/topics/62602-full-text-rss/articles

  39. Matt
    Posted 27 December 2011 at 8:04 pm | Permalink

    That is awesome stuff. Just what I needed. Thanks!

  40. Alan
    Posted 2 January 2012 at 10:32 pm | Permalink

    I am having some trouble in an implementation and would appreciate some guidance:

    With:
    $url = ‘http://business.financialpost.com/2012/01/02/asias-double-edged-currency-sword/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+FP_TopStories+%28Financial+Post+-+Top+Stories%29′;

    I get this as the first part of the output:

    == Title =====================================
    Asia currencies face volatility dilemma | Investing

    == Body ======================================

    <em><strong>By Emily Kaiser, Asia economics correspondent</strong></em>
    SINGAPORE – The roller-coaster ride for Asian currencies, which saw only the yen and yuan post significant gains for the year against the U.S. dollar, is set to continue in 2012.
    While Japan actively sought to stem the yen’s rise — drawing U.S. criticism last week — China intervened to ensure the yuan ended the year at a new high. Both currencies appreciated roughly 5% in 2011 against the dollar.
    The opposite approaches illustrate a dilemma facing Asian policymakers as they try to smooth out foreign exchange rate volatility, which shows no sign of abating in the new year. If the currency is too strong, exports get more expensive. Too weak, and imported inflation spikes and domestic buying power fades.

  41. Alan
    Posted 2 January 2012 at 10:50 pm | Permalink

    @Alan followup…. it was actually showing the html markup with ‘p’ markers but I see that’s actually desired.

    But I do see issues with seeing the following text showing up : odd letters/symbols before the the word We and after the word government

    “We believe the government’s

  42. Andy
    Posted 15 February 2012 at 4:20 pm | Permalink

    Keyvan, another heuristic to consider is adding elements with an explicit style=”display:none” to the unlikely candidates list. I ran into some examples where a hidden DIV contained a bunch of text that the user would never see, and modified my copy of the library to throw these out.

  43. Vanina
    Posted 19 September 2012 at 10:28 am | Permalink

    I can not seem to access the code online. The page redirects to a 404.
    Where can I find the source code?

  44. Posted 20 September 2012 at 5:21 pm | Permalink

    Alan: Sorry for the late reply. That appears to be a character encoding issue. You need to make sure whatever you give PHP Readability is in UTF-8. And treat its output as UTF-8.

    Andy: Yes, we actually do that in Full-Text RSS. I guess it wasn’t in the original Readability code as those elements probabaly weren’t being considered when run in the browser.

    Vanina: We moved code.fivefilters.org which unfortunately broke a few URLs. I’ve updated the links on this page, so please try again. Thanks for the report.

  45. Frank
    Posted 22 December 2012 at 10:12 am | Permalink

    Hey Keyvan,
    Thanks for sharing! Quick question about images.
    In a comment above, you were saying you were going to look into it as they seem to get killed in the process. Have you worked on a fix? :)
    I’d like to be able to use them along with text in a small app I’m building.
    Thanks!

  46. Posted 3 January 2013 at 12:05 pm | Permalink

    Frank: regarding images, we’ve made a few changes to PHP Readability that will go into the release of Full-Text RSS 3.1. The changes should preserve more images and embedded videos. Once we’re ready with that release I’ll update the PHP Readability code linked here.

  47. Michael
    Posted 10 February 2013 at 9:05 pm | Permalink

    Hello. I’ve tried the class you posted and it’s really great! I’ve tested it on different types of web pages and on most of them it gives awesome results!

    But there’s type of web pages that holds multiple blocks of content of similar size. I turned debug on and it showed me scores of 42 to 52, and I think instead of grabbing one of them it’s better to get some.

    I was thinking about having some threshold, say 20-30% of top candidate’s score and take all candidates that fit in it, so in my case 20% of 52 is 10.4, so all candidates with (score > 41.6) would be included in the output.

    Before I dive into rewriting it for my needs I wanted to ask this: do you have a version that extracts X top candidates instead of one or the way I described with the threshold? I’ve looked into the grabArticle code and it looks like it won’t be a quick fix to implement something like that.

    Feel free to contact me if you find this idea interesting.

  48. Posted 10 February 2013 at 9:33 pm | Permalink

    Hi Michael, that’s interesting. I’m not aware of a version that does that. For use on FiveFilters.org, we write custom extraction rules for sites where PHP Readability doesn’t extract what we want. In cases where we want to extract multiple elements, we use XPath to select them.

    I’m afraid I can’t help with what you’re trying to achieve, but it does sound interesting.

  49. Michael
    Posted 11 February 2013 at 12:03 am | Permalink

    Thanks for your time to reply me.

    I will then check the code again and try to come up with way to make the changes I need. Gladly it’s very well documented and easy to understand.

    Best regards.

4 Trackbacks

  1. [...] fact, I really want more than this – I want to use a port of Arc90′s Readability (there are many, in many different languages) to grab the content from the page I’ve tagged, [...]

  2. [...] PHP port [...]

  3. [...] to now we’ve relied mainly on PHP Readability to automatically identify and extract articles from web pages, and this is still how the majority [...]

  4. [...] Tayyar Beşik, software developer @Nokta If you want to use php for this job PHPReadability http://www.keyva… (more) Sign up for free to read the full text. Login if you already have an account.This answer [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe without commenting