Content Extraction at FiveFilters.org

Full-Text RSS 2.7 from FiveFilters.org is now available. I thought I’d write about one area of improvement in this release: content extraction.

Automatic Extraction

Up to now we’ve relied mainly on PHP Readability to automatically identify and extract articles from web pages, and this is still how the majority of articles are extracted. It works extremely well for most pages, but there are still occasions when it fails – e.g. picks out the wrong HTML element, or doesn’t find anything at all. Improving PHP Readability will be one area of focus for future releases.

In 2.7 we still use PHP Readability, but we now recognise and prioritise hNews microformatting – if detected, we extract the first element marked entry-title and all elements marked entry-content. This is a standard that will hopefully be used more widely on the web. (For those who’ve asked, Twitter updates are now extracted properly because of hNews support.)

Site Patterns

Recognising that auto-detection does sometimes fail, in version 2.5 we introduced custom extraction patterns: a way for users to override auto-detection and tell Full-Text RSS (using CSS selectors) which element it should extract as the content block.

The biggest change in 2.7 is the introduction of site patterns. Site patterns sit in between custom extraction and auto detection. They allow fine grained control over extraction on a per-site basis. A site, identified by its domain name, can now have its own config file detailing extraction rules. Each time a URL is processed, we check to see if a corresponding site config exists, and if it does, we refer to it for instructions. Users can specify XPath expressions to match title and body elements and define rules to strip superfluous elements.

Rather than create our own configuration format for site patterns, we chose to adopt the same format used by Instapaper. Here’s what the entry for wikipedia.org looks like:

body: //div[@id = 'content']
strip_id_or_class: editsection
strip_id_or_class: toc
prune: no

Instapaper users will find these patterns by visiting instapaper.com/bodytext/ (login required).

One big advantage for us in using the same config format is that we can make use of all the existing rules listed on Instapaper. Marco, Instapaper’s creator, has opened up the database to allow for public contributions. So, included in Full-Text RSS 2.7 is over 100 site configuration files which will be applied automatically (look inside the site_config/standard/ directory). Most of these are borrowed from Instapaper, but we’ll soon be adding our own which we’ll be sharing with everyone.

Users can also create their own site config files and drop them in the site_config/custom/ directory. Each site config is simply a text file named after the site. For example, if I wanted a special rule for extracting content from this site, I would create a keyvan.net.txt file with the appropriate rules inside.

Extraction Process Overview

To summarise, Full-Text RSS 2.7 attempts to extract in the following order:

  1. Custom Extraction Pattern
  2. Site Patterns
  3. hNews
  4. PHP Readability

If at any stage we find we’ve got a successful title and body match, we do not proceed further. If, however, there is no match, we move down the list until there is (the only exception here is with custom extraction patterns – if the supplied CSS selector does not match, no further attempt is made).

Sound useful?

Full-Text RSS 2.7 is licensed under the AGPL and available to try or buy at fivefilters.org/content-only/.

This entry was posted in General. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

10 Comments

  1. Posted 21 April 2011 at 6:07 pm | Permalink

    Hi,

    I just wanted to let know that I’ve installed full-text-rss-2.7 and it’s fantastic. We’re building a news analysis system and we had been looking for a tool to retrieve full articles from RSS feeds for quite a while. Your program saved us a lot of time!

    Thanks

    PS: pretty good idea to sell it instead of just asking fro contributions. You deserve it!

  2. Posted 21 April 2011 at 6:19 pm | Permalink

    Hi Sylvain,

    That’s great to hear. Do let me know if you have any trouble with it.

    Cheers,

    Keyvan

  3. Leo
    Posted 5 May 2011 at 1:15 am | Permalink

    Does version 2.7 convert feeds to full feeds faster than version 2.5??

  4. Posted 5 May 2011 at 1:30 am | Permalink

    Leo: version 2.6 included better support for parallel fetching. So it might improve performance if your server did not offer parallel fetching with 2.5 – to be sure you’ll have to download the free compatibility checker. 2.6 also improved performance for processing single page URLs (ie. not feeds). If either of these affect you, you’ll notice a speed increase.

    The full changelog is here: http://fivefilters.org/content-only/changelog.txt

  5. Posted 24 June 2011 at 12:17 pm | Permalink

    Hi Leo,
    Yes, version 2.7 convert feeds to full feeds faster than version 2.5, you should try it !

  6. Raihan
    Posted 8 July 2011 at 12:58 pm | Permalink

    ur script have any option to extract defined or selected content form this tag to that tag.i use 2.7 now.and thanks a lot for creating this script.

    in this http://rss.bdnews24.com/rss/english/home/rss.xml rss i want to remove last paragraph.

  7. Posted 8 July 2011 at 6:17 pm | Permalink

    Raihan: You should be able to use XPath to remove the last paragraph element inside the parent element.

  8. Raihan
    Posted 8 July 2011 at 7:16 pm | Permalink

    thanks for your reply
    i am novice…dont know xpath.
    in yahoo pipes i set start point and end point.in ur script in which file which section i need to edit…?
    and then its auto select that portion and give output.

  9. Posted 4 August 2011 at 3:39 am | Permalink

    Does five filter french language with the accents? thanks

  10. Posted 4 August 2011 at 12:22 pm | Permalink

    Raihan: Sorry for the late reply. With Full-Text RSS you will have to use XPath to specify which element should be removed. E.g. strip: //div[@id='content']/p[last()] will remove the last paragraph within the div element whose id is ‘content’ – you should include this line in the appropriate site config file. If you’re extracting from a page on example.org, the site config file will be named example.org.txt. You’ll find details in the user guide.

    Jean: If you mean does the script preserve accents, then yes. You can test it for yourself on http://fivefilters.org/content-only/ – give it a URL to a French article and check the results.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe without commenting