Ubuntu insights, Programming in groovy, java, et als!

Friday, March 02, 2012

Find RSS Feed URL of a Webpage

Given a URL of a web page, one can programatically search through the meta tags of the webpage's content for alternate URL links (like atom or RSS feed links for the same) to thereon further use them to parse and process the content of the webpage. Typically this is the way Google Reader works. Here I present a very simple implementation of the same in pharo smalltalk.


Object subclass: #RSSReader
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'VamsiExperiments'


getURLContent: url
  "## Comment : Supply the url String of the webpage,
 
   ## example: http://nerdysermons.blogspot.in"
| urlContent |
urlContent := (url asUrl retrieveContents contents asString).
^urlContent



findAlternateLinksInUrlContent: urlContent
  "## Comment : The above fetched page content to 

   ## be passed here to get an ordered collection 
   ## of alternate links"      
| links|
links := OrderedCollection new.
urlContent linesDo:  [:line |
(line findString: '<link rel="alternate"') > 0
ifTrue: [
links add: (line findTokens:'"' includes: 'http://').
].  
].
^links.

0 comments:

Post a Comment