One Programming Podcast that I subscribe to on my Zune is PHP Abstract. This podcast was put together by Matthew Turland and discussed Screen Scraping. This is kind of a “black art” but interesting nontheless. He discusses both Retrieval of Source Code using a function call and also a post call in PHP. He also goes into Authentication and Cookie Setting.
In this podcast Matthew recommends using the Web Developer Toolbar for Firefox to get form information, etc. If you are an active web developer and don’t already have this free plugin I recommend you get it, next to SEO Quake it’s probably one of my favorites as well and definitely handy if you are going to do a scrape.
He also discusses the second part to screen scraping which is data analysis. Since most pages don’t have well formed HTML he offers up some suggestions for page markup (text normalization) with PHPTidy.
As I mentioned earlier this is sometimes considered “Black Hat” but we do have at least one client I know of that utilizes this methodology by coldfusion for legitimate reasons. This is very cool technology that I haven’t actually utilized outside of this personal website before. My photo gallery is actually a scrape of my PIcasa Public Photo Gallery, albeit it was written by someone else and contributed open source it required some heavy manipulation to work in my blog since it is heavily customized.
PHP Abstract Episode 22: Screen Scraping
Questions or Comments?