Thursday, June 28, 2007

HTML scraping contd

As soon as HBO changes their site layout, my scrape stops working. The new code is :
scrape.awk:

BEGIN { FS="<|>"; OFS="\t"; namecount=0; timecount=0 }

/<dt class="title".*/ { namecount+=1; titles[namecount] = $5}
/<dd class="time".*/ { timecount+=1; times[timecount] = $3 $5 }

END {
for (i=1; i <= namecount; i++){
print times[i], titles[i]
}
}


And this gives:


$ ./fetch.sh
1:15AM Beverly Hills Cop
3:15AM Tales From The Crypt 29: House Of Horror
3:45AM The Killing Fields
6:30AM A Very Long Engagement
9:15AM A Perfect Murder
11:30AM Family Sins
1:15PM Cyber Seduction : His Secret Life
3:00PM The Adventures Of Shark Boy And Lava Girl
5:00PM Blue Thunder
7:15PM Little Nicky
9:00PM Final Fantasy: The Spirits Within
11:15PM Monster-In-Law


The scrape code would not need to be changed if only there were a method to automagically rip the needed information - however that is too much AI for me.
BTW 'Cyber Seduction' looks interesting ;-)

0 comments: