Scraping also is simply one step in information gathering and building the information commons. Is it therefore a last resort, or a first step towards building a more solid information foundation?
One of our online management-types is opposed to screen-scraping. What do I tell him?
He says scraping is unreliable, and creates maintenance headaches down the road. He says scraping has the ability to stop working at a moment’s notice.
Some of this is true, but scraping is a legit tool, am I right? Or have I just been hacking away to get the information I need for so long…
Leave a Reply
You must be logged in to post a comment.
5 Answers
Tell him that scraping is slightly more reliable than waiting for data you want to fall out of the sky into your lap in a usable format.
Leave a Reply
You must be logged in to post a comment.
Scraping is basically a last resort method of collecting data that you can't get another way. It's a last resort because it is unreliable. Ask him to recommend another method of getting the data you want. If he can't, the scraping is justified.
Leave a Reply
You must be logged in to post a comment.
Scraping is unreliable but, if it's the only way to get your data then it's certainly legit.
Also, is the value you get from the data you're scraping worth the effort of adjusting some scripts if the targets change their page structure?
Leave a Reply
You must be logged in to post a comment.
Scraping is definitely a legit tool, but as others have noted, one of last resort.
When we switched CMS and our original vendor refused to handover backups of the databases, we used scraping to capture our data ourselves, and I've heard of many other similar cases. Of course, if you're not looking at your own data but government data or even another 3rd party, it's also useful and sometimes the only route to get what you're after.
I would, however, pay close attention and do frequent spot checks to make sure it's actually quality data.
I actually used a scraper to do some analytics work I needed done on a site I ran, and things worked great for 3 months. And then my numbers turned all wacky: About 80% were correct, and 20% were off one way or another. A minor site design tweak completely threw it off, and I ended up canning the scraper completely in favor of APIs.
Leave a Reply
You must be logged in to post a comment.
It's been my experience that if a site doesn't have the resources to post its content in a data-only format, i.e., RSS feed or an API, it doesn't have the resources to change its pages that often.
You could offer that to the online management type.
Leave a Reply
You must be logged in to post a comment.
Your Answer
Please login to post questions.

This did happen to me once. It was a good day.