Blog articles

How Not to Extract Data From the Web

By Pluralsight    |    July 21, 2015

Ever needed to extract data from the web for one project or another, and wondered what tool to use? Import.io is really handy when it comes to doing just that. There’s lots of applications you can use it for — like price comparison, data journalism, data visualization, and point of interest data — and being able to insert thousands of URLs and extract all of the data from those web pages can be invaluable!

But what’s the problem? Well, the web is a massive place, and it’s easy to get into an issue while extracting data. So today I’m going to give you a foolproof guide on the do’s and don’ts when extracting data from the web. Let’s dive in!

Crawl the Right Way

The Crawler is the most popular tool at import.io — it has the ability to extract lots of data from websites with very little training. By giving the Crawler a minimum of 5 pages, it effectively learns what data looks like, and goes through thousands of pages extracting what matches.

While this is really great, you can imagine a web page has tens of thousands of pages on it, so what if your data sits on only 500 of them? As a default, we set the page depth at 10 based on the average amount of clicks away from the start pages, and our Crawler visits every single link on each page. This means it goes to every link for 10 pages from each start page. But this can generate lots of pages that might or might not contain data you want, and you might just add pages to the queue that don’t have similar data. So by setting a slightly lower page depth (I recommend 5 at max) with more “Where to start?” URLs inserted is a far more efficient way of Crawling a website. And don’t forget, you can paste in as many “Where to start?” URLs as you want.

Do Use Chained APIs

Now you have the ability to chain 2 APIs together. This is particularly useful for product lists and product pages where the majority of your data sits within the product pages. The ability to chain APIs means that if API “A” has a list of links to pages that have your data, you can chain it to API “B” — an API to get data from the pages API “A” links to. What does this mean for your bottom line? Incredibly fast and accurate data extraction that can be repeated whenever.

Do Use XPaths

XPaths are pieces of code that navigate to a certain part of the HTML. They’re great, as they allow you to look up a certain word on a web page and extract data in relation to its position. Put simply, if your data moves around on the page, you can use a manual XPath to anchor the program to a word phrase. There are also some really cool tutorials online for XPaths, including this one from W3Schools.

Don’t Use JavaScript (Where Possible)

JavaScript puts quite a large load on the web page and can cause it to go over our processing time limit for each page, resulting in a publish failure. This usually means you can get a static snapshot of the data instead of a nice, refreshed flow of data over time.

Don’t Ignore the Webmaster’s Wishes

When extracting data, the webmaster’s decision is final. If they’re unhappy about you scraping their website, then so be it. It’s why we always advise our users to reference the source of data wherever they use it. It can drive traffic to the webmaster’s site, and you get to pull the data — a great thing for both parties.

With these guidelines in mind, you should be extracting data and crawling like a pro. And if you’re a little foggy on what some of the terms I went over are, be sure to check out our Code School screencast on import.io.

About the author

Pluralsight is the technology skills platform. We enable individuals and teams to grow their skills, accelerate their careers and create the future.