Scraping Webmaster Tools with FMiner

Screen Scraping Webmaster Tools!

The biggest problem (after the problem with their data quality) I am having with Google Webmaster Tools is that you can’t export all the data for external analysis. Luckily the guys from the FMiner.com web scraping tool contacted me a few weeks ago to test their tool. The problem with Webmaster Tools is that you can’t use web based scrapers and all the other screen scraping software tools were not that good in the steps you need to take to get to the data within Webmaster Tools. The software is available for Windows and Mac OSX users.

FMiner is a classical screen scraping app, installed on your desktop. Since you need to emulate real browser behaviour, you need to install it on your desktop. There is no coding required and their interface is visual based which makes it possible to start scraping within minutes. Another possibility I like is to upload a set of keywords, to scrape internal search engine result pages for example, something that is missing in a lot of other tools. If you need to scrape a lot of accounts, this tool provides multi-browser crawling which decreases the time needed.
This tool can be used for a lot of scraping jobs, including Google SERPs, Facebook Graph search, downloading files & images and collecting e-mail addresses. And for the real heavy scrapers, they also have built in a captcha solving API system so if you want to pass captchas while scraping, no problem.

Below you can find an introduction to the tool, with one of their tutorial video’s about scraping IMDB.com:
Continue reading

Cheatsheet: managing search robot behaviour

Search Robot Management Cheatsheet

Many discussions have been taking place about the differences between crawling, indexing and caching. The way search engine robots are behaving can be controlled in many ways. Due to all the different possibilities, I often have discussions and have to clarify my point of view over and over again. So to be sure everyone is clear about the way you can control the crawling and indexing behaviour of the major search engines (Google, Bing, Yandex and Baidu), make sure you remember the following table or print the table and hang it next to your screen to win the next discussion with your fellow SEOs 🙂
Continue reading

GWT Hack: Fetching external websites without verifying

For a simple quickscan of a random website, you can’t use the standard Fetch as Googlebot functionality without verifying a domain first. Since this is becoming more and more important, now Google is also looking to hidden layers of content, ads etcetera and you want to test smartphone robot too, you can make use of a simple workaround.

Make sure you create a simple clean HTML file on a domain you already own. In my case, I use notprovided.eu which is verified in my Google Webmaster Tools account. Within this HTML you can easily add a iframe or embed element, containing the URL you want to test with the Fetch As Google function.

Within WMT you fetch the URL which includes the iframe or embed section. This will show the external website, parsed by the selected Google Bot. Make sure the file is not blocked by robots.txt from crawling, so add a noindex tag. If you want to fetch a HTTPS url, you also need to have a HTTPS domain verified in WMT to including such an URL by making use of iframe or embed codes.

fetch-as-google-iframe-example