Scraping Webmaster Tools with FMiner

Screen Scraping Webmaster Tools!

The biggest problem (after the problem with their data quality) I am having with Google Webmaster Tools is that you can’t export all the data for external analysis. Luckily the guys from the FMiner.com web scraping tool contacted me a few weeks ago to test their tool. The problem with Webmaster Tools is that you can’t use web based scrapers and all the other screen scraping software tools were not that good in the steps you need to take to get to the data within Webmaster Tools. The software is available for Windows and Mac OSX users.

FMiner is a classical screen scraping app, installed on your desktop. Since you need to emulate real browser behaviour, you need to install it on your desktop. There is no coding required and their interface is visual based which makes it possible to start scraping within minutes. Another possibility I like is to upload a set of keywords, to scrape internal search engine result pages for example, something that is missing in a lot of other tools. If you need to scrape a lot of accounts, this tool provides multi-browser crawling which decreases the time needed.
This tool can be used for a lot of scraping jobs, including Google SERPs, Facebook Graph search, downloading files & images and collecting e-mail addresses. And for the real heavy scrapers, they also have built in a captcha solving API system so if you want to pass captchas while scraping, no problem.

Below you can find an introduction to the tool, with one of their tutorial video’s about scraping IMDB.com:
Continue reading

Cheatsheet: managing search robot behaviour

Search Robot Management Cheatsheet

Many discussions have been taking place about the differences between crawling, indexing and caching. The way search engine robots are behaving can be controlled in many ways. Due to all the different possibilities, I often have discussions and have to clarify my point of view over and over again. So to be sure everyone is clear about the way you can control the crawling and indexing behaviour of the major search engines (Google, Bing, Yandex and Baidu), make sure you remember the following table or print the table and hang it next to your screen to win the next discussion with your fellow SEOs 🙂
Continue reading

GWT Hack: Fetching external websites without verifying

For a simple quickscan of a random website, you can’t use the standard Fetch as Googlebot functionality without verifying a domain first. Since this is becoming more and more important, now Google is also looking to hidden layers of content, ads etcetera and you want to test smartphone robot too, you can make use of a simple workaround.

Make sure you create a simple clean HTML file on a domain you already own. In my case, I use notprovided.eu which is verified in my Google Webmaster Tools account. Within this HTML you can easily add a iframe or embed element, containing the URL you want to test with the Fetch As Google function.

Within WMT you fetch the URL which includes the iframe or embed section. This will show the external website, parsed by the selected Google Bot. Make sure the file is not blocked by robots.txt from crawling, so add a noindex tag. If you want to fetch a HTTPS url, you also need to have a HTTPS domain verified in WMT to including such an URL by making use of iframe or embed codes.

fetch-as-google-iframe-example

The danger of content marketing for SEO

google-panda-updateMany articles have been written about content marketing in the past years, online marketers are sometimes completely obsessed about content marketing. Because of several projects I did during the past 12 months and all those people using “Content is King” whenever possible, I would like to discuss some of the factors content marketing has an impact on. These factors are not commonly considered by everyone involved in online marketing campaigns.

Due to the fast-paced development of Google, where there has been more focus on the quality of content and websites by developing algorithmic filters like Panda, many websites have produced enormous amounts of textual content on their domains. A lot of websites are still targeting specific queries with unique pages instead of targeting a specific group of people. In the company of the latest updates of Google’s algorithms, where Panda is one of them, some problems start to occur. From a technical point of view, more content is not always positively influencing the organic results.

Continue reading

The Future of Search – Race Expo Moscow 2014

Search engines like Yandex, Google and Baidu are rapidly changing these days. More and more, engineers try to imitate human behaviour and try to use user experience as a signal, compared to the old fashioned link graph way of ranking pages in their indices. Not only have we seen interesting developments like Yandex Islands and Google’s Knowledgde graph, but also social is starting to play a growing role in the technology behind search. Human search behaviour has changed considerably.

Together with the crowd, I’ll explore the world behind search engines and try to create an understanding of how search engines are working, how the future of search will be shaped and how one as an affiliate marketeer can make use of it.