Building the biggest knowledge graph, in layman terms an information database, of all time. That is what Google is researching at this moment. New details were presented during the annual “Knowledge Discovery in Databases (KDD) conference” in New York two weeks ago. Behind the screens of corporations like Microsoft, Facebook, Amazon and IBM are similar systems being developed. It’s a race towards the first database which archives all human knowledge.
The original paper: Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion – Link: http://bit.ly/knowledgevault
Google researchers presented Knowledge Vault, a system that is automatically gathering and merging information into a database at a large scale. If you are familiar with SEO, you will have heard about and probably edited Freebase, one of the current sources Google uses to enrich the search engine result pages (SEPRs) with information about persons, objects and companies. Google acquired Freebase in 2010, since it is one of the biggest knowledge bases available.
Freebase is a collection of structured data collected from individual sources like Wikipedia and maintained by community members, similar to the Wikipedia system. The difference with Wikipedia is the way the data is accessible. The aim of Freebase is to create a global resource to access common information so there are multiple API’s to use the data freely.
Acquiring Freebase was one of the major steps for Google in creating a system that is able to understand relationships between entities. The problem Google has with Freebase is the integrity of the data available, so it is trying to develop a system that is not depending on external sources. With the help of advanced machine learning algorithms the presented system is able to calculate the probability of correctness of current information and is able to translate and merge new information at an enormous scale.
Diving into the technical details of the presented system will require knowledge of information retrieval, natural language processing and machine learning. If you find these topics interesting, have a look at Coursera.org where you can take introduction courses to the mentioned topics. A workshop about Constructing and Mining Web-scale Knowledge Graphs can be found at http://bit.ly/kv-workshop
One of the less complicated systems Google uses to gather and verify information is crowd sourcing. The paper is referencing to fellow Google researchers, describing systems like the Feedback button below current knowledge graph cards or Quizz, a gamified system that assesses the knowledge of people.
Why does Google need such a system?
Since the introduction of the knowledge graph cards in the SERPs, Google has gather lots of feedback on the shown information. During the past months, I read a lot of positive commentaries about these additions.
With the introduction of Hummingbird, a new way of processing queries and content it has become even more important for Google to be able to recognize entities, with a focus on people, places, companies and products. The Freebase data quality is not sufficient, 75% of the people have no nationality assigned to their name. At this moment the Knowledge Vault has extracted 302 million facts with a correctness probability of 0.9 (0 being false, 1 being true), compared to 637 million facts in Freebase. Only 223 million facts overlap, which means only 35% of the facts stored in Freebase are probably correct.
With personalised search becoming more useful and Artificial Intelligence systems being developed, predicting algorithms based on facts will work faster and better compared to invalidated information. Virtual assistants like Google Now will be capable of answering everything. Exploratory search (example queries: “Brighton sightseeing”, “Things to do in London”) will be more accurate and enriched results like the carousel will be available for more places if Google has more information available. Currently these results only show up for the bigger cities.
While I understand the need of information for Google and we as users will greatly benefit from it, I don’t like the fact that this system will not be open source. Freebase has always been freely available, for both commercial and non-commercial use. For SEO purposes, this will not have From a scientific point of view, I’m really enthusiastic about the system as described in the paper, from a online marketing perspective I would have been more happy if another company then Google would be in the lead of the race towards a database containing all the human knowledge!
Do you want to know more about the history & future of the semantic web, structured data and implications for online marketing, join me during the session The Semantic Web & Structured Data, a Journey Into the Unknown