Diffbot | Data Graphs, Extraction, Web Scraping & Crawling

Every now and then it’s important to get back to Diffbot basics and ask a question that’s so obvious. Sometimes those questions have hidden depths. The question “What Is Product Data?” is one of those I recently asked myself, and it led me down a mini-rabbit hole. The basic definition of a product is a thing produced by labor, such as farms or factory products.

Or rather, the product of thought, time, or space. But, when you think about it, it far and beyond covers a lot of ground. By definition, a product can be almost anything you can imagine — from any item on the supermarket shelf to an eBook, a house, or even just a theory. Diffbot pares down the definition from almost anything in the world to a useful data definition.

So, what is that useful definition of a product in the context of data? Well, according to Diffbot: “A product is a physical or digital good, which has the attributes of existing, having a name, being tradable.” Beyond that, all bets are off! Thus, in order to frame that in the context of data, the universal attributes of a product are data attributes, like Identifier and Price.

But, there is, obviously, more to most product data than that. So, how do you define a set of attributes (or taxonomy) that is useful, and defines all products as data? Well, their team has come up with a couple of approaches to defining a product as data in detail. Something that will, generally, give you elaborate answers to that question. That aside, let’s discuss Diffbot.

What Diffbot Is All About

Diffbot is a web-based application software platform that provides a suite of products built to turn unstructured data from across the web into structured, contextual databases. Not to mention, its products are built off of cutting-edge machine vision and natural language processing software. A software toolkit that’s able to parse billions of web pages every day.

Technically, Diffbot provides great APIs, technical resources, and overall service. Their quality product service resources are one of the most advanced and highly accurate. What’s more, its team keeps its APIs up to date with social media’s rapid evolution. Not to mention, their overall customer support is equally quite helpful, timely, and very friendly for all users.

By the same token, they are very willing to work with flexible scenarios and accommodate needs in regard to low budgets for small research groups. While, at the same time, providing demo and trial accounts for their beginner customers to experiment with. Overall, they are the best social media data provider and analysis company, in our experience of over a decade.

You can use it for online media operations like:
  • Live news monitoring of higher education entities
  • Pulling of trends for data journalism projects
  • Product price fluctuations for the purposes of placing affiliate links

Prior to using Diffbot, we relied primarily on either an RSS Feed or a web scraping tool — that is on the visual layout and HTML of a webpage basis. We are very dependent on X Paths to get the data we want. Fortunately, we find that the Diffbot crawlers are more stable in the long term because they are not as impacted by website design changes.

Overall, this saves us a lot of time that we would otherwise be spending on maintenance. Also, Diffbot’s crawling product is a relatively low barrier to entry. You can try it out to pull all sorts of data from competing sites. Not to mention, it also uses a domain-specific language known as Diffbot Query Language (DQL) which can come in so handy (more on that later on).

The Problems It’s Solving Plus It’s Key Beneficial Features 

In the first place, the biggest problem that Diffbot helps solve is reducing the amount of maintenance you have to do on scraped websites. Whereby, you can use the heavily built Diffbot’s full-text capability and metadata tool. The metadata to use most is the language designation to ensure that your clients are seeing only articles in the languages that they choose.

Secondly, we also see great potential for using the bulk API to become more efficient in the content ingest process. And still, we are so excited to continue exploring this option in the future. Moving on, our team of Web Tech Experts Taskforce has also been using the Article and Analyse APIs as a core part of our pipeline. What about doing a build-vs-buy comparison?

Well, it’s our realization that it would be far preferable to leave this step to an external best-in-class solution. Rather than build (and importantly *maintain*) in-house. Wherever the automated page structure analysis fails, our team can easily “teach” it the structure. And then again, in the rare cases where that fails, the Diffbot team is very responsive to addressing issues.

Key Products:

As you can see, the list of Diffbot essential products supports its mission to structure the world’s knowledge. From web-wide crawls to data extraction APIs. As well as the ability to understand natural language documents, and the world’s most extensive Knowledge Graph, it turns the unstructured web into structured data. Enhance data with the Knowledge Graph.

Account-Based Marketing doesn’t work without fresh firmographic data. Enrich your lists with 50+ available fields from the Knowledge Graph. Available in your favorite data viz and automation platforms — Microsoft ExcelGoogle SheetsTableauZapier, and their REST API as well. With that in mind, let’s learn more about their key product features in detail.

1. The Knowledge Graph

It’s 2022, and a simple web search for real estate companies in Iceland results in paid ads and irrelevant how-to articles (seriously, give it a try). The problem is that the search engine you know and love is built for finding websites, not organizations, people, or news. Researching companies? Prospecting leads? Monitoring press mentions? That’s the KG job.

Diffbot’s Knowledge Graph (KG) product is the world’s largest contextual database comprised of over 10 billion entities including organizations, people, products, articles, and more. It’s innovative scraping and fact parsing technologies link up entities into contextual databases, incorporating over 1 trillion “facts” from across the web in nearly live time.

Through this unique Knowledge Graph, you’ll get unrivaled longtail person and organization data. Query the web for rich connected entities with 50x more articles than Google News Index. Simply put, the KG is amazingly comprehensive. Products, people, corporations, and more are all linked together in a contextual way. It provides a user-friendly way of almost everything!

Such as that amazing feeling like you’ve scraped the whole web. No custom scraping rules, and no need to figure out the nuances of where information is housed online. Just query and see if what you’re looking for is on the public web. On the same note, their standard export features are also very great. Whereby, you can easily export your content either to CSV or JSON.

2. Product Enhance Feature

Over 10 billion people, companies, products, articles, and discussions exist in the Diffbot Knowledge Graph — the largest in the world. If it’s something you can find on a website somewhere, you’ll find it (already clean and structured) in the KG. Results from the Knowledge Graph aren’t just a list of names either, they’re complete records with 50+ fields and properties.

Equally important, their Product Enhance Feature provides information about organizations and people you already hold some information on. Built off of Knowledge Graph technology, Enhance let’s users build robust data profiles about opportunities they already hold some data on. Enrich person or organization data from the world’s largest Knowledge Graph.

Explore Featured Datasets:

Pull details sourced from across the web via AI into your knowledge workflows. Search and find linked news, organizations, and people without scraping or crawling. No credit card is required. Full API access. It’s simple — searching for companies. Get companies in results. Searching for people? You get the idea.

Build accurate datasets of news, organizations, and people. Get relationship data sourced from the entire public web. Their AI reads the web nonstop so you don’t have to. Facts and relationships between entities are pulled out or inferred. Every entry in the KG provides detailed data provenance.

3. Enterprise-Built Crawling Bots

Turn any website into a structured database of all its products, articles, and discussions in minutes. Extract at the scale of the web. You can even schedule a demo if you want to see it in action first. The web spidering tool can crawl 50k sites or 50k URLs. As well as apply extraction APIs to unlimited pages. Like Extract, Crawl requires no rules.

Simply point Crawl to a starting point on a website and it’ll spider through every link on that page and extract them all. Diffbot’s distributed, world-class crawling infrastructure processes millions of pages daily. Utilize their reserved fleet of proxy IPs.

Or optionally, upgrade to gain access to tens of thousands of unique IPs for truly diversified crawling plus region/country-specific extraction. Programmatically start crawls, check crawl statuses, and retrieve output using the Crawl API. Not sure where to start? Their solution experts are there to help you build a complete end-to-end integration.

4. Data Structure Extraction APIs

Automatically, extract any given content from a variety of websites. Seamlessly, scrape articles, product pages, discussions, and more without any rules. And, as you try extracting a web page, you’ll notice that it reads these websites’ data content just like humans. As a human, you’re probably pretty good at telling a product page from a news article, right?

Or rather, you are also quite good at getting an idea of what a title says about the website you’re reading, right? But, what if you need to do that 10,000 times a minute? You could hire a lot more humans, or you could let Diffbot read it for you, right? Uniquely, their Data Extraction APIs allow you to leverage Diffbot’s innovative website parsing technologies.

As you point them at a predefined list of web properties. See live update ecommerce listing info, find brand mention news, pull in discussions/review data from across many sites, and more! Extract structured data from any URL. Not forgetting, their rule-less extraction APIs use machine vision to determine relevant details to return from nearly any page online.

5. Natural Language Processing APIs

Create your own knowledge graphs from natural language. Our best-in-class tool pulls structured entities, relationships, and sentiment from unstructured text. They’ve got a very high detection accuracy and uptime system: most of the time we can send API requests and know that the responses from Diffbot will be valid.

We believe there is also a host of APIs where you can extract data on different entity types. Utilize one of the most powerful and intuitive Diffbot query languages. Utilize Diffbot Query Language (DQL) to find, filter, sort, and facet around billions of interlinked entities. Use the visual query builder or level up with pure DQL. Effortlessly, get API access as long as you want it.

Need to tweak more advanced settings? Luckily, they’ve got those advanced settings at your disposal as well. Bearing in mind, that their REST API Schema is so simple and familiar, the following is all you need to get started.

Consider these code lines: 

# Python + Diffbot Extract

import requests

url = 'https://api.diffbot.com/v3/analyze?token=TOKEN&url=URL'

response = requests.request('GET', url)


Speaks any language? No worries! Thanks to its basis in computer vision, Diffbot Extract works with any human language. Pair Extract with Crawl to automatically generate a database of all the products on a website or all the articles on a news site. In that case, you can schedule a demo to see it in action before you can even begin. Crawl + Extract = 🚀

Reduce your time to insight, no more manual research. Easily enrich or create datasets wherever your work lives including Excel, Google Sheets, Tableau, and Power BI. No more wasting time while tracking down the structure of your data. They’ve done it for you, just schedule a demo — it’s now all a search away in the Knowledge Graph. With the help of the special DQL.

What Is Diffbot Query Language (DQL)?

Diffbot Query Language (DQL) is a domain-specific language meant for querying entities housed in the world’s largest knowledge graph. At its simplest, you can think of DQL as Google search’s advanced search operators.

But, with many more query parameters and a much richer underlying data set. There are various unique query parameters contained in DQL.

They include:
  • Summary view of a field of results (faceting)
  • Regex
  • Comparison operators
  • Geographical proximity operators
  • Sorting of results
  • Nested queries
  • A robust set of taxonomy fields for each entity type (orgs, people, products, articles, etc.)
  • Sentiment

Additionally, the powers of DQL are a result of the underlying data it can query. This is way over the billions of entities and trillions of facts contained in Diffbot’s Knowledge Graph. DQL wouldn’t be that useful in and of itself without the Knowledge Graph. Not sure where to start? Let the Diffbot solution experts craft a plan around your Web Scraping needs.

How Diffbot Web Scraping Toolkits Work

Web Scraping is one of the best techniques for extracting important data from websites to use in your business or applications, but not all data is created equal, and not all web scraping tools can get you the data you need. Collecting data from the web isn’t necessarily the hard part. Essentially, web scraping techniques utilize web crawlers (just programs).

Or basically, automated scripts that collect various bits of data from different sources. Any developer can build a relatively simple web scraper for their own use. And, there are certainly companies out there that have their own web crawlers to gather data for them (a big one is Amazon). But the web scraping process isn’t always straightforward, as many would think.

For one thing, there are many considerations that cause scapers to break or become less efficient. So, while there are plenty of web crawlers out there that can get you some of the data you need, not all can produce results. Unlike traditional web scraping tools, Diffbot doesn’t require any rules to read the content on a page. Ultimately, it all starts with computer vision.

Schedule A Demo: No Rules — Just Automated Web Scraping!

Something, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type. The result is a website transformed into clean structured data (like JSON or CSV), ready for your application. Well, I’d like a demo to see it in action myself! Just go ahead and try it.

There are certainly instances where a basic web scraper will get the job done, and not every company needs something robust to gather the data they need. However, knowing that the more data you have (especially if that data is fresh, well-structured, and contains the information you want) the better your results will be. But, there’s something else worth to be said.

For sure, that’s non-other than having a third-party vendor on your side. Certainly, just because you can build a web crawler doesn’t mean you should have to. Developers work hard building complex programs and apps for businesses, and they should focus on their craft instead of spending energy scraping the web.

‘A Few Downsides Worth Mentioning’

Before we conclude, in our case, it’s good to mention that there are two issues that are most challenging for users. On one hand, Diffbot does not recognize PDF documents, and we frequently would like to ingest them as articles. On the other hand, we find it difficult to troubleshoot a crawler where it’s not bringing in data or it’s not bringing in the data we expect.

Similarly, it’s good to mention one more thing — though this is more like a suggestion — that Diffbot has several excellent capabilities, of course. And they are constantly improving and adding new features. But, current customers and perhaps prospective ones too would benefit from a weekly/monthly newsletter, or social media updates, about them if needful, right?

Get Started:  If You’re Going To Learn One Language in 2021, Make it DQL

Some old versions of Python are still in use (<3.0) and, therefore, would make more sense to upgrade the version. Likewise, for the most advanced queries, you do have to learn the main Diffbot Query Language (DQL) in detail before you can begin.

There’s a bit of a learning curve to the DQL if you are not used to forming database queries. But, once you work out a few examples and get used to building queries, you’ll realize just how powerful your searches become. Their team is also helpful!

Takeaway Notes:

The best thing about Diffbot is that you can try out the free trial at any given time. Furthermore, all you’ll need to do is Signup To Get Started — so that you can create your very own unique account. And, as a matter of fact, you’ll notice that it doesn’t take long to get up and running with the KG. In a matter of a few minutes, you can begin to see types of query entities.

If you want a little more hand-holding reach out for a demo and their team will show you some cool queries, use cases for the Knowledge Graph, etc. We’ve been using both the Knowledge Graph and enhance products. We use the KG for a wider search, finding individuals with certain job titles at certain orgs. And then, we enrich those profiles with Diffbot Enhance.

Therefore, both Diffbot Knowledge Graph and Product Enhance Feature work together as great tools — for your overall web-based business market research and leads enrichment setup.

More Related Web Resources:
  1. What Is Big Data? How It Works And Why It Matters
  2. Data Management | Systems, Challenges & Best Practices
  3. Database | What Is It? Its Work, Examples & Management
  4. Wayback Machine | The No #1 Data Internet Archive For Webpages
  5. Unstructured Data | What Is It & How Do You Really Mine It?
  6. MySQL Database | How It Works & Why Webmasters Need It
  7. Autoptimize | How To Minify Your Site Data By A Plugin

Out of our very own Web Tech Experts team’s personal experiences, writing, and maintaining a web scraper is the bane of most developers’ existence. Now no one is forced to draw the short straw. For this reason, that’s why the likes of Diffbot web crawling/scraping tools do exist. Whilst, making it a needful toolkit for many webmasters to pull their efforts into.

That’s it! Everything that you need to know about Diffbot plus its benefits and what/how you can make use of it correctly/effectively/successfully in detail. So, have you been using it yet? If not, what’s still holding you back? If your answer is yes, you can share your overall User Experience (UX) with us. And if your answer is no, you can try it to see it in action.

Finally, you are also welcome to Consult Us if you’ll need more help. Otherwise, you can also share your additional thoughts, opinions, suggestions, recommendations, or even contribution questions (for our FAQ Answers web segment) in our comments section. And, if you’re in a position, you can also donate to support our work as well as to motivate our team.

Get Free Newsletters

Help Us Spread The Word