Make your websites smarter with Schema.org, Part
Make your data more useful in automated applications
This content is part # of # in the series: Make your websites smarter with Schema.org, Part 1
Stay tuned for additional content in this series.
This content is part of the series:Make your websites smarter with Schema.org, Part 1
Stay tuned for additional content in this series.
With the rise of artificial intelligence (AI) and cognitive computing,
there’s an increasing need for a structured data format that other
computers can easily understand. To meet that need, in 2011 a group of
search engine companies and large-scale web publishers created an
initiative called Schema.org to describe objects
that web pages are actually about.
In this four-part series, I introduce you to Schema.org and show you how
to use it to create more searchable web pages. In Part 1, I begin by
explaining the history of the project.
Benefits of Schema.org
To begin with, let’s look at some of the benefits of Schema.org. Why add
Schema.org markup to your pages? The bottom line is that doing so will
make your pages more accessible and easier to find for search engines, AI
assistants, and related web applications. You don’t have to learn any new
development systems or tools to use the markup an can broadly get up to
speed in a couple hours. Other benefits include:
- Aid in contextual search. Search engine companies and specialists are increasingly guiding users based on particular interests, rather than through blanket search terms. They are understanding intent and surfacing content that answers a user’s intent. Is the user shopping? Looking for a film to watch? Searching to solve a technical problem? If you use Schema.org markup, you allow search engines to include your sites according to contextual features, even more so if they are searching by voice or on mobile devices.
- Signal updated, quality content. When it comes to increasing your search engine ranking, there’s no replacement for creating great, quality content and cultivating legitimate links to your content. But using Schema.org markup signals to search engines that your content is well updated and of good quality.
- Increase click-through rates. When your Schema.org-enriched sites do show up in search engine rankings, they do so with the modern contextual features of the listing, called rich snippets. Rich snippets stand out from other search results, leading to better click-through rates by users.
- Improve content’s maintainability. When planning a site’s content, many people forget to plan for how to deal with content when it is out-of-date or irrelevant. Having pages that include Schema.org markup makes it easier to identify these pages and implement a plan during times of transition. Adding Schema.org markup makes it much easier to develop tools to work with your existing page and incorporate them into successive sites and software projects. It also makes it easier for you to collaborate with partners on new, joint projects based on your existing sites.
Home pages for eyeballs
In the first days of the web, everything you wanted to see was on a home
page. Those initial web pages were like a personal bulletin pinned on a
public board, but with hyperlinks. The goal was to have humans looking at
Before long, the Mosaic browser made it possible to embed images among the
text, which made the web more enticing for users. Embedded media objects
opened the door to audio, video, and application objects. Quickly, other
industries besides information and communication started to use — and
eventually to dominate — the web.
“We have less useful automation than we would have if there
were a common language. The web might seem an amazingly innovative
place, but we are missing out on many more
With the explosion of data on the Internet, it quickly became necessary to
categorize and tag content so that humans could more easily find the
information they were looking for.
Early web inventors wanted to spread organizational tools more broadly on
the web. In the 1990s, work on the “web of data” technology began. The
initial predictions for data on the web were grand. A May 2001 story in
Scientific American, by Sir Tim Berners-Lee and colleagues,
entitled “The Semantic Web,” set forth their ambitions for a new
technology that would provide a common language for data on the web,
making automation easier.
While much of this envisioned automation is now a reality, it’s primarily
due to the extraordinary feats of intense data munging by large search
engines and tech companies, and not because the common language for data
on the web ever took off. As a result, the automation we have now is not
as useful as it would be if there were a common language. The web might
seem an amazingly innovative place, but we are missing out on many more
The advent of Schema.org will bring to life the promise of the semantic
web. Through the efforts of the big players, even smaller players can now
RDF, linked data, microformats, and
In 2000, I wrote an article for IBM developerWorks, “An introduction to
RDF,” that explained the technology that the Worldwide Web Consortium
(W3C) was advocating to provide a common language for data on the web. The
Resource Description Framework (RDF) is a set of specifications for
modeling data on the web, to make work easier for autonomous agents and
improve search engines and service directories. RDF was originally
conceived as a simple model for expressing bits of data on the web.
Unfortunately, the W3C ended up piling so many complicated specifications
on top of RDF (including full-blown AI facilities) that they were never
really clear on how to boil the semantic web down to something simple
enough that a typical web developer could easily learn.
Figure 1. Semantic web layer cake
To counteract these complicated specifications, an initiative called
“Linked Open Data” began to push for a simplified set of principles. The
name shortened to “Linked Data” as it became clear that the principles
were useful even for enterprise and in private contexts. Linked Data
basically recommends using HTTP URLs to identify things, rather
than, say, plain text strings, and using conventions such as simple RDF to
provide associated information for the identified things. This information
might consist, for example, of labels that make use of plain text
At first this metadata was provided separately from the web page itself,
but web developers quickly began advocating for the use of simple HTML
conventions to encode metadata right in the web page. These were called
All these developments crystallized over the course of a decade into
Schema.org in 2011. The high-minded semantic web was simplified into
Linked Data, while the need for separate file representations was
eliminated by using microformat techniques.
An information model for your web
So, what does all this mean to today’s web developer? For one thing, it
means you have to ask, “What is my content actually about?”
Let’s say you maintain a web site for a book club. What are your pages
about? They are probably about books, meetings, and members, and you
describe these things with a conventional set of descriptions. For
- Books are described in terms of titles, authors, ISBNs, cover
images, and so on.
- Meetings are described in terms of times/dates, locations,
- Members are described in terms of their names, contact
information, and photos.
A person might be a member of the club and also a book’s author. In that
case, some elements of a member’s description could be shared with that of
an author. With that in mind, you might visualize the data describing your
club as similar to the kind of data organization found in object-oriented
Figure 2 shows part of this mental map, in which I’ve made up what I call
the Geo Book Club.
Figure 2. Book club raw information model
So, what are we looking at?
The ovals are web resources (a little bit analogous to object-oriented
instances). The most important thing about this mindset is that you think
about URLs as much in terms of things they describe as you do the content
http://example.com/geobookclub is the Geo Book
Club’s website. In this model, I also consider it a thing, that is, a
club. The resource type describes the type of thing that it is,
and I use a leading line in capital letters to indicate this in the
Resource types organize the conventions for properties that are associated
with specific things. For example, a person wouldn’t be associated with an
ISBN. Resource types place controls over the data patterns, making it more
efficient for applications to understand the data.
The arrows show the relationships or links between objects. It’s important
to label every link that you wish to elevate to an explicit relationship.
You don’t just say that the book “Things Fall Apart” is related
to the person “Chinua Achebe.” Instead, be more specific: The
book “Things Fall Apart” is authored by the person “Chinua
Achebe.” Because a book could have other related people, such as editors
or illustrators, labeling the specific relationships helps web
applications accurately process the data.
Sometimes the value of a relationship is just text rather than another web
resource. The diagram shows these as rectangles, and they are called
literals. Literals can also be numbers, dates, Booleans, and
other sorts of fundamental data.
The cloud shape is just a convenient marker for detail we don’t need for
this tutorial. I used them to show that a club can have multiple meetings,
but in this series we care only about the details of the second one. The
clouds are meant to show that there can be multiple meetings, each a
You could imagine a way of modeling this with some sort of container
object, say “membership” to hold the members, or “schedule” to hold the
events. However, containers get complex quickly. Schema.org emphasizes
simplicity, so conventions are more often to merely express multiple
instances of a relationship.
The book cover is an interesting special case. For one thing, it is a web
URL linking to an image file. Schema.org allows you to include different
sorts of web URLs in relationships, including images and other non-text
media objects. There is also no resource type specified. In a few cases
such as this, you can let the relationship carry the weight, though
Schema.org does also provide a more thorough way of expressing such media
relationships where needed.
RDF version of the model
If the model described above makes sense to you, you are close to
understanding RDF well enough to start using Schema.org. Keep in mind just
- All relationships must be URLs, not just simple strings such as
“member” and “author”. These are formally called predicates
in RDF, but Schema.org uses the term properties, and provides a web
page for each property it defines. That way, a person—or even a
machine—can just go to a relationship’s URL and see a readable
- Resource types are expressed using a special RDF predicate,
conventionally abbreviated as rdf:type. The value of this relationship
is called an RDF class.
Figure 3 shows a subset of the Geo Book Club model illustrating the fully
expressed predicates and type/class relationships. You can imagine how
cluttered it would be if I carried all that data through the entire
Figure 3. Book club information model snippet with
full RDF predicates and type info
There is no Schema.org class specifically for a book club, so I used the
one for an organization. Incidentally, Schema.org is not meant to provide
a comprehensive model of anything everyone might wish to express on the
web. However, if enough book club organizers got together and decided to
come up with Schema.org extensions to suit their needs, they might
eventually get them into the core Schema.org model. Rough consensus and
actual use are the most important drivers in the evolution of
Fitting the model to Schema.org
The following diagram shows a Schema-org conforming model. I use two
abbreviations to reduce clutter:
- URL abbreviation convention from RDF: A prefix followed by a colon and
the tail end of the URL.
- Resource type abbreviation: The second abbreviation is to specify the
resource type in parenthesis underneath the resource identifier
Figure 4. Book club Schema.org information
Besides the change to
schema:Organization, there is another
vocabulary change to match Schema.org. The
is given as
Schema.org supports a class inheritance capability similar to what you
might know from object-oriented programming. It has one ancestral class
schema:Thing, from which all the classes derive.
schema:Organizationis a subclass of
schema:Bookis a subclass of
schema:CreativeWorkwhich is in turn a subclass of
Even properties are subclasses of
schema:Thing, but this is a
bit of an arcane detail.
More interestingly, Schema.org makes much use of subproperties, which are
analogous to subclasses. For example, the Schema.org model doesn’t
schema:isbn as a recognized property on
schema:Book. Rather it specifies
schema:identifier. However, there are several subproperties
These different sorts of identifiers make sense in specific contexts.
Subproperties follow the Liskov Substitution principle, which you might remember from
object-oriented programming. In basic terms, that means that you can
substitute any subproperty for its parent. So since
schema:identifier is recognized on
you are free to substitute
schema:isbn, as I do in the Geo
Book club example.
If you run a web site, you already deal with models and frameworks for how
web pages should look and behave. It’s becoming increasingly important to
define what the content means, and, in particular, to describe the things
discussed in the website. Schema.org provides a framework that’s growing
in popularity for expressing such information.
In this part, you learned how to create models that take the first step
toward Schema.org. Now that you understand the Schema.org-based diagram
described here, you are ready to implement this model in your own HTML web
pages. There are several syntax options for doing so, and I get to these
options in the next part.