Testing JSON-LD NewsArticle Markups + linked Images of 240 publishers using Puppeteer

Tobias Willmann
Feb 15 · 5 min read

There are some things in JSON-LD Markups, which are not tested in Google Search Console or e.g. Screaming Frogs Schema validator, which can be tested by running Puppeteer.

Puppeteer, for example, can fetch status codes and real dimension of images mentioned in the markup + in addition, because of its JavaScript, do more complicated analysis with the crawled data.

This article will show some first test results and common mistakes made by publishers related to JSON-LD NewsArticle Markups and Images.

What you could check with Puppeteer but NOT with other tools

Images in the NewsArticle markup

Many tools like GSC check for valid markup, but not the images listed within the markup. As defined here they should fulfill the following specifications:

  • Image URLs must be crawlable and indexable.
  • Images must be in .jpg, .png, or .gif format.
  • Images should be at least 1200 pixels wide.
  • For best results, provide multiple high-resolution images (minimum of 800,000 pixels when multiplying width and height) with the following aspect ratios: 16x9, 4x3, and 1x1.

So what Puppeteer can for example do is:

  • Check status codes of the linked images
  • Check filetypes
  • Check dimensions of the linked images + ratios

Marked up Images with used for example in AMP carousels:

Publisher logo in the NewsArticle markup

Here too you could check images and compare with the requirements specified here:

  • The file must be a raster file, such as .jpg, .png, or .gif. Don’t use vector files, such as .svg or .eps.
  • The logo should fit in a 60x600px rectangle, and either be exactly 60px high (preferred), or exactly 600px wide. For example, 450x45px would not be acceptable, even though it fits within the 600x60px rectangle.

This is used for the logos in AMP previews:

How are publishers doing?

240 random article URLs of 240 publishers were tested. I first took some big publishers especially from german-speaking countries … later on, I took randomly some more from Wikipedia. I always took a random article URL from the homepage of the publisher.

Of course, while scraping some problems appeared. Picking random publishers also leads to many without AMP and some load errors of course.

I’m just testing AMP and JSON-LDs for now. So the test is based on 112 URLs having AMP and JSON-LDs. There were about 9 more publishers using AMP and Microdata or RDFa and about 10 more having AMP but some weird errors.

Publisher Logos

In 112 AMP URLs I found 102 publisher logos set:

  • 39 out of 102 publishers/URLs had no problem with the publisher logo markup
  • 2 had images with status code 304 instead of 200
  • 14 had width and height not defined in the markup. It’s a required field!
  • 38 had publisher.logo width or height of the real image not correct. The logo should fit in a 60x600px rectangle, and either be exactly 60px high (preferred), or exactly 600px width.
https://developers.google.com/search/docs/data-types/article#logo-guidelines
  • 33 had differences between natural and defined publisher.logo image sizes. The sizes defined in the markup differed form the actual image dimension, which I tested by opening the image URLs.
  • No publisher had problems with the filetype. No .svg found.

Image Objects

Numbers out of 112 AMP URLs / publishers:

  • 79 Publishers use 1 Image Object Markups
  • 4 Publishers use 2 Images Object Markups
  • 18 Publishers use 3 Images Object Markups
  • 2 Publishers use 4 Images Object Markups
  • 9 had no Image Object Markup or I was not able to find with my script

In total, I found 148 images with status code 200

  • 118 jpeg
  • 28 webp
  • 1 png
  • 1 error

Of course, images highly depend on the related article, so tests with 147 images are statistically bullshit. I try to prepare a bigger test set. But let's go on, to get a feeling of what could be tested…

Image related problems found in 147 images:

  • 42 out of 147 images are below 1200px width and thus too small.
  • 15 images had a difference between natural and defined Image sizes.
Here the markup tells something completely different than the actual image size
  • One publisher really used a 5x5 image as Image Object. 🙈
  • 8 images were bigger than 3000px width (I guess that’s fine)

Ratios

There are ration recommendations from Google, so let's see how well publishers adopted these:

  • Almost all publishers with 3 images tried to have a 16:9, 4:3 or 1:1 images… Just 3 out of 54 images are not 16:9, 4:3 or 1:1.
  • But it’s just 18 out of 112 publishing sites, which tried this
  • Just 3 URLs/publishing sites in the test do it without errors (so not too small images + proper width and height definition)
  • 59 images out of 147 used some ratio, but not 16:9, 4:3 or 1:1

More Learnings

1) There were almost no publishers with AMP and no markup at all

most use JSON-LD

2) There are more specific types of NewsArticle

https://schema.org/NewsArticle#subtypes

Just https://elmundo.es was using this.

3) There are publishers, which put the NewsArticle Object into an Array?

by default? Don’t know why? It still works

4) It is possible to link markups like this

seen at https://www.zeit.de

Step 1:

Step 2:

Googles testing tool detects that it is connected

UPDATE: Image widths

3187 Image + 2319 AMP URLs of 26 publishers analyzed:

At least 7 publisher don’t make it have JUST ONE image 1200px width and above.

More

I won’t share the spreadsheet with the detailed test results or code for now…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade