Testing JSON-LD NewsArticle Markups + linked Images of 240 publishers using Puppeteer
This article will show some first test results and common mistakes made by publishers related to JSON-LD NewsArticle Markups and Images.
What you could check with Puppeteer but NOT with other tools
Images in the NewsArticle markup
Many tools like GSC check for valid markup, but not the images listed within the markup. As defined here they should fulfill the following specifications:
- Image URLs must be crawlable and indexable.
- Images must be in .jpg, .png, or .gif format.
- Images should be at least 1200 pixels wide.
- For best results, provide multiple high-resolution images (minimum of 800,000 pixels when multiplying width and height) with the following aspect ratios: 16x9, 4x3, and 1x1.
So what Puppeteer can for example do is:
- Check status codes of the linked images
- Check filetypes
- Check dimensions of the linked images + ratios
Marked up Images with used for example in AMP carousels:
Publisher logo in the NewsArticle markup
Here too you could check images and compare with the requirements specified here:
- The file must be a raster file, such as .jpg, .png, or .gif. Don’t use vector files, such as .svg or .eps.
- The logo should fit in a 60x600px rectangle, and either be exactly 60px high (preferred), or exactly 600px wide. For example, 450x45px would not be acceptable, even though it fits within the 600x60px rectangle.
This is used for the logos in AMP previews:
How are publishers doing?
240 random article URLs of 240 publishers were tested. I first took some big publishers especially from german-speaking countries … later on, I took randomly some more from Wikipedia. I always took a random article URL from the homepage of the publisher.
Of course, while scraping some problems appeared. Picking random publishers also leads to many without AMP and some load errors of course.
I’m just testing AMP and JSON-LDs for now. So the test is based on 112 URLs having AMP and JSON-LDs. There were about 9 more publishers using AMP and Microdata or RDFa and about 10 more having AMP but some weird errors.
In 112 AMP URLs I found 102 publisher logos set:
- 39 out of 102 publishers/URLs had no problem with the publisher logo markup
- 2 had images with status code 304 instead of 200
- 14 had width and height not defined in the markup. It’s a required field!
- 38 had publisher.logo width or height of the real image not correct. The logo should fit in a 60x600px rectangle, and either be exactly 60px high (preferred), or exactly 600px width.
- 33 had differences between natural and defined publisher.logo image sizes. The sizes defined in the markup differed form the actual image dimension, which I tested by opening the image URLs.
- No publisher had problems with the filetype. No .svg found.
Numbers out of 112 AMP URLs / publishers:
- 79 Publishers use 1 Image Object Markups
- 4 Publishers use 2 Images Object Markups
- 18 Publishers use 3 Images Object Markups
- 2 Publishers use 4 Images Object Markups
- 9 had no Image Object Markup or I was not able to find with my script
In total, I found 148 images with status code 200
- 118 jpeg
- 28 webp
- 1 png
- 1 error
Of course, images highly depend on the related article, so tests with 147 images are statistically bullshit. I try to prepare a bigger test set. But let's go on, to get a feeling of what could be tested…
Image related problems found in 147 images:
- 42 out of 147 images are below 1200px width and thus too small.
- 15 images had a difference between natural and defined Image sizes.
- One publisher really used a 5x5 image as Image Object. 🙈
- 8 images were bigger than 3000px width (I guess that’s fine)
There are ration recommendations from Google, so let's see how well publishers adopted these:
- Almost all publishers with 3 images tried to have a 16:9, 4:3 or 1:1 images… Just 3 out of 54 images are not 16:9, 4:3 or 1:1.
- But it’s just 18 out of 112 publishing sites, which tried this
- Just 3 URLs/publishing sites in the test do it without errors (so not too small images + proper width and height definition)
- 59 images out of 147 used some ratio, but not 16:9, 4:3 or 1:1
1) There were almost no publishers with AMP and no markup at all
most use JSON-LD
2) There are more specific types of NewsArticle
Just https://elmundo.es was using this.
3) There are publishers, which put the NewsArticle Object into an Array?
by default? Don’t know why? It still works
4) It is possible to link markups like this
seen at https://www.zeit.de
Googles testing tool detects that it is connected
UPDATE: Image widths
3187 Image + 2319 AMP URLs of 26 publishers analyzed:
At least 7 publisher don’t make it have JUST ONE image 1200px width and above.
I won’t share the spreadsheet with the detailed test results or code for now…