Official Google Webmaster Central Blog: July 2019

Helping publishers and users get more out of visual searches on Google Images with AMP

Thursday, July 25, 2019

Google Images has made a series of changes to help people explore, learn and do more through visual search. An important element of visual search is the ability for users to scan many ideas before coming to a decision, whether it’s purchasing a product, learning more about a stylish room, or finding instructions for a DIY project. Often this involves loading many web pages, which can slow down a search considerably and prevent users from completing a task.

As previewed at Google I/O, we’re launching a new AMP-powered feature in Google Images on the mobile web, Swipe to Visit, which makes it faster and easier for users to browse and visit web pages. After a Google Images user selects an image to view on a mobile device, they will get a preview of the website header, which can be easily swiped up to load the web page instantly.

Swipe to Visit uses AMP's prerender capability to show a preview of the page displayed at the bottom of the screen. When a user swipes up on the preview, the web page is displayed instantly and the publisher receives a pageview. The speed and ease of this experience makes it more likely for users to visit a publisher's site, while still allowing users to continue their browsing session.

Publishers who support AMP don’t need to take any additional action for their sites to appear in Swipe to Visit on Google Images. Publishers who don’t support AMP can learn more about getting started with AMP here. In the coming weeks, publishers can also view their traffic data from AMP in Google Images in a Search Console’s performance report for Google Images in a new search area named “AMP on Image result”.

We look forward to continuing to support the Google Images ecosystem with features that help users and publishers alike.

Posted by Assaf Broitman, Google Images PM

A note on unsupported rules in robots.txt

Tuesday, July 02, 2019

Yesterday we announced that we're open-sourcing Google's production robots.txt parser. It was an exciting moment that paves the road for potential Search open sourcing projects in the future! Feedback is helpful, and we're eagerly collecting questions from developers and webmasters alike. One question stood out, which we'll address in this post:
Why isn't a code handler for other rules like crawl-delay included in the code?
The internet draft we published yesterday provides an extensible architecture for rules that are not part of the standard. This means that if a crawler wanted to support their own line like "unicorns: allowed", they could. To demonstrate how this would look in a parser, we included a very common line, sitemap, in our open-source robots.txt parser.
While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular, we focused on rules unsupported by the internet draft, such as crawl-delay, nofollow, and noindex. Since these rules were never documented by Google, naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all robots.txt files on the internet. These mistakes hurt websites' presence in Google's search results in ways we don’t think webmasters intended.
In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we're retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019. For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:

Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
404 and 410 HTTP status codes: Both status codes mean that the page does not exist, which will drop such URLs from Google's index once they're crawled and processed.
Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login will generally remove it from Google's index.
Disallow in robots.txt: Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won’t be indexed. While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.
Search Console Remove URL tool: The tool is a quick and easy method to remove a URL temporarily from Google's search results.

For more guidance about how to remove information from Google's search results, visit our Help Center. If you have questions, you can find us on Twitter and in our Webmaster Community, both offline and online.

Posted by Gary

Google's robots.txt parser is now open source

Monday, July 01, 2019

For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.
We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.
We also included a testing tool in the open source package to help you test a few rules. Once built, the usage is very straightforward:
robots_main <robots.txt content> <user_agent> <url>
If you want to check out the library, head over to our GitHub repository for the robots.txt parser. We'd love to see what you can build using it! If you built something using the library, drop us a comment on Twitter, and if you have comments or questions about the library, find us on GitHub.
Posted by Edu Pereda, Lode Vandevenne, and Gary, Search Open Sourcing team

Formalizing the Robots Exclusion Protocol Specification

Monday, July 01, 2019

For 25 years, the Robots Exclusion Protocol (REP) has been one of the most basic and critical components of the web. It allows website owners to exclude automated clients, for example web crawlers, from accessing their sites - either partially or completely.
In 1994, Martijn Koster (a webmaster himself) created the initial standard after crawlers were overwhelming his site. With more input from other webmasters, the REP was born, and it was adopted by search engines to help website owners manage their server resources easier.
However, the REP was never turned into an official Internet standard, which means that developers have interpreted the protocol somewhat differently over the years. And since its inception, the REP hasn't been updated to cover today's corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.
We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers. Together with the original author of the protocol, webmasters, and other search engines, we've documented how the REP is used on the modern web, and submitted it to the IETF.
The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they'd like to be crawled on their site and potentially shown to interested users. It doesn't change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web. Notably:

Any URI based transfer protocol can use robots.txt. For example, it's not limited to HTTP anymore and can be used for FTP or CoAP as well.
Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren't overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.
The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.

Additionally, we've updated the augmented Backus–Naur form in the internet draft to better define the syntax of robots.txt, which is critical for developers to parse the lines.
RFC stands for Request for Comments, and we mean it: we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right.
If you'd like to drop us a comment, ask us questions, or just say hi, you can find us on Twitter and in our Webmaster Community, both offline and online.

Posted by Henner Zeller, Lizzi Harvey, and Gary

Webmaster Central Blog

Helping publishers and users get more out of visual searches on Google Images with AMP

A note on unsupported rules in robots.txt

Google's robots.txt parser is now open source

Formalizing the Robots Exclusion Protocol Specification

Labels

Archive

Feed

Subscribe via email