Web Crawlers, the DMCA, and Thinking Ahead

Anton Yarkov
optiklab
Published in
6 min readMar 1, 2021
Image taken from Unsplash. Copyright © Markus Winkler

Who should read it

This article is for web content makers and owners of the public content platforms, web developers, and anyone who can suddenly publish content that might become a subject of DMCA claim.
A couple of examples are Twitter, GitHub, Vimeo platforms that allow users to publish pictures, videos, and source codes that might appear to violate copyright laws.

Disclaimer

Of course, when we are talking about Public resources like Twitter, it is not a problem for someone to write a web-crawler smart enough to analyze specific resources and copy/download all the possible content from it (or save the content on the user machine). In this case, your platform/web-site is simply one point in the chain of content distribution and does not know how this content is supposed to be shared after all. Since it is a separate resource with its mission and reasons to work with this information (so they might need or don’t need to satisfy DMCA rules), I don’t think it’s something you can do with it. Web-crawlers overall may become a massive problem in the DMCA applicability. On the other side, they play a substantial role as an external cache that allows people to find information lost on the original resource.

So, what’s the problem?

It’s not that simple when we talk about such public and well-known resources like Way Back Machine (https://web.archive.org/), though. They have publicly announced mission

“is a digital archive of the World Wide Web.”

In other words, they are going to re-publish all the content they found on the internet. They are doing it straightforwardly. And authors and platform owners definitely should care about it, because it is a de-facto mirror of your web site.

Recently, I made a couple of interesting investigations and reported both to GitHub and Twitter officials. So, I think it will be valuable to share some details, so reader might understand the problem better.

Case with GitHub

Usually, when some repository becomes a subject of DCMA claim, then you couldn’t access the code repository and see this kind of message:

DMCA’d code repository

However, you can go straight to the Wayback Machine and find this page alive some time ago, look into it, read the whole content and, guess what? You can download the DMCA’d source code as a ZIP archive right from there.

Case with Twitter

If you go to https://twitter.com/Notices_DMCA then you can find a whole bunch of DMCA’d twitter posts.

DMCA’d twitter post

Then, you should use your e-mail to officially request access to the original URLs via https://www.lumendatabase.org and get this information in a minute or so. Now, you can use it to find the images on the web archive.

I’m sharing this because it outlines things that you might not worry about until you reach the point of no return. And I already communicated with both GitHub and Twitter about these use cases. But guess how many more out there?

Why do I care?

Imagine how you feel if your business becomes threatened one day by somebody sharing your private assets? It’s a real issue many people are facing today.

Imagine if some source codes are not just stolen but also modified and re-published with some virus or trojan program inside?

At the end (maybe exaggerated example), imagine if you or your family member become slandered by the crowd with any photos or anything else that become publicly available. This might hurt a lot. And there are no cheap ways to do anything with it.

If all development teams followed some easy-to-use rules and frameworks and handle such cases appropriately and accurately, it would be beneficial.

So, what do I do?

Nowadays, if you develop a new “Clubhouse” startup that might reach millions of users daily, you should think about many things:
- GDPR(at least in the EU),
- copyright and DMCA (at least in the USA),
- data persistency and data retention policies,
- etc etc.
It’s not enough to put an image or an archive file on your web-site directly linked to the resource, allowing anyone to crawl all your resources and keep forever anywhere automatically. Again, this might play a GOOD or a BAD role, depending on the situation. But it would be best if you thought about potential cases in advance.

Where to start

If you feel that your database or file storage is full of content that might be a subject of DMCA, then I would start from this:

  1. Go over your database of the DMCA’d content (i.e., the content, links, etc., that was a subject of a claim under DMCA) and aggregated all necessary information about the content that has to be closed from public access. This list should be available to teams building Web Crawlers via subscription mechanisms of any kind (rss, email, xml/json/rest/wcf/… files or APIs, etc.).
  2. Update terms & conditions to publicly mention the rules of re-publishing content and specific steps, regulations, and requirements that you follow to protect data privacy & security and DMCA. It would help if you said that anyone who’s copying their content should be automatically responsible for following the same rules. Publish the appropriate documentation and API for developers that will help them to learn about the DMCA’d content mentioned above and automatically remove it from their databases or storage of any kind.
  3. Ask the community to work together and form a list/database of the most well-known web crawlers and web-sites available on the internet (I mean web-sites that are easily available and work similarly to Wayback Machine — mirror your data). Ask every team/company that you found to follow the rules and remove DMCA’d content, provide them documentation and API mentioned above.

If you are starting a new feature that allows people to upload and share content with anyone,, then I would also recommend you to work on preventing such issues in the future as so:

  1. Any resource (image, file, external link, etc.) should not be available by simple direct link, i.e., it should require additional user interaction to see the content (i.e., to make it downloadable by browser or API). Technically, this might be done using additional scripting to form a link after the user interacts with the user interface. And not after DOM (document object model) has been loaded into the browser. By doing this, you will make 100% of “simple” web crawlers unable to download the content — only the look & feel of the page. As been said in my disclaimer above: you still not protected from “smart” crawlers explicitly made for your web-site.
  2. Know who crawls you, i.e., see #3 above. You can build a list of the most well-known crawlers and control their access to specific content by restricting URL access. For example, you would give access to HTML/CSS by crawling the link https://twitter.com/server1/… but limit downloading pictures from https://mycoolwebsite.com/imgserver1/… until the user interacts with the authenticated session.

I listed some of the technical and organizational solutions that I would execute to handle the issues with DMCA in a more automotive way. I know that complicates things enough for somebody who has tiny budgets.
But as problems evolve, I believe large companies should figure out cheaper ways.

Please, help me in the comments to generate more ideas and ways to handle this.

--

--

Anton Yarkov
optiklab

Senior Software Engineer and Engineering manager with 10+ years of experience in development of high loaded online systems.