public/robots.txt not properly configured to prevent crawling
Summary
Web crawling was hitting download URLs, causing cache/archive to fill up an an alarming rate. The existing default robots.txt is not properly formatted to follow the robots.txt standard of:
# User_agent followed by Disallow
# Must be no spaces between
User-agent: *
Disallow: /wizardworld/map/ # This is an infinite virtual URL space
# alternate Disallow pattern,
# must have its own User_agent
# Wizardmapper knows where to go.
User-agent: wizardmapper
Disallow:
Steps to reproduce
Haven't tried to reproduce myself, have only seen the symptom of crawling, but this could be done using Python's Scrapy library most likely.
Expected behavior
Robots (when honoring the standard) should not be crawling the downloads page
Actual behavior
Robots are crawling all pages, including the downloads page
Relevant logs and/or screenshots
This is a private repo, and as a result I will not be pasting any logs.
Output of checks
Gitlab checks work fine, and this is not applicable to them.
Results of GitLab environment info
Not applicable