Part of SEO is making sure that what Google sees, is exactly what you want it to see. Nothing more nothing less. Lots of sites though, especially Government ones, let Google into see everything. As you will see in today’s show there is least one file I found that, the Victorian Government expressly said they did not want released to the public. There it was though sitting on a web server just waiting for Google to come along and find it. If I do this search, site:mornpen.vic.gov.au filetype:doc I can see that my own council has 89 Ms Word documents sitting on their web server. Most of those seem to application forms of one description or another. If there is also a HTML version of these forms though, then we have a duplicate content issue. The same applies to PDFs and text files which tend to be the other most popular duplicates. This is because a lot of sites offer a web page in different formats for accessibility reasons.
There is no reason for Google to see these versions of the page or even know they are there. I have seen some sites where the majority of content is PDF files. This is an issue as usually the PDF files are either duplicates of a web page or duplicates of a manufacturer product sheet. You can’t use a canonical tag on these files to tell Google this is just another version. You need to block Google completely from them using the robots.txt. However if it is sensitive information, do not have it on your web server.
Your web server is a very public place and it is not where you should be storing documents. I’ve found some interesting documents over years unintentionally. About ten years ago I was googling to see if I could find the mobile phone number of someone. I ended up finding the entire member database complete with usernames and passwords of the Australian Information Industry Association (AIIA). I think I still have it somewhere. I did the right thing and informed them of it and did not publish a story until they had it all secured. Thankfully these days we see less of these sorts of accidental data breaches but we still do not want Google crawling different file types of your content. If you have some interesting searches you’d like to share please do!