HTTPS sites going missing in the Internet Archive?

The Wayback Machine is a wonderful piece of technology. What it does is scrape sites on the internet and store the history of the publically available internet. It’s a very important task in this ever-changing environment.

However I’ve noticed just now – though I may just have been unlucky – that HTTPS sites, i.e. sites using SSL encryption, are not archived. I noticed this because I wanted to archive the site Free & Social – a StatusNet instance that the Swedish Pirate Party runs. So I posted a message on their forum, hoping to clarify whether this is a feature or a bug:

I’ve tried searching on the web and checking the FAQ, but I couldn’t seem to find an answer to why SSL sites don’t work with the Wayback Machine.

With more and more sites using https, everything from personal blogs to just about any site with a login, it would be a shame that the Internet Archive could not fetch these. Is there a technical difficulty that must be managed, or other reasoning behind this?

If I’m just mistaken and the https sites I’ve tried have malfunctioned for other reasons, I apologize. But from what I can see, a large (and growing) part of the internet is unfortunately not part of this mission as long as only HTTP connections get crawled.

I’m thinking maybe they don’t crawl SSL sites for some odd reason, like identity verification and so. Something like they can’t serve the site “properly” afterwards, or maybe arguing SSL sites are more secret. But I would argue that SSL sites are only SSL because cleartext transmissions are too easy to manipulate – the content is indistinguishable.

Leave a Reply

Your email address will not be published. Required fields are marked *