Anonymising email archives

Been thinking about a use case for publishing email list archives that have not previously been public. This requires anonymising the emails; level 1 would be anonymise every email leaving only subject and a server timestamp to retain some semblance of a thread for discussions. A “higher” level would anonymise in some way that eliminates individual identifiers but still tries to keep emails from any given individual consistently labelled as such to provide additional structure to discussion threads.

My intent is to use this topic thread to note ideas for what is needed in general case for level 1 and how to add more fine-grained capabilities.

First step seems reasonably obvious, to strip any headers prior to the email being received by the mailing list server.

Noting for higher levels this might be a way to help track users for threads.

If it’s an email list archive, why is the mailing list server still involved? If the script will operate on an archive it can assign anonymous IDs and randomly generated labels to all parties involved and maintain thread between those IDs when publishing. So publicly a thread between John and Mary will appear as being between e.g BlueDragon and PinkPanther

I’m assuming the archive stores all the headers, although I’ve not actually tested that assumption.