This a small script, which I used for it:
The mailbox needs to be piped through it, and in the meantime it will be deduplicated.
-D $((1024*1024*10)) sets a 10 Mebibyte cache, which is more than 10x the amount needed to deduplicate an entire year of my mail. YMMV, so adjust it accordingly. Setting it too high will cause some performance loss, setting it to low will let it slip duplicates.
https://linux.die.net/man/1/formail
formail is part of the procmail utility bundle, mktemp is part of coreutils.