From: W. Trevor King Date: Fri, 7 Nov 2014 19:03:21 +0000 (+1600) Subject: Mail archives in Git using ssoma X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=72f09751a66b15eaaf0c3c26e641f790a51da02f;p=notmuch-archives.git Mail archives in Git using ssoma --- diff --git a/64/45295b9f67d4c877c0e2e690c98bbf42e62414 b/64/45295b9f67d4c877c0e2e690c98bbf42e62414 new file mode 100644 index 000000000..26f8c0ac9 --- /dev/null +++ b/64/45295b9f67d4c877c0e2e690c98bbf42e62414 @@ -0,0 +1,321 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 16C81431FBC + for ; Fri, 7 Nov 2014 11:05:32 -0800 (PST) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: 3.181 +X-Spam-Level: *** +X-Spam-Status: No, score=3.181 tagged_above=-999 required=5 + tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, + FRT_SOMA=3.28, FRT_SOMA2=0.001, RCVD_IN_DNSWL_NONE=-0.0001] + autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id jK9x8K3STcqZ for ; + Fri, 7 Nov 2014 11:05:23 -0800 (PST) +Received: from resqmta-po-03v.sys.comcast.net (resqmta-po-03v.sys.comcast.net + [96.114.154.162]) + (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id 85174431FB6 + for ; Fri, 7 Nov 2014 11:05:23 -0800 (PST) +Received: from resomta-po-02v.sys.comcast.net ([96.114.154.226]) + by resqmta-po-03v.sys.comcast.net with comcast + id Cj4Y1p0084tLnxL01j5P4q; Fri, 07 Nov 2014 19:05:23 +0000 +Received: from odin.tremily.us ([24.18.63.50]) + by resomta-po-02v.sys.comcast.net with comcast + id Cj3N1p005152l3L01j3NKA; Fri, 07 Nov 2014 19:03:23 +0000 +Received: by odin.tremily.us (Postfix, from userid 1000) + id AD3631476F60; Fri, 7 Nov 2014 11:03:21 -0800 (PST) +DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tremily.us; s=odin; + t=1415387001; bh=Kh2QLEy4SAP5cx4l1iaezk+MrusS9fnXrUxa4FYdJUw=; + h=Date:From:To:Cc:Subject; + b=QlDq+Pvb+F9u6VN6FLPdEWkSgOglKj3QK0fJsxfGKjZVN8QQnpuO4JiXpB9qbIZIF + PnE0EJxcMQdel+3d6QF7WQvKInR/bIK/juQ87buJPerXtam+lZ8GEXNeeGqiuLqWvI + xQ6roGlXAr7i2KUzn+1++O9nU8pock1SIv4H54MY= +Date: Fri, 7 Nov 2014 11:03:21 -0800 +From: "W. Trevor King" +To: notmuch@notmuchmail.org +Subject: Mail archives in Git using ssoma +Message-ID: <20141107190321.GL23609@odin.tremily.us> +MIME-Version: 1.0 +Content-Type: multipart/signed; micalg=pgp-sha1; + protocol="application/pgp-signature"; boundary="9JSHP372f+2dzJ8X" +Content-Disposition: inline +OpenPGP: id=39A2F3FA2AB17E5D8764F388FC29BDCDF15F5BE8; + url=http://tremily.us/pubkey.txt +User-Agent: Mutt/1.5.23 (2014-03-12) +DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; + s=q20140121; t=1415387123; + bh=EMVFRZ821lxzwm/G4cwidMqYECm8nLl+PTjsH7YQgfc=; + h=Received:Received:Received:Date:From:To:Subject:Message-ID: + MIME-Version:Content-Type; + b=maAC+hXiJiVeTj+ewX3r21eF6EyrELH1QknYcT5nmUHSuhMpMMdA0xOnjxkgybAmt + ZoS1qi6ELkZTJ820eHxeIQ46z4pGBnrUyGlnDbD5Jgz3QzZC+90ynnk1kO5AcvOXcF + zEqeO6VkgseP4KZVAn3aQyF8uZVI6EioFxohztvVXcqk1jKlguia3S1+XmmV5z5sh3 + UYwEaEj5QSHfMnBnCKu5NkJtlb1guqsidWYJNFcY6x8lipa+Gqi9uS26uqhEcU2zzA + cQeBzdr1WztR1Ra63FOaRAG2Vx+YQLMdDEu20L2HFhm6wbt9S0PySCIuSc+6toMebB + /+n/87/eNSakg== +Cc: Eric Wong +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Fri, 07 Nov 2014 19:05:32 -0000 + + +--9JSHP372f+2dzJ8X +Content-Type: text/plain; charset=utf-8 +Content-Disposition: inline +Content-Transfer-Encoding: quoted-printable + +Hello everyone :), + +I like Git, so when folks suggest storing things in Git, I'm usually +excited ;). Eric Wong has been working on some tools to store email +in a Git repository, and his client-side code is ssoma [1]. I wanted +a bit more metadata than the stock ssoma-mda [2], and ended up just +writing a ssoma-mda in Python [3]. It needs Python =E2=89=A53.4 and pygit2. +I had pygit2 already installed for Python 3.3 (which gave me a local +libgit2), so I used pip to install it for 3.4: + + $ python3.4 -m ensurepip --user + $ pip3.4 install --user pygit2 + +Then I grabbed the archives, and pulled them into Git: + + $ wget http://notmuchmail.org/archives/notmuch.mbox + $ git init --bare notmuch-archives.git + $ cd notmuch-archives.git + $ python3.4 + >>> import email.utils + >>> import mailbox + >>> import ssoma_mda + >>> mbox =3D mailbox.mbox('../notmuch.mbox', factory=3DNone, create=3DFal= +se) + >>> messages =3D sorted(mbox, key=3Dlambda m: email.utils.mktime_tz(email= +=2Eutils.parsedate_tz(m['date']))) + >>> for message in messages: + ... if ((message['message-id'] =3D=3D '' = +and + ... message['X-List-Received-Date'] =3D=3D 'Sat, 26 Feb 2011 = +14:23:34 -0000') or + ... (message['message-id'] =3D=3D '<4EDF728E.3050204@gmail.com>= +' and + ... message['X-List-Received-Date'] =3D=3D 'Wed, 07 Dec 2011 = +14:05:16 -0000') or + ... (message['message-id'] =3D=3D <4FE369F2.5080804@gmail.com>'= + and + ... message['X-List-Received-Date'] =3D=3D 'Thu, 21 Jun 2012 = +18:38:07 -0000') or + ... (message['message-id'] =3D=3D '<5122353D.4060601@gmail.com>= +' and + ... message['X-List-Received-Date'] =3D=3D 'Mon, 18 Feb 2013 = +14:06:12 -0000') or + ... (message['message-id'] =3D=3D '' and + ... message['X-List-Received-Date'] =3D=3D 'Wed, 24 Apr 2013 = +18:09:55 -0000') or + ... (message['message-id'] =3D=3D '<527B9E8C.5000001@krugs.de>'= + and + ... message['X-List-Received-Date'] =3D=3D 'Thu, 07 Nov 2013 = +14:07:32 -0000') or + ... (message['message-id'] =3D=3D '<1399645162-8653-1-git-send-= +email-wael.nasreddine@gmail.com>' and + ... message['X-List-Received-Date'] =3D=3D 'Fri, 09 May 2014 = +14:19:36 -0000') or + ... (message['message-id'] =3D=3D '' a= +nd + ... message['X-List-Received-Date'] =3D=3D 'Thu, 18 Sep 2014 = +10:27:35 -0000') or + ... (message['message-id'] =3D=3D '' and + ... message['X-List-Received-Date'] !=3D 'Mon, 22 Sep 2014 09= +:54:16 -0000')): + ... continue + ... ssoma_mda.deliver(message=3Dmessage, once=3DTrue) + >>> ^D + +On my 1.1GHz Intel Celeron 847 Sandy Bridge netbook, that took about +half an hour. The initial repository was large: + + $ du -hs . + 394M . + +But packing it up made it small: + + $ git gc --aggressive + du -hs . + 51M . + +With a few less images than the mbox: + + $ git log --oneline | wc -l + 19650 + +Compared with 19660 messages in the mbox at 107 MB (160 MB for the +associated Maildir). + +The messages I dropped removed duplicate Message-IDs: + +* id:m2k4gmyjer.fsf@ecocode.net had different received dates: + + -X-List-Received-Date: Sat, 26 Feb 2011 14:12:20 -0000 + +X-List-Received-Date: Sat, 26 Feb 2011 14:23:34 -0000 + + but no significant differences. + +* id:4EDF728E.3050204@gmail.com had a real address in the + first-to-arrive version: + + -X-List-Received-Date: Wed, 07 Dec 2011 14:10:13 -0000 + -> <4winter@informatik.uni-hamburg.de> + + an an obfuscated one in the second-to-arrive version: + + +X-List-Received-Date: Wed, 07 Dec 2011 14:05:16 -0000 + +> <4winter-jNDFPZUTrfQBEfOqpokbeYV0Y/DQsy6Ps0AfqQuZ5sE@public.gmane.or= +g> + +* id:4FE369F2.5080804@gmail.com had the same: + + -X-List-Received-Date: Thu, 21 Jun 2012 18:37:54 -0000 + -> > wrote: + +* id:5122353D.4060601@gmail.com had different received dates: + + -X-List-Received-Date: Mon, 18 Feb 2013 14:06:05 -0000 + +X-List-Received-Date: Mon, 18 Feb 2013 14:06:12 -0000 + + but no significant differences. + +* id:CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA@mail.gmail.com + had different MIME boundaries: + + -Content-Type: multipart/alternative; boundary=3Df46d043be11ac45a0904db= +1f3428 + -X-List-Received-Date: Wed, 24 Apr 2013 18:09:46 -0000 + + +Content-Type: multipart/alternative; boundary=3De89a8f646ff3faa11d04db= +1f3294 + +X-List-Received-Date: Wed, 24 Apr 2013 18:09:55 -0000 + + but no significant differences. + +* id:527B9E8C.5000001@krugs.de had obfuscated addresses: + + -X-List-Received-Date: Thu, 07 Nov 2013 14:07:33 -0000 + -> Rainer M Krug writes: + + +X-List-Received-Date: Thu, 07 Nov 2013 14:07:32 -0000 + +> Rainer M Krug writes: + +* id:1399645162-8653-1-git-send-email-wael.nasreddine@gmail.com had + additional content in the later submission: + + -Subject: [PATCH] Add Travis-CI config file. + -Date: Fri, 9 May 2014 07:19:22 -0700 + -X-List-Received-Date: Fri, 09 May 2014 14:19:36 -0000 + - .travis.yml | 10 ++++++++++ + - 1 file changed, 10 insertions(+) + + +Subject: [PATCH v2] Enable Travis-CI as a backup continuous integration + + service. + +Date: Fri, 9 May 2014 14:44:50 -0700 + +X-List-Received-Date: Fri, 09 May 2014 21:45:16 -0000 + + + +The v2 adds a notification section to send failure (or back to passing= +) notifications + +to the mailing list and to the IRC channel + + + + .travis.yml | 13 +++++++++++++ + + 1 file changed, 13 insertions(+) + +* id:m2mw9xkyvg.fsf@krugs.de had an obfuscated adderss and different signat= +ure: + + -X-List-Received-Date: Thu, 18 Sep 2014 10:27:31 -0000 + ->> guyzmo writes: + -----BEGIN PGP SIGNATURE----- + Version: GnuPG/MacGPG2 v2.0.22 (Darwin) + -iQEcBAEBAgAGBQJUGrN3AAoJENvXNx4PUvmC4J0IAN9Wf+0ArvirJCoewItnEZoo + -ySg4VRP7uWVqDxHVl5N9XFv4YE2bZ2E2eMGvbo6v7I82lhqeR5dauZhlgCMki+ZI + + +X-List-Received-Date: Thu, 18 Sep 2014 10:27:35 -0000 + +>> guyzmo writes: + -----BEGIN PGP SIGNATURE----- + Version: GnuPG/MacGPG2 v2.0.22 (Darwin) + +iQEcBAEBAgAGBQJUGrN4AAoJENvXNx4PUvmC6LsIAIaFrd4MFnm8EixrAHPGfW6j + +L3KNG7Dv+hQuNRUN6qn+emZHI8wX4O74HOZOpZWkE09CmjkPJBmf7IuJwtz2ONbM + +* id:cover.1411379395.git.jani@nikula.org came in three times, with + three dates, but no significant differences: + + Date: Mon, 22 Sep 2014 11:54:20 +0200 + X-List-Received-Date: Mon, 22 Sep 2014 09:54:16 -0000 + + Date: Mon, 22 Sep 2014 11:54:42 +0200 + X-List-Received-Date: Mon, 22 Sep 2014 09:54:37 -0000 + + Date: Mon, 22 Sep 2014 11:54:51 +0200 + X-List-Received-Date: Mon, 22 Sep 2014 09:54:49 -0000 + +Anyhow, I've pushed the Git archive [4,5] if anyone wants to play +around with ssoma. I think this would be a nice backend for folks +building notmuch-based web archives, and pulling from Git is easier +than downloading a new mbox ;). + +Cheers, +Trevor + +[1]: http://ssoma.public-inbox.org/README +[2]: http://public-inbox.org/meta/m/ec8f54cf6451eef6e9f59eff691cd9002f4fdf6= +5.html +[3]: http://git.tremily.us/?p=3Dssoma-mda.git;a=3Dshortlog;h=3Drefs/heads/p= +ython + I have an uncommitted patch to work around http://bugs.python.org/issu= +e22684 +[4]: http://git.tremily.us/?p=3Dnotmuch-archives.git +[5]: git://tremily.us/notmuch-archives.git + +--=20 +This email may be signed or encrypted with GnuPG (http://www.gnupg.org). +For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy + +--9JSHP372f+2dzJ8X +Content-Type: application/pgp-signature; name="signature.asc" +Content-Description: OpenPGP digital signature + +-----BEGIN PGP SIGNATURE----- +Version: GnuPG v2 + +iQIcBAEBAgAGBQJUXRd3AAoJEG8/JgBt8ol86+IQAIJCbVS05SYDLgtTfo7+petJ +hptvLRSWq0jNefa5r1ipOpCrzpWDSe/Rjcdb09jodvZzsNoHrS3rNQJPKmEvcE6b +xP6YZ9wukBtMbSHx0XoCUpRri5LlQ2+774vMq2riT1X0qZov/uNG19XUorAoGj5U +f64pYH7Q8rVk7NwfszNgmgbrujXoBMRIJV5CVkdiCOnTPxr9zmZc7wXuQheIueO6 +Ow0aR9A++Wo8lUwCpQPRqTr2Fl4xxBwtLhigJOezOh3gbqGavaua6j0K+B1oQ1nL +W0iyE+GE4HVzx3npYWEqROMPnZ7Dsoiz2oQrbAZ+Xnkjw2SZyaFoI7KfpDa6WgD0 +hmVEdUBYD5uvrqmqKA12R6P70skiuujgKiW8npVcU2Xggoe0sS/gR6adkV2joF7F +qeTNJ+AqzL2S7WQ7Kja43Y+a2Nrsk3nbDMDRmgUK+DL2JzXKcx9HtZO/9JeKMwh5 +xtsZJ08D2rgOMgM4pW6ZxZGcDLVeKDqvDF+dZA6v/ruaIJmbyen6RBGc6J63cSGI +wfn1xFUbG0ZxhnV896UTuEMH5861pzenpXM2IZsT7T0XPCO/bTNdaBylnahQvBP4 +tIFD2smexq6CGAyw1SEy3CcJrFFyozAJ48gGaBmOdLt+SfoKrF9j/XX1bb4YqcWb +q9xzd66reO3ffkkXPPuV +=aswS +-----END PGP SIGNATURE----- + +--9JSHP372f+2dzJ8X--