Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 20B9D431FC7 for ; Thu, 30 Oct 2014 01:16:53 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -1.098 X-Spam-Level: X-Spam-Status: No, score=-1.098 tagged_above=-999 required=5 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xL21Vn0hML0p for ; Thu, 30 Oct 2014 01:16:44 -0700 (PDT) Received: from mail2.qmul.ac.uk (mail2.qmul.ac.uk [138.37.6.6]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 8435B431FB6 for ; Thu, 30 Oct 2014 01:16:43 -0700 (PDT) Received: from smtp.qmul.ac.uk ([138.37.6.40]) by mail2.qmul.ac.uk with esmtp (Exim 4.71) (envelope-from ) id 1Xjku5-0006EO-SM; Thu, 30 Oct 2014 08:16:42 +0000 Received: from 5751dfa2.skybroadband.com ([87.81.223.162] helo=localhost) by smtp.qmul.ac.uk with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.71) (envelope-from ) id 1Xjku3-0001UO-45; Thu, 30 Oct 2014 08:16:21 +0000 From: Mark Walters To: Michal Sojka , notmuch@notmuchmail.org Subject: Re: [PATCH v4 5/6] cli: search: Add configurable way to filter out duplicate addresses In-Reply-To: <1414421455-3037-6-git-send-email-sojkam1@fel.cvut.cz> References: <1414421455-3037-1-git-send-email-sojkam1@fel.cvut.cz> <1414421455-3037-6-git-send-email-sojkam1@fel.cvut.cz> User-Agent: Notmuch/0.18.1+86~gef5e66a (http://notmuchmail.org) Emacs/23.4.1 (x86_64-pc-linux-gnu) Date: Thu, 30 Oct 2014 08:16:18 +0000 Message-ID: <87egtqug4t.fsf@qmul.ac.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Sender-Host-Address: 87.81.223.162 X-QM-Geographic: According to ripencc, this message was delivered by a machine in Britain (UK) (GB). X-QM-SPAM-Info: Sender has good ham record. :) X-QM-Body-MD5: 9280583bd0e220ac2a1f2bae08d1669d (of first 20000 bytes) X-SpamAssassin-Score: -0.1 X-SpamAssassin-SpamBar: / X-SpamAssassin-Report: The QM spam filters have analysed this message to determine if it is spam. We require at least 5.0 points to mark a message as spam. This message scored -0.1 points. Summary of the scoring: * 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider * (markwalters1009[at]gmail.com) * -0.1 AWL AWL: From: address is in the auto white-list X-QM-Scan-Virus: ClamAV says the message is clean X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Oct 2014 08:16:53 -0000 On Mon, 27 Oct 2014, Michal Sojka wrote: > This adds an algorithm to filter out duplicate addresses from address > outputs (sender, receivers). The algorithm can be configured with > --filter-by command line option. > > The code here is an extended version of a patch from Jani Nikula. Hi As this is getting into the more controversial bike shedding region I wonder if it would be worth splitting this into 2 patches: the first could do the default dedupe based on name/address and the second could do add the filter-by options.=20 I think the default deduping is obviously worth doing but I am not sure about the rest. In any case I think the default deduping could go in pre-freeze but I would recommend the rest is left until after. > --- > completion/notmuch-completion.bash | 6 ++- > completion/notmuch-completion.zsh | 3 +- > doc/man1/notmuch-search.rst | 38 +++++++++++++++ > notmuch-search.c | 98 ++++++++++++++++++++++++++++++++= +++--- > test/T090-search-output.sh | 87 +++++++++++++++++++++++++++++++++ > test/T095-search-filter-by.sh | 64 +++++++++++++++++++++++++ > 6 files changed, 288 insertions(+), 8 deletions(-) > create mode 100755 test/T095-search-filter-by.sh > > diff --git a/completion/notmuch-completion.bash b/completion/notmuch-comp= letion.bash > index cfbd389..6b6d43a 100644 > --- a/completion/notmuch-completion.bash > +++ b/completion/notmuch-completion.bash > @@ -305,12 +305,16 @@ _notmuch_search() > COMPREPLY=3D( $( compgen -W "true false flag all" -- "${cur}" ) ) > return > ;; > + --filter-by) > + COMPREPLY=3D( $( compgen -W "nameaddr name addr addrfold nameaddrfo= ld" -- "${cur}" ) ) > + return > + ;; > esac >=20=20 > ! $split && > case "${cur}" in > -*) > - local options=3D"--format=3D --output=3D --sort=3D --offset=3D --li= mit=3D --exclude=3D --duplicate=3D" > + local options=3D"--format=3D --output=3D --sort=3D --offset=3D --li= mit=3D --exclude=3D --duplicate=3D --filter-by=3D" > compopt -o nospace > COMPREPLY=3D( $(compgen -W "$options" -- ${cur}) ) > ;; > diff --git a/completion/notmuch-completion.zsh b/completion/notmuch-compl= etion.zsh > index 3e52a00..3e535df 100644 > --- a/completion/notmuch-completion.zsh > +++ b/completion/notmuch-completion.zsh > @@ -53,7 +53,8 @@ _notmuch_search() > '--max-threads=3D[display only the first x threads from the search r= esults]:number of threads to show: ' \ > '--first=3D[omit the first x threads from the search results]:number= of threads to omit: ' \ > '--sort=3D[sort results]:sorting:((newest-first\:"reverse chronologi= cal order" oldest-first\:"chronological order"))' \ > - '--output=3D[select what to output]:output:((summary threads message= s files tags sender recipients))' > + '--output=3D[select what to output]:output:((summary threads message= s files tags sender recipients))' \ > + '--filter-by=3D[filter out duplicate addresses]:filter-by:((nameaddr= \:"both name and address part" name\:"name part" addr\:"address part" addrf= old\:"case-insensitive address part" nameaddrfold\:"name and case-insensiti= ve address part"))' > } >=20=20 > _notmuch() > diff --git a/doc/man1/notmuch-search.rst b/doc/man1/notmuch-search.rst > index b6607c9..84af2da 100644 > --- a/doc/man1/notmuch-search.rst > +++ b/doc/man1/notmuch-search.rst > @@ -85,6 +85,9 @@ Supported options for **search** include > (--format=3Dtext0), as a JSON array (--format=3Djson), or as > an S-Expression list (--format=3Dsexp). >=20=20 > + Duplicate addresses are filtered out. Filtering can be > + configured with the --filter-by option. > + > Note: Searching for **sender** should be much faster than > searching for **recipients**, because sender addresses are > cached directly in the database whereas other addresses > @@ -151,6 +154,41 @@ Supported options for **search** include > prefix. The prefix matches messages based on filenames. This > option filters filenames of the matching messages. >=20=20 > + ``--filter-by=3D``\ (**nameaddr**\ \|\ **name** \|\ **addr**\ \|\ **= addrfold**\ \|\ **nameaddrfold**\) > + > + Can be used with ``--output=3Dsender`` or > + ``--output=3Drecipients`` to filter out duplicate addresses. The > + filtering algorithm receives a sequence of email addresses and > + outputs the same sequence without the addresses that are > + considered a duplicate of a previously output address. What is > + considered a duplicate depends on how the two addresses are > + compared and this can be controlled with the follwing flags: > + > + **nameaddr** means that both name and address parts are > + compared in case-sensitive manner. Therefore, all same looking > + addresses strings are considered duplicate. This is the > + default. > + > + **name** means that only the name part is compared (in > + case-sensitive manner). For example, the addresses "John Doe > + " and "John Doe " will be > + considered duplicate. > + > + **addr** means that only the address part is compared (in > + case-sensitive manner). For example, the addresses "John Doe > + " and "Dr. John Doe " will > + be considered duplicate. > + > + **addrfold** is like **addr**, but comparison is done in > + canse-insensitive manner. For example, the addresses "John Doe > + " and "Dr. John Doe " will > + be considered duplicate. > + > + **nameaddrfold** is like **nameaddr**, but address comparison > + is done in canse-insensitive manner. For example, the > + addresses "John Doe " and "John Doe > + " will be considered duplicate. > + > EXIT STATUS > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20=20 > diff --git a/notmuch-search.c b/notmuch-search.c > index ce3bfb2..47aa979 100644 > --- a/notmuch-search.c > +++ b/notmuch-search.c > @@ -34,6 +34,14 @@ typedef enum { >=20=20 > #define OUTPUT_ADDRESS_FLAGS (OUTPUT_SENDER | OUTPUT_RECIPIENTS) >=20=20 > +typedef enum { > + FILTER_BY_NAMEADDR =3D 0, > + FILTER_BY_NAME, > + FILTER_BY_ADDR, > + FILTER_BY_ADDRFOLD, > + FILTER_BY_NAMEADDRFOLD, > +} filter_by_t; > + > typedef struct { > sprinter_t *format; > notmuch_query_t *query; > @@ -42,6 +50,7 @@ typedef struct { > int offset; > int limit; > int dupe; > + filter_by_t filter_by; > } search_options_t; >=20=20 > typedef struct { > @@ -229,6 +238,52 @@ do_search_threads (search_options_t *opt) > return 0; > } >=20=20 > +/* Returns TRUE iff name and/or addr is considered duplicite. */ A triviality; duplicite should be duplicate > +static notmuch_bool_t > +check_duplicite (const search_options_t *opt, GHashTable *addrs, const c= har *name, const char *addr) I am not sure on style but maybe is_duplicate would be clearer? Best wishes Mark > +{ > + notmuch_bool_t duplicite; > + char *key; > + > + if (opt->filter_by =3D=3D FILTER_BY_ADDRFOLD || > + opt->filter_by =3D=3D FILTER_BY_NAMEADDRFOLD) { > + gchar *folded =3D g_utf8_casefold (addr, -1); > + addr =3D talloc_strdup (opt->format, folded); > + g_free (folded); > + } > + switch (opt->filter_by) { > + case FILTER_BY_NAMEADDR: > + case FILTER_BY_NAMEADDRFOLD: > + key =3D talloc_asprintf (opt->format, "%s <%s>", name, addr); > + break; > + case FILTER_BY_NAME: > + key =3D talloc_strdup (opt->format, name); /* !name results in !key */ > + break; > + case FILTER_BY_ADDR: > + case FILTER_BY_ADDRFOLD: > + key =3D talloc_strdup (opt->format, addr); > + break; > + default: > + INTERNAL_ERROR("invalid --filter-by flags"); > + } > + > + if (opt->filter_by =3D=3D FILTER_BY_ADDRFOLD || > + opt->filter_by =3D=3D FILTER_BY_NAMEADDRFOLD) > + talloc_free ((char*)addr); > + > + if (! key) > + return FALSE; > + > + duplicite =3D g_hash_table_lookup_extended (addrs, key, NULL, NULL); > + > + if (! duplicite) > + g_hash_table_insert (addrs, key, NULL); > + else > + talloc_free (key); > + > + return duplicite; > +} > + > static void > print_mailbox (const search_options_t *opt, const mailbox_t *mailbox) > { > @@ -263,7 +318,8 @@ print_mailbox (const search_options_t *opt, const mai= lbox_t *mailbox) > } >=20=20 > static void > -process_address_list (const search_options_t *opt, InternetAddressList *= list) > +process_address_list (const search_options_t *opt, GHashTable *addrs, > + InternetAddressList *list) > { > InternetAddress *address; > int i; > @@ -279,7 +335,7 @@ process_address_list (const search_options_t *opt, In= ternetAddressList *list) > if (group_list =3D=3D NULL) > continue; >=20=20 > - process_address_list (opt, group_list); > + process_address_list (opt, addrs, group_list); > } else { > InternetAddressMailbox *mailbox =3D INTERNET_ADDRESS_MAILBOX (addre= ss); > mailbox_t mbx =3D { > @@ -287,13 +343,16 @@ process_address_list (const search_options_t *opt, = InternetAddressList *list) > .addr =3D internet_address_mailbox_get_addr (mailbox), > }; >=20=20 > + if (check_duplicite (opt, addrs, mbx.name, mbx.addr)) > + continue; > + > print_mailbox (opt, &mbx); > } > } > } >=20=20 > static void > -process_address_header (const search_options_t *opt, const char *value) > +process_address_header (const search_options_t *opt, GHashTable *addrs, = const char *value) > { > InternetAddressList *list; >=20=20 > @@ -304,7 +363,13 @@ process_address_header (const search_options_t *opt,= const char *value) > if (list =3D=3D NULL) > return; >=20=20 > - process_address_list (opt, list); > + process_address_list (opt, addrs, list); > +} > + > +static void > +_my_talloc_free_for_g_hash (void *ptr) > +{ > + talloc_free (ptr); > } >=20=20 > static int > @@ -314,8 +379,13 @@ do_search_messages (search_options_t *opt) > notmuch_messages_t *messages; > notmuch_filenames_t *filenames; > sprinter_t *format =3D opt->format; > + GHashTable *addresses =3D NULL; > int i; >=20=20 > + if (opt->output & OUTPUT_ADDRESS_FLAGS) > + addresses =3D g_hash_table_new_full (g_str_hash, g_str_equal, > + _my_talloc_free_for_g_hash, NULL); > + > if (opt->offset < 0) { > opt->offset +=3D notmuch_query_count_messages (opt->query); > if (opt->offset < 0) > @@ -363,7 +433,7 @@ do_search_messages (search_options_t *opt) > const char *addrs; >=20=20 > addrs =3D notmuch_message_get_header (message, "from"); > - process_address_header (opt, addrs); > + process_address_header (opt, addresses, addrs); > } >=20=20 > if (opt->output & OUTPUT_RECIPIENTS) { > @@ -373,7 +443,7 @@ do_search_messages (search_options_t *opt) >=20=20 > for (j =3D 0; j < ARRAY_SIZE (hdrs); j++) { > addrs =3D notmuch_message_get_header (message, hdrs[j]); > - process_address_header (opt, addrs); > + process_address_header (opt, addresses, addrs); > } > } > } > @@ -381,6 +451,9 @@ do_search_messages (search_options_t *opt) > notmuch_message_destroy (message); > } >=20=20 > + if (addresses) > + g_hash_table_unref (addresses); > + > notmuch_messages_destroy (messages); >=20=20 > format->end (format); > @@ -447,6 +520,7 @@ notmuch_search_command (notmuch_config_t *config, int= argc, char *argv[]) > .offset =3D 0, > .limit =3D -1, /* unlimited */ > .dupe =3D -1, > + .filter_by =3D FILTER_BY_NAMEADDR, > }; > char *query_str; > int opt_index, ret; > @@ -490,6 +564,13 @@ notmuch_search_command (notmuch_config_t *config, in= t argc, char *argv[]) > { NOTMUCH_OPT_INT, &opt.offset, "offset", 'O', 0 }, > { NOTMUCH_OPT_INT, &opt.limit, "limit", 'L', 0 }, > { NOTMUCH_OPT_INT, &opt.dupe, "duplicate", 'D', 0 }, > + { NOTMUCH_OPT_KEYWORD, &opt.filter_by, "filter-by", 'b', > + (notmuch_keyword_t []){ { "nameaddr", FILTER_BY_NAMEADDR }, > + { "name", FILTER_BY_NAME }, > + { "addr", FILTER_BY_ADDR }, > + { "addrfold", FILTER_BY_ADDRFOLD }, > + { "nameaddrfold", FILTER_BY_NAMEADDRFOLD }, > + { 0, 0 } } }, > { 0, 0, 0, 0, 0 } > }; >=20=20 > @@ -500,6 +581,11 @@ notmuch_search_command (notmuch_config_t *config, in= t argc, char *argv[]) > if (! opt.output) > opt.output =3D OUTPUT_SUMMARY; >=20=20 > + if (opt.filter_by && !(opt.output & OUTPUT_ADDRESS_FLAGS)) { > + fprintf (stderr, "Error: --filter-by can only be used with address outp= ut.\n"); > + return EXIT_FAILURE; > + } > + > switch (format_sel) { > case NOTMUCH_FORMAT_TEXT: > opt.format =3D sprinter_text_create (config, stdout); > diff --git a/test/T090-search-output.sh b/test/T090-search-output.sh > index 947d572..841a721 100755 > --- a/test/T090-search-output.sh > +++ b/test/T090-search-output.sh > @@ -387,6 +387,93 @@ cat <EXPECTED > EOF > test_expect_equal_file OUTPUT EXPECTED >=20=20 > +test_begin_subtest "--output=3Dsender" > +notmuch search --output=3Dsender '*' >OUTPUT > +cat <EXPECTED > +Fran=C3=A7ois Boulogne > +Olivier Berger > +Chris Wilson > +Carl Worth > +Alexander Botero-Lowry > +Keith Packard > +Jjgod Jiang > +Rolland Santimano > +Jan Janak > +Stewart Smith > +Lars Kellogg-Stedman > +Alex Botero-Lowry > +Ingmar Vanhassel > +Aron Griffis > +Adrian Perez de Castro > +Israel Herraiz > +Mikhail Gusarov > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Dsender --format=3Djson" > +notmuch search --output=3Dsender --format=3Djson '*' >OUTPUT > +cat <EXPECTED > +[{"name": "Fran=C3=A7ois Boulogne", "address": "boulogne.f@gmail.com"}, > +{"name": "Olivier Berger", "address": "olivier.berger@it-sudparis.eu"}, > +{"name": "Chris Wilson", "address": "chris@chris-wilson.co.uk"}, > +{"name": "Carl Worth", "address": "cworth@cworth.org"}, > +{"name": "Alexander Botero-Lowry", "address": "alex.boterolowry@gmail.co= m"}, > +{"name": "Keith Packard", "address": "keithp@keithp.com"}, > +{"name": "Jjgod Jiang", "address": "gzjjgod@gmail.com"}, > +{"name": "Rolland Santimano", "address": "rollandsantimano@yahoo.com"}, > +{"name": "Jan Janak", "address": "jan@ryngle.com"}, > +{"name": "Stewart Smith", "address": "stewart@flamingspork.com"}, > +{"name": "Lars Kellogg-Stedman", "address": "lars@seas.harvard.edu"}, > +{"name": "Alex Botero-Lowry", "address": "alex.boterolowry@gmail.com"}, > +{"name": "Ingmar Vanhassel", "address": "ingmar@exherbo.org"}, > +{"name": "Aron Griffis", "address": "agriffis@n01se.net"}, > +{"name": "Adrian Perez de Castro", "address": "aperez@igalia.com"}, > +{"name": "Israel Herraiz", "address": "isra@herraiz.org"}, > +{"name": "Mikhail Gusarov", "address": "dottedmag@dottedmag.net"}] > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Drecipients" > +notmuch search --output=3Drecipients '*' >OUTPUT > +cat <EXPECTED > +Allan McRae > +Discussion about the Arch User Repository (AUR) > +olivier.berger@it-sudparis.eu > +notmuch@notmuchmail.org > +notmuch > +Keith Packard > +Mikhail Gusarov > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Dsender --output=3Drecipients" > +notmuch search --output=3Dsender --output=3Drecipients '*' >OUTPUT > +cat <EXPECTED > +Fran=C3=A7ois Boulogne > +Allan McRae > +Discussion about the Arch User Repository (AUR) > +Olivier Berger > +olivier.berger@it-sudparis.eu > +Chris Wilson > +notmuch@notmuchmail.org > +Carl Worth > +Alexander Botero-Lowry > +Keith Packard > +Jjgod Jiang > +Rolland Santimano > +Jan Janak > +Stewart Smith > +Lars Kellogg-Stedman > +notmuch > +Alex Botero-Lowry > +Ingmar Vanhassel > +Aron Griffis > +Adrian Perez de Castro > +Israel Herraiz > +Mikhail Gusarov > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > test_begin_subtest "sanitize output for quoted-printable line-breaks in = author and subject" > add_message "[subject]=3D'two =3D?ISO-8859-1?Q?line=3D0A_subject?=3D > headers'" > diff --git a/test/T095-search-filter-by.sh b/test/T095-search-filter-by.sh > new file mode 100755 > index 0000000..97d9a9b > --- /dev/null > +++ b/test/T095-search-filter-by.sh > @@ -0,0 +1,64 @@ > +#!/usr/bin/env bash > +test_description=3D'duplicite address filtering in "notmuch search --out= put=3Drecipients"' > +. ./test-lib.sh > + > +add_message '[to]=3D"Real Name , Real Name "' > +add_message '[to]=3D"Nickname "' '[cc]=3D"Real Name "' > +add_message '[to]=3D"Nickname "' '[bcc]=3D"Real Name "' > + > +test_begin_subtest "--output=3Drecipients" > +notmuch search --output=3Drecipients "*" >OUTPUT > +cat <EXPECTED > +Real Name > +Real Name > +Nickname > +Real Name > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Drecipients --filter-by=3Dnameaddr" > +notmuch search --output=3Drecipients --filter-by=3Dnameaddr "*" >OUTPUT > +# The same as above > +cat <EXPECTED > +Real Name > +Real Name > +Nickname > +Real Name > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Drecipients --filter-by=3Dname" > +notmuch search --output=3Drecipients --filter-by=3Dname "*" >OUTPUT > +cat <EXPECTED > +Real Name > +Nickname > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Drecipients --filter-by=3Daddr" > +notmuch search --output=3Drecipients --filter-by=3Daddr "*" >OUTPUT > +cat <EXPECTED > +Real Name > +Real Name > +Real Name > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Drecipients --filter-by=3Daddrfold" > +notmuch search --output=3Drecipients --filter-by=3Daddrfold "*" >OUTPUT > +cat <EXPECTED > +Real Name > +Real Name > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_begin_subtest "--output=3Drecipients --filter-by=3Dnameaddrfold" > +notmuch search --output=3Drecipients --filter-by=3Dnameaddrfold "*" >OUT= PUT > +cat <EXPECTED > +Real Name > +Real Name > +Nickname > +EOF > +test_expect_equal_file OUTPUT EXPECTED > + > +test_done > --=20 > 2.1.1 > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch