From: Tomi Ollila <tomi.ollila@iki.fi>
To: david@tethera.net, notmuch@notmuchmail.org
Subject: Re: [Patch v3 5/6] test: add generator for random "stub" messages
In-Reply-To: <1345382314-5330-6-git-send-email-david@tethera.net>
References: <1345382314-5330-1-git-send-email-david@tethera.net>
	<1345382314-5330-6-git-send-email-david@tethera.net>
User-Agent: Notmuch/0.14+11~gd9bf007 (http://notmuchmail.org) Emacs/24.2.1
	(x86_64-unknown-linux-gnu)
Date: Sat, 08 Sep 2012 16:38:35 +0300
Message-ID: <m2wr04ocro.fsf@guru.guru-group.fi>
MIME-Version: 1.0
Content-Type: text/plain
Cc: David Bremner <bremner@debian.org>
Precedence: list

On Sun, Aug 19 2012, david@tethera.net wrote:

> From: David Bremner <bremner@debian.org>
>
> Initial use case is testing dump and restore, so we only have
> message-ids and tags.
>
> The message ID's are nothing like RFC compliant, but it doesn't seem
> any harder to roundtrip random UTF-8 strings than RFC-compliant ones.
>
> Tags are UTF-8, even though notmuch is in principle more generous than
> that.
> ---

Mostly LGTM (the whole series). Few comments inline...

Finally, 6/6 adds known broken test -- when will we see this code
taken into use and the broken test fixed :)

>  test/.gitignore      |    1 +
>  test/Makefile.local  |    9 +++
>  test/basic           |    2 +-
>  test/random-corpus.c |  202 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 213 insertions(+), 1 deletion(-)
>  create mode 100644 test/random-corpus.c

[ ... ]

>  
> diff --git a/test/random-corpus.c b/test/random-corpus.c
> new file mode 100644
> index 0000000..8c5b559
> --- /dev/null
> +++ b/test/random-corpus.c

[ ... ]

> +
> +/* Current largest UTF-32 value defined. Note that most of these will
> + * be printed as boxes in most fonts.
> + */

Should we be talking about UTF-8 valies. UTF-8 (currently has the same
limit).

> +
> +#define GLYPH_MAX 0x10FFFE
> +
> +static gunichar
> +random_unichar ()
> +{
> +    int start = 1, stop = GLYPH_MAX;
> +    int class = random() % 2;
> +
> +    /*
> +     *  Choose about half ascii as test characters, as ascii
> +     *  punctation and whitespace is the main cause of problems for
> +     *  the (old) restore parser
> +    */
> +    switch (class) {
> +    case 0:
> +	/* ascii */
> +	start = 0x01;
> +	stop = 0x7f;
> +	break;
> +    case 1:
> +	/* the rest of unicode */
> +	start = 0x80;
> +	stop = GLYPH_MAX;
> +    }
> +
> +    if (start == stop)
> +	return start;
> +    else
> +	return start + (random() % (stop - start + 1));
> +}
> +
> +static char *
> +random_utf8_string (void *ctx, size_t char_count)
> +{
> +
> +    gchar *buf = NULL;
> +    size_t buf_size = 0;
> +
> +    size_t offset = 0;
> +
> +    size_t i;
> +
> +    buf = talloc_realloc (ctx, NULL, gchar, char_count);
> +    buf_size = char_count;
> +
> +    for (i = 0; i < char_count; i++) {
> +	gunichar randomchar;
> +	size_t written;
> +
> +	/* 6 for one glyph, one for null */
> +	if (buf_size - offset < 8) {
> +	    buf_size += 16;
> +	    buf = talloc_realloc (ctx, buf, gchar, buf_size);

This reallocation will hit many times, as originally there was just
char_count bytes allocated -- this limit will probably get hit before 
halfway the creation of random string (half uses 1 byte, other half
2, 3 or 4 bytes, mostly 4 (even only half of the 4-byte range is used...)

Maybe originally allocating char_count * 2 + 8 and if realloc required
(char_count - i) * 2 + 8... or maybe better, just doing the latter
realloc and replacing first with buf = NULL; buf_size = 0;

Alternatively you could play with random states; calculate size,
reset random state, alloc size + 1 and write chars.

> +	}
> +
> +	randomchar = random_unichar();
> +
> +	written = g_unichar_to_utf8 (randomchar, buf + offset);
> +
> +	if (written <= 0) {
> +	    fprintf (stderr, "error converting to utf8\n");
> +	    exit (1);
> +	}
> +
> +	offset += written;
> +
> +    }

Above there is extra newline. There are a few others in other
files (at least after opening and before closing brace).
Maybe uncrustify your source :)

> +    buf[offset] = 0;
> +    return buf;
> +}
> +

[ ... ]


Tomi