[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#208308: *printf() and incomplete multibyte sequences may cause infinite loops in applications



I checked this behavior written in this report with the latest glibc
2.3.2.ds1-13.

At Tue, 2 Sep 2003 02:53:04 +0200,
R� Kuhlmann wrote:
> Lets look at the following test program:
> 
> | #include <stdio.h>
> | #include <stdarg.h>
> | #include <locale.h>
> | #include <wchar.h>
> | 
> | void vmain (const char *fmt, ...)
> | {
> |     va_list args;
> |     int rc;
> | 
> |     va_start (args, fmt);
> |     rc = vprintf (fmt, args);
> |     va_end (args);
> |     printf ("rc %d %d\n", rc, fwide (stdout));
> | }
> | 
> | int main()
> | {
> |     const char *display = "\xe3\x83\x82x";
> | 
> |     setlocale (LC_ALL, "");
> |     vmain ("'%.*s'\n", 1, display);
> |     vmain ("'%.*s'\n", 3, display + 1);
> |     return 0;
> | }
> 
> Now, in the C locale it functions properly:
> 
> | 'ã'
> | rc 4 -1
> | 'x'
> | rc 6 -1
> 
> (the \x82 and \x83 are nonprintable, but hexdump -C reveals they are indeed
> printed)
> 
> Now, in the en_US.UTF-8 locale, the output is broken:
> 
> | ''
> | rc 3 -1
> | 'rc -1 -1
> 
> The first case may be okay: the sequence is incomplete. However, in the
> second case, the broken sequence is not only dropped, but parsing of the
> format string is ceased alltogether - the trailing quote and newline is
> missing as well! The bug is not only present in vprintf(), but also in
> vsnprintf(), so you can't argue about byte-oriented vs. multibyte-oriented.
> In fact, vsnprintf(buffer, sizeof(buffer), "%.*s", 3, "\x83\x82xyz") will
> always return -1, no matter how large the buffer actually is. The sample
> code given in the man page will turn into an infinite loop because of this!

Look at 

	rc = vprintf(fmt, args)

It returns rc == -1.  Put perror() when rc is -1.  You can see:

	printf: Invalid or incomplete multibyte or wide character

before the line:

	'rc -1 -1

So after the line "\x83\x82x" is chopped.  It says printf()
encountered invalid character sequence because \x83 is invalid UTF-8
first character (note: the first multibyte character of UTF-8 is
0xc0-0xfd).  This is the reason why you thought the format string was
ceased.

OK, now we can clarify about this bug:

	Is it OK to stop conversion when printf("%s") includes the
	invalid character sequence and locale is not C?

Note that from your first test case, incomplete multibyte sequence
(whose first character is representable and valid) is OK to handle in
glibc.


Solaris 8 is not same with this behavior.  If the first multibyte
character is invalid, then it's fall back to the "byte" mode.  So
there's no error in printf().

Glibc uses mbsrtowcs(3) to convert multibyte characters to wide
characters in its internal.  So your first case is succeeded because
its length is 1, we don't see EILSEQ because output buffer is
insufficient.  Instead, the second case is failed because it includes
EILSEQ sequence.

Thinking about mbrtowcs, I think glibc behaves correctly.  It does not
violate any standards.  It does not break any terminal shift states
like Solaris invalid character raw "byte" stream.


So, I think it's feature, not bug.  I would like to close it.  But if
you have objections, please show me the reason with the appropriate
standard description, because I'm not sure the current glibc satisfies
more users rathar than Solaris-like hehavior.

Regards,
-- gotom



Reply to: