Opened 10 years ago

Last modified 6 years ago

#581 new defect

strftime depends on fringe standard behavior for UTF-8 correctness

Reported by: Vojtech Horky Owned by: Jakub Jermář
Priority: minor Milestone:
Component: helenos/lib/c Version: mainline
Keywords: first-patch Cc:
Blocker for: Depends on:
See also:

Description

The current implementation of strftime assumes that bytes corresponds to characters in the formatting string. This (wrong) assumption simplifies the implementation but prevents safe use of this function with non-ASCII strings.

The cleanest solution is probably to add new function with different name (similar to strcpy vs str_cpy) that is UTF-aware and put strftime into libposix.

Change History (6)

comment:1 by Jiří Zárevúcky, 10 years ago

This is incorrect. The function handles UTF-8 correctly simply because it outputs non-specifier bytes without change. UTF-8 has the property that bytes encoding non-ASCII characters are always greater than 127. As specifiers are all ASCII, any explicit decoding would serve no purpose at all.

If there is a problem (do you have a test case that fails?), it could indicate a non-standard behavior of printf() with regards to the '%c' specifier, which is used to copy raw bytes to the output.

comment:2 by Vojtech Horky, 10 years ago

I am not so sure. Correct me if I am wrong, please. I consider following code:

char my_date[10];
struct tm my_tm;
size_t res = strftime(my_date, 10, "á", &my_tm);

then my_date would contain "??" because each byte of the multibyte sequence would be appended separately and the check inside printf_core would treat the character as invalid (which is correct behaviour IMHO).

So, either you need to decode the characters properly or copy them to the output buffer verbatim without invoking snprintf.

in reply to:  2 ; comment:3 by Jiří Zárevúcky, 10 years ago

Summary: strftime is not UTF-awarestrftime depends on fringe standard behavior for UTF-8 correctness

Replying to vhotspur:

then my_date would contain "??" because each byte of the multibyte sequence would be appended separately and the check inside printf_core would treat the character as invalid (which is correct behaviour IMHO).

You are right, but this is incorrect behavior for printf(), because it is non-standard and useless (opening a new ticket).
(I believe the latest consensus is that) POSIXly named stuff should have POSIX behavior, unless really necessary. Departures like discarding non-UTF-8 bytes, where the original copies them verbatim, are completely unnecessary.
… In fact, most POSIX functions handle UTF-8 just fine without any special code, because UTF-8 was designed to work like that.


Still, it might be good to append raw bytes directly in strftime(), instead of depending on a fringe case in printf(). Should I leave it open for potential newcomers, or just write the patch?

in reply to:  3 comment:4 by Vojtech Horky, 10 years ago

Replying to jirkazr:

Should I leave it open for potential newcomers, or just write the patch?

I would prefer leaving it open for newcomers (that is why I added the first-patch keyword). And it is not a high priority problem. Thanks.

comment:5 by Jakub Jermář, 10 years ago

Milestone: 0.6.00.7.1

comment:6 by Jakub Jermář, 6 years ago

Milestone: 0.7.1
Note: See TracTickets for help on using tickets.