Opened 11 years ago
Last modified 7 years ago
#581 new defect
strftime depends on fringe standard behavior for UTF-8 correctness
Reported by: | Vojtech Horky | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | helenos/lib/c | Version: | mainline |
Keywords: | first-patch | Cc: | |
Blocker for: | Depends on: | ||
See also: |
Description
The current implementation of strftime
assumes that bytes corresponds to characters in the formatting string. This (wrong) assumption simplifies the implementation but prevents safe use of this function with non-ASCII strings.
The cleanest solution is probably to add new function with different name (similar to strcpy
vs str_cpy
) that is UTF-aware and put strftime
into libposix
.
Change History (6)
comment:1 by , 11 years ago
follow-up: 3 comment:2 by , 11 years ago
I am not so sure. Correct me if I am wrong, please. I consider following code:
char my_date[10]; struct tm my_tm; size_t res = strftime(my_date, 10, "á", &my_tm);
then my_date
would contain "??"
because each byte of the multibyte sequence would be appended separately and the check inside printf_core
would treat the character as invalid (which is correct behaviour IMHO).
So, either you need to decode the characters properly or copy them to the output buffer verbatim without invoking snprintf
.
follow-up: 4 comment:3 by , 11 years ago
Summary: | strftime is not UTF-aware → strftime depends on fringe standard behavior for UTF-8 correctness |
---|
Replying to vhotspur:
then
my_date
would contain"??"
because each byte of the multibyte sequence would be appended separately and the check insideprintf_core
would treat the character as invalid (which is correct behaviour IMHO).
You are right, but this is incorrect behavior for printf(), because it is non-standard and useless (opening a new ticket).
(I believe the latest consensus is that) POSIXly named stuff should have POSIX behavior, unless really necessary. Departures like discarding non-UTF-8 bytes, where the original copies them verbatim, are completely unnecessary.
… In fact, most POSIX functions handle UTF-8 just fine without any special code, because UTF-8 was designed to work like that.
Still, it might be good to append raw bytes directly in strftime(), instead of depending on a fringe case in printf(). Should I leave it open for potential newcomers, or just write the patch?
comment:4 by , 11 years ago
Replying to jirkazr:
Should I leave it open for potential newcomers, or just write the patch?
I would prefer leaving it open for newcomers (that is why I added the first-patch
keyword). And it is not a high priority problem. Thanks.
comment:5 by , 10 years ago
Milestone: | 0.6.0 → 0.7.1 |
---|
comment:6 by , 7 years ago
Milestone: | 0.7.1 |
---|
This is incorrect. The function handles UTF-8 correctly simply because it outputs non-specifier bytes without change. UTF-8 has the property that bytes encoding non-ASCII characters are always greater than 127. As specifiers are all ASCII, any explicit decoding would serve no purpose at all.
If there is a problem (do you have a test case that fails?), it could indicate a non-standard behavior of printf() with regards to the '%c' specifier, which is used to copy raw bytes to the output.