From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2024-04-25 15:33:00


On 4/25/24 18:28, Peter Dimov wrote:
> Andrey Semashev wrote:
>> On 4/25/24 17:53, Peter Dimov via Boost wrote:
>>> This behavior makes name UUIDs produced by e.g. "www.example.org"
>>> and L"www.example.org" different, which is unlikely to be what one
>>> wants in practice, and is against the recommendation of RFC 4122,
>>> which says
>>>
>>> o Convert the name to a canonical sequence of octets (as defined by
>>> the standards or conventions of its name space); put the name
>>> space ID in network byte order.
>>>
>>> I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00
>>> 0x00 as the "canonical sequence of octets" for U"A".
>>
>> Perhaps, we should simply assume that whatever form of the string the user
>> provided to the generator is the "canonical" form. That is, if the user wants
>> "www.example.org" and L"www.example.org" to produce the same UUID, it
>> is his responsibility to convert those strings to the same representation before
>> passing it to the generator.
>>
>> I think, in some regions, Unicode might not be the first encoding of choice, and
>> there also are incorrectly encoded strings that cannot be converted to UTF-8. I
>> don't think that Boost.UUID should deal with those issues.
>
> The right way to not deal with these issues is to simply not take wide strings
> in the first place. This forces the user to supply "the canonical octet
> representation".
>
> Since we do take wide strings, we have implicitly accepted the responsibility
> to produce the canonical octet representation for them. And inserting zeroes
> randomly is simply wrong.

Ok, so maybe we should simply deprecate the support for wide string inputs?