MySQL unicode literals

I want to insert a record into MySQL that has a non-ASCII Unicode character, but I’m on a terminal that doesn’t let me easily type non-ASCII characters. How do I escape a Unicode literal in MySQL’s SQL syntax?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

See: http://bugs.mysql.com/bug.php?id=10199
(Bug #10199: “Allow Unicode escape sequence for string literals.”)
This request has been “Open” since 2005. More details in Worklog Task #3529: Unicode Escape Sequences.

From https://web.archive.org/web/20091117221116/http://eng.kaching.com/2009/10/mysql-unicode-escape-sequences.html
though, you can see the following example, which does actually seem to work, but requires you to know the actual byte-by-byte UTF8 encoding:

You can also use the variable-length UTF-8 representation (convenient when, for example, copying from a utf-8 URL-encoded value like %E2%80%98).

mysql> select _utf8 x'E28098';  
+---+  
| ‘ |  
+---+

Method 2

This stored function provides the functionality MySQL is (apparently) missing, with a way to turn a literal code point into a character without having to already know the UTF-8 encoding.

If VARCHAR(1) seems strange, since utf8 characters in MySQL can be up to 3 bytes long, remember the size of VARCHAR is characters, not bytes. The function returns a single UTF-8-encoded character from the input value.

For hexadecimal literals, prepend 0x.

DELIMITER $$

DROP FUNCTION IF EXISTS `utf8_char` $$
CREATE FUNCTION `utf8_char`(v smallint unsigned) RETURNS VARCHAR(1) CHARSET utf8
NO SQL
DETERMINISTIC
BEGIN

-- http://stackoverflow.com/questions/3632410/mysql-unicode-literals/30675371#30675371

RETURN CHAR(CASE
            WHEN v <= 0x7F THEN v
            WHEN v <= 0x7FF THEN 0xC080 | ((v >> 6) << 8) | (v & 0x3F)
            WHEN v <= 0xFFFF THEN 0xE08080 | (((v >> 12) & 0x0F ) << 16)  | (((v >> 6) & 0x3F ) << 8) | (v & 0x3F)
            ELSE NULL END);

END $$

DELIMITER ;

Example output:

mysql> select utf8_char(8592) AS 'leftwards_arrow';
+-----------------+
| leftwards_arrow |
+-----------------+
| ←               |
+-----------------+
1 row in set (0.00 sec)

mysql> select utf8_char(0x2192) AS 'rightwards_arrow_hex';
+----------------------+
| rightwards_arrow_hex |
+----------------------+
| →                    |
+----------------------+
1 row in set (0.00 sec)

Method 3

If the goal is to specify the code point instead of the encoded byte sequence (i.e. 0x0F02 instead of the UTF-8 0xE0BC82 for “༂”), then you need to use an encoding in which the code point value just happens to be the encoded byte sequence. For example, “0xE28098″ is the UTF-8 encoded byte sequence for the ” ” character (as shown in dkamins’s answer), which is code point U+2018. However, 0x2018 is both the code point value for and the encoded byte sequence for ucs2 / utf16 (they are effectively the same encoding for BMP characters, but I prefer to use “utf16” as it is consistent with “utf8” and “utf32”, consistent in the “utf” theme). Hence:

_utf16 0x2018

returns the same character as:

_utf8 0xE0BC82

But, utf16 only works for BMP characters (code points U+0000 – U+FFFF) in terms of specifying the code point value. If you want a Supplementary Character (by specifying the code point instead of a specific encoding’s sequence of bytes), then you will need to use the utf32 encoding. Not only does _utf32 0x2018 return , but:

_utf32 0x1F47E

returns: 👾

To use either UTF-8 or UTF-16 encodings for that same Supplementary Character would require the following:

_utf8mb4 0xF09F91BE

_utf16 0xD83DDC7E

HOWEVER, if you are having trouble adding this to a string that is already utf8, then you will need to convert this into utf8 (or into utf8mb4 when creating Supplementary Characters as the utf8 encoding / charset can only handle BMP characters):

CONVERT(_utf32 0x1F47E USING utf8mb4)

Or, using the example character from Michael – sqlbot’s answer:

CONVERT(_utf32 0x2192 USING utf8)

returns a . Hence, a custom function is not needed in order to create a UTF-8 encoded character from its code point (at least not as of MySQL 8.0). Here is a test query

SELECT _utf32 0x1F47E AS "Supplementary Character in utf32",
       CONVERT(_utf32 0x1F47E USING utf8mb4) AS "Supplementary Character in utf8mb4",
       CHARSET(CONVERT(_utf32 0x1F47E USING utf8mb4)) AS "Proof",

       "---" AS "---",

       _utf32 0x2192 AS "BMP character in utf32",
       CONVERT(_utf32 0x2192 USING utf8) AS "BMP character in utf8",
       CHARSET(CONVERT(_utf32 0x2192 USING utf8)) AS "Proof";

And you can see it working on db<>fiddle (might not work in pre-8.0 MySQL).

 

For more details on these options, plus Unicode escape sequences for other languages and platforms, please see my post:

Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x