I’m wondering if there is a “best” choice for collation in MySQL for a general website where you aren’t 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.
In the past I have set PHP to output in “UTF-8”, but which collation does this match in MySQL? I’m thinking it’s one of the UTF-8 ones, but I have used utf8_unicode_ci
, utf8_general_ci
, and utf8_bin
before.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.
utf8_general_ci
is somewhat faster than utf8_unicode_ci
, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci
) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci
(I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.
You can read more on specific unicode character sets on the MySQL manual – http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
Method 2
Actually, you probably want to use utf8_unicode_ci
or utf8_general_ci
.
utf8_general_ci
sorts by stripping away all accents and sorting as if it were ASCIIutf8_unicode_ci
uses the Unicode sort order, so it sorts correctly in more languages
However, if you are only using this to store English text, these shouldn’t differ.
Method 3
Be very, very aware of this problem that can occur when using utf8_general_ci
.
MySQL will not distinguish between some characters in select statements, when utf8_general_ci
collation is used. This can lead to very nasty bugs – especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.
This problem exposes itself at the very least in early 5.x versions – I’m not sure if this behaviour has changed later.
I’m not a DBA, but to avoid this problem, I always go with utf8-bin
instead of a case-insensitive one.
The script below describes the problem by example.
-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;
-- next, make sure that your client connection is of the same
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci
-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');
-- (verify)
SELECT * FROM `test`;
-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are
-- case insensitive (ending with _ci) do not distinguish between
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to
-- do the same with the 'latin1' charset:
--
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci
-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected). This shows
-- that the problem with utf8/utf8_generic_ci isn't present
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same
-- way (for any sceptics out there):
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Two keys.
--
DROP DATABASE sandbox;
Method 4
It is best to use character set utf8mb4
with the collation utf8mb4_unicode_ci
.
The character set, utf8
, only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8
only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters. utf8mb4
supports all 17 planes.
MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.
The utf8mb4
character set was introduced in MySQL 5.5.3 on 2010-03-24.
Some of the required changes to use the new character set are not trivial:
- Changes may need to be made in your application database adapter.
- Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
- SQL CREATE statements may need to include:
ROW_FORMAT=DYNAMIC
- DYNAMIC is required for indexes on VARCHAR(192) and larger.
NOTE: Switching to Barracuda
from Antelope
, may require restarting the MySQL service more than once. innodb_file_format_max
does not change until after the MySQL service has been restarted to: innodb_file_format = barracuda
.
MySQL uses the old Antelope
InnoDB file format. Barracuda
supports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset: utf8mb4
- #1709 – Index column size too large. The maximum column size is 767 bytes.
- #1071 – Specified key was too long; max key length is 767 bytes
The following scenario has been tested on MySQL 5.6.17:
By default, MySQL is configured like this:
SHOW VARIABLES; innodb_large_prefix = OFF innodb_file_format = Antelope
Stop your MySQL service and add the options to your existing my.cnf:
[client] default-character-set= utf8mb4 [mysqld] explicit_defaults_for_timestamp = true innodb_large_prefix = true innodb_file_format = barracuda innodb_file_format_max = barracuda innodb_file_per_table = true # Character collation character_set_server=utf8mb4 collation_server=utf8mb4_unicode_ci
Example SQL CREATE statement:
CREATE TABLE Contacts ( id INT AUTO_INCREMENT NOT NULL, ownerId INT DEFAULT NULL, created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, contact VARCHAR(640) NOT NULL, prefix VARCHAR(128) NOT NULL, first VARCHAR(128) NOT NULL, middle VARCHAR(128) NOT NULL, last VARCHAR(128) NOT NULL, suffix VARCHAR(128) NOT NULL, notes MEDIUMTEXT NOT NULL, INDEX IDX_CA367725E05EFD25 (ownerId), INDEX created (created), INDEX modified_idx (modified), INDEX contact_idx (contact), PRIMARY KEY(id) ) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
- You can see error #1709 generated for
INDEX contact_idx (contact)
ifROW_FORMAT=DYNAMIC
is removed from the CREATE statement.
NOTE: Changing the index to limit to the first 128 characters on contact
eliminates the requirement for using Barracuda with ROW_FORMAT=DYNAMIC
INDEX contact_idx (contact(128)),
Also note: when it says the size of the field is VARCHAR(128)
, that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.
This INSERT
statement should contain the 4 byte ‘poo’ character in the 2 row:
INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES (1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''), (2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '', ''), (3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '123💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '', '');
You can see the amount of space used by the last
column:
mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`; +--------------------+---------------------+ | BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) | +--------------------+---------------------+ | 1024 | 128 | -- All characters are ASCII | 4096 | 128 | -- All characters are 4 bytes | 4024 | 128 | -- 3 characters are ASCII, 125 are 4 bytes +--------------------+---------------------+
In your database adapter, you may want to set the charset and collation for your connection:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
In PHP, this would be set for: PDO::MYSQL_ATTR_INIT_COMMAND
References:
Method 5
Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.
Example from the documentation for charset unicode:
utf8_general_ci
also is satisfactory
for both German and French, except
that ‘ß’ is equal to ‘s’, and not to
‘ss’. If this is acceptable for your
application, then you should use
utf8_general_ci
because it is faster.
Otherwise, useutf8_unicode_ci
because
it is more accurate.
So – it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci
should suffice, for other languages, like Swedish, special collations have been created.
Method 6
Essentially, it depends on how you think of a string.
I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you’re using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I’m a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.
In my books “value” should never in a million years be equal to “valúe”.
If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().
Method 7
For UTF-8 textual information, you should use utf8_general_ci
because…
-
utf8_bin
: compare strings by the
binary value of each character in
the string -
utf8_general_ci
: compare strings
using general language rules and
using case-insensitive comparisons
a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.
Method 8
The accepted answer fairly definitively suggests using utf8_unicode_ci, and whilst for new projects that’s great, I wanted to relate my recent contrary experience just in case it saves anyone some time.
Because utf8_general_ci is the default collation for Unicode in MySQL, if you want to use utf8_unicode_ci then you end up having to specify it in a lot of places.
For example, all client connections not only have a default charset (makes sense to me) but also a default collation (i.e. the collation will always default to utf8_general_ci for unicode).
Likely, if you use utf8_unicode_ci for your fields, your scripts that connect to the database will need to be updated to mention the desired collation explicitly — otherwise queries using text strings can fail when your connection is using the default collation.
The upshot is that when converting an existing system of any size to Unicode/utf8, you may end up being forced to use utf8_general_ci because of the way MySQL handles defaults.
Method 9
For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).
If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.
[Edit – nearly 6 years later]
I no longer recommend the “utf8” character set on MySQL, and instead recommend the “utf8mb4” character set. They match almost entirely, but allow for a little (lot) more unicode characters.
Realistically, MySQL should have updated the “utf8” character set and respective collations to match the “utf8” specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete “utf8” character set.
Method 10
I found these collation charts helpful. http://collation-charts.org/mysql60/. I’m no sure which is the used utf8_general_ci though.
For example here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
Method 11
In your database upload file, add the followin line before any line:
SET NAMES utf8;
And your problem should be solved.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0