java中使用mysql utf8mb4 字符集

emoji表情的存储问题

随着emoji表情使用越来越广泛,在mysql数据库在存储emoji表情的需求就越来越常见。mysql utf8 字符集只能存储三个字节,实际上并不是真正的UTF-8,不能存储emoji表情。所以如果mysql数据库和表的字符集都使用的utf8,在存emoji表情的时候就会出错:

1
2
3
4
5
6
7
8
9
10
11
12
Caused by: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x8D\x93\xE4\xBA...' for column 'title' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3933)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3869)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2524)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2675)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2465)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1912)
at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2133)
at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2067)
at com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5175)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2052)

解决办法

为了支持emoji,mysql推出了utf8mb4字符集 , 那么如何在现有字符集为utf8的数据库上使用utf8mb4呢?

首先,当然是把需要存emoji的数据库、表、字段的符集都修改成为utf8mb4, 由于utf8mb4是兼容utf8的,所以修改并不会影响数据。
然后,然后设置客户端字符集
Java中一般使用Connector/J 来连接mysql ,看一下官网原文描述:

For Connector/J 5.1.46 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.

For Connector/J 5.1.47 and later:

When UTF-8 is used for characterEncoding in the connection string, it maps to the MySQL character set name utf8mb4.

If the connection option connectionCollation is also set alongside characterEncoding and is incompatible with it, characterEncoding will be overridden with the encoding corresponding to connectionCollation.

Because there is no Java-style character set name for utfmb3 that you can use with the connection option charaterEncoding, the only way to use utf8mb3 as your connection character set is to use a utf8mb3 collation (for example, utf8_general_ci) for the connection option connectionCollation, which forces a utf8mb3 character set to be used, as explained in the last bullet.

可以看出:

  • 如果使用的是Connector/J 5.1.46 以及之前的版本,必须修改server 端的字符集:character_set_server=utf8mb4,这需要重启mysql服务,对于这使用云rds的同学来说更加困难
  • 更好的办法是升级 Connector/J5.1.47 之后的版本,这样当我们连接参数指定为characterEncoding=UTF-8 的时候, character_set_client character_set_connection 就会使用utf8mb4了。 当然还需要添加characterSetResults=UTF-8参数, 将character_set_results也设置为utf8mb4

通过Connector/J客户端设置utf8mb4字符集的 java 示例代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class Test{
public void testDevWithJDBC() throws SQLException, ClassNotFoundException {
Class.forName("com.mysql.jdbc.Driver");
Connection connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/jingdong_db?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8"
+ "&user=***&password=***");
Statement statement = connection.createStatement();
viewSessionCharacterSet(statement);
}

private void viewSessionCharacterSet(Statement statement) throws SQLException {
ResultSet resultSet = statement.executeQuery("show variables like '%character%';");
while (resultSet.next()){
System.out.println(resultSet.getString(1) + ":" +resultSet.getString(2));
}
}
}

关于乱码问题

查看字符集
首先我们得登录数据库实例,查看当前数据库的字符集设置。

使用如下命令:
show variables like '%character%'
可以看到字符集的具体情况如下:

1
2
3
4
5
6
7
8
variable_name  value
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8

字符集参数说明

  • character_set_client: 此参数是客户端可以设置的,具体设置方式见下文详解。

  • character_set_connection: 此参数是客户端可以设置的,具体设置方式见下文详解。

  • character_set_database: 此参数值与你创建库的时候指定的字符集有关,关于创建库的时候如何指定字符集,可以参考创建库

  • character_set_results:此参数是客户端可以设置的,具体设置方式见下文详解。

  • character_set_server: 此参数是服务端的设置,此参数值可以通过参数组进行修改,可以参考 修改参数; 注意:此参数修改完后,记得重启实例才能确保 character_set_database 默认值与 character_set_server 保持同步。

  • character_set_system: 此参数可以不用关心

  • character_set_filesystem: 此参数可以不用关心

如何避免乱码

要避免在使用数据库过程中的乱码问题,需要确保character_set_clientcharacter_set_connectioncharacter_set_resultscharacter_set_database这四个参数值是保持一致的,如果不一致就会有可能出现乱码。

Java设置客户端设置字符集

  • character_set_clientcharacter_set_connection 通过characterEncoding 参数设置
  • character_set_results 通过characterSetResults 参数设置

示例: jdbc:mysql://localhost:3306/mydb?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8