How to Properly Set Variables (character_set_%) for Proper Cyrillic Leters Reading from MySQL in R

I am experiencing the issue of unproper reading of cyrillic letters from MySQL table.

I use the following code:

library(RMySQL)
library(keyring)

mydb = dbConnect(MySQL(), ...)
dbReadTable(mydb, 'tregions2')

The table is read but cyrillic letters are substituted with question marks:

  id     regionname  iSOID administrativeCenter
1  1 ????????? ???? RU-ALT              ???????

I started investigating into the issue.

The result of the query show variables like 'character_set_%'; in MySQL Workbench for the same user logged in on the same PC returns:

character_set_client    utf8mb4
character_set_connection    utf8mb4
character_set_database  utf8
character_set_filesystem    binary
character_set_results   utf8mb4
character_set_server    utf8mb4
character_set_system    utf8
character_sets_dir  C:Program FilesMySQLMySQL Server 8.0sharecharsets

But result of the query returned by R is different:

> dbGetQuery(mydb, "show variables like 'character_set_%'")
             Variable_name                                                         Value
1     character_set_client                                                        latin1
2 character_set_connection                                                        latin1
3   character_set_database                                                          utf8
4 character_set_filesystem                                                        binary
5    character_set_results                                                        latin1
6     character_set_server                                                       utf8mb4
7     character_set_system                                                          utf8
8       character_sets_dir C:\Program Files\MySQL\MySQL Server 8.0\share\charsets\

The locale variables of R are the following:

> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251

I tried to change character set and collation of the table in DB. Earlier setting cp1251 character set helped me to properly write the data into the database. But not now. I tried utf8/koi8r/cp1251 without any effect.

Attempt to execute Sys.setlocale(,"ru_RU") aborted with an error that it could not be executed.

I am stuck. Could anyone give me an advise what else I should do?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

After several hours of investigation I finaly figured out the solution. Hope it will help someone encountering the same problem:

> dbExecute(mydb, "SET NAMES cp1251")
[1] 0
> dbGetQuery(mydb, "show variables like 'character_set_%'")
             Variable_name                                                         Value
1     character_set_client                                                        cp1251
2 character_set_connection                                                        cp1251
3   character_set_database                                                          utf8
4 character_set_filesystem                                                        binary
5    character_set_results                                                        cp1251
6     character_set_server                                                       utf8mb4
7     character_set_system                                                          utf8
8       character_sets_dir C:\Program Files\MySQL\MySQL Server 8.0\share\charsets\
> 
> TrTMP <- dbReadTable(mydb, 'tregions')
> TrTMP[1,c(1,2,6,14)]
  id     regionname  iSOID administrativeCenter
1  1 Алтайский край RU-ALT              Барнаул

Method 2

Tool -> Global Options -> Code -> Saving and put UTF-8
rs <- dbSendQuery(con, 'set character set "utf8"')
rs <- dbSendQuery(con, 'SET NAMES utf8')

options(encoding = “UTF-8”) at the top of my main script from which I call my package seems to fix the issue with having non-ascii characters in my package code.

read_chunk(lines = readLines("TestSpanishText.R", encoding = "UTF-8")) (also file())

For more flexibility, you should use utf8mb4 instead of cp1251. If you have data coming into the client in cp1251, then you probably have to stick with that charset.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x