I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information, then Download your information, then create a file with at least the Messages box checked) to do some cool statistics
However there is a small problem with encoding. I’m not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Radosu00c5u0082aw. When I try to open it with python (UTF-8) I get RadosÅx82aw. However I should get: Radosław.
My python script:
text = open(os.path.join(subdir, file), encoding='utf-8') conversations.append(json.load(text))
I tried a few most common encodings. Example data is:
{
"sender_name": "Radosu00c5u0082aw",
"timestamp": 1524558089,
"content": "No to trzeba ostatnie treningi zrobiu00c4u0087 xD",
"type": "Generic"
}
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin -1 instead. I’ll make sure to file a bug report.
In the meantime, you can repair the damage in two ways:
-
Decode the data as JSON, then re-encode any strings as Latin-1, decode again as UTF-8:
>>> import json >>> data = r'"Radosu00c5u0082aw"' >>> json.loads(data).encode('latin1').decode('utf8') 'Radosław' -
Load the data as binary, replace all
u00hhsequences with the byte the last two hex digits represent, decode as UTF-8 and then decode as JSON:import re from functools import partial fix_mojibake_escapes = partial( re.compile(rb'\u00([da-f]{2})').sub, lambda m: bytes.fromhex(m.group(1).decode())) with open(os.path.join(subdir, file), 'rb') as binary_data: repaired = fix_mojibake_escapes(binary_data.read()) data = json.loads(repaired.decode('utf8'))From your sample data this produces:
{'content': 'No to trzeba ostatnie treningi zrobić xD', 'sender_name': 'Radosław', 'timestamp': 1524558089, 'type': 'Generic'}
Method 2
Here is a command-line solution with jq and iconv. Tested on Linux.
cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json
Method 3
My solution for parsing objects use parse_hook callback on load/loads function:
import json
def parse_obj(dct):
for key in dct:
dct[key] = dct[key].encode('latin_1').decode('utf-8')
pass
return dct
data = '{"msg": "Ahoj svu00c4u009bte"}'
# String
json.loads(data)
# Out: {'msg': 'Ahoj svÄx9bte'}
json.loads(data, object_hook=parse_obj)
# Out: {'msg': 'Ahoj světe'}
# File
with open('/path/to/file.json') as f:
json.load(f, object_hook=parse_obj)
# Out: {'msg': 'Ahoj světe'}
pass
Update:
Solution for parsing list with strings does not working. So here is updated solution:
import json
def parse_obj(obj):
for key in obj:
if isinstance(obj[key], str):
obj[key] = obj[key].encode('latin_1').decode('utf-8')
elif isinstance(obj[key], list):
obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
pass
return obj
Method 4
I would like to extend @Geekmoss’ answer with the following recursive code snippet, I used to decode my facebook data.
import json
def parse_obj(obj):
if isinstance(obj, str):
return obj.encode('latin_1').decode('utf-8')
if isinstance(obj, list):
return [parse_obj(o) for o in obj]
if isinstance(obj, dict):
return {key: parse_obj(item) for key, item in obj.items()}
return obj
decoded_data = parse_obj(json.loads(file))
I noticed this works better, because the facebook data you download might contain list of dicts, in which case those dicts would be just returned ‘as is’ because of the lambda identity function.
Method 5
Based on @Martijn Pieters solution, I wrote something similar in Java.
public String getMessengerJson(Path path) throws IOException {
String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
String unescaped = unescapeMessenger(badlyEncoded);
byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
String fixed = new String(bytes, StandardCharsets.UTF_8);
return fixed;
}
The unescape method is inspired by the org.apache.commons.lang.StringEscapeUtils.
private String unescapeMessenger(String str) {
if (str == null) {
return null;
}
try {
StringWriter writer = new StringWriter(str.length());
unescapeMessenger(writer, str);
return writer.toString();
} catch (IOException ioe) {
// this should never ever happen while writing to a StringWriter
throw new UnhandledException(ioe);
}
}
private void unescapeMessenger(Writer out, String str) throws IOException {
if (out == null) {
throw new IllegalArgumentException("The Writer must not be null");
}
if (str == null) {
return;
}
int sz = str.length();
StrBuilder unicode = new StrBuilder(4);
boolean hadSlash = false;
boolean inUnicode = false;
for (int i = 0; i < sz; i++) {
char ch = str.charAt(i);
if (inUnicode) {
unicode.append(ch);
if (unicode.length() == 4) {
// unicode now contains the four hex digits
// which represents our unicode character
try {
int value = Integer.parseInt(unicode.toString(), 16);
out.write((char) value);
unicode.setLength(0);
inUnicode = false;
hadSlash = false;
} catch (NumberFormatException nfe) {
throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe);
}
}
continue;
}
if (hadSlash) {
hadSlash = false;
if (ch == 'u') {
inUnicode = true;
} else {
out.write("\");
out.write(ch);
}
continue;
} else if (ch == '\') {
hadSlash = true;
continue;
}
out.write(ch);
}
if (hadSlash) {
// then we're in the weird case of a at the end of the
// string, let's output it anyway.
out.write('\');
}
}
Method 6
Facebook programmers seem to have mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. Further details in Invalid Unicode encodings in Facebook data exports.
Try this:
import json
import io
class FacebookIO(io.FileIO):
def read(self, size: int = -1) -> bytes:
data: bytes = super(FacebookIO, self).readall()
new_data: bytes = b''
i: int = 0
while i < len(data):
# u00c4u0085
# 0123456789ab
if data[i:].startswith(b'\u00'):
u: int = 0
new_char: bytes = b''
while data[i+u:].startswith(b'\u00'):
hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
new_char = b''.join([new_char, bytes([hex])])
u += 6
char : str = new_char.decode('utf-8')
new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
new_data += new_chars
i += u
else:
new_data = b''.join([new_data, bytes([data[i]])])
i += 1
return new_data
if __name__ == '__main__':
f = FacebookIO('data.json','rb')
d = json.load(f)
print(d)
Method 7
This is @Geekmoss’ answer, but adapted for Python 3:
def parse_facebook_json(json_file_path):
def parse_obj(obj):
for key in obj:
if isinstance(obj[key], str):
obj[key] = obj[key].encode('latin_1').decode('utf-8')
elif isinstance(obj[key], list):
obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
pass
return obj
with json_file_path.open('rb') as json_file:
return json.load(json_file, object_hook=parse_obj)
# Usage
parse_facebook_json(Path("/.../message_1.json"))
Method 8
Extending Martijn solution #1, that I see it can lead towards recursive object processing (It certainly lead me initially):
You can apply this to the whole string of json object, if you don’t ensure_ascii
json.dumps(obj, ensure_ascii=False, indent=2).encode('latin-1').decode('utf-8')
then write it to file or something.
PS: This should be comment on @Martijn answer: https://stackoverflow.com/a/50011987/1309932 (but I can’t add comments)
Method 9
This is my approach for Node 17.0.1, based on @hotigeftas recursive code, using the iconv-lite package.
import iconv from 'iconv-lite';
function parseObject(object) {
if (typeof object == 'string') {
return iconv.decode(iconv.encode(object, 'latin1'), 'utf8');;
}
if (typeof object == 'object') {
for (let key in object) {
object[key] = parseObject(object[key]);
}
return object;
}
return object;
}
//usage
let file = JSON.parse(fs.readFileSync(fileName));
file = parseObject(file);
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0