Suppose I have a nested dictionary ‘user_dict’ with structure:
- Level 1: UserId (Long Integer)
- Level 2: Category (String)
- Level 3: Assorted Attributes (floats, ints, etc..)
For example, an entry of this dictionary would be:
user_dict[12] = { "Category 1": {"att_1": 1, "att_2": "whatever"}, "Category 2": {"att_1": 23, "att_2": "another"}}
each item in user_dict
has the same structure and user_dict
contains a large number of items which I want to feed to a pandas DataFrame, constructing the series from the attributes. In this case a hierarchical index would be useful for the purpose.
Specifically, my question is whether there exists a way to to help the DataFrame constructor understand that the series should be built from the values of the “level 3” in the dictionary?
If I try something like:
df = pandas.DataFrame(users_summary)
The items in “level 1” (the UserId’s) are taken as columns, which is the opposite of what I want to achieve (have UserId’s as index).
I know I could construct the series after iterating over the dictionary entries, but if there is a more direct way this would be very useful. A similar question would be asking whether it is possible to construct a pandas DataFrame from json objects listed in a file.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
A pandas MultiIndex consists of a list of tuples. So the most natural approach would be to reshape your input dict so that its keys are tuples corresponding to the multi-index values you require. Then you can just construct your dataframe using pd.DataFrame.from_dict
, using the option orient='index'
:
user_dict = {12: {'Category 1': {'att_1': 1, 'att_2': 'whatever'}, 'Category 2': {'att_1': 23, 'att_2': 'another'}}, 15: {'Category 1': {'att_1': 10, 'att_2': 'foo'}, 'Category 2': {'att_1': 30, 'att_2': 'bar'}}} pd.DataFrame.from_dict({(i,j): user_dict[i][j] for i in user_dict.keys() for j in user_dict[i].keys()}, orient='index') att_1 att_2 12 Category 1 1 whatever Category 2 23 another 15 Category 1 10 foo Category 2 30 bar
An alternative approach would be to build your dataframe up by concatenating the component dataframes:
user_ids = [] frames = [] for user_id, d in user_dict.iteritems(): user_ids.append(user_id) frames.append(pd.DataFrame.from_dict(d, orient='index')) pd.concat(frames, keys=user_ids) att_1 att_2 12 Category 1 1 whatever Category 2 23 another 15 Category 1 10 foo Category 2 30 bar
Method 2
pd.concat
accepts a dictionary. With this in mind, it is possible to improve upon the currently accepted answer in terms of simplicity and performance by use a dictionary comprehension to build a dictionary mapping keys to sub-frames.
pd.concat({k: pd.DataFrame(v).T for k, v in user_dict.items()}, axis=0)
Or,
pd.concat({ k: pd.DataFrame.from_dict(v, 'index') for k, v in user_dict.items() }, axis=0)
att_1 att_2 12 Category 1 1 whatever Category 2 23 another 15 Category 1 10 foo Category 2 30 bar
Method 3
So I used to use a for loop for iterating through the dictionary as well, but one thing I’ve found that works much faster is to convert to a panel and then to a dataframe.
Say you have a dictionary d
import pandas as pd d {'RAY Index': {datetime.date(2014, 11, 3): {'PX_LAST': 1199.46, 'PX_OPEN': 1200.14}, datetime.date(2014, 11, 4): {'PX_LAST': 1195.323, 'PX_OPEN': 1197.69}, datetime.date(2014, 11, 5): {'PX_LAST': 1200.936, 'PX_OPEN': 1195.32}, datetime.date(2014, 11, 6): {'PX_LAST': 1206.061, 'PX_OPEN': 1200.62}}, 'SPX Index': {datetime.date(2014, 11, 3): {'PX_LAST': 2017.81, 'PX_OPEN': 2018.21}, datetime.date(2014, 11, 4): {'PX_LAST': 2012.1, 'PX_OPEN': 2015.81}, datetime.date(2014, 11, 5): {'PX_LAST': 2023.57, 'PX_OPEN': 2015.29}, datetime.date(2014, 11, 6): {'PX_LAST': 2031.21, 'PX_OPEN': 2023.33}}}
The command
pd.Panel(d) <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis) Items axis: RAY Index to SPX Index Major_axis axis: PX_LAST to PX_OPEN Minor_axis axis: 2014-11-03 to 2014-11-06
where pd.Panel(d)[item] yields a dataframe
pd.Panel(d)['SPX Index'] 2014-11-03 2014-11-04 2014-11-05 2014-11-06 PX_LAST 2017.81 2012.10 2023.57 2031.21 PX_OPEN 2018.21 2015.81 2015.29 2023.33
You can then hit the command to_frame() to turn it into a dataframe. I use reset_index as well to turn the major and minor axis into columns rather than have them as indices.
pd.Panel(d).to_frame().reset_index() major minor RAY Index SPX Index PX_LAST 2014-11-03 1199.460 2017.81 PX_LAST 2014-11-04 1195.323 2012.10 PX_LAST 2014-11-05 1200.936 2023.57 PX_LAST 2014-11-06 1206.061 2031.21 PX_OPEN 2014-11-03 1200.140 2018.21 PX_OPEN 2014-11-04 1197.690 2015.81 PX_OPEN 2014-11-05 1195.320 2015.29 PX_OPEN 2014-11-06 1200.620 2023.33
Finally, if you don’t like the way the frame looks you can use the transpose function of panel to change the appearance before calling to_frame() see documentation here
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.transpose.html
Just as an example
pd.Panel(d).transpose(2,0,1).to_frame().reset_index() major minor 2014-11-03 2014-11-04 2014-11-05 2014-11-06 RAY Index PX_LAST 1199.46 1195.323 1200.936 1206.061 RAY Index PX_OPEN 1200.14 1197.690 1195.320 1200.620 SPX Index PX_LAST 2017.81 2012.100 2023.570 2031.210 SPX Index PX_OPEN 2018.21 2015.810 2015.290 2023.330
Hope this helps.
Method 4
In case someone wants to get the data frame in a “long format” (leaf values have the same type) without multiindex, you can do this:
pd.DataFrame.from_records( [ (level1, level2, level3, leaf) for level1, level2_dict in user_dict.items() for level2, level3_dict in level2_dict.items() for level3, leaf in level3_dict.items() ], columns=['UserId', 'Category', 'Attribute', 'value'] ) UserId Category Attribute value 0 12 Category 1 att_1 1 1 12 Category 1 att_2 whatever 2 12 Category 2 att_1 23 3 12 Category 2 att_2 another 4 15 Category 1 att_1 10 5 15 Category 1 att_2 foo 6 15 Category 2 att_1 30 7 15 Category 2 att_2 bar
(I know the original question probably wants (I.) to have Levels 1 and 2 as multiindex and Level 3 as columns and (II.) asks about other ways than iteration over values in the dict. But I hope this answer is still relevant and useful (I.): to people like me who have tried to find a way to get the nested dict into this shape and google only returns this question and (II.): because other answers involve some iteration as well and I find this approach flexible and easy to read; not sure about performance, though.)
Method 5
This solution should work for arbitrary depth by flattening dictionary keys to a tuple chain
def flatten_dict(nested_dict): res = {} if isinstance(nested_dict, dict): for k in nested_dict: flattened_dict = flatten_dict(nested_dict[k]) for key, val in flattened_dict.items(): key = list(key) key.insert(0, k) res[tuple(key)] = val else: res[()] = nested_dict return res def nested_dict_to_df(values_dict): flat_dict = flatten_dict(values_dict) df = pd.DataFrame.from_dict(flat_dict, orient="index") df.index = pd.MultiIndex.from_tuples(df.index) df = df.unstack(level=-1) df.columns = df.columns.map("{0[1]}".format) return df
Method 6
For other ways to represent the data, you don’t need to do much. For example, if you just want the “outer” key to be an index, the “inner” key to be columns and the values to be cell values, this would do the trick:
df = pd.DataFrame.from_dict(user_dict, orient='index')
Method 7
Building on verified answer, for me this worked best:
ab = pd.concat({k: pd.DataFrame(v).T for k, v in data.items()}, axis=0) ab.T
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0