Python: Pandas dataframe from Series of dict

I have a Pandas dataframe:

type(original)
pandas.core.frame.DataFrame

which includes the series object original['user']:

type(original['user'])
pandas.core.series.Series

original['user'] points to a number of dicts:

type(original['user'].ix[0])
dict

Each dict has the same keys:

original['user'].ix[0].keys()

[u'follow_request_sent',
 u'profile_use_background_image',
 u'profile_text_color',
 u'id',
 u'verified',
 u'profile_location',
 # ... keys removed for brevity
]

Above is (part of) one of the dicts of user fields in a tweet from tweeter API. I want to build a data frame from these dicts.

When I try to make a data frame directly, I get only one column for each row and this column contains the whole dict:

pd.DataFrame(original['user'][:2])
    user
0   {u'follow_request_sent': False, u'profile_use_...
1   {u'follow_request_sent': False, u'profile_use_..

When I try to create a data frame using from_dict() I get the same result:

pd.DataFrame.from_dict(original['user'][:2])

    user
0   {u'follow_request_sent': False, u'profile_use_...
1   {u'follow_request_sent': False, u'profile_use_..

Next I tried a list comprehension which returned an error:

item = [[k, v] for (k,v) in users]
ValueError: too many values to unpack

When I create a data frame from a single row, it nearly works:

df = pd.DataFrame.from_dict(original['user'].ix[0])
df.reset_index()

    index   contributors_enabled    created_at  default_profile     default_profile_image   description     entities    favourites_count    follow_request_sent     followers_count     following   friends_count   geo_enabled     id  id_str  is_translation_enabled  is_translator   lang    listed_count    location    name    notifications   profile_background_color    profile_background_image_url    profile_background_image_url_https  profile_background_tile     profile_image_url   profile_image_url_https     profile_link_color  profile_location    profile_sidebar_border_color    profile_sidebar_fill_color  profile_text_color  profile_use_background_image    protected   screen_name     statuses_count  time_zone   url     utc_offset  verified
0   description     False   Mon May 26 11:58:40 +0000 2014  True    False       {u'urls': []}   0   False   157

It works almost like I want it to, except it sets the description field as the default index.

Each of the dicts has 40 keys but I only need about 10 of them and I have 28734 rows in data frame.

How can I filter out the keys which I do not need?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

what I would try to do is the following:

new_df = pd.DataFrame(list(original['user']))

this will convert the series to list then pass it to pandas dataframe and it should take care of the rest.

Method 2

df = original['user'].apply(pd.Series)

works well

credit

Method 3

This works:

series_of_dicts = original['user']
df = pd.DataFrame.from_records(
    series_of_dicts.values, index=series_of_dicts.index
)

Or if you have a list or other iterable of dicts, then a simple

pd.DataFrame.from_records(iterable_of_dicts)

works.

Docs for DataFrame.from_records

I haven’t timed it, but I’d imagine it should be pretty fast, since it this is exactly what DataFrame.from_records() was made for.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x