I have a Pandas dataframe:
type(original) pandas.core.frame.DataFrame
which includes the series object original['user']:
type(original['user']) pandas.core.series.Series
original['user'] points to a number of dicts:
type(original['user'].ix[0]) dict
Each dict has the same keys:
original['user'].ix[0].keys() [u'follow_request_sent', u'profile_use_background_image', u'profile_text_color', u'id', u'verified', u'profile_location', # ... keys removed for brevity ]
Above is (part of) one of the dicts of user fields in a tweet from tweeter API. I want to build a data frame from these dicts.
When I try to make a data frame directly, I get only one column for each row and this column contains the whole dict:
pd.DataFrame(original['user'][:2])
user
0 {u'follow_request_sent': False, u'profile_use_...
1 {u'follow_request_sent': False, u'profile_use_..
When I try to create a data frame using from_dict() I get the same result:
pd.DataFrame.from_dict(original['user'][:2])
user
0 {u'follow_request_sent': False, u'profile_use_...
1 {u'follow_request_sent': False, u'profile_use_..
Next I tried a list comprehension which returned an error:
item = [[k, v] for (k,v) in users] ValueError: too many values to unpack
When I create a data frame from a single row, it nearly works:
df = pd.DataFrame.from_dict(original['user'].ix[0])
df.reset_index()
index contributors_enabled created_at default_profile default_profile_image description entities favourites_count follow_request_sent followers_count following friends_count geo_enabled id id_str is_translation_enabled is_translator lang listed_count location name notifications profile_background_color profile_background_image_url profile_background_image_url_https profile_background_tile profile_image_url profile_image_url_https profile_link_color profile_location profile_sidebar_border_color profile_sidebar_fill_color profile_text_color profile_use_background_image protected screen_name statuses_count time_zone url utc_offset verified
0 description False Mon May 26 11:58:40 +0000 2014 True False {u'urls': []} 0 False 157
It works almost like I want it to, except it sets the description field as the default index.
Each of the dicts has 40 keys but I only need about 10 of them and I have 28734 rows in data frame.
How can I filter out the keys which I do not need?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
what I would try to do is the following:
new_df = pd.DataFrame(list(original['user']))
this will convert the series to list then pass it to pandas dataframe and it should take care of the rest.
Method 2
df = original['user'].apply(pd.Series)
works well
Method 3
This works:
series_of_dicts = original['user']
df = pd.DataFrame.from_records(
series_of_dicts.values, index=series_of_dicts.index
)
Or if you have a list or other iterable of dicts, then a simple
pd.DataFrame.from_records(iterable_of_dicts)
works.
Docs for DataFrame.from_records
I haven’t timed it, but I’d imagine it should be pretty fast, since it this is exactly what DataFrame.from_records() was made for.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0