Dataframe.resample() works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I’d use iloc, which takes a row/column slice, both based on integer position and following normal python syntax. If you want every 5th row:
df.iloc[::5, :]
Method 2
Though @chrisb’s accepted answer does answer the question, I would like to add to it the following.
A simple method I use to get the nth data or drop the nth row is the following:
df1 = df[df.index % 3 != 0] # Excludes every 3rd row starting from 0 df2 = df[df.index % 3 == 0] # Selects every 3rd raw starting from 0
This arithmetic based sampling has the ability to enable even more complex row-selections.
This assumes, of course, that you have an index column of ordered, consecutive, integers starting at 0.
Method 3
There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__.
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For example, to get every 2 rows, you can do
df[::2] a b c 0 x x x 2 x x x 4 x x x
There’s also GroupBy.first/GroupBy.head, you group on the index:
df.index // 2 # Int64Index([0, 0, 1, 1, 2], dtype='int64') df.groupby(df.index // 2).first() # Alternatively, # df.groupby(df.index // 2).head(1) a b c 0 x x x 1 x x x 2 x x x
The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do
# df.groupby(np.arange(len(df)) // 2).first() df.groupby(pd.RangeIndex(len(df)) // 2).first() a b c 0 x x x 1 x x x 2 x x x
Method 4
Adding reset_index() to metastableB’s answer allows you to only need to assume that the rows are ordered and consecutive.
df1 = df[df.reset_index().index % 3 != 0] # Excludes every 3rd row starting from 0 df2 = df[df.reset_index().index % 3 == 0] # Selects every 3rd row starting from 0
df.reset_index().index will create an index that starts at 0 and increments by 1, allowing you to use the modulo easily.
Method 5
I had a similar requirement, but I wanted the n’th item in a particular group. This is how I solved it.
groups = data.groupby(['group_key']) selection = groups['index_col'].apply(lambda x: x % 3 == 0) subset = data[selection]
Method 6
A solution I came up with when using the index was not viable ( possibly the multi-Gig .csv was too large, or I missed some technique that would allow me to reindex without crashing ).
Walk through one row at a time and add the nth row to a new dataframe.
import pandas as pd
from csv import DictReader
def make_downsampled_df(filename, interval):
with open(filename, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
column_names = csv_dict_reader.fieldnames
df = pd.DataFrame(columns=column_names)
for index, row in enumerate(csv_dict_reader):
if index % interval == 0:
print(str(row))
df = df.append(row, ignore_index=True)
return df
Method 7
df.drop(labels=df[df.index % 3 != 0].index, axis=0) # every 3rd row (mod 3)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0