I have a very big csv file so that I can not read them all into the memory. I only want to read and process a few lines in it. So I am seeking a function in Pandas which could handle this task, which the basic python can handle this well:
with open('abc.csv') as f:
line = f.readline()
# pass until it reaches a particular line number....
However, if I do this in pandas, I always read the first line:
datainput1 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 )
datainput2 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 )
I am looking for some easier way to handle this task in pandas. For example, if I want to read rows from 1000 to 2000. How can I do this quickly?
I want to use pandas because I want to read data into the dataframe.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use chunksize:
for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
#do something
To answer your second part do this:
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)
This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.
Method 2
In addition to EdChums answer I find the nrows argument useful which simply defines the number of rows you want to import. Thereby you don’t get an iterator but rather can just import a part of the whole file of size nrows. It works with skiprows too.
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows= 1000, nrows=1000)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0