Python Pandas: How to read only first n rows of CSV files in?

I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it. Any thought will be appreciated.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you only want to read the first 999,999 (non-header) rows:

read_csv(..., nrows=999999)

If you only want to read rows 1,000,000 … 1,999,999

read_csv(..., skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for
reading pieces of large files*

skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you’ll probably also want to use chunksize:

chunksize : int, default None
Return TextFileReader object for iteration

pandas.io.parsers.read_csv documentation

Method 2

If you do not want to use Pandas, you can use csv library and to limit row readed with interaction break.

For example, I needed to read a list of files stored in csvs list to get the only the header.

for csvs in result:
    csvs = './'+csvs
    with open(csvs,encoding='ANSI', newline='') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        count=0
        for row in csv_reader:
            if count:
                break;


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x