I have a very large data set and I can’t afford to read the entire data set in. So, I’m thinking of reading only one chunk of it to train but I have no idea how to do it. Any thought will be appreciated.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If you only want to read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you only want to read rows 1,000,000 … 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
nrows : int, default None Number of rows of file to read. Useful for
reading pieces of large files*
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
and for large files, you’ll probably also want to use chunksize:
chunksize : int, default None
Return TextFileReader object for iteration
pandas.io.parsers.read_csv documentation
Method 2
If you do not want to use Pandas, you can use csv library and to limit row readed with interaction break.
For example, I needed to read a list of files stored in csvs list to get the only the header.
for csvs in result:
csvs = './'+csvs
with open(csvs,encoding='ANSI', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
count=0
for row in csv_reader:
if count:
break;
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0