How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

I need to create a data frame by reading in data from a file, using read_csv method. However, the separators are not very regular: some columns are separated by tabs (t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

Is there a way to tell pandas to treat these files properly?

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):
   fld = line.split()

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'at  btc 1 2n'
'dt  etf 3 4n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4

Method 2

>>> pd.read_csv("whitespace.csv", header = None, sep = "s+|t+|s+t+|t+s+")

would use any combination of any number of spaces and tabs as the separator.

Method 3

Pandas has two csv readers, only is flexible regarding redundant leading white space:

pd.read_csv("whitespace.csv", skipinitialspace=True)

while one is not

pd.DataFrame.from_csv("whitespace.csv")

Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or t) as separators.

Method 4

We may consider this to take care of all the combination and zero or more occurrences.

pd.read_csv("whitespace.csv", header = None, sep = "[ t]*,[ t]*")


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x