How do I read the following (two columns) data (from a .dat file) with Pandas
TIME XGSM 2004 006 01 00 01 37 600 1 2004 006 01 00 02 32 800 5 2004 006 01 00 03 28 000 8 2004 006 01 00 04 23 200 11 2004 006 01 00 05 18 400 17
Column separator is (at least) 2 spaces.
I tried
df = pd.read_table("test.dat", sep="s+", usecols=['TIME', 'XGSM'])
print df
But it prints
TIME XGSM 2004 6 2004 6 2004 6 2004 6 2004 6
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use parameter usecols with order of columns:
import pandas as pd
from pandas.compat import StringIO
temp=u"""TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp),
sep="s+",
skiprows=1,
usecols=[0,7],
names=['TIME','XGSM'])
print (df)
TIME XGSM
0 2004 1
1 2004 5
2 2004 8
3 2004 11
4 2004 17
Edit:
You can use separator regex – 2 and more spaces and then add engine='python' because warning:
ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
import pandas as pd
from pandas.compat import StringIO
temp=u"""TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep=r's{2,}', engine='python')
print (df)
TIME XGSM
0 2004 006 01 00 01 37 600 1
1 2004 006 01 00 02 32 800 5
2 2004 006 01 00 03 28 000 8
3 2004 006 01 00 04 23 200 11
4 2004 006 01 00 05 18 400 17
Method 2
Could also try pd.read_fwf() (Read a table of fixed-width formatted lines into DataFrame):
import pandas as pd
from io import StringIO
pd.read_fwf(StringIO("""TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17"""), usecols = ["TIME", "XGSM"])
# TIME XGSM
#0 2004 1
#1 2004 5
#2 2004 8
#3 2004 11
#4 2004 17
Method 3
I too experienced the problem while importing when there are lots of white space. I could solve by using
pd.read_fwf(file_name)
If you want to import files with fixed width text file, then read_fwf might be the solution without needing to use StringIO.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0