I would like to read several excel files from a directory into pandas and concatenate them into one big dataframe. I have not been able to figure it out though. I need some help with the for loop and building a concatenated dataframe:
Here is what I have so far:
import sys
import csv
import glob
import pandas as pd
# get data file names
path =r'C:DRODCL_rawdata_filesexcelfiles'
filenames = glob.glob(path + "/*.xlsx")
dfs = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
dfs.concat(df, ignore_index=True)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
As mentioned in the comments, one error you are making is that you are looping over an empty list.
Here is how I would do it, using an example of having 5 identical Excel files that are appended one after another.
(1) Imports:
import os import pandas as pd
(2) List files:
path = os.getcwd() files = os.listdir(path) files
Output:
['.DS_Store', '.ipynb_checkpoints', '.localized', 'Screen Shot 2013-12-28 at 7.15.45 PM.png', 'test1 2.xls', 'test1 3.xls', 'test1 4.xls', 'test1 5.xls', 'test1.xls', 'Untitled0.ipynb', 'Werewolf Modelling', '~$Random Numbers.xlsx']
(3) Pick out ‘xls’ files:
files_xls = [f for f in files if f[-3:] == 'xls'] files_xls
Output:
['test1 2.xls', 'test1 3.xls', 'test1 4.xls', 'test1 5.xls', 'test1.xls']
(4) Initialize empty dataframe:
df = pd.DataFrame()
(5) Loop over list of files to append to empty dataframe:
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
(6) Enjoy your new dataframe. 🙂
df
Output:
Result Sample 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 5 f 6 6 g 7 7 h 8 8 i 9 9 j 10 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 5 f 6 6 g 7 7 h 8 8 i 9 9 j 10 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 5 f 6 6 g 7 7 h 8 8 i 9 9 j 10 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 5 f 6 6 g 7 7 h 8 8 i 9 9 j 10 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 5 f 6 6 g 7 7 h 8 8 i 9 9 j 10
Method 2
this works with python 2.x
be in the directory where the Excel files are
see http://pbpython.com/excel-file-combine.html
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer,'sheet1')
writer.save()
Method 3
This can be done in this way:
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("/path/to/directory/*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
all_data.to_csv("new_combined_file.csv")
Method 4
#shortcut
import pandas as pd
from glob import glob
dfs=[]
for f in glob("data/*.xlsx"):
dfs.append(pd.read_excel(f))
df=pd.concat(dfs, ignore_index=True)
Method 5
I have multiple excel files and every file has a common id [every excel sheet has id column]. I tried in the following ways. I am not getting the correct data frame based on the id.
import pandas as pd
import os
path=os.getcwd()
path
files=os.listdir(path)
fil_xlsx=[f for f in files if f[-4:]=='xlsx']
df=pd.DataFrame()
for f in fil_xlsx:
data=pd.read_excel(f,'Sheet1')
df=df.append(data)
I am getting an empty data frame this way.
df=pd.DataFrame()
for f in fil_xlsx:
data=pd.read_excel(f,'Sheet1')
all1=pd.concat([data,df],ignore_index=True,join="inner")
Method 6
import pandas as pd
import os
os.chdir('...')
#read first file for column names
fdf= pd.read_excel("first_file.xlsx", sheet_name="sheet_name")
#create counter to segregate the different file's data
fdf["counter"]=1
nm= list(fdf)
c=2
#read first 1000 files
for i in os.listdir():
print(c)
if c<1001:
if "xlsx" in i:
df= pd.read_excel(i, sheet_name="sheet_name")
df["counter"]=c
if list(df)==nm:
fdf=fdf.append(df)
c+=1
else:
print("headers name not match")
else:
print("not xlsx")
fdf=fdf.reset_index(drop=True)
#relax
Method 7
import pandas as pd
import os
files = [file for file in os.listdir('./Salesfolder')]
all_month_sales= pd.DataFrame()
for file in files
df= pd.read_csv("./Salesfolder/"+file)
all_months_data=pd.concat([all_months_sales,df])
all_months_data.to_csv("all_data.csv",index=False)
You can go and read all your .xls files from folder (Salesfolder in my case) and same for your local path. Using iteration through whcih you can put them into empty data frame and you can concatnate your data frame to this . I have also exported to another csv for all months data into one csv file
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0