I need to compare two CSV files and print out differences in a third CSV file.
In my case, the first CSV is a old list of hash named old.csv and the second CSV is the new list of hash which contains both old and new hash.
Here is my code :
import csv
t1 = open('old.csv', 'r')
t2 = open('new.csv', 'r')
fileone = t1.readlines()
filetwo = t2.readlines()
t1.close()
t2.close()
outFile = open('update.csv', 'w')
x = 0
for i in fileone:
if i != filetwo[x]:
outFile.write(filetwo[x])
x += 1
outFile.close()
The third file is a copy of the old one and not the update.
What’s wrong ? I Hope you can help me, many thanks !!
PS : i don’t want to use diff
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The problem is that you are comparing each line in fileone to the same line in filetwo. As soon as there is an extra line in one file you will find that the lines are never equal again. Try this:
with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
Method 2
You may find this package useful (csv-diff):
pip install csv-diff
Once installed, you can run it from the command line:
csv-diff one.csv two.csv --key=id
Method 3
It feels natural detecting differences using sets.
#!/usr/bin/env python3
import sys
import argparse
import csv
def get_dataset(f):
return set(map(tuple, csv.reader(f)))
def main(f1, f2, outfile, sorting_column):
set1 = get_dataset(f1)
set2 = get_dataset(f2)
different = set1 ^ set2
output = csv.writer(outfile)
for row in sorted(different, key=lambda x: x[sorting_column], reverse=True):
output.writerow(row)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('infile', nargs=2, type=argparse.FileType('r'))
parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
parser.add_argument('-sc', '--sorting-column', nargs='?', type=int, default=0)
args = parser.parse_args()
main(*args.infile, args.outfile, args.sorting_column)
Method 4
I assumed your new file was just like your old one, except that some lines were added in between the old ones. The old lines in both files are stored in the same order.
Try this :
with open('old.csv', 'r') as t1:
old_csv = t1.readlines()
with open('new.csv', 'r') as t2:
new_csv = t2.readlines()
with open('update.csv', 'w') as out_file:
line_in_new = 0
line_in_old = 0
while line_in_new < len(new_csv) and line_in_old < len(old_csv):
if old_csv[line_in_old] != new_csv[line_in_new]:
out_file.write(new_csv[line_in_new])
else:
line_in_old += 1
line_in_new += 1
- Note that I used the context manager
withand some meaningful variable names, which makes it instantly easier to understand. And you don’t need thecsvpackage since you’re not using any of its functionalities here. - About your code, you were almost doing the right thing, except that _you must not go to the next line in your old CSV unless you are reading the same thing in both CSVs. That is to say, if you find a new line, keep reading the new file until you stumble upon an old one and then you’ll be able to continue reading.
UPDATE: This solution is not as pretty as Chris Mueller’s one which is perfect and very Pythonic for small files, but it only reads the files once (keeping the idea of your original algorithm), thus it can be better if you have larger file.
Method 5
with open('first_test_pipe.csv', 'r') as t1, open('validation.csv', 'r') as t2:
filecoming = t1.readlines()
filevalidation = t2.readlines()
for i in range(0,len(filevalidation)):
coming_set = set(filecoming[i].replace("n","").split(","))
validation_set = set(filevalidation[i].replace("n","").split(","))
ReceivedDataList=list(validation_set.intersection(coming_set))
NotReceivedDataList=list(coming_set.union(validation_set)-
coming_set.intersection(validation_set))
print(NotReceivedDataList)
Method 6
import pandas as pd
import sys
import csv
def dataframe_difference(df1: pd.DataFrame, df2: pd.DataFrame, csvfile, which=None):
"""Find rows which are different between two DataFrames."""
comparison_df = df1.merge(
df2,
indicator=True,
how='outer'
)
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv(csvfile)
return diff_df
if __name__ == '__main__':
df1 = pd.read_csv(sys.argv[1], sep=',')
df2 = pd.read_csv(sys.argv[2], sep=',')
df1.sort_values(sys.argv[3])
df2.sort_values(sys.argv[3])
#df1.drop(df1.columns[list(map(int, sys.argv[4].split()))], axis = 1, inplace = True)
#df2.drop(df2.columns[list(map(int, sys.argv[4].split()))], axis = 1, inplace = True)
print(dataframe_difference(df1, df2, sys.argv[5]))
to use run:
python3 script.py file1.csv file2.csv some_common_header_to_sort_each_file output_file.csv
In case you want to drop any columns from comparasion, uncomment df.drop part and run
python3 script.py file1.csv file2.csv some_common_header_to_sort_each_file "x y z..." output_file.csv
where x,y,z are the column numbers to drop, index starts from 0.
Method 7
Thanks to @vishnoo-rath’s comment under one of the above answers, for providing a link to the following page :
https://github.com/simonw/csv-diff#as-a-python-library
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="id"),
load_csv(open("two.csv"), key="id")
)
print(diff)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0