I’m trying to read a text from a text file, read lines, delete lines that contain specific string (in this case ‘bad’ and ‘naughty’).
The code I wrote goes like this:
infile = file('./oldfile.txt')
newopen = open('./newfile.txt', 'w')
for line in infile :
if 'bad' in line:
line = line.replace('.' , '')
if 'naughty' in line:
line = line.replace('.', '')
else:
newopen.write(line)
newopen.close()
I wrote like this but it doesn’t work out.
One thing important is, if the content of the text was like this:
good baby bad boy good boy normal boy
I don’t want the output to have empty lines.
so not like:
good baby good boy normal boy
but like this:
good baby good boy normal boy
What should I edit from my code on the above?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can make your code simpler and more readable like this
bad_words = ['bad', 'naughty']
with open('oldfile.txt') as oldfile, open('newfile.txt', 'w') as newfile:
for line in oldfile:
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
using a Context Manager and any.
Method 2
You could simply not include the line into the new file instead of doing replace.
for line in infile :
if 'bad' not in line and 'naughty' not in line:
newopen.write(line)
Method 3
I have used this to remove unwanted words from text files:
bad_words = ['abc', 'def', 'ghi', 'jkl']
with open('List of words.txt') as badfile, open('Clean list of words.txt', 'w') as cleanfile:
for line in badfile:
clean = True
for word in bad_words:
if word in line:
clean = False
if clean == True:
cleanfile.write(line)
Or to do the same for all files in a directory:
import os
bad_words = ['abc', 'def', 'ghi', 'jkl']
for root, dirs, files in os.walk(".", topdown = True):
for file in files:
if '.txt' in file:
with open(file) as filename, open('clean '+file, 'w') as cleanfile:
for line in filename:
clean = True
for word in bad_words:
if word in line:
clean = False
if clean == True:
cleanfile.write(line)
I’m sure there must be a more elegant way to do it, but this did what I wanted it to.
Method 4
Today I needed to accomplish a similar task so I wrote up a gist to accomplish the task based on some research I did.
I hope that someone will find this useful!
import os
os.system('cls' if os.name == 'nt' else 'clear')
oldfile = raw_input('{*} Enter the file (with extension) you would like to strip domains from: ')
newfile = raw_input('{*} Enter the name of the file (with extension) you would like me to save: ')
emailDomains = ['windstream.net', 'mail.com', 'google.com', 'web.de', 'email', 'yandex.ru', 'ymail', 'mail.eu', 'mail.bg', 'comcast.net', 'yahoo', 'Yahoo', 'gmail', 'Gmail', 'GMAIL', 'hotmail', 'comcast', 'bellsouth.net', 'verizon.net', 'att.net', 'roadrunner.com', 'charter.net', 'mail.ru', '@live', 'icloud', '@aol', 'facebook', 'outlook', 'myspace', 'rocketmail']
print "n[*] This script will remove records that contain the following strings: nn", emailDomains
raw_input("n[!] Press any key to start...n")
linecounter = 0
with open(oldfile) as oFile, open(newfile, 'w') as nFile:
for line in oFile:
if not any(domain in line for domain in emailDomains):
nFile.write(line)
linecounter = linecounter + 1
print '[*] - {%s} Writing verified record to %s ---{ %s' % (linecounter, newfile, line)
print '[*] === COMPLETE === [*]'
print '[*] %s was saved' % newfile
print '[*] There are %s records in your saved file.' % linecounter
Link to Gist: emailStripper.py
Best,
Az
Method 5
Use python-textops package :
from textops import *
'oldfile.txt' | cat() | grepv('bad') | tofile('newfile.txt')
Method 6
The else is only connected to the last if. You want elif:
if 'bad' in line:
pass
elif 'naughty' in line:
pass
else:
newopen.write(line)
Also note that I removed the line substitution, as you don’t write those lines anyway.
Method 7
Try this works well.
import re text = "this is bad!" text = re.sub(r"(.*?)bad(.*?)$|n", "", text) text = re.sub(r"(.*?)naughty(.*?)$|n", "", text) print(text)
Method 8
Regex is a little quicker than the accepted answer (for my 23 MB test file) that I used. But there isn’t a lot in it.
import re
bad_words = ['bad', 'naughty']
regex = f"^.*(:{'|'.join(bad_words)}).*n"
subst = ""
with open('oldfile.txt') as oldfile:
lines = oldfile.read()
result = re.sub(regex, subst, lines, re.MULTILINE)
with open('newfile.txt', 'w') as newfile:
newfile.write(result)
Method 9
to_skip = ("bad", "naughty")
out_handle = open("testout", "w")
with open("testin", "r") as handle:
for line in handle:
if set(line.split(" ")).intersection(to_skip):
continue
out_handle.write(line)
out_handle.close()
Method 10
bad_words = ['doc:', 'strickland:','n']
with open('linetest.txt') as oldfile, open('linetestnew.txt', 'w') as newfile:
for line in oldfile:
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
The n is a Unicode escape sequence for a newline.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0
