I am running through lines in a text file using a python script.
I want to search for an img tag within the text document and return the tag as text.
When I run the regex re.match(line) it returns a _sre.SRE_MATCH object.
How do I get it to return a string?
import sys
import string
import re
f = open("sample.txt", 'r' )
l = open('writetest.txt', 'w')
count = 1
for line in f:
line = line.rstrip()
imgtag = re.match(r'<img.*?>',line)
print("yo it's a {}".format(imgtag))
When run it prints:
yo it's a None yo it's a None yo it's a None yo it's a <_sre.SRE_Match object at 0x7fd4ea90e578> yo it's a None yo it's a <_sre.SRE_Match object at 0x7fd4ea90e578> yo it's a None yo it's a <_sre.SRE_Match object at 0x7fd4ea90e578> yo it's a <_sre.SRE_Match object at 0x7fd4ea90e5e0> yo it's a None yo it's a None
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You should use re.MatchObject.group(0). Like
imtag = re.match(r'<img.*?>', line).group(0)
Edit:
You also might be better off doing something like
imgtag = re.match(r'<img.*?>',line)
if imtag:
print("yo it's a {}".format(imgtag.group(0)))
to eliminate all the Nones.
Method 2
imgtag.group(0) or imgtag.group(). This returns the entire match as a string. You are not capturing anything else either.
http://docs.python.org/release/2.5.2/lib/match-objects.html
Method 3
Note that re.match(pattern, string, flags=0) only returns matches at the beginning of the string. If you want to locate a match anywhere in the string, use re.search(pattern, string, flags=0) instead (https://docs.python.org/3/library/re.html). This will scan the string and return the first match object. Then you can extract the matching string with match_object.group(0) as the folks suggested.
Method 4
Considering there might be several img tags I would recommend re.findall:
import re
with open("sample.txt", 'r') as f_in, open('writetest.txt', 'w') as f_out:
for line in f_in:
for img in re.findall('<img[^>]+>', line):
print >> f_out, "yo it's a {}".format(img)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0