I am very new to python
I need to match all cases by one regex expression and do a replacement. this is a sample substring –> desired result:
<cross_sell id="123" sell_type="456"> --> <cross_sell>
i am trying to do this in my code:
myString = re.sub(r'<[A-Za-z0-9_]+(s[A-Za-z0-9_="s]+)', "", myString)
instead of replacing everything after <cross_sell, it replaces everything and just returns '>'
is there a way for re.sub to replace only the capturing group instead of the entire pattern?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use substitution groups:
>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>' >>> re.sub(r'(<[A-Za-z0-9_]+)(s[A-Za-z0-9_="s]+)', r"1", my_string) '<cross_sell> --> <cross_sell>'
Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "1" modifier (first group) in the replacement string.
Method 2
You can use a group reference to match the first word and a negated character class to match the rest of the string between <> :
>>> s='<cross_sell id="123" sell_type="456">' >>> re.sub(r'(w+)[^>]+',r'1',s) '<cross_sell>'
w is equal to [A-Za-z0-9_].
Method 3
Since the input data is XML, you’d better parse it with an XML parser.
Built-in xml.etree.ElementTree is one option:
>>> import xml.etree.ElementTree as ET
>>> data = '<cross_sell id="123" sell_type="456"></cross_sell>'
>>> cross_sell = ET.fromstring(data)
>>> cross_sell.attrib = {}
>>> ET.tostring(cross_sell)
'<cross_sell />'
lxml.etree is an another option.
Method 4
below code tested under python 3.6 , without use group..
test = '<cross_sell id="123" sell_type="456">' resp = re.sub(r'w+="w+"' ,r'',test) print (resp) <cross_sell>
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0