RegEx Get string between two strings that has line breaks

I have the following test (formatted just like below):

<td scope="row" align="left">
      My Class: TEST DATA<br>
      Test Section: <br>
      MY SECTION<br>
      MY SECTION 2<br>
    </td>

I’m attempting to get the text between “Test Section: and the after the MY SECTION

I’ve tried several attempts with different RegEx patterns and I’m not getting anywhere.

If I do:

(?<=Test)(.*?)(?=<br)

Then I get the correct response of:

' Section: '

But, if I do

(?<=Test)(.*?)(?=</td>)

I get no results. The results should be “MY SECTIon
MY SECTION 2

I’ve tried using RegEx Multiline as well with no results.

Any help would be appreciated.

If it matters I’m coding in Python 2.7.

If something is not clear, or you need more info, please let me know.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use re.S or re.DOTALL flags. Or prepend the regular expression with (?s) to make . matches all character (including newline).

Without the flags, . does not match newline.

(?s)(?<=Test)(.*?)(?=</td>)

Example:

>>> s = '''<td scope="row" align="left">
...       My Class: TEST DATA<br>
...       Test Section: <br>
...       MY SECTION<br>
...       MY SECTION 2<br>
...     </td>'''
>>>
>>> import re
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s)  # without flags
[]
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s, flags=re.S)
[' Section: <br>n      MY SECTION<br>n      MY SECTION 2<br>n    ']
>>> re.findall('(?s)(?<=Test)(.*?)(?=</td>)', s)
[' Section: <br>n      MY SECTION<br>n      MY SECTION 2<br>n    ']

Method 2

Get the matched group from index 1

Test Section:([Ss]*)</td>

Live demo

Note: change the last part as per your need.

sample code:

import re
p = re.compile(ur'Test Section:([Ss]*)</td>', re.MULTILINE)
test_str = u"..."

re.findall(p, test_str)

Pattern Explanation:

  Test Section:            'Test Section:'
  (                        group and capture to 1:
    [Ss]*                  any character of: non-whitespace (all
                             but n, r, t, f, and " "), whitespace
                             (n, r, t, f, and " ") (0 or more
                             times (matching the most amount
                             possible))
  )                        end of 1
  </td>                    '</td>'


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x