Title |
Test
Find
Pattern Title
|
Expression |
<(?:[^"']+?|.+?(?:"|').*?(?:"|')?.*?)*?> |
Description |
This will match all tags in a string, it's good for stripping HTML or XML tags to get the plain text.It works with attributes that include javascript or "<>".
It will match all these
<hr size="3"
noshade
color="#000000"
align="left">
<p style="margin-top:0px;margin-bottom:0px"
align="center"><font face="Times New Roman"
size="5"><b>UNITED STATES</b></font></p>
<input type=button onclick='if(n.value>5)do_this();'> not this <br> <input type=button onclick="n>5?a():b();" value=test> not this <br> <input type=button onclick="n>5?a(\"OK\"):b('Not Ok');" value=test> not this <br> <input type=button onclick='n>5' value=test onmouseover="n<5&&n>8" onmouseout='if(n>5)alert(\'True\');else alert("False")'> not this <br>
|
Matches |
<input type=button onclick='n>5' value=test onmouseover="n<5&&n>8" onm |
Non-Matches |
haven't found any exceptions yet |
Author |
Rating:
Toby Henderson
|
Source |
|
Your Rating |
|
Title: Memory Peak
Name: Dave S
Date: 1/25/2006 10:06:34 AM
Comment:
This regexp choked on a string containing the 'less-than' character as part of invalid HTML. As in: 1 is < 2.
Everything following the < character causes greedy validation and with a long string (748 characters long), this regular expression caused CPU usage to peak and remain at 100%. This problem happened consistently (i.e. EVERY TIME that string was passed through the regex. I tracked down the problem to THIS regex with a Microsoft Tech agent who studied the memory dump produced by Windows and IIS. The memory dump pointed to this line:
isHTML = objRegExp.Test(str)
This indicates that the .Test method (in VBScript) of the regular expression object would choke on the 748-character-long string containing the 'less-than' character.
Obviously, in valid HTML that should be written as: 1 is < 2. But many users don't know proper HTML entities. I've reverted to <[^>]+> for the time being.
Title: best one so
Name: manit chanthavong
Date: 11/3/2005 6:42:57 PM
Comment:
looked for RE to strip html tags from a document. This is the best one I've seen.
Title: Very good
Name: Simon Cann
Date: 10/3/2005 11:34:30 AM
Comment:
Well done for a great expression, it's just what I needed.
Title: RE:Half right, half wrong
Name: Toby Henderson
Date: 4/5/2005 6:11:32 AM
Comment:
Gideon you are correct as those are not valid html tags. But seeing that they are meant to be a tags, I would want them captured. I'm not testing for validity I just want to find every tag in document to do something with them.
Title: Half right, half wrong
Name: Gideon Engelberth
Date: 4/4/2005 11:27:53 AM
Comment:
This expression may not give false negatives (because it allows things inside tags), but it definately gives false positives. Two examples of matches that as far as I know should not match are:
<tag attr="test>
<tag attr="test'>