Wednesday, August 6, 2008

Getting rid of HTML tags in a String

Today I found a post which proposes a simple trick based on regular expressions to get rid of all HTML tags from a String:

String noHTMLString = htmlString.replaceAll("\\<.*?>","");

All excited, and test-infected, I hurried to write a simple test:

@Test
public void testMinoreMaggiore() {
String instance = "<a href="...">3 è minore( < ) di 4 e maggiore ( > ) di 1</a>";
String expResult = "3 è minore ( < ) di 4 e maggiore ( > ) di 1";
String result = instance.replaceAll("\\<.*?>","");
assertEquals(expResult, result);
}

Not unexpectedly, the test failed. I couldn't help posting it to the author, hoping he won't see me as an insufferable know-it-all but as a mere proofreader...

3 comments:

David Peterson said...

Erm, hang on, shouldn't you be using ampersand notation (< and >) for angle brackets within the text of an element?

David Peterson said...
This comment has been removed by the author.
Unknown said...

Actually one should, but the world outside isn't perfect and when I wrote the test I was thinking of stripping all HTML off a page I didn't build... luck favours the prepared :-)