Regex to remove html tags

May 15, 2020 3 minute read

I was working on a problem which required some string data cleanup, the string I was working with had categorical values of survey response - satisfied, dissatisfied, very satisfied etc. but with html tags embedded in the string.

<img src="https://organization123.surveycompany.com/CP/Graphic.php?IM=ABC" style="width: 41px; height: 39px;"></img><br>Very Satisfied

I was only interested in the user rating part and wanted to get rid of all the html tags. In this quest to remove html tags I stumbled upon some very helpful stackoverflow posts which used regular expressions to remove html tags. I chose one such regex and it worked like a charm.

<[^<]+?>

I know the basics of regular expression but I still didn’t understand the two symbols towards the end of this expression so I searched again to find out what this expression is doing and found a great website where you can not only build your expression and find the meaning of each component of your expression but also test it.

Here is my breakdown of how this regular expression is able to match all html tags:

Character	Meaning
<	Matches character “<”
[^<]	Negated set - matches any character that is not in the set.
+	Matches one or more of the preceding token
?	If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times).

I was using python to do this transformation and this data was in a pandas dataframe, so I used the pandas.Series.str.replace to perform the complete operation.

# Replace all html tags with blank from surveyAnswer column in dataframe df.
# regex=True is the default so you can choose not to explicitly specify it.
df["surveyAnswer"] = df["surveyAnswer"].str.replace('<[^<]+?>','',regex=True)

Avinash Tripathi

Regex to remove html tags

Comments

You May Also Enjoy

Azure Data Factory - CI/CD

Azure Data Studio - Code snippets

ADLS Gen2 Best Practices - File Formats

Partition splitting in Azure Synapse