Friday, 10 October 2014

Useful commands when parsing HTML from the command line

These are commands I'm using all the time when working with HTML files from the command line:

Replacing <BR> tags with actual carriage returns:

sed 's_<[bB][rR][^>]*>_\n_g'
Note this command works with
  • Capitalised and small letter tags e.g.: <BR> or <br>
  • Tags with additional parameters (which is nonsense), e.g.: <br class="something">
  • Tags with spaces and a closing slash, e.g.: <br />

Replacing &nbsp; with spaces

sed 's_&nbsp;_ _g'

Remove heading space at the begining of each line

sed 's_^[\t ]*__g'

Remove all remaining html tags

sed 's_<[^>]*>__g'

You can then use all these commands together:
cat file.html | sed 's_<[bB][rR][^>]*>_\n_g' | sed 's_&nbsp;_ _g' | sed 's_^[\t ]*__g' | sed 's_<[^>]*>__g'