Parsing HTML in Bash

0
443

I have a process where I need to copy all the images from a web page. I used to run this process with xmllint, which will process an XML or HTML file and print out the entries you specify. But when my server host provider upgraded their systems, they didn’t include xmllint. So I had to find another way to extract a list of images from an HTML page. It turns out you can do this in Bash.

The read Statement

You may not think Bash can parse data files, but it can with some clever thinking. Bash, like other UNIX shells before it, can parse lines one at a time from a file via the built-in read statement.

By default, the read statement scans a line of data and splits it into fields. Usually, read splits fields using spaces and tabs, with newlines ending each line, but you can change this behavior by setting the Internal Field Separator (IFS) value and the end-of-line delimiter (-d).

To parse an HTML file using read , set the IFS to a greater-than symbol (>) and the delimiter to a less-than symbol (<). Each time Bash scans a line, it parses up to the next < (the start of an HTML tag) then splits that data at each > (the end of an HTML tag). This sample code takes a line of input and splits the data into the TAG and VALUE variables:

local IFS=’>’
read -d ‘<‘ TAG VALUE

Let’s explore how this works. Consider this simple HTML file:

<img src=”logo.png”
alt=”My logo” />
<p>some text</p>

The first time read parses this file, it stops at the first < symbol. Since < is the first character of this sample input, that means Bash finds an empty string. The resulting TAG and VALUE strings are also empty. But that’s fine for my use case.

The next time Bash reads the input, it gets img src=”logo.png”↲alt=”My logo” />↲ with a newline right before the alt, and stops before the < symbol on the next line. Then read splits the line at the > symbol, which leaves TAG with img src=”logo.png”↲alt=”My logo” / and VALUE with an empty newline.

The third time read parses the HTML file, it gets p>some text. Bash splits the string at the > resulting in TAG containing p and VALUE with some text .

A Simple Parser

Now that you understand how to use read, it’s easy to parse a longer HTML file with Bash. Start with a Bash function called xmlgetnext to parse the data using read , since you’ll be doing this again and again in the script. I named my function xmlgetnext to remind me this is a replacement for the Linux xmllint program, but I could have just as easily named it htmlgetnext .

xmlgetnext () {
local IFS=’>’
read -d ‘<‘ TAG VALUE
}

Now call that xmlgetnext function to parse the HTML file. This is my complete htmltags script:

#!/bin/sh
# print a list of all html tags

xmlgetnext () {
local IFS=’>’
read -d ‘<‘ TAG VALUE
}

cat $1 | while xmlgetnext ; do echo $TAG ; done

The last line is the key. It loops through the file using xmlgetnext to parse the HTML, and prints out only the TAG entries. And because of how echo operates with the standard field separators, any lines like img src=”logo.png”↲alt=”My logo” / that contain a newline get printed on a single line, as img src=”logo.png” alt=”My logo” /.

Parsing HTML in Bash

To fetch just the list of images, I run the output of this script through grep to only print the lines that have an img tag at the start of the line.