HTML Extraction

September 19, 2013

(this is a document in progress)

Getting Setup

There’s no one way or best way to extract text from HTML, and there’s no best scripting language . Here I’m going to demonstrate some steps for HTML extraction using the PHP scripting language. I don’t think PHP is the most elegant or efficient solution, but it does have the benefit of being very widely used for a range of web applications, including WordPress. If I have to choose a first language to introduce to people, I usually choose PHP because of its versatility and wide adoption. For more specific tasks, other languages are probably better suited (particularly based on existing libraries to facilitate the work).

Getting started with PHP will depend on your operating system. For Macs (and most Unix-like operating systems), PHP comes pre-installed and ready to use (though extra steps are required if you want to use it in conjunction with the bundled web server, but we don’t that now). For Windows, a more common solution is to install a so-called WAMP stack like WampServer that includes PHP (as well as a web server and database application).

Once PHP is installed, we need an editor. Editors, like programming languages, are subject to strong preferences and impassioned perspectives. You can actually use any plain-text editor (like Notepad on Windows). On Macs I’d tend to recommend TextMate because it’s free and relatively lightweight but provides syntax colouring (though not built-in syntax checking) and a mechanism for running scripts easily. A cross-platform solution is Sublime Text. A much heavier solution (and one that I use in other contexts) is Eclipse with PHP that includes as-you-type syntax checking. There are a lot of other options.

Hello World!

There’s no better way to find out if your PHP is working than to create a quick script and run it. We can fire up TextMate and paste the Hello World! code below into the editor window. Because PHP is used primarily in a web context and it’s possible to interweave PHP and HTML code, the basic syntax starts a PHP instruction <?php, followed by one or more lines of code like echo "hello world!";, and ends with the end of the instruction ?> .

One of the things I like most about TextMate is that I can run/test the script without even saving the file (beware – you could also lose your work this way!) by pressing ⇧⌘R (Shift-Command-R).

Hello World!

If the box in the forefront doesn’t display “hello world!” than you may have a syntax error. Error reports can be cryptic and even misleading, but look in particular for the line number and try doing web searches for parts of the error message (in particular, parts that look like generic labels and that aren’t specific to your code). Learning to code is partly about learning to debug.

A Closer Look at HTML

HTML is part of a family of hierarchical markup languages. In practice, that means that elements are (or should be) nested cleanly (no overlapping structures). For instance, a document contains a body, a body can contain multiple sections, sections can contain paragraphs, and paragraphs can contain formatted characters. You can view the source HTML of any web page in most browsers, but the hierarchical structure may not be evident. A more useful way of looking at the structure is to use web development tools that are built into most browsers, such as the Web Console in Firefox or DevTools in Chrome, Both of these allow you to inspect elements in what’s called the document object model or DOM (the hierarchical structure of objects in the page).

For instance, if you view the Universal Declaration of Human Rights and open the web console in Firefox, you can see something like the image below. The tree structure is displayed on the right and I’ve selected the <div id="content" ...  element which can be seen in its hierarchy starting at the top: html > body > div#main > div#content  (where div is the tag and #main or #content indicates the id attribute of the tag). The actual element is shown in a dotted box on the left side.


Approaches to Extracting Content

Let’s imagine that we just wanted to extract the main content of this page, without the header, the navigation elements and the footer (if we wanted to count words in the main part of the document, it would be misleading to include words from these paratextual elements – the occurrences of Article, for instance, in the left navigation bar in the image above).

There are two main approaches to extracting elements that we want. The first is by creating what’s called a regular expression, or a very flexible search query that can identify the start and the end of a section of interest. Assuming the HTML source for the document above is fairly predictable, we may be able to search for the start of our main content section like this (the slashes indicate the start and end of the expression):

/<div id="content"/

In practice, HTML is rairly that clean or predictable (unless it’s dynamically generated the exact same way by the server), so we’d want something more forgiving like this:

\s+  is indicating one or more spaces (whitespace is meaningless in HTML source code, so you could have one space, 5 spaces or 5 lines and it would look the same rendered in the browser), and the ["']  is indicating that the id attribute value might be surrounded by single or double quotes (in fact, messier HTML may not even have either, which would break our expression).

The bigger issue may be that the <div>  tag has an identifiable id attribute at the beginning, but the end is a generic </div>  tag, the same one that might be used to close other elements at other points in the hierarchy. The typical solution for this would be to search for a unique, identifying marker after the tag closes (something like div#bannerfooter  in our example document). That often works, but it can be messy and error-prone. (Having said that, I think regular expressions are an absolutely essential tool in the arsenal of someone doing text analysis – see the Understanding Regular Expressions tutorial from the Programming Historian for a great tutorial.)

The second approach to extracting content from HTML (or XML-based languages) is to working directly with the document object model (DOM). This allows us to cleanly and unequivocally capture specific elements from start to finish (a web browser can construct a proper DOM even from invalid HTML source code).

Do It Yourself, But Not All of It

To work with the DOM we will use an existing PHP library. Libraries are a crucial part of coding, you want to reinvent the wheel as seldomly as possible, especially since in most cases someone else has designed and tested a pretty decent wheel.

There are actually a couple of ways of working with HTML using built-in PHP libraries such as SimpleXML and DOM. Many libraries are actually extensions of other libraries to add even more convenience and functionality. That’s the case with the library that we will use: Simple HMTL DOM. Head over to the download page and click on the link for the latest version. You will want to download the zip file and unzip it.

Now go back to TextMate and create a new document. The first thing we’ll add is add a line that imports our library file (note that the second line, preceded by a double slash, is a comment, it isn’t executed as code):

Let’s save this file into a new directory (so choose Save from the File menu in TextMate, navigate to a directory of your chose, create a new folder) and then save this file as something like html-extraction.php.

Now we want to make a copy of the library file we downloaded earlier in the same folder (an easy way to do this is to drag and drop the file from the old folder to the new folder while keeping your finger on the Option key (a plus sign should appear to indicate that you’re duplicating the file, not just moving it).

Now we can edit our new script to test if the library is working properly:

Hit ⇧⌘R to test this and you should see the HTML source code scroll by until the end.

Extracting Text from HTML

Now all we need to do is to determine how to search for the element of interest and display or extract it. The manual for our Simple HTML DOM library is short by sweet, and it provides examples for searching for elements with IDs. In particular, we can do something like this:

The find method returns an array, or a stack of elements. The second argument (the zero) specifies that we just want the first of these elements (which confusingly is considered to be in column zero).

In the easiest case scenario, we can now just dump the plain text of this content object:

Or more likely we might want to save this to a file. Putting it all together we get something like this:



Comments are closed.