Walking the DOM in JavaScript has been covered well and good. But I couldn’t find substantial help when it came to walking the DOM tree of HTML files in PHP. It seems all there is to DOM is limited to XML. It would be wonderful if we could keep aside all tags and the related regex mumbo-jumbo to parse tags away, and instead check the values, text contained, IDs of elements, their class names and style rules. The function presented here walks through all the elements presented in a HTML file accessing all the node attributes and node values.

Anyways, being the wonderful thing that DOM is, I cooked up a small little function to walk the tree of an HTML file in PHP and print the output to screen, and boy does it walk. Anyway, less talk and more …
<?php
function walkDom($node, $level = 0)
{
$indent = ”;
for ($i = 0; $i < $level; $i++)
$indent .= ‘&nbsp;&nbsp;’; //prettifying the output
if($node->nodeType != XML_TEXT_NODE)
{
echo $indent.'<b>’.$node->nodeName.'</b>’;
if( $node->nodeType == XML_ELEMENT_NODE )
{
$attributes = $node->attributes; // get all the attributes(eg: id, class …)
foreach($attributes as $attribute)
{
echo ‘, ‘.$attribute->name.’=’.$attribute->value;
// $attribute->name is usually one of these:
// src, type, rel, link, name, value, href, onclick,
// id, class, style, title
// You can add your custom handlers depending on the Attribute.
}
//if( strlen(trim($node->childNodes->item(0)->nodeValue)) > 0 && count($cNodes) == 1 )
//echo ‘<br>’.$indent.'(contains=’.$node->childNodes->item(0)->nodeValue.’)’; // do this to print the contents of a node, which maybe the link text, contents of div and so on.
}
echo ‘<br><br>’;
}
$cNodes = $node->childNodes;
if (count($cNodes) > 0)
{
$level++ ; // go one level deeper
foreach($cNodes as $cNode)
walkDom($cNode, $level); //so this is recursion my professor kept talkin’ about
$level = $level – 1; // come a level up, and had to do it this way or else wordpress would take away one dash. 😦
}
}
?>

Is that good?? Because here is how you use it:
<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile(‘http://www.google.com&#8217;);
walkDom($doc);
?>

And this prints away the entire DOM of the read in file specified by the URL to loadHTMLFile. More information about the used constants and functions can be found here. And believe me, this works.

Advertisements