Walking the HTML DOM tree in PHP

Sun, 30th Dec, '07

Walking the DOM in JavaScript has been covered well and good. But I couldn’t find substantial help when it came to walking the DOM tree of HTML files in PHP. It seems all there is to DOM is limited to XML. It would be wonderful if we could keep aside all tags and the related regex mumbo-jumbo to parse tags away, and instead check the values, text contained, IDs of elements, their class names and style rules. The function presented here walks through all the elements presented in a HTML file accessing all the node attributes and node values.

Anyways, being the wonderful thing that DOM is, I cooked up a small little function to walk the tree of an HTML file in PHP and print the output to screen, and boy does it walk. Anyway, less talk and more …
<?php
function walkDom($node, $level = 0)
{
$indent = ”;
for ($i = 0; $i < $level; $i++)
$indent .= ‘&nbsp;&nbsp;’; //prettifying the output
if($node->nodeType != XML_TEXT_NODE)
{
echo $indent.'<b>’.$node->nodeName.'</b>’;
if( $node->nodeType == XML_ELEMENT_NODE )
{
$attributes = $node->attributes; // get all the attributes(eg: id, class …)
foreach($attributes as $attribute)
{
echo ‘, ‘.$attribute->name.’=’.$attribute->value;
// $attribute->name is usually one of these:
// src, type, rel, link, name, value, href, onclick,
// id, class, style, title
// You can add your custom handlers depending on the Attribute.
}
//if( strlen(trim($node->childNodes->item(0)->nodeValue)) > 0 && count($cNodes) == 1 )
//echo ‘<br>’.$indent.'(contains=’.$node->childNodes->item(0)->nodeValue.’)’; // do this to print the contents of a node, which maybe the link text, contents of div and so on.
}
echo ‘<br><br>’;
}
$cNodes = $node->childNodes;
if (count($cNodes) > 0)
{
$level++ ; // go one level deeper
foreach($cNodes as $cNode)
walkDom($cNode, $level); //so this is recursion my professor kept talkin’ about
$level = $level – 1; // come a level up, and had to do it this way or else wordpress would take away one dash. šŸ˜¦
}
}
?>

Is that good?? Because here is how you use it:
<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile(‘http://www.google.com&#8217;);
walkDom($doc);
?>

And this prints away the entire DOM of the read in file specified by the URL to loadHTMLFile. More information about the used constants and functions can be found here. And believe me, this works.

Advertisements

7 Responses to “Walking the HTML DOM tree in PHP”


  1. […] Observances has a script that walks the DOM tree of HTML files in PHP and prints the output to the screen. […]

  2. Saarland Says:

    Somehow i missed the point. Probably lost in translation šŸ™‚ Anyway … nice blog to visit.

    cheers, Saarland

  3. Melvin J. Says:

    Thanks Kumar! This actually came in handy when I was debugging a PHP script I use to parse data from a website. I found out (using your “walker”) that somehow the table I was trying to parse cannot be seen by the PHP DOM! Must a malformed HTML markup.

  4. krahulg Says:

    That’s great Melvin. I had to do this just to escape regexing[and i suck at regexes] through html files.

  5. Stefan Wagner Says:

    Hi all! Doesn’t work for me, I only get a #document!
    Useing php 5.2.6
    Anybody any ideas?
    Thanks for any help!

    Here is my code, maybe anyone can copy and ty out?:

    <?php
    function walkDom($node, $level = 0)
    {
    $indent =”;
    for ($i = 0; $i nodeType != XML_TEXT_NODE)
    {
    echo $indent.’‘.$node->nodeName.’‘;
    if( $node->nodeType == XML_ELEMENT_NODE )
    {
    $attributes = $node->attributes; // get all the attributes(eg: id, class ā€¦)
    foreach($attributes as $attribute)
    {
    echo ‘, ‘.$attribute->name.’=’.$attribute->value;
    // $attribute->name is usually one of these:
    // src, type, rel, link, name, value, href, onclick,
    // id, class, style, title
    // You can add your custom handlers depending on the Attribute.
    }
    if( strlen(trim($node->childNodes->item(0)->nodeValue)) > 0 && count($cNodes) == 1 )
    echo ”.$indent.'(contains=’.$node->childNodes->item(0)->nodeValue.’)’; // do this to print the contents of a node, which maybe the link text, contents of div and so on.
    }
    echo ”;
    }
    $cNodes = $node->childNodes;
    if (count($cNodes) > 0)
    {
    $level++ ; // go one level deeper
    foreach($cNodes as $cNode)
    walkDom($cNode, $level); //so this is recursion my professor kept talkin’ about
    $level = $level – 1; // come a level up, and had to do it this way or else wordpress would take away one dash. šŸ˜¦
    }
    }

    $doc = new DOMDocument();
    @$doc->loadHTMLFile(‘http://www.google.de’);
    walkDom($doc);
    ?>

  6. Stefan Wagner Says:

    Hello!

    I fixed it! My ISP blocks external loadHTMLFile. Getting the error took me two days, because of the @!!
    Replaced the call by, the function works fine: šŸ™‚ šŸ™‚

    Thanks a lot to the autor! There are hundrets of this scripts for JavaScript on the web, but I found no one else for PHP!!

    Greetings from Ukraine

    Stefan Wagner


    loadHTML($myWebSite);
    walkDom($doc);
    ?>

  7. Stefan Wagner Says:

    loadHTML($myWebSite);
    walkDom($doc);
    ?>


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: