Walking the HTML DOM tree in PHP

Sun, 30th Dec, '07

Walking the DOM in JavaScript has been covered well and good. But I couldn’t find substantial help when it came to walking the DOM tree of HTML files in PHP. It seems all there is to DOM is limited to XML. It would be wonderful if we could keep aside all tags and the related regex mumbo-jumbo to parse tags away, and instead check the values, text contained, IDs of elements, their class names and style rules. The function presented here walks through all the elements presented in a HTML file accessing all the node attributes and node values.

Anyways, being the wonderful thing that DOM is, I cooked up a small little function to walk the tree of an HTML file in PHP and print the output to screen, and boy does it walk. Anyway, less talk and more …
<?php
function walkDom($node, $level = 0)
{
$indent = ”;
for ($i = 0; $i < $level; $i++)
$indent .= ‘  ’; //prettifying the output
if($node->nodeType != XML_TEXT_NODE)
{
echo $indent.'’.$node->nodeName.'’;
if( $node->nodeType == XML_ELEMENT_NODE )
{
$attributes = $node->attributes; // get all the attributes(eg: id, class …)
foreach($attributes as $attribute)
{
echo ‘, ‘.$attribute->name.’=’.$attribute->value;
// $attribute->name is usually one of these:
// src, type, rel, link, name, value, href, onclick,
// id, class, style, title
// You can add your custom handlers depending on the Attribute.
}
//if( strlen(trim($node->childNodes->item(0)->nodeValue)) > 0 && count($cNodes) == 1 )
//echo ‘ ’.$indent.'(contains=’.$node->childNodes->item(0)->nodeValue.’)’; // do this to print the contents of a node, which maybe the link text, contents of div and so on.
}
echo ‘ ’;
}
$cNodes = $node->childNodes;
if (count($cNodes) > 0)
{
$level++ ; // go one level deeper
foreach($cNodes as $cNode)
walkDom($cNode, $level); //so this is recursion my professor kept talkin’ about
$level = $level – 1; // come a level up, and had to do it this way or else wordpress would take away one dash. 😦
}
}
?>
Is that good?? Because here is how you use it:
<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile(‘http://www.google.com’);
walkDom($doc);
?>
And this prints away the entire DOM of the read in file specified by the URL to loadHTMLFile. More information about the used constants and functions can be found here. And believe me, this works.

Posted by krg
Filed in Experiment
Tags: Document Object Model, DOM, php

7 Comments »

7 Responses to “Walking the HTML DOM tree in PHP”

Make It Up As You Go » Blog Archive » Walking the HTML DOM Tree in PHP Says:

Sun, 30th Dec, '07 at 20:58
[…] Observances has a script that walks the DOM tree of HTML files in PHP and prints the output to the screen. […]

Reply
Saarland Says:

Fri, 20th Jun, '08 at 4:49
Somehow i missed the point. Probably lost in translation 🙂 Anyway … nice blog to visit.

cheers, Saarland

Reply
Melvin J. Says:

Thu, 4th Dec, '08 at 5:44
Thanks Kumar! This actually came in handy when I was debugging a PHP script I use to parse data from a website. I found out (using your “walker”) that somehow the table I was trying to parse cannot be seen by the PHP DOM! Must a malformed HTML markup.

Reply
krahulg Says:

Thu, 4th Dec, '08 at 13:46
That’s great Melvin. I had to do this just to escape regexing[and i suck at regexes] through html files.

Reply
Stefan Wagner Says:

Mon, 8th Dec, '08 at 6:12
Hi all! Doesn’t work for me, I only get a #document!
Useing php 5.2.6
Anybody any ideas?
Thanks for any help!

Here is my code, maybe anyone can copy and ty out?:

<?php
function walkDom($node, $level = 0)
{
$indent =”;
for ($i = 0; $i nodeType != XML_TEXT_NODE)
{
echo $indent.’‘.$node->nodeName.’‘;
if( $node->nodeType == XML_ELEMENT_NODE )
{
$attributes = $node->attributes; // get all the attributes(eg: id, class …)
foreach($attributes as $attribute)
{
echo ‘, ‘.$attribute->name.’=’.$attribute->value;
// $attribute->name is usually one of these:
// src, type, rel, link, name, value, href, onclick,
// id, class, style, title
// You can add your custom handlers depending on the Attribute.
}
if( strlen(trim($node->childNodes->item(0)->nodeValue)) > 0 && count($cNodes) == 1 )
echo ”.$indent.'(contains=’.$node->childNodes->item(0)->nodeValue.’)’; // do this to print the contents of a node, which maybe the link text, contents of div and so on.
}
echo ”;
}
$cNodes = $node->childNodes;
if (count($cNodes) > 0)
{
$level++ ; // go one level deeper
foreach($cNodes as $cNode)
walkDom($cNode, $level); //so this is recursion my professor kept talkin’ about
$level = $level – 1; // come a level up, and had to do it this way or else wordpress would take away one dash. 😦
}
}

$doc = new DOMDocument();
@$doc->loadHTMLFile(‘http://www.google.de’);
walkDom($doc);
?>

Reply
Stefan Wagner Says:

Wed, 10th Dec, '08 at 0:08
Hello!

I fixed it! My ISP blocks external loadHTMLFile. Getting the error took me two days, because of the @!!
Replaced the call by, the function works fine: 🙂 🙂

Thanks a lot to the autor! There are hundrets of this scripts for JavaScript on the web, but I found no one else for PHP!!

Greetings from Ukraine

Stefan Wagner

loadHTML($myWebSite); walkDom($doc); ?>

Reply
Stefan Wagner Says:

Wed, 10th Dec, '08 at 0:08
loadHTML($myWebSite);
walkDom($doc);
?>

Reply

	Pooran Jaiswal on Social Networking Websites…
	MiKu on Social Networking Websites…
	tw on Social Networking Websites…
	How to Get Six Pack… on Social Networking Websites…
	Thorix » Siche… on Preparing a secure login form…
	Stefan Wagner on Walking the HTML DOM tree in…
	Stefan Wagner on Walking the HTML DOM tree in…
	Stefan Wagner on Walking the HTML DOM tree in…
	krahulg on Walking the HTML DOM tree in…
	Melvin J. on Walking the HTML DOM tree in…

observances