PHP xpath提取网页数据内容代码解析

想要使用xpath来解析html内容, PHP自带两个对象

DOMDocument，DOMXpath，其中初始化 loadHtml一般都会报很多警告，但是并不影响使用，用@屏蔽错误。

/*** 初始化DOMXpath对象** @param [type] $content 网页内容* @param [array] $pathinfo 匹配信息** @return void*/private function _createXpathObj($content, $patinfo){// 如果没有xpath配置项，不初始化xpathif (!$this->_existsXpathParse($patinfo)) {return;}try {$dom = new \\DOMDocument();@$dom->loadHtml($content);$dom->normalize();$xpath = new \\DOMXpath($dom);$this->xpathObj = $xpath;} catch (\\Exception $e) {getService(\'logger\')->warning(\'Parse html fail\', [\'content\' => $content]);}}

其中 $node 为 DOMElement 对象。

/*** 获取Xpath解析值** @param [type] $pat 匹配模式** @return string*/private function _getXpathField($pat){$objs = $this->xpathObj->query($pat);if ($objs->length > 0) {$node = $objs->item(0);$outerHTML = $node->ownerDocument->saveHTML($node);return trim($outerHTML);# 作为示例 输出innerhtml//$innerHTML = \'\';//foreach ($node->childNodes as $childNode){//   $innerHTML .= $childNode->ownerDocument->saveHTML($childNode);//}//return $innerHTML;# 作为示例 输出文本不含标签//return $node->textContent; //$node->nodeValue;}return \'\';}

示例

<?php$dom = new DOMDocument(\'1.0\',\'UTF-8\');$dom->loadHTML(\'<html><body><div><p>p1</p><p>p2</p></div></body></html>\');$node = $dom->getElementsByTagName(\'div\')->item(0);$outerHTML = $node->ownerDocument->saveHTML($node);$innerHTML = \'\';foreach ($node->childNodes as $childNode){$innerHTML .= $childNode->ownerDocument->saveHTML($childNode);}echo \'<h2>outerHTML: </h2>\';echo htmlspecialchars($outerHTML);echo \'<h2>innerHTML: </h2>\';echo htmlspecialchars($innerHTML);?>

以上就是本文的全部内容，希望对大家的学习有所帮助