hu.midori.kosmos.server.util
Class WebCrawlingUtils

java.lang.Object
  extended by hu.midori.kosmos.server.util.WebCrawlingUtils

public class WebCrawlingUtils
extends java.lang.Object

Utility methods for web crawling.

Version:
$Id$
Author:
Aron Gombas

Constructor Summary
protected WebCrawlingUtils()
          This class should never be instantiated.
 
Method Summary
static org.w3c.dom.Document downloadHtmlDom(java.net.URL url)
          Downloads and tidies up an HTML document from the given URL and returns it as DOM.
static org.w3c.dom.Document downloadXmlDom(java.net.URL url)
          Downloads an XML document from the given URL and returns it as DOM.
static java.lang.String eliminateEmptyValues(java.lang.String value)
          Eliminates the empty items from a scraped value string to make the tokenizer happy.
static org.w3c.dom.Node findDomNodeByAttribute(org.w3c.dom.NodeList nodes, java.lang.String attribName, java.lang.String attribValue)
          Returns the first node with the given attribute value from the given list or null if not found.
static org.w3c.dom.Document parseStringDom(java.lang.String xmlString)
          Parses an XML document from the given string.
static java.util.List runXQuery(org.w3c.dom.Document dom, java.lang.String query)
          Runs an XQuery on the given DOM and returns the full result.
static int runXQueryInt(org.w3c.dom.Document dom, java.lang.String query)
          Runs an XQuery on the given DOM and returns a single int as result.
static java.lang.String runXQueryString(org.w3c.dom.Document dom, java.lang.String query)
          Runs an XQuery on the given DOM and returns a single String as result.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WebCrawlingUtils

protected WebCrawlingUtils()
This class should never be instantiated.

Method Detail

parseStringDom

public static org.w3c.dom.Document parseStringDom(java.lang.String xmlString)
                                           throws java.lang.Exception
Parses an XML document from the given string.

Throws:
java.lang.Exception

downloadXmlDom

public static org.w3c.dom.Document downloadXmlDom(java.net.URL url)
                                           throws java.lang.Exception
Downloads an XML document from the given URL and returns it as DOM.

Throws:
java.lang.Exception

downloadHtmlDom

public static org.w3c.dom.Document downloadHtmlDom(java.net.URL url)
                                            throws java.lang.Exception
Downloads and tidies up an HTML document from the given URL and returns it as DOM.

Throws:
java.lang.Exception

findDomNodeByAttribute

public static org.w3c.dom.Node findDomNodeByAttribute(org.w3c.dom.NodeList nodes,
                                                      java.lang.String attribName,
                                                      java.lang.String attribValue)
Returns the first node with the given attribute value from the given list or null if not found.


runXQuery

public static java.util.List runXQuery(org.w3c.dom.Document dom,
                                       java.lang.String query)
                                throws net.sf.saxon.trans.XPathException
Runs an XQuery on the given DOM and returns the full result.

Throws:
net.sf.saxon.trans.XPathException

runXQueryInt

public static int runXQueryInt(org.w3c.dom.Document dom,
                               java.lang.String query)
                        throws net.sf.saxon.trans.XPathException
Runs an XQuery on the given DOM and returns a single int as result.

Throws:
net.sf.saxon.trans.XPathException

runXQueryString

public static java.lang.String runXQueryString(org.w3c.dom.Document dom,
                                               java.lang.String query)
                                        throws net.sf.saxon.trans.XPathException
Runs an XQuery on the given DOM and returns a single String as result.

Throws:
net.sf.saxon.trans.XPathException

eliminateEmptyValues

public static java.lang.String eliminateEmptyValues(java.lang.String value)
Eliminates the empty items from a scraped value string to make the tokenizer happy. E.g. ||xxx| will be transformed to | |xxx| .