SQL tools for the Web
For starters, we need an engine to run the SQL statements. I started to implement such an engine and became increasingly excited about the idea as I did more work and ran more test queries. I started with a simple engine that allowed queries against a single-source Webpage. At the time of this writing, the engine has been extended to allow selections from multiple Webpages -- as though from separate database tables -- with the results being joined together.

Devising the queries to put through the engine was very hard work, so I came up with the idea of an Assistant that would present a source HTML page in the hierarchical form shown above in Figure 4. Not only would this show how sets of elements repeat (to help you decide where to put the wildcard characters), it would also allow an SQL statement to be constructed with a few mouse clicks.

The peculiarities of HTML would be handled by highlighting certain elements such as links, forms, and frames, and by allowing you to drill down into frames and follow links to find the data you desire. Form submissions would be automated in order to support dynamic -- and static -- pages as data sources.

I have produced a demonstration version of the Assistant -- as a Java 2 applet -- that lets you run some example SQL queries on live Web data using the SQL Engine. See the link in Resources to try it out.

This kind of assistant would be a useful end-user tool, allowing data to be extracted from the Web and exported to a spreadsheet or database application for further processing. But I'd really like to capture -- as would you, I suspect -- the results into a Java program, probably a servlet that could be used as a portal. To accomplish that task, we need a Java API.

The WDBC API
My title for the Java API -- Web DataBase Connectivity (WDBC) -- suggests an API similar to JDBC. Although I might do so in the future, I've not followed the JDBC API to the letter because we require only a small subset of the JDBC functionality for this task. Indeed, support for updates is not relevant, nor is support for stored procedures. However, I do think that the programmatic feel of the API should be similar to JDBC so, if you know what JDBC code looks like, I'm sure you'll spot the similarities in this WDBC code:


int sqlColumns=3;

String sqlStatement="SELECT .table[0001]...htxt[0000] AS linkText, "
+".table[0001]...table[*].tr[0000].td[0000].text[0001] AS description, "
+".table[0001]...table[*].tr[0000].td[0000].href[0000] AS link\n"
+"FROM http://search.itworld.com:8765/query.html\n"
+"WITHGET col=qt=ejb\n"
+"WHERE linkText MATCHES '*December*'";

SqlEngine sqlEngine=new SqlEngine();

try
{
  ResultSet resultSet=sqlEngine.executeQuery(sqlStatement);

  while (resultSet.next())
  {
    for (int c=1; c<=sqlColumns; c++)
    {
      System.out.print(resultSet.getString(c)+"|");
    }
    System.out.println();
  }
}
catch (Exception ex) { ex.printStackTrace(); }

Queries are submitted via an executeQuery() method and the results come back as a ResultSet, the elements of which you can step through one by one using next(). Having extracted the data into your Java program, you can now do whatever you like with it.

Uses for WDBC
I see three main uses for the HTMSQL and the WDBC Java API:

When you're thinking about possible applications, keep in mind that you're not restricted to the textual content of Webpages. Links, images, and other HTML elements can all be distinguished, meaning that your Java program could follow a link according to the host name of the link. Or what about displaying an image only if the accompanying text contains the word "Madonna?"

The example SQL statements provided for you to try the demonstration applet (see Resources) allow you to:

Conclusion
For quite some time I puzzled over how to extract information easily from Webpages so I could build new kinds of applications. Now that I've become so engrossed in what you might call the enabling technology, I've become interested in the technique itself rather than in its applications, so -- for the moment -- I'll leave the applications to someone else. Maybe you have some ideas of your own, which I'd be interested to hear.

The proof-of-concept SQL Engine has been rewritten to take the idea forward without much of the original code. The new engine includes support for selections -- and joins -- across multiple source pages, which I think will increase the usefulness enormously. The priority now focuses on extending the Assistant to a wider toolset that makes it all child's play.

I'm not the only one to think this is a great idea. A couple of commercial tools -- W4F from the University of Pennsylvania and Caesius's WebQL -- address the same problem in a similar, yet different way (see Resources).


Page 1 Access the world's biggest database with Web DataBase Connectivity
Page 2 SQL tools for the Web

Printer-friendly version Printer-friendly version | Send this article to a friend Mail this to a friend

About the author
Tony Loton is an independent consultant, course instructor, and technical author who writes regularly for JavaWorld and other Java journals. He launched his company -- LOTONtech Limited -- as a vehicle for researching, developing, and commercializing innovative software solutions independently and in collaboration.




Source: http://www.javaworld.com/jw-03-2001/jw-0316-webdb.html