SQL tools for the Web
For
starters, we need an engine to run the SQL statements. I started to
implement such an engine and became increasingly excited about the idea as
I did more work and ran more test queries. I started with a simple engine
that allowed queries against a single-source Webpage. At the time of this
writing, the engine has been extended to allow selections from multiple
Webpages -- as though from separate database tables -- with the results
being joined together.
Devising the queries to put through the engine was very hard work, so I came up with the idea of an Assistant that would present a source HTML page in the hierarchical form shown above in Figure 4. Not only would this show how sets of elements repeat (to help you decide where to put the wildcard characters), it would also allow an SQL statement to be constructed with a few mouse clicks.
The peculiarities of HTML would be handled by highlighting certain elements such as links, forms, and frames, and by allowing you to drill down into frames and follow links to find the data you desire. Form submissions would be automated in order to support dynamic -- and static -- pages as data sources.
I have produced a demonstration version of the Assistant -- as a Java 2 applet -- that lets you run some example SQL queries on live Web data using the SQL Engine. See the link in Resources to try it out.
This kind of assistant would be a useful end-user tool, allowing data to be extracted from the Web and exported to a spreadsheet or database application for further processing. But I'd really like to capture -- as would you, I suspect -- the results into a Java program, probably a servlet that could be used as a portal. To accomplish that task, we need a Java API.
The WDBC API
My title for the
Java API -- Web DataBase Connectivity (WDBC) -- suggests an API similar to
JDBC. Although I might do so in the future, I've not followed the JDBC API
to the letter because we require only a small subset of the JDBC
functionality for this task. Indeed, support for updates is not relevant,
nor is support for stored procedures. However, I do think that the
programmatic feel of the API should be similar to JDBC so, if you know
what JDBC code looks like, I'm sure you'll spot the similarities in this
WDBC code:
int sqlColumns=3;
String sqlStatement="SELECT
.table[0001]...htxt[0000] AS linkText,
"
+".table[0001]...table[*].tr[0000].td[0000].text[0001] AS
description, "
+".table[0001]...table[*].tr[0000].td[0000].href[0000]
AS link\n"
+"FROM
http://search.itworld.com:8765/query.html\n"
+"WITHGET
col=qt=ejb\n"
+"WHERE linkText MATCHES '*December*'";
SqlEngine
sqlEngine=new SqlEngine();
try
{
ResultSet
resultSet=sqlEngine.executeQuery(sqlStatement);
while
(resultSet.next())
{
for (int
c=1; c<=sqlColumns;
c++)
{
System.out.print(resultSet.getString(c)+"|");
}
System.out.println();
}
}
catch
(Exception ex) { ex.printStackTrace(); }
Queries are submitted via an executeQuery()
method and the
results come back as a ResultSet
, the elements of which you
can step through one by one using next()
. Having extracted
the data into your Java program, you can now do whatever you like with it.
Uses for WDBC
I see three
main uses for the HTMSQL and the WDBC Java API:
When you're thinking about possible applications, keep in mind that you're not restricted to the textual content of Webpages. Links, images, and other HTML elements can all be distinguished, meaning that your Java program could follow a link according to the host name of the link. Or what about displaying an image only if the accompanying text contains the word "Madonna?"
The example SQL statements provided for you to try the demonstration applet (see Resources) allow you to:
Conclusion
For quite some
time I puzzled over how to extract information easily from Webpages so I
could build new kinds of applications. Now that I've become so engrossed
in what you might call the enabling technology, I've become
interested in the technique itself rather than in its applications, so --
for the moment -- I'll leave the applications to someone else. Maybe you
have some ideas of your own, which I'd be interested to hear.
The proof-of-concept SQL Engine has been rewritten to take the idea forward without much of the original code. The new engine includes support for selections -- and joins -- across multiple source pages, which I think will increase the usefulness enormously. The priority now focuses on extending the Assistant to a wider toolset that makes it all child's play.
I'm not the only one to think this is a great idea. A couple of
commercial tools -- W4F from the University of Pennsylvania and Caesius's
WebQL -- address the same problem in a similar, yet different way (see Resources).
Printer-friendly
version |
Mail this to a friend
About the author
Tony Loton is an independent consultant, course
instructor, and technical author who writes regularly for
JavaWorld and other Java journals. He launched his company -- LOTONtech Limited -- as a vehicle for
researching, developing, and commercializing innovative software solutions
independently and in collaboration.
Source: http://www.javaworld.com/jw-03-2001/jw-0316-webdb.html