Building a news aggregator with Unix tools or webMethods


Curtis Autery - January 26, 2005


The purpose of this document is to compare the standard HTTP/HTML handling in two sets of tools, webMethods Integration Server 6.1 with the standard public HTTP, List and String packages, and Unix shell with perl, xargs, and wget. The task that will be accomplished with both sets of tools is a simple news story aggregator using the publicly available stories on E*Trade's news site, "https://us.etrade.com/e/t/invest/marketnews".

If viewed with a web browser, E*Trade's news site shows a list of story headlines, and many extra items that are unrelated to our goal. Each headline can then be clicked on to load a new page, which contains the story text and other items not pertaining to the story. This is a sample headlines page as viewed with Mozilla:



The headlines page shows links to 20 stories, each of which has the same URL pattern: "/e/t/invest/story?ID=STORYID=". This pattern can be used to distinguish an anchor to a story from the page's other anchors.

Each story page is a complexity of nested HTML tables, the text of the story being buried somewhere within. This is a sample HTML snippet from a story page (the copyrighted story is omitted):

<a href="/e/t/invest/marketnews"><b>&lt; Back to Market News</b></a>
<hr size="1" noshade><br>
<table width="165" border="0" cellspacing="0" cellpadding="3" align="right"><tr>
<th colspan="4" align="left" class="rev">Related Quotes</th>
</tr><tr>
<th align="left" class="smrev">Sym.</th>
<td class="smrev">&nbsp;</td>
<th class="smrev" align="right">Price</th>
<th class="smrev" align="right">Chg.</th>
</tr>
 
<tr valign="top" bgcolor="#eeeeee">
<td class="sm"><a href="/e/t/invest/quotesandresearch?sym=PWRI&qmenu=2">PWRI</a></td>
<td align="center" class="sm"><a href="/e/t/invest/socreateentry?Symbol=PWRI">Trade</a><br>
 
<a href="/e/t/invest/newsandresearch?sym=PWRI&prod=PWRI:US:EQ">News</a></td>
<td nowrap align="right" class="sm">0.32</td>
<td nowrap align="right" class="sm" style="color:green">0.02</td>
</tr>
</table>
 
<b> Power2Ship Bidding on Security-Related Government Contracts;
(story here)
</td>
 
<td width="16" nowrap>&nbsp;</td></tr></table>
</td>
<!-- skins right rail -->
<td valign="top"><script TYPE='text/javascript' LANGUAGE="JavaScript" >

The story text is locatable in one of two ways: It is everything between the first occurrence of "<b> " and the "</td>" that follows, and it is the second TD element of the first nested table of the document's third table. With this information, we can write some basic instructions describing what the aggregator will do:

  1. Download the contents of https://us.etrade.com/e/t/invest/marketnews
  2. Parse out a list of destination URLs for all anchors that contain "/e/t/invest/story?ID=STORYID"
  3. For each matching URL
      a. Download the URL in question
      b. Parse out the story content using either method described above
      c. Add the story content to the output queue

  4. Print the output queue and clean up


The output should be the HTML of all 20 stories concatenated together, minus extra formatting, links to E*Trade services, and advertising.

Unix Tools


The three programs I chose to accomplish this goal were ones I was already familiar with and had used extensively for other activities: wget, xargs, and perl. For those unfamiliar, wget is a command-line web page downloader, xargs changes its input to a command-line parameter list ("xargs rm < file_list.txt" will remove all the files named in file_list.txt), and perl is a popular programming language, with powerful string manipulation abilities.

Step 1 from our basic instructions is easily accomplished with wget:
wget https://us.etrade.com/e/t/invest/marketnews

This, however, is a bad start. First, when executed, several directories will be created, namely e, e/t, and e/t/invest, and finally the file e/t/invest/marketnews will be created. Second, when executed repeatedly, new files marketnews.1, marketnews.2, marketnews.3, etc. will be created. So this will clutter the directory and make more work later double-checking which file is the most recent. It would be better in this case to work from standard output/standard input and pipe the results of one program directly into the next. Fortunately, wget supports outputting to STDOUT with a simple switch:
wget -O - https://us.etrade.com/e/t/invest/marketnews |

For step 2, the anchors that need to be parsed look like this in the raw HTML:
<a href="/e/t/invest/story?ID=STORYID=description_and_date_here">

Fortunately, the entire pattern we are interested in is enclosed in quotes, follows a standard pattern, and does not wrap to multiple lines. A perl one-liner can take the HTML output from wget and pull out all the story links with a simple regular expression:
m!/e/t/invest/story[^"]+!

This says "Match the first substring on the input line that starts with /e/t/invest/story, and then contains one or more (+) non-quote ([^"]) characters. Stop at either the end of the input line, or the first quote. The substring that matches will be contained in the special perl variable $&. Each match must be prepended with https://us.etrade.com to make a complete URL, and each line of the previous wget command's output must be fed to the pattern matcher one at a time using perl's -n command-line switch, making the entire command:
perl -lne 'print "https://us.etrade.com$&" if m!/e/t/invest/story[^"]+!' |

Concise and elegant, one of the reasons perl is still popular after 17 years.

For step 3, we have a list of URLs, and we need to download them all. Enter xargs calling wget:
xargs wget -O - |

If it is unclear what that will do, xargs will build a commandline for wget that contains all the data just piped to it. The resulting command that gets executed will look like this:
wget -O - https://us.etrade…STORYID=story1 https://us.etrade…STORYID=story2 …etc

The output from all the pages will be printed to STDOUT, where a final perl script can cull the data into just the good parts, using the markers "<b> " and "</td>" referred to earlier. Another perl one-liner can be used to attack this problem, which may come as a surprise to those not familiar with the flexibility and text processing power of perl. The range operator (..) in scalar context combined with the -n commandline switch can cycle back and forth between sections delimited by the above markers. For example,
perl -lne 'print $_ if (/^<b> / .. /^<.td>/)'

…will cycle through each input line, and the if statement will evaluate to true on the first occurrence of "<b> ", and back to false on the first occurrence of "</td>", then back to true on the next "<b> ", etc., printing out each line of STDIN (the special variable $_ refers to the current line in this case, and can also be omitted as it is the default) between the markers. In addition, I would like to clean up the HTML output to add separators between each story, so instead of </td> I would like to see <br><br> in the output stream, making the final command:
perl -ne 'print $_ if /^<b> / .. s/^<.td>/<br><br>/'

Putting all our command together in a shell script, and adding a wrapper that will put HTTP headers on the stream, so it can be called as a CGI script, the final product looks like this:
#!/bin/bash
 
echo Content-type: text/html
echo
 
wget -O - https://us.etrade.com/e/t/invest/marketnews 2>/dev/null |
perl -lne 'print "https://us.etrade.com$&" if m!/e/t/invest/story[^"]+!' |
xargs wget -O - 2>/dev/null |
perl -ne 'print $_ if /^<b> / .. s/^<.td>/<br><br>/'

The script is simple, but assumes a lot about the environment. It has to be on a system that supports shell scripting, such as *NIX or Windows with Cygwin, xargs and perl must be installed (they are on most *NIX systems by default), and the version of wget installed must support SSL to establish the HTTPS connections to E*Trade.

In other words, it's an effective, small solution using free tools, but its portability takes some work.

webMethods Integration Server 6.1


In webMethods, the perl mantra "There's more than one way to do it" applies very well.
WebMethods is an extention of Java programming, where raw Java code can be written, as well as a visual flowchart-like stringing together of processes. It is similar to Unix shell programming, in that there are standard functions like loops and variable assignments that work independent of processes being called.

Here is an example to illustrate, showing a simple flow that creates a list of even numbers from 2 to 20:





The top frame shows the sequence in which flows are called. Repeat and Loop are built-in flow control statements, whereas appendToStringList and multiplyInts are flows in the standard public packages pub.list and pub.math, respectively. The bottom frame shows the control of variables to and from specific flows (i.e., "The Pipeline").

There are two basic kinds of flows, the kind shown above, and pure Java. There is a simple interface to reading from and writing to the Pipeline. Here is an example of Java code that creates the same list as above and outputs it to the Pipeline:
String Counts[] = new String[10];
for (int i = 0; i < 10; i++) {
Counts[i] = String.valueOf(2 * (i + 1));
}
pipeline.getCursor().insertAfter("output", Counts);

WebMethods has many useful packages for such things as connecting to databases, transferring files, manipulating strings and arrays, working with XML, WSDLs, process scheduling, managing SQL queries, etc. The Integration Server that executes flow code runs on Windows or Unix (since it is basically Java underneath, making it cross-platform was a natural direction to go). The two main drawbacks are that it costs a fortune and eats up hordes of resources. Recommended memory to run the server on a Windows machine is 1 gig of RAM.

My first attempt at tackling the aggregator problem was to incorporate regular flow services to download the web pages and turn them into strings, and to use Java to parse the anchors of the index page, and to filter the story pages based on the HTML markers. Here is the current state of the Java to parse anchors out of the HTML before I decided to scrap the idea:
//Declare variables
String web; // Input web page
String tmp; // Temp string
String links[]; // Array of <a href=(\w+).+?> or <a href="(.+?)".+?>
int count = 0; // Current length of links array
int n;
int f;
IDataCursor my_pipe = pipeline.getCursor();
 
//Read data from pipeline
my_pipe.first("html_input");
web = (String) my_pipe.getValue();
tmp = web.toLowerCase();
 
//Count anchors
n = 0;
while ( tmp.indexOf("<a", n) != -1 ) {
n = tmp.indexOf("<a", n) + 1;
count++;
}
 
//Add anchors to links array
links = new String[count];
count = 0;
n = 0;
while ( tmp.indexOf("<a", n) != -1 ) {
n = tmp.indexOf("<a", n) + 1;
f = tmp.indexOf(">", n);
links[count++] = web.substring(n,f);
}
 
// Write array to pipeline
my_pipe.insertAfter("links",links);
my_pipe.destroy();

Java isn't my strongest language. I was unhappy that I had to cycle through the input string twice, but I was trying to be effective in memory usage, and I didn't want to be dependant on extra Java modules for queues, vectors, or something similar.

I scrapped the idea of using Java because the non-Java flow services were getting big, and this would have been extra mental overhead for anyone attempting to decipher the process later. I rewrote the page again in pure webMethods flow, and it ended up looking like this:



First, pub.client.http downloads the news index site as a byte stream, where I then turn it into a string with bytesToString. The tokenize flow then splits the string into an array of lines, where I then loop through each of them checking for the key phrase "STORYID", which indicates lines that contain a story link.

The first loop creates a second string array of just lines with STORYID in them, and the second loop then cycles through them, filters out the destintation URL, and downloads those pages. I could have combined them into one loop, but this was my first pass before I had a breakthrough and scrapped this idea as well.

To finish out the flow from above, the story URLs are downloaded and converted to strings, where they are each passed to a second flow, story_filter, that returns just the meat of the story, and concatenates it to the main output variable with pub.string.concat. The story_filter flow looks like this:



This is simpler than it looks. The string is again tokenized to an array of lines, and then the "<b> " and "</td>" markers are used to control a boolean value "record" that indicates whether the line in question should be written to this flow's output string. At the end, the meat of the story is returned to the main get_news flow.

The two flows are fairly bloated, but before my final breakthrough I had intended on simplifying them with a couple more passes until they were manageable. Fortunately, I avoided that work when I discovered the power of built-in XML queries.

WebMethods can download a web page as an XML node, and use either the standard XQL query language or its own proprietary WQL query language to extract information from the page. Combining the two query languages, I was able to trim down the process into something a little more manageable:



That's the extent of the flow, seven lines, just like the earlier CGI shell script. The first loadXMLNode downloads the story index page, and the first queryXMLNode extracts story links with the XQL query:
//a[@href > "https://us.etrade.com/e/t/invest/story" $and$ @href < "https://us.etrade.com/e/t/invest/storz"]/@href

The list returned is all URLs in the document between /e/t/invest/story and /e/t/invest/storz, which I find amusing, but it seemed to be the only way to filter out the links that I didn't want. Unfortunately, I couldn't find a way to incorporate a regular expression into the XQL query.

Next, the returned URLs are looped over, and each page is downloaded with a new loadXMLNode and queried with queryXMLNode. In this case, I used the proprietary WQL language, which uses something similar to a DHTML document model to identify elements. The query I used is based on the location of the story that was identified earlier (second TD element of the first nested table of the document's third table). The query is:
doc.html[0].body[0].table[2].tr[0].td[0].table[0].tr[0].td[1].b|br|pre|p[].src

This goes to the document's third table (table[2]), then to the first cell and the nested table therein, to the second TD element (td[1]). Finally, the query filters based on a few specific tags, and returns the HTML source (.src). The tags used filter out horizontal rules and nested tables that sometimes occur prior to a story. Only text preceded by a <b>, <br>, <pre>, or <p> symbol are printed. The text returned is an array of matching strings.

Lastly, the returned strings are appended to the output array of strings, which ends up corresponding to the same output lines as the CGI shell script.

A good understanding of this solution requires understanding how webMethods works, how the query languages work, what the HTML source of the pages in question looks like, and what the basic instructions from above were attempting to do. Much like the shell script, the solution has elegance, but is built on top of the hard work of many different people writing the underlying modules.

Unlike the shell solution, the webMethods solution is cross-platform as-is, requiring only the purchase of the webMethods suite, and an administrative staff to support it.

Comments: Post a Comment
<< Home

This page is powered by Blogger. Isn't yours?