Data Acquisition

Data Acquisition
Prev		Next

Sportwire uses plug-ins for selecting the input feed and for the pre-processing of the input documents. Input feeds are selected through the wirefeeder.feedclass property and control the connection to the feed and how to partition a feed into discreet story blocks. To free the input pipe a quickly as possible (and avoid any backlog in the input buffers), feed XML is queued as text blocks (one per story), with a number of worker threads set on the queue to parse, pre-process and store the documents. The overall process is illustrated in Figure 1

Figure 1. Data acquisition flow diagram illustrating the transformations from ascii/xml to website presentation.

Input feed classes must provide a Reader over each single story. As each story is read, the text XML is tagged with an ID handle (used for debug reporting, most often derived from the DTD) and paired with a document handler before being queued. Document handlers allow alternatives in how the individual documents are parsed and entered into the database and let us ‘pipeline’ the processing. For example, the ToNewsMLFilter applies an XSL transform on the input feed returning a new NewsML document to the queue, the JDOMToFile will render the document XML out to a filename (extracted from the content attribute XPath specified in the config file) while the XMLDBMSDocHandler selects an XMLDBMS Map schema for object-relational mapping and stores the document to the database.

Acquisition Sequence

Data acquisition must trap incoming text streams with zero loss, and translate the feed XML to the standard NewsML format used in the backend system. Here's how the process works:

The feed arrives over the socket, it is read in as text and chopped at the end of each message. At this stage, we do simple line-by-line text regex pattern matches and substitutions, irrespective of any XML, trap potential CDATA sections, and route potential pre-formatted text out to an external txt2html™ process.
Each message chunk is tagged, paired with a ToNewsMLFilter document handler and queued.
The worker parses the contents into a JDOM object and extracts the DOCTYPE system ID as an indentifier.
This ID is matched to a NewsML transformation stylesheet of the same name, for example scoresxml.dtd will hunt down scoresxml.xsl. This produces a new JDOM object conforming to the NewsML DTD, and this new object is again tagged, paired with a JDOMToFile or XMLDBMSHandler and re-queued.
The JDOMToFile handler extracts a filename from the NewsML document (according to a config property XPath) and writes the object out as XML; the XMLDBMSDocHandler uses the SystemID to match to an XMLDBMS Object:Relationl Map spec file and uses this to post the contents to the database.

To remain flexible to new document types, the transformation from vendor-feed to NewsML is left to the XSL stylesheets. Ensuring the integrity of the transform result is the responsibility of the XSL author.

Although we have a Java-based representation of a feed-specific Document object in 3, that module has no knowledge of feed-specific rules or structure, so any modifications made through Java calls would have to be generic. In 5, however, we know we have a NewsML object, so we could inspect it, extend it or correct any flagged fields, for example, to change league names or lookup numerical ID values.

Feed Preprocessing

The input feed requires contains preprocessing before it can be parsed using the Java tools^[1]; parsing at this stage must be done using brute-force methods such as perl regular expressions.

The illustration in Figure 2 shows the initial read process for the SportsNetwork TSNFeed feed handler:

The connection is managed by an instance of the ORO TelnetClient (1) set with a timeout watch thread (2) to abort the feed process if the connection is lost (WireFeeder will attempt to reconnect).
Each line read is added to a storybuffer; element contents are escaped to protect against common HTML entities
“Long” element text is most often preformatted ASCII text. These lines are collected in a CDATA buffer; when the reader detects the end of the element, the CDATA buffer is sent to an external txt2html™ process and the resulting HTML added to the storybuffer, protected with CDATA tags.
When the feed reader detects the “end of story” tag, storybuffer is returned to the WireFeeder as a StringReader.

Figure 2. Preprocessing for the TSNFeed

SportsML Translation

Vendor format XML entered in the document queue is paired with a document handler; in the case of the SportsNetwork feed, we have paired this with the ToNewsMLFilter handler which will transform the vendor format into the standard IPTC SportsML format.

The illustration in Figure 3 shows the transformation process:

The received XML is parsed into a JDOM instance.
The doctag (the systemID) is used to locate an XSL file which is applied to the JDOM; the resulting JDOM is then re-queued paired with the JDOMToFile handler.
JDOMToFile extracts a filename from the XML object and serializes the object out to that file.

Figure 3. Preprocessing for the ToNewsMLFilter

SportsNetwork Translation

The filtering document handler ToNewsMLFilter will search for an XSL file matching the document tag (the SystemID) and apply that stylesheet to the received XML to produce a new XML that will be resubmitted to the document queue. For our deployment, we use this feature to translate all received XML into IPTC standard SportsML prior to storing the document. The following XSL illustrates the translation of the SportsNetwork into a SportsML sports-content documents:

 <?xml version="1.0"
		  encoding="ISO-8859-1"?>
<xsl:stylesheet 
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    version="1.0" >

<xsl:include href="snvariables.xsl"/>
<!-- snvariables translates to SportsML identifiers;
     this could also be done using Xalan jdbc connections -->

<xsl:output  method = "xml" standalone = "yes" 
    doctype-system = "dtds/sportsml-core.dtd" />

<xsl:template match="/">

<sports-content>
  <xsl:attribute name="id"><xsl:value-of
										 select="$sportid"/>-<xsl:value-of select="message/sport"/>-recap-<xsl:value-of select="$vteamshort"/>-<xsl:value-of select="$hteamshort"/>-<xsl:value-of select="message/game_id"/></xsl:attribute>

<sports-metadata>
<xsl:attribute name="date-time"><xsl:value-of select="message/time_stamp"/></xsl:attribute>
<xsl:attribute name="language">en-US</xsl:attribute>

<xsl:attribute name="fixture-key">event-recap</xsl:attribute>
<xsl:attribute name="fixture-source">iptc.org</xsl:attribute>
<xsl:attribute name="fixture-name">Event Recap</xsl:attribute>
	<sports-title>Recap: <xsl:value-of select="message/final/hteam"/> vs. <xsl:value-of select="message/final/vteam"/></sports-title>

<sports-content-codes>
<sports-content-code>
<xsl:attribute name="code-type">sport</xsl:attribute>
<xsl:attribute name="code-key"><xsl:value-of select="$sportkey"/></xsl:attribute>
<xsl:attribute name="code-source">iptc.org</xsl:attribute>
<xsl:attribute name="code-name"><xsl:value-of select="$sport"/></xsl:attribute>
</sports-content-code>

<sports-content-code>
<xsl:attribute name="code-type">league</xsl:attribute>
<xsl:attribute name="code-key"><xsl:value-of select="$leagueid"/></xsl:attribute>
<xsl:attribute name="code-source">iptc.org</xsl:attribute>
<xsl:attribute name="code-name"><xsl:value-of select="$league"/></xsl:attribute>
</sports-content-code>

<sports-content-code>
<xsl:attribute name="code-type">team</xsl:attribute>
<xsl:attribute name="code-key"><xsl:value-of select="$hteamid"/></xsl:attribute>
<xsl:attribute name="code-source">iptc.org</xsl:attribute>
<xsl:attribute name="code-name"><xsl:value-of select="$hteamfull"/></xsl:attribute>
</sports-content-code>

<sports-content-code>
<xsl:attribute name="code-type">team</xsl:attribute>
<xsl:attribute name="code-key"><xsl:value-of select="$vteamid"/></xsl:attribute>
<xsl:attribute name="code-source">iptc.org</xsl:attribute>
<xsl:attribute name="code-name"><xsl:value-of select="$vteamfull"/></xsl:attribute>
</sports-content-code>

</sports-content-codes>

</sports-metadata>
<sports-event>

<event-metadata>

<!-- This event-key is the NHL game number which will repeat from year
to year. A list similar to the team list above could be created for
site-name and site-key but the latter are undefined, with the
exception of NYC venues, in the SML vocab file from sportsml.com -->

<xsl:attribute name="event-key"><xsl:value-of select="message/game_id"/></xsl:attribute>
<xsl:attribute name="site-key"></xsl:attribute>
<xsl:attribute name="site-name"></xsl:attribute>
<xsl:attribute name="site-source">iptc.org</xsl:attribute>
<xsl:attribute name="event-status">post-event</xsl:attribute>
</event-metadata>

<team>
<team-metadata>
<xsl:attribute name="team-key"><xsl:value-of select="$vteamid"/></xsl:attribute>
<xsl:attribute name="alignment">away</xsl:attribute>
<name>
<xsl:attribute name="first"><xsl:value-of select="$vteamfirst"/></xsl:attribute>
<xsl:attribute name="last"><xsl:value-of select="$vteamlast"/></xsl:attribute>
</name>
</team-metadata>
<team-stats>
<xsl:attribute name="event-score"><xsl:value-of select="message/final/vscore"/></xsl:attribute>
</team-stats>
</team>

<team>
<team-metadata>
<xsl:attribute name="team-key"><xsl:value-of select="$hteamid"/></xsl:attribute>
<xsl:attribute name="alignment">home</xsl:attribute>
<name>
<xsl:attribute name="first"><xsl:value-of select="$hteamfirst"/></xsl:attribute>
<xsl:attribute name="last"><xsl:value-of select="$hteamlast"/></xsl:attribute>
</name>
</team-metadata>
<team-stats>
<xsl:attribute name="event-score"><xsl:value-of select="message/final/hscore"/></xsl:attribute>
</team-stats>
</team>

<highlight class="snbody"><xsl:value-of select="message/body"/></highlight>
<highlight class="sngamenotes"><xsl:value-of select="message/gamenotes"/></highlight>
</sports-event>

</sports-content>
</xsl:template>
</xsl:stylesheet>

^[1] The vendors claim their XML will parse, and SportsNetwork claims (in private emails from BJ) that their feed is generated with XML tools, yet their feed contains no charset declaration, and entities and attributes often contain ampersands (&), backquotes and other characters not allowed by the Xalan/Xerces suite of XML parsing tools. Whether this is a flaw in their XML generating tools or a flaw in the apache parsers is unknown. Clarification is welcome.

Prev	Up	Next
Components	Home	JDOMFile