XML format for DataSet object creation
(The following is an extract from the Solo_Predictor
user's manual)
This appendix describes the XML format to construct
a DataSet object. The DataSet
object is a container for scientific data which permits
the storage of numerical values (known as the data) along with
the typical associated contextual information.
For the purposes of this appendix, it is important
to note that the DataSet object allows the inclusion of one
or more sets of textual labels to be associated with each column
and/or row of a data matrix. In addition, numerical “axis scale”
values can also be associated with each column or row.
It should be noted that the object has a significant
amount of flexibility beyond what this document will discuss.
For additional information on the fields and construct of the
DSO, the user is directed to the object’s documentation on the
web: http://software.eigenvector.com/DataSet/
By convention in Solo and PLS_Toolbox, each
row of a data table is considered a “sample” (or “observation”)
and the columns of a data table are the variables measured on
each sample. Thus, to create a typical DataSet object (DSO)
which can be used to make a prediction, a DSO will be created
around a single row of values. An XML construct of a DSO will,
therefore, always contain at least a tag to describe the data.
Numerical values for an XML construct of a
DSO are given in comma-separated and semicolon-separated format.
Commas indicate values on the same row of a matrix (item,item,item);
Semicolons indicate row-wise breaks (row; row; row). White space
is always ignored.
The basic XML DSO construct consists of the
outer object tag with a “class” attribute indicating that the
object is a DSO. There are actually two formats for creating
a DataSet objects. One uses class=”dataset” and is a complete
and complicated description of a dataset object. The other uses
class=”dso” and is much easier to create. We recommend class=”dso”
for most applications and will not discuss class=”dataset” in
this document.
The outer tag must contain a <data> tag
which will always have the class=”numeric” attribute (because
data will always be numeric).
<obj class=”dso”>
<data class=”numeric”> 1,2,3,4,5 </data>
</obj>
This XML construct would create a simple DSO
containing the values 1 to 5 in a row vector. If the DSO being
created should contain multiple rows, a semicolon should be
used after each row of numbers. Note that all rows must contain the same number of elements.
In most cases, it is desirable to associate
some contextual information regarding the variables which are
being passed to the predictor. This is often expressed as either
textual labels, indicating the measured parameter (often giving
the name of purpose of the device: “thermocouple A”) or numeric
axis scale values (often used in spectroscopy, electrochemistry,
time-based measurements, etc.) These contextual data will be
used by Solo_Predictor to help align new data to a model, verify
that the new data has all the expected variables, and replace
those which are missing.
To include labels in a DSO, an additional <label>
tag must be added to the XML description. The label tag can
contain one or more label “sets” each enclosed in a <set>
tag. Each set contains three elements: mode, name, and content.
The mode tag indicates the data mode (1=rows/samples, 2 = columns/variables)
for which the label set is being defined. The name tag is optional
but, if present, indicates a name to associate with this label
set. The content tag defines the actual labels for each element on the given mode. Each label
must be enclosed in its own separate <sr> tag and there
must be an appropriate number of tags for the number of columns
or rows (whichever mode the labels are being associated with).
For example, the following creates the labels “A” through “E”
for the five columns of our example data:
<obj class=”dso”>
<data class=”numeric”> 1,2,3,4,5 </data>
<label>
<set>
<mode>2</mode>
<name class="string">example
variable labels</name>
<content class="string">
<sr>A</sr>
<sr>B</sr>
<sr>C</sr>
<sr>D</sr>
<sr>E</sr>
</content>
</set>
</label>
</obj>
A
label can be added for the sample (first mode) by including
an additional <set> tag inside
the <label> tag (before or after the <set> tags already included above):
<set>
<mode>1</mode>
<name>example sample label</name>
<content class=”string”>
<sr>This is my one sample</sr>
</content>
</set>
Although the <content>
tag uses the <sr> tags to
enclose the string, this is not necessary in this case. Any
time a single string value is being created, the <sr>
tags can be omitted as can the class attribute. Thus the content
tag could have read:
<content>This is my one
sample</content>
Numeric axis scale values can be added using
an axisscale tag (note the tag name does not contain a space)
with similar content to the label tag. The only difference is
that the axisscale property expects a numeric value so the <content>
tag is defined with the class=”numeric” attribute and the values are supplied as
a comma-separated values list. The following defines an axisscale
for the variables running from 500 to 508 in steps of 2:
<obj class=”dso”>
<data class=”numeric”> 1,2,3,4,5 </data>
<axisscale>
<set>
<mode>2</mode>
<name>example axis scale</name>
<content class=”numeric">500,502,504,506,508</content>
</set>
</axisscale>
</obj>
As with labels, note that the number of items
defined in the content must match the length (number of elements)
of the given mode (columns in this example).
Most of the remaining DSO properties (fields)
can be set using similar calls. For example, classes and titles
(see the DataSet object documentation for more information on
these fields) can be added to the DSO using tags similar to
label and axisscale. Titles must have content of class=”string”
and must contain a single string. Classes can have numeric or
string content and must have sufficient elements to match the
size of the given mode.
In addition, the include field uses the <set>
notation described above and the author, name, description,
and userdata fields all use the single-tag notation (as with
the data tag where the field name is given with the class attribute
and the content within the tag). For example, see below:
<obj class=”dso”>
<data class=”numeric”> 1,2,3,4,5 </data>
<name class=”string”>Name for Dataset</name>
<author class=”string”>Dataset\’s Author</author>
<description class=”string”>
<sr>Include a multi-line string here</sr>
<sr>Use as many sr tags as you have
lines</sr>
</description>
</obj>
Note the use of the backslash in front of the
single quote included in the Author tag. This is only necessary
when passing XML through the Solo_Predictor interface. When
XML is saved to a file, backslashes are not needed.
<obj class=”dso”>
<data class=”numeric”>1,2,3,4,5</data>
<name>Name for Dataset</name>
<author>Dataset\’s Author</author>
<description class=”string”>
<sr>Include a multi-line string here</sr>
<sr>Use as many sr tags as you have
lines</sr>
</description>
<axisscale>
<set>
<mode>2</mode>
<name>example
axis scale</name>
<content class=”numeric”>500,502,504,506,508</content>
</set>
</axisscale>
<label>
<set>
<mode>2</mode>
<name>example
variable labels</name>
<content class=”string”>
<sr>A</sr>
<sr>B</sr>
<sr>C</sr>
<sr>D</sr>
<sr>E</sr>
</content>
</set>
<set>
<mode>1</mode>
<name>example
sample label</name>
<content>This is my one sample</content>
</set>
</label>
</obj>
Please contact
Eigenvector Research for more information on DSO XML format,
if needed.