Schemas, Content Models, Templates, and Interoperability

The following example will demonstrate the importance of interoperability at the data level.

Two ornithologists are recording data describing birds. Tables 3 and 4 contain their results.

Specimen ID Species Wingspan Bill Length
2312 C. Cristata 34 cm 2.2 cm
2313 C. Cristata 35 cm 2.5 cm
2314 C. Cristata 37 cm 2.3 cm
2315 C. Cristata 34.5 cm 2.2 cm
2316 C. Cristata 36 cm 2.1 cm
2317 C. Cristata 36 cm 2.4 cm

Table 3: Sample data describing blue jays

Specimen_Identifier Sp. Wingspan Beak_Length Units
g:12 Larus argentatus 48 1.1 in
g:13 Larus argentatus 55 1 in
g:14 Larus argentatus 54 1.15 in
g:15 Larus argentatus 49 1.2 in
g:16 Larus argentatus 59 1.12 in

Table 4: Sample data describing seagulls

Both tables obviously record the same kind of data, which matches identified individuals of a given bird species with corresponding measurements of beak and wingspan.

But these tables are not interoperable because they use different schemas. In Table 3, specimens are listed under the Specimen ID field; in Table 4, specimens are listed under the Specimen_Identifier field. In Table 3, Species is spelled out, but the genus is abbreviated; in Table 4, Species is abbreviated as Sp. but genus is spelled out in each entry. Likewise: in Table 3, measurement units for specimen wingspan and beak measurements are provided in the same field as the measurements themselves; in Table 4, the units for these measurements are listed in a separate field. Finally: in Table 3, beak measurements are listed under the Bill Length field; in Table 4, they are listed under the Beak Length field.

All of this might not seem like much to a human operator. People can often understand and account for such discrepancies by drawing on experience and intuition (though there are occasions in which schematic differences can baffle even human operators). By contrast: unless specifically instructed to do so, a computer cannot understand that a beak can also be called a bill, or that species can be abbreviated sp. Nor can a computer interpret units of measure unless so instructed: to a computer, there is no difference between 57 inches and 57 centimeters, unless the computer has been instructed to understand the difference.

In order to make data sharing as easy as possible for both people and computers, AASG Geothermal Data uses content models to ensure that data providers submit their data the same way. This is not to say that the content models used by the Geothermal Data project are perfect. Rather, these content models represent a practical compromise between machine readability, human readability, and the demands of the data.

To continue the example above, imagine that the Ornithology Society has decided to adopt the following schema for data describing birds:

Specimen Species Wingspan (mm) Proboscis Length (mm)
Unique identifier for each bird measured Genus and species of specimen; provide full names for both genus and species Measurement of specimen from wingtip to wingtip, in millimeters (mm) Measurement of specimen beak or bill length, in millimeters (mm)

Table 5: A content model for an ornithological schema

When fitted to the above schema, the data from Table 3 and Table 4 would appear as follows:

Specimen Species Wingspan (mm) Proboscis Length (mm)
2312 Cyanocitta cristata 340 22
2313 Cyanocitta cristata 350 25
2314 Cyanocitta cristata 370 23
2315 Cyanocitta cristata 345 22
2316 Cyanocitta cristata 360 21
2317 Cyanocitta cristata 360 24
g:12 Larus argentatus 1200 27.5
g:13 Larus argentatus 1375 25
g:14 Larus argentatus 1350 28.75
g:15 Larus argentatus 1225 30
g:16 Larus argentatus 1475 28

Table 6: The data from Tables 3 and 4 fitted to the schema in Table 5

Since their data now uses the same schema, the data from Tables 3 and 4 is now interoperable and can be directly compared in a machine-friendly way, without operator interpretation.

Note that the measurements for both C. Cristata and L. Argentatus are in millimeters. This streamlines the process of searching the database, because neither the search engine nor the user need convert results when searching for bird speciments by measurement. The only downside is a certain degree of overhead on the part of data providers: in order to map their data to the schema, users must convert their measurements to millimeters.

Note also that the specimen identifiers use different formats. Though it might countermand best practice, this does not prevent interoperability either, as long as each identifier is unique.