Schemas, Content Models, Templates, and Interoperability
The following example will demonstrate the importance of interoperability at the data level.
Two ornithologists are recording data describing birds. Tables 3 and 4 contain their results.
Specimen ID | Species | Wingspan | Bill Length |
2312 | C. Cristata | 34 cm | 2.2 cm |
2313 | C. Cristata | 35 cm | 2.5 cm |
2314 | C. Cristata | 37 cm | 2.3 cm |
2315 | C. Cristata | 34.5 cm | 2.2 cm |
2316 | C. Cristata | 36 cm | 2.1 cm |
2317 | C. Cristata | 36 cm | 2.4 cm |
Table 3: Sample data describing blue jays
Specimen_Identifier | Sp. | Wingspan | Beak_Length | Units |
g:12 | Larus argentatus | 48 | 1.1 | in |
g:13 | Larus argentatus | 55 | 1 | in |
g:14 | Larus argentatus | 54 | 1.15 | in |
g:15 | Larus argentatus | 49 | 1.2 | in |
g:16 | Larus argentatus | 59 | 1.12 | in |
Table 4: Sample data describing seagulls
Both tables obviously record the same kind of data, which matches identified individuals of a given bird species with corresponding measurements of beak and wingspan.
But these tables are not interoperable because they use different schemas. In Table 3, specimens are listed under the Specimen ID field; in Table 4, specimens are listed under the Specimen_Identifier field. In Table 3, Species is spelled out, but the genus is abbreviated; in Table 4, Species is abbreviated as Sp. but genus is spelled out in each entry. Likewise: in Table 3, measurement units for specimen wingspan and beak measurements are provided in the same field as the measurements themselves; in Table 4, the units for these measurements are listed in a separate field. Finally: in Table 3, beak measurements are listed under the Bill Length field; in Table 4, they are listed under the Beak Length field.
All of this might not seem like much to a human operator. People can often understand and account for such discrepancies by drawing on experience and intuition (though there are occasions in which schematic differences can baffle even human operators). By contrast: unless specifically instructed to do so, a computer cannot understand that a beak can also be called a bill, or that species can be abbreviated sp. Nor can a computer interpret units of measure unless so instructed: to a computer, there is no difference between 57 inches and 57 centimeters, unless the computer has been instructed to understand the difference.
In order to make data sharing as easy as possible for both people and computers, AASG Geothermal Data uses content models to ensure that data providers submit their data the same way. This is not to say that the content models used by the Geothermal Data project are perfect. Rather, these content models represent a practical compromise between machine readability, human readability, and the demands of the data.
To continue the example above, imagine that the Ornithology Society has decided to adopt the following schema for data describing birds:
Specimen | Species | Wingspan (mm) | Proboscis Length (mm) |
Unique identifier for each bird measured | Genus and species of specimen; provide full names for both genus and species | Measurement of specimen from wingtip to wingtip, in millimeters (mm) | Measurement of specimen beak or bill length, in millimeters (mm) |
Table 5: A content model for an ornithological schema
When fitted to the above schema, the data from Table 3 and Table 4 would appear as follows:
Specimen | Species | Wingspan (mm) | Proboscis Length (mm) |
2312 | Cyanocitta cristata | 340 | 22 |
2313 | Cyanocitta cristata | 350 | 25 |
2314 | Cyanocitta cristata | 370 | 23 |
2315 | Cyanocitta cristata | 345 | 22 |
2316 | Cyanocitta cristata | 360 | 21 |
2317 | Cyanocitta cristata | 360 | 24 |
g:12 | Larus argentatus | 1200 | 27.5 |
g:13 | Larus argentatus | 1375 | 25 |
g:14 | Larus argentatus | 1350 | 28.75 |
g:15 | Larus argentatus | 1225 | 30 |
g:16 | Larus argentatus | 1475 | 28 |
Table 6: The data from Tables 3 and 4 fitted to the schema in Table 5
Since their data now uses the same schema, the data from Tables 3 and 4 is now interoperable and can be directly compared in a machine-friendly way, without operator interpretation.
Note that the measurements for both C. Cristata and L. Argentatus are in millimeters. This streamlines the process of searching the database, because neither the search engine nor the user need convert results when searching for bird speciments by measurement. The only downside is a certain degree of overhead on the part of data providers: in order to map their data to the schema, users must convert their measurements to millimeters.
Note also that the specimen identifiers use different formats. Though it might countermand best practice, this does not prevent interoperability either, as long as each identifier is unique.