Strawman - Data Type Record Scope and Serialization

Author: Daan Broeder

Date: 14 Jan, 2014

Sorry Gridhar, it passed me by without registrering. I will have à look later this week
G.
Daan
Sent from my iPad
On 14 Jan 2014, at 21:54, "gmanepalli" <***@***.***> wrote:
Any comments or feedback from the group on the uploaded document? I know you are all contemplating internally, but sharing your thoughts would be good. :-)
--
Full post: https://rd-alliance.org/strawman-data-type-record-scope-and-serializatio...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1133

Author: Simon Cox

Date: 15 Jan, 2014

FYI - The link to the document metadata is
https://www.rd-alliance.org/filedepot?fid=367
and a direct link to download the document appears to be
https://www.rd-alliance.org/index.php?q=filedepot_download/386/367
Simon
- Show quoted text -From: gmanepalli=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of gmanepalli
Sent: Wednesday, 15 January 2014 7:54 AM
To: Data Type Registries WG
Subject: Re: [rda-dtr-wg] Strawman - Data Type Record Scope and Serialization
Any comments or feedback from the group on the uploaded document? I know you are all contemplating internally, but sharing your thoughts would be good. :-)
--
Full post: https://rd-alliance.org/strawman-data-type-record-scope-and-serializatio...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1133

Author: Giridhar Manepalli

Date: 20 Jan, 2014

(Daan submitted his comments inline the document. Uploaded here: https://rd-alliance.org/filedepot?fid=376)

Thanks Daan for your comments.

It appears most of Daan's comments revolve around the optional fields/extensions. In my strawman, I suggested that we support the notion of type extensions in that more nuanced types can be based off existing types. The document further states that all mandatory fields should be captured within base types, and all optional fields should be registered as 'extensions' to those base types. I still stick to my original suggestion. However, I'm now leaning towards not defining what those extensions could or should be. Originally, I said those extensions could be about data encoding, semantics, or service/processing. After thinking about this yet again for the umpteenth time and also after reading Daan's comments, it appears we shouldn't define what those extensions could be. Individual communities will register extensions however they prefer and require. Over a period of time, if we find common patterns to those extensions, then and only then we shall classify the various extensions that are registered.

If this suggestion is agreeable, what this means is that as a DTR working group we only need to agree on what the mandatory fields of a base type are going to be, and that the type registry spec. should support the notion of extensions (where such extensions allow recording open-ended fields).

Right now, the mandatory list looks like this:

1. a unique and potentially resolvable identifier (assigned by a type registry),
2. a human description(s) of the data type,
3. conceptual details of the type (e.g., a weather dataset has time, location, and temperature details regardless of how those pieces are encoded), and
4. provenance information (as in who created the type record).

I would amend #1 and say

1. a unique and potentially resolvable identifier (assigned by a type registry or assigned by a user - varies across registry instances)

Daan suggested we do not make #3 mandatory. I think that is fair in that we do not always have 'properties' in mind when we think of a type. However, my concern is the value proposition of registering a type with just an ID, human description, and provenance. Is there enough value in recording just those three fields in a type registry?

Giridhar

Author: Norman Paskin

Date: 21 Jan, 2014

GM wrote: “all mandatory fields should be captured within base types, and all optional fields should be registered as 'extensions' to those base types….. we shouldn't define what those extensions could be. Individual communities will register extensions however they prefer and require. Over a period of time, if we find common patterns to those extensions, then and only then we shall classify the various extensions that are registered”
I agree with all these points. Some comments:
1. It seems to me that suggestions for such common patterns could also come from the communities using the types. Therefore it would be helpful to provide in the registry (perhaps as annotations) as much explanation of use and design of the type as possible, or links to examples of use. Another reason for doing this is (I suggest) to encourage the re-use of types or compatible development of new ones, as in my next point.
2. The classification of free extension of types raises a challenge which I think we are all aware of but is worth a little thought. If someone develops a data type for e.g. “author” and someone is looking for a data type for e.g. “contributor”, then clearly there will be some potential for overlap: so they may want to either (a) re use the type “author” instead of developing a new one; or (b) develop a new type “contributor” which shares the “author” data model or at least avoids conflicting use of common elements. This will be useful if those two types are subsequently used together in an application, i.e. in data integration.
If I read the proposal correctly, the data model for the extensions is essentially to be uncontrolled. The advantage of not having a common or controlled data model for all extensions is a lower barrier to entry and stimulation of use; the disadvantage is that without an underlying common data model, misunderstandings and misapplications of existing types could creep in, and conflicting data models might develop, hindering data integration. Both approaches have their uses, but we should be clear which approach we want to follow.
In an area I am more familiar with (content management), this issue of classification of data from disparate sources has been encountered before: e.g. in the Dublin core approach of open “DC extensions”, contrasted with the DOI approach of controlled extensions of a kernel (where extensions are known to, if not managed centrally by, the DOI Foundation using the indecs-based data model to ensure compatibility). In the area of rights management (somewhat distant from RDA, but it shares some principles with defining types) this area of data integration from disparate sources is currently a very active area: e.g. the Rights Data Integration project http://www.rdi-project.org/, based on the Linked Content Coalition principles http://www.linkedcontentcoalition.org/. [A declaration of interest: I was heavily involved in the Linked Content Coalition work and currently involved in planning its ongoing activities].
Norman
- Show quoted text -From: gmanepalli=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of gmanepalli
Sent: 20 January 2014 22:33
To: Data Type Registries WG
Subject: [SPAM] LOW * Re: [rda-dtr-wg] Strawman - Data Type Record Scope and Serialization
(Daan submitted his comments inline the document. Uploaded here: https://rd-alliance.org/filedepot?fid=376)
Thanks Daan for your comments.
It appears most of Daan's comments revolve around the optional fields/extensions. In my strawman, I suggested that we support the notion of type extensions in that more nuanced types can be based off existing types. The document further states that all mandatory fields should be captured within base types, and all optional fields should be registered as 'extensions' to those base types. I still stick to my original suggestion. However, I'm now leaning towards not defining what those extensions could or should be. Originally, I said those extensions could be about data encoding, semantics, or service/processing. After thinking about this yet again for the umpteenth time and also after reading Daan's comments, it appears we shouldn't define what those extensions could be. Individual communities will register extensions however they prefer and require. Over a period of time, if we find common patterns to those extensions, then and only then we shall classify the various extensions that are registered.
If this suggestion is agreeable, what this means is that as a DTR working group we only need to agree on what the mandatory fields of a base type are going to be, and that the type registry spec. should support the notion of extensions (where such extensions allow recording open-ended fields).
Right now, the mandatory list looks like this:
1. a unique and potentially resolvable identifier (assigned by a type registry),
2. a human description(s) of the data type,
3. conceptual details of the type (e.g., a weather dataset has time, location, and temperature details regardless of how those pieces are encoded), and
4. provenance information (as in who created the type record).
I would amend #1 and say
1. a unique and potentially resolvable identifier (assigned by a type registry or assigned by a user - varies across registry instances)
Daan suggested we do not make #3 mandatory. I think that is fair in that we do not always have 'properties' in mind when we think of a type. However, my concern is the value proposition of registering a type with just an ID, human description, and provenance. Is there enough value in recording just those three fields in a type registry?
Giridhar
--
Full post: https://rd-alliance.org/strawman-data-type-record-scope-and-serializatio...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/1133

Author: Simon Cox

Date: 24 Jan, 2014

Some more considerations:
1. Nested types The issue of ‘nested’ and ‘aggregated’ types is a regular challenge.
For example, WaterML2 is a specialization of OMXML, which is a GML encoding of an observation model. GML is an XML application, and OMXML is a GML application, and WaterML2 is an OMXML application. WaterML2 data is an XML document using elements from a variety of namespaces, but conforming to the GML pattern which uses a very specific approach of object/property interleaving (like RDF/XML) and also uses xlinks for cross-referencing and turning a tree into a directed graph.
Such a dataset can be accessed by a generic XML client, but with more precision progressively by a GML client, an OMXML client, and a WaterML2 client. For transport it is often gzipped. The mime-type is application/gzip. When unzipped it is application/xml, or application/gml+xml (not yet accepted by IANA). A user should also be told that it uses the following XML namespaces http://www.opengis.net/gml/3.2 http://www.opengis.net/om/2.0 http://www.opengis.net/waterml/2.0 http://www.w3.org/1999/xlink
Note that WaterML2 is recommended by the EU for any environmental time-series data, so it is not even necessarily ‘water’. So this is largely about encoding. So it is related to mime-types, but (a) The IANA registration process is not agile enough for research data purposes – I was hoping that the DTR WG would develop a viable alternative to IANA registration (b) we must support registration of aggregate or nested types – i.e. a single identifier for an aggregate type.
2. The focus is on ‘conceptual’ typing.
But which concepts? For example, weather data is usually also geospatial data. From the point of view of consuming applications (like visualization) the dimensionality may be more important than the content – time-series, geospatial imagery, classified maps are each conceptualizations on a different kind of axis. Most time-series can be visualized the same way, regardless of whether it is water level or stock-price. I’m all for promoting semantics, but I’m not sure we would all agree on what the key semantics of a dataset is!
3. Versions
The notion of ‘version’ is predicated on a single parent. Safer to register every item as a sibling of every other item, but also to allow relationships between items. One of the relationships can be ‘supersedes’ (with the inverse ‘supersededBy’) but its cardinality can be >1. During registration you might use a naming convention that hints at versioning, but the actual relationship to other entries, including 'previous versions' should be explicit.
4. Worked example
The composite type has the same identifier as the base record. That seems to prevent alternative extensions to the same base.
Probably not what was intended.
Simon

Author: Larry Lannom

Date: 25 Jan, 2014

Thanks to Daan, Norman, and Simon for the thoughtful and useful comments on the straw man. Giridhar is working on v2 of that document which should show up in a few days. But I wanted to ask if the approach we used in round 1 (upload a document to the file repository and then post a msg pointing to it and collect comments in responses to that post) was effective or whether we should consider a different approach. A few of you saw my message to the Secretariat asking about options and we were told that 1) sub-folders could be made in the file repository so that, e.g., there could be a sub-folder for this document with versions inside, although you would still have to come out here to see comments unless we all used Word track changes or similar and 2) the wiki feature had comments and a diff function that could be turned on, so we could just try that. Any views?

Author: Giridhar Manepalli

Date: 28 Jan, 2014

The straw man as you know posed a lot of questions and gave background information re. data types. Based on the thoughtful comments from Daan, Norman, Simon, Larry, and Christophe, I produced another document on this topic: https://rd-alliance.org/filedepot?fid=383

You will see that this document follows a Q&A format. A Q&A format seems natural to me for discussing this topic, at least because I had all those questions in mind. I do not think of this document as a second version to the straw man, but as a document that captures the group's current thinking.

I think we are close to forming a decent data model for the first version of the data type registry. One more round of (quick) feedback would be useful before a type registry prototype is released. Thanks.

Author: Stephen Richard

Date: 04 Feb, 2014

I’ve reworked the Q&A document (DataTypeRecordScope.docx) into a more declarative statement of the intention of the data type scheme and its applications. See what you think…
https://rd-alliance.org/filedepot_download/386/398
Sorry I won’t be able to make the meeting in Dublin, this is an interesting and important topic.
steve
Stephen M Richard
Arizona Geological Survey
416 W. Congress #100
Tucson, AZ
AZGS: 520-770-3500
Office: 520-209-4127
FAX: 520-770-3505
- Show quoted text -From: gmanepalli=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of gmanepalli
Sent: Tuesday, January 28, 2014 1:30 PM
To: Data Type Registries WG
Subject: Re: [rda-dtr-wg] Strawman - Data Type Record Scope and Serialization
The straw man as you know posed a lot of questions and gave background information re. data types. Based on the thoughtful comments from Daan, Norman, Simon, Larry, and Christophe, I produced another document on this topic: https://rd-alliance.org/filedepot?fid=383
You will see that this document follows a Q&A format. A Q&A format seems natural to me for discussing this topic, at least because I had all those questions in mind. I do not think of this document as a second version to the straw man, but as a document that captures the group's current thinking.
I think we are close to forming a decent data model for the first version of the data type registry. One more round of (quick) feedback would be useful before a type registry prototype is released. Thanks.
--
Full post: https://www.rd-alliance.org/strawman-data-type-record-scope-and-serializ...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/1133