Collection requirements, streaming

16 Mar 2016

Dear all, Frederik,
attached is a new version of the draft requirements document with
updates from during and after the Tokyo plenary session. Not complete in
any sense, but a good start :-) There are still several things to think
about and discuss.
One item I find very interesting is Frederik's notion of a "stream view"
of collections (one possible model in the speak of our case statement).
So far, I understand this as putting a PID on a life broadcast video:
The collection grows over time, but only at the head, is strictly
ordered, and a typical access pattern is to receive continuous parts of
it. To fit into our current general model for collections, the items in
the stream may also have to be discrete and should not overlap; but we
may also think about items that are not discretely defined, which will
be a bit more complex. Is this what you had in mind? There are probably
several more aspects for this model we have to work out.
Best, Tobias
--
Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784

File Attachment: 
  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 16 Mar, 2016

    This "stream" concept, what I read in the doc, seems to be provenance data?
    OwnCloud has a module "activities" which is tracking all the
    things/changes you are doing (see attached screenshot)

    ATTACHMENT: 
    AttachmentSize
    Image icon activities.png41.19 KB

  • Frederik Baumgardt's picture

    Author: Frederik Baumgardt

    Date: 16 Mar, 2016

    Yes to Tobias' question. And I think it might end up as provenance data, Thomas. The issue here being the definition of PID.
    In the models that I’m familiar with, PIDs can only reference immutable objects and a stream would really be a set of relations between a series of static PID-referenced collections, e.g. realized as ‘parent’-properties.
    However, I wonder how the PID model for e.g. OrcID deals with the mutability of the personal information it references. I.e. does it solve the issue with some sort of indirection or are there two different concepts of identity at work here; semantic and structural, where semantic identity is preserved over structural changes and structural identity is not. In which case semantic identity would actually be an indirection and I would be curious how that’s implemented (some constant properties in the referenced object?). And how it affects citability.
    Sorry if my lack of familiarity with the previous work shows here, I’m actually working through a couple different PID specs at the moment.

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 17 Mar, 2016

    Hello Frederik,
    to be precise: is a stream a set of relations between a series of static
    collections or static objects? I can imagine how both ways may be
    useful, but for me, they point to different models:
    A) Each object gets a PID. Each new object is related to its predecessor
    object through a relation. The relations together form a collection
    (with a dedicated PID).
    B) Each object gets a PID. Each state of the whole thing at a specific
    point in time gets a PID (forms a collection with objects as parents).
    Whenever a new state is introduced (for example, by adding a new object
    - but there can also be other changes!), a new static collection is
    formed and the new collection is related to the old with a relation,
    thereby creating a second hierarchical level on top that looks like (A)
    again.
    Obviously, B is more complex than A, but might be required for some use
    cases where there is a need to reference specific states. Regarding
    identity: Model A preserves semantic identity, but not structural
    identity. Model B preserves both (thus, the increased costs). Does this
    sound correct?
    ORCID: I would assume that upon a change of personal information in
    ORCID, there is no new PID formed - at least not an ORCID, as this would
    compromise the whole intention. But ORCID might be an applicant for
    model A if there is enough end-user value.
    Best, Tobias
    -------- Original Message --------
    *Subject: *Re: [rda-collection-wg] Collection requirements, streaming
    *From: *fbaumgardt <***@***.***>
    *To: *ThomasZastrow
    <***@***.***>, Research Data
    Collections WG <***@***.***-groups.org>

  • Bridget Almas's picture

    Author: Bridget Almas

    Date: 17 Mar, 2016

    Hi Tobias,
    Both would seem to be important to support. Model A represents a view on
    collections that I hadn't really thought about before. If I understand
    correctly, under this approach, the entire history of changes for an
    object is itself a collection? So a URI
    likeorcid.org/0000-0001-7556-1572/history becomes a PID for the
    collection of records that make up the history of changes to the ORCID
    record?
    Best
    Bridget

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 17 Mar, 2016

    Hi Bridget,
    yes, that is entirely possible. I think we can describe this through two
    further variations on A:
    A1) The objects do not have a shared identity, but together, they make
    up a whole (a bag of several apples).
    A2) Each object is a new iteration on the previous one, overtaking some
    of its identity aspect (the ORCID history example or an svn trunk history).
    There are probably better ways to express this - we will need more
    precise model descriptions at some point.
    Best, Tobias
    -------- Original Message --------
    *Subject: *Re: [rda-collection-wg] Collection requirements, streaming
    *From: *balmas <***@***.***>
    *To: ****@***.***-groups.org

  • Frederik Baumgardt's picture

    Author: Frederik Baumgardt

    Date: 17 Mar, 2016

    @Tobias: Do the diagrams I put up on the Wiki reflect your mental models? I wouldn’t expect it, so feel free to replace them or scribble on them. I’d also save some of this discussion there, but I’m not yet sure I have full grasp of everybody’s conceptions.
    @Bridget: It’s my understanding that the history pointer in your example does not meet the criteria of a PID as in, its content is mutable. Similarly the ‘latest’ pointer in a versioning system, or ‘HEAD' in git. Whereas the commit IDs would.
    I have this vague idea that we can interface persistent and mutable data spaces with persistent traits, e.g. a typed PID on a typed object requires that certain properties are immutable, but others which are not part of the persistent data type can be mutable. Citing a PID-referenced datum is citing the immutable properties only. E.g. a person has a stable SSN and DOB, but their name and address could change. The person’s PID would reference the SSN and DOB, but not the name and address. You can access those dynamically once you’ve got access to the properties of the persistent trait. Does that make sense? I do think it’s outside the scope of the WG though? Have other WGs addressed this issue?
    Best,
    Frederik

submit a comment