Array Database Assessment Recommendations
Array Database Assessment Working Group |
Recommendation Title: Array Databases: Concepts, Standards, Implementations |
Authors: Peter Baumann1, Dimitar Misev1, Vlad Merticariu1, Bang Pham Huu1, Brennan Bell1, Kwo-Sen Kuo2 1 Jacobs University, Large-Scale Scientific Information Systems Research Group, Bremen, Germany 2 Bayesics, LLC / NASA USA Contributors: RDA Array Database Assessment Working Group members |
Executive Summary
Multi-dimensional arrays (also known as raster data or gridded data) play a core role in many, if not all science and engineering domains where they typically represent spatio-temporal sensor, image, simulation output, or statistics “datacubes”. However, as classic database technology does not support arrays adequately, such data today are maintained mostly in silo solutions, with architectures that tend to erode and have difficulties keeping up with the increasing requirements on service quality.
Array Database systems attempt to close this gap by providing declarative query support for flexible ad- hoc analytics on large n-D arrays, similar to what SQL offers on set-oriented data, XQuery on hierarchical data, and SPARQL or CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity.
To elicit the state of the art in Array Databases, Research Data Alliance (RDA) has established the Array Database Assessment Working Group (ADA:WG) as a spin-off from the Big Data Interest Group. Between September 2016 and March 2018, the ADA:WG has established an introduction to Array Database technology, a comparison of Array Database systems and related technology, a list of pertinent standards with tutorials, and comparative benchmarks to essentially answer the question: how can data scientists and engineers benefit from Array Database technology?
Investigation of altogether 19 systems shows that there is a lively ecosystem of technology with increasing uptake, and proven array analytics standards are in place. Tools, though, vary greatly in functionality, performance, and maturity as investigation shows. On one end of the spectrum we find Petascale proven systems which parallelize across 1,000+ cloud nodes, on the other end some systems appear as lab prototypes which still have to find their way into large-scale practice. In comparison to other array services (MapReduce type systems, command line tools, libraries, etc.) Array Databases can excel in aspects like service friendliness to both users and administrators, standards adherence, and often performance. As it turns out, Array Databases can offer significant advantages in terms of flexibility, functionality, extensibility, as well as performance and scalability – in total, their approach of offering “datacubes” analysis-ready heralds a new level of service quality. Consequently, they have to be considered as a serious option for “Big DataCube” servicees in science, engineering and beyond.
The outcome of this investigation, a unique compilation and in-depth analysis of the state of the art in Array Databases, is supposed to provide beneficial insight for both technologists and decision makers considering “Big Array Data” services in both academic and industrial environments.
Attachment | Size |
---|---|
Array-Database-Assessment-WG_report.xml | 3.77 KB |
Array-Databases_abstract.pdf | 489.98 KB |
RDA_ArrayDatabaseAssessment_Recommendations_Maintenance_note.pdf | 32.93 KB |
Array-Databases_final-report.pdf | 1.87 MB |
- Log in to post comments
- 18393 reads
Author: Rainer Stotzka
Date: 12 Apr, 2018
(The thoughts I am describing in this comment reflect my personal opinions as a RDA member.)
Type of output
Dear members of the RDA WG Array Database Assessment,
Thank you very much for your report on Array Databases. Array databases, their technologies and implementations are an important topic for RDA and data sharing.
The report describes the state-of-the-art and compares various systems systematically from various perspectives. It shows a snapshot of the current situation and concludes with the need for further research.
I am not an expert in array databases, but I have the feeling that the report depicts a scientific study which I would recommend as an excellent reading material for newcomers in this field. Considering the types of RDA outputs (https://rd-alliance.org/recommendations-outputs) I would label the report rather as a “supporting output” or “other output” than a “RDA recommendation”.
Author: Rainer Stotzka
Date: 21 Apr, 2018
(The thoughts I am describing in this comment reflect my personal opinions as a RDA member.)
Consensus
The research field of array databases is very narrow making it hard to bring together the expertise from various continents to RDA. This was also reflected in the very low participation in the last plenary meetings and the email communications of the WG.
The authors of the report consist of five researchers from Jacobs University and one from LLC / NASA. It seems that at least three authors are not RDA members at all.
To my knowledge we don’t have in RDA a clear definition how and when consensus is reached which is sufficient for a balanced RDA output.
I would feel more comfortable if the report listed a couple of more authors from other locations and who ideally also contributed to the development of a variety of array db systems.
Author: Lesley Wyborn
Date: 17 Apr, 2018
I have read the Array Database Assessment Working Group final report. It is a very good summary of the current state of play in in Big Data analytical systems and will provide reference material to anyone new to the field and even to more experienced people. It notes that there is a lively ecosystem of technologies available and that this is one of the more comprehensive reviews of 19 of the systems that are available.
However, I feel that there are some issues need to be clarified in this report as follows:
In view of the issues raised (some of which have also been raised by Rainer), I feel that this report needs more revision and more exposure to the RDA Array Database Assessment Working Group, as well as to the groups whose systems have been reviewed in this report. In addition, there are typographic errors that need to be addressed.
Author: Peter Baumann
Date: 17 Apr, 2018
Dear all,
thanks you for your detailed feedback, which allows me to respond to several items(disclaimer: I am only talking for myself here). As I need to phase this into a full agenda I will do it piecemeal and - apologies - with likely some time delay. So this post is the first of series, thereby trying to disentangle discussion.
First, consensus: as is stated by Rainer, rules about consensus seem nonexisting with RDA at this time. That's fine, building up such an organization is always a stepwise process as I know from own experience. Hence, on this occasion it is a good idea maybe to initiate discussion so that for the future rules can be agreed so as to close this gap.
However, it should be a matter of fairness to not apply rules in retrospect - first let people invest substantial work for 1.5 years, watch closely, and at submission time tell them "that's not what we want".
RDA is very much carried by volunteers, and this precious resource should not be wasted.
cheers,
Peter
Author: Peter Baumann
Date: 17 Apr, 2018
back again. Turns out that this was our fault, and I feel as coordinating author I am to blame in the first place: findings should have been phrased as recommendations, syntactically. What the report should epress is, in a quick & likely dirty shot:
1 - For services on massive multi-dimensional arrays ("datacubes"), it is recommended to use array databases - they have proven mature and scalable to Petabytes, and further offer the advantage of "any query, any time" flexibility through their query languages.
2 - For the decision on a particular system, various aspects are relevant, including functionality, standards conformance, flexibility, scalability, performance. It is recommended to make a weighted decision based on the information provided in this report, rather than looking at any one criterion in isolation.
3 - As tuning can make a significant difference in performance, it is recommended to use the tuning parameters of array databases, based on the listing for the systems in this report, together with the further literature referenced.
4 - Due to the remarkable variety of datacube interfaces found it is recommended to base services on open standards so as to avoid vendor lock-in.
5 - Array services are trending under the keyword "datacubes", hence the landscape of tools is devloping quickly. It is recommended to continuously watch it, and also to extend the benchmarks which, due to resource reasons, necessarily could not cover all tools.
best,
Peter
Author: Peter Baumann
Date: 17 Apr, 2018
Lesley, concerning adoption: given that the core question, as per Charter, was: "can Array Databases be used?" , an adoption obviously means: Array Databases are used. This is what the report collects. Of course there are zillions of services with whatever solution, but that was out of scope as per Charter.
What's your point against research projects using Array Databases? I guess in Research Data Alliance we mainly rely on those in our work.
You write "By this logic anyone who installs rasdaman (or any of the other 18 Array Database Systems reveiwed) could be interpreted as an ‘adopter’ of this RDA ‘recommendation’." Absolutely! Any large-scale installation of any Array DBMS, in conjunction of this report, is a proof of concept for usability. Again, see the Chater where this has been stated clearly: to seek real-life, large-scale installations.
-Peter
Author: Peter Baumann
Date: 17 Apr, 2018
Participation:
The Charter was published widely. We had plenaries with open discussions. We have 40 members who have subscribed willingly, thereby expressing interest. So there was ample opportunity to contribute andor review. Of course, there is always more that can be done, but (i) volunteer resources are limited, unfortunately, and (ii) I had the belief that RDA itself would spread word - which did not happen, as I learnt only later.
Those complaining about missing participation I would cordially invite to implement it giving a shining example - engagement is the fuel of RDA. A few of us have taken action, and IMHO my co-contributors deserve that their work gets acknowledged by both activists and spectators.
So next time let's get all hands in for a joint endeavour!
-Peter
Author: Peter Baumann
Date: 17 Apr, 2018
Support. RDA claims to offer an environment supportive to scientists. Unfortunately, this is not always the case to the extent desirable. Examples include:
- the Wiki is not configured correctly, it make scollaboration difficult. And we tentatively used the wiki until the very last phase of translation to Word/PDF, despite these difficulties. My various requests to the maintainers remained unheard, unfortunately.
- I was submitting the report in the confidence that RDA would take all necessary steps, including informing relevant audiences. As I learnt yesterday this has not been done.
Of course, we can do it all ourselves - in theory. In practice, we have resource constraints. Now I will send out an email to those people having expressed interest by joining this WG plus the Big Data IG. But it is not entirely satisfying that RDA misses important tasks and we get credited for that.
-Peter
Author: Peter Baumann
Date: 17 Apr, 2018
Lesley, you write: Further, as noted on the Array Database Assessment Working Group WIKI, “the wiki pages have been copied into an MS-Word document to produce the final PDF formatted result, and the wiki pages below are obsoleted.” This makes it very hard for members of the Working Group who were unable to attend RDA Berlin to have been able to contribute to the report, prior to its submission to council for review.
If you read on you find the report uploaded and accessible on that page, so I fail to see how someone cannot contribute. Further, the Wiki was available for 1.5 years (!) for contribution - we tentatively did it the hard way, through a misconfigured Wiki, to be open for any and all contributions.
-Peter
Author: Peter Baumann
Date: 17 Apr, 2018
Lesley, you observe that we wrote about 3 systems where it was 4. Indeed, this is a mistake (in fact, a coordination issue) and I am taking on responsibility, will fix it. To be exact: 3 Array DMBSs (rasdaman, PostGIS Raster, SciDB) + 1 related tool (Open Data Cube, not an Array DBMS) = 4 systems have been benchmarked.
-Peter
Author: Ben Evans
Date: 18 Apr, 2018
A few comments on the document.
The document is quite useful and an interesting read due to some detailed work to capture an interesting survey perspective on a class of datacube-style systems. I use the word Survey because I think its a better description at the moment rather than a recommendation. The document asserts Arrays are motivated to be the solution to a wide range of problems. But its not clear that Arrays is equal to solving the scientific problems. Its hard to be definitive about this since the other software makers need to have commented about their approach.
Even though the technology reviewed have some similarities, it comes through that its not a uniform landscape. Unfortunately its not so clear what or how broadly any independent client software are using these array standards as an interface. It could be because there is no well-known client software taking this approach, though they may be doing bits. I also can't see the case for interoperability based on standards without it.
I also suggest that the benchmark results are interesting, but not easy to be convinced by them. This area is *hard* work, and I think its really beyond what should be expected of this document. However, I think this should be re-cast to be just a proposal to say that "here is a first go at what a test methodology" for Arrays and then the document could go on to describe that better. I don't easily see the relationship to Big Data problems based on the results, so its not as interesting as what it first seems. The results themselves, especially trying to compare all the different solutions, unfortunately can't be easily cited without more work and resolving some of the ambiguities.
Anyway, I would like to see some way that the document is actually resolved to be something without needing wholesale rewrites. Its a substantial contribution and effort to bring this to light, and it helps more clearly ask questions about the nature of the various datacube approaches being used and perhaps where it is going.
Author: Simon Cox
Date: 20 Apr, 2018
Peter -
There is no question that engagement was enabled, to the extent to which the RDA infrastructure allowed. While the absence of other contributions might be taken to signify consent, it might also show lack of time or interest, and definitely does not satisfy realistic expectations as evidence of consensus. I agree that RDA's procedures do not provide an explicit threshold or mechanism to demonstrate consensus. But where 5/6 of the authors are from one research team, and the one who isn't is tagged on as the last author, it does not make a compelling case.
I strongly agree with the other commenters that this is a significant piece of work, and should be published. But not as a 'Recommendation'.
Author: David Gavin
Date: 19 Apr, 2018
Hello, I am the technical lead for Digital Earth Australia, a principal contributor to the Open Data Cube (ODC) software and initiative. First off, I would like to thank you for your work within this working group in raising awareness of Array Databases concepts and for the inclusion of ODC within your study. As your paper indicates, ODC is not an Array Database, but a Python based scripting interface which parses user queries via its API onto respective datasets residing on file systems with the help of a relational database, returning the resulting geodata as Python xarrays. Three core paradigms of ODC which we would appreciate were reflected in this paper:
- The focus on providing a scalable platform for scientific work across multiple compute platforms, ranging from desktop to cloud to super-computer workloads.
- The ability to index and access data without the need for ingestion. Ingestion is a data transform step to reformat from the source format to a custodian-managed format or a compute-optimised format. Indexing creates the necessary database records and retains the source format;
- The python environment was chosen for its wide applicability in the science community. This allows users and developers to connect to additional data analysis libraries and develop new features.
We understand that our existing documentation does not convey ODC’s ability to access and work with un-ingested data and seems to imply that data ingestion is mandatory when it is in fact an optional step.
As we are always seeking to improve both the performance and the impact of our software, we are keen to understand and ultimately replicate the benchmarks that you have performed as part of this paper. To that end, we would deeply appreciate if your paper could include:
- The size, format and internal structure of the data/area of interest you included as part of your benchmark;
- As ODC typically stores data on file systems or within cloud object stores, details around the type of filesystem and underlying storage hardware used;
- An appendix including the exact Python scripts used for each of the test as well as which versions of any additional Python modules used;
- Correct and consistant references to the OpenDataCube codebase (https://github.com/opendatacube) and documentation (https://datacube-core.readthedocs.io/en/latest/)
I would like to make myself and other members of ODC community available to you if there is any assistance or further details we can provide.
Author: Sandro Fiore
Date: 09 May, 2018
Dear Peter, all,
In the following my comments/review to the current draft.
1) Pag. 6: support: While -> support. While
2) Pag 8: Figure 2: why don’t we report until 2018 instead of 2012? There would be additional efforts over the last few years that could be represented. I would also change the caption to better reflect this.
3) Pag.8: there is several times in the text “Section X”.
4) Pag. 13: Before the sentence “For example, in directional tiling ratios….” please add: “Ophidia [XX] provides a two-level hierarchical storage model [YY] relying on data partitioning and distribution”.
[XX]: Ophidia website http://ophidia.cmcc.it/
[YY]: Sandro Fiore, Alessandro D'Anca, Cosimo Palazzo, Ian T. Foster, Dean N. Williams, Giovanni Aloisio: Ophidia: Toward Big Data Analytics for eScience. ICCS 2013: 2376-2385
5) Pag. 16: Such technology can e implemented … -> Such technology can be implemented …
6) Pag. 21: I believe it would be of interest to add OGC WPS. In the climate change context OGC WPS is being highly exploited to provide an interoperable server-side interface to data cube processing facilities.
7) Pag. 22: Bi Data -> Big data (?)
8) Pag. 22: “Please observe Etiquette (see bottom).” -> ???
9) Pag. 23: Array tools section: “they do not accept queries via Internet, but rather require being logged in on the server machine for executing shell commands (ex: Ophidia)”.
1) This is not correct as Ophidia provides a (multi-threaded) server-side, parallel approach for data analytics. Users can ‘send’ declarative commands over the network without any need to be logged on the server machine. Moreover, 2) I would suggest moving Ophidia into the “Array Database systems” category as it represents a full-stack solution for n-dimensional arrays (it also supports “query language, multi-user operation, storage, management, and access control mechanisms”, pag. 22).
10) Pag. 29: Please rename the section Ophidia as “Ophidia/ECAS”
11) Pag. 29: Please add at the end of the Ophidia section:
“Ophidia is the core component of the ENES Climate Analytics Service (ECAS) a Thematic Service of the H2020 European Open Science Cloud Hub (EOSC-Hub) project.”
12) Pag. 29: Not clear the template. If besides website and source code, publications can be also added, I would suggest to add a short sub-section:
Publications:
More scientific and technical details about Ophidia can be found here: [a],[b],[c], [d]. A more complete list can be found at: http://ophidia.cmcc.it/overview/ (bottom of the page)
Listing the following papers in the references sections (pag. 70-73):
[a] S. Fiore, C. Palazzo, A. D’Anca, I. T. Foster, D. N. Williams, G. Aloisio, “A big data analytics framework for scientific data management”, IEEE BigData Conference 2013: 1-8
[b] M. Plociennik, S. Fiore, G. Donvito, M. Owsiak, M. Fargetta, R. Barbera, R. Bruno, E. Giorgio, D. N. Williams, and G. Aloisio, “Two-level Dynamic Workflow Orchestration in the INDIGO DataCloud for Large-scale, Climate Change Data Analytics Experiments”, International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA. Procedia Computer Science, vol. 80, 2016, pp. 722-733
[c] S. Fiore, M. Plóciennik, C. M. Doutriaux, C. Palazzo, J. Boutte, T. Zok, D. Elia, M. Owsiak, A. D’Anca, Z. Shaheen, R. Bruno, M. Fargetta, M. Caballer, G. Moltó, I. Blanquer, R. Barbera, M. David, G. Donvito, D. N. Williams, V. Anantharaj, D. Salomoni, G. Aloisio, “Distributed and cloud-based multi-model analytics experiments on large volumes of climate change data in the earth system grid federation eco-system”. In Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016. p. 2911-2918.
[d] D. Elia, S. Fiore, A. D’Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “An in-memory based framework for scientific data analytics”. In Proceedings of the ACM International Conference on Computing Frontiers (CF ’16), May 16-19, 2016, Como, Italy, pp. 424-429
13) Pag. 32: You can add the following instance for Ophidia.
OphidiaLab@CMCC: an Ophidia big data instance is available at CMCC for production, training and test purposes. After a registration, you can get access to an Ophidia cluster with about 5 fat nodes (100 cores, 1.3TB of memory, 50TB of storage) - https://ophidialab.cmcc.it/web/home.html
14) Pag. 36: What about metadata support? Provenance?
I would suggest adding also them considering the relevance they have in some scientific domains (e.g. climate and weather)
15) Pag. 37: I suggested to move Ophidia into the ArrayDBMS category. If this is ok, I can provide the input for table in section 7.2.2. Please let me know.
16) Pag. 48: Section 7.3.1
I would suggest adding “in-memory” analytics among the optimizations, as it represents the enabling feature for ‘fast’ data analytics.
17) Pag. 50: Same as comment #15, for table 7.3.2. Please let me know.
18) Pag. 51: according to my previous comments, Ophidia should be removed from the Array-tools table.
Btw, in the Array-tools table there are some errors (Ophidia supports partitioning) and incomplete information (caching is supported too)
19) Pag 55: Same as comment #15, for table 7.4.2. Please let me know.
20) Pag 54: I would add under Processing and parallelism, also “workflow support”.
21) Pag. 56: Most of the info reported for Ophidia in table ArrayTools, section 7.4.2 are not correct (architecture is “full-stask”, partitioning is supported and managed, tiles can be on separate nodes, and the system fully supports processing on existing archives). Btw, as commented before, my suggestion would be to move Ophidia from the array tools to the full stack solutions.
22) I believe a section related to real use cases from different domains (as discussed at the RDA WG:ADA meeting in Barcelona) could have provided a nice and useful end-user perspective point of view on array-databases, which I see a bit missing in the report.
Author: Sandro Fiore
Date: 09 May, 2018
Noticing that some assessments made for Ophidia in the Section 7 tables, were not correct (see my previous comment here), I believe it should be also stated clearer in the document whether the info reported in the assessment tables in Section 7 were provided or at least validated, for each software solution, by one (or more) representative(s) of the associated software development team.