Text Mining at the 7th RDA Plenary

09 Mar 2016

By Piotr Przybyła

Blog by Piotr Przybyła, National Centre for Text Mining, University of Manchester - RDA Europe Plenary 7 Early Career Programme Winner

I participated in the 7th plenary meeting of RDA in Tokyo within RDA Europe Early Career Programme. I work on natural language processing and text mining (TM) research, among others in the OpenMinTeD (Open Mining Infrastructure for Text & Data) project. Participating in the plenary was a very satisfactory experience, as it has shown that the understanding and appreciation of TM in data-driven science community is growing. Text mining is an exceptional field, as it requires cooperation of two communities to succeed. These are: researchers having knowledge in text processing, and experts in a domain, to which investigated text belongs. This could as well be a narrow biomedical issue (e.g. cancer research), but also more general area in humanities. That is why forum like RDA plenary are important to help us to communicate with researchers in other fields and understand their needs. During the 7th plenary meeting, the OpenMinTeD project has organised a Bird of Feathers session, aimed to spark off the discussion on issues of data sharing in the area of text mining.

Open Science symposium

The RDA meeting has been preceded by a data sharing symposium “Data-driven Science – The trigger of Scientific development”, a place of vibrant discussion of challenges and opportunities brought by current trends, usually called Open Science. The name may seem surprising – isn't science open by definition? One would assume so, as it progresses through wide communication of results. However, as we've learned thanks to talks given by keynote speakers, open science has many aspects:

open access to producing knowledge (known as citizen science),
open access to scientific publications (promoted by open access movement),
open access to scientific data (focus of RDA).

Clearly, we can't just make everything open here and now. The keynote presentations included many open questions: What data should be restricted from public access? Who should pay for long-term storage? How to assure high quality?

The interplay between open access to data and publications has been of particular interest to me as a text miner. In this field, we use both open and restricted access to acquire scientific papers not to read them ourselves, but to treat them as input data for automatic reading algorithms. That raises several legal questions as well, concerning compliance with restrictions imposed by publishers. The problem could be solved at policy level by introducing so-called copyright exception for text and data mining. From that point of view, it's been comforting to see representatives of the European Commission, OECD, National Science Foundation and Japanese government at the symposium, all advocating for more openness in science.

Data Foundations and Terminology IG

As an early career researcher, I've been assigned to the Data Foundations and Terminology Interest Group to assist during the meeting and prepare a report. The groups aims to assist RDA efforts by providing a controlled vocabulary in the area of scientific data sharing. It previously has been a working group, which provided several deliverables and has been transformed into an interest group to continue extending resources and liaising with RDA groups.

The group has turned out to be surprisingly relevant to my research interest, and I have been able to engage in discussion during the meeting. Some of the problems are caused by terms, which have different definitions in different scopes. This issue is well known in my domain, i.e. text processing, as word ambiguity. It's very common in everyday language (cf. bank: slope vs. institution), but I have been surprised to find out it's an issue in scientific vocabulary as well. That was just a point in a very active discussion – others including rules for creating definitions, legal metadata, conversion to common formats and differences between ontology and vocabulary. The topics mentioned are still discussed between the participants, many days after the meeting.

Poster session

As stated previously, interacting with researchers from other fields is an essential part of text miner's work. The poster session, in which I have presented my work in OpenMinTeD project, has generated really high interest in my and other young researcher's work. I've had an occasion to talk to numerous people from a wide range of fields – from agriculture, through social science, to psychiatry, and discuss problems that we encounter when trying to re-use textual data. Some of these conversations have been continued during the Bird of Feathers session, exclusively devoted to text mining.

Text Mining Bird of Feathers

The first part of the session has been a short introduction of problems in text mining. One of the motivations behind it is what happens with the scientific literature. According to the STM report, global research community generates 1.5 million new articles per year (as of 2009). It has been assessed that 90% of papers are never cited and 50% are never even read by anyone other than its author and people involved in publication process. The situation is not better in narrow domains, e.g. there are 70,000 papers published on single protein p53 (tumor suppressor). Obviously, no human researcher could keep up with such influx of knowledge. That's why people have turned towards machine reading, i.e. automatic processing of textual resources, organising and classifying them in various dimensions and extracting main information items. At the beginning, a text mining system uses information retrieval techniques to select a group of documents that possibly match user's query. Then, a programme proceeds to language understanding stage, when linguistic knowledge is used to analyse every sentence. This allows identification and extraction of entities and relations between them, transforming unstructured data into structured knowledge. Finally, the obtained information could be used for multidimensional analysis and prediction, performed by humans or machines.

The focus of OpenMinTeD project is on facilitating this process by establishing open and sustainable text and data mining platform and infrastructure, thanks to which researchers can discover, collaboratively create, share and re-use knowledge from a wide range of textual resources. It's important to note that OpenMinTeD is not a system that may become unavailable after some time. The most important part of the project is infrastructure which will continue to yield benefits for scientists in long perspective.

The OpenMinTeD project wouldn't be possible without involvement of external expert, working with text mining on daily basis an providing us with great insight regarding challenges and problems. Their input is used within four different groups, each responsible for one of the areas. The first group deals with resource metadata and aims to compile and maintain an inventory of metadata schemas. The second group, responsible for language resources, will create a specification of data formats used for knowledge representation, focusing on maximising re-usability of created content. The group working on IPR and licensing will try to identify copyright and related rights, restrictions and exceptions that limit the use of textual sources. Finally, the workflows an annotations group aims to define annotation models and type systems to ensure seamless integration of components from different ecosystems.

Next, we have talked about the ways in which the OpenMinTeD project efforts could be linked with RDA to bring benefits in text mining community. The first step was to hear more from the participants of the meeting to know what are their experiences and needs in that area. As we have got a lot of feedback, they have been invited to participate in the working groups as external experts.

It has also been discussed, which of the existing working/interest groups have produced results that could be in helpful in problems specific for text mining, covered during the meeting. It seems that there is very little overlap, so creating a new group could be justified. Based on very preliminary stage of the discussion, interest group has been selected as the most appropriate option. Now we just need to prepare a charter proposal and start benefiting from the spirit of RDA!

O&A Members

MEMBERSHIP

RDA Groups

The Research Data Alliance

Membership

RDA Working and Interest Groups

RDA Solutions

RDA domain research

Blog

You are here

About the author

Related blogs

comments