Replies: 5 comments 5 replies
-
One theme that might be worth discussing is whether a top-down or a bottom-up approach is more promising. With top-down I mean something like IUPAC proposing a "de jure" standard and with bottom up I mean collections of people developing their own practices that might morph into a "de facto" standard. |
Beta Was this translation helpful? Give feedback.
-
My first posting... I don't want to be prescriptive - "this won't work" - "this will"... So I'll give examples and mantras that I hope may be helpful. I'll use "you" because it's your vision, but I'm happy to be part.
Chemistry has objective problems:
But on the positive side, single people and small groups have made important breakthroughs:
But "you cannot hide complexity, you can only move it around" . CIF has been going 30 years and now covers many crystallography related fields. Materials/chemistry will have to support:
We covered most of these in Chemical Markup Language. Whether you use XML or JSON or whatever is irrelevant. You are welcome to repurpose some or all. The big informatics advances are that the language is largely irrelevant, that it is much easier to write code and validate systems, The biggest advance of all is Wikidata. You don't have to reinvent anything other than the chemistry. Bibliography, biology, are solved by others. If you can make Wikidata part of your core it will unite the community and bring help from outside whether explicit or at second hand. It's also politically acceptable - a huge identifier system which you can rely on and modify whenever you need it. P. |
Beta Was this translation helpful? Give feedback.
-
One resource I forgot to add is Stonebraker, M. Tamr White Paper – The Seven Tenets of Scalable Data Unification - they say a schema first approach will never scale. And in the end, this also ties in with [Shirky: Ontology is Overrated -- Categories, Links, and Tags ](Shirky: Ontology is Overrated -- Categories, Links, and Tags) - standardization always seems to involve predicting the future, which is hard if not impossible. I feel that there is some truth to this. From my experience with the ELN, we can spend forever thinking about all eventualities, but this will just block us from writing code (and, most likely, will anyhow be incomplete). |
Beta Was this translation helpful? Give feedback.
-
On Sun, Feb 6, 2022 at 8:50 AM Kevin Jablonka ***@***.***> wrote:
One resource I forgot to add is Stonebraker, M. Tamr White Paper – The
Seven Tenets of Scalable Data Unification
<http://www.tamr.com/wp-content/uploads/2017/06/The-Seven-Tenets-of-Scalable-Data-Unification-WP.pdf>
- they say a schema first approach will never scale.
I feel that there is some truth to this. From my experience with the ELN,
we can spend forever thinking about all eventualities, but this will just
block us from writing code (and, most likely, will anyhow be incomplete).
Very useful ideas.
The key message is "start now!" I have seen many systems that tried to
design comprehensive schemas - and example is Animl.org (Analytical Markup
Language) (from ASTMS and others) which is 20 years old and as far as I can
see it's mainly still having meetings. MADICES (and I hope I can count
myself in that and use "we") need to make a splash in 6 months - with
running code and use-cases.
The key question is "WHY are we doing X?". Is it:
1 to help ourselves
2 to help others
3 to build a community
4 to build a semi-intelligent system
5 to be a formal record for IP
6 to be a formal academic record (e.g. for theses and articles)
7 to record for health and safety
8 to gather knowledge for the orgqanization
(add other organizational concerns)
No system can do everything. 5,6,7,8 are the main basis for
commercial ELNs. They are for the organizations and require much
more overhead than the others. They are production systems.
I've been involved in chemical ELNs - but not recently. My experience was
that they were oriented towards
* make compounds
* test their activity
That doesn't easily transfer to materials (but I may be out of date -
that's why the meeting is so valuable for me!)
1,2,3,4 are open-ended research projects. That's what I am involved in.
The web was built on open-ended development ("rough consensus and running
code"). I think MADICES should take an exploratory approach - set some
aggressive goals (which would be useful) and create prototypes.
I'll be proposing two subprojects that people can visit during the
workshop. Both will drive data representation and vocabulary.
* docanalysis . Vocabulary. Extracting words and phrases from text,
grouping for subjects and looking up in Wikidata. We already have running
code (we bolted in ScispaCy yesterday) and can link to wikidata. Within
seconds we can give word frequencies for a document. If we decide which are
important we can create mini-dictionaries and use these for searching the
literature. Created by Shweata N Hegde and Ayush Garg.
* pyamiimage. Extraction of values from diagrams. We have prototyped XRD
and XANES plots and if we get interest would be able to automatically
extract numeric data (as CSV) , hopefully by Wednesday
The web was built on this type of experimentation. Experience from
exercises like this will define what-we-already-do as the basis for more
formal schemas.
P.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
Frankly, to a degree I think right now anything achieving 1 or 2 stars would be good. That is, fully self-explanatory, arbitrarily structured digital data. In experimental chemistry too much information is still being stashed away in paperback notebooks today. Reaching out for that paperback data is almost impossible, even if standards change massively and it can be valorized (i.e. huge incentives appear). I think that non-discoverable digital repositories will be cracked open by their creators once incentives are ripe (i.e. run of the mill analysis tools that can easily be adapted to arbitrary file formats, data becomes publishable on itself, data becomes "pushable" upon pre-existing databases, etc.). In that sense, for agility's sake I would argue that the priority should be achieving a degree of self-curated, robust ELN architectures with good internal metadata that can and should be universally adopted now. I think there are incentives for adopting such technologies in place (i.e. ease of data management, automated SI generation, points 5-8 by @petermr above) but maybe more perks can be added on top (i.e. lightweight ML models to work on the fly, self-curation of outliers, and of course journal requirements) to push the remaining hard-liners. There is an educational issue in here as well, but I wonder if other institutions do cover ELNs as a part of the undergrad curriculum as of now. I think ELN developers (not as many as one would expect!) are up for the challenge for the most part, but of course there are a number of commercial interests in the field which may not be so keen on open metadata and integrability because they look at user-generated data as an asset to keep users from migrating. This is extremely pernicious and should be pointed out, IMO, to experimental colleagues and developers alike. EDIT: I think #6 (comment) meant something similar. |
Beta Was this translation helpful? Give feedback.
-
This thread is for discussing, finalizing and organizing a breakout with the theme in the title above.
Original text:
Beta Was this translation helpful? Give feedback.
All reactions