F: Agility and standardization #10

ml-evs · 2021-12-21T15:43:54Z

ml-evs
Dec 21, 2021
Maintainer

This thread is for discussing, finalizing and organizing a breakout with the theme in the title above.

Original text:

Agility and standardization

How do we balance the need for flexibility/agility with the need/desire for detailed, structured data?
We can specify and standardize up into the smallest details. However, this may render the standards completely unusable? How do we find the right balance?
How should this technical endeavor be structured socially (c.f. Breakout Organizer to-do list #1)? Are we designing a cathedral or creating a bazaar? Can we take aspects from both?
One of the key reasons chemists love paper notebooks is that they are flexible. How can we allow for this flexibility while still ensuring that data is reusable?
How ambitious are we in terms of technology?: “I like JSON-LD because it’s based on technology that most web developers use today. It helps people solve interesting distributed problems without buying into any grand vision. It helps you get to the “adjacent possible” instead of having to wait for a mirage to solidify.”

kjappelbaum · 2022-01-06T16:19:20Z

kjappelbaum
Jan 6, 2022
Maintainer

One theme that might be worth discussing is whether a top-down or a bottom-up approach is more promising.

With top-down I mean something like IUPAC proposing a "de jure" standard and with bottom up I mean collections of people developing their own practices that might morph into a "de facto" standard.
People arguing for the second approach say that organizations such as IUPAC are simply too slow.
Others that favor the first approach might argue that IUPAC is needed such that people take the developments seriously.

0 replies

petermr · 2022-01-11T10:30:15Z

petermr
Jan 11, 2022

My first posting...
The most important thing always is the people. A small group of energetic committed people can make things work. Not always, maybe even not usually. Much of what I and colleagues have done over 30+ years hasn't worked out. But enough has. If something works but is messy it can perhaps be redesigned. If it's perfect and noone uses it, it hasn't achieved anything.

I don't want to be prescriptive - "this won't work" - "this will"... So I'll give examples and mantras that I hope may be helpful. I'll use "you" because it's your vision, but I'm happy to be part.

have a simple vision that you can explain to people. It might seem overambitious to many but history has shown that ambitious visions can occasionally succeed.
you are in it for the long term! This will not win everyones hearts and minds in 3 years.
you won't get everything right first time. Fred Brooks: "plan to throw the first version [of code] away; you will anyway"
IETF built the internet on "rough consensus and running code". Having simple demonstrator systems early is more important than comprehensive design.
code can be changed; data can't. Getting data right is key. The formats of the 1980's (MOL) are still with us.
other disciplines have tackled many of your generic problems (data structures, hypermedia). Don't build yet another MOL or INChI extension.
collaboration with N organizations involves N**2 interactions. This scales very badly. And some oprganizations actively resist and sabotage change.

Chemistry has objective problems:

the major Societies are not interested or antagonistic. You will need to walk round the barriers
industry is very conservative. Often competitive and restrictive.
It's a very broad discipline, ranging from precise maths, through Natural Language. A lot is poorly defined (organometallics, nano).
It's also very complex (e.g. different combinations of spectroscopy
They is often no obvious payoff. Improvements in informatics helps save graduate student time, but who cares? Selling 10% improvement to industry is impossible from outside.

But on the positive side, single people and small groups have made important breakthroughs:

Figshare. One graduate student (Mark Hahnel) looking for a better way to manage his data built a whole system
Jmol. A succession of "Doctor Who's" passed he mantle over the years. Bob Hanson has done a brilliant job - it's a one person project
Blue Obelisk is strong, after 15+ years. I was amazed that the community spontaneously patched my Java code against LogShell. Many people I have never met.
Many theory codes: Abinit, Avogadro, ...

But "you cannot hide complexity, you can only move it around" . CIF has been going 30 years and now covers many crystallography related fields. Materials/chemistry will have to support:

molecules, atoms, bonds. and properties of all these
coordinate systems (2D, 3D) and transformations
reactions and networks
spectra and similar data hyperlinked to atomic entities
experiments and procedures
calculations
natural language

We covered most of these in Chemical Markup Language. Whether you use XML or JSON or whatever is irrelevant. You are welcome to repurpose some or all. The big informatics advances are that the language is largely irrelevant, that it is much easier to write code and validate systems,

The biggest advance of all is Wikidata. You don't have to reinvent anything other than the chemistry. Bibliography, biology, are solved by others. If you can make Wikidata part of your core it will unite the community and bring help from outside whether explicit or at second hand. It's also politically acceptable - a huge identifier system which you can rely on and modify whenever you need it.

P.

5 replies

kjappelbaum Jan 11, 2022
Maintainer

code can be changed; data can't. Getting data right is key. The formats of the 1980's (MOL) are still with us.

this is a super interesting point and very related to #5, i.e., the question on which layer one aims for interoperability

kjappelbaum Jan 11, 2022
Maintainer

industry is very conservative. Often competitive and restrictive.

In this context, what do you think about Allotrope or the Pistoia Alliance -- in particular about the access and funding model (academia can get access for free, industry pays a fee which is used to pay developers)

petermr Jan 11, 2022

I've known Pistoia for many years - but not recently. I don't know Allotrope. I've skim-read both sites. Both are industry driven. The following are general points applying to any initiatives like this.

industry groups are geared to the needs of industry. The large ones are concerned about regulation and IP. I've been to many meetings of the Drug Information Association (DIA). (Some years ago) They wanted PDF as a standard because it wasn't machine-readable. Heavily regulated . Motivation 21 CFR 11. (https://www.fda.gov/regulatory-information/search-fda-guidance-documents/part-11-electronic-records-electronic-signatures-scope-and-application).
The small ones are concerned about selling to the large ones. Getting approval (software, eqpt) is a key aspect.
They move very slowly. Years/decades
Companies don't agree on standards unless they have to.
Much of the activity is arranging meetings. They also create specifications, but I didn't easily find them (in a few minutes). I know several of the people in Pistoia and I'd expect to meet them at meetings.

I don't know how much software they have created. My guess is that it's left to companies to implement the standards.

P.

kjappelbaum Jan 13, 2022
Maintainer

Thanks for your input!

I think another important point mentioned by @cthoyt is that semantic techniques will only work with permissive licenses. Allotropes ontology (https://bioportal.bioontology.org/ontologies/AFO/) seems CC-BY.

know several of the people in Pistoia and I'd expect to meet them at meetings.

we also contacted both if they want to joint the workshop but there wasn't a storm of enthusiasm :D

cthoyt Jan 13, 2022

CC BY isn't so bad, it's the SA or the ND that gets you. I would check in on the OBO Foundry's page for this, they lay it out pretty nicely https://obofoundry.org/principles/fp-001-open.html.

However, making sure that ontologies are created out in the open (e.g., on GitHub, with an active, public issue tracker) is super important

kjappelbaum · 2022-02-06T08:50:17Z

kjappelbaum
Feb 6, 2022
Maintainer

One resource I forgot to add is Stonebraker, M. Tamr White Paper – The Seven Tenets of Scalable Data Unification - they say a schema first approach will never scale. And in the end, this also ties in with [Shirky: Ontology is Overrated -- Categories, Links, and Tags ](Shirky: Ontology is Overrated -- Categories, Links, and Tags) - standardization always seems to involve predicting the future, which is hard if not impossible.

I feel that there is some truth to this. From my experience with the ELN, we can spend forever thinking about all eventualities, but this will just block us from writing code (and, most likely, will anyhow be incomplete).

0 replies

petermr · 2022-02-06T09:56:01Z

petermr
Feb 6, 2022

On Sun, Feb 6, 2022 at 8:50 AM Kevin Jablonka ***@***.***> wrote: One resource I forgot to add is Stonebraker, M. Tamr White Paper – The Seven Tenets of Scalable Data Unification <http://www.tamr.com/wp-content/uploads/2017/06/The-Seven-Tenets-of-Scalable-Data-Unification-WP.pdf> - they say a schema first approach will never scale. I feel that there is some truth to this. From my experience with the ELN, we can spend forever thinking about all eventualities, but this will just block us from writing code (and, most likely, will anyhow be incomplete).

Very useful ideas. The key message is "start now!" I have seen many systems that tried to design comprehensive schemas - and example is Animl.org (Analytical Markup Language) (from ASTMS and others) which is 20 years old and as far as I can see it's mainly still having meetings. MADICES (and I hope I can count myself in that and use "we") need to make a splash in 6 months - with running code and use-cases. The key question is "WHY are we doing X?". Is it: 1 to help ourselves 2 to help others 3 to build a community 4 to build a semi-intelligent system 5 to be a formal record for IP 6 to be a formal academic record (e.g. for theses and articles) 7 to record for health and safety 8 to gather knowledge for the orgqanization (add other organizational concerns) No system can do everything. 5,6,7,8 are the main basis for commercial ELNs. They are for the organizations and require much more overhead than the others. They are production systems. I've been involved in chemical ELNs - but not recently. My experience was that they were oriented towards * make compounds * test their activity That doesn't easily transfer to materials (but I may be out of date - that's why the meeting is so valuable for me!) 1,2,3,4 are open-ended research projects. That's what I am involved in. The web was built on open-ended development ("rough consensus and running code"). I think MADICES should take an exploratory approach - set some aggressive goals (which would be useful) and create prototypes. I'll be proposing two subprojects that people can visit during the workshop. Both will drive data representation and vocabulary. * docanalysis . Vocabulary. Extracting words and phrases from text, grouping for subjects and looking up in Wikidata. We already have running code (we bolted in ScispaCy yesterday) and can link to wikidata. Within seconds we can give word frequencies for a document. If we decide which are important we can create mini-dictionaries and use these for searching the literature. Created by Shweata N Hegde and Ayush Garg. * pyamiimage. Extraction of values from diagrams. We have prototyped XRD and XANES plots and if we get interest would be able to automatically extract numeric data (as CSV) , hopefully by Wednesday The web was built on this type of experimentation. Experience from exercises like this will define what-we-already-do as the basis for more formal schemas. P.

--

Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

0 replies

rlaplaza · 2022-02-07T15:30:36Z

rlaplaza
Feb 7, 2022

Frankly, to a degree I think right now anything achieving 1 or 2 stars would be good. That is, fully self-explanatory, arbitrarily structured digital data. In experimental chemistry too much information is still being stashed away in paperback notebooks today. Reaching out for that paperback data is almost impossible, even if standards change massively and it can be valorized (i.e. huge incentives appear). I think that non-discoverable digital repositories will be cracked open by their creators once incentives are ripe (i.e. run of the mill analysis tools that can easily be adapted to arbitrary file formats, data becomes publishable on itself, data becomes "pushable" upon pre-existing databases, etc.).

In that sense, for agility's sake I would argue that the priority should be achieving a degree of self-curated, robust ELN architectures with good internal metadata that can and should be universally adopted now. I think there are incentives for adopting such technologies in place (i.e. ease of data management, automated SI generation, points 5-8 by @petermr above) but maybe more perks can be added on top (i.e. lightweight ML models to work on the fly, self-curation of outliers, and of course journal requirements) to push the remaining hard-liners. There is an educational issue in here as well, but I wonder if other institutions do cover ELNs as a part of the undergrad curriculum as of now.

I think ELN developers (not as many as one would expect!) are up for the challenge for the most part, but of course there are a number of commercial interests in the field which may not be so keen on open metadata and integrability because they look at user-generated data as an asset to keep users from migrating. This is extremely pernicious and should be pointed out, IMO, to experimental colleagues and developers alike.

EDIT: I think #6 (comment) meant something similar.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F: Agility and standardization #10

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Agility and standardization

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

F: Agility and standardization #10

ml-evs Dec 21, 2021 Maintainer

Agility and standardization

Replies: 5 comments · 5 replies

kjappelbaum Jan 6, 2022 Maintainer

petermr Jan 11, 2022

kjappelbaum Jan 11, 2022 Maintainer

kjappelbaum Jan 11, 2022 Maintainer

petermr Jan 11, 2022

kjappelbaum Jan 13, 2022 Maintainer

cthoyt Jan 13, 2022

kjappelbaum Feb 6, 2022 Maintainer

petermr Feb 6, 2022

rlaplaza Feb 7, 2022

ml-evs
Dec 21, 2021
Maintainer

Replies: 5 comments 5 replies

kjappelbaum
Jan 6, 2022
Maintainer

petermr
Jan 11, 2022

kjappelbaum Jan 11, 2022
Maintainer

kjappelbaum Jan 11, 2022
Maintainer

kjappelbaum Jan 13, 2022
Maintainer

kjappelbaum
Feb 6, 2022
Maintainer

petermr
Feb 6, 2022

rlaplaza
Feb 7, 2022