Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorials needed on connecting to and using INBO data from R #37

Open
3 of 11 tasks
florisvdh opened this issue Feb 5, 2018 · 29 comments
Open
3 of 11 tasks

Tutorials needed on connecting to and using INBO data from R #37

florisvdh opened this issue Feb 5, 2018 · 29 comments

Comments

@florisvdh
Copy link
Member

florisvdh commented Feb 5, 2018

INBO hosts many databases, so tutorials are needed for scientific and scientific support staff on how to use them from within R.

These tutorials should focus on:

  • connecting to the respective (spatial and non-spatial) databases / datawarehouses from within R, in a dplyr / DBI compliant way (rather than using RODBC). Interesting examples can be found at https://db.rstudio.com/.
    • to achieve reproducibility (transferability to docker, ...) and scalability (servers), methods should be agnostic of the user's environment as much as possible. E.g.: the user's location, the operating system, the network.
    • one connection method that works with all INBO (spatial and non-spatial) databases / datawarehouses is the preferred way to go.
  • best practices using each specific database, by clearly defining the available tables and fields, and providing key scripting examples.

There is an urgent need for the following INBO databases:

Further, the need is also urgent for these databases:

  • LIMS
  • BWK-database
  • INBODEM

This list is most probably incomplete. (So having an overview of data is another, related need.)

@stijnvanhoey
Copy link
Collaborator

With respect to GIS-data, the available dbase of mercator-vonet should be in scope ass well. I cite here the information provided by Carine Wils:

INBO netwerk connecteren naar de nieuwe Read only databank van MercatorNet (= PostGIS databank). Daarin zitten een groot deel van de (vector)lagen van AIV (vroegere AGIV), maar ook van ANB – INBO – dep OMG (ex-dLNE+dRV) – OVAM – MOW – Erfgoed – VMM, ed.
Het gros van deze lagen is ook beschikbaar op onze S-schijf, maar dan in de vorm van compressed FileGDB's, dus niet zo handig voor niet-ArcGIS gebruikers.

Toegang tot deze databank is mogelijk via je milieuinfo account (diegene waarmee je inlogt op JIRA), maar je moet daarvoor wel een mail sturen naar help@milieuinfo.be met vraag om toegang tot de RO databank Mercator.

Als extra ondersteuning voor de gebruikers zijn er van elke dataset op de nieuwe databank connectie/legende bestanden aangemaakt (LYR en QLR). Deze zijn te vinden op de MercatorNet website: https://www.mercator.vonet.be/intern-ontsloten-data-lyr-qlr en zijn hier: https://github.com/MercatorNet/MercatorUpload in één keer te downloaden.

Meer info op deze website: https://www.mercator.vonet.be.

Voor de geïnteresseerden, de connectieparameters zijn als volgt:

• Database: mercator_ro_productie
• User: <milieuinfo user productie>
• Password: <eigen paswoord>
• Instance: 5613
• Hostname: db-pr.mercator.vonet.be

@peterdesmet
Copy link
Member

Just tested DBI/odbc and rodbc for a project:

  • RODBC: has better documentation.
  • odbc: is faster, but can't find more documentation than the same README that is copy/pasted everywhere. Still, it is the technology that is used in the RStudio connections pane, so I agree we should use that rather than RODBC.

Best for INBO would be to provide examples that can be added to the connections pane in RStudio. Or bundle it into something that can easily be installed for everyone?

@florisvdh
Copy link
Member Author

With respect to odbc and DBI, some information from two talks at useR!2017 can be found here. (google doc; will jump to the odbc talk; DBI is next)

@stijnvanhoey
Copy link
Collaborator

at https://inbo.github.io/tutorials/tutorials/r_database_access/ a introduction is provided about R access to databases

@florisvdh
Copy link
Member Author

Woohaa! This rocks. I'll certainly test this further, for now I'll propose a few tweaks here or there.

@ElsLommelen
Copy link
Contributor

Sorry for a late response, but I only now noticed this issue.

With respect to RODBC and DBI, our styleguide recommends to use RODBC to connect to SQL Server and Access, and DBI to connect to other databases. For me as a Windows user, this seems the right guideline to follow: the odbc-driver that is used with DBI for SQL Server on Windows causes a lot of difficulties with special characters. (Apparently Linux machines have a better driver and do not have these problems.)

As @peterdesmet already mentions: the documentation in RODBC is better, and to be honest: I think the error handling in DBI is terrible. Often I have to paste my query in SQL Server Management Studio to find out what is wrong.

For databases without any special characters, it could be ok to suggest to use DBI (e.g. advantage to use dbplyr), but otherwise I would recommend to test first in a Windows environment if all characters can be imported well. The majority of the INBO users has Windows, so I think choices should mainly be based on this OS.

@hansvancalster
Copy link
Contributor

A way to circumvent problems with special characters, yet still use DBI is given in this function. The input parameter scient_name in that function often contains special characters. This parameter is passed on to glue_sql with {} and afterwards converted to the correct encoding with iconv before passing it to dbGetQuery(). The problem was solved by @Jo-Loos.

@ElsLommelen
Copy link
Contributor

As you know, I knew this solution, and it works indeed for some special characters, but unfortunately not for all (e.g. characters for male, female, <= in one character,...). (The examples indeed have an alternative and can be replaced in the database, but there will probably pop up others without alternative that give the same problem.)

But what actually worries me, is that this all makes it more complicated for starting or occasional R programmers to use DBI. The whole process of connecting to a database and running a query is already complicated, and then they have to add another statement to translate these characters that are not presenting well,... Furthermore, we already searched for weeks to find this (partial) solution, they might already search for some time before they figure out it is not their mistake (which can be demotivating) and ask for a solution.

So my main concern was not if there is a solution, but: if we know there are some difficulties with DBI that will make things complicated for less experienced users and there is an alternative (RODBC), will we still recommend DBI for using with SQL Server? What is actually the advantage of using DBI over RODBC? Except for the presence of a connection pane in RStudio, I don't see any arguments to choose for DBI in the discussion of this issue.

So please convince me with some good arguments why DBI is prefered over RODBC.

(I don't know why @ThierryO mentioned RODBC as a recommended package for SQL Server in the styleguide. It seems he wanted to promote DBI (as he did for most database types), but for some reason he made an exception for SQL Server and Access.)

@hansvancalster
Copy link
Contributor

I was just pointing to that solution because it matters a lot where in your code the conversion to the correct encoding is done. For instance, encoding the input parameter and then passing it to glue_sql will not work.

I don't have strong opinions, about DBI or RODBC, but I find exploring the database via the connection pane a big plus and it is considered the "best in class" package by rstudio.

@florisvdh
Copy link
Member Author

Something else to consider here is that the tidyverse package dbplyr is designed to work with DBI connections (see next coding club).

@florisvdh
Copy link
Member Author

Fast forward to 2020: we now have the inbodb package to address some challenges! It would be good to have a pointer to inbodb in the tutorials website, both in the database-R tutorial and in the overview article of INBO software. @ElsLommelen can you give your opinion? (you're best placed to implement this as well)

Also the INBOVEG tutorial will need some updating @ElsDeBie

Beside that, I think we should further investigate / specify the need for functionality for specific databases, i.e. database-specific queries and preprocessing through R functions. This relates to the second sub-part of this issue, i.e. "best practices using each specific database".

@ElsLommelen
Copy link
Contributor

Some ideas for the database-R tutorial:

  • one easy step to start with: replace inborutils by inbodb to guide users to the right database
  • it could be useful to mention in this tutorial that for some databases (INBOVEG and Florabank), ready-to-use functions exist to query the database, with of course a reference to the INBOVEG tutorial
  • setting up the connection yourself: for Windows users it is certainly useful to mention here to add parameter encoding = "latin1"
  • certainly for Access we should do some further tests:
    • are special symbols and complex data types an issue as well in DBI/odbc?
    • should we add a specific function in inbodb that overcomes these problems (if possible), or mention RODBC as an alternative package, with some advantages and disadvantages for both? (Access provides possibilities to explore the database and compose queries as an alternative to the connection pane and dbplyr)
  • maybe change the url https://inbo.github.io/tutorials/tutorials/r_database_access/ because the word access can be confusing as it mainly deals with SQL server?
  • would it be useful to add a short note on other database types (postgresql, sqlite,...) and the fact that dedicated packages exist to make this connection?
  • is there a tutorial on dbplyr, or should this be added here?
  • the tutorial is rather long, which might make it difficult to look up specific information. Would it be an idea to split it in 2 (or 3) tutorials: one on a connection and one on queries, and maybe a third one as a general page that explains that one needs 2 steps: connection and queries.

INBO software: I'll add inbodb and a short description

I'll start working on a PR with the first idea for the database tutorial and the software and I'll wait on some more opinions for the other ideas. Maybe it would be better if someone with experience on it would add something on dbplyr (if necessary)?

Maybe it is a good idea to involve @hansvancalster in this discussion as well? (For the tutorials on Florabank and INBOVEG)

@florisvdh florisvdh modified the milestones: >2020, 2020 Jun 24, 2020
@ElsLommelen
Copy link
Contributor

Considering the INBOVEG tutorial: maybe it better fits as a vignet in inbodb?

@florisvdh
Copy link
Member Author

Fine ideas @ElsLommelen !

maybe change the url https://inbo.github.io/tutorials/tutorials/r_database_access/

perhaps r_database_sqlserver ? MS Access better goes into its own tutorial.

is there a tutorial on dbplyr, or should this be added here?

You could simply refer to https://dbplyr.tidyverse.org/index.html

split it in 2 (or 3) tutorials: one on a connection and one on queries

If you would do that, I'd prefer two tutorials rather than three, and which should clearly refer to each other at the beginning. Otherwise, the interconnection gets lost (the website doesn't provide tutorial-hierarchy) and in such case I think it's better to keep it together as one.

@ElsLommelen
Copy link
Contributor

perhaps r_database_sqlserver? MS Access better goes into its own tutorial.

Because the general use is similar for all database types, I am not sure the best option is to give each database type its own tutorial.

Also, maybe (part of) the database tutorial should better be part of a vignet in inbodb... I feel the whole idea should have some more considerations. For now I will make some minor changes to direct people to the right package, and I will have a closer look at it later on.

@florisvdh
Copy link
Member Author

I am not sure the best option is to give each database type its own tutorial

Sure, common parts better aren't duplicated.

Regarding the tutorials-website vs. vignette, I leave such preference to you, it is not that obvious to me what is best. It's certainly a good thing to have self-contained help within the package. If the tutorial (or parts of it) were to move to the inbodb package, I suggest to still leave a 'tutorial' here that directs to the inbodb package for more information.

@ElsLommelen
Copy link
Contributor

Part of the tutorial describes methods not using inbodb, so this will stay on the tutorial website anyway. Other parts can directly link to the relevant inbodb vignet.

@ElsLommelen
Copy link
Contributor

Another issue with inbodb: where does it fit best on the INBO software website? It certainly belongs to retrieve data, but it covers partly the technical aspect (connecting to databases) and it partly queries databases. For now it only queries databases with biological data (and is put in this section), but as part of the INBO databases contain environmental data, the package inbodb may in future contain queries on environmental data as well. Considering all this, where does it fit best?

@hansvancalster
Copy link
Contributor

Maybe in the table add a row retrieve data: general, or do not make that distinction in the table and just lump them in one row retrieve data? In the remainder of the article, the distinction is useful (and I guess also in the table, so I am in favor of adding a row).

@florisvdh
Copy link
Member Author

I agree with Hans to add a row Retrieve data: general, above the current ones on that topic. You can then add a new subtitle in the text, under 'Retrieve data' and put inbodb there.

@ElsDeBie
Copy link
Contributor

ElsDeBie commented Jul 1, 2020

Are there any good sites about 'how to write a vignette'? I quickly searched and find this: https://kbroman.org/pkg_primer/pages/vignettes.html
http://r-pkgs.had.co.nz/vignettes.html
For tutorials, we have our: https://inbo.github.io/tutorials/create_tutorial/. I can imagine that we have to write more vignettes, perhaps it's also useful to make a tutorial for this? Or are there generel R rules?

@florisvdh
Copy link
Member Author

Hi @ElsDeBie, I found Hadley Wickham's book very helpful; see the vignettes chapter: http://r-pkgs.had.co.nz/vignettes.html

@ElsLommelen
Copy link
Contributor

Euh, no idea. Easiest to generate a vignette is using the function usethis::use_vignette() and specify a name for your vignette. (This function deals with all settings and the heading, you only have to push these changes to github, and of course add content to your vignette.) That's about it for the technical part, I think.

And for the content, to me it seems the same as any other text: focus on a (potential) user group for your package and write down what they should really know to use your package. I generally start with writing the goal of the vignette (and the focal user group), and try to use a lot of examples.

@florisvdh
Copy link
Member Author

Hm, I should have looked better, you've listed r-pkgs already 😉 Anyway, thanks for the tutorial of Broman, seems interesting - didn't know it.

@ElsLommelen
Copy link
Contributor

@ElsDeBie In the link by Hadley Wickham, just replace package name devtools with usethis if the function is unknown: some functions have moved to usethis some years ago, and for some months these functions are effectively removed from devtools. Unfortunately a lot of documentation (even the develop cheatsheet) is not yet updated with these changes.

@florisvdh
Copy link
Member Author

There is a 2nd edition of the book under development at https://r-pkgs.org. It seems that the authors did already make updates to the vignettes chapter at https://r-pkgs.org/vignettes.html.

@ElsDeBie
Copy link
Contributor

ElsDeBie commented Jul 3, 2020

Els Lommelen: > Another issue with inbodb: where does it fit best on the INBO software website? It certainly belongs to retrieve data, but it covers partly the technical aspect (connecting to databases) and it partly queries databases. For now it only queries databases with biological data (and is put in this section), but as part of the INBO databases contain environmental data, the package inbodb may in future contain queries on environmental data as well. Considering all this, where does it fit best?

--> Is it an idea to put between brackets the databases mentioned in this package? Otherwise users can be lost if they don't no where to search for ..."Retrieve data: general | inbodb (technical support, Inboveg, FLoradatabank)".
Or maybe beter to put a third column with 'concerning databases'

https://inbo.github.io/tutorials/articles/inbo_software/

@florisvdh
Copy link
Member Author

florisvdh commented Jul 3, 2020

@ElsDeBie Adding the explicit databases has both pros (more informative) and cons (needs maintenance). @ElsLommelen what do you prefer?

Adding the explicit databases (in the description) was actually her first approach; I suggested generalization 🤔 (see #171 (comment)).

Regarding the form: I would not insert package-specific information in the table, as the table intends to give a very global overview. Package-specific information is given in the text below. @ElsDeBie Maybe you can propose a better category name than 'Retrieve data: general', that would be a better fit ?

Still another possibility could be to repeat the package under more than 1 category in the table.

@ElsLommelen
Copy link
Contributor

@ElsDeBie I agree with @florisvdh and I would rather prefer to not mention specific databases for maintenance reasons. When adding new databases to inbodb, it is easily added in for instance the readme of inbodb itself, but keeping the tutorials website up-to-date as well (and especially remembering to do so whenever we make changes in inbodb), will require an additional effort. On the other hand it would be nice to name all these specific databases (inboveg and florabank for now) in the readme of inbodb, so visitors will immediately have this information while using the link in the tutorials website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants