Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax for 'replace' is MySQL/MariaDB specific and not working with Postgres-based databases #11

Open
phainom opened this issue Dec 16, 2019 · 12 comments · Fixed by #14
Assignees
Labels
bug Something isn't working

Comments

@phainom
Copy link
Collaborator

phainom commented Dec 16, 2019

Replacing duplicate primary keys is as of now done by creating a temporary table and exchanging the actual one later by it.

The syntax CREATE TEMPORARY TABLE ... LIKE is specific to certain databases like mariadb. It is not working for postgres and databases building on it, like Amazon Redshift. Thus, the full compatibility with those databases cannot be ensured.

There are different paths to resolving the issue:

  • exchange the syntax by a long sql string making use of insert update duplicate keys.
  • put parentheses around the table to accomodate the different syntax for at least postgres.
@phainom phainom added the bug Something isn't working label Dec 16, 2019
@phainom phainom self-assigned this Dec 16, 2019
@phainom
Copy link
Collaborator Author

phainom commented Jan 21, 2020

The problem here is that from a design perspective, the whole send_data utility is different from all other utilities in the package. Apart from that feature, everything else is independent from the SQL dialect being used. In fact, this is the only method using any SQL code not given.

To further support different varieties of SQL, I see the following ways to change send_data:

  1. completely follow the syntax of to_sql by pandas
  2. change everything to basic insert statements
  3. delete it from the package completely

1 would mean just using SQL alchemy. It would break the current setup, since there is no update duplicate keys within pandas and afaik SQL alchemy. 2. would be super slow, especially compared to optimized packages such as psycopg2. 3. would break the current setup completely, but be the most natural thing to do, since this is the only utitlity directly using SQL code. The only other alternative would be to support and optimize for certain dialects (eg MariaDB and Postgres), which can't be the purpose of the package. I chose not to further use send_data in my projects, but this is because I usually use Redshift and tend to prefer to use the COPY statement within a bulk query, which is usually faster than to_sql. Since I cannot assume this is the case for other productionalized projects, I tend to go with 2., which would also mimic the setup of the R package dbtools. This would definitely slow things down from the current setup, but optimization can still be done on the users end by using the bulk_query utility.

@phainom
Copy link
Collaborator Author

phainom commented Feb 19, 2020

@miraKlein pointed me towards an additional problem with the current setup which temporary tables are causing, namely the inability of using chunk sizes. This is something which should be thought about when rewriting the syntax for replace.

@wahani
Copy link
Member

wahani commented Feb 24, 2020

We have discussed this with the INWT Python crowd. If we remove all connections to a specific driver, then dbrequest will deliver the following benefits:

  • management of connection objects
  • management of credentials
  • reading sql code from files
  • sparing us to write boilerplate around the pandas send to sql methods to achieve the points above
  • maybe some others

Some of these points, but not all of them, have been the goals with dbtools. There we wanted to consolidate the different approaches to connect to databases within the team at INWT. At the time this was only MySQL/MariaDB. We now also have Redshift and PostgreSQL. However, it was always the goal to implement SQL-Dialact/Driver specific solutions for issues in the team. So that they can be fixed once and properly and everyone benefits. Some of the Drivers are not public but are implemented in different packages.

What we had from the beginning, which I do not see in dbrequests right now, is dispatch on the connection object; aka the driver/SQL-dialect. In R this is accomplished by using generic functions and implement methods for the specific driver. You can see that here for sendQuery and here for sendData.

So what we have in dbtools is a set of basic functionality that we implemented for DBIConnections (which is the super class/parent of all DB connections in R/DBI). Then we override/extend this functionality for the specific drivers, currently MySQL, MariaDB and PostgreSQL for Redshift. We can do this in dbtools itself, or do this in other packages, like we did for Redshift.

So what I take from dbtools and would like to see in dbrequests is:

  • Driver specific solutions... (you seem to like to move away from this, which would be fine for me and I get back to it in a minute)
  • Extensibility for drivers that are currently not supported

Driver specific solutions...

Is something you would like to avoid, which I find very reasonable. In fact it is the right path to go: with dbtools we enforce that someone who uses Redshift has to install MariaDB and MySQL, because we have it inside of dbtools and our Redshift-engine uses dbtools as dependency. This is okay, because we have to have MariaDB anyways, it is typically the target in our ETLs. This is probably something we do not want in dbrequests in the future.

Extensibility for drivers and methods that are not supported by pandas

In dbtools we have something that should be similar to an abstract parent class in OOP, and then we define subclasses for the specific drivers. This makes it fairly easy to write backends for new drivers. I do not think this is possible right now in dbrequests (correct me if I'm wrong), but we would really need that for future development. Because, if we remove the dialect specific parts from dbrequests, which currently sit in the Connection class, we must have a mechanism to extend the Connection class. However that is a static dependency in the Database class, which I presume is the user interface. I guess a solution for this is not too much effort. One solution that crossed my mind is to use dependency injection for a Connection object in the Database class. So instead of importing it, the Database class would accept an object of the Connection class during init. Then we can extend the Connection class for driver specific modifications and do so in separate packages. But I lack any detailed knowledge about the implementation and Python...

Maybe we can schedule a call to discuss this. This got much longer than I thought already...

@phainom
Copy link
Collaborator Author

phainom commented Feb 25, 2020

I agree for the most part and like the idea of an extensible solution for driver support.

Regarding the benefits/main goals of the package:

  • management of connection objects
  • management of credentials
  • reading sql code from files
  • sparing us to write boilerplate around the pandas send to sql methods to achieve the points above
  • maybe some others

I'd argue the most important "benefit" from dbrequests and the reason it was built was to deliver an equivalent solution to dbtools and provide a smooth transition for users accustomed to its usage, essentially providing user interfaces like send_query and send_data. dbrequests does not follow Python conventions around database API's, specifically PEP 249. There are driver specific python packages which provide highly optimized utilities, security features etc above what dbrequests offers, eg psycopg2 for postgres/redshift and mariadb for mariadb. Shall the purpose of dbrequests be to provide an user interface to them different from PEP 249, which they are following? The other points you mention are either taken over by SQL alchemy or two-liners.

I would follow your proposal to change the connection class to an extensible format to be able to inherit advantages from other connector packages and make it easier for you to take over at least partial ownership of further developments. Lets schedule a call and discuss!

@wahani
Copy link
Member

wahani commented Feb 28, 2020

Short summary:

  • @phainom will refactor the code such that we can derive new Connection classes for new drivers
  • Driver specific logic will move into dedicated packages, so that the dependency graph of dbrequests remains small
  • In the future dbrequests should be viewed as opinionated interface, independent of SQL dialect and driver. It will continue to rely on pandas and its database abstractions.
  • People from INWT and me specifically will contribute back to this project. We already have a set of functionality that we should integrate back into this package.

@wahani
Copy link
Member

wahani commented Mar 1, 2020

@phainom this issue gets a bit off topic but I have some general questions I would like to ask:

  • If we have several queries in python, can we send them via the same connection? Currently my impression is, that I can only hand over one string to send_query. However I do not see a reason, why this string shouldn't contain more than one query separated by ';'. In that case I think the same connection is used, correct?
  • I played around with dbrequests and noticed, that 'pymysql' is not a dependency. Instead I have to install it once I try to initialize a Database. Option A: this was unintended and is a bug; Option B: we may consolidate all kinds of dialects in this package without a growing dependency graph after all. Which I like, because we may have all kinds of ideas that we can merge. What do you think?
  • For writing data in dbtools we always write a csv file and then use the load infile. This is fast for large datasets. I tested it with dbrequests and noticed that: pandas.DataFrame.to_csv is a bit slow and still it would be better for large datasets to first write to csv. I also tested with dask which does not help much in this case. Do you have any idea where we can find a reliable fast-write for pandas.DataFrame to CSVs?

@phainom
Copy link
Collaborator Author

phainom commented Mar 2, 2020

@phainom this issue gets a bit off topic but I have some general questions I would like to ask:

* If we have several queries in python, can we send them via the same connection? Currently my impression is, that I can only hand over one string to `send_query`. However I do not see a reason, why this string shouldn't contain more than one query separated by ';'. In that case I think the same connection is used, correct?

Yes, this is possible. Note that send_query always tries to return a pandas dataframe, if multiple select statements are sent, the last one will be the one returned. Note that the connections are closed by dbrequests also if the database was not set up in a with-statement, since connections is called inside a with-block in the database class.

* I played around with dbrequests and noticed, that 'pymysql' is not a dependency. Instead I have to install it once I try to initialize a Database. Option A: this was unintended and is a bug; Option B: we may consolidate all kinds of dialects in this package without a growing dependency graph after all. Which I like, because we may have all kinds of ideas that we can merge. What do you think?

Its a feature, not a bug. pymysql is only needed when used as a driver for connecting to a mysql/mariadb database. If you use another driver, thats fine (eg psycopg2-binary for connecting to postgres). dbrequests does not include any drivers and not directly depends on them. The reason pymysql is included in the pipfile is only for testing purposes. Direct dependencies from those packages can be outsourced to the child-packages by redefining the connection class (I already started working on this).

* For writing data in dbtools we always write a csv file and then use the [load infile](https://mariadb.com/kb/en/load-data-infile/). This is fast for large datasets. I tested it with dbrequests and noticed that: `pandas.DataFrame.to_csv` is a bit slow and still it would be better for large datasets to first write to csv. I also tested with dask which does not help much in this case. Do you have any idea where we can find a reliable fast-write for pandas.DataFrame to CSVs?

One could use compression (eg gzip) while writing to disk, or use hdf? Both options should speed up writing to a file considerably. For Redshift compressed files are supported, for mariadb via mysqlimport as well?

@wahani
Copy link
Member

wahani commented Mar 2, 2020

Its a feature, not a bug. pymysql is only needed when used as a driver for connecting to a mysql/mariadb database. If you use another driver, thats fine (eg psycopg2-binary for connecting to postgres). dbrequests does not include any drivers and not directly depends on them.

I thought so. However we could keep the implementations for MariaDB and Redshift in this package, couldn't we? My concern was, that additional dependencies would be necessary. This way, the user only needs to install the driver package if they actually need it. And dbrequests is still relatively small and could probably cope with some other utilities. An import like:

import dbrequests.mariadb as dbm

was what I was thinking about. I saw that dask is structured in a similar way with it's modules and we still may develop different backends side by side

@phainom
Copy link
Collaborator Author

phainom commented Mar 3, 2020

You are right, this would be another option, having the advantage of not needing additional packaging-boilerplating.

@wahani
Copy link
Member

wahani commented Mar 4, 2020

You are right, this would be another option, having the advantage of not needing additional packaging-boilerplating.

I would like to see the dialect specific backends in this package side by side. Changing between dialects is then a matter of a change in the import statement. And we would have a good overview of the things possible. Not sure how to represent that in the file structure though: would we create a directory for each backend we implement?

@wahani wahani linked a pull request Mar 4, 2020 that will close this issue
@phainom
Copy link
Collaborator Author

phainom commented Mar 4, 2020

Exactly, one folder including __init__.py per subpackage (See dask, sklearn etc). I will soon implement a first skeleton of a postgres subpackage and with it propose a structure.

This issue is still open though, the current Insert update Implementation would better fit into a mariadb subpackage.

phainom added a commit that referenced this issue Mar 12, 2020
connection class as parameter in database class
@phainom
Copy link
Collaborator Author

phainom commented Mar 12, 2020

Reopen since initial issue is still not fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants