-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syntax for 'replace' is MySQL/MariaDB specific and not working with Postgres-based databases #11
Comments
The problem here is that from a design perspective, the whole send_data utility is different from all other utilities in the package. Apart from that feature, everything else is independent from the SQL dialect being used. In fact, this is the only method using any SQL code not given. To further support different varieties of SQL, I see the following ways to change send_data:
1 would mean just using SQL alchemy. It would break the current setup, since there is no update duplicate keys within pandas and afaik SQL alchemy. 2. would be super slow, especially compared to optimized packages such as psycopg2. 3. would break the current setup completely, but be the most natural thing to do, since this is the only utitlity directly using SQL code. The only other alternative would be to support and optimize for certain dialects (eg MariaDB and Postgres), which can't be the purpose of the package. I chose not to further use send_data in my projects, but this is because I usually use Redshift and tend to prefer to use the COPY statement within a bulk query, which is usually faster than to_sql. Since I cannot assume this is the case for other productionalized projects, I tend to go with 2., which would also mimic the setup of the R package dbtools. This would definitely slow things down from the current setup, but optimization can still be done on the users end by using the bulk_query utility. |
@miraKlein pointed me towards an additional problem with the current setup which temporary tables are causing, namely the inability of using chunk sizes. This is something which should be thought about when rewriting the syntax for replace. |
We have discussed this with the INWT Python crowd. If we remove all connections to a specific driver, then dbrequest will deliver the following benefits:
Some of these points, but not all of them, have been the goals with dbtools. There we wanted to consolidate the different approaches to connect to databases within the team at INWT. At the time this was only MySQL/MariaDB. We now also have Redshift and PostgreSQL. However, it was always the goal to implement SQL-Dialact/Driver specific solutions for issues in the team. So that they can be fixed once and properly and everyone benefits. Some of the Drivers are not public but are implemented in different packages. What we had from the beginning, which I do not see in dbrequests right now, is dispatch on the connection object; aka the driver/SQL-dialect. In R this is accomplished by using generic functions and implement methods for the specific driver. You can see that here for So what we have in dbtools is a set of basic functionality that we implemented for DBIConnections (which is the super class/parent of all DB connections in R/DBI). Then we override/extend this functionality for the specific drivers, currently MySQL, MariaDB and PostgreSQL for Redshift. We can do this in dbtools itself, or do this in other packages, like we did for Redshift. So what I take from dbtools and would like to see in dbrequests is:
Is something you would like to avoid, which I find very reasonable. In fact it is the right path to go: with dbtools we enforce that someone who uses Redshift has to install MariaDB and MySQL, because we have it inside of dbtools and our Redshift-engine uses dbtools as dependency. This is okay, because we have to have MariaDB anyways, it is typically the target in our ETLs. This is probably something we do not want in dbrequests in the future.
In dbtools we have something that should be similar to an abstract parent class in OOP, and then we define subclasses for the specific drivers. This makes it fairly easy to write backends for new drivers. I do not think this is possible right now in dbrequests (correct me if I'm wrong), but we would really need that for future development. Because, if we remove the dialect specific parts from dbrequests, which currently sit in the Connection class, we must have a mechanism to extend the Connection class. However that is a static dependency in the Database class, which I presume is the user interface. I guess a solution for this is not too much effort. One solution that crossed my mind is to use dependency injection for a Connection object in the Database class. So instead of importing it, the Database class would accept an object of the Connection class during init. Then we can extend the Connection class for driver specific modifications and do so in separate packages. But I lack any detailed knowledge about the implementation and Python... Maybe we can schedule a call to discuss this. This got much longer than I thought already... |
I agree for the most part and like the idea of an extensible solution for driver support. Regarding the benefits/main goals of the package:
I'd argue the most important "benefit" from dbrequests and the reason it was built was to deliver an equivalent solution to dbtools and provide a smooth transition for users accustomed to its usage, essentially providing user interfaces like send_query and send_data. dbrequests does not follow Python conventions around database API's, specifically PEP 249. There are driver specific python packages which provide highly optimized utilities, security features etc above what dbrequests offers, eg psycopg2 for postgres/redshift and mariadb for mariadb. Shall the purpose of dbrequests be to provide an user interface to them different from PEP 249, which they are following? The other points you mention are either taken over by SQL alchemy or two-liners. I would follow your proposal to change the connection class to an extensible format to be able to inherit advantages from other connector packages and make it easier for you to take over at least partial ownership of further developments. Lets schedule a call and discuss! |
Short summary:
|
@phainom this issue gets a bit off topic but I have some general questions I would like to ask:
|
Yes, this is possible. Note that send_query always tries to return a pandas dataframe, if multiple select statements are sent, the last one will be the one returned. Note that the connections are closed by dbrequests also if the database was not set up in a with-statement, since connections is called inside a with-block in the database class.
Its a feature, not a bug. pymysql is only needed when used as a driver for connecting to a mysql/mariadb database. If you use another driver, thats fine (eg psycopg2-binary for connecting to postgres). dbrequests does not include any drivers and not directly depends on them. The reason pymysql is included in the pipfile is only for testing purposes. Direct dependencies from those packages can be outsourced to the child-packages by redefining the connection class (I already started working on this).
One could use compression (eg gzip) while writing to disk, or use hdf? Both options should speed up writing to a file considerably. For Redshift compressed files are supported, for mariadb via mysqlimport as well? |
I thought so. However we could keep the implementations for MariaDB and Redshift in this package, couldn't we? My concern was, that additional dependencies would be necessary. This way, the user only needs to install the driver package if they actually need it. And dbrequests is still relatively small and could probably cope with some other utilities. An import like:
was what I was thinking about. I saw that |
You are right, this would be another option, having the advantage of not needing additional packaging-boilerplating. |
I would like to see the dialect specific backends in this package side by side. Changing between dialects is then a matter of a change in the import statement. And we would have a good overview of the things possible. Not sure how to represent that in the file structure though: would we create a directory for each backend we implement? |
Exactly, one folder including This issue is still open though, the current Insert update Implementation would better fit into a mariadb subpackage. |
connection class as parameter in database class
Reopen since initial issue is still not fixed. |
Replacing duplicate primary keys is as of now done by creating a temporary table and exchanging the actual one later by it.
The syntax CREATE TEMPORARY TABLE ... LIKE is specific to certain databases like mariadb. It is not working for postgres and databases building on it, like Amazon Redshift. Thus, the full compatibility with those databases cannot be ensured.
There are different paths to resolving the issue:
The text was updated successfully, but these errors were encountered: