Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB/PyMongo: Amalgamate PyMongo driver to use CrateDB as backend #83

Merged
merged 6 commits into from
Jul 6, 2024

Conversation

amotl
Copy link
Member

@amotl amotl commented Nov 27, 2023

About

This patch amalgamates the PyMongo driver to talk to CrateDB instead of a MongoDB server.

Documentation

Preview: https://cratedb-toolkit--83.org.readthedocs.build/adapter/pymongo.html

Synopsis

import pymongo
from cratedb_toolkit.adapter.pymongo import PyMongoCrateDBAdapter

with PyMongoCrateDBAdapter(dburi="crate://crate@localhost:4200"):
    client = pymongo.MongoClient("localhost", 27017)
    collection = client.foo.bar

    inserted_id = collection.insert_one({"author": "Mike", "text": "My first blog post!"}).inserted_id
    print(inserted_id)

    document = collection.find({"author": "Mike"})
    print(document)

Status

It can run 95% of the MongoDB "getting started" tutorial successfully.

-- https://pymongo.readthedocs.io/en/stable/tutorial.html
-- test_pymongo.py

Backlog

  • Software tests
  • Proper CI setup / configuration
  • Documentation

@amotl amotl force-pushed the pymongo-adapter branch 3 times, most recently from 27d19ad to be8f833 Compare November 28, 2023 23:30
Copy link

codecov bot commented Nov 28, 2023

Codecov Report

Attention: Patch coverage is 87.29412% with 54 lines in your changes missing coverage. Please review.

Project coverage is 82.01%. Comparing base (2851bb7) to head (746d7c2).

Files Patch % Lines
cratedb_toolkit/adapter/pymongo/cursor.py 78.92% 43 Missing ⚠️
cratedb_toolkit/adapter/pymongo/reactor.py 88.88% 4 Missing ⚠️
cratedb_toolkit/adapter/pymongo/collection.py 95.94% 3 Missing ⚠️
cratedb_toolkit/util/pandas.py 92.30% 3 Missing ⚠️
cratedb_toolkit/util/common.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #83       +/-   ##
===========================================
+ Coverage   69.35%   82.01%   +12.65%     
===========================================
  Files          73       81        +8     
  Lines        2826     3247      +421     
===========================================
+ Hits         1960     2663      +703     
+ Misses        866      584      -282     
Flag Coverage Δ
influxdb 30.73% <0.94%> (?)
main 60.45% <1.41%> (-8.91%) ⬇️
mongodb 39.05% <0.94%> (?)
pymongo 42.13% <86.82%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 18 to 33
## Synopsis
```python
import pymongo
from cratedb_toolkit.adapter.pymongo import PyMongoCrateDbAdapter

with PyMongoCrateDbAdapter(dburi="crate://crate@localhost:4200"):
client = pymongo.MongoClient("localhost", 27017)
collection = client.foo.bar

inserted_id = collection.insert_one({"author": "Mike", "text": "My first blog post!"}).inserted_id
print(inserted_id)

document = collection.find_one({"author": "Mike"})
print(document)
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the "naming things" aspect: I am still not sure whether to say PyMongoCrateDbAdapter or PyMongoCrateDBAdapter. What do you think is better?

On another recent occasion, we used the name CrateDBTestAdapter. So, if we agree to always use CrateDB with capitalized DB, this spot will need to be adjusted.

/cc @pilosus, @surister, @matriv, @seut

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I like PyMongoCrateDBAdapter slightly better, contrary to what I've initially written down. But iirc, @matriv has a different stance on this. Maybe because he is used to Java naming conventions, being somewhat stricter on this topic?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep let's go with DB both captitals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated with 6d2730a.

Comment on lines 35 to 39
## Examples

To inspect the capabilities of the driver adapter, there is a basic program at
[pymongo_adapter.py], and test cases at [test_pymongo.py].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 592442e.

amotl added 2 commits July 6, 2024 18:04
Setting the CrateDB column policy to `dynamic` means that new columns
can be added without needing to explicitly change the table definition
by running corresponding `ALTER TABLE` statements.

-- https://cratedb.com/docs/crate/reference/en/latest/general/ddl/column-policy.html#dynamic
@amotl amotl force-pushed the pymongo-adapter branch 2 times, most recently from 2b76fe7 to c925151 Compare July 6, 2024 17:46
Comment on lines +54 to +59
def patch_types_map():
"""
Register missing timestamp data type.
"""
# TODO: Submit patch to `crate-python`.
TYPES_MAP["timestamp without time zone"] = sqltypes.TIMESTAMP
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch should be submitted to sqlalchemy-cratedb.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a corresponding note to the backlog.

Copy link
Member Author

@amotl amotl Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those patches might also be submitted to sqlalchemy-cratedb in one way or another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a corresponding note to the backlog.

Comment on lines +43 to +63
def adjust_sqlalchemy(self):
"""
Configure CrateDB SQLAlchemy dialect.

Setting the CrateDB column policy to `dynamic` means that new columns
can be added without needing to explicitly change the table definition
by running corresponding `ALTER TABLE` statements.

https://cratedb.com/docs/crate/reference/en/latest/general/ddl/column-policy.html#dynamic
"""
# 1. Patch data types for CrateDB dialect.
# TODO: Upstream to `sqlalchemy-cratedb`.
patch_types_map()

# 2. Prepare pandas.
# TODO: Provide unpatching hook.
# TODO: Use `with table_kwargs(...)`.
from cratedb_toolkit.util.pandas import patch_pandas_sqltable_with_dialect_parameters

patch_pandas_sqltable_with_dialect_parameters(table_kwargs={"crate_column_policy": "'dynamic'"})
patch_pandas_sqltable_with_extended_mapping()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be generalized / upstreamed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a corresponding note to the backlog.

Comment on lines +66 to +91
def insert_returning_id(pd_table, conn, keys, data_iter):
"""
Use CrateDB's "bulk operations" endpoint as a fast path for pandas' and Dask's `to_sql()` [1] method.

The idea is to break out of SQLAlchemy, compile the insert statement, and use the raw
DBAPI connection client, in order to be able to amend the SQL statement, adding a
`RETURNING _id` clause.

The vanilla implementation, used by SQLAlchemy, is::

data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(pd_table.table.insert(), data)
"""
nonlocal object_id_cratedb

# Compile SQL statement and materialize batch.
sql = str(pd_table.table.insert().compile(bind=conn))
data = list(data_iter)

# Invoke amended insert operation, returning the record
# identifier as surrogate to MongoDB's `ObjectId`.
cursor = conn._dbapi_connection.cursor()
cursor.execute(sql=sql + " RETURNING _id", parameters=data[0])
outcome = cursor.fetchone()
object_id_cratedb = outcome[0]
cursor.close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to be upstreamed like sqlalchemy-cratedb's insert_bulk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a corresponding note to the backlog.

amotl added 4 commits July 6, 2024 20:37
With corresponding improvements, the amalgamated PyMongo driver can now
run 95% of the MongoDB "getting started" tutorial successfully.
It needs to balance SQLAlchemy 1.x vs. 2.x throughout the toolkit test
cases, because JessiQL still uses SQLAlchemy 1.x.
Use `PyMongoCrateDBAdapter`, with capitalized "DB".
@amotl amotl merged commit 19c5cff into main Jul 6, 2024
23 checks passed
@amotl amotl deleted the pymongo-adapter branch July 6, 2024 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants