Skip to content

Commit

Permalink
More documentation fixes. (#277)
Browse files Browse the repository at this point in the history
* More documentation fixes.

- Use raw strings to escape newlines in examples
- Read examples from README.rst instead of copy-pasting them
- Add help.txt to let people read the full module docstring without
  downloading and installing smart_open

* improve S3 documentation

* add SSH to top level documentation

* add null handler for logging

* Update smart_open/s3.py

Co-Authored-By: mpenkov <penkov@pm.me>

* respond to review comments

* move null handler to init.py

* fix indentation
  • Loading branch information
mpenkov authored Apr 1, 2019
1 parent cce1732 commit afe99b2
Show file tree
Hide file tree
Showing 10 changed files with 326 additions and 49 deletions.
6 changes: 6 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ What?
``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API:


.. _doctools_before_examples:

.. code-block:: python
>>> from smart_open import open
Expand Down Expand Up @@ -77,12 +79,16 @@ Other examples of URLs that ``smart_open`` accepts::
[ssh|scp|sftp]://username@host/path/file
file:///home/user/file.xz

.. _doctools_after_examples:

For detailed API info, see the online help:

.. code-block:: python
help('smart_open')
or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser.

More examples:

.. code-block:: python
Expand Down
266 changes: 266 additions & 0 deletions help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
Help on package smart_open:

NAME
smart_open

FILE
/home/misha/git/smart_open/smart_open/__init__.py

DESCRIPTION
Utilities for streaming to/from several file-like data storages: S3 / HDFS / local
filesystem / compressed files, and many more, using a simple, Pythonic API.

The streaming makes heavy use of generators and pipes, to avoid loading
full file contents into memory, allowing work with arbitrarily large files.

The main functions are:

* `open()`, which opens the given file for reading/writing
* `s3_iter_bucket()`, which goes over all keys in an S3 bucket in parallel
* `register_compressor()`, which registers callbacks for transparent compressor handling

PACKAGE CONTENTS
doctools
hdfs
http
s3
smart_open_lib
ssh
tests (package)
webhdfs

FUNCTIONS
open(uri, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None, ignore_ext=False, transport_params={})
Open the URI object, returning a file-like object.

The URI is usually a string in a variety of formats:

1. a URI for the local filesystem: `./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
`s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret@my_bucket/lines.txt`

The URI may also be one of:

- an instance of the pathlib.Path class
- a stream (anything that implements io.IOBase-like functionality)

This function supports transparent compression and decompression using the
following codec:

- ``.gz``
- ``.bz2``
- ``.xz``

The function depends on the file extension to determine the appropriate codec.

Parameters
----------
uri: str or object
The object to open.
mode: str, optional
Mimicks built-in open parameter of the same name.
buffering: int, optional
Mimicks built-in open parameter of the same name.
encoding: str, optional
Mimicks built-in open parameter of the same name.
errors: str, optional
Mimicks built-in open parameter of the same name.
newline: str, optional
Mimicks built-in open parameter of the same name.
closefd: boolean, optional
Mimicks built-in open parameter of the same name. Ignored.
opener: object, optional
Mimicks built-in open parameter of the same name. Ignored.
ignore_ext: boolean, optional
Disable transparent compression/decompression based on the file extension.
transport_params: dict, optional
Additional parameters for the transport layer (see notes below).

Returns
-------
A file-like object.

Notes
-----
smart_open has several implementations for its transport layer (e.g. S3, HTTP).
Each transport layer has a different set of keyword arguments for overriding
default behavior. If you specify a keyword argument that is *not* supported
by the transport layer being used, smart_open will ignore that argument and
log a warning message.

S3 (for details, see :mod:`smart_open.s3` and :func:`smart_open.s3.open`):

buffer_size: int, optional
The buffer size to use when performing I/O.
min_part_size: int, optional
The minimum part size for multipart uploads. For writing only.
session: object, optional
The S3 session to use when working with boto3.
resource_kwargs: dict, optional
Keyword arguments to use when creating a new resource. For writing only.
multipart_upload_kwargs: dict, optional
Additional parameters to pass to boto3's initiate_multipart_upload function.
For writing only.

HTTP (for details, see :mod:`smart_open.http` and :func:`smart_open.http.open`):

kerberos: boolean, optional
If True, will attempt to use the local Kerberos credentials
user: str, optional
The username for authenticating over HTTP
password: str, optional
The password for authenticating over HTTP

WebHDFS (for details, see :mod:`smart_open.webhdfs` and :func:`smart_open.webhdfs.open`):

min_part_size: int, optional
For writing only.

SSH (for details, see :mod:`smart_open.ssh` and :func:`smart_open.ssh.open`):

mode: str, optional
The mode to use for opening the file.
host: str, optional
The hostname of the remote machine. May not be None.
user: str, optional
The username to use to login to the remote machine.
If None, defaults to the name of the current user.
port: int, optional
The port to connect to.


Examples
--------
>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
... print(repr(line))
... break
'User-Agent: *\n'

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
... print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

>>> # can use context managers too:
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
... with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
... for line in fin:
... fout.write(line)

>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
... for line in fin:
... print(repr(line.decode('utf-8')))
... break
... offset = fin.seek(0) # seek to the beginning
... print(fin.read(4))
'User-Agent: *\n'
b'User'

>>> # stream from HTTP
>>> for line in open('http://example.com/index.html'):
... print(repr(line))
... break
'<!doctype html>\n'

Other examples of URLs that ``smart_open`` accepts::

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://username@host//path/file
[ssh|scp|sftp]://username@host/path/file


See Also
--------
- `Standard library reference <https://docs.python.org/3.7/library/functions.html#open>`__
- `smart_open README.rst <https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst>`__

register_compressor(ext, callback)
Register a callback for transparently decompressing files with a specific extension.

Parameters
----------
ext: str
The extension.
callback: callable
The callback. It must accept two position arguments, file_obj and mode.

Examples
--------

Instruct smart_open to use the identity function whenever opening a file
with a .foo extension:

>>> def identity(file_obj, mode):
... return file_obj
>>> register_compressor('.foo', identity)

s3_iter_bucket = iter_bucket(bucket_name, prefix='', accept_key=None, key_limit=None, workers=16, retries=3)
Iterate and download all S3 objects under `s3://bucket_name/prefix`.

Parameters
----------
bucket_name: str
The name of the bucket.
prefix: str, optional
Limits the iteration to keys starting wit the prefix.
accept_key: callable, optional
This is a function that accepts a key name (unicode string) and
returns True/False, signalling whether the given key should be downloaded.
The default behavior is to accept all keys.
key_limit: int, optional
If specified, the iterator will stop after yielding this many results.
workers: int, optional
The number of subprocesses to use.
retries: int, optional
The number of time to retry a failed download.

Yields
------
str
The full key name (does not include the bucket name).
bytes
The full contents of the key.

Notes
-----
The keys are processed in parallel, using `workers` processes (default: 16),
to speed up downloads greatly. If multiprocessing is not available, thus
_MULTIPROCESSING is False, this parameter will be ignored.

Examples
--------

>>> # get all JSON files under "mybucket/foo/"
>>> for key, content in iter_bucket(bucket_name, prefix='foo/', accept_key=lambda key: key.endswith('.json')):
... print key, len(content)

>>> # limit to 10k files, using 32 parallel workers (default is 16)
>>> for key, content in iter_bucket(bucket_name, key_limit=10000, workers=32):
... print key, len(content)

smart_open(uri, mode='rb', **kw)
Deprecated, use smart_open.open instead.

DATA
__all__ = ['open', 'smart_open', 's3_iter_bucket', 'register_compresso...


5 changes: 5 additions & 0 deletions smart_open/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@
"""

import logging

from .smart_open_lib import open, smart_open, register_compressor
from .s3 import iter_bucket as s3_iter_bucket
__all__ = ['open', 'smart_open', 's3_iter_bucket', 'register_compressor']

logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
32 changes: 32 additions & 0 deletions smart_open/doctools.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@

import inspect
import io
import os.path
import re


def extract_kwargs(docstring):
Expand Down Expand Up @@ -121,3 +123,33 @@ def to_docstring(kwargs, lpad=''):
for line in description:
buf.write('%s %s\n' % (lpad, line))
return buf.getvalue()


def extract_examples_from_readme_rst(indent=' '):
"""Extract examples from this project's README.rst file.
Parameters
----------
indent: str
Prepend each line with this string. Should contain some number of spaces.
Returns
-------
str
The examples.
Notes
-----
Quite fragile, depends on named labels inside the README.rst file.
"""
curr_dir = os.path.dirname(os.path.abspath(__file__))
readme_path = os.path.join(curr_dir, '..', 'README.rst')
try:
with open(readme_path) as fin:
lines = list(fin)
start = lines.index('.. _doctools_before_examples:\n')
end = lines.index(".. _doctools_after_examples:\n")
lines = lines[start+4:end-2]
return ''.join([indent + re.sub('^ ', '', l) for l in lines])
except Exception:
return indent + 'See README.rst'
1 change: 0 additions & 1 deletion smart_open/hdfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
import subprocess

logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())


def open(uri, mode):
Expand Down
1 change: 0 additions & 1 deletion smart_open/http.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
DEFAULT_BUFFER_SIZE = 128 * 1024

logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())


_HEADERS = {'Accept-Encoding': 'identity'}
Expand Down
10 changes: 5 additions & 5 deletions smart_open/s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
import six

logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())

# Multiprocessing is unavailable in App Engine (and possibly other sandboxes).
# The only method currently relying on it is iter_bucket, which is instructed
Expand Down Expand Up @@ -77,16 +76,17 @@ def open(
key_id: str
The name of the key within the bucket.
mode: str
The mode with which to open the object. Must be either rb or wb.
The mode for opening the object. Must be either "rb" or "wb".
buffer_size: int, optional
The buffer size to use when performing I/O.
min_part_size: int
For writing only.
min_part_size: int, optional
The minimum part size for multipart uploads. For writing only.
session: object, optional
The S3 session to use when working with boto3.
resource_kwargs: dict, optional
Keyword arguments to use when creating a new resource.
Keyword arguments to use when creating a new resource. For writing only.
multipart_upload_kwargs: dict, optional
Additional parameters to pass to boto3's initiate_multipart_upload function.
For writing only.
"""
Expand Down
Loading

0 comments on commit afe99b2

Please sign in to comment.