A Berkeley library for introductory data science.
written by Professor John DeNero, Professor David Culler, Sam Lau, and Alvin Wan
For an example of usage, see the Berkeley Data 8 class.
Use pip
:
pip install datascience
This project adheres to Semantic Versioning.
Switch from pandas.read_table to pandas.read_csv, to avoid deprecation warnings. Shouldn't change the behavior of the library.
Table.append_column
now returns the table it is modifying.
- Add
shuffle
function toTable
.
- Added
join
for multiple columns.
- Allow NumPy arrays to be appended into tables.
- Added optional formatters to "Table.with_column", "Table.with_columns", and "Table.append_column".
- Warning added for comparing iterables using predicates incorrectly.
- 'move_column' added.
- Created new methods 'first' and 'last'.
- 'append_column' now returns the table it is modifying.
- 'move_to_end' and 'move_to_start' can now take integer labels.
- Fixes test suite and removes all deprecated code in the test suite caused by deprecated API calls from the datascience library.
- Adds
hist_of_counts
function
- Fixes minor issues introduced by matplotlib 2.x upgrade (data-8#315)
- Fixes a bug in HTML table generation (data-8#315)
- Add
sample_proportions
function.
- Fix
OrderedDict
bug inTable.hist
.
- Fix
CurrencyFormatter
to handle commas. - Fix
Table.hist
to keep histograms in the order of the columns.
- Fix
join
so that it keeps all rows in the inner join of two tables.
- Added
group_barh
andgroup_bar
to plot counts by a grouping category, a common use case. - Added options to
hist
to produce a histogram for each group on a column. - Deprecated Table method
pivot_hist
. Added an option tohist
to simulatepivot_hist
's behavior.
- DistributionFormatter added.
- Fix bug for relabeled columns that had a format already.
- Circles bound to values determine the circle area, not radius.
- Scatter diagrams can take data-driven size and color parameters.
- Changed signature of
apply
,hist
, andbin
to accept multiple columns without a list - Deprecate
hist
argument namecounts
in favor ofbin_column
- Rename various positional args (technically could break some code, but won't)
- Unified
with_column
andwith_columns
(not a breaking change) - Unified
group
andgroups
(not a breaking change)
- Added "Table.remove"
- Added
proportions_from_distribution
method todatascience.util
. (993e3d2) Table.column
now throws a descriptiveValueError
instead of aKeyError
when the column isn't in the table. (ef8b319)
Breaking changes
- Change default behavior of
table.sample
towith_replacement=True
instead ofFalse
. (3717b67)
Additions
- Added
Map.copy
. - Added
Map.overlay
which overlays a feature(s) on a new copy of Map. (315bb63e)
- Remove rogue print from
table.hist
- Added predicates for string comparison:
containing
andcontained_in
. (#231)
API reference is at http://data8.org/datascience/ .
The required environment for installation and tests is the Anaconda Python3 distribution
If you encounter an Image not found
error on Mac OSX, you may need an
XQuartz upgrade.
Start by cloning this repository:
git clone https://github.com/data-8/datascience
Install the dependencies into a Conda environment with:
conda env create -f osx_environment.yml -n datascience
# For Linux, use
conda env create -f linux_environment.yml -n datascience
Source the environment to use the correct packages while developing:
source activate datascience
# `source deactivate` will unload the environment
The above command must be run each time you develop in the package. You can also install direnv to auto-load/unload the environment.
Install datascience
locally with:
make install
Then, run the tests:
make test
After that, go ahead and start hacking!
The source activate datascience
command must be run each time you develop in
the package. Alternatively, you can install direnv to auto-load/unload
the environment.
Documentation is generated from the docstrings in the methods and is pushed online at http://data8.org/datascience/ automatically. If you want to preview the docs locally, use these commands:
make docs # Generates docs inside doc/ folder
make serve_docs # Starts a local server to view docs
We use Zenhub to organize development on this library. To get started, go ahead and install the Zenhub Chrome Extension.
Then navigate to the issue board or press b
. You'll see a screen
that looks something like this:
- New Issues are issues that are just created and haven't been prioritized.
- Backlogged issues are issues that are not high priority, like nice-to-have features.
- To Do issues are high priority and should get done ASAP, such as breaking bugs or functionality that we need to lecture on soon.
- Once someone has been assigned to an issue, that issue should be moved into the In Progress column.
- When the task is complete, we close the related issue.
- John creates an issue called "Everything is breaking". It goes into the New Issues pipeline at first.
- This issue is important, so John immediately moves it into the To Do pipeline. Since he has to go lecture for 61A, he doesn't assign it to himself right away.
- Sam sees the issue, assigns himself to it, and moves it into the In Progress pipeline.
- After everything is fixed, Sam closes the issue.
Here's another example.
- Ani creates an issue asking for beautiful histograms. Like before, it goes into the New Issues pipeline.
- John decides that the issue is not as high priority right now because other things are breaking, so he moves it into the Backlog pipeline.
- When he has some more time, John assigns himself the issue and moves it into the In Progress pipeline.
- Once the issue is finished, he closes the issue.
python setup.py sdist upload -r pypi