Skip to content

Commit

Permalink
Better installation instructions and less restrictive dependency vers…
Browse files Browse the repository at this point in the history
…ions for easier install
  • Loading branch information
ginic committed Mar 11, 2024
1 parent 425f293 commit 39b2234
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 9 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ The `ihop` directory is a python module with submodules that can also be run as
- `ihop.visualizations`: Visualization utilities to create T-SNE projections used the in the cluster viewer applications
- `ihop.utils`: Options to configure logging and Spark environment
- `ihop.resources`: Data resources
- `ihop.resources.analogies`: Subreddit algebra analogies for tuning community2vec, taken from [social-dimensions](https://github.com/CSSLab/social-dimensions) with minor updates
- `ihop.resources.analogies`: Subreddit algebra analogies for tuning community2vec, taken from [social-dimensions](https://github.com/CSSLab/social-dimensions) with minor updates
- `ihop.resources.collections`: Pre-defined collections of subreddits from the Media Cloud team.

# External Dependencies
- Python >= 3.7
- Python >= 3.8, <3.11. Note that <3.11 is only necessary because of the `dev` dependency [irrCAC](https://pypi.org/project/irrcac/)
- [Java](https://docs.oracle.com/en/java/javase/17/install/overview-jdk-installation.html) or [OpenJDK](https://openjdk.java.net/install/) (at least version 8). Make sure you have `JAVA_HOME` set appropriately
- (Optional to support faster compression & customize Hadoop config for Spark) [Hadoop](https://hadoop.apache.org) at least version 3.3 is needed for Pyspark to properly decompress the Reddit zst files (see [this issue](https://stackoverflow.com/questions/64607248/configure-spark-on-yarn-to-use-hadoop-native-libraries) or [this one](https://stackoverflow.com/questions/67099204/reading-a-zst-archive-in-scala-spark-native-zstandard-library-not-available)). Install Hadoop and configure the environment variables using [these instructions](https://phoenixnap.com/kb/install-hadoop-ubuntu).
- [unzstd](http://manpages.ubuntu.com/manpages/bionic/man1/unzstd.1.html) and [bzip2](https://www.sourceware.org/bzip2/) are used for recompressing the monthly Reddit dumps to bzip2 format, which Spark and Gensim are more readily able to handle than the zst files.
Expand All @@ -22,7 +22,7 @@ The `ihop` directory is a python module with submodules that can also be run as
# Setup and Installation
Note that [Pyspark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies) is used for training embeddings and LDA models, so you must have Java installed (at least version 8) and `JAVA_HOME` set appropriately.

Use [Anaconda](https://docs.anaconda.com/anaconda/install/index.html) to create the `ihop` environment by running `conda create -n ihop python=3.9`, install the package and its dependencies using `pip install .`. Note that if you are using a Mac with an M1 or M2 chip, install sklearn and numpy *first* using conda, `conda install numpy==1.21.2 scikit-learn==1.0.1`. This installs everything needed for just for preprocessing data and training models. If you're going to run the Dash app, install using `pip install .[app]`.
Use [Anaconda](https://docs.anaconda.com/anaconda/install/index.html) to create the `ihop` environment by running `conda create -n ihop python=3.10`, install the package and its dependencies using `pip install .`. This installs everything needed for just for preprocessing data and training models. If you're going to run the Dash app, install using `pip install .[app]`. Note that if you are using a Mac with an M1 or M2 chip, install sklearn and numpy *first* using conda, `conda install numpy scikit-learn`.

For testing and development tools, install the `ihop` package to be importable for testing, install using `pip install -e .[test,dev]`

Expand Down
12 changes: 6 additions & 6 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,27 +1,27 @@
[metadata]
name = ihop
version = 2.0.0
version = 2.1.0
description = Tools for clustering Reddit based on NLP techniques
long_description = = file: README.md
long_description = file: README.md
long_description_content_type = text/markdown
license_files = LICENSE

[options]
python_requires = <3.10, >=3.8
python_requires = >=3.8,<3.11
packages = find:
include_package_data = True
install_requires =
dvc[s3]==3.27.0
gensim==4.1.2
joblib==1.1.0
numpy==1.21.2
joblib>=1.1.0
numpy>=1.21.0
matplotlib==3.5.0
pandas==1.3.5
pyspark>=3.2.0
pytimeparse==1.1.8
scipy
s3fs[boto3]>=2022.3.0
scikit-learn==1.0.1
scikit-learn>=1.4.0

[options.extras_require]
app =
Expand Down

0 comments on commit 39b2234

Please sign in to comment.