Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with PySpark 3.0.0 and NumPy #4

Open
browshanravan opened this issue Aug 31, 2020 · 0 comments
Open

Compatibility with PySpark 3.0.0 and NumPy #4

browshanravan opened this issue Aug 31, 2020 · 0 comments

Comments

@browshanravan
Copy link

browshanravan commented Aug 31, 2020

When I run StratifiedCrossValidator instead of CrossValidator with my pipeline, I get the following error, which I suspect relates to the newer version of PySpark and/or NumPy since spark_stratifier installs pyspark-2.3.2 and numpy==1.15.1 as part of its installation.

Any plans for upgrading the package?

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-4-e237f44298bb> in <module>
    237 
    238 
--> 239 cvModel = crossval.fit(train)
    240 predictions = cvModel.transform(test)

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
    127                 return self.copy(params)._fit(dataset)
    128             else:
--> 129                 return self._fit(dataset)
    130         else:
    131             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/spark_stratifier/stratifier.py in _fit(self, dataset)
     45     metrics = [0.0] * numModels
     46 
---> 47     stratified_data = self.stratify_data(dataset)
     48 
     49     for i in range(nFolds):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/spark_stratifier/stratifier.py in stratify_data(self, dataset)
     26     split_ratio = 1.0 / nFolds
     27 
---> 28     passes = dataset[dataset['label'] == 1]
     29     fails = dataset[dataset['label'] == 0]
     30 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/pyspark/sql/dataframe.py in __getitem__(self, item)
   1378         """
   1379         if isinstance(item, basestring):
-> 1380             jc = self._jdf.apply(item)
   1381             return Column(jc)
   1382         elif isinstance(item, Column):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    135                 # Hide where the exception came from that shows a non-Pythonic
    136                 # JVM exception message.
--> 137                 raise_from(converted)
    138             else:
    139                 raise

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in raise_from(e)

AnalysisException: Cannot resolve column name "label" among (type, amount, oldbalanceOrg, newbalanceOrig, isFraud, sample_weight_per_class);
@browshanravan browshanravan changed the title Compatibility with PySpark 3.0.0 Compatibility with PySpark 3.0.0 and NumPy Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant