Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wer_standardize error #97

Open
em-chiu opened this issue Nov 6, 2024 · 7 comments
Open

wer_standardize error #97

em-chiu opened this issue Nov 6, 2024 · 7 comments

Comments

@em-chiu
Copy link

em-chiu commented Nov 6, 2024

I used the below code to try and process the data but got a ValueError

import jiwer

string_wer_data["standardized_ref"] = jiwer.wer_standardize((string_wer_data["reference"]))
string_wer_data["standardized_hyp"] = jiwer.wer_standardize((string_wer_data["hypothesis"]))

emulating nikvaessen's response to another issue:

import jiwer

jiwer.wer(
  outputs_true, 
  outputs_pred,
  reference_transform=jiwer.wer_standardize, 
  hypothesis_transform=jiwer.wer_standardize
)

Originally posted by @nikvaessen in #85 (comment)

ValueError:

Traceback (most recent call last):
  File "/Users/emily/Desktop/whisper-input/wer_modules.py", line 73, in <module>
    string_wer_data["standardized_ref"] = jiwer.wer_standardize((string_wer_data["reference"]))
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/emily/miniconda3/lib/python3.12/site-packages/jiwer/transforms.py", line 130, in __call__
    text = tr(text)
           ^^^^^^^^
  File "/Users/emily/miniconda3/lib/python3.12/site-packages/jiwer/transforms.py", line 87, in __call__
    raise ValueError(
ValueError: input 0 
@nikvaessen
Copy link
Collaborator

nikvaessen commented Nov 7, 2024

Can you verify the following assertions on string_wer_data["reference"] :

assert isinstance(string_wer_data["reference"], list)
assert all(isinstance(e, str) for e in string_wer_data["reference"])

@em-chiu
Copy link
Author

em-chiu commented Nov 12, 2024

i verified those assertions are true

@em-chiu
Copy link
Author

em-chiu commented Dec 4, 2024

thanks so much for looking into this, i was wondering if there’s any other info i can provide?

@nikvaessen
Copy link
Collaborator

nikvaessen commented Dec 4, 2024

Can you provide a minimal code sample which reproduces the error?

@em-chiu
Copy link
Author

em-chiu commented Dec 8, 2024

import pandas as pd
import numpy as np

import jiwer
import glob
import os

hypotheses = []
references = []

# glob to get file list-- directory name, glob will give list of all files
hyps = glob.glob('/Users/emily/Desktop/whisper-output-turbo-word/*.txt')
refs = glob.glob('/Users/emily/Desktop/cd-reference-txt/*.txt')

for hyp, ref in zip(hyps, refs):
    with open(hyp) as hypfile, open(ref) as reffile: #loops through each file #hypfile = file handle
        hyp_read = hypfile.read() #get file contents w/ .read() method
        ref_read = reffile.read()
        hypotheses.append(hyp_read) #append() can add whole list as element to list
        references.append(ref_read)

whisper_wer_data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))

string_wer_data = whisper_wer_data.astype("string") #creates copy of df in str type


string_wer_data["standardized_ref"] = jiwer.wer_standardize((string_wer_data["reference"]))
string_wer_data["standardized_hyp"] = jiwer.wer_standardize((string_wer_data["hypothesis"]))

the files are on my local machine, is there anything else i should provide?

@nikvaessen
Copy link
Collaborator

Preferably, it should be a minimal code sample, which does not read to read any data from disk. This way, I can reproduce the error on my computer and debug the issue.

I would suggest that you try to isolate the particular string which causes the ValueError. You could loop over each file, read the content, and separately call wer_standardize on it.

@em-chiu
Copy link
Author

em-chiu commented Dec 8, 2024

apologies, i'm not very familiar with all the terminology for troubleshooting issues. i'll try to do as you suggest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants