CSV parser fails with mixed quoting and non-standard row_sep #107

yourpalal · 2019-10-03T14:17:38Z

opts = {row_sep: "|\n", col_sep: ","}
CSV.parse(CSV.generate(opts) { |csv|
  csv << ["yes, it's true"];
  csv << [ "CSV is broken"];
  csv << ["uhoh!"]; }, opts)

In the example above, the parser will raise a CSV::MalformedCSVError (Unquoted fields do not allow new line <"\n"> in line 2.) error.

It will succeed if any of the following are done:

pass force_quotes: true to generate
add a comma in each row
remove the comma from the first row
use row_sep: "\n"

Therefore, the problem seems to be that a non-standard row_sep + lines with a field being quoted in one row and unquoted in the next.

The text was updated successfully, but these errors were encountered:

yourpalal · 2019-10-03T14:18:00Z

I'm using the version included in ruby 2.6.5, btw!

yourpalal · 2019-10-03T14:32:02Z

The stack trace indicates that it's failing in parse_quotable_robust. I'm thinking the problem may be that in prepare_unquoted, the @unquoted_value regexp does not take into account the value of row_sep. Therefore, the | is consumed by the scanner as part of the unquoted field value. After this, the "|\n" no longer exists to denote a line break, and instead the \n is interpreted as the next character of this field.

kou · 2019-10-04T00:02:56Z

Why do you want to use "|\n"?
Could you show a real world use case?

yourpalal · 2019-10-04T13:55:04Z

We're using this because we're interfacing with other systems that have very poor support for CSV, and get confused by newlines in quoted fields. That said, these files are created and then transformed by our software before being passed off to the other software (various RDBMSes) so during that first, internal phase, we could just use the default row/column separators, and avoid needing to parse the non-standard format.

If you don't want to support it, that's ok but an error should probably be raised when reading with non-standard row/col separators.

kou · 2019-10-10T07:01:37Z

Could you show the real CSV that has problem?
The CSV does NOT use "|\n" as row separator, right?

yourpalal · 2019-10-10T15:16:35Z

The CSV was using "|\n" as a row separator.

This CSV is used for an export/import process that involves creating CSV files, transforming them, and then finally importing them into MySQL, Postgres, Oracle, etc. We use a "|\n" separator in the final CSV files to overcome very poor CSV support in some of these databases.

It used to be the case that the initial CSV files were written with the "|\n" row separator, which meant that we had to support it when reading the CSV files for the transformation & filtering step. The generated CSV was not any more complicated than what was produced in the initial example I provided.

After discovering this issue, I have modified our system to only use the "|\n" separator in the final step - right before the file is imported into the other SQL databases.

kou · 2019-10-11T21:12:37Z

Ah, sorry. I want to know the input of your transformation system. I think that the following is a situation of this case:

X -CSV(X)-> [YOUR SYSTEM] -CSV(RDBMS)-> RDBMS

I want to know a sample content of CSV(X).

yourpalal · 2019-10-11T21:37:40Z

The whole process starts with our database, so the situation is more like:

step	owner	system	name	read row_sep	write row_sep
1	me	ruby (rails)	DB to CSV (exporting)		`"\|\n"`
2	me	ruby	CSV to CSV (filtering)	`"\|\n"`	`"\|\n"`
3	me	ruby	CSV to CSV-like (transforming)	`"\|\n"`	various things, including `"\|\n"`
4	not me	RDBMS	CSV-like to DB	various things, including `"\|\n"`

the data is in our database
we export that data into many CSV files
those CSV files are transformed by removing rows and columns
those CSV files are transformed into an alternate CSV format, depending on what RDBMS they will be loaded into
they're loaded into the RDBMSes

Steps 1-2 happen daily, steps 3,4,5 are repeated many times with the output of steps 1-2.

The system has been modified to now work like this:

step	system	name	read row_sep	write row_sep
1	ruby (rails)	DB to CSV (exporting)		`"\n"`
2	ruby	CSV to CSV (filtering)	`"\n"`	`"\n"`
3	ruby	CSV to CSV-like (transforming)	`"\n"`	various things, including `"\|\n"`
4	RDBMS	CSV-like to DB	various things, including `"\|\n"`

With this change, we no longer need to read CSV files with row_sep: "|\n", which means this bug no longer affects us.

The reason we were reading with that row separator is because it is convenient to use the same row separator from start to end of the process. It's OK to have to use standard CSV in steps 1 and 2. The actual data we are processing is not relevant to the process or the usage of that row separator.

Therefore, in my opinion, the main bug is that you can set anything you want as row_sep during reading, but the reader will fail to read files with non-standard row_sep values.

kou · 2019-10-11T22:06:39Z

Thanks. I understand.
I'll think about how to deal with this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV parser fails with mixed quoting and non-standard row_sep #107

CSV parser fails with mixed quoting and non-standard row_sep #107

yourpalal commented Oct 3, 2019

yourpalal commented Oct 3, 2019

yourpalal commented Oct 3, 2019 •

edited

Loading

kou commented Oct 4, 2019

yourpalal commented Oct 4, 2019

kou commented Oct 10, 2019

yourpalal commented Oct 10, 2019

kou commented Oct 11, 2019

yourpalal commented Oct 11, 2019 •

edited

Loading

kou commented Oct 11, 2019

CSV parser fails with mixed quoting and non-standard row_sep #107

CSV parser fails with mixed quoting and non-standard row_sep #107

Comments

yourpalal commented Oct 3, 2019

yourpalal commented Oct 3, 2019

yourpalal commented Oct 3, 2019 • edited Loading

kou commented Oct 4, 2019

yourpalal commented Oct 4, 2019

kou commented Oct 10, 2019

yourpalal commented Oct 10, 2019

kou commented Oct 11, 2019

yourpalal commented Oct 11, 2019 • edited Loading

kou commented Oct 11, 2019

yourpalal commented Oct 3, 2019 •

edited

Loading

yourpalal commented Oct 11, 2019 •

edited

Loading