Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CSV.open with StringIO argument #302

Merged
merged 16 commits into from
Aug 5, 2024
Merged

Allow CSV.open with StringIO argument #302

merged 16 commits into from
Aug 5, 2024

Conversation

MarcPer
Copy link
Contributor

@MarcPer MarcPer commented May 4, 2024

Fix #300

lib/csv.rb Outdated Show resolved Hide resolved
@kou kou changed the title Allow CSV.foreach with StringIO argument Allow CSV.open with StringIO argument May 4, 2024
@MarcPer MarcPer requested a review from kou May 24, 2024 08:12
lib/csv.rb Outdated
Comment on lines 1604 to 1613
pos = filename_or_io.pos
f = StringIO.new(filename_or_io.read)
filename_or_io.seek(pos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use StringIO#string here?

Suggested change
pos = filename_or_io.pos
f = StringIO.new(filename_or_io.read)
filename_or_io.seek(pos)
f = StringIO.new(filename_or_io.string)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd then run into an issue similar to #98, which I confirmed with another test:

 def test_foreach_stringio_with_bom
   s = StringIO.new
   s << "\ufeff"  # BOM
   s << @data
   s.rewind
   s.read(3)
   rows = CSV.foreach(s, col_sep: "\t", row_sep: "\r\n").to_a
   assert_equal(@rows, rows)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the most relevant counter-argument is, CSV.new also uses read, so using read here is also a matter of consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Could you also add a test for the BOM case?

How about using StringIO#set_encoding_by_bom for it?

        f = StringIO.new(filename_or_io.string)
        unless f.set_encoding_by_bom
          f.set_encoding(filename_or_io.external_encoding)
        end

CSV.new uses already opened IO. So it must not ignore the current position. So we must use read.
But CSV.open opens a new IO. So it must use all contents instead of contents after the current position.
Could you check the current behavior of CSV.open with sought real IO?

Copy link
Contributor Author

@MarcPer MarcPer May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But CSV.open opens a new IO. So it must use all contents instead of contents after the current position.

You're right, CSV.open reads the whole file content. I changed open again to use string instead of read.

Could you also add a test for the BOM case?

Sure thing. I added it, and together with using the **file_opts in the new StringIO, it's handled correctly.

How about using StringIO#set_encoding_by_bom for it?

I couldn't see any effect of doing it. It seems like including the file_opts is enough. Or is there a case I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't see any effect of doing it.

Ah, b706d91 is a trick for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything you still feel is missing, after I've added file_opts with the trick from b706d91 ?

test/csv/interface/test_read.rb Outdated Show resolved Hide resolved
test/csv/interface/test_read.rb Outdated Show resolved Hide resolved
test/csv/interface/test_read.rb Outdated Show resolved Hide resolved
test/csv/interface/test_read.rb Outdated Show resolved Hide resolved
test/csv/interface/test_read.rb Outdated Show resolved Hide resolved
@kou
Copy link
Member

kou commented Jun 4, 2024

Could you rebase on main? Our CI on master broken... and I've fixed it now.

@MarcPer
Copy link
Contributor Author

MarcPer commented Jun 5, 2024

Could you rebase on main? Our CI on master broken... and I've fixed it now.

Thank you, I've rebase it now.

@kou
Copy link
Member

kou commented Jun 6, 2024

Thanks. But sorry... The master CI was broken... I've fixed it. Could you rebase on master again?

@MarcPer
Copy link
Contributor Author

MarcPer commented Jun 6, 2024

Thanks. But sorry... The master CI was broken... I've fixed it. Could you rebase on master again?

No problem, I rebased it now.

@MarcPer
Copy link
Contributor Author

MarcPer commented Jun 13, 2024

I'm assuming the test failures are related to https://bugs.ruby-lang.org/issues/20526, right?

@kou
Copy link
Member

kou commented Jun 14, 2024

Hmm. It should be fixed by 47bf76f .
Could you rebase again?

@MarcPer
Copy link
Contributor Author

MarcPer commented Jun 14, 2024

Rebased it, and removed a now unneeded check when closing the IO within CSV#open.

@kou
Copy link
Member

kou commented Jun 16, 2024

Ah, the failed test is an added test.

Does this work?

diff --git a/lib/csv.rb b/lib/csv.rb
index 7511ced..d2d1c57 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -1591,7 +1591,10 @@ class CSV
       # wrap a File opened with the remaining +args+ with no newline
       # decorator
       file_opts = {}
-      may_enable_bom_deletection_automatically(mode, options, file_opts)
+      may_enable_bom_deletection_automatically(filename_or_io,
+                                               mode,
+                                               options,
+                                               file_opts)
       file_opts.merge!(options)
       unless file_opts.key?(:newline)
         file_opts[:universal_newline] ||= false
@@ -1900,10 +1903,15 @@ class CSV
     private_constant :ON_WINDOWS
 
     private
-    def may_enable_bom_deletection_automatically(mode, options, file_opts)
-      # "bom|utf-8" may be buggy on Windows:
-      # https://bugs.ruby-lang.org/issues/20526
-      return if ON_WINDOWS
+    def may_enable_bom_deletection_automatically(filename_or_io,
+                                                 mode,
+                                                 options,
+                                                 file_opts)
+      unless filename_or_io.is_a?(StringIO)
+        # "bom|utf-8" may be buggy on Windows:
+        # https://bugs.ruby-lang.org/issues/20526
+        return if ON_WINDOWS
+      end
       return unless Encoding.default_external == Encoding::UTF_8
       return if options.key?(:encoding)
       return if options.key?(:external_encoding)

@MarcPer
Copy link
Contributor Author

MarcPer commented Aug 1, 2024

Sorry, I was away for a while. I saw that support to StringIO was dropped for Ruby 2.6 and earlier, before handling of BOM encoding was added. Given that, would it be okay to skip the new tests for StringIO if RUBY_VERSION < "2.7"? Otherwise, the CSV gem might have to deal itself with detection and stripping of BOM.

Comment on lines 35 to 49
if RUBY_VERSION >= "2.7"
# Support to StringIO was dropped for Ruby 2.6 and earlier:
# https://github.com/ruby/stringio/pull/47
def test_foreach_stringio
string_io = StringIO.new(@data)
rows = CSV.foreach(string_io, col_sep: "\t", row_sep: "\r\n").to_a
assert_equal(@rows, rows)
end

def test_foreach_stringio_with_bom
string_io = StringIO.new("\ufeff#{@data}") # U+FEFF ZERO WIDTH NO-BREAK SPACE
rows = CSV.foreach(string_io, col_sep: "\t", row_sep: "\r\n").to_a
assert_equal(@rows, rows)
end
end
Copy link
Member

@kou kou Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does test_foreach_stringio work with Ruby 2.7? If so, could you omit only test_foreach_stringio_with_bom?

Suggested change
if RUBY_VERSION >= "2.7"
# Support to StringIO was dropped for Ruby 2.6 and earlier:
# https://github.com/ruby/stringio/pull/47
def test_foreach_stringio
string_io = StringIO.new(@data)
rows = CSV.foreach(string_io, col_sep: "\t", row_sep: "\r\n").to_a
assert_equal(@rows, rows)
end
def test_foreach_stringio_with_bom
string_io = StringIO.new("\ufeff#{@data}") # U+FEFF ZERO WIDTH NO-BREAK SPACE
rows = CSV.foreach(string_io, col_sep: "\t", row_sep: "\r\n").to_a
assert_equal(@rows, rows)
end
end
def test_foreach_stringio
string_io = StringIO.new(@data)
rows = CSV.foreach(string_io, col_sep: "\t", row_sep: "\r\n").to_a
assert_equal(@rows, rows)
end
def test_foreach_stringio_with_bom
if RUBY_VERSION < "2.7"
# Support to StringIO was dropped for Ruby 2.6 and earlier without BOM support:
# https://github.com/ruby/stringio/pull/47
omit("StringIO's BOM support isn't available with Ruby < 2.7")
end
string_io = StringIO.new("\ufeff#{@data}") # U+FEFF ZERO WIDTH NO-BREAK SPACE
rows = CSV.foreach(string_io, col_sep: "\t", row_sep: "\r\n").to_a
assert_equal(@rows, rows)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does if I also initialize StringIO without additional options. I just pushed another commit like that. It's unfortunate having to run an additional RUBY_VERSION check at runtime, but at least that's only when calling open.

lib/csv.rb Outdated
Comment on lines 1608 to 1612
f = if RUBY_VERSION >= "2.7"
StringIO.new(filename_or_io.string, **file_opts)
else
StringIO.new(filename_or_io.string)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always use an empty file_opts for Ruby 2.7 or earlier instead?

diff --git a/lib/csv.rb b/lib/csv.rb
index 5f75894..6f94bd4 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -1605,11 +1605,7 @@ class CSV
       options.delete_if {|k, _| /newline\z/.match?(k)}
 
       if filename_or_io.is_a?(StringIO)
-        f = if RUBY_VERSION >= "2.7"
-          StringIO.new(filename_or_io.string, **file_opts)
-        else
-          StringIO.new(filename_or_io.string)
-        end
+        f = StringIO.new(filename_or_io.string, **file_opts)
       else
         begin
           f = File.open(filename_or_io, mode, **file_opts)
@@ -1911,7 +1907,11 @@ class CSV
                                                mode,
                                                options,
                                                file_opts)
-      unless filename_or_io.is_a?(StringIO)
+      if filename_or_io.is_a?(StringIO)
+        # Support to StringIO was dropped for Ruby 2.6 and earlier without BOM support:
+        # https://github.com/ruby/stringio/pull/47
+        return if RUBY_VERSION < "3.0"
+      else
         # "bom|utf-8" may be buggy on Windows:
         # https://bugs.ruby-lang.org/issues/20526
         return if ON_WINDOWS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work because CSV#open still adds keys to file_opts here:

csv/lib/csv.rb

Lines 1586 to 1589 in bb93c28

file_opts.merge!(options)
unless file_opts.key?(:newline)
file_opts[:universal_newline] ||= false
end

For Ruby < 2.7, if file_opts is not empty, we get TypeError: no implicit conversion of Hash into String.

I like that something fails if options is given to CSV#open when it's not supported, as opposed to simply ignoring it, the way I did it, but then the error should be ideally more clear than what the old StringIO returns. I can go that route, by rescuing the TypeError, and raising another exception with a clearer message. In any case, the file_opts[:universal_newline] ||= false would still have to be removed for < 2.7.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work because CSV#open still adds keys to file_opts here:

csv/lib/csv.rb

Lines 1586 to 1589 in bb93c28

file_opts.merge!(options)
unless file_opts.key?(:newline)
file_opts[:universal_newline] ||= false
end

Oh, sorry.

For Ruby < 2.7, if file_opts is not empty, we get TypeError: no implicit conversion of Hash into String.

I like that something fails if options is given to CSV#open when it's not supported, as opposed to simply ignoring it, the way I did it, but then the error should be ideally more clear than what the old StringIO returns. I can go that route, by rescuing the TypeError, and raising another exception with a clearer message.

I agree with you but can we use raise ... unless file_opts.empty? for it instead of rescuing the TypeError?

In any case, the file_opts[:universal_newline] ||= false would still have to be removed for < 2.7.

I agree with you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out passing an empty file_opts doesn't work either. I've noticed this quirk from the interpreter in Ruby 2.6:

StringIO.new("bla", **{}) # => #<StringIO:0x0000563108bb88b8>

opts = {}
StringIO.new("bla", **opts) # => TypeError (no implicit conversion of Hash into String)

To address that, I've changed the StringIO initialization in Ruby < 2.7, so that it:

  • ignores options that are meant only for CSV.new
  • raises an ArgumentError if anything else is passed as options to CSV.open
  • initializes StringIO without keyword arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use the approach.

lib/csv.rb Outdated Show resolved Hide resolved
lib/csv.rb Outdated Show resolved Hide resolved
lib/csv.rb Outdated Show resolved Hide resolved
@kou kou merged commit ce91198 into ruby:master Aug 5, 2024
46 checks passed
@kou
Copy link
Member

kou commented Aug 5, 2024

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Read StringIO with foreach
2 participants