Reading the code, I think we need to seek based on lines, because the original data is not really UTF-8, and it could contain ANSI sequence. That means we might need to have a buffer somewhere.
Umm... this is tough. We can't cut off ANSI sequence as well. Otherwise Ansi2html might not be able to do the job properly. Not very sure though. Line based might be the best effort. And if a single line is too long, we still need to cut it off. When that happen, we'll need to remove corrupted bytes, too.
Umm... what an oversight. Looks like we still need some of the patches in !10665 (closed), i.e. force_encoding for all read because to_json would try to encode it once again instead of just changing the encoding tag.
Umm... not really. I tried to add a test but it looks ok. So somewhere is not tagging UTF-8 correctly but that might not directly come from Gitlab::Ci::Trace::Stream
irb(main):086:0> data = build.trace.read do |stream|irb(main):087:1* stream.limitirb(main):088:1> state = 'eyJvZmZzZXQiOjUyNzA2LCJuX29wZW5fdGFncyI6MCwiZmdfY29sb3IiOm51bGwsImJnX2NvbG9yIjpudWxsLCJzdHlsZV9tYXNrIjowfQ%3D%3D'irb(main):089:1> trace = stream.html_with_state(state)irb(main):090:1> endArgumentError: invalid base64 from /opt/gitlab/embedded/lib/ruby/2.3.0/base64.rb:74:in `unpack' from /opt/gitlab/embedded/lib/ruby/2.3.0/base64.rb:74:in `strict_decode64' from /opt/gitlab/embedded/lib/ruby/2.3.0/base64.rb:105:in `urlsafe_decode64' from /opt/gitlab/embedded/service/gitlab-rails/lib/ci/ansi2html.rb:264:in `restore_state' from /opt/gitlab/embedded/service/gitlab-rails/lib/ci/ansi2html.rb:137:in `convert' from /opt/gitlab/embedded/service/gitlab-rails/lib/ci/ansi2html.rb:27:in `convert' from /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/ci/trace/stream.rb:59:in `html_with_state' from (irb):89:in `block in irb_binding' from /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/ci/trace.rb:61:in `read' from (irb):86 from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/railties-4.2.8/lib/rails/commands/console.rb:110:in `start' from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/railties-4.2.8/lib/rails/commands/console.rb:9:in `start' from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/railties-4.2.8/lib/rails/commands/commands_tasks.rb:68:in `console' from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/railties-4.2.8/lib/rails/commands/commands_tasks.rb:39:in `run_command!' from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/railties-4.2.8/lib/rails/commands.rb:17:in `<top (required)>' from bin/rails:9:in `require' from bin/rails:9:in `<main>'
I've identified the issue. If we're reading in binmode (Gitlab::Ci::Trace#read, via "rb"), then IO#each_line in Ci::Ansi2html::Converter#convert would give ASCII-8BIT which would convert @out to be encoded with ASCII-8BIT.
I cannot reproduce this via console because I am not using Gitlab::Ci::Trace#read therefore not reading in binmode.
I am looking in a way to switch to text mode, but I am not sure if there's a way to do so. If not, we could just hot patch this by enforce the encoding in @out after the scanning is done.
Or we could just read in text mode in the first place... I am not sure though. I think we do binmode because we want to seek in bytes, but of course seek cannot be encoding aware anyway, doing so with text mode or not might not matter. If that's the case, we could actually just read in text mode.
@smcgivern I just reopened this issue as I think we could just explore here, what do you think?
I think the ideal solution would be that Ci::Ansi2html::Converter should not use the stream from Gitlab::Ci::Trace::Stream directly. Alternatively, we could set the encoding just before passing to the converter. However this could be surprising because it's changing the state for the stream in Stream#html_with_state. On the other hand, we could also argue that we're already doing so by passing the stream to the converter, because it's reading from the stream directly, the IO#pos is already changing there.
Honestly I think Gitlab::Ci::Trace::Stream is poorly designed and we should restructure it.
@godfat thanks, makes sense to me. I don't have a strong opinion on the design here, so I'll leave that to @ayufan. I think the current workaround is not ideal, though