Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

尝试复现examples/python/ml/flax_gpt2报错 #245

Closed
Mingbo-Lee opened this issue Jul 17, 2023 · 16 comments
Closed

尝试复现examples/python/ml/flax_gpt2报错 #245

Mingbo-Lee opened this issue Jul 17, 2023 · 16 comments

Comments

@Mingbo-Lee
Copy link
Contributor

Issue Type

Bug

Modules Involved

Documentation/Tutorial/Example

Have you reproduced the bug with SPU HEAD?

Yes

Installation Kind

binary

SPU Version

spu 0.4.1b1

OS Platform and Distribution

Linux Ubuntu 20.04.6 LTS

Python Version

3.8.17

Compiler Version

GCC 9.4.0

Current Behavior?

我尝试复现examples/python/ml/flax_gpt2
按照 https://github.com/secretflow/spu/tree/main/examples/python/ml/flax_gpt2 内的指令一步步复现
执行到
bazel run -c opt //examples/python/utils:nodectl -- --config pwd/examples/python/ml/flax_gpt2/3pc.json up
出现报错

Standalone code to reproduce the issue

bazel run -c opt //examples/python/utils:nodectl -- --config `pwd`/examples/python/ml/flax_gpt2/3pc.json up

Relevant log output

INFO: From Compiling src/core/lib/channel/connected_channel.cc:
external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc: In lambda function:
external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc:337:54: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct grpc_transport_stream_op_batch'; use assignment or value-initialization instead [-Wclass-memaccess]
  337 |       memset(&recv_message_, 0, sizeof(recv_message_));
      |                                                      ^
In file included from external/com_github_grpc_grpc/src/core/lib/channel/channel_stack.h:75,
                 from external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.h:25,
                 from external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc:21:
external/com_github_grpc_grpc/src/core/lib/transport/transport.h:274:8: note: 'struct grpc_transport_stream_op_batch' declared here
  274 | struct grpc_transport_stream_op_batch {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc: In member function 'grpc_core::Poll<std::unique_ptr<grpc_metadata_batch, grpc_core::Arena::PooledDeleter> > grpc_core::{anonymous}::ClientStream::PollOnce()':
external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc:363:46: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct grpc_transport_stream_op_batch'; use assignment or value-initialization instead [-Wclass-memaccess]
  363 |       memset(&metadata_, 0, sizeof(metadata_));
      |                                              ^
In file included from external/com_github_grpc_grpc/src/core/lib/channel/channel_stack.h:75,
                 from external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.h:25,
                 from external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc:21:
external/com_github_grpc_grpc/src/core/lib/transport/transport.h:274:8: note: 'struct grpc_transport_stream_op_batch' declared here
  274 | struct grpc_transport_stream_op_batch {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc:409:56: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct grpc_transport_stream_op_batch'; use assignment or value-initialization instead [-Wclass-memaccess]
  409 |         memset(&send_message_, 0, sizeof(send_message_));
      |                                                        ^
In file included from external/com_github_grpc_grpc/src/core/lib/channel/channel_stack.h:75,
                 from external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.h:25,
                 from external/com_github_grpc_grpc/src/core/lib/channel/connected_channel.cc:21:
external/com_github_grpc_grpc/src/core/lib/transport/transport.h:274:8: note: 'struct grpc_transport_stream_op_batch' declared here
  274 | struct grpc_transport_stream_op_batch {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: From Compiling src/core/lib/event_engine/posix_engine/tcp_socket_utils.cc:
external/com_github_grpc_grpc/src/core/lib/event_engine/posix_engine/tcp_socket_utils.cc: In function 'bool grpc_event_engine::posix_engine::SockaddrToV4Mapped(const grpc_event_engine::experimental::EventEngine::ResolvedAddress*, grpc_event_engine::experimental::EventEngine::ResolvedAddress*)':
external/com_github_grpc_grpc/src/core/lib/event_engine/posix_engine/tcp_socket_utils.cc:275:62: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'class grpc_event_engine::experimental::EventEngine::ResolvedAddress'; use assignment or value-initialization instead [-Wclass-memaccess]
  275 |     memset(resolved_addr6_out, 0, sizeof(*resolved_addr6_out));
      |                                                              ^
In file included from external/com_github_grpc_grpc/src/core/lib/event_engine/posix_engine/tcp_socket_utils.h:29,
                 from external/com_github_grpc_grpc/src/core/lib/event_engine/posix_engine/tcp_socket_utils.cc:17:
external/com_github_grpc_grpc/include/grpc/event_engine/event_engine.h:119:9: note: 'class grpc_event_engine::experimental::EventEngine::ResolvedAddress' declared here
  119 |   class ResolvedAddress {
      |         ^~~~~~~~~~~~~~~
INFO: From Compiling src/core/ext/filters/channel_idle/channel_idle_filter.cc:
In file included from external/com_github_grpc_grpc/src/core/ext/filters/channel_idle/channel_idle_filter.cc:44:
external/com_github_grpc_grpc/src/core/lib/promise/loop.h:121:31: warning: attribute ignored in declaration of 'union grpc_core::promise_detail::Loop<F>::<unnamed>' [-Wattributes]
  121 |   GPR_NO_UNIQUE_ADDRESS union {
      |                               ^
external/com_github_grpc_grpc/src/core/lib/promise/loop.h:121:31: note: attribute for 'union grpc_core::promise_detail::Loop<F>::<unnamed>' must follow the 'union' keyword
INFO: From Compiling src/core/ext/filters/client_channel/lb_policy/priority/priority.cc:
external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/priority/priority.cc:405:10: warning: 'uint32_t grpc_core::{anonymous}::PriorityLb::GetChildPriorityLocked(const string&) const' defined but not used [-Wunused-function]
  405 | uint32_t PriorityLb::GetChildPriorityLocked(
      |          ^~~~~~~~~~
ERROR: /home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/external/com_github_grpc_grpc/src/core/BUILD:3386:16: Compiling src/core/ext/xds/xds_route_config.cc failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 96 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
In file included from /usr/include/c++/9/bits/move.h:55,
                 from /usr/include/c++/9/bits/stl_pair.h:59,
                 from /usr/include/c++/9/utility:70,
                 from /usr/include/c++/9/algorithm:60,
                 from external/com_github_grpc_grpc/src/core/ext/xds/xds_route_config.h:24,
                 from external/com_github_grpc_grpc/src/core/ext/xds/xds_route_config.cc:19:
/usr/include/c++/9/type_traits: In instantiation of 'struct std::is_constructible<grpc_core::XdsListenerResource::FilterChainData>':
/usr/include/c++/9/type_traits:2912:25:   required from 'constexpr const bool std::is_constructible_v<grpc_core::XdsListenerResource::FilterChainData>'
/usr/include/c++/9/optional:479:66:   required by substitution of 'template<class ... _Args, typename std::enable_if<is_constructible_v<grpc_core::XdsListenerResource::FilterChainData, _Args&& ...>, bool>::type <anonymous> > constexpr std::_Optional_base<grpc_core::XdsListenerResource::FilterChainData, false, false>::_Optional_base(std::in_place_t, _Args&& ...) [with _Args = {}; typename std::enable_if<is_constructible_v<grpc_core::XdsListenerResource::FilterChainData, _Args&& ...>, bool>::type <anonymous> = <missing>]'
/usr/include/c++/9/type_traits:883:12:   required from 'struct std::is_constructible<grpc_core::XdsListenerResource::TcpListener, const grpc_core::XdsListenerResource::TcpListener&>'
/usr/include/c++/9/type_traits:901:12:   required from 'struct std::__is_copy_constructible_impl<grpc_core::XdsListenerResource::TcpListener, true>'
/usr/include/c++/9/type_traits:907:12:   required from 'struct std::is_copy_constructible<grpc_core::XdsListenerResource::TcpListener>'
/usr/include/c++/9/type_traits:2918:25:   required from 'constexpr const bool std::is_copy_constructible_v<grpc_core::XdsListenerResource::TcpListener>'
/usr/include/c++/9/variant:275:5:   required from 'constexpr const bool std::__detail::__variant::_Traits<grpc_core::XdsListenerResource::HttpConnectionManager, grpc_core::XdsListenerResource::TcpListener>::_S_copy_ctor'
/usr/include/c++/9/variant:1228:11:   required from 'class std::variant<grpc_core::XdsListenerResource::HttpConnectionManager, grpc_core::XdsListenerResource::TcpListener>'
external/com_github_grpc_grpc/src/core/ext/xds/xds_listener.h:189:53:   required from here
/usr/include/c++/9/type_traits:883:12: error: default member initializer for 'grpc_core::XdsListenerResource::DownstreamTlsContext::require_client_certificate' required before the end of its enclosing class
  883 |     struct is_constructible
      |            ^~~~~~~~~~~~~~~~
In file included from external/com_github_grpc_grpc/src/core/ext/xds/xds_routing.h:35,
                 from external/com_github_grpc_grpc/src/core/ext/xds/xds_route_config.cc:64:
external/com_github_grpc_grpc/src/core/ext/xds/xds_listener.h:83:37: note: defined here
   83 |     bool require_client_certificate = false;
      |                                     ^~~~~~~~
Target //examples/python/utils:nodectl failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 60.324s, Critical Path: 30.75s
INFO: 1435 processes: 59 internal, 1376 linux-sandbox.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
@anakinxc
Copy link
Contributor

Hi @Mingbo-Lee

请使用 gcc 11.2

@Mingbo-Lee
Copy link
Contributor Author

使用gcc 11.2 后,再自行排除一些错误,问题已经解决,非常感谢 @anakinxc

@Mingbo-Lee
Copy link
Contributor Author

出现新的报错

Run on SPU
Traceback (most recent call last):
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/examples/python/ml/flax_gpt2/flax_gpt2.py", line 86, in <module>
    outputs_ids = run_on_spu()
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/examples/python/ml/flax_gpt2/flax_gpt2.py", line 72, in run_on_spu
    input_ids = ppd.device("P1")(lambda x: x)(inputs_ids)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 499, in __call__
    self.device.node_client.run(server_fn, *args, **kwargs),
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 249, in run
    return self._call(self._stub.Run, fn, *args, **kwargs)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 238, in _call
    rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 218, in rebuild_messages
    return b''.join([msg for msg in msgs])
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 218, in <listcomp>
    return b''.join([msg for msg in msgs])
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 238, in <genexpr>
    rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/com_github_grpc_grpc/src/python/grpcio/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/com_github_grpc_grpc/src/python/grpcio/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: cannot import name 'HashableWrapper' from 'jax._src.util' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:9923 {created_time:"2023-07-19T13:53:33.587829594+00:00", grpc_status:2, grpc_message:"Exception iterating responses: cannot import name \'HashableWrapper\' from \'jax._src.util\' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)"}"
>

安装最新版本的jax jaxlib 和 spu 都没用

(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip list | grep spu
spu                          0.4.1b1
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip list | grep jax
jax                          0.4.12
jaxlib                       0.4.12
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ 

jax&jaxlib == 0.4.12 /0.4.13 都报错

@anakinxc
Copy link
Contributor

更新一下 flax 看看

@Mingbo-Lee
Copy link
Contributor Author

更新最新版本flax,还是报错

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: cannot import name 'HashableWrapper' from 'jax._src.util' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:9923 {grpc_message:"Exception iterating responses: cannot import name \'HashableWrapper\' from \'jax._src.util\' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)", grpc_status:2, created_time:"2023-07-19T14:08:11.217088731+00:00"}"
>
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip list | grep flax
flax                         0.7.0
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip install flax==
ERROR: Could not find a version that satisfies the requirement flax== (from versions: 0.1.0rc1, 0.1.0rc2, 0.1.0, 0.2.0, 0.2.1, 0.2.2, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.4.0, 0.4.1, 0.4.2, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.6.8, 0.6.9, 0.6.10, 0.6.11, 0.7.0)
ERROR: No matching distribution found for flax==
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ 

@anakinxc
Copy link
Contributor

有点奇怪……HashableWrapper 也不是 spu import 的……

cpu 的跑起来没问题?顺便问一下 grpc 的版本是多少?

@Mingbo-Lee
Copy link
Contributor Author

1.49.1

(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip list | grep grpc
grpcio                       1.49.1
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ 

@Mingbo-Lee
Copy link
Contributor Author

更新到了最新版本1.56.0,还是报错

flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 238, in <genexpr>
    rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/com_github_grpc_grpc/src/python/grpcio/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/com_github_grpc_grpc/src/python/grpcio/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: cannot import name 'HashableWrapper' from 'jax._src.util' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:9923 {grpc_message:"Exception iterating responses: cannot import name \'HashableWrapper\' from \'jax._src.util\' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)", grpc_status:2, created_time:"2023-07-19T14:20:51.958034795+00:00"}"
>
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip list | grep grpc
grpcio                       1.56.0
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ 

@Mingbo-Lee
Copy link
Contributor Author

这种方式很容易复现:https://www.secretflow.org.cn/docs/secretflow/latest/zh-Hans/tutorial/gpt2_with_spu
顺便问一下,这两种方式运行SPU有什么区别?

@Mingbo-Lee
Copy link
Contributor Author

CPU跑起来没问题:

------
Run on CPU
2023-07-19 14:20:34.238997: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-07-19 14:20:34.239058: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-07-19 14:20:34.239068: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I enjoy walking with my cute dog, but I'm not sure if I'll ever

------
Run on SPU
Traceback (most recent call last):
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/examples/python/ml/flax_gpt2/flax_gpt2.py", line 86, in <module>
    outputs_ids = run_on_spu()
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/examples/python/ml/flax_gpt2/flax_gpt2.py", line 72, in run_on_spu
    input_ids = ppd.device("P1")(lambda x: x)(inputs_ids)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 499, in __call__
    self.device.node_client.run(server_fn, *args, **kwargs),
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 249, in run
    return self._call(self._stub.Run, fn, *args, **kwargs)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 238, in _call
    rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 218, in rebuild_messages
    return b''.join([msg for msg in msgs])
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 218, in <listcomp>
    return b''.join([msg for msg in msgs])
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/spulib/spu/utils/distributed.py", line 238, in <genexpr>
    rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/com_github_grpc_grpc/src/python/grpcio/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/limingbo/.cache/bazel/_bazel_limingbo/b283a70469641d1121054548cb95c882/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_gpt2/flax_gpt2.runfiles/com_github_grpc_grpc/src/python/grpcio/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: cannot import name 'HashableWrapper' from 'jax._src.util' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:9923 {grpc_message:"Exception iterating responses: cannot import name \'HashableWrapper\' from \'jax._src.util\' (/opt/anaconda3/envs/limingbo_sf/lib/python3.8/site-packages/jax/_src/util.py)", grpc_status:2, created_time:"2023-07-19T14:20:51.958034795+00:00"}"
>
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ pip list | grep grpc
grpcio                       1.56.0
(limingbo_sf) limingbo@luoyegroup-ubuntu:~/spu$ 

@anakinxc
Copy link
Contributor

这种方式很容易复现:https://www.secretflow.org.cn/docs/secretflow/latest/zh-Hans/tutorial/gpt2_with_spu 顺便问一下,这两种方式运行SPU有什么区别?

SecretFlow 的这个 tutorial 是基于 sf&ray 的。。spu 的是基于 spu 自己实现的一个简单的 distributed framework

@Mingbo-Lee
Copy link
Contributor Author

这种方式很容易复现:https://www.secretflow.org.cn/docs/secretflow/latest/zh-Hans/tutorial/gpt2_with_spu 顺便问一下,这两种方式运行SPU有什么区别?

SecretFlow 的这个 tutorial 是基于 sf&ray 的。。spu 的是基于 spu 自己实现的一个简单的 distributed framework

好的 非常感谢

@anakinxc
Copy link
Contributor

这种方式很容易复现:https://www.secretflow.org.cn/docs/secretflow/latest/zh-Hans/tutorial/gpt2_with_spu 顺便问一下,这两种方式运行SPU有什么区别?

SecretFlow 的这个 tutorial 是基于 sf&ray 的。。spu 的是基于 spu 自己实现的一个简单的 distributed framework

好的 非常感谢

正在尝试复现。。。稍等哈

@Mingbo-Lee
Copy link
Contributor Author

这种方式很容易复现:https://www.secretflow.org.cn/docs/secretflow/latest/zh-Hans/tutorial/gpt2_with_spu 顺便问一下,这两种方式运行SPU有什么区别?

SecretFlow 的这个 tutorial 是基于 sf&ray 的。。spu 的是基于 spu 自己实现的一个简单的 distributed framework

好的 非常感谢

正在尝试复现。。。稍等哈

好的

@anakinxc
Copy link
Contributor

Hi @Mingbo-Lee

我刚刚从头试了一下,没复现,我来描述一下我的 step

拉一个新的 secretflow/spu-ci:latest

pip install -r requirements.txt
pip install 'transformers[flax]'

bazel build //examples/python/... -c opt

找两个 terminals
第一个跑 bazel-bin/examples/python/utils/nodectl --config `pwd`/examples/python/ml/flax_gpt2/3pc.json up

第二个跑 bazel-bin/examples/python/ml/flax_gpt2/flax_gpt2 --config `pwd`/examples/python/ml/flax_gpt2/3pc.json

output

(base) root@c890861d99bb:/home/admin/dev# bazel-bin/examples/python/ml/flax_gpt2/flax_gpt2 --config `pwd`/examples/python/ml/flax_gpt2/3pc.json
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

------
Run on CPU
I enjoy walking with my cute dog, but I'm not sure if I'll ever

------
Run on SPU
I enjoy walking with my cute dog, but I'm not sure if I'll ever
(base) root@c890861d99bb:/home/admin/dev#

pip list 结果如下

Package             Version
------------------- --------
absl-py             1.4.0
cached-property     1.5.2
cachetools          5.3.1
certifi             2023.5.7
charset-normalizer  3.2.0
chex                0.1.7
cloudpickle         2.2.1
dill                0.3.6
dm-tree             0.1.8
etils               1.3.0
filelock            3.12.2
flax                0.7.0
fsspec              2023.6.0
grpcio              1.56.0
huggingface-hub     0.16.4
idna                3.4
importlib-metadata  6.8.0
importlib-resources 6.0.0
jax                 0.4.13
jaxlib              0.4.13
markdown-it-py      3.0.0
mdurl               0.1.2
ml-dtypes           0.2.0
msgpack             1.0.5
multiprocess        0.70.14
nest-asyncio        1.5.6
numpy               1.24.4
opt-einsum          3.3.0
optax               0.1.4
orbax-checkpoint    0.2.3
packaging           23.1
pip                 23.1.2
protobuf            3.20.3
Pygments            2.15.1
PyYAML              6.0.1
regex               2023.6.3
requests            2.31.0
rich                13.4.2
safetensors         0.3.1
scipy               1.10.1
setuptools          67.8.0
tensorstore         0.1.40
termcolor           2.3.0
tokenizers          0.13.3
toolz               0.12.0
tqdm                4.65.0
transformers        4.31.0
typing_extensions   4.7.1
urllib3             2.0.3
wheel               0.38.4
zipp                3.16.2

要不试一下新建一个新的 python env?

@Mingbo-Lee
Copy link
Contributor Author

我新建一个新的Python env, 成功复现,非常感谢

INFO: Elapsed time: 1.222s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Build completed successfully, 2 total actions
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

------
Run on CPU
2023-07-20 01:45:32.818594: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-07-20 01:45:32.818663: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-07-20 01:45:32.818674: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I enjoy walking with my cute dog, but I'm not sure if I'll ever

------
Run on SPU
I enjoy walking with my cute dog, but I'm not sure if I'll ever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants