changed to proper Xavier initialization, existing implementation was … (

#1927) Summary: …resulting in a large negative bias, which was killing all gradients through the following relu. https://paperswithcode.com/method/xavier-initialization Pull Request resolved: #1927 Reviewed By: davidberard98 Differential Revision: D49754019 Pulled By: xuzhao9 fbshipit-source-id: 436676afed9bcc0f464cd1b25465444a98a52b5a
pytorch · Sep 29, 2023 · 827f90b · 827f90b
1 parent 3f11b81
commit 827f90b
Showing 1 changed file with 1 addition and 2 deletions.
diff --git a/torchbenchmark/models/dlrm/dlrm_s_pytorch.py b/torchbenchmark/models/dlrm/dlrm_s_pytorch.py
@@ -149,8 +149,7 @@ def create_mlp(self, ln, sigmoid_layer):
             mean = 0.0  # std_dev = np.sqrt(variance)
             std_dev = np.sqrt(2 / (m + n))  # np.sqrt(1 / m) # np.sqrt(1 / n)
             W = np.random.normal(mean, std_dev, size=(m, n)).astype(np.float32)
-            std_dev = np.sqrt(1 / m)  # np.sqrt(2 / (m + 1))
-            bt = np.random.normal(mean, std_dev, size=m).astype(np.float32)
+            bt = np.zeros(m).astype(np.float32) # see upstream PR at https://github.com/facebookresearch/dlrm/pull/358
             # approach 1
             LL.weight.data = torch.tensor(W, requires_grad=True)
             LL.bias.data = torch.tensor(bt, requires_grad=True)