Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical instability between Tensorflow and Pytorch #56

Open
Mustardburger opened this issue Jul 22, 2023 · 2 comments
Open

Numerical instability between Tensorflow and Pytorch #56

Mustardburger opened this issue Jul 22, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Mustardburger
Copy link
Collaborator

Mustardburger commented Jul 22, 2023

Issue type

Bug or help needed

Relevant package versions

numpy == 1.24.1
tensorflow == 2.11.0
torch == 2.0.1

Python version

3.8.0

Current behaviour

The envelope forms in tensorflow and pytorch (defined here) yield very similar results (their difference between the two outputs is on the scale of 10e-8). However, these differences accumulate after several time steps in the ODE solver, and become very noticeable after around 150 to 200 time steps in the solver.

Code to reproduce

The recommended envelope form for CellBox is the tanh. The code below calculates the output from tensorflow's and pytorch's isolated envelope form set to tanh (defined in KernelConfig). There is no ODE involved yet.

import numpy as np
import tensorflow.compat.v1 as tf
import torch
tf.disable_v2_behavior()

class KernelConfig(object):
    def __init__(self):
        
        self.n_x = 5
        self.envelope_form = "tanh" # options: tanh, polynormial, hill, linear, clip linear
        self.envelope_fn = None
        self.polynomial_k = 2 # larger than 1
        self.ode_degree = 1
        self.envelope = 0
        self.ode_solver = "heun" # options: euler, heun, rk4, midpoint
        self.dT = 0.1
        self.n_T = 1000
        self.gradient_zero_from = None

args = KernelConfig()
W = np.random.normal(loc=0.01, size=(args.n_x, args.n_x))
eps = np.ones((args.n_x, 1), dtype=np.float32)
alpha = np.ones((args.n_x, 1), dtype=np.float32)
y0_np = np.zeros((args.n_x, 1))

# Test the envelope
def tensorflow_envelope():
    from cellbox.kernel import get_envelope
    envelope_fn = get_envelope(args)

    params = {}
    W_copy = np.copy(W)
    params["W"] = tf.convert_to_tensor(W_copy, dtype=tf.float32)
    if args.ode_degree == 1:
        def weighted_sum(x):
            return tf.matmul(params['W'], x)
    
    return envelope_fn(weighted_sum(tf.convert_to_tensor(params["W"], dtype=tf.float32))).eval(session=tf.compat.v1.Session())

def pytorch_get_envelope(args):
    """get the envelope form based on the given argument"""
    if args.envelope_form == 'tanh':
        args.envelope_fn = torch.tanh
    elif args.envelope_form == 'polynomial':
        k = args.polynomial_k
        assert k > 1, "Hill coefficient has to be k>2."
        if k % 2 == 1:  # odd order polynomial equation
            args.envelope_fn = lambda x: x ** k / (1 + torch.abs(x) ** k)
        else:  # even order polynomial equation
            args.envelope_fn = lambda x: x**k/(1+x**k)*torch.sign(x)
    elif args.envelope_form == 'hill':
        k = args.polynomial_k
        assert k > 1, "Hill coefficient has to be k>=2."
        args.envelope_fn = lambda x: 2*(1-1/(1+nn.functional.relu(torch.tensor(x+1)).numpy()**k))-1
    elif args.envelope_form == 'linear':
        args.envelope_fn = lambda x: x
    elif args.envelope_form == 'clip linear':
        args.envelope_fn = lambda x: torch.clamp(x, min=-1, max=1)
    else:
        raise Exception("Illegal envelope function. Choose from [tanh, polynomial/hill]")
    return args.envelope_fn

def pytorch_envelope():
    envelope_fn = pytorch_get_envelope(args)
    params = {}
    W_copy = np.copy(W)
    params["W"] = torch.tensor(W_copy, dtype=torch.float32)
    if args.ode_degree == 1:
        def weighted_sum(x):
            return torch.matmul(params['W'], x)

    return envelope_fn(weighted_sum(torch.tensor(params["W"], dtype=torch.float32))).numpy()

tf_out = tensorflow_envelope()
torch_out = pytorch_envelope()
print(np.abs(tf_out - torch_out))

The output is:

[[0.0000000e+00 1.4901161e-08 0.0000000e+00 0.0000000e+00 0.0000000e+00]
 [5.9604645e-08 0.0000000e+00 5.9604645e-08 2.9802322e-08 5.9604645e-08]
 [1.1920929e-07 0.0000000e+00 9.3132257e-10 0.0000000e+00 2.9802322e-08]
 [2.9802322e-08 1.4901161e-08 5.9604645e-08 1.8626451e-09 5.9604645e-08]
 [5.9604645e-08 5.9604645e-08 5.9604645e-08 0.0000000e+00 0.0000000e+00]]

If using polynomial with args.polynomial_k = 2:

args.envelope_form = "polynomial"
args.polynomial_k = 2
tf_out = tensorflow_envelope()
torch_out = pytorch_envelope()
print(np.abs(tf_out - torch_out))

The output is:

[[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 5.9604645e-08 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 1.4551915e-11 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]]

However, if changing the envelope form to clip linear:

args.envelope_form = "clip linear"
tf_out = tensorflow_envelope()
torch_out = pytorch_envelope()
print(np.abs(tf_out - torch_out))

The output is:

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

This difference might be small, but it adds up within the ODE solver, and causes the final result of the tensorflow and pytorch ODE solver to differ significantly. The same issue persisted when args.envelope_form is set to hill or polynomial. However, when args.envelope_form is set to linear or clip linear, the difference between tensorflow and pytorch ODE solver is exactly 0, leading me to believe the numerical discrepancy of the other envelope functions cause this behaviour.

Solution

Is there a way around this? If two ODE solutions are very different, which one is the correct solution?

@Mustardburger Mustardburger added the bug Something isn't working label Jul 22, 2023
@cannin
Copy link
Member

cannin commented Jul 24, 2023

@Mustardburger @DesmondYuan Some thoughts:

  1. Get rid of the cellbox requirement if you need to post this as a Torch issue.
  2. Is there some specific TF/Torch version of the power operator that would the same output?
    args.envelope_fn = lambda x: x ** k / (1 + tf.abs(x) ** k)
  3. What happens if there is no compat.v1?
  4. In the equation (and similar):
lambda x: x ** k / (1 + torch.abs(x) ** k)

do you know at what specific point there starts being an error? Is x**k or 1 + torch.abs(x) ** k or something else?

@Mustardburger
Copy link
Collaborator Author

Mustardburger commented Jul 25, 2023

@cannin @DesmondYuan:

  1. Yes, for the issue on Pytorch I will get rid of everything CellBox related. If you think my way of presenting the issue is good, I can go on, make some small changes and submit this issue to Pytorch.
  2. That's a good point, I'm not sure which function (power operator or division) within that envelope function causes the difference. I will have a look.
  3. That's also a good point, I haven't tried it. I doubt it will alleviate the issue. But if it does, then does it indicate something to do with the numerical accuracy of tensorflow v1?
  4. Same for 2.

I am also thinking, since both tensorflow and pytorch code right now are identical, but lead to different solutions, which one of them is the more correct solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants