Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tn #74

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ repos:
- id: check-toml
- id: debug-statements
- repo: https://github.com/psf/black
rev: 24.4.2
rev: 24.8.0
hooks:
- id: black
- repo: https://github.com/pycqa/isort
Expand All @@ -25,7 +25,7 @@ repos:
additional_dependencies: [tomli]
args: [--in-place, --config, ./pyproject.toml]
- repo: https://github.com/asottile/pyupgrade
rev: v3.16.0
rev: v3.17.0
hooks:
- id: pyupgrade
- repo: https://github.com/hadialqattan/pycln
Expand Down
1 change: 1 addition & 0 deletions src/qibotn/backends/cutensornet.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class CuTensorNet(NumpyBackend): # pragma: no cover

def __init__(self, runcard):
super().__init__()
import cuquantum
Tankya2 marked this conversation as resolved.
Show resolved Hide resolved
from cuquantum import cutensornet as cutn # pylint: disable=import-error

if runcard is not None:
Expand Down
101 changes: 83 additions & 18 deletions src/qibotn/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def dense_vector_tn_MPI(qibo_circ, datatype, n_samples=8):
Dense vector of quantum circuit.
"""

import cuquantum.cutensornet as cutn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to keep the imports within the functions? (instead of top-level)

I know it was like this even before this PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was that not all functions require the import, specifically dense_vector_tn(), expectation_pauli_tn(), dense_vector_mps(), pauli_string_gen(). Do you think it is better to bring them to the top-level?

from cuquantum import Network
from mpi4py import MPI

Expand All @@ -71,21 +72,31 @@ def dense_vector_tn_MPI(qibo_circ, datatype, n_samples=8):
size = comm.Get_size()

device_id = rank % getDeviceCount()
cp.cuda.Device(device_id).use()
mempool = cp.get_default_memory_pool()

# Perform circuit conversion
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)
if rank == 0:
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)

operands = myconvertor.state_vector_operands()
operands = myconvertor.state_vector_operands()
else:
operands = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the actual purpose of this?

If rank != 0, qibo_circ is fully ignored...

Even if it is somehow meaningful (I'm not seeing how, but that may be my limitation), the result could only be trivial, so you could even return immediately, without executing all the other operations...

Copy link
Contributor Author

@Tankya2 Tankya2 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each rank needs the same initial set of operands for computation. Here, the operands are created in Rank 0, for all other rank the operands are just set to None. In line 86, the operands created in Rank 0 is then broadcasted to all other ranks.


# Assign the device for each process.
device_id = rank % getDeviceCount()
Comment on lines -80 to -81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why it was repeated before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment may be still useful, and you could lift to the line above.

operands = comm.bcast(operands, root)

# Create network object.
network = Network(*operands, options={"device_id": device_id})

# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction.
path, info = network.contract_path(
optimize={"samples": n_samples, "slicing": {"min_slices": max(32, size)}}
optimize={
"samples": n_samples,
"slicing": {
"min_slices": max(32, size),
"memory_model": cutn.MemoryModel.CUTENSOR,
},
}
)

# Select the best path from all ranks.
Expand Down Expand Up @@ -114,6 +125,9 @@ def dense_vector_tn_MPI(qibo_circ, datatype, n_samples=8):
# Sum the partial contribution from each process on root.
result = comm.reduce(sendobj=result, op=MPI.SUM, root=root)

del network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Why do you now need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming it is for memory management

Deletion of a name removes the binding of that name from the local or global namespace, depending on whether the name occurs in a global statement in the same code block. If the name is unbound, a NameError exception will be raised.
https://docs.python.org/3/reference/simple_stmts.html#the-del-statement

del x doesn’t directly call x.del() — the former decrements the reference count for x by one, and the latter is only called when x’s reference count reaches zero.
https://docs.python.org/3/reference/datamodel.html#object.__del__

So, unless it is exactly documented that you should apply del network before .free_all_blocks() (and even in that case, we should make sure that the underlying .__del__() is called when you wish), it doesn't make much or no difference wrt just waiting for the return, since the network name would get anyhow out of scope, and the .__del__() method would be called. This only if no other references are kept in the returned objects, which is also a condition for the current del network statement to work - so nothing would change from that point of view.

mempool.free_all_blocks()

return result, rank


Expand All @@ -136,6 +150,7 @@ def dense_vector_tn_nccl(qibo_circ, datatype, n_samples=8):
Returns:
Dense vector of quantum circuit.
"""
import cuquantum.cutensornet as cutn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

from cupy.cuda import nccl
from cuquantum import Network
from mpi4py import MPI
Expand All @@ -148,6 +163,7 @@ def dense_vector_tn_nccl(qibo_circ, datatype, n_samples=8):
device_id = rank % getDeviceCount()

cp.cuda.Device(device_id).use()
mempool = cp.get_default_memory_pool()

# Set up the NCCL communicator.
nccl_id = nccl.get_unique_id() if rank == root else None
Expand All @@ -157,12 +173,26 @@ def dense_vector_tn_nccl(qibo_circ, datatype, n_samples=8):
# Perform circuit conversion
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)
operands = myconvertor.state_vector_operands()
# Perform circuit conversion
if rank == 0:
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)
operands = myconvertor.state_vector_operands()
Tankya2 marked this conversation as resolved.
Show resolved Hide resolved
else:
operands = None

operands = comm_mpi.bcast(operands, root)

network = Network(*operands)

# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction.
path, info = network.contract_path(
optimize={"samples": n_samples, "slicing": {"min_slices": max(32, size)}}
optimize={
"samples": n_samples,
"slicing": {
"min_slices": max(32, size),
"memory_model": cutn.MemoryModel.CUTENSOR,
},
}
)

# Select the best path from all ranks.
Expand Down Expand Up @@ -200,6 +230,9 @@ def dense_vector_tn_nccl(qibo_circ, datatype, n_samples=8):
stream_ptr,
)

del network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

mempool.free_all_blocks()

return result, rank


Expand All @@ -226,6 +259,7 @@ def expectation_pauli_tn_nccl(qibo_circ, datatype, pauli_string_pattern, n_sampl
Returns:
Expectation of quantum circuit due to pauli string.
"""
import cuquantum.cutensornet as cutn
from cupy.cuda import nccl
from cuquantum import Network
from mpi4py import MPI
Expand All @@ -238,23 +272,36 @@ def expectation_pauli_tn_nccl(qibo_circ, datatype, pauli_string_pattern, n_sampl
device_id = rank % getDeviceCount()

cp.cuda.Device(device_id).use()
mempool = cp.get_default_memory_pool()

# Set up the NCCL communicator.
nccl_id = nccl.get_unique_id() if rank == root else None
nccl_id = comm_mpi.bcast(nccl_id, root)
comm_nccl = nccl.NcclCommunicator(size, nccl_id, rank)

# Perform circuit conversion
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)
operands = myconvertor.expectation_operands(
pauli_string_gen(qibo_circ.nqubits, pauli_string_pattern)
)
if rank == 0:

myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)
operands = myconvertor.expectation_operands(
pauli_string_gen(qibo_circ.nqubits, pauli_string_pattern)
)
else:
operands = None

operands = comm_mpi.bcast(operands, root)

network = Network(*operands)

# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction.
path, info = network.contract_path(
optimize={"samples": n_samples, "slicing": {"min_slices": max(32, size)}}
optimize={
"samples": n_samples,
"slicing": {
"min_slices": max(32, size),
"memory_model": cutn.MemoryModel.CUTENSOR,
},
}
)

# Select the best path from all ranks.
Expand Down Expand Up @@ -292,6 +339,9 @@ def expectation_pauli_tn_nccl(qibo_circ, datatype, pauli_string_pattern, n_sampl
stream_ptr,
)

del network
mempool.free_all_blocks()

return result, rank


Expand All @@ -318,6 +368,7 @@ def expectation_pauli_tn_MPI(qibo_circ, datatype, pauli_string_pattern, n_sample
Returns:
Expectation of quantum circuit due to pauli string.
"""
import cuquantum.cutensornet as cutn
from cuquantum import Network
from mpi4py import MPI # this line initializes MPI

Expand All @@ -326,24 +377,35 @@ def expectation_pauli_tn_MPI(qibo_circ, datatype, pauli_string_pattern, n_sample
rank = comm.Get_rank()
size = comm.Get_size()

# Assign the device for each process.
device_id = rank % getDeviceCount()
cp.cuda.Device(device_id).use()
mempool = cp.get_default_memory_pool()

# Perform circuit conversion
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)
if rank == 0:
myconvertor = QiboCircuitToEinsum(qibo_circ, dtype=datatype)

operands = myconvertor.expectation_operands(
pauli_string_gen(qibo_circ.nqubits, pauli_string_pattern)
)
operands = myconvertor.expectation_operands(
pauli_string_gen(qibo_circ.nqubits, pauli_string_pattern)
)
else:
operands = None

# Assign the device for each process.
device_id = rank % getDeviceCount()
operands = comm.bcast(operands, root)

# Create network object.
network = Network(*operands, options={"device_id": device_id})

# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction.
path, info = network.contract_path(
optimize={"samples": n_samples, "slicing": {"min_slices": max(32, size)}}
optimize={
"samples": n_samples,
"slicing": {
"min_slices": max(32, size),
"memory_model": cutn.MemoryModel.CUTENSOR,
},
}
)

# Select the best path from all ranks.
Expand Down Expand Up @@ -372,6 +434,9 @@ def expectation_pauli_tn_MPI(qibo_circ, datatype, pauli_string_pattern, n_sample
# Sum the partial contribution from each process on root.
result = comm.reduce(sendobj=result, op=MPI.SUM, root=root)

del network
mempool.free_all_blocks()

return result, rank


Expand Down