Serialization¶
Groundhog serializes function arguments and results to send them between processes. Understanding serialization helps you avoid common errors and handle large data efficiently.
Serialization Roundtrip¶
Serialization happens in a roundtrip process between your local Python process and the remote execution environment:
1. Arguments: Original Process -> Shell Script¶
When you call result = my_function.remote(x, y, z=10):
- Groundhog pickles
(args, kwargs)as a tuple:((x, y), {"z": 10}) - Base64-encodes the pickled bytes to create a text string
- Embeds the string in the shell script as a heredoc (written to a
.infile)
This happens identically for .remote(), .submit(), and .local() - all embed the serialized payload in the shell script.
2. Arguments: Shell Script -> Runner Process¶
The shell script runs (on HPC node for .remote(), locally for .local()):
- Writes the
.infile containing the base64-encoded payload - Executes the runner script
- Runner reads the
.infile - Runner base64-decodes and unpickles to get
(args, kwargs)
3. Function Executes¶
The runner calls your function with the deserialized arguments:
args, kwargs = deserialize(payload)
func = getattr(module, "function_name")
result = func(*args, **kwargs)
Your function executes and returns a value.
4. Results: Runner Process -> Shell Script¶
The runner serializes the return value:
- Pickles the result object
- Base64-encodes the pickled bytes
- Writes to a
.outfile - Prints the
.outcontents to stdout
5. Results: Shell Script -> Original Process¶
The shell script completes and Groundhog captures stdout:
- Extracts the serialized result from stdout
- Base64-decodes and unpickles to restore the original object
- Returns to your calling code
This roundtrip ensures arguments and results travel safely as plain text through shell scripts and Globus Compute's transport layer.
Size Limits¶
Globus Compute has a 10 MB payload limit. If your serialized arguments or results exceed this, remote calls fail with PayloadTooLargeError.
The limit applies to the serialized size, not the original object size. Pickle+b64 adds overhead, so a 5 MB NumPy array might serialize to 7-8 MB.
Check size before submission:
import numpy as np
large_array = np.random.random((1000, 1000))
try:
result = process.remote(large_array)
except PayloadTooLargeError as e:
print(f"Payload too large: {e.size_mb:.2f} MB")
# Use alternative approach
Handling Large Data¶
For data exceeding the 10 MB limit, you have options:
ProxyStore (.local() only)¶
Coming Soon for Remote Execution 👷🚧
ProxyStore integration is currently a proof-of-concept, and has only been implemented for .local() calls. Support for .remote() and .submit() proxying via Globus Transfer is under development.
.local() automatically uses ProxyStore for large data, creating an effective upper bound on the size of the shell script's embedded payload by writing data to disk and serializing a small proxy object instead. The proxy loads data on demand in the subprocess. See also: proxystore docs
This happens automatically - no configuration needed:
# Large data automatically uses ProxyStore
large_array = np.random.random((10000, 10000))
result = process.local(large_array) # Works, no size limit
Shared Storage (.remote() / .submit())¶
For remote execution, use HPC shared filesystems:
# Pass path instead of data
@hog.function(endpoint="anvil")
def process(data_path: str = "/scratch/shared/data.npy") -> float:
import numpy as np
data = np.load(data_path)
return float(np.mean(data))
Chunking¶
Split large data into smaller chunks:
chunks = np.array_split(large_array, 10)
futures = [process.submit(chunk) for chunk in chunks]
results = [f.result() for f in futures]
final_result = combine(results)
Best Practices¶
Convert to Plain Python Types¶
NumPy and pandas types (for example) don't always deserialize cleanly:
# Bad - returns numpy.float64
@hog.function(endpoint="anvil")
def compute_mean(data: list[float]) -> float:
import numpy as np
return np.mean(data)
# Better - returns plain float
@hog.function(endpoint="anvil")
def compute_mean(data: list[float]) -> float:
import numpy as np
return float(np.mean(data))
Explicit conversion to int(), float(), list() prevents errors.
Test with .local() First¶
.local() uses the same serialization as .remote() but runs locally. Test serialization without HPC access:
# Test locally first
result = my_function.local(args)
# If that works, remote will too (size permitting)
result = my_function.remote(args)
This catches serialization errors early.