Skip to content

Milo's parallel tuning #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 116 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
e827e75
created remoteRunner class
MiloLurati Jan 22, 2024
deab579
added remote actor class
Jan 23, 2024
52287b9
update remote runner
Jan 23, 2024
40cc888
added remote_mode function argument to tune_kernel and related remote…
Jan 23, 2024
b14aaf0
added parallel tuning test
Jan 23, 2024
1a55a5c
added pool of actors
Jan 23, 2024
4fef594
clean up remote runner and actor
Jan 24, 2024
3f3b9e6
updates on remote code
Jan 30, 2024
fe5da39
changed naming from remote to parallel
Apr 2, 2024
a43dc84
added get_num_devices function
Apr 4, 2024
ab3aa24
added ensemble and parallel runner related stuff
Apr 4, 2024
e8a7228
switched to new naming of parallel remote and some clean up
Apr 4, 2024
3dd748c
added class instances needed down the line in the execution of the en…
Apr 4, 2024
e743bec
changed naming due to ensemble implementation, this was the original …
Apr 4, 2024
df949d0
started ensemble implementation, very basic functionality works
Apr 4, 2024
45a1747
updated tests
Apr 4, 2024
5fb5927
clean up in parallel runner
Apr 5, 2024
a96ef43
moved to sub directory ray
Apr 5, 2024
c831f5f
added subdirectory ray with all 3 actor classes
Apr 5, 2024
0cc2a6e
itegrated calls to cache manager functions when running in ensemble
Apr 5, 2024
b816f3d
added cache manager logic
Apr 5, 2024
781839a
added instances needed for the ensemble down the line of execution
Apr 8, 2024
9f8d212
added strategy option to get_options function
Apr 8, 2024
d08b5d4
added ignore_reinit_error to ray init
Apr 10, 2024
903c981
added ignore_reinit_error to ray init
Apr 10, 2024
1a2219a
added cache manager to parallel tuning
Apr 10, 2024
a476585
re-assign tuning options to final version from the cache manager at t…
Apr 11, 2024
6233e09
small bug fix in execute
MiloLurati Apr 11, 2024
cde62ae
Merge branch 'KernelTuner:master' into parallelTuning
MiloLurati Apr 14, 2024
0722629
Merge pull request #2 from KernelTuner/simulation-searchspace-improve…
MiloLurati Apr 14, 2024
14e5f0b
updates to run ensemble in simulation mode on CPUs
MiloLurati Apr 16, 2024
a963dac
fixed problem with ray resources and stalling actors
MiloLurati Apr 23, 2024
c55b870
added setup_resources and new impl of costfunc (not yet tested and st…
MiloLurati Apr 25, 2024
d8541a0
added ensemble and memetic to strategy map and import
MiloLurati Apr 25, 2024
c755254
rearranged how parallel runner deals with cache manager and actor's l…
MiloLurati Apr 25, 2024
a23ef94
initial adaptions for memetic and cleaned up logic of ensemble
MiloLurati Apr 25, 2024
697ead0
returning tuning_options for memetic logic
MiloLurati Apr 25, 2024
b247ed0
init impl of memetic strategy
MiloLurati Apr 25, 2024
948ab7f
initial adapion for memetic strategy
MiloLurati Apr 25, 2024
9e40d4e
removed brute_force from strategy map and import
MiloLurati Apr 25, 2024
3cb428d
fixes of new costfunc and stop criterion is checked retrospectively
MiloLurati Apr 25, 2024
2d13fc3
fixed bug with tuning options cache manager
MiloLurati Apr 29, 2024
1a2ba53
fixed some bugs for memetic algo functioning
MiloLurati Apr 29, 2024
2aba6f5
removed debug prints
MiloLurati Apr 29, 2024
cd3f212
fixed problem with single config input and final results data structure
MiloLurati Apr 29, 2024
af9bd5e
added progress prints of memetic algo and kill statement for cache ma…
MiloLurati Apr 29, 2024
d382f05
sort results for retrospective stop criterion check
MiloLurati Apr 29, 2024
218b8ac
added comments
MiloLurati Apr 29, 2024
79b7a50
updated returning results logic in _evaluate_configs()
MiloLurati Apr 29, 2024
88f63b4
added comments
MiloLurati Apr 29, 2024
a2afd1d
updates to run more strategies then devices available
MiloLurati Apr 30, 2024
d950b2d
returning last two lists of candidates for memetic algo
MiloLurati May 3, 2024
980777f
returning last two candidates for memetic algo
MiloLurati May 3, 2024
95a2f0f
returning last two populations for memetic algo
MiloLurati May 3, 2024
89c499b
implemented adaptive local search depth logic and fix few issues, wor…
MiloLurati May 3, 2024
babba0b
modifications related to last iteration of memetic algo
MiloLurati May 6, 2024
e0e1e61
updates related to old popuation logic
MiloLurati May 6, 2024
6305782
unified two actors into one
MiloLurati May 7, 2024
0f2b7e4
updates related to actors unification and memetic algo development
MiloLurati May 7, 2024
63ddedb
added create_actor_on_device and initialize_ray
MiloLurati May 7, 2024
d7fe9b4
updates realted to unification of actors, memetic algo, and reutiliza…
MiloLurati May 7, 2024
46fcde1
returning 80% of cpus for simulation mode in get_num_devices
MiloLurati May 7, 2024
d543848
updates realted to actor unification and reutilization of actors for …
MiloLurati May 7, 2024
15df6ea
updates on feval counting and distributing
MiloLurati May 7, 2024
96d03b8
Merge branch 'KernelTuner:master' into parallelTuning
MiloLurati May 8, 2024
18ce214
Merge branch 'parallelTuning' of https://github.com/MiloLurati/kernel…
MiloLurati May 8, 2024
ec719a2
added logic for time limit stop
MiloLurati May 10, 2024
6c2a62b
debug prints clean up
MiloLurati May 10, 2024
c7fd2af
unified parallel tuning and parallel ensemble logic in ParallelRunner
MiloLurati May 10, 2024
af532c5
added self.init_arguments for parallel runner execution
MiloLurati May 28, 2024
82d9886
fix about non-pickleable observers and other small adjustments
MiloLurati May 28, 2024
c6a2f36
now the cache manager deals only with the cache and not with the enti…
MiloLurati May 28, 2024
5fe2e56
fix related to non-pickleable observers
MiloLurati May 28, 2024
3b3317c
update related to new cache manager
MiloLurati May 28, 2024
1593806
added cleanup at the end of the ensemble
MiloLurati May 28, 2024
efd5be2
changes to hyperparameters
MiloLurati May 28, 2024
bc66244
changes related to non-pickleable observers
MiloLurati May 28, 2024
c5cfd05
Merge branch 'master' into parallelTuning
MiloLurati May 28, 2024
9e9f1af
updated init_arguments to a dict
MiloLurati May 31, 2024
3fed66c
updates for searchspace split, ensemble related fix, and observer exe…
MiloLurati May 31, 2024
86a9b67
small corections related to stop criterion for memetic
MiloLurati Jun 5, 2024
de5fc49
added logic to check if all GPUs are of the same type
MiloLurati Jun 7, 2024
1b0adb0
deleted split searchspace function
MiloLurati Jun 7, 2024
5130286
changed place where ray is initialized
MiloLurati Jun 7, 2024
5b9d817
setting BO to random sampling if needed
MiloLurati Jun 7, 2024
8b1e57f
Merge branch 'KernelTuner:master' into parallelTuning
MiloLurati Jun 7, 2024
040a57e
added num_gpus option
MiloLurati Jun 7, 2024
acaaeb1
removed debug print
MiloLurati Jun 10, 2024
63d9f65
added check_and_retrive strategy option
MiloLurati Jun 10, 2024
e604510
moved reinitialization of actor observers to execute method, before w…
MiloLurati Jun 18, 2024
5933a69
changes related to re-initialization of observers in actor init and d…
MiloLurati Jun 18, 2024
4e4c47b
removed unnecesary blocking ray.get
MiloLurati Jun 21, 2024
104205d
removed debug prints
MiloLurati Jul 1, 2024
123fba5
added greedy ils esemble instead of default
MiloLurati Jul 1, 2024
d381011
added check on strategy_options
MiloLurati Jul 1, 2024
7e832e3
removed all memetic algo related stuff
MiloLurati Jul 1, 2024
e976bf8
Merge branch 'KernelTuner:master' into parallelTuning
MiloLurati Jul 1, 2024
65d32c1
added ray to pyproject.toml
MiloLurati Jul 1, 2024
a841f2a
Merge branch 'parallelTuning' of https://github.com/MiloLurati/kernel…
MiloLurati Jul 1, 2024
503df1b
updated toml file with ray dashboard
MiloLurati Jul 1, 2024
c126a01
fix small bug in _evaluate_configs
MiloLurati Jul 1, 2024
4df1b0d
adapted test for ensemble
MiloLurati Jul 1, 2024
29a507c
cleaned up not used imports
MiloLurati Jul 1, 2024
7c49a29
added comments
MiloLurati Jul 1, 2024
eb5db41
added documentation and related fixes
MiloLurati Jul 4, 2024
65c6a23
Merge branch 'MiloParallelTuning' into parallelTuning
isazi Mar 25, 2025
b8acffa
Merge pull request #241 from MiloLurati/parallelTuning
isazi Mar 25, 2025
bab28ef
Typo.
isazi Mar 25, 2025
5b93b9c
Remove Observer that does not exist anymore.
isazi Mar 25, 2025
20b7c96
Remove spurious parameter to function.
isazi Mar 25, 2025
50473a2
Remove not existant observer.
isazi Mar 25, 2025
ceb0996
Fix Runner interface.
isazi Mar 25, 2025
9739495
Reformat with black.
isazi Mar 25, 2025
d560a30
Fix a bunch of SonarCloud warnings.
isazi Mar 26, 2025
f1f872e
Fix some tests failing.
isazi Mar 26, 2025
de27d90
Fix SonarQube warning and format.
isazi Mar 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/optimization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ the ``strategy=`` optional argument of ``tune_kernel()``. Kernel Tuner currently
* "pso" particle swarm optimization
* "random_sample" takes a random sample of the search space
* "simulated_annealing" simulated annealing strategy
* "ensemble" ensemble strategy

Most strategies have some mechanism built in to detect when to stop tuning, which may be controlled through specific
parameters that can be passed to the strategies using the ``strategy_options=`` optional argument of ``tune_kernel()``. You
Expand Down
15 changes: 5 additions & 10 deletions kernel_tuner/accuracy.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,7 @@ def select_for_configuration(self, params):

if option not in self.data:
list = ", ".join(map(str, self.data.keys()))
raise KeyError(
f"'{option}' is not a valid parameter value, should be one of: {list}"
)
raise KeyError(f"'{option}' is not a valid parameter value, should be one of: {list}")

return self.data[option]

Expand All @@ -60,12 +58,14 @@ def _find_bfloat16_if_available():
# Try to get bfloat16 if available.
try:
from bfloat16 import bfloat16

return bfloat16
except ImportError:
pass

try:
from tensorflow import bfloat16

return bfloat16.as_numpy_dtype
except ImportError:
pass
Expand Down Expand Up @@ -102,9 +102,7 @@ def _to_float_dtype(x: str) -> np.dtype:


class TunablePrecision(Tunable):
def __init__(
self, param_key: str, array: np.ndarray, dtypes: Dict[str, np.dtype] = None
):
def __init__(self, param_key: str, array: np.ndarray, dtypes: Dict[str, np.dtype] = None):
"""The ``Tunable`` object can be used as an input argument when tuning
kernels. It is a container that internally holds several arrays
containing the same data, but stored in using different levels of
Expand Down Expand Up @@ -135,7 +133,6 @@ def __init__(
if bfloat16 is not None:
dtypes["bfloat16"] = bfloat16


# If dtype is a list, convert it to a dictionary
if isinstance(dtypes, (list, tuple)):
dtypes = dict((name, _to_float_dtype(name)) for name in dtypes)
Expand Down Expand Up @@ -257,9 +254,7 @@ def metric(a, b):
raise ValueError(f"invalid error metric provided: {user_key}")

# cast both arguments to f64 before passing them to the metric
return lambda a, b: metric(
a.astype(np.float64, copy=False), b.astype(np.float64, copy=False)
)
return lambda a, b: metric(a.astype(np.float64, copy=False), b.astype(np.float64, copy=False))


class AccuracyObserver(OutputObserver):
Expand Down
12 changes: 7 additions & 5 deletions kernel_tuner/backends/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
try:
from hip._util.types import DeviceArray
except ImportError:
Pointer = Exception # using Exception here as a type that will never be among kernel arguments
Pointer = Exception # using Exception here as a type that will never be among kernel arguments
DeviceArray = Exception


Expand Down Expand Up @@ -157,7 +157,9 @@ def ready_argument_list(self, arguments):

for i, arg in enumerate(arguments):
if not (isinstance(arg, (np.ndarray, np.number, DeviceArray)) or is_cupy_array(arg)):
raise TypeError(f"Argument is not numpy or cupy ndarray or numpy scalar or HIP Python DeviceArray but a {type(arg)}")
raise TypeError(
f"Argument is not numpy or cupy ndarray or numpy scalar or HIP Python DeviceArray but a {type(arg)}"
)
dtype_str = arg.typestr if isinstance(arg, DeviceArray) else str(arg.dtype)
if isinstance(arg, np.ndarray):
if dtype_str in dtype_map.keys():
Expand Down Expand Up @@ -288,7 +290,7 @@ def compile(self, kernel_instance):
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
check=True
check=True,
)

subprocess.run(
Expand All @@ -299,7 +301,7 @@ def compile(self, kernel_instance):
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
check=True
check=True,
)

self.lib = np.ctypeslib.load_library(filename, ".")
Expand Down Expand Up @@ -439,7 +441,7 @@ def cleanup_lib(self):
"""unload the previously loaded shared library"""
if self.lib is None:
return

if not self.using_openmp and not self.using_openacc:
# this if statement is necessary because shared libraries that use
# OpenMP will core dump when unloaded, this is a well-known issue with OpenMP
Expand Down
8 changes: 2 additions & 6 deletions kernel_tuner/backends/cupy.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
# collect environment information
env = dict()
cupy_info = str(cp._cupyx.get_runtime_info()).split("\n")[:-1]
info_dict = {
s.split(":")[0].strip(): s.split(":")[1].strip() for s in cupy_info
}
info_dict = {s.split(":")[0].strip(): s.split(":")[1].strip() for s in cupy_info}
env["device_name"] = info_dict[f"Device {device} Name"]

env["cuda_version"] = cp.cuda.runtime.driverGetVersion()
Expand Down Expand Up @@ -129,9 +127,7 @@ def compile(self, kernel_instance):

options = tuple(compiler_options)

self.current_module = cp.RawModule(
code=kernel_string, options=options, name_expressions=[kernel_name]
)
self.current_module = cp.RawModule(code=kernel_string, options=options, name_expressions=[kernel_name])

self.func = self.current_module.get_function(kernel_name)
self.num_regs = self.func.num_regs
Expand Down
24 changes: 6 additions & 18 deletions kernel_tuner/backends/nvcuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,9 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
CudaFunctions.last_selected_device = device

# compute capabilities and device properties
err, major = cudart.cudaDeviceGetAttribute(
cudart.cudaDeviceAttr.cudaDevAttrComputeCapabilityMajor, device
)
err, major = cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrComputeCapabilityMajor, device)
cuda_error_check(err)
err, minor = cudart.cudaDeviceGetAttribute(
cudart.cudaDeviceAttr.cudaDevAttrComputeCapabilityMinor, device
)
err, minor = cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrComputeCapabilityMinor, device)
cuda_error_check(err)
err, self.max_threads = cudart.cudaDeviceGetAttribute(
cudart.cudaDeviceAttr.cudaDevAttrMaxThreadsPerBlock, device
Expand Down Expand Up @@ -164,20 +160,14 @@ def compile(self, kernel_instance):
if not any(["--std=" in opt for opt in self.compiler_options]):
self.compiler_options.append("--std=c++11")
if not any([b"--gpu-architecture=" in opt or b"-arch" in opt for opt in compiler_options]):
compiler_options.append(
f"--gpu-architecture=compute_{to_valid_nvrtc_gpu_arch_cc(self.cc)}".encode("UTF-8")
)
compiler_options.append(f"--gpu-architecture=compute_{to_valid_nvrtc_gpu_arch_cc(self.cc)}".encode("UTF-8"))
if not any(["--gpu-architecture=" in opt or "-arch" in opt for opt in self.compiler_options]):
self.compiler_options.append(f"--gpu-architecture=compute_{to_valid_nvrtc_gpu_arch_cc(self.cc)}")

err, program = nvrtc.nvrtcCreateProgram(
str.encode(kernel_string), b"CUDAProgram", 0, [], []
)
err, program = nvrtc.nvrtcCreateProgram(str.encode(kernel_string), b"CUDAProgram", 0, [], [])
try:
cuda_error_check(err)
err = nvrtc.nvrtcCompileProgram(
program, len(compiler_options), compiler_options
)
err = nvrtc.nvrtcCompileProgram(program, len(compiler_options), compiler_options)
cuda_error_check(err)
err, size = nvrtc.nvrtcGetPTXSize(program)
cuda_error_check(err)
Expand All @@ -189,9 +179,7 @@ def compile(self, kernel_instance):
raise SkippableFailure("uses too much shared data")
else:
cuda_error_check(err)
err, self.func = cuda.cuModuleGetFunction(
self.current_module, str.encode(kernel_name)
)
err, self.func = cuda.cuModuleGetFunction(self.current_module, str.encode(kernel_name))
cuda_error_check(err)

# get the number of registers per thread used in this kernel
Expand Down
16 changes: 4 additions & 12 deletions kernel_tuner/backends/opencl.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@
class OpenCLFunctions(GPUBackend):
"""Class that groups the OpenCL functions on maintains some state about the device."""

def __init__(
self, device=0, platform=0, iterations=7, compiler_options=None, observers=None
):
def __init__(self, device=0, platform=0, iterations=7, compiler_options=None, observers=None):
"""Creates OpenCL device context and reads device properties.

:param device: The ID of the OpenCL device to use for benchmarking
Expand All @@ -37,14 +35,10 @@ def __init__(
platforms = cl.get_platforms()
self.ctx = cl.Context(devices=[platforms[platform].get_devices()[device]])

self.queue = cl.CommandQueue(
self.ctx, properties=cl.command_queue_properties.PROFILING_ENABLE
)
self.queue = cl.CommandQueue(self.ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
self.mf = cl.mem_flags
# inspect device properties
self.max_threads = self.ctx.devices[0].get_info(
cl.device_info.MAX_WORK_GROUP_SIZE
)
self.max_threads = self.ctx.devices[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)
self.compiler_options = compiler_options or []

# observer stuff
Expand Down Expand Up @@ -108,9 +102,7 @@ def compile(self, kernel_instance):
:returns: An OpenCL kernel that can be called directly.
:rtype: pyopencl.Kernel
"""
prg = cl.Program(self.ctx, kernel_instance.kernel_string).build(
options=self.compiler_options
)
prg = cl.Program(self.ctx, kernel_instance.kernel_string).build(options=self.compiler_options)
func = getattr(prg, kernel_instance.name)
return func

Expand Down
17 changes: 3 additions & 14 deletions kernel_tuner/backends/pycuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,9 @@ def _finish_up():
PyCudaFunctions.last_selected_context = self.context

# inspect device properties
devprops = {
str(k): v for (k, v) in self.context.get_device().get_attributes().items()
}
devprops = {str(k): v for (k, v) in self.context.get_device().get_attributes().items()}
self.max_threads = devprops["MAX_THREADS_PER_BLOCK"]
cc = str(devprops.get("COMPUTE_CAPABILITY_MAJOR", "0")) + str(
devprops.get("COMPUTE_CAPABILITY_MINOR", "0")
)
cc = str(devprops.get("COMPUTE_CAPABILITY_MAJOR", "0")) + str(devprops.get("COMPUTE_CAPABILITY_MINOR", "0"))
if cc == "00":
cc = self.context.get_device().compute_capability()
self.cc = str(cc[0]) + str(cc[1])
Expand Down Expand Up @@ -347,14 +343,7 @@ def run_kernel(self, func, gpu_args, threads, grid, stream=None):
"""
if stream is None:
stream = self.stream
func(
*gpu_args,
block=threads,
grid=grid,
stream=stream,
shared=self.smem_size,
texrefs=self.texrefs
)
func(*gpu_args, block=threads, grid=grid, stream=stream, shared=self.smem_size, texrefs=self.texrefs)

def memset(self, allocation, value, size):
"""Set the memory in allocation to the value in value.
Expand Down
Loading