-
Notifications
You must be signed in to change notification settings - Fork 59
Distance based latency #858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D80141665 |
Summary: Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
a44e6bf
to
21ec77b
Compare
This pull request was exported from Phabricator. Differential Revision: D80141665 |
21ec77b
to
667531b
Compare
Summary: Pull Request resolved: meta-pytorch#858 Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
Summary: Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
667531b
to
d4889c9
Compare
This pull request was exported from Phabricator. Differential Revision: D80141665 |
Summary: Pull Request resolved: meta-pytorch#858 Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
85d1018
to
ba31c31
Compare
Summary: Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
This pull request was exported from Phabricator. Differential Revision: D80141665 |
ba31c31
to
9ad771c
Compare
Summary: Pull Request resolved: meta-pytorch#858 Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
9ad771c
to
8858e2d
Compare
Summary: Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
This pull request was exported from Phabricator. Differential Revision: D80141665 |
Summary: Pull Request resolved: meta-pytorch#858 Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Differential Revision: D80141665
8858e2d
to
bb0d755
Compare
Summary: Previously we had to use u64 for serialization reasons but those reasons no longer exist Differential Revision: D80556690
Summary: There was an open TODO to remove the global mailbox for SimClock. We don't actually even need mailboxes for sim clock and a oneshot works just fine Differential Revision: D80029571
Summary: Pull Request resolved: meta-pytorch#854 When we increase the number of actors in our simulation it takes longer for all the events at a certain time to complete so we need to wait for longer. If we wait to long then the simulation just runs slower than it needs to so its nice to make this configurable. In the long term we will come up with a more robust solution to this but in the meantime that is not a priority. See EX528476 to understand the underlying problem the debounce is remedying Differential Revision: D80137965 Reviewed By: pablorfb-meta
Summary: The sim allocator will now register the location (region, dc, zone, rack, host, gpu) of every ProcId upon creation with the simnet. Differential Revision: D80137963
Summary: Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Reviewed By: pablorfb-meta Differential Revision: D80141665
bb0d755
to
56fcf78
Compare
Summary: Pull Request resolved: meta-pytorch#858 Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency. Latency is randomly sample from a beta distribution where the min and max for each distance is configured Implementation details (follow along numbers in comments): 1. In the previous diff when Procs were allocated, their coordinates (region, dc, zone, rack, host, gpu) were registered to the Simnet 2. When SimTx posts a message, we can safely assume that it is a MessageEnvelope. MessageEnvelopes contain information about the sender and receiver so we can determine which ProcIds the message is being sent between, which in turn means we can identify which coordinates they are being sent between 3. We determine distance between 2 coordinates by identifying the most major dimension in which they differ 4. We create a struct called LatencyConfig which holds a distribution for sampling, as well as minimum and maximum values for each distance. 5. We use the identified distance to get a sample for what the latency should be for that send 6. We pass in that latency to the MessageDeliveryEvent to use as its duration 7. The old network configuration which was an all-to-all map of edges with latencies between nodes has been removed along with all related structs 8. Unit tests have been refactored such that when we need a particular message to be sent with a particular latency, we register the ProcIds with the appropriate coordinates, and configure the interdistance latency test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance Reviewed By: pablorfb-meta Differential Revision: D80141665
This pull request was exported from Phabricator. Differential Revision: D80141665 |
56fcf78
to
f8a69db
Compare
This pull request has been merged in 112091d. |
Summary:
Now that the simnet has awareness of which compute resource each ProcId maps to, when messages are being sent we can simply look at the sender and destination ProcIds and compute the distance the message is being sent in order to determine the latency.
Latency is randomly sample from a beta distribution where the min and max for each distance is configured
Implementation details (follow along numbers in comments):
test_allocator_registers_resources in alloc/sim.rs demonstrates that when we allocate a ProcMesh using the sim allocator, our Procs are registered as compute resources and the latencies are computed based on distance
Differential Revision: D80141665