diff --git a/.gitignore b/.gitignore
index 7250c33d23..eb3d1ddab5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -50,8 +50,8 @@
# Plugins
/com.unity.ml-agents/VideoRecorder*
-# Generated doc folders
-/docs/html
+# MkDocs build output
+/site/
# Mac hidden files
*.DS_Store
diff --git a/.yamato/com.unity.ml-agents-test.yml b/.yamato/com.unity.ml-agents-test.yml
index 6345ad8608..1cda00ad8d 100644
--- a/.yamato/com.unity.ml-agents-test.yml
+++ b/.yamato/com.unity.ml-agents-test.yml
@@ -8,7 +8,7 @@ test_editors:
enableNoDefaultPackages: !!bool true
trunk_editor:
- - version: trunk
+ - version: 6000.3.0a3
testProject: DevProject
test_platforms:
diff --git a/com.unity.ml-agents/CHANGELOG.md b/com.unity.ml-agents/CHANGELOG.md
index bbfdcba4fe..48c9cccd39 100755
--- a/com.unity.ml-agents/CHANGELOG.md
+++ b/com.unity.ml-agents/CHANGELOG.md
@@ -11,11 +11,12 @@ and this project adheres to
#### com.unity.ml-agents (C#)
- Upgraded to Inference Engine 2.2.1 (#6212)
- The minimum supported Unity version was updated to 6000.0. (#6207)
-- Merge the extension package com.unity.ml-agents.extensions to the main package com.unity.ml-agents. (#6227)
+- Merged the extension package com.unity.ml-agents.extensions to the main package com.unity.ml-agents. (#6227)
### Minor Changes
#### com.unity.ml-agents (C#)
-- Remove broken sample from the package (#6230)
+- Removed broken sample from the package (#6230)
+- Moved to Unity Package documentation as the primary developer documentation. (#6232)
#### ml-agents / ml-agents-envs
- Bumped grpcio version to >=1.11.0,<=1.53.2 (#6208)
diff --git a/com.unity.ml-agents/Documentation~/Advanced-Features.md b/com.unity.ml-agents/Documentation~/Advanced-Features.md
new file mode 100644
index 0000000000..c4f44ec06f
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Advanced-Features.md
@@ -0,0 +1,16 @@
+# Advanced Features
+
+The ML-Agents Toolkit provides several advanced features that extend the core functionality and enable sophisticated use cases.
+
+
+| **Feature** | **Description** |
+|-------------------------------------------------------------|------------------------------------------------------------------------------|
+| [Custom Side Channels](Custom-SideChannels.md) | Create custom communication channels between Unity and Python. |
+| [Custom Grid Sensors](Custom-GridSensors.md) | Build specialized grid-based sensors for spatial data. |
+| [Input System Integration](InputSystem-Integration.md) | Integrate ML-Agents with Unity's Input System. |
+| [Inference Engine](Inference-Engine.md) | Deploy trained models for real-time inference. |
+| [Hugging Face Integration](Hugging-Face-Integration.md) | Connect with Hugging Face models and ecosystem. |
+| [Game Integrations](Integrations.md) | Integrate ML-Agents with specific game genres and mechanics (e.g., Match-3). |
+| [Match-3 Integration](Integrations-Match3.md) | Abstraction and tools for Match-3 board games (board, sensors, actuators). |
+| [ML-Agents Package Settings](Package-Settings.md) | Configure advanced package settings and preferences. |
+| [Unity Environment Registry](Unity-Environment-Registry.md) | Manage and register Unity environments programmatically. |
diff --git a/com.unity.ml-agents/Documentation~/Background-Machine-Learning.md b/com.unity.ml-agents/Documentation~/Background-Machine-Learning.md
new file mode 100644
index 0000000000..99fb9cfcb8
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Background-Machine-Learning.md
@@ -0,0 +1,51 @@
+# Background: Machine Learning
+
+Given that a number of users of the ML-Agents Toolkit might not have a formal machine learning background, this page provides an overview to facilitate the understanding of the ML-Agents Toolkit. However, we will not attempt to provide a thorough treatment of machine learning as there are fantastic resources online.
+
+Machine learning, a branch of artificial intelligence, focuses on learning patterns from data. The three main classes of machine learning algorithms include: unsupervised learning, supervised learning and reinforcement learning. Each class of algorithm learns from a different type of data. The following paragraphs provide an overview for each of these classes of machine learning, as well as introductory examples.
+
+## Unsupervised Learning
+
+The goal of [unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) is to group or cluster similar items in a data set. For example, consider the players of a game. We may want to group the players depending on how engaged they are with the game. This would enable us to target different groups (e.g. for highly-engaged players we might invite them to be beta testers for new features, while for unengaged players we might email them helpful tutorials). Say that we wish to split our players into two groups. We would first define basic attributes of the players, such as the number of hours played, total money spent on in-app purchases and number of levels completed. We can then feed this data set (three attributes for every player) to an unsupervised learning algorithm where we specify the number of groups to be two. The algorithm would then split the data set of players into two groups where the players within each group would be similar to each other. Given the attributes we used to describe each player, in this case, the output would be a split of all the players into two groups, where one group would semantically represent the engaged players and the second group would semantically represent the unengaged players.
+
+With unsupervised learning, we did not provide specific examples of which players are considered engaged and which are considered unengaged. We just defined the appropriate attributes and relied on the algorithm to uncover the two groups on its own. This type of data set is typically called an unlabeled data set as it is lacking these direct labels. Consequently, unsupervised learning can be helpful in situations where these labels can be expensive or hard to produce. In the next paragraph, we overview supervised learning algorithms which accept input labels in addition to attributes.
+
+## Supervised Learning
+
+In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), we do not want to just group similar items but directly learn a mapping from each item to the group (or class) that it belongs to. Returning to our earlier example of clustering players, let's say we now wish to predict which of our players are about to churn (that is stop playing the game for the next 30 days). We can look into our historical records and create a data set that contains attributes of our players in addition to a label indicating whether they have churned or not. Note that the player attributes we use for this churn prediction task may be different from the ones we used for our earlier clustering task. We can then feed this data set (attributes **and** label for each player) into a supervised learning algorithm which would learn a mapping from the player attributes to a label indicating whether that player will churn or not. The intuition is that the supervised learning algorithm will learn which values of these attributes typically correspond to players who have churned and not churned (for example, it may learn that players who spend very little and play for very short periods will most likely churn). Now given this learned model, we can provide it the attributes of a new player (one that recently started playing the game) and it would output a _predicted_ label for that player. This prediction is the algorithms expectation of whether the player will churn or not. We can now use these predictions to target the players who are expected to churn and entice them to continue playing the game.
+
+As you may have noticed, for both supervised and unsupervised learning, there are two tasks that need to be performed: attribute selection and model selection. Attribute selection (also called feature selection) pertains to selecting how we wish to represent the entity of interest, in this case, the player. Model selection, on the other hand, pertains to selecting the algorithm (and its parameters) that perform the task well. Both of these tasks are active areas of machine learning research and, in practice, require several iterations to achieve good performance.
+
+We now switch to reinforcement learning, the third class of machine learning algorithms, and arguably the one most relevant for the ML-Agents Toolkit.
+
+## Reinforcement Learning
+
+[Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning) can be viewed as a form of learning for sequential decision making that is commonly associated with controlling robots (but is, in fact, much more general). Consider an autonomous firefighting robot that is tasked with navigating into an area, finding the fire and neutralizing it. At any given moment, the robot perceives the environment through its sensors (e.g. camera, heat, touch), processes this information and produces an action (e.g. move to the left, rotate the water hose, turn on the water). In other words, it is continuously making decisions about how to interact in this environment given its view of the world (i.e. sensors input) and objective (i.e. neutralizing the fire). Teaching a robot to be a successful firefighting machine is precisely what reinforcement learning is designed to do.
+
+More specifically, the goal of reinforcement learning is to learn a **policy**, which is essentially a mapping from **observations** to **actions**. An observation is what the robot can measure from its **environment** (in this case, all its sensory inputs) and an action, in its most raw form, is a change to the configuration of the robot (e.g. position of its base, position of its water hose and whether the hose is on or off).
+
+The last remaining piece of the reinforcement learning task is the **reward signal**. The robot is trained to learn a policy that maximizes its overall rewards. When training a robot to be a mean firefighting machine, we provide it with rewards (positive and negative) indicating how well it is doing on completing the task. Note that the robot does not _know_ how to put out fires before it is trained. It learns the objective because it receives a large positive reward when it puts out the fire and a small negative reward for every passing second. The fact that rewards are sparse (i.e. may not be provided at every step, but only when a robot arrives at a success or failure situation), is a defining characteristic of reinforcement learning and precisely why learning good policies can be difficult (and/or time-consuming) for complex environments.
+
+

+
+Learning a policy usually requires many trials and iterative policy updates. More specifically, the robot is placed in several fire situations and over time learns an optimal policy which allows it to put out fires more effectively. Obviously, we cannot expect to train a robot repeatedly in the real world, particularly when fires are involved. This is precisely why the use of Unity as a simulator serves as the perfect training grounds for learning such behaviors. While our discussion of reinforcement learning has centered around robots, there are strong parallels between robots and characters in a game. In fact, in many ways, one can view a non-playable character (NPC) as a virtual robot, with its own observations about the environment, its own set of actions and a specific objective. Thus it is natural to explore how we can train behaviors within Unity using reinforcement learning. This is precisely what the ML-Agents Toolkit offers. The video linked below includes a reinforcement learning demo showcasing training character behaviors using the ML-Agents Toolkit.
+
+
+
+Similar to both unsupervised and supervised learning, reinforcement learning also involves two tasks: attribute selection and model selection. Attribute selection is defining the set of observations for the robot that best help it complete its objective, while model selection is defining the form of the policy (mapping from observations to actions) and its parameters. In practice, training behaviors is an iterative process that may require changing the attribute and model choices.
+
+## Training and Inference
+
+One common aspect of all three branches of machine learning is that they all involve a **training phase** and an **inference phase**. While the details of the training and inference phases are different for each of the three, at a high-level, the training phase involves building a model using the provided data, while the inference phase involves applying this model to new, previously unseen, data. More specifically:
+
+- For our unsupervised learning example, the training phase learns the optimal two clusters based on the data describing existing players, while the inference phase assigns a new player to one of these two clusters.
+- For our supervised learning example, the training phase learns the mapping from player attributes to player label (whether they churned or not), and the inference phase predicts whether a new player will churn or not based on that learned mapping.
+- For our reinforcement learning example, the training phase learns the optimal policy through guided trials, and in the inference phase, the agent observes and takes actions in the wild using its learned policy.
+
+To briefly summarize: all three classes of algorithms involve training and inference phases in addition to attribute and model selections. What ultimately separates them is the type of data available to learn from. In unsupervised learning our data set was a collection of attributes, in supervised learning our data set was a collection of attribute-label pairs, and, lastly, in reinforcement learning our data set was a collection of observation-action-reward tuples.
+
+## Deep Learning
+
+[Deep learning](https://en.wikipedia.org/wiki/Deep_learning) is a family of algorithms that can be used to address any of the problems introduced above. More specifically, they can be used to solve both attribute and model selection tasks. Deep learning has gained popularity in recent years due to its outstanding performance on several challenging machine learning tasks. One example is [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo), a [computer Go](https://en.wikipedia.org/wiki/Computer_Go) program, that leverages deep learning, that was able to beat Lee Sedol (a Go world champion).
+
+A key characteristic of deep learning algorithms is their ability to learn very complex functions from large amounts of training data. This makes them a natural choice for reinforcement learning tasks when a large amount of data can be generated, say through the use of a simulator or engine such as Unity. By generating hundreds of thousands of simulations of the environment within Unity, we can learn policies for very complex environments (a complex environment is one where the number of observations an agent perceives and the number of actions they can take are large). Many of the algorithms we provide in ML-Agents use some form of deep learning, built on top of the open-source library, [PyTorch](Background-PyTorch.md).
diff --git a/com.unity.ml-agents/Documentation~/Background-PyTorch.md b/com.unity.ml-agents/Documentation~/Background-PyTorch.md
new file mode 100644
index 0000000000..33253a77d1
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Background-PyTorch.md
@@ -0,0 +1,11 @@
+# Background: PyTorch
+
+As discussed in our [machine learning background page](Background-Machine-Learning.md), many of the algorithms we provide in the ML-Agents Toolkit leverage some form of deep learning. More specifically, our implementations are built on top of the open-source library [PyTorch](https://pytorch.org/). In this page we provide a brief overview of PyTorch and TensorBoard that we leverage within the ML-Agents Toolkit.
+
+## PyTorch
+
+[PyTorch](https://pytorch.org/) is an open source library for performing computations using data flow graphs, the underlying representation of deep learning models. It facilitates training and inference on CPUs and GPUs in a desktop, server, or mobile device. Within the ML-Agents Toolkit, when you train the behavior of an agent, the output is a model (.onnx) file that you can then associate with an Agent. Unless you implement a new algorithm, the use of PyTorch is mostly abstracted away and behind the scenes.
+
+## TensorBoard
+
+One component of training models with PyTorch is setting the values of certain model attributes (called _hyperparameters_). Finding the right values of these hyperparameters can require a few iterations. Consequently, we leverage a visualization tool called [TensorBoard](https://www.tensorflow.org/tensorboard). It allows the visualization of certain agent attributes (e.g. reward) throughout training which can be helpful in both building intuitions for the different hyperparameters and setting the optimal values for your Unity environment. We provide more details on setting the hyperparameters in the [Training ML-Agents](Training-ML-Agents.md) page. If you are unfamiliar with TensorBoard we recommend our guide on [using TensorBoard with ML-Agents](Using-Tensorboard.md) or this [tutorial](https://github.com/dandelionmane/tf-dev-summit-tensorboard-tutorial).
\ No newline at end of file
diff --git a/com.unity.ml-agents/Documentation~/Background-Unity.md b/com.unity.ml-agents/Documentation~/Background-Unity.md
new file mode 100644
index 0000000000..8bc78b21e2
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Background-Unity.md
@@ -0,0 +1,14 @@
+# Background: Unity
+
+If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we highly recommend the [Unity Manual](https://docs.unity3d.com/Manual/index.html) and [Tutorials page](https://unity3d.com/learn/tutorials). The [Roll-a-ball tutorial](https://learn.unity.com/project/roll-a-ball) is a fantastic resource to learn all the basic concepts of Unity to get started with the ML-Agents Toolkit:
+
+- [Editor](https://docs.unity3d.com/Manual/sprite/sprite-editor/use-editor.html)
+- [Scene](https://docs.unity3d.com/Manual/CreatingScenes.html)
+- [GameObject](https://docs.unity3d.com/Manual/GameObjects.html)
+- [Rigidbody](https://docs.unity3d.com/ScriptReference/Rigidbody.html)
+- [Camera](https://docs.unity3d.com/Manual/Cameras.html)
+- [Scripting](https://docs.unity3d.com/Manual/ScriptingSection.html)
+- [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html)
+- [Ordering of event functions](https://docs.unity3d.com/Manual/ExecutionOrder.html)
+(e.g. FixedUpdate, Update)
+- [Prefabs](https://docs.unity3d.com/Manual/Prefabs.html)
diff --git a/com.unity.ml-agents/Documentation~/Background.md b/com.unity.ml-agents/Documentation~/Background.md
new file mode 100644
index 0000000000..44c467dc83
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Background.md
@@ -0,0 +1,11 @@
+# Background
+
+This section provides foundational knowledge to help you understand the technologies and concepts that power the ML-Agents Toolkit.
+
+| **Topic** | **Description** |
+|-----------------------------------------------------------|-------------------------------------------------------------------------------|
+| [Machine Learning](Background-Machine-Learning.md) | Introduction to ML concepts, reinforcement learning, and training principles. |
+| [Unity](Background-Unity.md) | Unity fundamentals for ML-Agents development and environment creation. |
+| [PyTorch](Background-PyTorch.md) | PyTorch basics for understanding the training pipeline and neural networks. |
+| [Using Virtual Environment](Using-Virtual-Environment.md) | Setting up and managing Python virtual environments for ML-Agents. |
+| [ELO Rating System](ELO-Rating-System.md) | Understanding ELO rating system for multi-agent training evaluation. |
diff --git a/com.unity.ml-agents/Documentation~/Blog-posts.md b/com.unity.ml-agents/Documentation~/Blog-posts.md
new file mode 100644
index 0000000000..c0fa4d2bf0
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Blog-posts.md
@@ -0,0 +1,18 @@
+We have published a series of blog posts that are relevant for ML-Agents:
+
+- (July 12, 2021)
+ [ML-Agents plays Dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
+- (May 5, 2021)
+ [ML-Agents v2.0 release: Now supports training complex cooperative behaviors](https://blogs.unity3d.com/2021/05/05/ml-agents-v2-0-release-now-supports-training-complex-cooperative-behaviors/)
+- (November 20, 2020)
+ [How Eidos-Montréal created Grid Sensors to improve observations for training agents](https://www.eidosmontreal.com/news/the-grid-sensor-for-automated-game-testing/)
+- (February 28, 2020)
+ [Training intelligent adversaries using self-play with ML-Agents](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
+- (November 11, 2019)
+ [Training your agents 7 times faster with ML-Agents](https://blogs.unity3d.com/2019/11/11/training-your-agents-7-times-faster-with-ml-agents/)
+- (June 26, 2018)
+ [Solving sparse-reward tasks with Curiosity](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/)
+- (June 19, 2018)
+ [Unity ML-Agents Toolkit v0.4 and Udacity Deep Reinforcement Learning Nanodegree](https://github.com/udacity/deep-reinforcement-learning)
+- (September 19, 2017)
+ [Introducing: Unity Machine Learning Agents Toolkit](https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/)
diff --git a/com.unity.ml-agents/Documentation~/CONTRIBUTING.md b/com.unity.ml-agents/Documentation~/CONTRIBUTING.md
new file mode 100644
index 0000000000..6629acde5f
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/CONTRIBUTING.md
@@ -0,0 +1,36 @@
+# How to Contribute to ML-Agents
+
+## 1.Fork the repository
+Fork the ML-Agents repository by clicking on the "Fork" button in the top right corner of the GitHub page. This creates a copy of the repository under your GitHub account.
+
+## 2. Set up your development environment
+Clone the forked repository to your local machine using Git. Install the necessary dependencies and follow the instructions provided in the project's documentation to set up your development environment properly.
+
+## 3. Choose an issue or feature
+Browse the project's issue tracker or discussions to find an open issue or feature that you would like to contribute to. Read the guidelines and comments associated with the issue to understand the requirements and constraints.
+
+## 4. Make your changes
+Create a new branch for your changes based on the main branch of the ML-Agents repository. Implement your code changes or add new features as necessary. Ensure that your code follows the project's coding style and conventions.
+
+* Example: Let's say you want to add support for a new type of reward function in the ML-Agents framework. You can create a new branch named feature/reward-function to implement this feature.
+
+## 5. Test your changes
+Run the appropriate tests to ensure your changes work as intended. If necessary, add new tests to cover your code and verify that it doesn't introduce regressions.
+
+* Example: For the reward function feature, you would write tests to check different scenarios and expected outcomes of the new reward function.
+
+## 6. Submit a pull request
+Push your branch to your forked repository and submit a pull request (PR) to the ML-Agents main repository. Provide a clear and concise description of your changes, explaining the problem you solved or the feature you added.
+
+* Example: In the pull request description, you would explain how the new reward function works, its benefits, and any relevant implementation details.
+
+## 7. Respond to feedback
+Be responsive to any feedback or comments provided by the project maintainers. Address the feedback by making necessary revisions to your code and continue the discussion if required.
+
+## 8. Continuous integration and code review
+The ML-Agents project utilizes automated continuous integration (CI) systems to run tests on pull requests. Address any issues flagged by the CI system and actively participate in the code review process by addressing comments from reviewers.
+
+## 9. Merge your changes
+Once your pull request has been approved and meets all the project's requirements, a project maintainer will merge your changes into the main repository. Congratulations, your contribution has been successfully integrated!
+
+**Remember to always adhere to the project's code of conduct, be respectful, and follow any specific contribution guidelines provided by the ML-Agents project. Happy contributing!**
diff --git a/com.unity.ml-agents/Documentation~/Cloud-Deployment.md b/com.unity.ml-agents/Documentation~/Cloud-Deployment.md
new file mode 100644
index 0000000000..510ae31870
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Cloud-Deployment.md
@@ -0,0 +1,12 @@
+# Cloud & Deployment (deprecated)
+
+
+This section contains legacy documentation for deploying ML-Agents training in cloud environments. While these approaches may still work, they are no longer actively maintained or recommended.
+
+
+| **Platform** | **Description** |
+|----------------------------------------------------------|------------------------------------------------------|
+| [Using Docker](Using-Docker.md) | Containerized deployment with Docker (deprecated). |
+| [Amazon Web Services](Training-on-Amazon-Web-Service.md) | Training on AWS cloud infrastructure (deprecated). |
+| [Microsoft Azure](Training-on-Microsoft-Azure.md) | Training on Azure cloud services (deprecated). |
+
diff --git a/com.unity.ml-agents/Documentation~/Custom-GridSensors.md b/com.unity.ml-agents/Documentation~/Custom-GridSensors.md
new file mode 100644
index 0000000000..777e1f8896
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Custom-GridSensors.md
@@ -0,0 +1,69 @@
+# Custom Grid Sensors
+
+Grid Sensor provides a 2D observation that detects objects around an agent from a top-down view. Compared to RayCasts, it receives a full observation in a grid area without gaps, and the detection is not blocked by objects around the agents. This gives a more granular view while requiring a higher usage of compute resources.
+
+One extra feature with Grid Sensors is that you can derive from the Grid Sensor base class to collect custom data besides the object tags, to include custom attributes as observations. This allows more flexibility for the use of GridSensor.
+
+## Creating Custom Grid Sensors
+To create a custom grid sensor, you'll need to derive from two classes: `GridSensorBase` and `GridSensorComponent`.
+
+## Deriving from `GridSensorBase`
+This is the implementation of your sensor. This defines how your sensor process detected colliders, what the data looks like, and how the observations are constructed from the detected objects. Consider overriding the following methods depending on your use case:
+* `protected virtual int GetCellObservationSize()`: Return the observation size per cell. Default to `1`.
+* `protected virtual void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)`: Constructs observations from the detected object. The input provides the detected GameObject and the index of its tag (0-indexed). The observations should be written to the given `dataBuffer` and the buffer size is defined in `GetCellObservationSize()`. This data will be gathered from each cell and sent to the trainer as observation.
+* `protected virtual bool IsDataNormalized()`: Return whether the observation is normalized to 0~1. This affects whether you're able to use compressed observations as compressed data only supports normalized data. Return `true` if all the values written in `GetObjectData` are within the range of (0, 1), otherwise return `false`. Default to `false`.
+
+There might be cases when your data is not in the range of (0, 1) but you still wish to use compressed data to speed up training. If your data is naturally bounded within a range, normalize your data first to the possible range and fill the buffer with normalized data. For example, since the angle of rotation is bounded within `0 ~ 360`, record an angle `x` as `x/360` instead of `x`. If your data value is not bounded (position, velocity, etc.), consider setting a reasonable min/max value and use that to normalize your data.
+* `protected internal virtual ProcessCollidersMethod GetProcessCollidersMethod()`: Return the method to process colliders detected in a cell. This defines the sensor behavior when multiple objects with detectable tags are detected within a cell.
+Currently, two methods are provided:
+ * `ProcessCollidersMethod.ProcessClosestColliders` (Default): Process the closest collider to the agent. In this case each cell's data is represented by one object.
+ * `ProcessCollidersMethod.ProcessAllColliders`: Process all detected colliders. This is useful when the data from each cell is additive, for instance, the count of detected objects in a cell. When using this option, the input `dataBuffer` in `GetObjectData()` will contain processed data from other colliders detected in the cell. You'll more likely want to add/subtract values from the buffer instead of overwrite it completely.
+
+## Deriving from `GridSensorComponent`
+To create your sensor, you need to override the sensor component and add your sensor to the creation. Specifically, you need to override `GetGridSensors()` and return an array of grid sensors you want to use in the component. It can be used to create multiple different customized grid sensors, or you can also include the ones provided in our package (listed in the next section).
+
+Example:
+```csharp
+public class CustomGridSensorComponent : GridSensorComponent
+{
+ protected override GridSensorBase[] GetGridSensors()
+ {
+ return new GridSensorBase[] { new CustomGridSensor(...)};
+ }
+}
+```
+
+## Grid Sensor Types
+Here we list two types of grid sensor provided in the package: `OneHotGridSensor` and `CountingGridSensor`. Their implementations are also a good reference for making you own ones.
+
+### OneHotGridSensor
+This is the default sensor used by `GridSensorComponent`. It detects objects with detectable tags and the observation is the one-hot representation of the detected tag index.
+
+The implementation of the sensor is defined as following:
+* `GetCellObservationSize()`: `detectableTags.Length`
+* `IsDataNormalized()`: `true`
+* `ProcessCollidersMethod()`: `ProcessCollidersMethod.ProcessClosestColliders`
+* `GetObjectData()`:
+
+```csharp
+protected override void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)
+{
+ dataBuffer[tagIndex] = 1;
+}
+```
+
+### CountingGridSensor
+This is an example of using all colliders detected in a cell. It counts the number of objects detected for each detectable tag. The sensor cannot be used with data compression.
+
+The implementation of the sensor is defined as following:
+* `GetCellObservationSize()`: `detectableTags.Length`
+* `IsDataNormalized()`: `false`
+* `ProcessCollidersMethod()`: `ProcessCollidersMethod.ProcessAllColliders`
+* `GetObjectData()`:
+
+```csharp
+protected override void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)
+{
+ dataBuffer[tagIndex] += 1;
+}
+```
diff --git a/com.unity.ml-agents/Documentation~/Custom-SideChannels.md b/com.unity.ml-agents/Documentation~/Custom-SideChannels.md
new file mode 100644
index 0000000000..2cead31c6d
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Custom-SideChannels.md
@@ -0,0 +1,187 @@
+# Custom Side Channels
+
+You can create your own side channel in C# and Python and use it to communicate custom data structures between the two. This can be useful for situations in which the data to be sent is too complex or structured for the built-in `EnvironmentParameters`, or is not related to any specific agent, and therefore inappropriate as an agent observation.
+
+## Overview
+
+In order to use a side channel, it must be implemented as both Unity and Python classes.
+
+### Unity side
+
+The side channel will have to implement the `SideChannel` abstract class and the following method.
+
+- `OnMessageReceived(IncomingMessage msg)` : You must implement this method and read the data from IncomingMessage. The data must be read in the order that it was written.
+
+The side channel must also assign a `ChannelId` property in the constructor. The `ChannelId` is a Guid (or UUID in Python) used to uniquely identify a side channel. This Guid must be the same on C# and Python. There can only be one side channel of a certain id during communication.
+
+To send data from C# to Python, create an `OutgoingMessage` instance, add data to it, call the `base.QueueMessageToSend(msg)` method inside the side channel, and call the `OutgoingMessage.Dispose()` method.
+
+To register a side channel on the Unity side, call `SideChannelManager.RegisterSideChannel` with the side channel as only argument.
+
+### Python side
+
+The side channel will have to implement the `SideChannel` abstract class. You must implement :
+
+- `on_message_received(self, msg: "IncomingMessage") -> None` : You must implement this method and read the data from IncomingMessage. The data must be read in the order that it was written.
+
+The side channel must also assign a `channel_id` property in the constructor. The `channel_id` is a UUID (referred in C# as Guid) used to uniquely identify a side channel. This number must be the same on C# and Python. There can only be one side channel of a certain id during communication.
+
+To assign the `channel_id` call the abstract class constructor with the appropriate `channel_id` as follows:
+
+```python
+super().__init__(my_channel_id)
+```
+
+To send a byte array from Python to C#, create an `OutgoingMessage` instance, add data to it, and call the `super().queue_message_to_send(msg)` method inside the side channel.
+
+To register a side channel on the Python side, pass the side channel as argument when creating the `UnityEnvironment` object. One of the arguments of the constructor (`side_channels`) is a list of side channels.
+
+## Example implementation
+
+Below is a simple implementation of a side channel that will exchange ASCII encoded strings between a Unity environment and Python.
+
+### Example Unity C# code
+
+The first step is to create the `StringLogSideChannel` class within the Unity project. Here is an implementation of a `StringLogSideChannel` that will listen for messages from python and print them to the Unity debug log, as well as send error messages from Unity to python.
+
+```csharp
+using UnityEngine;
+using Unity.MLAgents;
+using Unity.MLAgents.SideChannels;
+using System.Text;
+using System;
+
+public class StringLogSideChannel : SideChannel
+{
+ public StringLogSideChannel()
+ {
+ ChannelId = new Guid("621f0a70-4f87-11ea-a6bf-784f4387d1f7");
+ }
+
+ protected override void OnMessageReceived(IncomingMessage msg)
+ {
+ var receivedString = msg.ReadString();
+ Debug.Log("From Python : " + receivedString);
+ }
+
+ public void SendDebugStatementToPython(string logString, string stackTrace, LogType type)
+ {
+ if (type == LogType.Error)
+ {
+ var stringToSend = type.ToString() + ": " + logString + "\n" + stackTrace;
+ using (var msgOut = new OutgoingMessage())
+ {
+ msgOut.WriteString(stringToSend);
+ QueueMessageToSend(msgOut);
+ }
+ }
+ }
+}
+```
+
+Once we have defined our custom side channel class, we need to ensure that it is instantiated and registered. This can typically be done wherever the logic of the side channel makes sense to be associated, for example on a MonoBehaviour object that might need to access data from the side channel. Here we show a simple MonoBehaviour object which instantiates and registers the new side channel. If you have not done it already, make sure that the MonoBehaviour which registers the side channel is attached to a GameObject which will be live in your Unity scene.
+
+```csharp
+using UnityEngine;
+using Unity.MLAgents;
+
+
+public class RegisterStringLogSideChannel : MonoBehaviour
+{
+
+ StringLogSideChannel stringChannel;
+ public void Awake()
+ {
+ // We create the Side Channel
+ stringChannel = new StringLogSideChannel();
+
+ // When a Debug.Log message is created, we send it to the stringChannel
+ Application.logMessageReceived += stringChannel.SendDebugStatementToPython;
+
+ // The channel must be registered with the SideChannelManager class
+ SideChannelManager.RegisterSideChannel(stringChannel);
+ }
+
+ public void OnDestroy()
+ {
+ // De-register the Debug.Log callback
+ Application.logMessageReceived -= stringChannel.SendDebugStatementToPython;
+ if (Academy.IsInitialized){
+ SideChannelManager.UnregisterSideChannel(stringChannel);
+ }
+ }
+
+ public void Update()
+ {
+ // Optional : If the space bar is pressed, raise an error !
+ if (Input.GetKeyDown(KeyCode.Space))
+ {
+ Debug.LogError("This is a fake error. Space bar was pressed in Unity.");
+ }
+ }
+}
+```
+
+### Example Python code
+
+Now that we have created the necessary Unity C# classes, we can create their Python counterparts.
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.side_channel.side_channel import (
+ SideChannel,
+ IncomingMessage,
+ OutgoingMessage,
+)
+import numpy as np
+import uuid
+
+
+# Create the StringLogChannel class
+class StringLogChannel(SideChannel):
+
+ def __init__(self) -> None:
+ super().__init__(uuid.UUID("621f0a70-4f87-11ea-a6bf-784f4387d1f7"))
+
+ def on_message_received(self, msg: IncomingMessage) -> None:
+ """
+ Note: We must implement this method of the SideChannel interface to
+ receive messages from Unity
+ """
+ # We simply read a string from the message and print it.
+ print(msg.read_string())
+
+ def send_string(self, data: str) -> None:
+ # Add the string to an OutgoingMessage
+ msg = OutgoingMessage()
+ msg.write_string(data)
+ # We call this method to queue the data we want to send
+ super().queue_message_to_send(msg)
+```
+
+We can then instantiate the new side channel, launch a `UnityEnvironment` with that side channel active, and send a series of messages to the Unity environment from Python using it.
+
+```python
+# Create the channel
+string_log = StringLogChannel()
+
+# We start the communication with the Unity Editor and pass the string_log side channel as input
+env = UnityEnvironment(side_channels=[string_log])
+env.reset()
+string_log.send_string("The environment was reset")
+
+group_name = list(env.behavior_specs.keys())[0] # Get the first group_name
+group_spec = env.behavior_specs[group_name]
+for i in range(1000):
+ decision_steps, terminal_steps = env.get_steps(group_name)
+ # We send data to Unity : A string with the number of Agent at each
+ string_log.send_string(
+ f"Step {i} occurred with {len(decision_steps)} deciding agents and "
+ f"{len(terminal_steps)} terminal agents"
+ )
+ env.step() # Move the simulation forward
+
+env.close()
+```
+
+Now, if you run this script and press `Play` the Unity Editor when prompted, the console in the Unity Editor will display a message at every Python step. Additionally, if you press the Space Bar in the Unity Engine, a message will appear in the terminal.
diff --git a/com.unity.ml-agents/Documentation~/ELO-Rating-System.md b/com.unity.ml-agents/Documentation~/ELO-Rating-System.md
new file mode 100644
index 0000000000..bd2d7de496
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/ELO-Rating-System.md
@@ -0,0 +1,56 @@
+# ELO Rating System
+In adversarial games, the cumulative environment reward may **not be a meaningful metric** by which to track learning progress.
+
+This is because the cumulative reward is **entirely dependent on the skill of the opponent**.
+
+An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.
+
+Instead, it's better to use ELO rating system, a method to calculate **the relative skill level between two players in a zero-sum game**.
+
+If the training performs correctly, **this value should steadily increase**.
+
+## What is a zero-sum game?
+A zero-sum game is a game where **each player's gain or loss of utility is exactly balanced by the gain or loss of the utility of the opponent**.
+
+Simply explained, we face a zero-sum game **when one agent gets +1.0, its opponent gets -1.0 reward**.
+
+For instance, Tennis is a zero-sum game: if you win the point you get +1.0 and your opponent gets -1.0 reward.
+
+## How works the ELO Rating System
+- Each player **has an initial ELO score**. It's defined in the `initial_elo` trainer config hyperparameter.
+
+- The **difference in rating between the two players** serves as the predictor of the outcomes of a match.
+
+
+*For instance, if player A has an Elo score of 2100 and player B has an ELO score of 1800 the chance that player A wins is 85% against 15% for player b.*
+
+- We calculate the **expected score of each player** using this formula:
+
+
+
+- At the end of the game, based on the outcome **we update the player’s actual Elo score**, we use a linear adjustment proportional to the amount by which the player over-performed or under-performed.
+The winning player takes points from the losing one:
+ - If the *higher-rated player wins* → **a few points** will be taken from the lower-rated player.
+ - If the *lower-rated player wins* → **a lot of points** will be taken from the high-rated player.
+ - If it’s *a draw* → the lower-rated player gains **a few points** from higher.
+
+- We update players rating using this formula:
+
+
+
+### The Tennis example
+
+- We start to train our agents.
+- Both of them have the same skills. So ELO score for each of them that we defined using parameter `initial_elo = 1200.0`.
+
+We calculate the expected score E: Ea = 0.5 and Eb = 0.5
+
+So it means that each player has 50% chances of winning the point.
+
+If A wins, the new rating R would be:
+
+Ra = 1200 + 16 * (1 - 0.5) → 1208
+
+Rb = 1200 + 16 * (0 - 0.5) → 1192
+
+Player A has now an ELO score of 1208 and Player B an ELO score of 1192. Therefore, Player A is now a little bit **better than Player B**.
diff --git a/com.unity.ml-agents/Documentation~/FAQ.md b/com.unity.ml-agents/Documentation~/FAQ.md
new file mode 100644
index 0000000000..368fac2b2c
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/FAQ.md
@@ -0,0 +1,53 @@
+# Frequently Asked Questions
+
+## Installation problems
+
+## Environment Permission Error
+
+If you directly import your Unity environment without building it in the editor, you might need to give it additional permissions to execute it.
+
+If you receive such a permission error on macOS, run:
+
+```sh
+chmod -R 755 *.app
+```
+
+or on Linux:
+
+```sh
+chmod -R 755 *.x86_64
+```
+
+On Windows, you can find [instructions](https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/cc754344(v=ws.11)).
+
+## Environment Connection Timeout
+
+If you are able to launch the environment from `UnityEnvironment` but then receive a timeout error like this:
+
+```
+UnityAgentsException: The Communicator was unable to connect. Please make sure the External process is ready to accept communication with Unity.
+```
+
+There may be a number of possible causes:
+
+- _Cause_: There may be no agent in the scene
+- _Cause_: On OSX, the firewall may be preventing communication with the environment. _Solution_: Add the built environment binary to the list of exceptions on the firewall by following [instructions](https://support.apple.com/en-us/HT201642).
+- _Cause_: An error happened in the Unity Environment preventing communication. _Solution_: Look into the [log files](https://docs.unity3d.com/Manual/LogFiles.html) generated by the Unity Environment to figure what error happened.
+- _Cause_: You have assigned `HTTP_PROXY` and `HTTPS_PROXY` values in your environment variables. _Solution_: Remove these values and try again.
+- _Cause_: You are running in a headless environment (e.g. remotely connected to a server). _Solution_: Pass `--no-graphics` to `mlagents-learn`, or `no_graphics=True` to `RemoteRegistryEntry.make()` or the `UnityEnvironment` initializer. If you need graphics for visual observations, you will need to set up `xvfb` (or equivalent).
+
+## Communication port {} still in use
+
+If you receive an exception `"Couldn't launch new environment because communication port {} is still in use. "`, you can change the worker number in the Python script when calling
+
+```python
+UnityEnvironment(file_name=filename, worker_id=X)
+```
+
+## Mean reward : nan
+
+If you receive a message `Mean reward : nan` when attempting to train a model using PPO, this is due to the episodes of the Learning Environment not terminating. In order to address this, set `Max Steps` for the Agents within the Scene Inspector to a value greater than 0. Alternatively, it is possible to manually set `done` conditions for episodes from within scripts for custom episode-terminating events.
+
+## "File name" cannot be opened because the developer cannot be verified.
+
+If you have downloaded the repository using the github website on macOS 10.15 (Catalina) or later, you may see this error when attempting to play scenes in the Unity project. Workarounds include installing the package using the Unity Package Manager (this is the officially supported approach - see [here](Installation.md)), or following the instructions [here](https://support.apple.com/en-us/HT202491) to verify the relevant files on your machine on a file-by-file basis.
diff --git a/com.unity.ml-agents/Documentation~/Get-Started.md b/com.unity.ml-agents/Documentation~/Get-Started.md
new file mode 100644
index 0000000000..210c70fc09
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Get-Started.md
@@ -0,0 +1,16 @@
+# Get started
+The ML-Agents Toolkit contains several main components:
+
+- Unity package `com.unity.ml-agents` contains the Unity C# SDK that will be integrated into your Unity project.
+- Two Python packages:
+ - `mlagents` contains the machine learning algorithms that enables you to train behaviors in your Unity scene. Most users of ML-Agents will only need to directly install `mlagents`.
+ - `mlagents_envs` contains a set of Python APIs to interact with a Unity scene. It is a foundational layer that facilitates data messaging between Unity scene and the Python machine learning algorithms. Consequently, `mlagents` depends on `mlagents_envs`.
+- Unity [Project](https://github.com/Unity-Technologies/ml-agents/tree/main/Project/Assets/ML-Agents/Examples) that contains several [example environments](Learning-Environment-Examples.md) that highlight the various features of the toolkit to help you get started.
+
+Use the following topics to get started with ML-Agents.
+
+| **Section** | **Description** |
+|---------------------------------------------------------------|------------------------------------|
+| [Install ML-Agents](Installation.md) | Install ML-Agents. |
+| [Sample: Running an Example Environment](Sample.md) | Learn how to run a sample project. |
+| [More Example Environments](Learning-Environment-Examples.md) | Explore 17+ sample environments. |
diff --git a/com.unity.ml-agents/Documentation~/Glossary.md b/com.unity.ml-agents/Documentation~/Glossary.md
new file mode 100644
index 0000000000..f2be15bc2a
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Glossary.md
@@ -0,0 +1,19 @@
+# ML-Agents Toolkit Glossary
+
+- **Academy** - Singleton object which controls timing, reset, and training/inference settings of the environment.
+- **Action** - The carrying-out of a decision on the part of an agent within the environment.
+- **Agent** - Unity Component which produces observations and takes actions in the environment. Agents actions are determined by decisions produced by a Policy.
+- **Decision** - The specification produced by a Policy for an action to be carried out given an observation.
+- **Editor** - The Unity Editor, which may include any pane (e.g. Hierarchy, Scene, Inspector).
+- **Environment** - The Unity scene which contains Agents.
+- **Experience** - Corresponds to a tuple of [Agent observations, actions, rewards] of a single Agent obtained after a Step.
+- **External Coordinator** - ML-Agents class responsible for communication with outside processes (in this case, the Python API).
+- **FixedUpdate** - Unity method called each time the game engine is stepped. ML-Agents logic should be placed here.
+- **Frame** - An instance of rendering the main camera for the display. Corresponds to each `Update` call of the game engine.
+- **Observation** - Partial information describing the state of the environment available to a given agent. (e.g. Vector, Visual)
+- **Policy** - The decision-making mechanism for producing decisions from observations, typically a neural network model.
+- **Reward** - Signal provided at every step used to indicate desirability of an agent’s action within the current state of the environment.
+- **State** - The underlying properties of the environment (including all agents within it) at a given time.
+- **Step** - Corresponds to an atomic change of the engine that happens between Agent decisions.
+- **Trainer** - Python class which is responsible for training a given group of Agents.
+- **Update** - Unity function called each time a frame is rendered. ML-Agents logic should not be placed here.
diff --git a/com.unity.ml-agents/Documentation~/Hugging-Face-Integration.md b/com.unity.ml-agents/Documentation~/Hugging-Face-Integration.md
new file mode 100644
index 0000000000..189624817f
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Hugging-Face-Integration.md
@@ -0,0 +1,56 @@
+# The Hugging Face Integration
+
+The [Hugging Face Hub 🤗](https://huggingface.co/models?pipeline_tag=reinforcement-learning) is a central place **where anyone can share and download models**.
+
+It allows you to:
+- **Host** your trained models.
+- **Download** trained models from the community.
+- Visualize your agents **playing directly on your browser**.
+
+You can see the list of ml-agents models [here](https://huggingface.co/models?library=ml-agents).
+
+We wrote a **complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub**:
+
+- A short tutorial where you [teach **Huggy the Dog to fetch the stick** and then play with him directly in your browser](https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction)
+- A [more in-depth tutorial](https://huggingface.co/learn/deep-rl-course/unit5/introduction)
+
+## Download a model from the Hub
+
+You can simply download a model from the Hub using `mlagents-load-from-hf`.
+
+You need to define two parameters:
+
+- `--repo-id`: the name of the Hugging Face repo you want to download.
+- `--local-dir`: the path to download the model.
+
+For instance, I want to load the model with model-id "ThomasSimonini/MLAgents-Pyramids" and put it in the downloads directory:
+
+```sh
+mlagents-load-from-hf --repo-id="ThomasSimonini/MLAgents-Pyramids" --local-dir="./downloads"
+```
+
+## Upload a model to the Hub
+
+You can simply upload a model to the Hub using `mlagents-push-to-hf`
+
+You need to define four parameters:
+
+- `--run-id`: the name of the training run id.
+- `--local-dir`: where the model was saved
+- `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always / If the repo does not exist it will be created automatically
+- `--commit-message`: since HF repos are git repositories you need to give a commit message.
+
+For instance, I want to upload my model trained with run-id "SnowballTarget1" to the repo-id: ThomasSimonini/ppo-SnowballTarget:
+
+```sh
+ mlagents-push-to-hf --run-id="SnowballTarget1" --local-dir="./results/SnowballTarget1" --repo-id="ThomasSimonini/ppo-SnowballTarget" --commit-message="First Push"
+```
+
+## Visualize an agent playing
+
+You can watch your agent playing directly in your browser (if the environment is from the [ML-Agents official environments](Learning-Environment-Examples.md))
+
+- Step 1: Go to https://huggingface.co/unity and select the environment demo.
+- Step 2: Find your model_id in the list.
+- Step 3: Select your .nn /.onnx file.
+- Step 4: Click on Watch the agent play
diff --git a/com.unity.ml-agents/Documentation~/Inference-Engine.md b/com.unity.ml-agents/Documentation~/Inference-Engine.md
new file mode 100644
index 0000000000..aca1dde648
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Inference-Engine.md
@@ -0,0 +1,26 @@
+# Inference Engine
+
+The ML-Agents Toolkit allows you to use pre-trained neural network models inside your Unity games. This support is possible thanks to the [Inference Engine](https://docs.unity3d.com/Packages/com.unity.ai.inference@latest). Inference Engine uses [compute shaders](https://docs.unity3d.com/Manual/class-ComputeShader.html) to run the neural network within Unity.
+
+## Supported devices
+
+Inference Engine supports [all Unity runtime platforms](https://docs.unity3d.com/Manual/PlatformSpecific.html).
+
+Scripting Backends : Inference Engine is generally faster with **IL2CPP** than with **Mono** for Standalone builds. In the Editor, It is not possible to use Inference Engine with GPU device selected when Editor Graphics Emulation is set to **OpenGL(ES) 3.0 or 2.0 emulation**. Also there might be non-fatal build time errors when target platform includes Graphics API that does not support **Unity Compute Shaders**.
+
+In cases when it is not possible to use compute shaders on the target platform, inference can be performed using **CPU** or **GPUPixel** Inference Engine backends.
+
+## Using Inference Engine
+
+When using a model, drag the model file into the **Model** field in the Inspector of the Agent. Select the **Inference Device**: **Compute Shader**, **Burst** or **Pixel Shader** you want to use for inference.
+
+**Note:** For most of the models generated with the ML-Agents Toolkit, CPU inference (**Burst**) will be faster than GPU inference (**Compute Shader** or **Pixel Shader**). You should use GPU inference only if you use the ResNet visual encoder or have a large number of agents with visual observations.
+
+# Unsupported use cases
+## Externally trained models
+The ML-Agents Toolkit only supports the models created with our trainers. Model loading expects certain conventions for constants and tensor names. While it is possible to construct a model that follows these conventions, we don't provide any additional help for this. More details can be found in [TensorNames.cs](https://github.com/Unity-Technologies/ml-agents/blob/release_22_docs/com.unity.ml-agents/Runtime/Inference/TensorNames.cs) and [SentisModelParamLoader.cs](https://github.com/Unity-Technologies/ml-agents/blob/release_22_docs/com.unity.ml-agents/Runtime/Inference/SentisModelParamLoader.cs).
+
+If you wish to run inference on an externally trained model, you should use Inference Engine directly, instead of trying to run it through ML-Agents.
+
+## Model inference outside of Unity
+We do not provide support for inference anywhere outside of Unity. The `.onnx` files produced by training use the open format ONNX; if you wish to convert a `.onnx` file to another format or run inference with them, refer to their documentation.
diff --git a/com.unity.ml-agents/Documentation~/InputSystem-Integration.md b/com.unity.ml-agents/Documentation~/InputSystem-Integration.md
new file mode 100644
index 0000000000..de5715ba0c
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/InputSystem-Integration.md
@@ -0,0 +1,39 @@
+# Input System Integration
+
+The ML-Agents package integrates with the [Input System Package](https://docs.unity3d.com/Packages/com.unity.inputsystem@1.14/manual/QuickStartGuide.html) through the `InputActuatorComponent`. This component sets up an action space for your `Agent` based on an `InputActionAsset` that is referenced by the `IInputActionAssetProvider` interface, or the `PlayerInput` component that may be living on your player controlled `Agent`. This means that if you have code outside of your agent that handles input, you will not need to implement the Heuristic function in agent as well. The `InputActuatorComponent` will handle this for you. You can now train and run inference on `Agents` with an action space defined by an `InputActionAsset`.
+
+Take a look at how we have implemented the C# code in the example Input Integration scene (located under Project/Assets/ML-Agents/Examples/PushBlockWithInput/). Once you have some familiarity, then the next step would be to add the InputActuatorComponent to your player Agent. The example we have implemented uses C# Events to send information from the Input System.
+
+## Getting Started with Input System Integration
+1. Add the `com.unity.inputsystem` version 1.1.0-preview.3 or later to your project via the Package Manager window.
+2. If you have already setup an InputActionAsset skip to Step 3, otherwise follow these sub steps:
+3. Create an InputActionAsset to allow your Agent to be controlled by the Input System.
+4. Handle the events from the Input System where you normally would (i.e. a script external to your Agent class).
+5. Add the InputSystemActuatorComponent to the GameObject that has the `PlayerInput` and `Agent` components attached.
+
+Additionally, see below for additional technical specifications on the C# code for the InputActuatorComponent.
+## Technical Specifications
+
+### `IInputActionsAssetProvider` Interface
+The `InputActuatorComponent` searches for a `Component` that implements
+`IInputActionAssetProvider` on the `GameObject` they both are attached to. It is important to note
+that if multiple `Components` on your `GameObject` need to access an `InputActionAsset` to handle events, they will need to share the same instance of the `InputActionAsset` that is returned from the
+`IInputActionAssetProvider`.
+
+### `InputActuatorComponent` Class
+The `InputActuatorComponent` is the bridge between ML-Agents and the Input System. It allows ML-Agents to:
+* create an `ActionSpec` for your Agent based on an `InputActionAsset` that comes from an `IInputActionAssetProvider`.
+* send simulated input from a training process or a neural network
+* let developers keep their input handling code in one place
+
+This is accomplished by adding the `InputActuatorComponent` to an Agent which already has the PlayerInput component attached.
+
+## Requirements
+
+If using the `InputActuatorComponent`, install the `com.unity.inputsystem` package `version 1.1.0-preview.3` or later.
+
+## Known Limitations
+
+For the `InputActuatorComponent`
+- Limited implementation of `InputControls`
+- No way to customize the action space of the `InputActuatorComponent`
diff --git a/com.unity.ml-agents/Documentation~/Installation.md b/com.unity.ml-agents/Documentation~/Installation.md
new file mode 100644
index 0000000000..921e4fa8b1
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Installation.md
@@ -0,0 +1,124 @@
+# Installation
+To install and use the ML-Agents Toolkit, follow the steps below. Detailed instructions for each step are provided later on this page.
+
+1. Install Unity (6000.0 or later)
+2. Install Python (>= 3.10.1, <=3.10.12) - we recommend using 3.10.12
+3. Install the `com.unity.ml-agents` Unity package; or clone this repository and install locally (recommended for the latest version and bug fixes)
+4. Install `mlagents-envs`
+5. Install `mlagents`
+
+### Install **Unity 6000.0** or Later
+
+[Download](https://unity3d.com/get-unity/download) and install Unity. We strongly recommend that you install Unity through the Unity Hub as it will enable you to manage multiple Unity versions.
+
+### Install **Python 3.10.12**
+
+We recommend [installing](https://www.python.org/downloads/) Python 3.10.12. If you are using Windows, please install the x86-64 version and not x86. If your Python environment doesn't include `pip3`, see these [instructions](https://packaging.python.org/guides/installing-using-linux-tools/#installing-pip-setuptools-wheel-with-linux-package-managers) on installing it. We also recommend using [conda](https://docs.conda.io/en/latest/) or [mamba](https://github.com/mamba-org/mamba) to manage your python virtual environments.
+
+#### Conda python setup
+
+Once conda has been installed in your system, open a terminal and execute the following commands to setup a python 3.10.12 virtual environment and activate it.
+
+```shell
+conda create -n mlagents python=3.10.12 && conda activate mlagents
+```
+
+### Install the `com.unity.ml-agents` Unity package
+
+The Unity ML-Agents C# SDK is a Unity Package. You can install the `com.unity.ml-agents` package [directly from the Package Manager registry](https://docs.unity3d.com/Manual/upm-ui-install.html). Please make sure you enable 'Preview Packages' in the 'Advanced' dropdown in order to find the latest Preview release of the package.
+
+**NOTE:** If you do not see the ML-Agents package listed in the Package Manager please follow the advanced installation instructions below.
+
+#### Advanced: Local Installation for Development
+
+You will need to clone the repository if you plan to modify or extend the ML-Agents Toolkit for your purposes, or if you'd like to download our example environments. Some of our tutorials / guides assume you have access to our example environments.
+
+Use the command below to clone the repository
+
+```sh
+git clone --branch release_22 https://github.com/Unity-Technologies/ml-agents.git
+```
+
+The `--branch release_22` option will switch to the tag of the latest stable release. Omitting that will get the `develop` branch which is potentially unstable. However, if you find that a release branch does not work, the recommendation is to use the `develop` branch as it may have potential fixes for bugs and dependency issues.
+
+(Optional to get bleeding edge)
+
+```sh
+git clone https://github.com/Unity-Technologies/ml-agents.git
+```
+
+If you plan to contribute those changes back, make sure to clone the `develop` branch (by omitting `--branch release_22` from the command above). See our [Contributions Guidelines](CONTRIBUTING.md) for more information on contributing to the ML-Agents Toolkit.
+
+You can [add the local](https://docs.unity3d.com/Manual/upm-ui-local.html) `com.unity.ml-agents` package (from the repository that you just cloned) to your project by:
+
+1. navigating to the menu `Window` -> `Package Manager`.
+2. In the package manager window click on the `+` button on the top left of the packages list).
+3. Select `Add package from disk...`
+4. Navigate into the `com.unity.ml-agents` folder.
+5. Select the `package.json` file.
+
+
+
+If you are going to follow the examples from our documentation, you can open the
+`Project` folder in Unity and start tinkering immediately.
+
+### Install Python package
+
+Installing the `mlagents` Python package involves installing other Python packages that `mlagents` depends on. So you may run into installation issues if your machine has older versions of any of those dependencies already installed. Consequently, our supported path for installing `mlagents` is to leverage Python Virtual Environments. Virtual Environments provide a mechanism for isolating the dependencies for each project and are supported on Mac / Windows / Linux. We offer a dedicated [guide on Virtual Environments](Using-Virtual-Environment.md).
+
+#### Installing `mlagents` from PyPi
+
+You can install the ML-Agents Python package directly from PyPi. This is the recommended approach if you installed the C# package via the Package Manager registry.
+
+**Important:** Ensure you install a Python package version that matches your Unity package version. Check the [release history](https://github.com/Unity-Technologies/ml-agents/releases) to find compatible versions.
+
+To install, activate your virtual environment and run the following command:
+
+```shell
+python -m pip install mlagents==1.1.0
+```
+
+which will install the latest version of ML-Agents Python packages and associated dependencies available on PyPi. If building the wheel for `grpcio` fails, run the following command before installing `mlagents` with pip:
+
+```shell
+conda install "grpcio=1.48.2" -c conda-forge
+```
+
+When you install the Python package, the dependencies listed in the [setup.py file](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents/setup.py) are also installed. These include [PyTorch](Background-PyTorch.md).
+
+
+#### Advanced: Local Installation for Development
+
+##### (Windows) Installing PyTorch
+
+On Windows, you'll have to install the PyTorch package separately prior to installing ML-Agents in order to make sure the cuda-enabled version is used, rather than the CPU-only version. Activate your virtual environment and run from the command line:
+
+```sh
+pip3 install torch~=2.2.1 --index-url https://download.pytorch.org/whl/cu121
+```
+
+Note that on Windows, you may also need Microsoft's Visual C++ Redistributable if you don't have it already. See the [PyTorch installation guide](https://pytorch.org/get-started/locally/) for more installation options and versions.
+
+##### All Platforms
+
+To install the `mlagents` Python package, activate your virtual environment and run from the command line:
+
+```sh
+cd /path/to/ml-agents
+python -m pip install ./ml-agents-envs
+python -m pip install ./ml-agents
+```
+
+Note that this will install `mlagents` from the cloned repository, _not_ from the PyPi repository. If you installed this correctly, you should be able to run `mlagents-learn --help`, after which you will see the command line parameters you can use with `mlagents-learn`.
+
+
+
+If you intend to make modifications to `mlagents` or `mlagents_envs`, from the repository's root directory, run:
+
+```sh
+pip3 install torch -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install -e ./ml-agents-envs
+pip3 install -e ./ml-agents
+```
+
+Running pip with the `-e` flag will let you make changes to the Python files directly and have those reflected when you run `mlagents-learn`. It is important to install these packages in this order as the `mlagents` package depends on `mlagents_envs`, and installing it in the other order will download `mlagents_envs` from PyPi.
diff --git a/com.unity.ml-agents/Documentation~/Integrations-Match3.md b/com.unity.ml-agents/Documentation~/Integrations-Match3.md
new file mode 100644
index 0000000000..d4495bfa19
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Integrations-Match3.md
@@ -0,0 +1,69 @@
+# Match-3 with ML-Agents
+
+
+
+## Getting started
+The C# code for Match-3 exists inside of the Unity package (`com.unity.ml-agents`). The good first step would be to take a look at how we have implemented the C# code in the example Match-3 scene (located under /Project/Assets/ML-Agents/Examples/match3). Once you have some familiarity, then the next step would be to implement the C# code for Match-3 from the extensions package.
+
+Additionally, see below for additional technical specifications on the C# code for Match-3. Please note the Match-3 game isn't human playable as implemented and can be only played via training.
+
+## Technical specifications for Match-3 with ML-Agents
+
+### AbstractBoard class
+The `AbstractBoard` is the bridge between ML-Agents and your game. It allows ML-Agents to
+* ask your game what the current and maximum sizes (rows, columns, and potential piece types) of the board are
+* ask your game what the "color" of a cell is
+* ask whether the cell is a "special" piece type or not
+* ask your game whether a move is allowed
+* request that your game make a move
+
+These are handled by implementing the abstract methods of `AbstractBoard`.
+
+#### `public abstract BoardSize GetMaxBoardSize()`
+Returns the largest `BoardSize` that the game can use. This is used to determine the sizes of observations and sensors, so don't make it larger than necessary.
+
+#### `public virtual BoardSize GetCurrentBoardSize()`
+Returns the current size of the board. Each field on this BoardSize must be less than or equal to the corresponding field returned by `GetMaxBoardSize()`. This method is optional; if your always use the same size board, you don't need to override it.
+
+If the current board size is smaller than the maximum board size, `GetCellType()` and `GetSpecialType()` will not be called for cells outside the current board size, and `IsValidMove` won't be called for moves that would go outside of the current board size.
+
+#### `public abstract int GetCellType(int row, int col)`
+Returns the "color" of piece at the given row and column. This should be between 0 and BoardSize.NumCellTypes-1 (inclusive). The actual order of the values doesn't matter.
+
+#### `public abstract int GetSpecialType(int row, int col)`
+Returns the special type of the piece at the given row and column. This should be between 0 and BoardSize.NumSpecialTypes (inclusive). The actual order of the values doesn't matter.
+
+#### `public abstract bool IsMoveValid(Move m)`
+Check whether the particular `Move` is valid for the game. The actual results will depend on the rules of the game, but we provide the `SimpleIsMoveValid()` method that handles basic match3 rules with no special or immovable pieces.
+
+#### `public abstract bool MakeMove(Move m)`
+Instruct the game to make the given move. Returns true if the move was made. Note that during training, a move that was marked as invalid may occasionally still be requested. If this happens, it is safe to do nothing and request another move.
+
+### `Move` struct
+The Move struct encapsulates a swap of two adjacent cells. You can get the number of potential moves for a board of a given size with. `Move.NumPotentialMoves(maxBoardSize)`. There are two helper functions to create a new `Move`:
+* `public static Move FromMoveIndex(int moveIndex, BoardSize maxBoardSize)` can be used to iterate over all potential moves for the board by looping from 0 to `Move.NumPotentialMoves()`
+* `public static Move FromPositionAndDirection(int row, int col, Direction dir, BoardSize maxBoardSize)` creates a `Move` from a row, column, and direction (and board size).
+
+### `BoardSize` struct
+Describes the "size" of the board, including the number of potential piece types that the board can have. This is returned by the AbstractBoard.GetMaxBoardSize() and GetCurrentBoardSize() methods.
+
+#### `Match3Sensor` and `Match3SensorComponent` classes
+The `Match3Sensor` generates observations about the state using the `AbstractBoard` interface. You can choose whether to use vector or "visual" observations; in theory, visual observations should perform better because they are 2-dimensional like the board, but we need to experiment more on this.
+
+A `Match3SensorComponent` generates `Match3Sensor`s (the exact number of sensors depends on your configuration) at runtime, and should be added to the same GameObject as your `Agent` implementation. You do not need to write any additional code to use them.
+
+#### `Match3Actuator` and `Match3ActuatorComponent` classes
+The `Match3Actuator` converts actions from training or inference into a `Move` that is sent to` AbstractBoard.MakeMove()` It also checks `AbstractBoard.IsMoveValid` for each potential move and uses this to set the action mask for Agent.
+
+A `Match3ActuatorComponent` generates a `Match3Actuator` at runtime, and should be added to the same GameObject as your `Agent` implementation. You do not need to write any additional code to use them.
+
+### Setting up Match-3 simulation
+* Implement the `AbstractBoard` methods to integrate with your game.
+* Give the `Agent` rewards when it does what you want it to (match multiple pieces in a row, clears pieces of a certain type, etc).
+* Add the `Agent`, `AbstractBoard` implementation, `Match3SensorComponent`, and `Match3ActuatorComponent` to the same `GameObject`.
+* Call `Agent.RequestDecision()` when you're ready for the `Agent` to make a move on the next `Academy` step. During the next `Academy` step, the `MakeMove()` method on the board will be called.
+
+## Implementation Details
+
+### Action Space
+The indexing for actions is the same as described in [Human Like Playtesting with Deep Learning](https://www.researchgate.net/publication/328307928_Human-Like_Playtesting_with_Deep_Learning) (for example, Figure 2b). The horizontal moves are enumerated first, then the vertical ones.
diff --git a/com.unity.ml-agents/Documentation~/Integrations.md b/com.unity.ml-agents/Documentation~/Integrations.md
new file mode 100644
index 0000000000..fe202c0c8c
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Integrations.md
@@ -0,0 +1,5 @@
+# Game Integrations
+ML-Agents provides some utilities to make it easier to integrate with some common genres of games.
+
+## Match-3
+The [Match-3 integration](Integrations-Match3.md) provides an abstraction of a match-3 game board and moves, along with a sensor to observe the game state, and an actuator to translate the ML-Agent actions into game moves.
\ No newline at end of file
diff --git a/com.unity.ml-agents/Documentation~/Learning-Environment-Create-New.md b/com.unity.ml-agents/Documentation~/Learning-Environment-Create-New.md
new file mode 100644
index 0000000000..65b00dd771
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Learning-Environment-Create-New.md
@@ -0,0 +1,357 @@
+# Making a New Learning Environment
+
+This tutorial walks through the process of creating a Unity Environment from scratch. We recommend first reading the [Running an Example](Sample.md) guide to understand the concepts presented here first in an already-built environment.
+
+
+
+In this example, we will create an agent capable of controlling a ball on a platform. We will then train the agent to roll the ball toward the cube while avoiding falling off the platform.
+
+## Overview
+
+Using the ML-Agents Toolkit in a Unity project involves the following basic steps:
+
+1. Create an environment for your agents to live in. An environment can range from a simple physical simulation containing a few objects to an entire game or ecosystem.
+2. Implement your Agent subclasses. An Agent subclass defines the code an Agent uses to observe its environment, to carry out assigned actions, and to calculate the rewards used for reinforcement training. You can also implement optional methods to reset the Agent when it has finished or failed its task.
+3. Add your Agent subclasses to appropriate GameObjects, typically, the object in the scene that represents the Agent in the simulation.
+
+**Note:** If you are unfamiliar with Unity, refer to the [Unity manual](https://docs.unity3d.com/Manual/index.html) if an Editor task isn't explained sufficiently in this tutorial.
+
+If you haven't already, follow the [installation instructions](Installation.md).
+
+## Set Up the Unity Project
+
+The first task to accomplish is simply creating a new Unity project and importing the ML-Agents assets into it:
+
+1. Launch Unity Hub and create a new 3D project named "RollerBall".
+2. [Add the ML-Agents Unity package](Installation.md#install-the-comunityml-agents-unity-package) to your project.
+
+Your Unity **Project** window should contain the following Packages:
+
+{: style="width:250px"}
+
+## Create the Environment
+
+Next, we will create a very simple scene to act as our learning environment. The "physical" components of the environment include a Plane to act as the floor for the Agent to move around on, a Cube to act as the goal or target for the agent to seek, and a Sphere to represent the Agent itself.
+
+### Create the Floor Plane
+
+1. Right click in Hierarchy window, select 3D Object > Plane.
+2. Name the GameObject "Floor".
+3. Select the Floor Plane to view its properties in the Inspector window.
+4. Set Transform to Position = `(0, 0, 0)`, Rotation = `(0, 0, 0)`, Scale =`(1, 1, 1)`.
+
+{: style="width:400px"}
+
+### Add the Target Cube
+
+1. Right click in Hierarchy window, select 3D Object > Cube.
+2. Name the GameObject "Target".
+3. Select the Target Cube to view its properties in the Inspector window.
+4. Set Transform to Position = `(3, 0.5, 3)`, Rotation = `(0, 0, 0)`, Scale =
+ `(1, 1, 1)`.
+
+{: style="width:400px"}
+
+### Add the Agent Sphere
+
+1. Right click in Hierarchy window, select 3D Object > Sphere.
+2. Name the GameObject "RollerAgent".
+3. Select the RollerAgent Sphere to view its properties in the Inspector window.
+4. Set Transform to Position = `(0, 0.5, 0)`, Rotation = `(0, 0, 0)`, Scale = `(1, 1, 1)`.
+5. Click **Add Component**.
+6. Add the `Rigidbody` component to the Sphere.
+
+### Group into Training Area
+
+Group the floor, target and agent under a single, empty, GameObject. This will simplify some of our subsequent steps.
+
+To do so:
+
+1. Right-click on your Project Hierarchy and create a new empty GameObject. Name it TrainingArea.
+2. Reset the TrainingArea’s Transform so that it is at `(0,0,0)` with Rotation `(0,0,0)` and Scale `(1,1,1)`.
+3. Drag the Floor, Target, and RollerAgent GameObjects in the Hierarchy into the TrainingArea GameObject.
+
+{: style="width:250px"}
+
+## Implement an Agent
+
+To create the Agent Script:
+
+1. Select the RollerAgent GameObject to view it in the Inspector window.
+2. Click **Add Component**.
+3. Click **New Script** in the list of components (at the bottom).
+4. Name the script "RollerAgent".
+5. Click **Create and Add**.
+
+Then, edit the new `RollerAgent` script:
+
+1. In the Unity Project window, double-click the `RollerAgent` script to open it in your code editor.
+2. Import ML-Agent package by adding
+
+ ```csharp
+ using Unity.MLAgents;
+ using Unity.MLAgents.Sensors;
+ using Unity.MLAgents.Actuators;
+ ```
+then change the base class from `MonoBehaviour` to `Agent`.
+3. Delete `Update()` since we are not using it, but keep `Start()`.
+
+So far, these are the basic steps that you would use to add ML-Agents to any Unity project. Next, we will add the logic that will let our Agent learn to roll to the cube using reinforcement learning. More specifically, we will need to extend three methods from the `Agent` base class:
+
+- `OnEpisodeBegin()`
+- `CollectObservations(VectorSensor sensor)`
+- `OnActionReceived(ActionBuffers actionBuffers)`
+
+We overview each of these in more detail in the dedicated subsections below.
+
+### Initialization and Resetting the Agent
+
+The process of training in the ML-Agents Toolkit involves running episodes where the Agent (Sphere) attempts to solve the task. Each episode lasts until the Agents solves the task (i.e. reaches the cube), fails (rolls off the platform) or times out (takes too long to solve or fail at the task). At the start of each episode, `OnEpisodeBegin()` is called to set-up the environment for a new episode. Typically the scene is initialized in a random manner to enable the agent to learn to solve the task under a variety of conditions.
+
+In this example, each time the Agent (Sphere) reaches its target (Cube), the episode ends and the target (Cube) is moved to a new random location; and if the Agent rolls off the platform, it will be put back onto the floor. These are all handled in `OnEpisodeBegin()`.
+
+To move the target (Cube), we need a reference to its Transform (which stores a GameObject's position, orientation and scale in the 3D world). To get this reference, add a public field of type `Transform` to the RollerAgent class. Public fields of a component in Unity get displayed in the Inspector window, allowing you to choose which GameObject to use as the target in the Unity Editor.
+
+To reset the Agent's velocity (and later to apply force to move the agent) we need a reference to the Rigidbody component. A [Rigidbody](https://docs.unity3d.com/ScriptReference/Rigidbody.html) is Unity's primary element for physics simulation. (See [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html) for full documentation of Unity physics.) Since the Rigidbody component is on the same GameObject as our Agent script, the best way to get this reference is using
+`GameObject.GetComponent()`, which we can call in our script's `Start()`
+method.
+
+So far, our RollerAgent script looks like:
+
+```csharp
+using System.Collections.Generic;
+using UnityEngine;
+using Unity.MLAgents;
+using Unity.MLAgents.Sensors;
+
+public class RollerAgent : Agent
+{
+ Rigidbody rBody;
+ void Start () {
+ rBody = GetComponent();
+ }
+
+ public Transform Target;
+ public override void OnEpisodeBegin()
+ {
+ // If the Agent fell, zero its momentum
+ if (this.transform.localPosition.y < 0)
+ {
+ this.rBody.angularVelocity = Vector3.zero;
+ this.rBody.velocity = Vector3.zero;
+ this.transform.localPosition = new Vector3( 0, 0.5f, 0);
+ }
+
+ // Move the target to a new spot
+ Target.localPosition = new Vector3(Random.value * 8 - 4,
+ 0.5f,
+ Random.value * 8 - 4);
+ }
+}
+```
+
+Next, let's implement the `Agent.CollectObservations(VectorSensor sensor)` method.
+
+### Observing the Environment
+
+The Agent sends the information we collect to the Brain, which uses it to make a decision. When you train the Agent (or use a trained model), the data is fed into a neural network as a feature vector. For an Agent to successfully learn a task, we need to provide the correct information. A good rule of thumb for deciding what information to collect is to consider what you would need to calculate an analytical solution to the problem.
+
+In our case, the information our Agent collects includes the position of the target, the position of the agent itself, and the velocity of the agent. This helps the Agent learn to control its speed so it doesn't overshoot the target and roll off the platform. In total, the agent observation contains 8 values as implemented below:
+
+```csharp
+public override void CollectObservations(VectorSensor sensor)
+{
+ // Target and Agent positions
+ sensor.AddObservation(Target.localPosition);
+ sensor.AddObservation(this.transform.localPosition);
+
+ // Agent velocity
+ sensor.AddObservation(rBody.velocity.x);
+ sensor.AddObservation(rBody.velocity.z);
+}
+```
+
+### Taking Actions and Assigning Rewards
+
+The final part of the Agent code is the `Agent.OnActionReceived()` method, which receives actions and assigns the reward.
+
+#### Actions
+
+To solve the task of moving towards the target, the Agent (Sphere) needs to be able to move in the `x` and `z` directions. As such, the agent needs 2 actions: the first determines the force applied along the x-axis; and the second determines the force applied along the z-axis. (If we allowed the Agent to move in three dimensions, then we would need a third action.)
+
+The RollerAgent applies the values from the `action[]` array to its Rigidbody component `rBody`, using `Rigidbody.AddForce()`:
+
+```csharp
+Vector3 controlSignal = Vector3.zero;
+controlSignal.x = action[0];
+controlSignal.z = action[1];
+rBody.AddForce(controlSignal * forceMultiplier);
+```
+
+#### Rewards
+
+Reinforcement learning requires rewards to signal which decisions are good and which are bad. The learning algorithm uses the rewards to determine whether it is giving the Agent the optimal actions. You want to reward an Agent for completing the assigned task. In this case, the Agent is given a reward of 1.0 for reaching the Target cube.
+
+Rewards are assigned in `OnActionReceived()`. The RollerAgent calculates the distance to detect when it reaches the target. When it does, the code calls `Agent.SetReward()` to assign a reward of 1.0 and marks the agent as finished by calling `EndEpisode()` on the Agent.
+
+```csharp
+float distanceToTarget = Vector3.Distance(this.transform.localPosition, Target.localPosition);
+// Reached target
+if (distanceToTarget < 1.42f)
+{
+ SetReward(1.0f);
+ EndEpisode();
+}
+```
+
+Finally, if the Agent falls off the platform, end the episode so that it can reset itself:
+
+```csharp
+// Fell off platform
+if (this.transform.localPosition.y < 0)
+{
+ EndEpisode();
+}
+```
+
+#### OnActionReceived()
+
+With the action and reward logic outlined above, the final version of
+`OnActionReceived()` looks like:
+
+```csharp
+public float forceMultiplier = 10;
+public override void OnActionReceived(ActionBuffers actionBuffers)
+{
+ // Actions, size = 2
+ Vector3 controlSignal = Vector3.zero;
+ controlSignal.x = actionBuffers.ContinuousActions[0];
+ controlSignal.z = actionBuffers.ContinuousActions[1];
+ rBody.AddForce(controlSignal * forceMultiplier);
+
+ // Rewards
+ float distanceToTarget = Vector3.Distance(this.transform.localPosition, Target.localPosition);
+
+ // Reached target
+ if (distanceToTarget < 1.42f)
+ {
+ SetReward(1.0f);
+ EndEpisode();
+ }
+
+ // Fell off platform
+ else if (this.transform.localPosition.y < 0)
+ {
+ EndEpisode();
+ }
+}
+```
+
+Note the `forceMultiplier` class variable is defined before the method definition. Since `forceMultiplier` is public, you can set the value from the Inspector window.
+
+## Final Agent Setup in Editor
+
+Now that all the GameObjects and ML-Agent components are in place, it is time to connect everything together in the Unity Editor. This involves adding and setting some of the Agent Component's properties so that they are compatible with our Agent script.
+
+1. Select the **RollerAgent** GameObject to show its properties in the Inspector window.
+2. Drag the Target GameObject in the Hierarchy into the `Target` field in RollerAgent Script.
+3. Add a `Decision Requester` script with the **Add Component** button. Set the **Decision Period** to `10`. For more information on decisions, see [the Agent documentation](Learning-Environment-Design-Agents.md#decisions)
+4. Add a `Behavior Parameters` script with the **Add Component** button. Set the Behavior Parameters of the Agent to the following:
+ - `Behavior Name`: _RollerBall_
+ - `Vector Observation` > `Space Size` = 8
+ - `Actions` > `Continuous Actions` = 2
+
+In the inspector, the `RollerAgent` should look like this now:
+
+{: style="width:400px"}
+
+Now you are ready to test the environment before training.
+
+## Testing the Environment
+
+It is always a good idea to first test your environment by controlling the Agent using the keyboard. To do so, you will need to extend the `Heuristic()` method in the `RollerAgent` class. For our example, the heuristic will generate an action corresponding to the values of the "Horizontal" and "Vertical" input axis (which correspond to the keyboard arrow keys):
+
+```csharp
+public override void Heuristic(in ActionBuffers actionsOut)
+{
+ var continuousActionsOut = actionsOut.ContinuousActions;
+ continuousActionsOut[0] = Input.GetAxis("Horizontal");
+ continuousActionsOut[1] = Input.GetAxis("Vertical");
+}
+```
+
+In order for the Agent to use the Heuristic, You will need to set the `Behavior Type` to `Heuristic Only` in the `Behavior Parameters` of the RollerAgent.
+
+Press **Play** to run the scene and use the arrows keys to move the Agent around the platform. Make sure that there are no errors displayed in the Unity Editor Console window and that the Agent resets when it reaches its target or falls from the platform.
+
+## Training the Environment
+
+The process is the same as described in the [Running an Example](Sample.md) guide.
+
+The hyperparameters for training are specified in a configuration file that you pass to the `mlagents-learn` program. Create a new `rollerball_config.yaml` file under `config/` and include the following hyperparameter values:
+
+```yml
+behaviors:
+ RollerBall:
+ trainer_type: ppo
+ hyperparameters:
+ batch_size: 10
+ buffer_size: 100
+ learning_rate: 3.0e-4
+ beta: 5.0e-4
+ epsilon: 0.2
+ lambd: 0.99
+ num_epoch: 3
+ learning_rate_schedule: linear
+ beta_schedule: constant
+ epsilon_schedule: linear
+ network_settings:
+ normalize: false
+ hidden_units: 128
+ num_layers: 2
+ reward_signals:
+ extrinsic:
+ gamma: 0.99
+ strength: 1.0
+ max_steps: 500000
+ time_horizon: 64
+ summary_freq: 10000
+```
+
+Hyperparameters are explained in [the training configuration file documentation](Training-Configuration-File.md)
+
+Since this example creates a very simple training environment with only a few inputs and outputs, using small batch and buffer sizes speeds up the training considerably. However, if you add more complexity to the environment or change the reward or observation functions, you might also find that training performs better with different hyperparameter values. In addition to setting these hyperparameter values, the Agent **DecisionFrequency** parameter has a large effect on training time and success. A larger value reduces the number of decisions the training algorithm has to consider and, in this simple environment, speeds up training.
+
+To train your agent, run the following command before pressing **Play** in the Editor:
+
+mlagents-learn config/rollerball_config.yaml --run-id=RollerBall
+
+To monitor the statistics of Agent performance during training, use [TensorBoard](Using-Tensorboard.md).
+
+
+
+In particular, the _cumulative_reward_ and _value_estimate_ statistics show how well the Agent is achieving the task. In this example, the maximum reward an Agent can earn is 1.0, so these statistics approach that value when the Agent has successfully _solved_ the problem.
+
+## Optional: Multiple Training Areas within the Same Scene
+
+In many of the [example environments](Learning-Environment-Examples.md), many copies of the training area are instantiated in the scene. This generally speeds up training, allowing the environment to gather many experiences in parallel. This can be achieved simply by instantiating many Agents with the same `Behavior Name`. Note that we've already simplified our transition to using multiple areas by creating the `TrainingArea` GameObject and relying on local positions in `RollerAgent.cs`. Use the following steps to parallelize your RollerBall environment:
+
+1. Drag the TrainingArea GameObject, along with its attached GameObjects, into your Assets browser, turning it into a prefab.
+2. You can now instantiate copies of the TrainingArea prefab. Drag them into your scene, positioning them so that they do not overlap.
+
+Alternatively, you can use the `TrainingAreaReplicator` to replicate training areas. Use the following steps:
+
+1. Create a new empty Game Object in the scene.
+2. Click on the new object and add a TrainingAreaReplicator component to the empty Game Object through the inspector.
+3. Drag the training area to `Base Area` in the Training Area Replicator.
+4. Specify the number of areas to replicate and the separation between areas.
+5. Hit play and the areas will be replicated automatically!
+
+## Optional: Training Using Concurrent Unity Instances
+Another level of parallelization comes by training using [concurrent Unity instances](ML-Agents-Overview.md#additional-features). For example,
+
+```
+mlagents-learn config/rollerball_config.yaml --run-id=RollerBall --num-envs=2
+```
+
+will start ML Agents training with two environment instances. Combining multiple training areas within the same scene, with concurrent Unity instances, effectively gives you two levels of parallelism to speed up training. The command line option `--num-envs=` controls the number of concurrent Unity instances that are executed in parallel during training.
diff --git a/com.unity.ml-agents/Documentation~/Learning-Environment-Design-Agents.md b/com.unity.ml-agents/Documentation~/Learning-Environment-Design-Agents.md
new file mode 100644
index 0000000000..c066f0cf94
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Learning-Environment-Design-Agents.md
@@ -0,0 +1,640 @@
+# Agents
+
+An agent is an entity that can observe its environment, decide on the best course of action using those observations, and execute those actions within its environment. Agents can be created in Unity by extending the `Agent` class. The most important aspects of creating agents that can successfully learn are the observations the agent collects, and the reward you assign to estimate the value of the agent's current state toward accomplishing its tasks.
+
+An Agent passes its observations to its Policy. The Policy then makes a decision and passes the chosen action back to the agent. Your agent code must execute the action, for example, move the agent in one direction or another. In order to train an agent using reinforcement learning, your agent must calculate a reward value at each action. The reward is used to discover the optimal decision-making policy.
+
+The `Policy` class abstracts out the decision making logic from the Agent itself so that you can use the same Policy in multiple Agents. How a Policy makes its decisions depends on the `Behavior Parameters` associated with the agent. If you set `Behavior Type` to `Heuristic Only`, the Agent will use its `Heuristic()` method to make decisions which can allow you to control the Agent manually or write your own Policy. If the Agent has a `Model` file, its Policy will use the neural network `Model` to take decisions.
+
+When you create an Agent, you should usually extend the base Agent class. This includes implementing the following methods:
+
+- `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode, including at the beginning of the simulation.
+- `Agent.CollectObservations(VectorSensor sensor)` — Called every step that the Agent requests a decision. This is one possible way for collecting the Agent's observations of the environment; see [Generating Observations](#generating-observations) below for more options.
+- `Agent.OnActionReceived()` — Called every time the Agent receives an action to take. Receives the action chosen by the Agent. It is also common to assign a reward in this method.
+- `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in the Behavior Parameters of the Agent, the Agent will use the `Heuristic()` method to generate the actions of the Agent. As such, the `Heuristic()` method writes to the array of floats provided to the Heuristic method as argument. __Note__: Do not create a new float array of action in the `Heuristic()` method, as this will prevent writing floats to the original action array.
+
+As a concrete example, here is how the Ball3DAgent class implements these methods:
+
+- `Agent.OnEpisodeBegin()` — Resets the agent cube and ball to their starting positions. The function randomizes the reset values so that the training generalizes to more than a specific starting position and agent cube orientation.
+- `Agent.CollectObservations(VectorSensor sensor)` — Adds information about the orientation of the agent cube, the ball velocity, and the relative position between the ball and the cube. Since the `CollectObservations()` method calls `VectorSensor.AddObservation()` such that vector size adds up to 8, the Behavior Parameters of the Agent are set with vector observation space with a state size of 8.
+- `Agent.OnActionReceived()` — The action results in a small change in the agent cube's rotation at each step. In this example, an Agent receives a small positive reward for each step it keeps the ball on the agent cube's head and a larger, negative reward for dropping the ball. An Agent's episode is also ended when it drops the ball so that it will reset with a new ball for the next simulation step.
+- `Agent.Heuristic()` - Converts the keyboard inputs into actions.
+
+## Decisions
+
+The observation-decision-action-reward cycle repeats each time the Agent request a decision. Agents will request a decision when `Agent.RequestDecision()` is called. If you need the Agent to request decisions on its own at regular intervals, add a `Decision Requester` component to the Agent's GameObject. Making decisions at regular step intervals is generally most appropriate for physics-based simulations. For example, an agent in a robotic simulator that must provide fine-control of joint torques should make its decisions every step of the simulation. In games such as real-time strategy, where many agents make their decisions at regular intervals, the decision timing for each agent can be staggered by setting the `DecisionStep` parameter in the `Decision Requester` component for each agent. On the other hand, an agent that only needs to make decisions when certain game or simulation events occur, such as in a turn-based game, should call `Agent.RequestDecision()` manually.
+
+## Observations and Sensors
+In order for an agent to learn, the observations should include all the information an agent needs to accomplish its task. Without sufficient and relevant information, an agent may learn poorly or may not learn at all. A reasonable approach for determining what information should be included is to consider what you would need to calculate an analytical solution to the problem, or what you would expect a human to be able to use to solve the problem.
+
+### Generating Observations
+ML-Agents provides multiple ways for an Agent to make observations:
+ 1. Overriding the `Agent.CollectObservations()` method and passing the observations to the provided `VectorSensor`.
+ 2. Adding the `[Observable]` attribute to fields and properties on the Agent.
+ 3. Implementing the `ISensor` interface, using a `SensorComponent` attached to the Agent to create the `ISensor`.
+
+#### Agent.CollectObservations()
+Agent.CollectObservations() is best used for aspects of the environment which are numerical and non-visual. The Policy class calls the `CollectObservations(VectorSensor sensor)` method of each Agent. Your implementation of this function must call `VectorSensor.AddObservation` to add vector observations.
+
+The `VectorSensor.AddObservation` method provides a number of overloads for adding common types of data to your observation vector. You can add Integers and booleans directly to the observation vector, as well as some common Unity data types such as `Vector2`, `Vector3`, and `Quaternion`.
+
+For examples of various state observation functions, you can look at the [example environments](Learning-Environment-Examples.md) included in the ML-Agents SDK. For instance, the 3DBall example uses the rotation of the platform, the relative position of the ball, and the velocity of the ball as its state observation.
+
+```csharp
+public GameObject ball;
+
+public override void CollectObservations(VectorSensor sensor)
+{
+ // Orientation of the cube (2 floats)
+ sensor.AddObservation(gameObject.transform.rotation.z);
+ sensor.AddObservation(gameObject.transform.rotation.x);
+ // Relative position of the ball to the cube (3 floats)
+ sensor.AddObservation(ball.transform.position - gameObject.transform.position);
+ // Velocity of the ball (3 floats)
+ sensor.AddObservation(m_BallRb.velocity);
+ // 8 floats total
+}
+```
+
+As an experiment, you can remove the velocity components from the observation and retrain the 3DBall agent. While it will learn to balance the ball reasonably well, the performance of the agent without using velocity is noticeably worse.
+
+The observations passed to `VectorSensor.AddObservation()` must always contain the same number of elements must always be in the same order. If the number of observed entities in an environment can vary, you can pad the calls with zeros for any missing entities in a specific observation, or you can limit an agent's observations to a fixed subset. For example, instead of observing every enemy in an environment, you could only observe the closest five.
+
+Additionally, when you set up an Agent's `Behavior Parameters` in the Unity Editor, you must set the **Vector Observations > Space Size** to equal the number of floats that are written by `CollectObservations()`.
+
+#### Observable Fields and Properties
+Another approach is to define the relevant observations as fields or properties on your Agent class, and annotate them with an `ObservableAttribute`. For example, in the Ball3DHardAgent, the difference between positions could be observed by adding a property to the Agent:
+```csharp
+using Unity.MLAgents.Sensors.Reflection;
+
+public class Ball3DHardAgent : Agent {
+
+ [Observable(numStackedObservations: 9)]
+ Vector3 PositionDelta
+ {
+ get
+ {
+ return ball.transform.position - gameObject.transform.position;
+ }
+ }
+}
+```
+`ObservableAttribute` currently supports most basic types (e.g. floats, ints, bools), as well as `Vector2`, `Vector3`, `Vector4`, `Quaternion`, and enums.
+
+The behavior of `ObservableAttribute`s are controlled by the "Observable Attribute Handling" in the Agent's `Behavior Parameters`. The possible values for this are:
+ * **Ignore** (default) - All ObservableAttributes on the Agent will be ignored. If there are no ObservableAttributes on the Agent, this will result in the fastest initialization time.
+ * **Exclude Inherited** - Only members on the declared class will be examined; members that are inherited are ignored. This is a reasonable tradeoff between performance and flexibility.
+ * **Examine All** All members on the class will be examined. This can lead to slower startup times.
+
+"Exclude Inherited" is generally sufficient, but if your Agent inherits from another Agent implementation that has Observable members, you will need to use "Examine All".
+
+Internally, ObservableAttribute uses reflection to determine which members of the Agent have ObservableAttributes, and also uses reflection to access the fields or invoke the properties at runtime. This may be slower than using CollectObservations or an ISensor, although this might not be enough to noticeably affect performance.
+
+**NOTE**: you do not need to adjust the Space Size in the Agent's `Behavior Parameters` when you add `[Observable]` fields or properties to an Agent, since their size can be computed before they are used.
+
+#### ISensor interface and SensorComponents
+The `ISensor` interface is generally intended for advanced users. The `Write()` method is used to actually generate the observation, but some other methods such as returning the shape of the observations must also be implemented.
+
+The `SensorComponent` abstract class is used to create the actual `ISensor` at runtime. It must be attached to the same `GameObject` as the `Agent`, or to a child `GameObject`.
+
+There are several SensorComponents provided in the API, including:
+- `CameraSensorComponent` - Uses images from a `Camera` as observations.
+- `RenderTextureSensorComponent` - Uses the content of a `RenderTexture` as observations.
+- `RayPerceptionSensorComponent` - Uses the information from set of ray casts as observations.
+- `Match3SensorComponent` - Uses the board of a [Match-3 game](Integrations-Match3.md) as observations.
+- `GridSensorComponent` - Uses a set of box queries in a grid shape as observations.
+
+**NOTE**: you do not need to adjust the Space Size in the Agent's `Behavior Parameters` when using `SensorComponents`s.
+
+Internally, both `Agent.CollectObservations` and `[Observable]` attribute use an ISensors to write observations, although this is mostly abstracted from the user.
+
+### Vector Observations
+Both `Agent.CollectObservations()` and `ObservableAttribute`s produce vector observations, which are represented at lists of `float`s. `ISensor`s can produce both vector observations and visual observations, which are multi-dimensional arrays of floats.
+
+Below are some additional considerations when dealing with vector observations:
+
+#### One-hot encoding categorical information
+
+Type enumerations should be encoded in the _one-hot_ style. That is, add an element to the feature vector for each element of enumeration, setting the element representing the observed member to one and set the rest to zero. For example, if your enumeration contains \[Sword, Shield, Bow\] and the agent observes that the current item is a Bow, you would add the elements: 0, 0, 1 to the feature vector. The following code example illustrates how to add.
+
+```csharp
+enum ItemType { Sword, Shield, Bow, LastItem }
+public override void CollectObservations(VectorSensor sensor)
+{
+ for (int ci = 0; ci < (int)ItemType.LastItem; ci++)
+ {
+ sensor.AddObservation((int)currentItem == ci ? 1.0f : 0.0f);
+ }
+}
+```
+
+`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as a shortcut for _one-hot_ style observations. The following example is identical to the previous one.
+
+```csharp
+enum ItemType { Sword, Shield, Bow, LastItem }
+const int NUM_ITEM_TYPES = (int)ItemType.LastItem + 1;
+
+public override void CollectObservations(VectorSensor sensor)
+{
+ // The first argument is the selection index; the second is the
+ // number of possibilities
+ sensor.AddOneHotObservation((int)currentItem, NUM_ITEM_TYPES);
+}
+```
+
+`ObservableAttribute` has built-in support for enums. Note that you don't need the `LastItem` placeholder in this case:
+```csharp
+enum ItemType { Sword, Shield, Bow }
+
+public class HeroAgent : Agent
+{
+ [Observable]
+ ItemType m_CurrentItem;
+}
+```
+
+#### Normalization
+
+For the best results when training, you should normalize the components of your feature vector to the range [-1, +1] or [0, 1]. When you normalize the values, the PPO neural network can often converge to a solution faster. Note that it isn't always necessary to normalize to these recommended ranges, but it is considered a best practice when using neural networks. The greater the variation in ranges between the components of your observation, the more likely that training will be affected.
+
+To normalize a value to [0, 1], you can use the following formula:
+
+```csharp
+normalizedValue = (currentValue - minValue)/(maxValue - minValue)
+```
+
+:warning: For vectors, you should apply the above formula to each component (x, y, and z). Note that this is _not_ the same as using the `Vector3.normalized` property or `Vector3.Normalize()` method in Unity (and similar for `Vector2`).
+
+Rotations and angles should also be normalized. For angles between 0 and 360 degrees, you can use the following formulas:
+
+```csharp
+Quaternion rotation = transform.rotation;
+Vector3 normalized = rotation.eulerAngles / 180.0f - Vector3.one; // [-1,1]
+Vector3 normalized = rotation.eulerAngles / 360.0f; // [0,1]
+```
+
+For angles that can be outside the range [0,360], you can either reduce the angle, or, if the number of turns is significant, increase the maximum value used in your normalization formula.
+
+#### Stacking
+Stacking refers to repeating observations from previous steps as part of a larger observation. For example, consider an Agent that generates these observations in four steps
+```
+step 1: [0.1]
+step 2: [0.2]
+step 3: [0.3]
+step 4: [0.4]
+```
+
+If we use a stack size of 3, the observations would instead be:
+```csharp
+step 1: [0.1, 0.0, 0.0]
+step 2: [0.2, 0.1, 0.0]
+step 3: [0.3, 0.2, 0.1]
+step 4: [0.4, 0.3, 0.2]
+```
+(The observations are padded with zeroes for the first `stackSize-1` steps). This is a simple way to give an Agent limited "memory" without the complexity of adding a recurrent neural network (RNN).
+
+The steps for enabling stacking depends on how you generate observations:
+* For Agent.CollectObservations(), set "Stacked Vectors" on the Agent's `Behavior Parameters` to a value greater than 1.
+* For ObservableAttribute, set the `numStackedObservations` parameter in the constructor, e.g. `[Observable(numStackedObservations: 2)]`.
+* For `ISensor`s, wrap them in a `StackingSensor` (which is also an `ISensor`). Generally, this should happen in the `CreateSensor()` method of your `SensorComponent`.
+
+#### Vector Observation Summary & Best Practices
+
+- Vector Observations should include all variables relevant for allowing the agent to take the optimally informed decision, and ideally no extraneous information.
+- In cases where Vector Observations need to be remembered or compared over time, either an RNN should be used in the model, or the `Stacked Vectors` value in the agent GameObject's `Behavior Parameters` should be changed.
+- Categorical variables such as type of object (Sword, Shield, Bow) should be encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done automatically using the `AddOneHotObservation()` method of the `VectorSensor`, or using `[Observable]` on an enum field or property of the Agent.
+- In general, all inputs should be normalized to be in the range 0 to +1 (or -1 to 1). For example, the `x` position information of an agent where the maximum possible value is `maxValue` should be recorded as `VectorSensor.AddObservation(transform.position.x / maxValue);` rather than `VectorSensor.AddObservation(transform.position.x);`.
+- Positional information of relevant GameObjects should be encoded in relative coordinates wherever possible. This is often relative to the agent position.
+
+### Visual Observations
+
+Visual observations are generally provided to agent via either a `CameraSensor` or `RenderTextureSensor`. These collect image information and transforms it into a 3D Tensor which can be fed into the convolutional neural network (CNN) of the agent policy. For more information on CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). This allows agents to learn from spatial regularities in the observation images. It is possible to use visual and vector observations with the same agent.
+
+Agents using visual observations can capture state of arbitrary complexity and are useful when the state is difficult to describe numerically. However, they are also typically less efficient and slower to train, and sometimes don't succeed at all as compared to vector observations. As such, they should only be used when it is not possible to properly define the problem using vector or ray-cast observations.
+
+Visual observations can be derived from Cameras or RenderTextures within your scene. To add a visual observation to an Agent, add either a Camera Sensor Component or RenderTextures Sensor Component to the Agent. Then drag the camera or render texture you want to add to the `Camera` or `RenderTexture` field. You can have more than one camera or render texture and even use a combination of both attached to an Agent. For each visual observation, set the width and height of the image (in pixels) and whether or not the observation is color or grayscale.
+
+
+
+or
+
+
+
+Each Agent that uses the same Policy must have the same number of visual observations, and they must all have the same resolutions (including whether or not they are grayscale). Additionally, each Sensor Component on an Agent must have a unique name so that they can be sorted deterministically (the name must be unique for that Agent, but multiple Agents can have a Sensor Component with the same name).
+
+Visual observations also support stacking, by specifying `Observation Stacks` to a value greater than 1. The visual observations from the last `stackSize` steps will be stacked on the last dimension (channel dimension).
+
+When using `RenderTexture` visual observations, a handy feature for debugging is adding a `Canvas`, then adding a `Raw Image` with it's texture set to the Agent's `RenderTexture`. This will render the agent observation on the game screen.
+
+
+
+The [GridWorld environment](Learning-Environment-Examples.md#gridworld) is an example on how to use a RenderTexture for both debugging and observation. Note that in this example, a Camera is rendered to a RenderTexture, which is then used for observations and debugging. To update the RenderTexture, the Camera must be asked to render every time a decision is requested within the game code. When using Cameras as observations directly, this is done automatically by the Agent.
+
+
+
+#### Visual Observation Summary & Best Practices
+
+- To collect visual observations, attach `CameraSensor` or `RenderTextureSensor` components to the agent GameObject.
+- Visual observations should generally only be used when vector observations are not sufficient.
+- Image size should be kept as small as possible, without the loss of needed details for decision making.
+- Images should be made grayscale in situations where color information is not needed for making informed decisions.
+
+### Raycast Observations
+
+Raycasts are another possible method for providing observations to an agent. This can be easily implemented by adding a `RayPerceptionSensorComponent3D` (or `RayPerceptionSensorComponent2D`) to the Agent GameObject.
+
+During observations, several rays (or spheres, depending on settings) are cast into the physics world, and the objects that are hit determine the observation vector that is produced.
+
+
+
+Both sensor components have several settings:
+
+- _Detectable Tags_ A list of strings corresponding to the types of objects that the Agent should be able to distinguish between. For example, in the WallJump example, we use "wall", "goal", and "block" as the list of objects to detect.
+- _Rays Per Direction_ Determines the number of rays that are cast. One ray is always cast forward, and this many rays are cast to the left and right.
+- _Max Ray Degrees_ The angle (in degrees) for the outermost rays. 90 degrees corresponds to the left and right of the agent.
+- _Sphere Cast Radius_ The size of the sphere used for sphere casting. If set to 0, rays will be used instead of spheres. Rays may be more efficient, especially in complex scenes.
+- _Ray Length_ The length of the casts
+- _Ray Layer Mask_ The [LayerMask](https://docs.unity3d.com/ScriptReference/LayerMask.html) passed to the raycast or spherecast. This can be used to ignore certain types of objects when casting.
+- _Observation Stacks_ The number of previous results to "stack" with the cast results. Note that this can be independent of the "Stacked Vectors" setting in `Behavior Parameters`.
+- _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
+- _End Vertical Offset_ (3D only) The vertical offset of the ray end point.
+- _Alternating Ray Order_ Alternating is the default, it gives an order of (0, -delta, delta, -2*delta, 2*delta, ..., -n*delta, n*delta). If alternating is disabled the order is left to right (-n*delta, -(n-1)*delta, ..., -delta, 0, delta, ..., (n-1)*delta, n*delta). For general usage there is no difference but if using custom models the left-to-right layout that matches the spatial structuring can be preferred (e.g. for processing with conv nets).
+- _Use Batched Raycasts_ (3D only) Whether to use batched raycasts. Enable to use batched raycasts and the jobs system.
+
+In the example image above, the Agent has two `RayPerceptionSensorComponent3D`s. Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components had a vertical offset, so the Agent can tell whether it's clear to jump over the wall.
+
+The total size of the created observations is
+
+```
+(Observation Stacks) * (1 + 2 * Rays Per Direction) * (Num Detectable Tags + 2)
+```
+
+so the number of rays and tags should be kept as small as possible to reduce the amount of data used. Note that this is separate from the State Size defined in `Behavior Parameters`, so you don't need to worry about the formula above when setting the State Size.
+
+#### RayCast Observation Summary & Best Practices
+
+- Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to use.
+- This observation type is best used when there is relevant spatial information for the agent that doesn't require a fully rendered image to convey.
+- Use as few rays and tags as necessary to solve the problem in order to improve learning stability and agent performance.
+- If you run into performance issues, try using batched raycasts by enabling the _Use Batched Raycast_ setting. (Only available for 3D ray perception sensors.)
+
+### Grid Observations
+Grid-base observations combine the advantages of 2D spatial representation in visual observations, and the flexibility of defining detectable objects in RayCast observations. The sensor uses a set of box queries in a grid shape and gives a top-down 2D view around the agent. This can be implemented by adding a `GridSensorComponent` to the Agent GameObject.
+
+During observations, the sensor detects the presence of detectable objects in each cell and encode that into one-hot representation. The collected information from each cell forms a 3D tensor observation and will be fed into the convolutional neural network (CNN) of the agent policy just like visual observations.
+
+
+
+The sensor component has the following settings:
+- _Cell Scale_ The scale of each cell in the grid.
+- _Grid Size_ Number of cells on each side of the grid.
+- _Agent Game Object_ The Agent that holds the grid sensor. This is used to disambiguate objects with the same tag as the agent so that the agent doesn't detect itself.
+- _Rotate With Agent_ Whether the grid rotates with the Agent.
+- _Detectable Tags_ A list of strings corresponding to the types of objects that the Agent should be able to distinguish between.
+- _Collider Mask_ The [LayerMask](https://docs.unity3d.com/ScriptReference/LayerMask.html) passed to the collider detection. This can be used to ignore certain types of objects.
+- _Initial Collider Buffer Size_ The initial size of the Collider buffer used in the non-allocating Physics calls for each cell.
+- _Max Collider Buffer Size_ The max size of the Collider buffer used in the non-allocating Physics calls for each cell.
+
+The observation for each grid cell is a one-hot encoding of the detected object. The total size of the created observations is
+
+```
+GridSize.x * GridSize.z * Num Detectable Tags
+```
+
+so the number of detectable tags and size of the grid should be kept as small as possible to reduce the amount of data used. This makes a trade-off between the granularity of the observation and training speed.
+
+To allow more variety of observations that grid sensor can capture, the `GridSensorComponent` and the underlying `GridSensorBase` also provides interfaces that can be overridden to collect customized observation from detected objects. See the Unity package documentation for more details on custom grid sensors.
+
+__Note__: The `GridSensor` only works in 3D environments and will not behave properly in 2D environments.
+
+#### Grid Observation Summary & Best Practices
+
+- Attach `GridSensorComponent` to use.
+- This observation type is best used when there is relevant non-visual spatial information that can be best captured in 2D representations.
+- Use as small grid size and as few tags as necessary to solve the problem in order to improve learning stability and agent performance.
+- Do not use `GridSensor` in a 2D game.
+
+### Variable Length Observations
+
+It is possible for agents to collect observations from a varying number of GameObjects by using a `BufferSensor`. You can add a `BufferSensor` to your Agent by adding a `BufferSensorComponent` to its GameObject. The `BufferSensor` can be useful in situations in which the Agent must pay attention to a varying number of entities (for example, a varying number of enemies or projectiles). On the trainer side, the `BufferSensor` is processed using an attention module. More information about attention mechanisms can be found [here](https://arxiv.org/abs/1706.03762). Training or doing inference with variable length observations can be slower than using a flat vector observation. However, attention mechanisms enable solving problems that require comparative reasoning between entities in a scene such as our [Sorter environment](Learning-Environment-Examples.md#sorter). Note that even though the `BufferSensor` can process a variable number of entities, you still need to define a maximum number of entities. This is because our network architecture requires to know what the shape of the observations will be. If fewer entities are observed than the maximum, the observation will be padded with zeros and the trainer will ignore the padded observations. Note that attention layers are invariant to the order of the entities, so there is no need to properly "order" the entities before feeding them into the `BufferSensor`.
+
+The `BufferSensorComponent` Editor inspector has two arguments:
+
+ - `Observation Size` : This is how many floats each entities will be represented with. This number is fixed and all entities must have the same representation. For example, if the entities you want to put into the `BufferSensor` have for relevant information position and speed, then the `Observation Size` should be 6 floats.
+ - `Maximum Number of Entities` : This is the maximum number of entities the `BufferSensor` will be able to collect.
+
+To add an entity's observations to a `BufferSensorComponent`, you need to call `BufferSensorComponent.AppendObservation()` in the Agent.CollectObservations() method with a float array of size `Observation Size` as argument.
+
+__Note__: Currently, the observations put into the `BufferSensor` are not normalized, you will need to normalize your observations manually between -1 and 1.
+
+#### Variable Length Observation Summary & Best Practices
+ - Attach `BufferSensorComponent` to use.
+ - Call `BufferSensorComponent.AppendObservation()` in the Agent.CollectObservations() methodto add the observations of an entity to the `BufferSensor`.
+ - Normalize the entities observations before feeding them into the `BufferSensor`.
+
+### Goal Signal
+
+It is possible for agents to collect observations that will be treated as "goal signal". A goal signal is used to condition the policy of the agent, meaning that if the goal changes, the policy (i.e. the mapping from observations to actions) will change as well. Note that this is true for any observation since all observations influence the policy of the Agent to some degree. But by specifying a goal signal explicitly, we can make this conditioning more important to the agent. This feature can be used in settings where an agent must learn to solve different tasks that are similar by some aspects because the agent will learn to reuse learnings from different tasks to generalize better. In Unity, you can specify that a `VectorSensor` or a `CameraSensor` is a goal by attaching a `VectorSensorComponent` or a `CameraSensorComponent` to the Agent and selecting `Goal Signal` as `Observation Type`. On the trainer side, there are two different ways to condition the policy. This setting is determined by the [goal_conditioning_type parameter](Training-Configuration-File.md#common-trainer-configurations). If set to `hyper` (default) a [HyperNetwork](https://arxiv.org/pdf/1609.09106.pdf) will be used to generate some of the weights of the policy using the goal observations as input. Note that using a HyperNetwork requires a lot of computations, it is recommended to use a smaller number of hidden units in the policy to alleviate this. If set to `none` the goal signal will be considered as regular observations. For an example on how to use a goal signal, see the [GridWorld example](Learning-Environment-Examples.md#gridworld).
+
+#### Goal Signal Summary & Best Practices
+ - Attach a `VectorSensorComponent` or `CameraSensorComponent` to an agent and set the observation type to goal to use the feature.
+ - Set the goal_conditioning_type parameter in the training configuration.
+ - Reduce the number of hidden units in the network when using the HyperNetwork conditioning type.
+
+## Actions and Actuators
+
+An action is an instruction from the Policy that the agent carries out. The action is passed to the an `IActionReceiver` (either an `Agent` or an `IActuator`) as the `ActionBuffers` parameter when the Academy invokes the `IActionReciever.OnActionReceived()` function. There are two types of actions supported: **Continuous** and **Discrete**.
+
+Neither the Policy nor the training algorithm know anything about what the action values themselves mean. The training algorithm simply tries different values for the action list and observes the affect on the accumulated rewards over time and many training episodes. Thus, the only place actions are defined for an Agent is in the `OnActionReceived()` function.
+
+For example, if you designed an agent to move in two dimensions, you could use either continuous or the discrete actions. In the continuous case, you would set the action size to two (one for each dimension), and the agent's Policy would output an action with two floating point values. In the discrete case, you would use one Branch with a size of four (one for each direction), and the Policy would create an action array containing a single element with a value ranging from zero to three. Alternatively, you could create two branches of size two (one for horizontal movement and one for vertical movement), and the Policy would output an action array containing two elements with values ranging from zero to one. You could alternatively use a combination of continuous and discrete actions e.g., using one continuous action for horizontal movement and a discrete branch of size two for the vertical movement.
+
+Note that when you are programming actions for an agent, it is often helpful to test your action logic using the `Heuristic()` method of the Agent, which lets you map keyboard commands to actions.
+
+### Continuous Actions
+
+When an Agent's Policy has **Continuous** actions, the `ActionBuffers.ContinuousActions` passed to the Agent's `OnActionReceived()` function is an array with length equal to the `Continuous Action Size` property value. The individual values in the array have whatever meanings that you ascribe to them. If you assign an element in the array as the speed of an Agent, for example, the training process learns to control the speed of the Agent through this parameter.
+
+The [3DBall example](Learning-Environment-Examples.md#3dball-3d-balance-ball) uses continuous actions with two control values.
+
+
+
+These control values are applied as rotation to the cube:
+
+```csharp
+ public override void OnActionReceived(ActionBuffers actionBuffers)
+ {
+ var actionZ = 2f * Mathf.Clamp(actionBuffers.ContinuousActions[0], -1f, 1f);
+ var actionX = 2f * Mathf.Clamp(actionBuffers.ContinuousActions[1], -1f, 1f);
+
+ gameObject.transform.Rotate(new Vector3(0, 0, 1), actionZ);
+ gameObject.transform.Rotate(new Vector3(1, 0, 0), actionX);
+ }
+```
+
+By default, the output from our provided PPO algorithm pre-clamps the values of `ActionBuffers.ContinuousActions` into the [-1, 1] range. It is a best practice to manually clip these as well, if you plan to use a 3rd party algorithm with your environment. As shown above, you can scale the control values as needed after clamping them.
+
+### Discrete Actions
+
+When an Agent's Policy uses **discrete** actions, the `ActionBuffers.DiscreteActions` passed to the Agent's `OnActionReceived()` function is an array of integers with length equal to `Discrete Branch Size`. When defining the discrete actions, `Branches` is an array of integers, each value corresponds to the number of possibilities for each branch.
+
+For example, if we wanted an Agent that can move in a plane and jump, we could define two branches (one for motion and one for jumping) because we want our agent be able to move **and** jump concurrently. We define the first branch to have 5 possible actions (don't move, go left, go right, go backward, go forward) and the second one to have 2 possible actions (don't jump, jump). The `OnActionReceived()` method would look something like:
+
+```csharp
+// Get the action index for movement
+int movement = actionBuffers.DiscreteActions[0];
+// Get the action index for jumping
+int jump = actionBuffers.DiscreteActions[1];
+
+// Look up the index in the movement action list:
+if (movement == 1) { directionX = -1; }
+if (movement == 2) { directionX = 1; }
+if (movement == 3) { directionZ = -1; }
+if (movement == 4) { directionZ = 1; }
+// Look up the index in the jump action list:
+if (jump == 1 && IsGrounded()) { directionY = 1; }
+
+// Apply the action results to move the Agent
+gameObject.GetComponent().AddForce(
+ new Vector3(
+ directionX * 40f, directionY * 300f, directionZ * 40f));
+```
+
+#### Masking Discrete Actions
+
+When using Discrete Actions, it is possible to specify that some actions are impossible for the next decision. When the Agent is controlled by a neural network, the Agent will be unable to perform the specified action. Note that when the Agent is controlled by its Heuristic, the Agent will still be able to decide to perform the masked action. In order to disallow an action, override the `Agent.WriteDiscreteActionMask()` virtual method, and call `SetActionEnabled()` on the provided `IDiscreteActionMask`:
+
+```csharp
+public override void WriteDiscreteActionMask(IDiscreteActionMask actionMask)
+{
+ actionMask.SetActionEnabled(branch, actionIndex, isEnabled);
+}
+```
+
+Where:
+
+- `branch` is the index (starting at 0) of the branch on which you want to allow or disallow the action
+- `actionIndex` is the index of the action that you want to allow or disallow.
+- `isEnabled` is a bool indicating whether the action should be allowed or now.
+
+For example, if you have an Agent with 2 branches and on the first branch (branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_ and _"change weapon"_. Then with the code bellow, the Agent will either _"do nothing"_ or _"change weapon"_ for their next decision (since action index 1 and 2 are masked)
+
+```csharp
+actionMask.SetActionEnabled(0, 1, false);
+actionMask.SetActionEnabled(0, 2, false);
+```
+
+Notes:
+
+- You can call `SetActionEnabled` multiple times if you want to put masks on multiple branches.
+- At each step, the state of an action is reset and enabled by default.
+- You cannot mask all the actions of a branch.
+- You cannot mask actions in continuous control.
+
+
+### IActuator interface and ActuatorComponents
+The Actuator API allows users to abstract behavior out of Agents and in to components (similar to the ISensor API). The `IActuator` interface and `Agent` class both implement the `IActionReceiver` interface to allow for backward compatibility with the current `Agent.OnActionReceived`. This means you will not have to change your code until you decide to use the `IActuator` API.
+
+Like the `ISensor` interface, the `IActuator` interface is intended for advanced users.
+
+The `ActuatorComponent` abstract class is used to create the actual `IActuator` at runtime. It must be attached to the same `GameObject` as the `Agent`, or to a child `GameObject`. Actuators and all of their data structures are initialized during `Agent.Initialize`. This was done to prevent an unexpected allocations at runtime.
+
+You can find an example of an `IActuator` implementation in the `Basic` example scene. **NOTE**: you do not need to adjust the Actions in the Agent's `Behavior Parameters` when using an `IActuator` and `ActuatorComponents`.
+
+Internally, `Agent.OnActionReceived` uses an `IActuator` to send actions to the Agent, although this is mostly abstracted from the user.
+
+
+### Actions Summary & Best Practices
+
+- Agents can use `Discrete` and/or `Continuous` actions.
+- Discrete actions can have multiple action branches, and it's possible to mask certain actions so that they won't be taken.
+- In general, fewer actions will make for easier learning.
+- Be sure to set the Continuous Action Size and Discrete Branch Size to the desired number for each type of action, and not greater, as doing the latter can interfere with the efficiency of the training process.
+- Continuous action values should be clipped to an appropriate range. The provided PPO model automatically clips these values between -1 and 1, but third party training systems may not do so.
+
+## Rewards
+
+In reinforcement learning, the reward is a signal that the agent has done something right. The PPO reinforcement learning algorithm works by optimizing the choices an agent makes such that the agent earns the highest cumulative reward over time. The better your reward mechanism, the better your agent will learn.
+
+**Note:** Rewards are not used during inference by an Agent using a trained model and is also not used during imitation learning.
+
+Perhaps the best advice is to start simple and only add complexity as needed. In general, you should reward results rather than actions you think will lead to the desired results. You can even use the Agent's Heuristic to control the Agent while watching how it accumulates rewards.
+
+Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` methods on the agent. The reward assigned between each decision should be in the range [-1,1]. Values outside this range can lead to unstable training. The `reward` value is reset to zero when the agent receives a new decision. If there are multiple calls to `AddReward()` for a single agent decision, the rewards will be summed together to evaluate how good the previous decision was. The `SetReward()` will override all previous rewards given to an agent since the previous decision.
+
+### Examples
+
+You can examine the `OnActionReceived()` functions defined in the [example environments](Learning-Environment-Examples.md) to see how those projects allocate rewards.
+
+The `GridAgent` class in the [GridWorld example](Learning-Environment-Examples.md#gridworld) uses a very simple reward system:
+
+```csharp
+Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
+ new Vector3(0.3f, 0.3f, 0.3f));
+if (hitObjects.Where(col => col.gameObject.tag == "goal").ToArray().Length == 1)
+{
+ AddReward(1.0f);
+ EndEpisode();
+}
+else if (hitObjects.Where(col => col.gameObject.tag == "pit").ToArray().Length == 1)
+{
+ AddReward(-1f);
+ EndEpisode();
+}
+```
+
+The agent receives a positive reward when it reaches the goal and a negative reward when it falls into the pit. Otherwise, it gets no rewards. This is an example of a _sparse_ reward system. The agent must explore a lot to find the infrequent reward.
+
+In contrast, the `AreaAgent` in the [Area example](Learning-Environment-Examples.md#push-block) gets a small negative reward every step. In order to get the maximum reward, the agent must finish its task of reaching the goal square as quickly as possible:
+
+```csharp
+AddReward( -0.005f);
+MoveAgent(act);
+
+if (gameObject.transform.position.y < 0.0f ||
+ Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f ||
+ Mathf.Abs(gameObject.transform.position.z + 5 - area.transform.position.z) > 8)
+{
+ AddReward(-1f);
+ EndEpisode();
+}
+```
+
+The agent also gets a larger negative penalty if it falls off the playing surface.
+
+The `Ball3DAgent` in the [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a similar approach, but allocates a small positive reward as long as the agent balances the ball. The agent can maximize its rewards by keeping the ball on the platform:
+
+```csharp
+
+SetReward(0.1f);
+
+// When ball falls mark Agent as finished and give a negative penalty
+if ((ball.transform.position.y - gameObject.transform.position.y) < -2f ||
+ Mathf.Abs(ball.transform.position.x - gameObject.transform.position.x) > 3f ||
+ Mathf.Abs(ball.transform.position.z - gameObject.transform.position.z) > 3f)
+{
+ SetReward(-1f);
+ EndEpisode();
+
+}
+```
+
+The `Ball3DAgent` also assigns a negative penalty when the ball falls off the platform.
+
+Note that all of these environments make use of the `EndEpisode()` method, which manually terminates an episode when a termination condition is reached. This can be called independently of the `Max Step` property.
+
+### Rewards Summary & Best Practices
+
+- Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()` to overwrite any previous rewards accumulate between decisions.
+- The magnitude of any given reward should typically not be greater than 1.0 in order to ensure a more stable learning process.
+- Positive rewards are often more helpful to shaping the desired behavior of an agent than negative rewards. Excessive negative rewards can result in the agent failing to learn any meaningful behavior.
+- For locomotion tasks, a small positive reward (+0.1) for forward velocity is typically used.
+- If you want the agent to finish a task quickly, it is often helpful to provide a small penalty every step (-0.05) that the agent does not complete the task. In this case completion of the task should also coincide with the end of the episode by calling `EndEpisode()` on the agent when it has accomplished its goal.
+
+## Agent Properties
+
+
+
+- `Behavior Parameters` - The parameters dictating what Policy the Agent will receive.
+ - `Behavior Name` - The identifier for the behavior. Agents with the same behavior name will learn the same policy.
+ - `Vector Observation`
+ - `Space Size` - Length of vector observation for the Agent.
+ - `Stacked Vectors` - The number of previous vector observations that will be stacked and used collectively for decision making. This results in the effective size of the vector observation being passed to the Policy being: _Space Size_ x _Stacked Vectors_.
+ - `Actions`
+ - `Continuous Actions` - The number of concurrent continuous actions that the Agent can take.
+ - `Discrete Branches` - An array of integers, defines multiple concurrent discrete actions. The values in the `Discrete Branches` array correspond to the number of possible discrete values for each action branch.
+ - `Model` - The neural network model used for inference (obtained after training)
+ - `Inference Device` - Whether to use CPU or GPU to run the model during inference
+ - `Behavior Type` - Determines whether the Agent will do training, inference, or use its Heuristic() method:
+ - `Default` - the Agent will train if they connect to a python trainer, otherwise they will perform inference.
+ - `Heuristic Only` - the Agent will always use the `Heuristic()` method.
+ - `Inference Only` - the Agent will always perform inference.
+ - `Team ID` - Used to define the team for self-play
+ - `Use Child Sensors` - Whether to use all Sensor components attached to child GameObjects of this Agent.
+- `Max Step` - The per-agent maximum number of steps. Once this number is reached, the Agent will be reset.
+
+## Destroying an Agent
+
+You can destroy an Agent GameObject during the simulation. Make sure that there is always at least one Agent training at all times by either spawning a new Agent every time one is destroyed or by re-spawning new Agents when the whole environment resets.
+
+## Defining Multi-agent Scenarios
+
+### Teams for Adversarial Scenarios
+
+Self-play is triggered by including the self-play hyperparameter hierarchy in the [trainer configuration](Training-ML-Agents.md#training-configurations). To distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab.
+
+
+
+**_Team ID must be 0 or an integer greater than 0._**
+
+In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their Behavior Parameters Script. In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script. Note, in asymmetric games, the agents must have both different Behavior Names _and_ different team IDs!
+
+For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis and Soccer environments. Tennis and Soccer provide examples of symmetric games. To train an asymmetric game, specify trainer configurations for each of your behavior names and include the self-play hyperparameter hierarchy in both.
+
+### Groups for Cooperative Scenarios
+
+Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`, typically in an environment controller or similar script, and adding agents to it using the `RegisterAgent(Agent agent)` method. Note that all agents added to the same `SimpleMultiAgentGroup` must have the same behavior name and Behavior Parameters. Using `SimpleMultiAgentGroup` enables the agents within a group to learn how to work together to achieve a common goal (i.e., maximize a group-given reward), even if one or more of the group members are removed before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and
+`GroupEpisodeInterrupted()` methods. For example:
+
+```csharp
+// Create a Multi Agent Group in Start() or Initialize()
+m_AgentGroup = new SimpleMultiAgentGroup();
+
+// Register agents in group at the beginning of an episode
+for (var agent in AgentList)
+{
+ m_AgentGroup.RegisterAgent(agent);
+}
+
+// if the team scores a goal
+m_AgentGroup.AddGroupReward(rewardForGoal);
+
+// If the goal is reached and the episode is over
+m_AgentGroup.EndGroupEpisode();
+ResetScene();
+
+// If time ran out and we need to interrupt the episode
+m_AgentGroup.GroupEpisodeInterrupted();
+ResetScene();
+```
+
+Multi Agent Groups should be used with the MA-POCA trainer, which is explicitly designed to train cooperative environments. This can be enabled by using the `poca` trainer - see the [training configurations](Training-Configuration-File.md) doc for more information on configuring MA-POCA. When using MA-POCA, agents which are deactivated or removed from the Scene during the episode will still learn to contribute to the group's long term rewards, even if they are not active in the scene to experience them.
+
+See the [Cooperative Push Block](Learning-Environment-Examples.md#cooperative-push-block) environment for an example of how to use Multi Agent Groups, and the [Dungeon Escape](Learning-Environment-Examples.md#dungeon-escape) environment for an example of how the Multi Agent Group can be used with agents that are removed from the scene mid-episode.
+
+**NOTE**: Groups differ from Teams (for competitive settings) in the following way - Agents working together should be added to the same Group, while agents playing against each other should be given different Team Ids. If in the Scene there is one playing field and two teams, there should be two Groups, one for each team, and each team should be assigned a different Team Id. If this playing field is duplicated many times in the Scene (e.g. for training speedup), there should be two Groups _per playing field_, and two unique Team Ids _for the entire Scene_. In environments with both Groups and Team Ids configured, MA-POCA and self-play can be used together for training. In the diagram below, there are two agents on each team, and two playing fields where teams are pitted against each other. All the blue agents should share a Team Id (and the orange ones a different ID), and there should be four group managers, one per pair of agents.
+
+
+
+Please see the [SoccerTwos](Learning-Environment-Examples.md#soccer-twos) environment for an example.
+
+#### Cooperative Behaviors Notes and Best Practices
+* An agent can only be registered to one MultiAgentGroup at a time. If you want to re-assign an agent from one group to another, you have to unregister it from the current group first.
+
+* Agents with different behavior names in the same group are not supported.
+
+* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0. Instead, handle Max Steps using the MultiAgentGroup by ending the episode for the entire Group using `GroupEpisodeInterrupted()`.
+
+* `EndGroupEpisode` and `GroupEpisodeInterrupted` do the same job in the game, but has slightly different effect on the training. If the episode is completed, you would want to call `EndGroupEpisode`. But if the episode is not over but it has been running for enough steps, i.e. reaching max step, you would call `GroupEpisodeInterrupted`.
+
+* If an agent finished earlier, e.g. completed tasks/be removed/be killed in the game, do not call `EndEpisode()` on the Agent. Instead, disable the agent and re-enable it when the next episode starts, or destroy the agent entirely. This is because calling `EndEpisode()` will call `OnEpisodeBegin()`, which will reset the agent immediately. While it is possible to call `EndEpisode()` in this way, it is usually not the desired behavior when training groups of agents.
+
+* If an agent that was disabled in a scene needs to be re-enabled, it must be re-registered to the MultiAgentGroup.
+
+* Group rewards are meant to reinforce agents to act in the group's best interest instead of individual ones, and are treated differently than individual agent rewards during training. So calling `AddGroupReward()` is not equivalent to calling agent.AddReward() on each agent in the group.
+
+* You can still add incremental rewards to agents using `Agent.AddReward()` if they are in a Group. These rewards will only be given to those agents and are received when the Agent is active.
+
+* Environments which use Multi Agent Groups can be trained using PPO or SAC, but agents will not be able to learn from group rewards after deactivation/removal, nor will they behave as cooperatively.
+
+## Recording Demonstrations
+
+In order to record demonstrations from an agent, add the `Demonstration Recorder` component to a GameObject in the scene which contains an `Agent` component. Once added, it is possible to name the demonstration that will be recorded from the agent.
+
+
+
+When `Record` is checked, a demonstration will be created whenever the scene is played from the Editor. Depending on the complexity of the task, anywhere from a few minutes or a few hours of demonstration data may be necessary to be useful for imitation learning. To specify an exact number of steps you want to record use the `Num Steps To Record` field and the editor will end your play session automatically once that many steps are recorded. If you set `Num Steps To Record` to `0` then recording will continue until you manually end the play session. Once the play session ends a `.demo` file will be created in the `Assets/Demonstrations` folder (by default). This file contains the demonstrations. Clicking on the file will provide metadata about the demonstration in the inspector.
+
+
+
+You can then specify the path to this file in your [training configurations](Training-Configuration-File.md#behavioral-cloning).
diff --git a/com.unity.ml-agents/Documentation~/Learning-Environment-Design.md b/com.unity.ml-agents/Documentation~/Learning-Environment-Design.md
new file mode 100644
index 0000000000..42bc6b9bb7
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Learning-Environment-Design.md
@@ -0,0 +1,92 @@
+# Designing a Learning Environment
+
+This page contains general advice on how to design your learning environment, in addition to overviewing aspects of the ML-Agents Unity SDK that pertain to setting up your scene and simulation as opposed to designing your agents within the scene. We have a dedicated page on [Designing Agents](Learning-Environment-Design-Agents.md) which includes how to instrument observations, actions and rewards, define teams for multi-agent scenarios and record agent demonstrations for imitation learning.
+
+## The Simulation and Training Process
+
+Training and simulation proceed in steps orchestrated by the ML-Agents Academy class. The Academy works with Agent objects in the scene to step through the simulation.
+
+During training, the external Python training process communicates with the Academy to run a series of episodes while it collects data and optimizes its neural network model. When training is completed successfully, you can add the trained model file to your Unity project for later use.
+
+The ML-Agents Academy class orchestrates the agent simulation loop as follows:
+
+1. Calls your Academy's `OnEnvironmentReset` delegate.
+2. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
+3. Gathers information about the scene. This is done by calling the `CollectObservations(VectorSensor sensor)` function for each Agent in the scene, as well as updating their sensor and collecting the resulting observations.
+4. Uses each Agent's Policy to decide on the Agent's next action.
+5. Calls the `OnActionReceived()` function for each Agent in the scene, passing in the action chosen by the Agent's Policy.
+6. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its `Max Step` count or has otherwise marked itself as `EndEpisode()`.
+
+To create a training environment, extend the Agent class to implement the above methods whether you need to implement them or not depends on your specific scenario.
+
+## Organizing the Unity Scene
+
+To train and use the ML-Agents Toolkit in a Unity scene, the scene as many Agent subclasses as you need. Agent instances should be attached to the GameObject representing that Agent.
+
+### Academy
+
+The Academy is a singleton which orchestrates Agents and their decision making processes. Only a single Academy exists at a time.
+
+#### Academy resetting
+
+To alter the environment at the start of each episode, add your method to the Academy's OnEnvironmentReset action.
+
+```csharp
+public class MySceneBehavior : MonoBehaviour
+{
+ public void Awake()
+ {
+ Academy.Instance.OnEnvironmentReset += EnvironmentReset;
+ }
+
+ void EnvironmentReset()
+ {
+ // Reset the scene here
+ }
+}
+```
+
+For example, you might want to reset an Agent to its starting position or move a goal to a random position. An environment resets when the `reset()` method is called on the Python `UnityEnvironment`.
+
+When you reset an environment, consider the factors that should change so that training is generalizable to different conditions. For example, if you were training a maze-solving agent, you would probably want to change the maze itself for each training episode. Otherwise, the agent would probably on learn to solve one, particular maze, not mazes in general.
+
+### Multiple Areas
+
+In many of the example environments, many copies of the training area are instantiated in the scene. This generally speeds up training, allowing the environment to gather many experiences in parallel. This can be achieved simply by instantiating many Agents with the same Behavior Name. If possible, consider designing your scene to support multiple areas.
+
+Check out our example environments to see examples of multiple areas. Additionally, the [Making a New Learning Environment](Learning-Environment-Create-New.md#optional-multiple-training-areas-within-the-same-scene) guide demonstrates this option.
+
+## Environments
+
+When you create a training environment in Unity, you must set up the scene so that it can be controlled by the external training process. Considerations include:
+
+- The training scene must start automatically when your Unity application is launched by the training process.
+- The Academy must reset the scene to a valid starting point for each episode of training.
+- A training episode must have a definite end — either using `Max Steps` or by each Agent ending its episode manually with `EndEpisode()`.
+
+## Environment Parameters
+
+Curriculum learning and environment parameter randomization are two training methods that control specific parameters in your environment. As such, it is important to ensure that your environment parameters are updated at each step to the correct values. To enable this, we expose a `EnvironmentParameters` C# class that you can use to retrieve the values of the parameters defined in the training configurations for both of those features. Please see our [documentation](Training-ML-Agents.md#environment-parameters) for curriculum learning and environment parameter randomization for details.
+
+We recommend modifying the environment from the Agent's `OnEpisodeBegin()` function by leveraging `Academy.Instance.EnvironmentParameters`. See the WallJump example environment for a sample usage (specifically, [WallJumpAgent.cs](https://github.com/Unity-Technologies/ml-agents/blob/release_22/Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs) ).
+
+## Agent
+
+The Agent class represents an actor in the scene that collects observations and carries out actions. The Agent class is typically attached to the GameObject in the scene that otherwise represents the actor — for example, to a player object in a football game or a car object in a vehicle simulation. Every Agent must have appropriate `Behavior Parameters`.
+
+Generally, when creating an Agent, you should extend the Agent class and implement the `CollectObservations(VectorSensor sensor)` and `OnActionReceived()` methods:
+
+- `CollectObservations(VectorSensor sensor)` — Collects the Agent's observation of its environment.
+- `OnActionReceived()` — Carries out the action chosen by the Agent's Policy and assigns a reward to the current state.
+
+Your implementations of these functions determine how the Behavior Parameters assigned to this Agent must be set.
+
+You must also determine how an Agent finishes its task or times out. You can manually terminate an Agent episode in your `OnActionReceived()` function when the Agent has finished (or irrevocably failed) its task by calling the `EndEpisode()` function. You can also set the Agent's `Max Steps` property to a positive value and the Agent will consider the episode over after it has taken that many steps. You can use the `Agent.OnEpisodeBegin()` function to prepare the Agent to start again.
+
+See [Agents](Learning-Environment-Design-Agents.md) for detailed information about programming your own Agents.
+
+## Recording Statistics
+
+We offer developers a mechanism to record statistics from within their Unity environments. These statistics are aggregated and generated during the training process. To record statistics, see the `StatsRecorder` C# class.
+
+See the FoodCollector example environment for a sample usage (specifically, [FoodCollectorSettings.cs](https://github.com/Unity-Technologies/ml-agents/blob/release_22/Project/Assets/ML-Agents/Examples/FoodCollector/Scripts/FoodCollectorSettings.cs) ).
diff --git a/com.unity.ml-agents/Documentation~/Learning-Environment-Examples.md b/com.unity.ml-agents/Documentation~/Learning-Environment-Examples.md
new file mode 100644
index 0000000000..364d5d061a
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Learning-Environment-Examples.md
@@ -0,0 +1,383 @@
+# Example Learning Environments
+
+
+
+The Unity ML-Agents Toolkit includes an expanding set of example environments that highlight the various features of the toolkit. These environments can also serve as templates for new environments or as ways to test new ML algorithms. Environments are located in `Project/Assets/ML-Agents/Examples` and summarized below.
+
+For the environments that highlight specific features of the toolkit, we provide the pre-trained model files and the training config file that enables you to train the scene yourself. The environments that are designed to serve as challenges for researchers do not have accompanying pre-trained model files or training configs and are marked as _Optional_ below.
+
+This page only overviews the example environments we provide. To learn more on how to design and build your own environments see our [Making a New Learning Environment](Learning-Environment-Create-New.md) page. If you would like to contribute environments, please see our [contribution guidelines](CONTRIBUTING.md) page.
+
+## Basic
+
+
+
+- Set-up: A linear movement task where the agent must move left or right to rewarding states.
+- Goal: Move to the most reward state.
+- Agents: The environment contains one agent.
+- Agent Reward Function:
+ - -0.01 at each step
+ - +0.1 for arriving at suboptimal state.
+ - +1.0 for arriving at optimal state.
+- Behavior Parameters:
+ - Vector Observation space: One variable corresponding to current state.
+ - Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move right).
+ - Visual Observations: None
+- Float Properties: None
+- Benchmark Mean Reward: 0.93
+
+## 3DBall: 3D Balance Ball
+
+
+
+- Set-up: A balance-ball task, where the agent balances the ball on it's head.
+- Goal: The agent must balance the ball on it's head for as long as possible.
+- Agents: The environment contains 12 agents of the same kind, all using the same Behavior Parameters.
+- Agent Reward Function:
+ - +0.1 for every step the ball remains on it's head.
+ - -1.0 if the ball falls off.
+- Behavior Parameters:
+ - Vector Observation space: 8 variables corresponding to rotation of the agent cube, and position and velocity of ball.
+ - Vector Observation space (Hard Version): 5 variables corresponding to rotation of the agent cube and position of ball.
+ - Actions: 2 continuous actions, with one value corresponding to X-rotation, and the other to Z-rotation.
+ - Visual Observations: Third-person view from the upper-front of the agent. Use
+ `Visual3DBall` scene.
+- Float Properties: Three
+ - scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
+ - Default: 1
+ - Recommended Minimum: 0.2
+ - Recommended Maximum: 5
+ - gravity: Magnitude of gravity
+ - Default: 9.81
+ - Recommended Minimum: 4
+ - Recommended Maximum: 105
+ - mass: Specifies mass of the ball
+ - Default: 1
+ - Recommended Minimum: 0.1
+ - Recommended Maximum: 20
+- Benchmark Mean Reward: 100
+
+## GridWorld
+
+
+
+- Set-up: A multi-goal version of the grid-world task. Scene contains agent, goal, and obstacles.
+- Goal: The agent must navigate the grid to the appropriate goal while avoiding the obstacles.
+- Agents: The environment contains nine agents with the same Behavior Parameters.
+- Agent Reward Function:
+ - -0.01 for every step.
+ - +1.0 if the agent navigates to the correct goal (episode ends).
+ - -1.0 if the agent navigates to an incorrect goal (episode ends).
+- Behavior Parameters:
+ - Vector Observation space: None
+ - Actions: 1 discrete action branch with 5 actions, corresponding to movement in cardinal directions or not moving. Note that for this environment, [action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions) is turned on by default (this option can be toggled using the `Mask Actions` checkbox within the `trueAgent` GameObject). The trained model file provided was generated with action masking turned on.
+ - Visual Observations: One corresponding to top-down view of GridWorld.
+ - Goal Signal : A one hot vector corresponding to which color is the correct goal for the Agent
+- Float Properties: Three, corresponding to grid size, number of green goals, and number of red goals.
+- Benchmark Mean Reward: 0.8
+
+## Push Block
+
+
+
+- Set-up: A platforming environment where the agent can push a block around.
+- Goal: The agent must push the block to the goal.
+- Agents: The environment contains one agent.
+- Agent Reward Function:
+ - -0.0025 for every step.
+ - +1.0 if the block touches the goal.
+- Behavior Parameters:
+ - Vector Observation space: (Continuous) 70 variables corresponding to 14 ray-casts each detecting one of three possible objects (wall, goal, or block).
+ - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise and counterclockwise, move along four different face directions, or do nothing.
+- Float Properties: Four
+ - block_scale: Scale of the block along the x and z dimensions
+ - Default: 2
+ - Recommended Minimum: 0.5
+ - Recommended Maximum: 4
+ - dynamic_friction: Coefficient of friction for the ground material acting on moving objects
+ - Default: 0
+ - Recommended Minimum: 0
+ - Recommended Maximum: 1
+ - static_friction: Coefficient of friction for the ground material acting on stationary objects
+ - Default: 0
+ - Recommended Minimum: 0
+ - Recommended Maximum: 1
+ - block_drag: Effect of air resistance on block
+ - Default: 0.5
+ - Recommended Minimum: 0
+ - Recommended Maximum: 2000
+- Benchmark Mean Reward: 4.5
+
+## Wall Jump
+
+
+
+- Set-up: A platforming environment where the agent can jump over a wall.
+- Goal: The agent must use the block to scale the wall and reach the goal.
+- Agents: The environment contains one agent linked to two different Models. The Policy the agent is linked to changes depending on the height of the wall. The change of Policy is done in the WallJumpAgent class.
+- Agent Reward Function:
+ - -0.0005 for every step.
+ - +1.0 if the agent touches the goal.
+ - -1.0 if the agent falls off the platform.
+- Behavior Parameters:
+ - Vector Observation space: Size of 74, corresponding to 14 ray casts each detecting 4 possible objects. plus the global position of the agent and whether or not the agent is grounded.
+ - Actions: 4 discrete action branches:
+ - Forward Motion (3 possible actions: Forward, Backwards, No Action)
+ - Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
+ - Side Motion (3 possible actions: Left, Right, No Action)
+ - Jump (2 possible actions: Jump, No Action)
+ - Visual Observations: None
+- Float Properties: Four
+- Benchmark Mean Reward (Big & Small Wall): 0.8
+
+## Crawler
+
+
+
+- Set-up: A creature with 4 arms and 4 forearms.
+- Goal: The agents must move its body toward the goal direction without falling.
+- Agents: The environment contains 10 agents with same Behavior Parameters.
+- Agent Reward Function (independent): The reward function is now geometric meaning the reward each step is a product of all the rewards instead of a sum, this helps the agent try to maximize all rewards instead of the easiest rewards.
+ - Body velocity matches goal velocity. (normalized between (0,1))
+ - Head direction alignment with goal direction. (normalized between (0,1))
+- Behavior Parameters:
+ - Vector Observation space: 172 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body.
+ - Actions: 20 continuous actions, corresponding to target rotations for joints.
+ - Visual Observations: None
+- Float Properties: None
+- Benchmark Mean Reward: 3000
+
+## Worm
+
+
+
+- Set-up: A worm with a head and 3 body segments.
+- Goal: The agents must move its body toward the goal direction.
+- Agents: The environment contains 10 agents with same Behavior Parameters.
+- Agent Reward Function (independent): The reward function is now geometric meaning the reward each step is a product of all the rewards instead of a sum, this helps the agent try to maximize all rewards instead of the easiest rewards.
+ - Body velocity matches goal velocity. (normalized between (0,1))
+ - Body direction alignment with goal direction. (normalized between (0,1))
+- Behavior Parameters:
+ - Vector Observation space: 64 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body.
+ - Actions: 9 continuous actions, corresponding to target rotations for joints.
+ - Visual Observations: None
+- Float Properties: None
+- Benchmark Mean Reward: 800
+
+## Food Collector
+
+
+
+- Set-up: A multi-agent environment where agents compete to collect food.
+- Goal: The agents must learn to collect as many green food spheres as possible while avoiding red spheres.
+- Agents: The environment contains 5 agents with same Behavior Parameters.
+- Agent Reward Function (independent):
+ - +1 for interaction with green spheres
+ - -1 for interaction with red spheres
+- Behavior Parameters:
+ - Vector Observation space: 53 corresponding to velocity of agent (2), whether agent is frozen and/or shot its laser (2), plus grid based perception of objects around agent's forward direction (40 by 40 with 6 different categories).
+ - Actions:
+ - 3 continuous actions correspond to Forward Motion, Side Motion and Rotation
+ - 1 discrete action branch for Laser with 2 possible actions corresponding to Shoot Laser or No Action
+ - Visual Observations (Optional): First-person camera per-agent, plus one vector flag representing the frozen state of the agent. This scene uses a combination of vector and visual observations and the training will not succeed without the frozen vector flag. Use `VisualFoodCollector` scene.
+- Float Properties: Two
+ - laser_length: Length of the laser used by the agent
+ - Default: 1
+ - Recommended Minimum: 0.2
+ - Recommended Maximum: 7
+ - agent_scale: Specifies the scale of the agent in the 3 dimensions (equal across the three dimensions)
+ - Default: 1
+ - Recommended Minimum: 0.5
+ - Recommended Maximum: 5
+- Benchmark Mean Reward: 10
+
+## Hallway
+
+
+
+- Set-up: Environment where the agent needs to find information in a room, remember it, and use it to move to the correct goal.
+- Goal: Move to the goal which corresponds to the color of the block in the room.
+- Agents: The environment contains one agent.
+- Agent Reward Function (independent):
+ - +1 For moving to correct goal.
+ - -0.1 For moving to incorrect goal.
+ - -0.0003 Existential penalty.
+- Behavior Parameters:
+ - Vector Observation space: 30 corresponding to local ray-casts detecting objects, goals, and walls.
+ - Actions: 1 discrete action Branch, with 4 actions corresponding to agent rotation and forward/backward movement.
+- Float Properties: None
+- Benchmark Mean Reward: 0.7
+ - To train this environment, you can enable curiosity by adding the `curiosity` reward signal in `config/ppo/Hallway.yaml`
+
+## Soccer Twos
+
+
+
+- Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game.
+- Goal:
+ - Get the ball into the opponent's goal while preventing the ball from entering own goal.
+- Agents: The environment contains two different Multi Agent Groups with two agents in each. Parameters : SoccerTwos.
+- Agent Reward Function (dependent):
+ - (1 - `accumulated time penalty`) When ball enters opponent's goal `accumulated time penalty` is incremented by (1 / `MaxStep`) every fixed update and is reset to 0 at the beginning of an episode.
+ - -1 When ball enters team's goal.
+- Behavior Parameters:
+ - Vector Observation space: 336 corresponding to 11 ray-casts forward distributed over 120 degrees and 3 ray-casts backward distributed over 90 degrees each detecting 6 possible object types, along with the object's distance. The forward ray-casts contribute 264 state dimensions and backward 72 state dimensions over three observation stacks.
+ - Actions: 3 discrete branched actions corresponding to forward, backward, sideways movement, as well as rotation.
+ - Visual Observations: None
+- Float Properties: Two
+ - ball_scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
+ - Default: 7.5
+ - Recommended minimum: 4
+ - Recommended maximum: 10
+ - gravity: Magnitude of the gravity
+ - Default: 9.81
+ - Recommended minimum: 6
+ - Recommended maximum: 20
+
+## Strikers Vs. Goalie
+
+
+
+- Set-up: Environment where two agents compete in a 2 vs 1 soccer variant.
+- Goal:
+ - Striker: Get the ball into the opponent's goal.
+ - Goalie: Keep the ball out of the goal.
+- Agents: The environment contains two different Multi Agent Groups. One with two Strikers and the other one Goalie. Behavior Parameters : Striker, Goalie.
+- Striker Agent Reward Function (dependent):
+ - +1 When ball enters opponent's goal.
+ - -0.001 Existential penalty.
+- Goalie Agent Reward Function (dependent):
+ - -1 When ball enters goal.
+ - 0.001 Existential bonus.
+- Behavior Parameters:
+ - Striker Vector Observation space: 294 corresponding to 11 ray-casts forward distributed over 120 degrees and 3 ray-casts backward distributed over 90 degrees each detecting 5 possible object types, along with the object's distance. The forward ray-casts contribute 231 state dimensions and backward 63 state dimensions over three observation stacks.
+ - Striker Actions: 3 discrete branched actions corresponding to forward, backward, sideways movement, as well as rotation.
+ - Goalie Vector Observation space: 738 corresponding to 41 ray-casts distributed over 360 degrees each detecting 4 possible object types, along with the object's distance and 3 observation stacks.
+ - Goalie Actions: 3 discrete branched actions corresponding to forward, backward, sideways movement, as well as rotation.
+ - Visual Observations: None
+- Float Properties: Two
+ - ball_scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
+ - Default: 7.5
+ - Recommended minimum: 4
+ - Recommended maximum: 10
+ - gravity: Magnitude of the gravity
+ - Default: 9.81
+ - Recommended minimum: 6
+ - Recommended maximum: 20
+
+## Walker
+
+
+
+- Set-up: Physics-based Humanoid agents with 26 degrees of freedom. These DOFs correspond to articulation of the following body-parts: hips, chest, spine, head, thighs, shins, feet, arms, forearms and hands.
+- Goal: The agents must move its body toward the goal direction without falling.
+- Agents: The environment contains 10 independent agents with same Behavior Parameters.
+- Agent Reward Function (independent): The reward function is now geometric meaning the reward each step is a product of all the rewards instead of a sum, this helps the agent try to maximize all rewards instead of the easiest rewards.
+ - Body velocity matches goal velocity. (normalized between (0,1))
+ - Head direction alignment with goal direction. (normalized between (0,1))
+- Behavior Parameters:
+ - Vector Observation space: 243 variables corresponding to position, rotation, velocity, and angular velocities of each limb, along with goal direction.
+ - Actions: 39 continuous actions, corresponding to target rotations and strength applicable to the joints.
+ - Visual Observations: None
+- Float Properties: Four
+ - gravity: Magnitude of gravity
+ - Default: 9.81
+ - Recommended Minimum:
+ - Recommended Maximum:
+ - hip_mass: Mass of the hip component of the walker
+ - Default: 8
+ - Recommended Minimum: 7
+ - Recommended Maximum: 28
+ - chest_mass: Mass of the chest component of the walker
+ - Default: 8
+ - Recommended Minimum: 3
+ - Recommended Maximum: 20
+ - spine_mass: Mass of the spine component of the walker
+ - Default: 8
+ - Recommended Minimum: 3
+ - Recommended Maximum: 20
+- Benchmark Mean Reward : 2500
+
+
+## Pyramids
+
+
+
+- Set-up: Environment where the agent needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.
+- Goal: Move to the golden brick on top of the spawned pyramid.
+- Agents: The environment contains one agent.
+- Agent Reward Function (independent):
+ - +2 For moving to golden brick (minus 0.001 per step).
+- Behavior Parameters:
+ - Vector Observation space: 148 corresponding to local ray-casts detecting switch, bricks, golden brick, and walls, plus variable indicating switch state.
+ - Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and forward/backward movement.
+- Float Properties: None
+- Benchmark Mean Reward: 1.75
+
+## Match 3
+
+
+- Set-up: Simple match-3 game. Matched pieces are removed, and remaining pieces drop down. New pieces are spawned randomly at the top, with a chance of being "special".
+- Goal: Maximize score from matching pieces.
+- Agents: The environment contains several independent Agents.
+- Agent Reward Function (independent):
+ - .01 for each normal piece cleared. Special pieces are worth 2x or 3x.
+- Behavior Parameters:
+ - None
+ - Observations and actions are defined with a sensor and actuator respectively.
+- Float Properties: None
+- Benchmark Mean Reward:
+ - 39.5 for visual observations
+ - 38.5 for vector observations
+ - 34.2 for simple heuristic (pick a random valid move)
+ - 37.0 for greedy heuristic (pick the highest-scoring valid move)
+
+## Sorter
+
+
+ - Set-up: The Agent is in a circular room with numbered tiles. The values of the tiles are random between 1 and 20. The tiles present in the room are randomized at each episode. When the Agent visits a tile, it turns green.
+ - Goal: Visit all the tiles in ascending order.
+ - Agents: The environment contains a single Agent
+ - Agent Reward Function:
+ - -.0002 Existential penalty.
+ - +1 For visiting the right tile
+ - -1 For visiting the wrong tile
+ - BehaviorParameters:
+ - Vector Observations : 4 : 2 floats for Position and 2 floats for orientation
+ - Variable Length Observations : Between 1 and 20 entities (one for each tile) each with 22 observations, the first 20 are one hot encoding of the value of the tile, the 21st and 22nd represent the position of the tile relative to the Agent and the 23rd is `1` if the tile was visited and `0` otherwise.
+ - Actions: 3 discrete branched actions corresponding to forward, backward, sideways movement, as well as rotation.
+ - Float Properties: One
+ - num_tiles: The maximum number of tiles to sample.
+ - Default: 2
+ - Recommended Minimum: 1
+ - Recommended Maximum: 20
+ - Benchmark Mean Reward: Depends on the number of tiles.
+
+## Cooperative Push Block
+
+
+- Set-up: Similar to Push Block, the agents are in an area with blocks that need to be pushed into a goal. Small blocks can be pushed by one agent and are worth +1 value, medium blocks require two agents to push in and are worth +2, and large blocks require all 3 agents to push and are worth +3.
+- Goal: Push all blocks into the goal.
+- Agents: The environment contains three Agents in a Multi Agent Group.
+- Agent Reward Function:
+ - -0.0001 Existential penalty, as a group reward.
+ - +1, +2, or +3 for pushing in a block, added as a group reward.
+- Behavior Parameters:
+ - Observation space: A single Grid Sensor with separate tags for each block size, the goal, the walls, and other agents.
+ - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise and counterclockwise, move along four different face directions, or do nothing.
+- Float Properties: None
+- Benchmark Mean Reward: 11 (Group Reward)
+
+## Dungeon Escape
+
+
+- Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape. To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself to do so. The dragon will drop a key for the others to use. The other agents can then pick up this key and unlock the dungeon door. If the agents take too long, the dragon will escape through a portal and the environment resets.
+- Goal: Unlock the dungeon door and leave.
+- Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which moves in a predetermined pattern.
+- Agent Reward Function:
+ - +1 group reward if any agent successfully unlocks the door and leaves the dungeon.
+- Behavior Parameters:
+ - Observation space: A Ray Perception Sensor with separate tags for the walls, other agents, the door, key, the dragon, and the dragon's portal. A single Vector Observation which indicates whether the agent is holding a key.
+ - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise and counterclockwise, move along four different face directions, or do nothing.
+- Float Properties: None
+- Benchmark Mean Reward: 1.0 (Group Reward)
diff --git a/com.unity.ml-agents/Documentation~/Learning-Environment-Executable.md b/com.unity.ml-agents/Documentation~/Learning-Environment-Executable.md
new file mode 100644
index 0000000000..4c113377c4
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Learning-Environment-Executable.md
@@ -0,0 +1,156 @@
+# Using an Environment Executable
+
+This section will help you create and use built environments rather than the Editor to interact with an environment. Using an executable has some advantages over using the Editor:
+
+- You can exchange executable with other people without having to share your entire repository.
+- You can put your executable on a remote machine for faster training.
+- You can use `Server Build` (`Headless`) mode for faster training (as long as the executable does not need rendering).
+- You can keep using the Unity Editor for other tasks while the agents are training.
+
+## Building the 3DBall environment
+
+The first step is to open the Unity scene containing the 3D Balance Ball environment:
+
+1. Launch Unity.
+2. On the Projects dialog, choose the **Open** option at the top of the window.
+3. Using the file dialog that opens, locate the `Project` folder within the ML-Agents project and click **Open**.
+4. In the **Project** window, navigate to the folder `Assets/ML-Agents/Examples/3DBall/Scenes/`.
+5. Double-click the `3DBall` file to load the scene containing the Balance Ball environment.
+
+
+
+Next, we want the set up scene to play correctly when the training process launches our environment executable. This means:
+
+- The environment application runs in the background.
+- No dialogs require interaction.
+- The correct scene loads automatically.
+
+1. Open Player Settings (menu: **Edit** > **Project Settings** > **Player**).
+2. Under **Resolution and Presentation**:
+ - Ensure that **Run in Background** is Checked.
+ - Ensure that **Display Resolution Dialog** is set to Disabled. (Note: this setting may not be available in newer versions of the editor.)
+3. Open the Build Settings window (menu:**File** > **Build Settings**).
+4. Choose your target platform.
+ - (optional) Select “Development Build” to [log debug messages](https://docs.unity3d.com/Manual/LogFiles.html).
+5. If any scenes are shown in the **Scenes in Build** list, make sure that the 3DBall Scene is the only one checked. (If the list is empty, then only the
+current scene is included in the build).
+6. Click **Build**:
+ - In the File dialog, navigate to your ML-Agents directory.
+ - Assign a file name and click **Save**.
+ - (For Windows)With Unity 2018.1, it will ask you to select a folder instead of a file name. Create a subfolder within the root directory and select that folder to build. In the following steps you will refer to this subfolder's name as `env_name`. You cannot create builds in the Assets folder
+
+
+
+Now that we have a Unity executable containing the simulation environment, we can interact with it.
+
+## Interacting with the Environment
+
+If you want to use the [Python API](Python-LLAPI.md) to interact with your executable, you can pass the name of the executable with the argument 'file_name' of the `UnityEnvironment`. For instance:
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+env = UnityEnvironment(file_name=)
+```
+
+## Training the Environment
+
+1. Open a command or terminal window.
+2. Navigate to the folder where you installed the ML-Agents Toolkit. If you followed the default [installation](Installation.md), then navigate to the `ml-agents/` folder.
+3. Run `mlagents-learn --env= --run-id=` Where:
+ - `` is the file path of the trainer configuration yaml
+ - `` is the name and path to the executable you exported from Unity (without extension)
+ - `` is a string used to separate the results of different training runs
+
+For example, if you are training with a 3DBall executable, and you saved it to the directory where you installed the ML-Agents Toolkit, run:
+
+```sh
+mlagents-learn config/ppo/3DBall.yaml --env=3DBall --run-id=firstRun
+```
+
+And you should see something like
+
+```console
+ml-agents$ mlagents-learn config/ppo/3DBall.yaml --env=3DBall --run-id=first-run
+
+
+ ▄▄▄▓▓▓▓
+ ╓▓▓▓▓▓▓█▓▓▓▓▓
+ ,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
+ ▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
+ ▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
+ ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
+ ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
+ ^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
+ '▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
+ ▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
+ `▀█▓▓▓▓▓▓▓▓▓▌
+ ¬`▀▀▀█▓
+
+```
+
+**Note**: If you're using Anaconda, don't forget to activate the ml-agents environment first.
+
+If `mlagents-learn` runs correctly and starts training, you should see something like this:
+
+```console
+CrashReporter: initialized
+Mono path[0] = '/Users/dericp/workspace/ml-agents/3DBall.app/Contents/Resources/Data/Managed'
+Mono config path = '/Users/dericp/workspace/ml-agents/3DBall.app/Contents/MonoBleedingEdge/etc'
+INFO:mlagents_envs:
+'Ball3DAcademy' started successfully!
+Unity Academy name: Ball3DAcademy
+
+INFO:mlagents_envs:Connected new brain:
+Unity brain name: Ball3DLearning
+ Number of Visual Observations (per agent): 0
+ Vector Observation space size (per agent): 8
+ Number of stacked Vector Observation: 1
+INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain Ball3DLearning:
+ batch_size: 64
+ beta: 0.001
+ buffer_size: 12000
+ epsilon: 0.2
+ gamma: 0.995
+ hidden_units: 128
+ lambd: 0.99
+ learning_rate: 0.0003
+ max_steps: 5.0e4
+ normalize: True
+ num_epoch: 3
+ num_layers: 2
+ time_horizon: 1000
+ sequence_length: 64
+ summary_freq: 1000
+ use_recurrent: False
+ memory_size: 256
+ use_curiosity: False
+ curiosity_strength: 0.01
+ curiosity_enc_size: 128
+ output_path: ./results/first-run-0/Ball3DLearning
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
+INFO:mlagents.trainers: first-run-0: Ball3DLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
+```
+
+You can press Ctrl+C to stop the training, and your trained model will be at `results//.onnx`, which corresponds to your model's latest checkpoint. (**Note:** There is a known bug on Windows that causes the saving of the model to fail when you early terminate the training, it's recommended to wait until Step has reached the max_steps parameter you set in your config YAML.) You can now embed this trained model into your Agent by following the steps below:
+
+1. Move your model file into `Project/Assets/ML-Agents/Examples/3DBall/TFModels/`.
+2. Open the Unity Editor, and select the **3DBall** scene as described above.
+3. Select the **3DBall** prefab from the Project window and select **Agent**.
+4. Drag the `.onnx` file from the Project window of the Editor to the **Model** placeholder in the **Ball3DAgent** inspector window.
+5. Press the **Play** button at the top of the Editor.
+
+## Training on Headless Server
+
+To run training on headless server with no graphics rendering support, you need to turn off graphics display in the Unity executable. There are two ways to achieve this:
+1. Pass `--no-graphics` option to mlagents-learn training command. This is equivalent to adding `-nographics -batchmode` to the Unity executable's commandline.
+2. Build your Unity executable with **Server Build**. You can find this setting in Build Settings in the Unity Editor.
+
+If you want to train with graphics (for example, using camera and visual observations), you'll need to set up display rendering support (e.g. xvfb) on you server machine. In our [Colab Notebook Tutorials](Tutorial-Colab.md), the Setup section has examples of setting up xvfb on servers.
diff --git a/com.unity.ml-agents/Documentation~/Learning-Environments-Agents.md b/com.unity.ml-agents/Documentation~/Learning-Environments-Agents.md
new file mode 100644
index 0000000000..8150b20819
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Learning-Environments-Agents.md
@@ -0,0 +1,11 @@
+# Learning environments and agents
+
+Use the following topics to dive into designing Learning Environments and Agents.
+
+
+| **Section** | **Description** |
+|---------------------------------------------------------------------------------|-----------------------------------------------------------------|
+| [Designing Learning Environments](Learning-Environment-Design.md) | Learn how to structure and configure environments |
+| [Designing Agents](Learning-Environment-Design-Agents.md) | Understand how to create agents |
+| [Sample: Making a New Learning Environment](Learning-Environment-Create-New.md) | Follow a step-by-step example to build a learning environment. |
+| [Using an Executable Environment](Learning-Environment-Executable.md) | Deploy and use executable environments for training. |
diff --git a/com.unity.ml-agents/Documentation~/Limitations.md b/com.unity.ml-agents/Documentation~/Limitations.md
new file mode 100644
index 0000000000..95e08f99a0
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Limitations.md
@@ -0,0 +1,7 @@
+# Limitations
+
+See the package-specific Limitations pages:
+
+- [`com.unity.mlagents` Unity package](Package-Limitations.md)
+- [`mlagents` Python package](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents/README.md#limitations)
+- [`mlagents_envs` Python package](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-envs/README.md#limitations)
diff --git a/com.unity.ml-agents/Documentation~/ML-Agents-Overview.md b/com.unity.ml-agents/Documentation~/ML-Agents-Overview.md
new file mode 100644
index 0000000000..ae31f3e538
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/ML-Agents-Overview.md
@@ -0,0 +1,275 @@
+# ML-Agents Theory
+
+Depending on your background (i.e. researcher, game developer, hobbyist), you may have very different questions on your mind at the moment. To make your transition to the ML-Agents Toolkit easier, we provide several background pages that include overviews and helpful resources on the [Unity Engine](Background-Unity.md), [machine learning](Background-Machine-Learning.md) and [PyTorch](Background-PyTorch.md). We **strongly** recommend browsing the relevant background pages if you're not familiar with a Unity scene, basic machine learning concepts or have not previously heard of PyTorch.
+
+The remainder of this page contains a deep dive into ML-Agents, its key components, different training modes and scenarios. By the end of it, you should have a good sense of _what_ the ML-Agents Toolkit allows you to do. The subsequent documentation pages provide examples of _how_ to use ML-Agents. To get started, watch this [demo video of ML-Agents in action](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).
+
+## Running Example: Training NPC Behaviors
+
+To help explain the material and terminology in this page, we'll use a hypothetical, running example throughout. We will explore the problem of training the behavior of a non-playable character (NPC) in a game. (An NPC is a game character that is never controlled by a human player and its behavior is pre-defined by the game developer.) More specifically, let's assume we're building a multi-player, war-themed game in which players control the soldiers. In this game, we have a single NPC who serves as a medic, finding and reviving wounded players. Lastly, let us assume that there are two teams, each with five players and one NPC medic.
+
+The behavior of a medic is quite complex. It first needs to avoid getting injured, which requires detecting when it is in danger and moving to a safe location. Second, it needs to be aware of which of its team members are injured and require assistance. In the case of multiple injuries, it needs to assess the degree of injury and decide who to help first. Lastly, a good medic will always place itself in a position where it can quickly help its team members. Factoring in all of these traits means that at every instance, the medic needs to measure several attributes of the environment (e.g. position of team members, position of enemies, which of its team members are injured and to what degree) and then decide on an action (e.g. hide from enemy fire, move to help one of its members). Given the large number of settings of the environment and the large number of actions that the medic can take, defining and implementing such complex behaviors by hand is challenging and prone to errors.
+
+With ML-Agents, it is possible to _train_ the behaviors of such NPCs (called **Agents**) using a variety of methods. The basic idea is quite simple. We need to define three entities at every moment of the game (called **environment**):
+
+- **Observations** - what the medic perceives about the environment. Observations can be numeric and/or visual. Numeric observations measure attributes of the environment from the point of view of the agent. For our medic this would be attributes of the battlefield that are visible to it. For most interesting environments, an agent will require several continuous numeric observations. Visual observations, on the other hand, are images generated from the cameras attached to the agent and represent what the agent is seeing at that point in time. It is common to confuse an agent's observation with the environment (or game) **state**. The environment state represents information about the entire scene containing all the game characters. The agents observation, however, only contains information that the agent is aware of and is typically a subset of the environment state. For example, the medic observation cannot include information about an enemy in hiding that the medic is unaware of.
+- **Actions** - what actions the medic can take. Similar to observations, actions can either be continuous or discrete depending on the complexity of the environment and agent. In the case of the medic, if the environment is a simple grid world where only their location matters, then a discrete action taking on one of four values (north, south, east, west) suffices. However, if the environment is more complex and the medic can move freely then using two continuous actions (one for direction and another for speed) is more appropriate.
+- **Reward signals** - a scalar value indicating how well the medic is doing. Note that the reward signal need not be provided at every moment, but only when the medic performs an action that is good or bad. For example, it can receive a large negative reward if it dies, a modest positive reward whenever it revives a wounded team member, and a modest negative reward when a wounded team member dies due to lack of assistance. Note that the reward signal is how the objectives of the task are communicated to the agent, so they need to be set up in a manner where maximizing reward generates the desired optimal behavior.
+
+After defining these three entities (the building blocks of a **reinforcement learning task**), we can now _train_ the medic's behavior. This is achieved by simulating the environment for many trials where the medic, over time, learns what is the optimal action to take for every observation it measures by maximizing its future reward. The key is that by learning the actions that maximize its reward, the medic is learning the behaviors that make it a good medic (i.e. one who saves the most number of lives). In **reinforcement learning** terminology, the behavior that is learned is called a **policy**, which is essentially a (optimal) mapping from observations to actions. Note that the process of learning a policy through running simulations is called the **training phase**, while playing the game with an NPC that is using its learned policy is called the **inference phase**.
+
+The ML-Agents Toolkit provides all the necessary tools for using Unity as the simulation engine for learning the policies of different objects in a Unity environment. In the next few sections, we discuss how the ML-Agents Toolkit achieves this and what features it provides.
+
+## Key Components
+
+The ML-Agents Toolkit contains five high-level components:
+
+- **Learning Environment** - which contains the Unity scene and all the game characters. The Unity scene provides the environment in which agents observe, act, and learn. How you set up the Unity scene to serve as a learning environment really depends on your goal. You may be trying to solve a specific reinforcement learning problem of limited scope, in which case you can use the same scene for both training and for testing trained agents. Or, you may be training agents to operate in a complex game or simulation. In this case, it might be more efficient and practical to create a purpose-built training scene. The ML-Agents Toolkit includes an ML-Agents Unity SDK (`com.unity.ml-agents` package) that enables you to transform any Unity scene into a learning environment by defining the agents and their behaviors.
+- **Python Low-Level API** - which contains a low-level Python interface for interacting and manipulating a learning environment. Note that, unlike the Learning Environment, the Python API is not part of Unity, but lives outside and communicates with Unity through the Communicator. This API is contained in a dedicated `mlagents_envs` Python package and is used by the Python training process to communicate with and control the Academy during training. However, it can be used for other purposes as well. For example, you could use the API to use Unity as the simulation engine for your own machine learning algorithms. See [Python API](Python-LLAPI.md) for more information.
+- **External Communicator** - which connects the Learning Environment with the Python Low-Level API. It lives within the Learning Environment.
+- **Python Trainers** which contains all the machine learning algorithms that enable training agents. The algorithms are implemented in Python and are part of their own `mlagents` Python package. The package exposes a single command-line utility `mlagents-learn` that supports all the training methods and options outlined in this document. The Python Trainers interface solely with the Python Low-Level API.
+- **Gym Wrapper** (not pictured). A common way in which machine learning researchers interact with simulation environments is via a wrapper provided by OpenAI called [gym](https://github.com/openai/gym). We provide a gym wrapper in the `ml-agents-envs` package and [instructions](Python-Gym-API.md) for using it with existing machine learning algorithms which utilize gym.
+- **PettingZoo Wrapper** (not pictured) PettingZoo is python API for interacting with multi-agent simulation environments that provides a gym-like interface. We provide a PettingZoo wrapper for Unity ML-Agents environments in the `ml-agents-envs` package and [instructions](Python-PettingZoo-API.md) for using it with machine learning algorithms.
+
+
+
+_Simplified block diagram of ML-Agents._
+
+The Learning Environment contains two Unity Components that help organize the Unity scene:
+
+- **Agents** - which is attached to a Unity GameObject (any character within a scene) and handles generating its observations, performing the actions it receives and assigning a reward (positive / negative) when appropriate. Each Agent is linked to a Behavior.
+- **Behavior** - defines specific attributes of the agent such as the number of actions that agent can take. Each Behavior is uniquely identified by a `Behavior Name` field. A Behavior can be thought as a function that receives observations and rewards from the Agent and returns actions. A Behavior can be of one of three types: Learning, Heuristic or Inference. A Learning Behavior is one that is not, yet, defined but about to be trained. A Heuristic Behavior is one that is defined by a hard-coded set of rules implemented in code. An Inference Behavior is one that includes a trained Neural Network file. In essence, after a Learning Behavior is trained, it becomes an Inference Behavior.
+
+Every Learning Environment will always have one Agent for every character in the scene. While each Agent must be linked to a Behavior, it is possible for Agents that have similar observations and actions to have the same Behavior. In our sample game, we have two teams each with their own medic. Thus we will have two Agents in our Learning Environment, one for each medic, but both of these medics can have the same Behavior. This does not mean that at each instance they will have identical observation and action _values_.
+
+
+
+_Example block diagram of ML-Agents Toolkit for our sample game._
+
+Note that in a single environment, there can be multiple Agents and multiple Behaviors at the same time. For example, if we expanded our game to include tank driver NPCs, then the Agent attached to those characters cannot share its Behavior with the Agent linked to the medics (medics and drivers have different actions). The Learning Environment through the Academy (not represented in the diagram) ensures that all the Agents are in sync in addition to controlling environment-wide settings.
+
+Lastly, it is possible to exchange data between Unity and Python outside of the machine learning loop through _Side Channels_. One example of using _Side Channels_ is to exchange data with Python about _Environment Parameters_. The following diagram illustrates the above.
+
+
+
+## Training Modes
+
+Given the flexibility of ML-Agents, there are a few ways in which training and inference can proceed.
+
+### Built-in Training and Inference
+
+As mentioned previously, the ML-Agents Toolkit ships with several implementations of state-of-the-art algorithms for training intelligent agents. More specifically, during training, all the medics in the scene send their observations to the Python API through the External Communicator. The Python API processes these observations and sends back actions for each medic to take. During training these actions are mostly exploratory to help the Python API learn the best policy for each medic. Once training concludes, the learned policy for each medic can be exported as a model file. Then during the inference phase, the medics still continue to generate their observations, but instead of being sent to the Python API, they will be fed into their (internal, embedded) model to generate the _optimal_ action for each medic to take at every point in time.
+
+The [Running an Example](Sample.md) guide covers this training mode with the **3D Balance Ball** sample environment.
+
+#### Cross-Platform Inference
+
+It is important to note that the ML-Agents Toolkit leverages the [Inference Engine](Inference-Engine.md) to run the models within a Unity scene such that an agent can take the _optimal_ action at each step. Given that Inference Engine supports all Unity runtime platforms, this means that any model you train with the ML-Agents Toolkit can be embedded into your Unity application that runs on any platform.
+
+### Custom Training and Inference
+
+In the previous mode, the Agents were used for training to generate a PyTorch model that the Agents can later use. However, any user of the ML-Agents Toolkit can leverage their own algorithms for training. In this case, the behaviors of all the Agents in the scene will be controlled within Python. You can even turn your environment into a [gym.](Python-Gym-API.md)
+
+We do not currently have a tutorial highlighting this mode, but you can learn more about the Python API [here](Python-LLAPI.md).
+
+## Flexible Training Scenarios
+
+While the discussion so-far has mostly focused on training a single agent, with ML-Agents, several training scenarios are possible. We are excited to see what kinds of novel and fun environments the community creates. For those new to training intelligent agents, below are a few examples that can serve as inspiration:
+
+- Single-Agent. A single agent, with its own reward signal. The traditional way of training an agent. An example is any single-player game, such as Chicken.
+- Simultaneous Single-Agent. Multiple independent agents with independent reward signals with same `Behavior Parameters`. A parallelized version of the traditional training scenario, which can speed-up and stabilize the training process. Helpful when you have multiple versions of the same character in an environment who should learn similar behaviors. An example might be training a dozen robot-arms to each open a door simultaneously.
+- Adversarial Self-Play. Two interacting agents with inverse reward signals. In two-player games, adversarial self-play can allow an agent to become increasingly more skilled, while always having the perfectly matched opponent: itself. This was the strategy employed when training AlphaGo, and more recently used by OpenAI to train a human-beating 1-vs-1 Dota 2 agent.
+- Cooperative Multi-Agent. Multiple interacting agents with a shared reward signal with same or different `Behavior Parameters`. In this scenario, all agents must work together to accomplish a task that cannot be done alone. Examples include environments where each agent only has access to partial information, which needs to be shared in order to accomplish the task or collaboratively solve a puzzle.
+- Competitive Multi-Agent. Multiple interacting agents with inverse reward signals with same or different `Behavior Parameters`. In this scenario, agents must compete with one another to either win a competition, or obtain some limited set of resources. All team sports fall into this scenario.
+- Ecosystem. Multiple interacting agents with independent reward signals with same or different `Behavior Parameters`. This scenario can be thought of as creating a small world in which animals with different goals all interact, such as a savanna in which there might be zebras, elephants and giraffes, or an autonomous driving simulation within an urban environment.
+
+## Training Methods: Environment-agnostic
+
+The remaining sections overview the various state-of-the-art machine learning algorithms that are part of the ML-Agents Toolkit. If you aren't studying machine and reinforcement learning as a subject and just want to train agents to accomplish tasks, you can treat these algorithms as _black boxes_. There are a few training-related parameters to adjust inside Unity as well as on the Python training side, but you do not need in-depth knowledge of the algorithms themselves to successfully create and train agents. Step-by-step procedures for running the training process are provided in the [Training ML-Agents](Training-ML-Agents.md) page.
+
+This section specifically focuses on the training methods that are available regardless of the specifics of your learning environment.
+
+#### A Quick Note on Reward Signals
+
+In this section we introduce the concepts of _intrinsic_ and _extrinsic_ rewards, which helps explain some of the training methods.
+
+In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy) that maximizes a reward. You will need to provide the agent one or more reward signals to use during training. Typically, a reward is defined by your environment, and corresponds to reaching some goal. These are what we refer to as _extrinsic_ rewards, as they are defined external of the learning algorithm.
+
+Rewards, however, can be defined outside of the environment as well, to encourage the agent to behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these rewards as _intrinsic_ reward signals. The total reward that the agent will learn to maximize can be a mix of extrinsic and intrinsic reward signals.
+
+The ML-Agents Toolkit allows reward signals to be defined in a modular way, and we provide four reward signals that can the mixed and matched to help shape your agent's behavior:
+
+- `extrinsic`: represents the rewards defined in your environment, and is enabled by default
+- `gail`: represents an intrinsic reward signal that is defined by GAIL (see below)
+- `curiosity`: represents an intrinsic reward signal that encourages exploration in sparse-reward environments that is defined by the Curiosity module (see below).
+- `rnd`: represents an intrinsic reward signal that encourages exploration in sparse-reward environments that is defined by the Curiosity module (see below).
+
+### Deep Reinforcement Learning
+
+ML-Agents provide an implementation of two reinforcement learning algorithms:
+
+- [Proximal Policy Optimization (PPO)](https://openai.com/research/openai-baselines-ppo)
+- [Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/)
+
+The default algorithm is PPO. This is a method that has been shown to be more general purpose and stable than many other RL algorithms.
+
+In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected at any time during the past. As experiences are collected, they are placed in an experience replay buffer and randomly drawn during training. This makes SAC significantly more sample-efficient, often requiring
+5-10 times less samples to learn the same task as PPO. However, SAC tends to
+require more model updates. SAC is a good choice for heavier or slower environments (about 0.1 seconds per step or more). SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way. Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
+
+#### Curiosity for Sparse-reward Environments
+
+In environments where the agent receives rare or infrequent rewards (i.e. sparse-reward), an agent may never receive a reward signal on which to bootstrap its training process. This is a scenario where the use of an intrinsic reward signals can be valuable. Curiosity is one such signal which can help the agent explore when extrinsic rewards are sparse.
+
+The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation of the approach described in [Curiosity-driven Exploration by Self-supervised Prediction](https://pathak22.github.io/noreward-rl/) by Pathak, et al. It trains two networks:
+
+- an inverse model, which takes the current and next observation of the agent, encodes them, and uses the encoding to predict the action that was taken between the observations
+- a forward model, which takes the encoded current observation and action, and predicts the next encoded observation.
+
+The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
+
+For more information, see our dedicated [blog post on the Curiosity module](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/).
+
+#### RND for Sparse-reward Environments
+
+Similarly to Curiosity, Random Network Distillation (RND) is useful in sparse or rare reward environments as it helps the Agent explore. The RND Module is implemented following the paper [Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894). RND uses two networks:
+
+ - The first is a network with fixed random weights that takes observations as inputs and generates an encoding
+ - The second is a network with similar architecture that is trained to predict the outputs of the first network and uses the observations the Agent collects as training data.
+
+The loss (the squared difference between the predicted and actual encoded observations) of the trained model is used as intrinsic reward. The more an Agent visits a state, the more accurate the predictions and the lower the rewards which encourages the Agent to explore new states with higher prediction errors.
+
+### Imitation Learning
+
+It is often more intuitive to simply demonstrate the behavior we want an agent to perform, rather than attempting to have it learn via trial-and-error methods. For example, instead of indirectly training a medic with the help of a reward function, we can give the medic real world examples of observations from the game and actions from a game controller to guide the medic's behavior. Imitation Learning uses pairs of observations and actions from a demonstration to learn a policy. See this [video demo](https://youtu.be/kpb8ZkMBFYs) of imitation learning .
+
+Imitation learning can either be used alone or in conjunction with reinforcement learning. If used alone it can provide a mechanism for learning a specific type of behavior (i.e. a specific style of solving the task). If used in conjunction with reinforcement learning it can dramatically reduce the time the agent takes to solve the environment. This can be especially pronounced in sparse-reward environments. For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids), using 6 episodes of demonstrations can reduce training steps by more than 4 times. See Behavioral Cloning + GAIL + Curiosity + RL below.
+
+
+
+The ML-Agents Toolkit provides a way to learn directly from demonstrations, as well as use them to help speed up reward-based training (RL). We include two algorithms called Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). In most scenarios, you can combine these two features:
+
+- If you want to help your agents learn (especially with environments that have sparse rewards) using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning at low strengths in addition to having an extrinsic reward. An example of this is provided for the PushBlock example environment in `config/imitation/PushBlock.yaml`.
+- If you want to train purely from demonstrations with GAIL and BC _without_ an extrinsic reward signal, please see the CrawlerStatic example environment under in `config/imitation/CrawlerStatic.yaml`.
+
+***Note:*** GAIL introduces a [_survivor bias_](https://arxiv.org/pdf/1809.02925.pdf) to the learning process. That is, by giving positive rewards based on similarity to the expert, the agent is incentivized to remain alive for as long as possible. This can directly conflict with goal-oriented tasks like our PushBlock or Pyramids example environments where an agent must reach a goal state thus ending the episode as quickly as possible. In these cases, we strongly recommend that you use a low strength GAIL reward signal and a sparse extrinsic signal when the agent achieves the task. This way, the GAIL reward signal will guide the agent until it discovers the extrinsic signal and will not overpower it. If the agent appears to be ignoring the extrinsic reward signal, you should reduce the strength of GAIL.
+
+#### GAIL (Generative Adversarial Imitation Learning)
+
+GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), uses an adversarial approach to reward your Agent for behaving similar to a set of demonstrations. GAIL can be used with or without environment rewards, and works well when there are a limited number of demonstrations. In this framework, a second neural network, the discriminator, is taught to distinguish whether an observation/action is from a demonstration or produced by the agent. This discriminator can then examine a new observation/action and provide it a reward based on how close it believes this new observation/action is to the provided demonstrations.
+
+At each training step, the agent tries to learn how to maximize this reward. Then, the discriminator is trained to better distinguish between demonstrations and agent state/actions. In this way, while the agent gets better and better at mimicking the demonstrations, the discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
+
+This approach learns a _policy_ that produces states and actions similar to the demonstrations, requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide the learning process.
+
+#### Behavioral Cloning (BC)
+
+BC trains the Agent's policy to exactly mimic the actions shown in a set of demonstrations. The BC feature can be enabled on the PPO or SAC trainers. As BC cannot generalize past the examples shown in the demonstrations, BC tends to work best when there exists demonstrations for nearly all of the states that the agent can experience, or in conjunction with GAIL and/or an extrinsic reward.
+
+#### Recording Demonstrations
+
+Demonstrations of agent behavior can be recorded from the Unity Editor or build, and saved as assets. These demonstrations contain information on the observations, actions, and rewards for a given agent during the recording session. They can be managed in the Editor, as well as used for training with BC and GAIL. See the [Designing Agents](Learning-Environment-Design-Agents.md#recording-demonstrations) page for more information on how to record demonstrations for your agent.
+
+### Summary
+
+To summarize, we provide 3 training methods: BC, GAIL and RL (PPO or SAC) that can be used independently or together:
+
+- BC can be used on its own or as a pre-training step before GAIL and/or RL
+- GAIL can be used with or without extrinsic rewards
+- RL can be used on its own (either PPO or SAC) or in conjunction with BC and/or GAIL.
+
+Leveraging either BC or GAIL requires recording demonstrations to be provided as input to the training algorithms.
+
+## Training Methods: Environment-specific
+
+In addition to the three environment-agnostic training methods introduced in the previous section, the ML-Agents Toolkit provides additional methods that can aid in training behaviors for specific types of environments.
+
+### Training in Competitive Multi-Agent Environments with Self-Play
+
+ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with [Self-Play](https://openai.com/research/competitive-self-play). A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and actions and learn from the same reward function and so _they can share the same policy_. In asymmetric games, this is not the case. An example of an asymmetric games are Hide and Seek. Agents in these types of games do not always have the same observation or actions and so sharing policy networks is not necessarily ideal.
+
+With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent (which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).
+
+Self-play can be used with our implementations of both Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing. This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
+
+See our [Designing Agents](Learning-Environment-Design-Agents.md#defining-multi-agent-scenarios) page for more information on setting up teams in your Unity scene. Also, read our [blog post on self-play](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/) for additional information. Additionally, check [ELO Rating System](ELO-Rating-System.md) the method we use to calculate the relative skill level between two players.
+
+### Training In Cooperative Multi-Agent Environments with MA-POCA
+
+
+
+ML-Agents provides the functionality for training cooperative behaviors - i.e., groups of agents working towards a common goal, where the success of the individual is linked to the success of the whole group. In such a scenario, agents typically receive rewards as a group. For instance, if a team of agents wins a game against an opposing team, everyone is rewarded - even agents who did not directly contribute to the win. This makes learning what to do as an individual difficult - you may get a win for doing nothing, and a loss for doing your best.
+
+In ML-Agents, we provide MA-POCA (MultiAgent POsthumous Credit Assignment), which is a novel multi-agent trainer that trains a _centralized critic_, a neural network that acts as a "coach" for a whole group of agents. You can then give rewards to the team as a whole, and the agents will learn how best to contribute to achieving that reward. Agents can _also_ be given rewards individually, and the team will work together to help the individual achieve those goals. During an episode, agents can be added or removed from the group, such as when agents spawn or die in a game. If agents are removed mid-episode (e.g., if teammates die or are removed from the game), they will still learn whether their actions contributed to the team winning later, enabling agents to take group-beneficial actions even if they result in the individual being removed from the game (i.e., self-sacrifice). MA-POCA can also be combined with self-play to train teams of agents to play against each other.
+
+To learn more about enabling cooperative behaviors for agents in an ML-Agents environment, check out [this page](Learning-Environment-Design-Agents.md#groups-for-cooperative-scenarios).
+
+To learn more about MA-POCA, please see our paper [On the Use and Misuse of Absorbing States in Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/2111.05992.pdf). For further reading, MA-POCA builds on previous work in multi-agent cooperative learning ([Lowe et al.](https://arxiv.org/abs/1706.02275), [Foerster et al.](https://arxiv.org/pdf/1705.08926.pdf), among others) to enable the above use-cases.
+
+### Solving Complex Tasks using Curriculum Learning
+
+Curriculum learning is a way of training a machine learning model where more difficult aspects of a problem are gradually introduced in such a way that the model is always optimally challenged. This idea has been around for a long time, and it is how we humans typically learn. If you imagine any childhood primary school education, there is an ordering of classes and topics. Arithmetic is taught before algebra, for example. Likewise, algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide a scaffolding for later lessons. The same principle can be applied to machine learning, where training on easier tasks can provide a scaffolding for harder tasks in the future.
+
+Imagine training the medic to scale a wall to arrive at a wounded team member. The starting point when training a medic to accomplish this task will be a random policy. That starting policy will have the medic running in circles, and will likely never, or very rarely scale the wall properly to revive their team member (and achieve the reward). If we start with a simpler task, such as moving toward an unobstructed team member, then the medic can easily learn to accomplish the task. From there, we can slowly add to the difficulty of the task by increasing the size of the wall until the medic can complete the initially near-impossible task of scaling the wall. We have included an environment to demonstrate this with ML-Agents, called [Wall Jump](Learning-Environment-Examples.md#wall-jump).
+
+
+
+_Demonstration of a hypothetical curriculum training scenario in which a progressively taller wall obstructs the path to the goal._
+
+_[**Note**: The example provided above is for instructional purposes, and was based on an early version of the [Wall Jump example environment](Learning-Environment-Examples.md). As such, it is not possible to directly replicate the results here using that environment.]_
+
+The ML-Agents Toolkit supports modifying custom environment parameters during the training process to aid in learning. This allows elements of the environment related to difficulty or complexity to be dynamically adjusted based on training progress. The [Training ML-Agents](Training-ML-Agents.md#curriculum) page has more information on defining training curriculums.
+
+### Training Robust Agents using Environment Parameter Randomization
+
+An agent trained on a specific environment, may be unable to generalize to any tweaks or variations in the environment (in machine learning this is referred to as overfitting). This becomes problematic in cases where environments are instantiated with varying objects or properties. One mechanism to alleviate this and train more robust agents that can generalize to unseen variations of the environment is to expose them to these variations during training. Similar to Curriculum Learning, where environments become more difficult as the agent learns, the ML-Agents Toolkit provides a way to randomly sample parameters of the environment during training. We refer to this approach as **Environment Parameter Randomization**. For those familiar with Reinforcement Learning research, this approach is based on the concept of [Domain Randomization](https://arxiv.org/abs/1703.06907). By using [parameter randomization during training](Training-ML-Agents.md#environment-parameter-randomization), the agent can be better suited to adapt (with higher performance) to future unseen variations of the environment.
+
+| Ball scale of 0.5 | Ball scale of 4 |
+| :--------------------------: | :------------------------: |
+|  |  |
+
+_Example of variations of the 3D Ball environment. The environment parameters are `gravity`, `ball_mass` and `ball_scale`._
+
+## Model Types
+
+Regardless of the training method deployed, there are a few model types that users can train using the ML-Agents Toolkit. This is due to the flexibility in defining agent observations, which include vector, ray cast and visual observations. You can learn more about how to instrument an agent's observation in the [Designing Agents](Learning-Environment-Design-Agents.md) guide.
+
+### Learning from Vector Observations
+
+Whether an agent's observations are ray cast or vector, the ML-Agents Toolkit provides a fully connected neural network model to learn from those observations. At training time you can configure different aspects of this model such as the number of hidden units and number of layers.
+
+### Learning from Cameras using Convolutional Neural Networks
+
+Unlike other platforms, where the agent’s observation might be limited to a single vector or image, the ML-Agents Toolkit allows multiple cameras to be used for observations per agent. This enables agents to learn to integrate information from multiple visual streams. This can be helpful in several scenarios such as training a self-driving car which requires multiple cameras with different viewpoints, or a navigational agent which might need to integrate aerial and first-person visuals. You can learn more about adding visual observations to an agent [here](Learning-Environment-Design-Agents.md#visual-observations).
+
+When visual observations are utilized, the ML-Agents Toolkit leverages convolutional neural networks (CNN) to learn from the input images. We offer three network architectures:
+
+- a simple encoder which consists of two convolutional layers
+- the implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers,
+- the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two.
+
+The choice of the architecture depends on the visual complexity of the scene and the available computational resources.
+
+### Learning from Variable Length Observations using Attention
+
+Using the ML-Agents Toolkit, it is possible to have agents learn from a varying number of inputs. To do so, each agent can keep track of a buffer of vector observations. At each step, the agent will go through all the elements in the buffer and extract information but the elements in the buffer can change at every step. This can be useful in scenarios in which the agents must keep track of a varying number of elements throughout the episode. For example in a game where an agent must learn to avoid projectiles, but the projectiles can vary in numbers.
+
+
+
+You can learn more about variable length observations [here](Learning-Environment-Design-Agents.md#variable-length-observations). When variable length observations are utilized, the ML-Agents Toolkit leverages attention networks to learn from a varying number of entities. Agents using attention will ignore entities that are deemed not relevant and pay special attention to entities relevant to the current situation based on context.
+
+### Memory-enhanced Agents using Recurrent Neural Networks
+
+Have you ever entered a room to get something and immediately forgot what you were looking for? Don't let that happen to your agents.
+
+
+
+In some scenarios, agents must learn to remember the past in order to take the best decision. When an agent only has partial observability of the environment, keeping track of past observations can help the agent learn. Deciding what the agents should remember in order to solve a task is not easy to do by hand, but our training algorithms can learn to keep track of what is important to remember with [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
+
+## Additional Features
+
+Beyond the flexible training scenarios available, the ML-Agents Toolkit includes additional features which improve the flexibility and interpretability of the training process.
+
+- **Concurrent Unity Instances** - We enable developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up training. Check out our dedicated page on [creating a Unity executable](Learning-Environment-Executable.md) and the [Training ML-Agents](Training-ML-Agents.md#training-using-concurrent-unity-instances) page for instructions on how to set the number of concurrent instances.
+- **Recording Statistics from Unity** - We enable developers to [record statistics](Learning-Environment-Design.md#recording-statistics) from within their Unity environments. These statistics are aggregated and generated during the training process.
+- **Custom Side Channels** - We enable developers to [create custom side channels](Custom-SideChannels.md) to manage data transfer between Unity and Python that is unique to their training workflow and/or environment.
diff --git a/com.unity.ml-agents/Documentation~/Migrating.md b/com.unity.ml-agents/Documentation~/Migrating.md
new file mode 100644
index 0000000000..2ae1c6e4ec
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Migrating.md
@@ -0,0 +1,542 @@
+# Upgrading
+
+# Migrating
+
+## Migrating to the ml-agents-envs 0.30.0 package
+- Python 3.10.12 is now the minimum version of python supported due to [python3.6 EOL](https://endoflife.date/python). Please update your python installation to 3.10.12 or higher.
+- The `gym-unity` package has been refactored into the `ml-agents-envs` package. Please update your imports accordingly.
+- Example:
+ - Before
+ ```python
+ from gym_unity.unity_gym_env import UnityToGymWrapper
+ ```
+ - After:
+ ```python
+ from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper
+ ```
+
+## Migrating the package to version 3.x
+- The official version of Unity ML-Agents supports is now 6000.0. If you run into issues, please consider deleting your project's Library folder and reopening your project.
+
+
+## Migrating the package to version 2.x
+- The official version of Unity ML-Agents supports is now 2022.3 LTS. If you run into issues, please consider deleting your project's Library folder and reopening your project.
+- If you used any of the APIs that were deprecated before version 2.0, you need to use their replacement. These deprecated APIs have been removed. See the migration steps bellow for specific API replacements.
+
+### Deprecated methods removed
+| **Deprecated API** | **Suggested Replacement** |
+|:-------:|:------:|
+| `IActuator ActuatorComponent.CreateActuator()` | `IActuator[] ActuatorComponent.CreateActuators()` |
+| `IActionReceiver.PackActions(in float[] destination)` | none |
+| `Agent.CollectDiscreteActionMasks(DiscreteActionMasker actionMasker)` | `Agent.WriteDiscreteActionMask(IDiscreteActionMask actionMask)` |
+| `Agent.Heuristic(float[] actionsOut)` | `Agent.Heuristic(in ActionBuffers actionsOut)` |
+| `Agent.OnActionReceived(float[] vectorAction)` | `Agent.OnActionReceived(ActionBuffers actions)` |
+| `Agent.GetAction()` | `Agent.GetStoredActionBuffers()` |
+| `BrainParameters.SpaceType`, `VectorActionSize`, `VectorActionSpaceType`, and `NumActions` | `BrainParameters.ActionSpec` |
+| `ObservationWriter.AddRange(IEnumerable data, int writeOffset = 0)` | `ObservationWriter. AddList(IList data, int writeOffset = 0` |
+| `SensorComponent.IsVisual()` and `IsVector()` | none |
+| `VectorSensor.AddObservation(IEnumerable observation)` | `VectorSensor.AddObservation(IList observation)` |
+| `SideChannelsManager` | `SideChannelManager` |
+
+### IDiscreteActionMask changes
+- The interface for disabling specific discrete actions has changed. `IDiscreteActionMask.WriteMask()` was removed, and replaced with `SetActionEnabled()`. Instead of returning an IEnumerable with indices to disable, you can now call `SetActionEnabled` for each index to disable (or enable). As an example, if you overrode `Agent.WriteDiscreteActionMask()` with something that looked like:
+
+```csharp
+public override void WriteDiscreteActionMask(IDiscreteActionMask actionMask)
+{
+ var branch = 2;
+ var actionsToDisable = new[] {1, 3};
+ actionMask.WriteMask(branch, actionsToDisable);
+}
+```
+
+the equivalent code would now be
+
+```csharp
+public override void WriteDiscreteActionMask(IDiscreteActionMask actionMask)
+{
+ var branch = 2;
+ actionMask.SetActionEnabled(branch, 1, false);
+ actionMask.SetActionEnabled(branch, 3, false);
+}
+```
+### IActuator changes
+- The `IActuator` interface now implements `IHeuristicProvider`. Please add the corresponding `Heuristic(in ActionBuffers)` method to your custom Actuator classes.
+
+### ISensor and SensorComponent changes
+- The `ISensor.GetObservationShape()` method and `ITypedSensor` and `IDimensionPropertiesSensor` interfaces were removed, and `GetObservationSpec()` was added. You can use `ObservationSpec.Vector()` or `ObservationSpec.Visual()` to generate `ObservationSpec`s that are equivalent to the previous shape. For example, if your old ISensor looked like:
+
+```csharp
+public override int[] GetObservationShape()
+{
+ return new[] { m_Height, m_Width, m_NumChannels };
+}
+```
+
+the equivalent code would now be
+
+```csharp
+public override ObservationSpec GetObservationSpec()
+{
+ return ObservationSpec.Visual(m_Height, m_Width, m_NumChannels);
+}
+```
+
+- The `ISensor.GetCompressionType()` method and `ISparseChannelSensor` interface was removed, and `GetCompressionSpec()` was added. You can use `CompressionSpec.Default()` or `CompressionSpec.Compressed()` to generate `CompressionSpec`s that are equivalent to the previous values. For example, if your old ISensor looked like:
+ ```csharp
+public virtual SensorCompressionType GetCompressionType()
+{
+ return SensorCompressionType.None;
+}
+```
+
+the equivalent code would now be
+
+```csharp
+public CompressionSpec GetCompressionSpec()
+{
+ return CompressionSpec.Default();
+}
+```
+
+- The abstract method `SensorComponent.GetObservationShape()` was removed.
+- The abstract method `SensorComponent.CreateSensor()` was replaced with `CreateSensors()`, which returns an `ISensor[]`.
+
+### Match3 integration changes
+The Match-3 integration utilities are now included in `com.unity.ml-agents`.
+
+The `AbstractBoard` interface was changed:
+* `AbstractBoard` no longer contains `Rows`, `Columns`, `NumCellTypes`, and `NumSpecialTypes` fields.
+* `public abstract BoardSize GetMaxBoardSize()` was added as an abstract method. `BoardSize` is a new struct that contains `Rows`, `Columns`, `NumCellTypes`, and `NumSpecialTypes` fields, with the same meanings as the old `AbstractBoard` fields.
+* `public virtual BoardSize GetCurrentBoardSize()` is an optional method; by default it returns `GetMaxBoardSize()`. If you wish to use a single behavior to work with multiple board sizes, override `GetCurrentBoardSize()` to return the current `BoardSize`. The values returned by `GetCurrentBoardSize()` must be less than or equal to the corresponding values from `GetMaxBoardSize()`.
+
+### GridSensor changes
+The sensor configuration has changed:
+* The sensor implementation has been refactored and existing GridSensor created from extension package will not work in newer version. Some errors might show up when loading the old sensor in the scene. You'll need to remove the old sensor and create a new GridSensor.
+* These parameters names have changed but still refer to the same concept in the sensor: `GridNumSide` -> `GridSize`, `RotateToAgent` -> `RotateWithAgent`, `ObserveMask` -> `ColliderMask`, `DetectableObjects` -> `DetectableTags`
+* `DepthType` (`ChanelBase`/`ChannelHot`) option and `ChannelDepth` are removed. Now the default is one-hot encoding for detected tag. If you were using original GridSensor without overriding any method, switching to new GridSensor will produce similar effect for training although the actual observations will be slightly different.
+
+For creating your GridSensor implementation with custom data:
+* To create custom GridSensor, derive from `GridSensorBase` instead of `GridSensor`. Besides overriding `GetObjectData()`, you will also need to consider override `GetCellObservationSize()`, `IsDataNormalized()` and `GetProcessCollidersMethod()` according to the data you collect. Also you'll need to override `GridSensorComponent.GetGridSensors()` and return your custom GridSensor.
+* The input argument `tagIndex` in `GetObjectData()` has changed from 1-indexed to 0-indexed and the data type changed from `float` to `int`. The index of first detectable tag will be 0 instead of 1. `normalizedDistance` was removed from input.
+* The observation data should be written to the input `dataBuffer` instead of creating and returning a new array.
+* Removed the constraint of all data required to be normalized. You should specify it in `IsDataNormalized()`. Sensors with non-normalized data cannot use PNG compression type.
+* The sensor will not further encode the data received from `GetObjectData()` anymore. The values received from `GetObjectData()` will be the observation sent to the trainer.
+
+### LSTM models from previous releases no longer supported
+The way that Sentis processes LSTM (recurrent neural networks) has changed. As a result, models trained with previous versions of ML-Agents will not be usable at inference if they were trained with a `memory` setting in the `.yaml` config file. If you want to use a model that has a recurrent neural network in this release of ML-Agents, you need to train the model using the python trainer from this release.
+
+
+## Migrating to Release 13
+### Implementing IHeuristic in your IActuator implementations
+ - If you have any custom actuators, you can now implement the `IHeuristicProvider` interface to have your actuator handle the generation of actions when an Agent is running in heuristic mode.
+- `VectorSensor.AddObservation(IEnumerable)` is deprecated. Use `VectorSensor.AddObservation(IList)` instead.
+- `ObservationWriter.AddRange()` is deprecated. Use `ObservationWriter.AddList()` instead.
+- `ActuatorComponent.CreateActuator()` is deprecated. Please use override `ActuatorComponent.CreateActuators` instead. Since `ActuatorComponent.CreateActuator()` is abstract, you will still need to override it in your class until it is removed. It is only ever called if you don't override `ActuatorComponent.CreateActuators`. You can suppress the warnings by surrounding the method with the following pragma:
+ ```c#
+ #pragma warning disable 672
+ public IActuator CreateActuator() { ... }
+ #pragma warning restore 672
+ ```
+
+
+# Migrating
+## Migrating to Release 11
+### Agent virtual method deprecation
+ - `Agent.CollectDiscreteActionMasks()` was deprecated and should be replaced with `Agent.WriteDiscreteActionMask()`
+ - `Agent.Heuristic(float[])` was deprecated and should be replaced with `Agent.Heuristic(ActionBuffers)`.
+ - `Agent.OnActionReceived(float[])` was deprecated and should be replaced with `Agent.OnActionReceived(ActionBuffers)`.
+ - `Agent.GetAction()` was deprecated and should be replaced with `Agent.GetStoredActionBuffers()`.
+
+The default implementation of these will continue to call the deprecated versions where appropriate. However, the deprecated versions may not be compatible with continuous and discrete actions on the same Agent.
+
+### BrainParameters field and method deprecation
+ - `BrainParameters.VectorActionSize` was deprecated; you can now set `BrainParameters.ActionSpec.NumContinuousActions` or `BrainParameters.ActionSpec.BranchSizes` instead.
+ - `BrainParameters.VectorActionSpaceType` was deprecated, since both continuous and discrete actions can now be used.
+ - `BrainParameters.NumActions()` was deprecated. Use `BrainParameters.ActionSpec.NumContinuousActions` and
+ `BrainParameters.ActionSpec.NumDiscreteActions` instead.
+
+## Migrating from Release 7 to latest
+
+### Important changes
+- Some trainer files were moved. If you were using the `TrainerFactory` class, it was moved to the `trainers/trainer` folder.
+- The `components` folder containing `bc` and `reward_signals` code was moved to the `trainers/tf` folder
+
+### Steps to Migrate
+- Replace calls to `from mlagents.trainers.trainer_util import TrainerFactory` to `from mlagents.trainers.trainer import TrainerFactory`
+- Replace calls to `from mlagents.trainers.trainer_util import handle_existing_directories` to `from mlagents.trainers.directory_utils import validate_existing_directories`
+- Replace `mlagents.trainers.components` with `mlagents.trainers.tf.components` in your import statements.
+
+
+## Migrating from Release 3 to Release 7
+
+### Important changes
+- The Parameter Randomization feature has been merged with the Curriculum feature. It is now possible to specify a sampler in the lesson of a Curriculum. Curriculum has been refactored and is now specified at the level of the parameter, not the behavior. More information [here](Training-ML-Agents.md#curriculum).
+
+### Steps to Migrate
+- The configuration format for curriculum and parameter randomization has changed. To upgrade your configuration files, an upgrade script has been provided. Run `python -m mlagents.trainers.upgrade_config -h` to see the script usage. Note that you will have had to upgrade to/install the current version of ML-Agents before running the script. To update manually:
+ - If your config file used a `parameter_randomization` section, rename that section to `environment_parameters`
+ - If your config file used a `curriculum` section, you will need to rewrite your curriculum with this [format](Training-ML-Agents.md#curriculum).
+
+## Migrating from Release 1 to Release 3
+
+### Important changes
+- Training artifacts (trained models, summaries) are now found under `results/` instead of `summaries/` and `models/`.
+- Trainer configuration, curriculum configuration, and parameter randomization configuration have all been moved to a single YAML file. (#3791)
+- Trainer configuration format has changed, and using a "default" behavior name has been deprecated. (#3936)
+- `max_step` in the `TerminalStep` and `TerminalSteps` objects was renamed `interrupted`.
+- On the UnityEnvironment API, `get_behavior_names()` and `get_behavior_specs()` methods were combined into the property `behavior_specs` that contains a mapping from behavior names to behavior spec.
+- `use_visual` and `allow_multiple_visual_obs` in the `UnityToGymWrapper` constructor were replaced by `allow_multiple_obs` which allows one or more visual observations and vector observations to be used simultaneously.
+- `--save-freq` has been removed from the CLI and is now configurable in the trainer configuration file.
+- `--lesson` has been removed from the CLI. Lessons will resume when using `--resume`. To start at a different lesson, modify your Curriculum configuration.
+
+### Steps to Migrate
+- To upgrade your configuration files, an upgrade script has been provided. Run `python -m mlagents.trainers.upgrade_config -h` to see the script usage. Note that you will have had to upgrade to/install the current version of ML-Agents before running the script.
+
+To do it manually, copy your `` sections from `trainer_config.yaml` into a separate trainer configuration file, under a `behaviors` section. The `default` section is no longer needed. This new file should be specific to your environment, and not contain configurations for multiple environments (unless they have the same Behavior Names).
+ - You will need to reformat your trainer settings as per the [example](Training-ML-Agents.md).
+ - If your training uses [curriculum](Training-ML-Agents.md#curriculum-learning), move those configurations under a `curriculum` section.
+ - If your training uses [parameter randomization](Training-ML-Agents.md#environment-parameter-randomization), move the contents of the sampler config to `parameter_randomization` in the main trainer configuration.
+- If you are using `UnityEnvironment` directly, replace `max_step` with `interrupted` in the `TerminalStep` and `TerminalSteps` objects.
+ - Replace usage of `get_behavior_names()` and `get_behavior_specs()` in UnityEnvironment with `behavior_specs`.
+ - If you use the `UnityToGymWrapper`, remove `use_visual` and `allow_multiple_visual_obs` from the constructor and add `allow_multiple_obs = True` if the environment contains either both visual and vector observations or multiple visual observations.
+ - If you were setting `--save-freq` in the CLI, add a `checkpoint_interval` value in your trainer configuration, and set it equal to `save-freq * n_agents_in_scene`.
+
+## Migrating from 0.15 to Release 1
+
+### Important changes
+
+- The `MLAgents` C# namespace was renamed to `Unity.MLAgents`, and other nested namespaces were similarly renamed (#3843).
+- The `--load` and `--train` command-line flags have been deprecated and replaced with `--resume` and `--inference`.
+- Running with the same `--run-id` twice will now throw an error.
+- The `play_against_current_self_ratio` self-play trainer hyperparameter has been renamed to `play_against_latest_model_ratio`
+- Removed the multi-agent gym option from the gym wrapper. For multi-agent scenarios, use the [Low Level Python API](Python-LLAPI.md).
+- The low level Python API has changed. You can look at the document [Low Level Python API documentation](Python-LLAPI.md) for more information. If you use `mlagents-learn` for training, this should be a transparent change.
+- The obsolete `Agent` methods `GiveModel`, `Done`, `InitializeAgent`, `AgentAction` and `AgentReset` have been removed.
+- The signature of `Agent.Heuristic()` was changed to take a `float[]` as a parameter, instead of returning the array. This was done to prevent a common source of error where users would return arrays of the wrong size.
+- The SideChannel API has changed (#3833, #3660) :
+ - Introduced the `SideChannelManager` to register, unregister and access side channels.
+ - `EnvironmentParameters` replaces the default `FloatProperties`. You can access the `EnvironmentParameters` with `Academy.Instance.EnvironmentParameters` on C#. If you were previously creating a `UnityEnvironment` in python and passing it a `FloatPropertiesChannel`, create an `EnvironmentParametersChannel` instead.
+ - `SideChannel.OnMessageReceived` is now a protected method (was public)
+ - SideChannel IncomingMessages methods now take an optional default argument, which is used when trying to read more data than the message contains.
+ - Added a feature to allow sending stats from C# environments to TensorBoard (and other python StatsWriters). To do this from your code, use `Academy.Instance.StatsRecorder.Add(key, value)`(#3660)
+- `num_updates` and `train_interval` for SAC have been replaced with `steps_per_update`.
+- The `UnityEnv` class from the `gym-unity` package was renamed `UnityToGymWrapper` and no longer creates the `UnityEnvironment`. Instead, the `UnityEnvironment` must be passed as input to the constructor of `UnityToGymWrapper`
+- Public fields and properties on several classes were renamed to follow Unity's C# style conventions. All public fields and properties now use "PascalCase" instead of "camelCase"; for example, `Agent.maxStep` was renamed to `Agent.MaxStep`. For a full list of changes, see the pull request. (#3828)
+- `WriteAdapter` was renamed to `ObservationWriter`. (#3834)
+
+### Steps to Migrate
+
+- In C# code, replace `using MLAgents` with `using Unity.MLAgents`. Replace other nested namespaces such as `using MLAgents.Sensors` with `using Unity.MLAgents.Sensors`
+- Replace the `--load` flag with `--resume` when calling `mlagents-learn`, and don't use the `--train` flag as training will happen by default. To run without training, use `--inference`.
+- To force-overwrite files from a pre-existing run, add the `--force` command-line flag.
+- The Jupyter notebooks have been removed from the repository.
+- If your Agent class overrides `Heuristic()`, change the signature to `public override void Heuristic(float[] actionsOut)` and assign values to `actionsOut` instead of returning an array.
+- If you used `SideChannels` you must:
+ - Replace `Academy.FloatProperties` with `Academy.Instance.EnvironmentParameters`.
+ - `Academy.RegisterSideChannel` and `Academy.UnregisterSideChannel` were removed. Use `SideChannelManager.RegisterSideChannel` and `SideChannelManager.UnregisterSideChannel` instead.
+- Set `steps_per_update` to be around equal to the number of agents in your environment, times `num_updates` and divided by `train_interval`.
+- Replace `UnityEnv` with `UnityToGymWrapper` in your code. The constructor no longer takes a file name as input but a fully constructed `UnityEnvironment` instead.
+- Update uses of "camelCase" fields and properties to "PascalCase".
+
+## Migrating from 0.14 to 0.15
+
+### Important changes
+
+- The `Agent.CollectObservations()` virtual method now takes as input a `VectorSensor` sensor as argument. The `Agent.AddVectorObs()` methods were removed.
+- The `SetMask` was renamed to `SetMask` method must now be called on the `DiscreteActionMasker` argument of the `CollectDiscreteActionMasks` virtual method.
+- We consolidated our API for `DiscreteActionMasker`. `SetMask` takes two arguments : the branch index and the list of masked actions for that branch.
+- The `Monitor` class has been moved to the Examples Project. (It was prone to errors during testing)
+- The `MLAgents.Sensors` namespace has been introduced. All sensors classes are part of the `MLAgents.Sensors` namespace.
+- The `MLAgents.SideChannels` namespace has been introduced. All side channel classes are part of the `MLAgents.SideChannels` namespace.
+- The interface for `RayPerceptionSensor.PerceiveStatic()` was changed to take an input class and write to an output class, and the method was renamed to `Perceive()`.
+- The `SetMask` method must now be called on the `DiscreteActionMasker` argument of the `CollectDiscreteActionMasks` method.
+- The method `GetStepCount()` on the Agent class has been replaced with the property getter `StepCount`
+- The `--multi-gpu` option has been removed temporarily.
+- `AgentInfo.actionMasks` has been renamed to `AgentInfo.discreteActionMasks`.
+- `BrainParameters` and `SpaceType` have been removed from the public API
+- `BehaviorParameters` have been removed from the public API.
+- `DecisionRequester` has been made internal (you can still use the DecisionRequesterComponent from the inspector). `RepeatAction` was renamed `TakeActionsBetweenDecisions` for clarity.
+- The following methods in the `Agent` class have been renamed. The original method names will be removed in a later release:
+ - `InitializeAgent()` was renamed to `Initialize()`
+ - `AgentAction()` was renamed to `OnActionReceived()`
+ - `AgentReset()` was renamed to `OnEpisodeBegin()`
+ - `Done()` was renamed to `EndEpisode()`
+ - `GiveModel()` was renamed to `SetModel()`
+- The `IFloatProperties` interface has been removed.
+- The interface for SideChannels was changed:
+ - In C#, `OnMessageReceived` now takes a `IncomingMessage` argument, and `QueueMessageToSend` takes an `OutgoingMessage` argument.
+ - In python, `on_message_received` now takes a `IncomingMessage` argument, and `queue_message_to_send` takes an `OutgoingMessage` argument.
+ - Automatic stepping for Academy is now controlled from the AutomaticSteppingEnabled property.
+
+### Steps to Migrate
+
+- Add the `using MLAgents.Sensors;` in addition to `using MLAgents;` on top of your Agent's script.
+- Replace your Agent's implementation of `CollectObservations()` with `CollectObservations(VectorSensor sensor)`. In addition, replace all calls to `AddVectorObs()` with `sensor.AddObservation()` or `sensor.AddOneHotObservation()` on the `VectorSensor` passed as argument.
+- Replace your calls to `SetActionMask` on your Agent to `DiscreteActionMasker.SetActionMask` in `CollectDiscreteActionMasks`.
+- If you call `RayPerceptionSensor.PerceiveStatic()` manually, add your inputs to a `RayPerceptionInput`. To get the previous float array output, iterate through `RayPerceptionOutput.rayOutputs` and call `RayPerceptionOutput.RayOutput.ToFloatArray()`.
+- Replace all calls to `Agent.GetStepCount()` with `Agent.StepCount`
+- We strongly recommend replacing the following methods with their new equivalent as they will be removed in a later release:
+ - `InitializeAgent()` to `Initialize()`
+ - `AgentAction()` to `OnActionReceived()`
+ - `AgentReset()` to `OnEpisodeBegin()`
+ - `Done()` to `EndEpisode()`
+ - `GiveModel()` to `SetModel()`
+- Replace `IFloatProperties` variables with `FloatPropertiesChannel` variables.
+- If you implemented custom `SideChannels`, update the signatures of your methods, and add your data to the `OutgoingMessage` or read it from the `IncomingMessage`.
+- Replace calls to Academy.EnableAutomaticStepping()/DisableAutomaticStepping() with Academy.AutomaticSteppingEnabled = true/false.
+
+## Migrating from 0.13 to 0.14
+
+### Important changes
+
+- The `UnitySDK` folder has been split into a Unity Package (`com.unity.ml-agents`) and an examples project (`Project`). Please follow the [Installation Guide](Installation.md) to get up and running with this new repo structure.
+- Several changes were made to how agents are reset and marked as done:
+ - Calling `Done()` on the Agent will now reset it immediately and call the `AgentReset` virtual method. (This is to simplify the previous logic in which the Agent had to wait for the next `EnvironmentStep` to reset)
+ - The "Reset on Done" setting in AgentParameters was removed; this is now effectively always true. `AgentOnDone` virtual method on the Agent has been removed.
+- The `Decision Period` and `On Demand decision` checkbox have been removed from the Agent. On demand decision is now the default (calling `RequestDecision` on the Agent manually.)
+- The Academy class was changed to a singleton, and its virtual methods were removed.
+- Trainer steps are now counted per-Agent, not per-environment as in previous versions. For instance, if you have 10 Agents in the scene, 20 environment steps now corresponds to 200 steps as printed in the terminal and in Tensorboard.
+- Curriculum config files are now YAML formatted and all curricula for a training run are combined into a single file.
+- The `--num-runs` command-line option has been removed from `mlagents-learn`.
+- Several fields on the Agent were removed or made private in order to simplify the interface.
+ - The `agentParameters` field of the Agent has been removed. (Contained only `maxStep` information)
+ - `maxStep` is now a public field on the Agent. (Was moved from `agentParameters`)
+ - The `Info` field of the Agent has been made private. (Was only used internally and not meant to be modified outside of the Agent)
+ - The `GetReward()` method on the Agent has been removed. (It was being confused with `GetCumulativeReward()`)
+ - The `AgentAction` struct no longer contains a `value` field. (Value estimates were not set during inference)
+ - The `GetValueEstimate()` method on the Agent has been removed.
+ - The `UpdateValueAction()` method on the Agent has been removed.
+- The deprecated `RayPerception3D` and `RayPerception2D` classes were removed, and the `legacyHitFractionBehavior` argument was removed from `RayPerceptionSensor.PerceiveStatic()`.
+- RayPerceptionSensor was inconsistent in how it handle scale on the Agent's transform. It now scales the ray length and sphere size for casting as the transform's scale changes.
+
+### Steps to Migrate
+
+- Follow the instructions on how to install the `com.unity.ml-agents` package into your project in the [Installation Guide](Installation.md).
+- If your Agent implemented `AgentOnDone` and did not have the checkbox `Reset On Done` checked in the inspector, you must call the code that was in `AgentOnDone` manually.
+- If you give your Agent a reward or penalty at the end of an episode (e.g. for reaching a goal or falling off of a platform), make sure you call `AddReward()` or `SetReward()` _before_ calling `Done()`. Previously, the order didn't matter.
+- If you were not using `On Demand Decision` for your Agent, you **must** add a `DecisionRequester` component to your Agent GameObject and set its `Decision Period` field to the old `Decision Period` of the Agent.
+- If you have a class that inherits from Academy:
+ - If the class didn't override any of the virtual methods and didn't store any additional data, you can just remove the old script from the scene.
+ - If the class had additional data, create a new MonoBehaviour and store the data in the new MonoBehaviour instead.
+ - If the class overrode the virtual methods, create a new MonoBehaviour and move the logic to it:
+ - Move the InitializeAcademy code to MonoBehaviour.Awake
+ - Move the AcademyStep code to MonoBehaviour.FixedUpdate
+ - Move the OnDestroy code to MonoBehaviour.OnDestroy.
+ - Move the AcademyReset code to a new method and add it to the Academy.OnEnvironmentReset action.
+- Multiply `max_steps` and `summary_freq` in your `trainer_config.yaml` by the number of Agents in the scene.
+- Combine curriculum configs into a single file. See [the WallJump curricula](https://github.com/Unity-Technologies/ml-agents/blob/0.14.1/config/curricula/wall_jump.yaml) for an example of the new curriculum config format. A tool like https://www.json2yaml.com may be useful to help with the conversion.
+- If you have a model trained which uses RayPerceptionSensor and has non-1.0 scale in the Agent's transform, it must be retrained.
+
+## Migrating from ML-Agents Toolkit v0.12.0 to v0.13.0
+
+### Important changes
+
+- The low level Python API has changed. You can look at the document [Low Level Python API documentation](Python-LLAPI.md) for more information. This should only affect you if you're writing a custom trainer; if you use `mlagents-learn` for training, this should be a transparent change.
+ - `reset()` on the Low-Level Python API no longer takes a `train_mode` argument. To modify the performance/speed of the engine, you must use an `EngineConfigurationChannel`
+ - `reset()` on the Low-Level Python API no longer takes a `config` argument. `UnityEnvironment` no longer has a `reset_parameters` field. To modify float properties in the environment, you must use a `FloatPropertiesChannel`. For more information, refer to the [Low Level Python API documentation](Python-LLAPI.md)
+- `CustomResetParameters` are now removed.
+- The Academy no longer has a `Training Configuration` nor `Inference Configuration` field in the inspector. To modify the configuration from the Low-Level Python API, use an `EngineConfigurationChannel`. To modify it during training, use the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate` in `mlagents-learn`.
+- The Academy no longer has a `Default Reset Parameters` field in the inspector. The Academy class no longer has a `ResetParameters`. To access shared float properties with Python, use the new `FloatProperties` field on the Academy.
+- Offline Behavioral Cloning has been removed. To learn from demonstrations, use the GAIL and Behavioral Cloning features with either PPO or SAC.
+- `mlagents.envs` was renamed to `mlagents_envs`. The previous repo layout depended on [PEP420](https://www.python.org/dev/peps/pep-0420/), which caused problems with some of our tooling such as mypy and pylint.
+- The official version of Unity ML-Agents supports is now 2022.3 LTS. If you run into issues, please consider deleting your library folder and reopening your projects. You will need to install the Sentis package into your project in order to ML-Agents to compile correctly.
+
+### Steps to Migrate
+
+- If you had a custom `Training Configuration` in the Academy inspector, you will need to pass your custom configuration at every training run using the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate`.
+- If you were using `--slow` in `mlagents-learn`, you will need to pass your old `Inference Configuration` of the Academy inspector with the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate` instead.
+- Any imports from `mlagents.envs` should be replaced with `mlagents_envs`.
+
+## Migrating from ML-Agents Toolkit v0.11.0 to v0.12.0
+
+### Important Changes
+
+- Text actions and observations, and custom action and observation protos have been removed.
+- RayPerception3D and RayPerception2D are marked deprecated, and will be removed in a future release. They can be replaced by RayPerceptionSensorComponent3D and RayPerceptionSensorComponent2D.
+- The `Use Heuristic` checkbox in Behavior Parameters has been replaced with a `Behavior Type` dropdown menu. This has the following options:
+ - `Default` corresponds to the previous unchecked behavior, meaning that Agents will train if they connect to a python trainer, otherwise they will perform inference.
+ - `Heuristic Only` means the Agent will always use the `Heuristic()` method. This corresponds to having "Use Heuristic" selected in 0.11.0.
+ - `Inference Only` means the Agent will always perform inference.
+- ML-Agents was upgraded to use Sentis 1.2.0-exp.2 and is installed via the package manager.
+
+### Steps to Migrate
+
+- We [fixed a bug](https://github.com/Unity-Technologies/ml-agents/pull/2823) in `RayPerception3d.Perceive()` that was causing the `endOffset` to be used incorrectly. However this may produce different behavior from previous versions if you use a non-zero `startOffset`. To reproduce the old behavior, you should increase the value of `endOffset` by `startOffset`. You can verify your raycasts are performing as expected in scene view using the debug rays.
+- If you use RayPerception3D, replace it with RayPerceptionSensorComponent3D (and similarly for 2D). The settings, such as ray angles and detectable tags, are configured on the component now. RayPerception3D would contribute `(# of rays) * (# of tags + 2)` to the State Size in Behavior Parameters, but this is no longer necessary, so you should reduce the State Size by this amount. Making this change will require retraining your model, since the observations that RayPerceptionSensorComponent3D produces are different from the old behavior.
+- If you see messages such as `The type or namespace 'Sentis' could not be found` or `The type or namespace 'Google' could not be found`, you will need to [install the Sentis preview package](Installation.md#package-installation).
+
+## Migrating from ML-Agents Toolkit v0.10 to v0.11.0
+
+### Important Changes
+
+- The definition of the gRPC service has changed.
+- The online BC training feature has been removed.
+- The BroadcastHub has been deprecated. If there is a training Python process, all LearningBrains in the scene will automatically be trained. If there is no Python process, inference will be used.
+- The Brain ScriptableObjects have been deprecated. The Brain Parameters are now on the Agent and are referred to as Behavior Parameters. Make sure the Behavior Parameters is attached to the Agent GameObject.
+- To use a heuristic behavior, implement the `Heuristic()` method in the Agent class and check the `use heuristic` checkbox in the Behavior Parameters.
+- Several changes were made to the setup for visual observations (i.e. using Cameras or RenderTextures):
+ - Camera resolutions are no longer stored in the Brain Parameters.
+ - AgentParameters no longer stores lists of Cameras and RenderTextures
+ - To add visual observations to an Agent, you must now attach a CameraSensorComponent or RenderTextureComponent to the agent. The corresponding Camera or RenderTexture can be added to these in the editor, and the resolution and color/grayscale is configured on the component itself.
+
+#### Steps to Migrate
+
+- In order to be able to train, make sure both your ML-Agents Python package and UnitySDK code come from the v0.11 release. Training will not work, for example, if you update the ML-Agents Python package, and only update the API Version in UnitySDK.
+- If your Agents used visual observations, you must add a CameraSensorComponent corresponding to each old Camera in the Agent's camera list (and similarly for RenderTextures).
+- Since Brain ScriptableObjects have been removed, you will need to delete all the Brain ScriptableObjects from your `Assets` folder. Then, add a `Behavior Parameters` component to each `Agent` GameObject. You will then need to complete the fields on the new `Behavior Parameters` component with the BrainParameters of the old Brain.
+
+## Migrating from ML-Agents Toolkit v0.9 to v0.10
+
+### Important Changes
+
+- We have updated the C# code in our repository to be in line with Unity Coding Conventions. This has changed the name of some public facing classes and enums.
+- The example environments have been updated. If you were using these environments to benchmark your training, please note that the resulting rewards may be slightly different in v0.10.
+
+#### Steps to Migrate
+
+- `UnitySDK/Assets/ML-Agents/Scripts/Communicator.cs` and its class `Communicator` have been renamed to `UnitySDK/Assets/ML-Agents/Scripts/ICommunicator.cs` and `ICommunicator`respectively.
+- The `SpaceType` Enums `discrete`, and `continuous` have been renamed to `Discrete` and `Continuous`.
+- We have removed the `Done` call as well as the capacity to set `Max Steps` on the Academy. Therefore, an AcademyReset will never be triggered from C# (only from Python). If you want to reset the simulation after a fixed number of steps, or when an event in the simulation occurs, we recommend looking at our multi-agent example environments (such as FoodCollector). In our examples, groups of Agents can be reset through an "Area" that can reset groups of Agents.
+- The import for `mlagents.envs.UnityEnvironment` was removed. If you are using the Python API, change `from mlagents_envs import UnityEnvironment` to `from mlagents_envs.environment import UnityEnvironment`.
+
+## Migrating from ML-Agents Toolkit v0.8 to v0.9
+
+### Important Changes
+
+- We have changed the way reward signals (including Curiosity) are defined in the `trainer_config.yaml`.
+- When using multiple environments, every "step" is recorded in TensorBoard.
+- The steps in the command line console corresponds to a single step of a single environment. Previously, each step corresponded to one step for all environments (i.e., `num_envs` steps).
+
+#### Steps to Migrate
+
+- If you were overriding any of these following parameters in your config file, remove them from the top-level config and follow the steps below:
+ - `gamma`: Define a new `extrinsic` reward signal and set it's `gamma` to your new gamma.
+ - `use_curiosity`, `curiosity_strength`, `curiosity_enc_size`: Define a `curiosity` reward signal and set its `strength` to `curiosity_strength`, and `encoding_size` to `curiosity_enc_size`. Give it the same `gamma` as your `extrinsic` signal to mimic previous behavior.
+- TensorBoards generated when running multiple environments in v0.8 are not comparable to those generated in v0.9 in terms of step count. Multiply your v0.8 step count by `num_envs` for an approximate comparison. You may need to change `max_steps` in your config as appropriate as well.
+
+## Migrating from ML-Agents Toolkit v0.7 to v0.8
+
+### Important Changes
+
+- We have split the Python packages into two separate packages `ml-agents` and `ml-agents-envs`.
+- `--worker-id` option of `learn.py` has been removed, use `--base-port` instead if you'd like to run multiple instances of `learn.py`.
+
+#### Steps to Migrate
+
+- If you are installing via PyPI, there is no change.
+- If you intend to make modifications to `ml-agents` or `ml-agents-envs` please check the Installing for Development in the [Installation documentation](Installation.md).
+
+## Migrating from ML-Agents Toolkit v0.6 to v0.7
+
+### Important Changes
+
+- We no longer support TFS and are now using the [Sentis](Inference-Engine.md)
+
+#### Steps to Migrate
+
+- Make sure to remove the `ENABLE_TENSORFLOW` flag in your Unity Project settings
+
+## Migrating from ML-Agents Toolkit v0.5 to v0.6
+
+### Important Changes
+
+- Brains are now Scriptable Objects instead of MonoBehaviors.
+- You can no longer modify the type of a Brain. If you want to switch between `PlayerBrain` and `LearningBrain` for multiple agents, you will need to assign a new Brain to each agent separately. **Note:** You can pass the same Brain to multiple agents in a scene by leveraging Unity's prefab system or look for all the agents in a scene using the search bar of the `Hierarchy` window with the word `Agent`.
+
+- We replaced the **Internal** and **External** Brain with **Learning Brain**. When you need to train a model, you need to drag it into the `Broadcast Hub` inside the `Academy` and check the `Control` checkbox.
+- We removed the `Broadcast` checkbox of the Brain, to use the broadcast functionality, you need to drag the Brain into the `Broadcast Hub`.
+- When training multiple Brains at the same time, each model is now stored into a separate model file rather than in the same file under different graph scopes.
+- The **Learning Brain** graph scope, placeholder names, output names and custom placeholders can no longer be modified.
+
+#### Steps to Migrate
+
+- To update a scene from v0.5 to v0.6, you must:
+ - Remove the `Brain` GameObjects in the scene. (Delete all of the Brain GameObjects under Academy in the scene.)
+ - Create new `Brain` Scriptable Objects using `Assets -> Create -> ML-Agents` for each type of the Brain you plan to use, and put the created files under a folder called Brains within your project.
+ - Edit their `Brain Parameters` to be the same as the parameters used in the `Brain` GameObjects.
+ - Agents have a `Brain` field in the Inspector, you need to drag the appropriate Brain ScriptableObject in it.
+ - The Academy has a `Broadcast Hub` field in the inspector, which is list of brains used in the scene. To train or control your Brain from the `mlagents-learn` Python script, you need to drag the relevant `LearningBrain` ScriptableObjects used in your scene into entries into this list.
+
+## Migrating from ML-Agents Toolkit v0.4 to v0.5
+
+### Important
+
+- The Unity project `unity-environment` has been renamed `UnitySDK`.
+- The `python` folder has been renamed to `ml-agents`. It now contains two packages, `mlagents.env` and `mlagents.trainers`. `mlagents.env` can be used to interact directly with a Unity environment, while `mlagents.trainers` contains the classes for training agents.
+- The supported Unity version has changed from `2017.1 or later` to `2017.4 or later`. 2017.4 is an LTS (Long Term Support) version that helps us maintain good quality and support. Earlier versions of Unity might still work, but you may encounter an [error](FAQ.md#instance-of-corebraininternal-couldnt-be-created) listed here.
+
+### Unity API
+
+- Discrete Actions now use [branches](https://arxiv.org/abs/1711.08946). You can now specify concurrent discrete actions. You will need to update the Brain Parameters in the Brain Inspector in all your environments that use discrete actions. Refer to the [discrete action documentation](Learning-Environment-Design-Agents.md#discrete-action-space) for more information.
+
+### Python API
+
+- In order to run a training session, you can now use the command `mlagents-learn` instead of `python3 learn.py` after installing the `mlagents` packages. This change is documented [here](Training-ML-Agents.md#training-with-mlagents-learn). For example, if we previously ran
+
+ ```sh
+ python3 learn.py 3DBall --train
+ ```
+
+from the `python` subdirectory (which is changed to `ml-agents` subdirectory in v0.5), we now run
+
+ ```sh
+ mlagents-learn config/trainer_config.yaml --env=3DBall --train
+ ```
+
+from the root directory where we installed the ML-Agents Toolkit.
+
+- It is now required to specify the path to the yaml trainer configuration file when running `mlagents-learn`. For an example trainer configuration file, see [trainer_config.yaml](https://github.com/Unity-Technologies/ml-agents/blob/0.5.0a/config/trainer_config.yaml). An example of passing a trainer configuration to `mlagents-learn` is shown above.
+- The environment name is now passed through the `--env` option.
+- Curriculum learning has been changed. In summary:
+ - Curriculum files for the same environment must now be placed into a folder. Each curriculum file should be named after the Brain whose curriculum it specifies.
+ - `min_lesson_length` now specifies the minimum number of episodes in a lesson and affects reward thresholding.
+ - It is no longer necessary to specify the `Max Steps` of the Academy to use curriculum learning.
+
+## Migrating from ML-Agents Toolkit v0.3 to v0.4
+
+### Unity API
+
+- `using MLAgents;` needs to be added in all of the C# scripts that use ML-Agents.
+
+### Python API
+
+- We've changed some of the Python packages dependencies in requirement.txt file. Make sure to run `pip3 install -e .` within your `ml-agents/python` folder to update your Python packages.
+
+## Migrating from ML-Agents Toolkit v0.2 to v0.3
+
+There are a large number of new features and improvements in the ML-Agents Toolkit v0.3 which change both the training process and Unity API in ways which will cause incompatibilities with environments made using older versions. This page is designed to highlight those changes for users familiar with v0.1 or v0.2 in order to ensure a smooth transition.
+
+### Important
+
+- The ML-Agents Toolkit is no longer compatible with Python 2.
+
+### Python Training
+
+- The training script `ppo.py` and `PPO.ipynb` Python notebook have been replaced with a single `learn.py` script as the launching point for training with ML-Agents. For more information on using `learn.py`, see [here](Training-ML-Agents.md#training-with-mlagents-learn).
+- Hyperparameters for training Brains are now stored in the `trainer_config.yaml` file. For more information on using this file, see [here](Training-ML-Agents.md#training-configurations).
+
+### Unity API
+
+- Modifications to an Agent's rewards must now be done using either `AddReward()` or `SetReward()`.
+- Setting an Agent to done now requires the use of the `Done()` method.
+- `CollectStates()` has been replaced by `CollectObservations()`, which now no longer returns a list of floats.
+- To collect observations, call `AddVectorObs()` within `CollectObservations()`. Note that you can call `AddVectorObs()` with floats, integers, lists and arrays of floats, Vector3 and Quaternions.
+- `AgentStep()` has been replaced by `AgentAction()`.
+- `WaitTime()` has been removed.
+- The `Frame Skip` field of the Academy is replaced by the Agent's `Decision Frequency` field, enabling the Agent to make decisions at different frequencies.
+- The names of the inputs in the Internal Brain have been changed. You must replace `state` with `vector_observation` and `observation` with `visual_observation`. In addition, you must remove the `epsilon` placeholder.
+
+### Semantics
+
+In order to more closely align with the terminology used in the Reinforcement Learning field, and to be more descriptive, we have changed the names of some of the concepts used in ML-Agents. The changes are highlighted in the table below.
+
+| Old - v0.2 and earlier | New - v0.3 and later |
+| ---------------------- | -------------------- |
+| State | Vector Observation |
+| Observation | Visual Observation |
+| Action | Vector Action |
+| N/A | Text Observation |
+| N/A | Text Action |
diff --git a/com.unity.ml-agents/Documentation~/Package-Limitations.md b/com.unity.ml-agents/Documentation~/Package-Limitations.md
new file mode 100644
index 0000000000..2ecb4c5696
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Package-Limitations.md
@@ -0,0 +1,20 @@
+## Package Limitations
+### Training
+Training is limited to the Unity Editor and Standalone builds on Windows, MacOS, and Linux with the Mono scripting backend. Currently, training does not work with the IL2CPP scripting backend. Your environment will default to inference mode if training is not supported or is not currently running.
+
+### Inference
+Inference is executed via Unity Inference Engine on the end-user device. Therefore, it is subject to the performance limitations of the end-user CPU or GPU. Also, only models created with our trainers are supported for running ML-Agents with a neural network behavior.
+
+### Headless Mode
+If you enable Headless mode, you will not be able to collect visual observations from your agents.
+
+### Rendering Speed and Synchronization
+Currently the speed of the game physics can only be increased to 100x real-time. The Academy (the sentinel that controls the stepping of the game to make sure everything is synchronized, from collection of observations to applying actions generated from policy inference to the agent) also moves in time with `FixedUpdate()` rather than `Update()`, so game behavior implemented in `Update()` may be out of sync with the agent decision-making. See [Execution Order of Event Functions](https://docs.unity3d.com/Manual/execution-order.html) for more information.
+
+You can control the frequency of Academy stepping by calling `Academy.Instance.DisableAutomaticStepping()`, and then calling `Academy.Instance.EnvironmentStep()`.
+
+### Input System Integration
+
+For `InputActuatorComponent` (see [Input System Integration](InputSystem-Integration.md) for more information):
+- Limited implementation of `InputControls`
+- No way to customize the action space of the `InputActuatorComponent`
diff --git a/com.unity.ml-agents/Documentation~/Package-Settings.md b/com.unity.ml-agents/Documentation~/Package-Settings.md
new file mode 100644
index 0000000000..df21ed9a22
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Package-Settings.md
@@ -0,0 +1,23 @@
+# ML-Agents Package Settings
+
+ML-Agents Package Settings contains settings that apply to the whole project. It allows you to configure ML-Agents-specific settings in the Editor. These settings are available for use in both the Editor and Player.
+
+You can find them at `Edit` > `Project Settings...` > `ML-Agents`. It lists out all the available settings and their default values.
+
+
+## Create Custom Settings
+In order to to use your own settings for your project, you'll need to create a settings asset.
+
+You can do this by clicking the `Create Settings Asset` button or clicking the gear on the top right and select `New Settings Asset...`. The asset file can be placed anywhere in the `Asset/` folder in your project. After Creating the settings asset, you'll be able to modify the settings for your project and your settings will be saved in the asset.
+
+
+
+
+## Multiple Custom Settings for Different Scenarios
+You can create multiple settings assets in one project.
+
+By clicking the gear on the top right you'll see all available settings listed in the drop-down menu to choose from.
+
+This allows you to create different settings for different scenarios. For example, you can create two separate settings for training and inference, and specify which one you want to use according to what you're currently running.
+
+
diff --git a/com.unity.ml-agents/Documentation~/Profiling-Python.md b/com.unity.ml-agents/Documentation~/Profiling-Python.md
new file mode 100644
index 0000000000..4b5079a26f
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Profiling-Python.md
@@ -0,0 +1,49 @@
+# Profiling in Python
+
+As part of the ML-Agents Toolkit, we provide a lightweight profiling system, in order to identity hotspots in the training process and help spot regressions from changes.
+
+Timers are hierarchical, meaning that the time tracked in a block of code can be further split into other blocks if desired. This also means that a function that is called from multiple places in the code will appear in multiple places in the timing output.
+
+All timers operate using a "global" instance by default, but this can be overridden if necessary (mainly for testing).
+
+## Adding Profiling
+
+There are two ways to indicate code should be included in profiling. The simplest way is to add the `@timed` decorator to a function or method of interested.
+
+```python
+class TrainerController:
+ # ....
+ @timed
+ def advance(self, env: EnvManager) -> int:
+ # do stuff
+```
+
+You can also used the `hierarchical_timer` context manager.
+
+```python
+with hierarchical_timer("communicator.exchange"):
+ outputs = self.communicator.exchange(step_input)
+```
+
+The context manager may be easier than the `@timed` decorator for profiling different parts of a large function, or profiling calls to abstract methods that might not use decorator.
+
+## Output
+
+By default, at the end of training, timers are collected and written in json format to `{summaries_dir}/{run_id}_timers.json`. The output consists of node objects with the following keys:
+
+- total (float): The total time in seconds spent in the block, including child calls.
+- count (int): The number of times the block was called.
+- self (float): The total time in seconds spent in the block, excluding child calls.
+- children (dictionary): A dictionary of child nodes, keyed by the node name.
+- is_parallel (bool): Indicates that the block of code was executed in multiple threads or processes (see below). This is optional and defaults to false.
+
+### Parallel execution
+
+#### Subprocesses
+
+For code that executes in multiple processes (for example, SubprocessEnvManager), we periodically send the timer information back to the "main" process, aggregate the timers there, and flush them in the subprocess. Note that (depending on the number of processes) this can result in timers where the total time may exceed the parent's total time. This is analogous to the difference between "real" and "user" values reported from the unix `time` command. In the timer output, blocks that were run in parallel are indicated by the `is_parallel` flag.
+
+#### Threads
+
+Timers currently use `time.perf_counter()` to track time spent, which may not give accurate results for multiple threads. If this is problematic, set
+`threaded: false` in your trainer configuration.
diff --git a/com.unity.ml-agents/Documentation~/Python-APIs.md b/com.unity.ml-agents/Documentation~/Python-APIs.md
new file mode 100644
index 0000000000..06d9068462
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-APIs.md
@@ -0,0 +1,15 @@
+# Python APIs
+
+The Python APIs allow you to control and interact with Unity environments from Python scripts. Each API is designed for specific use cases and offers different levels of abstraction and functionality.
+
+
+| **API** | **Description** |
+|--------------------------------------------------------------------------------------|---------------------------------------------------------------------|
+| [Python Gym API](Python-Gym-API.md) | OpenAI Gym-compatible interface for standard RL workflows. |
+| [Python Gym API Documentation](Python-Gym-API-Documentation.md) | Detailed documentation for the Python Gym API. |
+| [Python PettingZoo API](Python-PettingZoo-API.md) | Multi-agent environment interface compatible with PettingZoo. |
+| [Python PettingZoo API Documentation](Python-PettingZoo-API-Documentation.md) | Detailed documentation for the Python PettingZoo API. |
+| [Python Low-Level API](Python-LLAPI.md) | Direct low-level access for custom training and advanced use cases. |
+| [Python Low-Level API Documentation](Python-LLAPI-Documentation.md) | Detailed documentation for the Python Low-Level API. |
+| [On/Off Policy Trainer Documentation](Python-On-Off-Policy-Trainer-Documentation.md) | Documentation for on-policy and off-policy training methods. |
+| [Python Optimizer Documentation](Python-Optimizer-Documentation.md) | Documentation for optimizers used in training algorithms. |
diff --git a/com.unity.ml-agents/Documentation~/Python-Custom-Trainer-Plugin.md b/com.unity.ml-agents/Documentation~/Python-Custom-Trainer-Plugin.md
new file mode 100644
index 0000000000..7038c6f6d4
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-Custom-Trainer-Plugin.md
@@ -0,0 +1,39 @@
+# Unity Ml-Agents Custom trainers Plugin
+
+As an attempt to bring a wider variety of reinforcement learning algorithms to our users, we have added custom trainers capabilities. we introduce an extensible plugin system to define new trainers based on the High level trainer API in `Ml-agents` Package. This will allow rerouting `mlagents-learn` CLI to custom trainers and extending the config files with hyper-parameters specific to your new trainers. We will expose a high-level extensible trainer (both on-policy, and off-policy trainers) optimizer and hyperparameter classes with documentation for the use of this plugin. For more information on how python plugin system works see [Plugin interfaces](Training-Plugins.md).
+## Overview
+Model-free RL algorithms generally fall into two broad categories: on-policy and off-policy. On-policy algorithms perform updates based on data gathered from the current policy. Off-policy algorithms learn a Q function from a buffer of previous data, then use this Q function to make decisions. Off-policy algorithms have three key benefits in the context of ML-Agents: They tend to use fewer samples than on-policy as they can pull and re-use data from the buffer many times. They allow player demonstrations to be inserted in-line with RL data into the buffer, enabling new ways of doing imitation learning by streaming player data.
+
+To add new custom trainers to ML-agents, you would need to create a new python package. To give you an idea of how to structure your package, we have created a [mlagents_trainer_plugin](https://github.com/Unity-Technologies/ml-agents/tree/release_22/ml-agents-trainer-plugin) package ourselves as an example, with implementation of `A2c` and `DQN` algorithms. You would need a `setup.py` file to list extra requirements and register the new RL algorithm in ml-agents ecosystem and be able to call `mlagents-learn` CLI with your customized configuration.
+
+
+```shell
+├── mlagents_trainer_plugin
+│ ├── __init__.py
+│ ├── a2c
+│ │ ├── __init__.py
+│ │ ├── a2c_3DBall.yaml
+│ │ ├── a2c_optimizer.py
+│ │ └── a2c_trainer.py
+│ └── dqn
+│ ├── __init__.py
+│ ├── dqn_basic.yaml
+│ ├── dqn_optimizer.py
+│ └── dqn_trainer.py
+└── setup.py
+```
+## Installation and Execution
+If you haven't already, follow the [installation instructions](Installation.md). Once you have the `ml-agents-env` and `ml-agents` packages you can install the plugin package. From the repository's root directory install `ml-agents-trainer-plugin` (or replace with the name of your plugin folder).
+
+```sh
+pip3 install -e <./ml-agents-trainer-plugin>
+```
+
+Following the previous installations your package is added as an entrypoint and you can use a config file with new trainers:
+```sh
+mlagents-learn ml-agents-trainer-plugin/mlagents_trainer_plugin/a2c/a2c_3DBall.yaml --run-id
+--env
+```
+
+## Tutorial
+Here’s a step-by-step [tutorial](Tutorial-Custom-Trainer-Plugin.md) on how to write a setup file and extend ml-agents trainers, optimizers, and hyperparameter settings.To extend ML-agents classes see references on [trainers](Python-On-Off-Policy-Trainer-Documentation.md) and [Optimizer](Python-Optimizer-Documentation.md).
\ No newline at end of file
diff --git a/com.unity.ml-agents/Documentation~/Python-Gym-API-Documentation.md b/com.unity.ml-agents/Documentation~/Python-Gym-API-Documentation.md
new file mode 100644
index 0000000000..f3db6d59cb
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-Gym-API-Documentation.md
@@ -0,0 +1,133 @@
+# Python Gym API Documentation
+
+
+# mlagents\_envs.envs.unity\_gym\_env
+
+
+## UnityGymException Objects
+
+```python
+class UnityGymException(error.Error)
+```
+
+Any error related to the gym wrapper of ml-agents.
+
+
+## UnityToGymWrapper Objects
+
+```python
+class UnityToGymWrapper(gym.Env)
+```
+
+Provides Gym wrapper for Unity Learning Environments.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(unity_env: BaseEnv, uint8_visual: bool = False, flatten_branched: bool = False, allow_multiple_obs: bool = False, action_space_seed: Optional[int] = None)
+```
+
+Environment initialization
+
+**Arguments**:
+
+- `unity_env`: The Unity BaseEnv to be wrapped in the gym. Will be closed when the UnityToGymWrapper closes.
+- `uint8_visual`: Return visual observations as uint8 (0-255) matrices instead of float (0.0-1.0).
+- `flatten_branched`: If True, turn branched discrete action spaces into a Discrete space rather than MultiDiscrete.
+- `allow_multiple_obs`: If True, return a list of np.ndarrays as observations with the first elements containing the visual observations and the last element containing the array of vector observations. If False, returns a single np.ndarray containing either only a single visual observation or the array of vector observations.
+- `action_space_seed`: If non-None, will be used to set the random seed on created gym.Space instances.
+
+
+#### reset
+
+```python
+ | reset() -> Union[List[np.ndarray], np.ndarray]
+```
+
+Resets the state of the environment and returns an initial observation. Returns: observation (object/list): the initial observation of the space.
+
+
+#### step
+
+```python
+ | step(action: List[Any]) -> GymStepResult
+```
+
+Run one timestep of the environment's dynamics. When end of episode is reached, you are responsible for calling `reset()` to reset this environment's state. Accepts an action and returns a tuple (observation, reward, done, info).
+
+**Arguments**:
+
+- `action` _object/list_ - an action provided by the environment
+
+**Returns**:
+
+- `observation` _object/list_ - agent's observation of the current environment reward (float/list) : amount of reward returned after previous action
+- `done` _boolean/list_ - whether the episode has ended.
+- `info` _dict_ - contains auxiliary diagnostic information.
+
+
+#### render
+
+```python
+ | render(mode="rgb_array")
+```
+
+Return the latest visual observations. Note that it will not render a new frame of the environment.
+
+
+#### close
+
+```python
+ | close() -> None
+```
+
+Override _close in your subclass to perform any necessary cleanup. Environments will automatically close() themselves when garbage collected or when the program exits.
+
+
+#### seed
+
+```python
+ | seed(seed: Any = None) -> None
+```
+
+Sets the seed for this env's random number generator(s). Currently not implemented.
+
+
+## ActionFlattener Objects
+
+```python
+class ActionFlattener()
+```
+
+Flattens branched discrete action spaces into single-branch discrete action spaces.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(branched_action_space)
+```
+
+Initialize the flattener.
+
+**Arguments**:
+
+- `branched_action_space`: A List containing the sizes of each branch of the action space, e.g. [2,3,3] for three branches with size 2, 3, and 3 respectively.
+
+
+#### lookup\_action
+
+```python
+ | lookup_action(action)
+```
+
+Convert a scalar discrete action into a unique set of branched actions.
+
+**Arguments**:
+
+- `action`: A scalar value representing one of the discrete actions.
+
+**Returns**:
+
+The List containing the branched actions.
diff --git a/com.unity.ml-agents/Documentation~/Python-Gym-API.md b/com.unity.ml-agents/Documentation~/Python-Gym-API.md
new file mode 100644
index 0000000000..d08caae44a
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-Gym-API.md
@@ -0,0 +1,272 @@
+# Unity ML-Agents Gym Wrapper
+
+A common way in which machine learning researchers interact with simulation environments is via a wrapper provided by OpenAI called `gym`. For more information on the gym interface, see [here](https://github.com/openai/gym).
+
+We provide a gym wrapper and instructions for using it with existing machine learning algorithms which utilize gym. Our wrapper provides interfaces on top of our `UnityEnvironment` class, which is the default way of interfacing with a Unity environment via Python.
+
+## Installation
+
+The gym wrapper is part of the `mlagents_envs` package. Please refer to the [mlagents_envs installation instructions](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-envs/README.md).
+
+
+## Using the Gym Wrapper
+
+The gym interface is available from `gym_unity.envs`. To launch an environment from the root of the project repository use:
+
+```python
+from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper
+
+env = UnityToGymWrapper(unity_env, uint8_visual, flatten_branched, allow_multiple_obs)
+```
+
+- `unity_env` refers to the Unity environment to be wrapped.
+
+- `uint8_visual` refers to whether to output visual observations as `uint8` values (0-255). Many common Gym environments (e.g. Atari) do this. By default, they will be floats (0.0-1.0). Defaults to `False`.
+
+- `flatten_branched` will flatten a branched discrete action space into a Gym Discrete. Otherwise, it will be converted into a MultiDiscrete. Defaults to `False`.
+
+- `allow_multiple_obs` will return a list of observations. The first elements contain the visual observations and the last element contains the array of vector observations. If False the environment returns a single array (containing a single visual observations, if present, otherwise the vector observation). Defaults to `False`.
+
+- `action_space_seed` is the optional seed for action sampling. If non-None, will be used to set the random seed on created gym.Space instances.
+
+The returned environment `env` will function as a gym.
+
+## Limitations
+
+- It is only possible to use an environment with a **single** Agent.
+- By default, the first visual observation is provided as the `observation`, if present. Otherwise, vector observations are provided. You can receive all visual and vector observations by using the `allow_multiple_obs=True` option in the gym parameters. If set to `True`, you will receive a list of `observation` instead of only one.
+- The `TerminalSteps` or `DecisionSteps` output from the environment can still be accessed from the `info` provided by `env.step(action)`.
+- Stacked vector observations are not supported.
+- Environment registration for use with `gym.make()` is currently not supported.
+- Calling env.render() will not render a new frame of the environment. It will return the latest visual observation if using visual observations.
+
+## Running OpenAI Baselines Algorithms
+
+OpenAI provides a set of open-source maintained and tested Reinforcement Learning algorithms called the [Baselines](https://github.com/openai/baselines).
+
+Using the provided Gym wrapper, it is possible to train ML-Agents environments using these algorithms. This requires the creation of custom training scripts to launch each algorithm. In most cases these scripts can be created by making slight modifications to the ones provided for Atari and Mujoco environments.
+
+These examples were tested with baselines version 0.1.6.
+
+### Example - DQN Baseline
+
+In order to train an agent to play the `GridWorld` environment using the Baselines DQN algorithm, you first need to install the baselines package using pip:
+
+```
+pip install git+git://github.com/openai/baselines
+```
+
+Next, create a file called `train_unity.py`. Then create an `/envs/` directory and build the environment to that directory. For more information on building Unity environments, see [here](Learning-Environment-Executable.md). Note that because of limitations of the DQN baseline, the environment must have a single visual observation, a single discrete action and a single Agent in the scene. Add the following code to the `train_unity.py` file:
+
+```python
+import gym
+
+from baselines import deepq
+from baselines import logger
+
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper
+
+
+def main():
+ unity_env = UnityEnvironment( < path - to - environment >)
+ env = UnityToGymWrapper(unity_env, uint8_visual=True)
+ logger.configure('./logs') # Change to log in a different directory
+ act = deepq.learn(
+ env,
+ "cnn", # For visual inputs
+ lr=2.5e-4,
+ total_timesteps=1000000,
+ buffer_size=50000,
+ exploration_fraction=0.05,
+ exploration_final_eps=0.1,
+ print_freq=20,
+ train_freq=5,
+ learning_starts=20000,
+ target_network_update_freq=50,
+ gamma=0.99,
+ prioritized_replay=False,
+ checkpoint_freq=1000,
+ checkpoint_path='./logs', # Change to save model in a different directory
+ dueling=True
+ )
+ print("Saving model to unity_model.pkl")
+ act.save("unity_model.pkl")
+
+
+if __name__ == '__main__':
+ main()
+```
+
+To start the training process, run the following from the directory containing
+`train_unity.py`:
+
+```sh
+python -m train_unity
+```
+
+### Other Algorithms
+
+Other algorithms in the Baselines repository can be run using scripts similar to the examples from the baselines package. In most cases, the primary changes needed to use a Unity environment are to import `UnityToGymWrapper`, and to replace the environment creation code, typically `gym.make()`, with a call to `UnityToGymWrapper(unity_environment)` passing the environment as input.
+
+A typical rule of thumb is that for vision-based environments, modification should be done to Atari training scripts, and for vector observation environments, modification should be done to Mujoco scripts.
+
+Some algorithms will make use of `make_env()` or `make_mujoco_env()` functions. You can define a similar function for Unity environments. An example of such a method using the PPO2 baseline:
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.envs import UnityToGymWrapper
+from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.bench import Monitor
+from baselines import logger
+import baselines.ppo2.ppo2 as ppo2
+
+import os
+
+try:
+ from mpi4py import MPI
+except ImportError:
+ MPI = None
+
+
+def make_unity_env(env_directory, num_env, visual, start_index=0):
+ """
+ Create a wrapped, monitored Unity environment.
+ """
+
+ def make_env(rank, use_visual=True): # pylint: disable=C0111
+ def _thunk():
+ unity_env = UnityEnvironment(env_directory, base_port=5000 + rank)
+ env = UnityToGymWrapper(unity_env, uint8_visual=True)
+ env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
+ return env
+
+ return _thunk
+
+ if visual:
+ return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
+ else:
+ rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
+ return DummyVecEnv([make_env(rank, use_visual=False)])
+
+
+def main():
+ env = make_unity_env( < path - to - environment >, 4, True)
+ ppo2.learn(
+ network="mlp",
+ env=env,
+ total_timesteps=100000,
+ lr=1e-3,
+ )
+
+
+if __name__ == '__main__':
+ main()
+```
+
+## Run Google Dopamine Algorithms
+
+Google provides a framework [Dopamine](https://github.com/google/dopamine), and implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of Rainbow. Using the Gym wrapper, we can run Unity environments using Dopamine.
+
+First, after installing the Gym wrapper, clone the Dopamine repository.
+
+```
+git clone https://github.com/google/dopamine
+```
+
+Then, follow the appropriate install instructions as specified on [Dopamine's homepage](https://github.com/google/dopamine). Note that the Dopamine guide specifies using a virtualenv. If you choose to do so, make sure your unity_env package is also installed within the same virtualenv as Dopamine.
+
+### Adapting Dopamine's Scripts
+
+First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire `atari` folder, and name it something else (e.g. `unity`). If you choose the copy approach, be sure to change the package names in the import statements in `train.py` to your new directory.
+
+Within `run_experiment.py`, we will need to make changes to which environment is instantiated, just as in the Baselines example. At the top of the file, insert
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.envs import UnityToGymWrapper
+```
+
+to import the Gym Wrapper. Navigate to the `create_atari_environment` method in the same file, and switch to instantiating a Unity environment by replacing the method with the following code.
+
+```python
+ game_version = 'v0' if sticky_actions else 'v4'
+ full_game_name = '{}NoFrameskip-{}'.format(game_name, game_version)
+ unity_env = UnityEnvironment()
+ env = UnityToGymWrapper(unity_env, uint8_visual=True)
+ return env
+```
+
+`` is the path to your built Unity executable. For more information on building Unity environments, see [here](Learning-Environment-Executable.md), and note the Limitations section below.
+
+Note that we are not using the preprocessor from Dopamine, as it uses many Atari-specific calls. Furthermore, frame-skipping can be done from within Unity, rather than on the Python side.
+
+### Limitations
+
+Since Dopamine is designed around variants of DQN, it is only compatible with discrete action spaces, and specifically the Discrete Gym space. For environments that use branched discrete action spaces, you can enable the `flatten_branched` parameter in `UnityToGymWrapper`, which treats each combination of branched actions as separate actions.
+
+Furthermore, when building your environments, ensure that your Agent is using visual observations with greyscale enabled, and that the dimensions of the visual observations is 84 by 84 (matches the parameter found in `dqn_agent.py` and `rainbow_agent.py`). Dopamine's agents currently do not automatically adapt to the observation dimensions or number of channels.
+
+### Hyperparameters
+
+The hyperparameters provided by Dopamine are tailored to the Atari games, and you will likely need to adjust them for ML-Agents environments. Here is a sample `dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with a simple GridWorld.
+
+```python
+import dopamine.agents.rainbow.rainbow_agent
+import dopamine.unity.run_experiment
+import dopamine.replay_memory.prioritized_replay_buffer
+import gin.tf.external_configurables
+
+RainbowAgent.num_atoms = 51
+RainbowAgent.stack_size = 1
+RainbowAgent.vmax = 10.
+RainbowAgent.gamma = 0.99
+RainbowAgent.update_horizon = 3
+RainbowAgent.min_replay_history = 20000 # agent steps
+RainbowAgent.update_period = 5
+RainbowAgent.target_update_period = 50 # agent steps
+RainbowAgent.epsilon_train = 0.1
+RainbowAgent.epsilon_eval = 0.01
+RainbowAgent.epsilon_decay_period = 50000 # agent steps
+RainbowAgent.replay_scheme = 'prioritized'
+RainbowAgent.tf_device = '/cpu:0' # use '/cpu:*' for non-GPU version
+RainbowAgent.optimizer = @tf.train.AdamOptimizer()
+
+tf.train.AdamOptimizer.learning_rate = 0.00025
+tf.train.AdamOptimizer.epsilon = 0.0003125
+
+Runner.game_name = "Unity" # any name can be used here
+Runner.sticky_actions = False
+Runner.num_iterations = 200
+Runner.training_steps = 10000 # agent steps
+Runner.evaluation_steps = 500 # agent steps
+Runner.max_steps_per_episode = 27000 # agent steps
+
+WrappedPrioritizedReplayBuffer.replay_capacity = 1000000
+WrappedPrioritizedReplayBuffer.batch_size = 32
+```
+
+This example assumed you copied `atari` to a separate folder named `unity`. Replace `unity` in `import dopamine.unity.run_experiment` with the folder you copied your `run_experiment.py` and `trainer.py` files to. If you directly modified the existing files, then use `atari` here.
+
+### Starting a Run
+
+You can now run Dopamine as you would normally:
+
+```
+python -um dopamine.unity.train \
+ --agent_name=rainbow \
+ --base_dir=/tmp/dopamine \
+ --gin_files='dopamine/agents/rainbow/configs/rainbow.gin'
+```
+
+Again, we assume that you've copied `atari` into a separate folder. Remember to replace `unity` with the directory you copied your files into. If you edited the Atari files directly, this should be `atari`.
+
+### Example: GridWorld
+
+As a baseline, here are rewards over time for the three algorithms provided with Dopamine as run on the GridWorld example environment. All Dopamine (DQN, Rainbow, C51) runs were done with the same epsilon, epsilon decay, replay history, training steps, and buffer settings as specified above. Note that the first 20000 steps are used to pre-fill the training buffer, and no learning happens.
+
+We provide results from our PPO implementation and the DQN from Baselines as reference. Note that all runs used the same greyscale GridWorld as Dopamine. For PPO, `num_layers` was set to 2, and all other hyperparameters are the default for GridWorld in `config/ppo/GridWorld.yaml`. For Baselines DQN, the provided hyperparameters in the previous section are used. Note that Baselines implements certain features (e.g. dueling-Q) that are not enabled in Dopamine DQN.
+
+
+
diff --git a/com.unity.ml-agents/Documentation~/Python-LLAPI-Documentation.md b/com.unity.ml-agents/Documentation~/Python-LLAPI-Documentation.md
new file mode 100644
index 0000000000..213797ea9c
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-LLAPI-Documentation.md
@@ -0,0 +1,1114 @@
+# Python Low-Level API Documentation
+
+
+# mlagents\_envs.base\_env
+
+Python Environment API for the ML-Agents Toolkit The aim of this API is to expose Agents evolving in a simulation to perform reinforcement learning on. This API supports multi-agent scenarios and groups similar Agents (same observations, actions spaces and behavior) together. These groups of Agents are identified by their BehaviorName. For performance reasons, the data of each group of agents is processed in a batched manner. Agents are identified by a unique AgentId identifier that allows tracking of Agents across simulation steps. Note that there is no guarantee that the number or order of the Agents in the state will be consistent across simulation steps. A simulation steps corresponds to moving the simulation forward until at least one agent in the simulation sends its observations to Python again. Since Agents can request decisions at different frequencies, a simulation step does not necessarily correspond to a fixed simulation time increment.
+
+
+## DecisionStep Objects
+
+```python
+class DecisionStep(NamedTuple)
+```
+
+Contains the data a single Agent collected since the last simulation step.
+ - obs is a list of numpy arrays observations collected by the agent.
+ - reward is a float. Corresponds to the rewards collected by the agent since the last simulation step.
+ - agent_id is an int and an unique identifier for the corresponding Agent.
+ - action_mask is an optional list of one dimensional array of booleans. Only available when using multi-discrete actions. Each array corresponds to an action branch. Each array contains a mask for each action of the branch. If true, the action is not available for the agent during this simulation step.
+
+
+## DecisionSteps Objects
+
+```python
+class DecisionSteps(Mapping)
+```
+
+Contains the data a batch of similar Agents collected since the last simulation step. Note that all Agents do not necessarily have new information to send at each simulation step. Therefore, the ordering of agents and the batch size of the DecisionSteps are not fixed across simulation steps.
+ - obs is a list of numpy arrays observations collected by the batch of agent. Each obs has one extra dimension compared to DecisionStep: the first dimension of the array corresponds to the batch size of the batch.
+ - reward is a float vector of length batch size. Corresponds to the rewards collected by each agent since the last simulation step.
+ - agent_id is an int vector of length batch size containing unique identifier for the corresponding Agent. This is used to track Agents across simulation steps.
+ - action_mask is an optional list of two-dimensional array of booleans. Only available when using multi-discrete actions. Each array corresponds to an action branch. The first dimension of each array is the batch size and the second contains a mask for each action of the branch. If true, the action is not available for the agent during this simulation step.
+
+
+#### agent\_id\_to\_index
+
+```python
+ | @property
+ | agent_id_to_index() -> Dict[AgentId, int]
+```
+
+**Returns**:
+
+A Dict that maps agent_id to the index of those agents in this DecisionSteps.
+
+
+#### \_\_getitem\_\_
+
+```python
+ | __getitem__(agent_id: AgentId) -> DecisionStep
+```
+
+returns the DecisionStep for a specific agent.
+
+**Arguments**:
+
+- `agent_id`: The id of the agent
+
+**Returns**:
+
+The DecisionStep
+
+
+#### empty
+
+```python
+ | @staticmethod
+ | empty(spec: "BehaviorSpec") -> "DecisionSteps"
+```
+
+Returns an empty DecisionSteps.
+
+**Arguments**:
+
+- `spec`: The BehaviorSpec for the DecisionSteps
+
+
+## TerminalStep Objects
+
+```python
+class TerminalStep(NamedTuple)
+```
+
+Contains the data a single Agent collected when its episode ended.
+ - obs is a list of numpy arrays observations collected by the agent.
+ - reward is a float. Corresponds to the rewards collected by the agent since the last simulation step.
+ - interrupted is a bool. Is true if the Agent was interrupted since the last decision step. For example, if the Agent reached the maximum number of steps for the episode.
+ - agent_id is an int and an unique identifier for the corresponding Agent.
+
+
+## TerminalSteps Objects
+
+```python
+class TerminalSteps(Mapping)
+```
+
+Contains the data a batch of Agents collected when their episode terminated. All Agents present in the TerminalSteps have ended their episode.
+ - obs is a list of numpy arrays observations collected by the batch of agent. Each obs has one extra dimension compared to DecisionStep: the first dimension of the array corresponds to the batch size of the batch.
+ - reward is a float vector of length batch size. Corresponds to the rewards collected by each agent since the last simulation step.
+ - interrupted is an array of booleans of length batch size. Is true if the associated Agent was interrupted since the last decision step. For example, if the Agent reached the maximum number of steps for the episode.
+ - agent_id is an int vector of length batch size containing unique identifier for the corresponding Agent. This is used to track Agents across simulation steps.
+
+
+#### agent\_id\_to\_index
+
+```python
+ | @property
+ | agent_id_to_index() -> Dict[AgentId, int]
+```
+
+**Returns**:
+
+A Dict that maps agent_id to the index of those agents in this TerminalSteps.
+
+
+#### \_\_getitem\_\_
+
+```python
+ | __getitem__(agent_id: AgentId) -> TerminalStep
+```
+
+returns the TerminalStep for a specific agent.
+
+**Arguments**:
+
+- `agent_id`: The id of the agent
+
+**Returns**:
+
+obs, reward, done, agent_id and optional action mask for a specific agent
+
+
+#### empty
+
+```python
+ | @staticmethod
+ | empty(spec: "BehaviorSpec") -> "TerminalSteps"
+```
+
+Returns an empty TerminalSteps.
+
+**Arguments**:
+
+- `spec`: The BehaviorSpec for the TerminalSteps
+
+
+## ActionTuple Objects
+
+```python
+class ActionTuple(_ActionTupleBase)
+```
+
+An object whose fields correspond to actions of different types. Continuous and discrete actions are numpy arrays of type float32 and int32, respectively and are type checked on construction. Dimensions are of (n_agents, continuous_size) and (n_agents, discrete_size), respectively. Note, this also holds when continuous or discrete size is zero.
+
+
+#### discrete\_dtype
+
+```python
+ | @property
+ | discrete_dtype() -> np.dtype
+```
+
+The dtype of a discrete action.
+
+
+## ActionSpec Objects
+
+```python
+class ActionSpec(NamedTuple)
+```
+
+A NamedTuple containing utility functions and information about the action spaces for a group of Agents under the same behavior.
+- num_continuous_actions is an int corresponding to the number of floats which constitute the action.
+- discrete_branch_sizes is a Tuple of int where each int corresponds to the number of discrete actions available to the agent on an independent action branch.
+
+
+#### is\_discrete
+
+```python
+ | is_discrete() -> bool
+```
+
+Returns true if this Behavior uses discrete actions
+
+
+#### is\_continuous
+
+```python
+ | is_continuous() -> bool
+```
+
+Returns true if this Behavior uses continuous actions
+
+
+#### discrete\_size
+
+```python
+ | @property
+ | discrete_size() -> int
+```
+
+Returns an int corresponding to the number of discrete branches.
+
+
+#### empty\_action
+
+```python
+ | empty_action(n_agents: int) -> ActionTuple
+```
+
+Generates ActionTuple corresponding to an empty action (all zeros) for a number of agents.
+
+**Arguments**:
+
+- `n_agents`: The number of agents that will have actions generated
+
+
+#### random\_action
+
+```python
+ | random_action(n_agents: int) -> ActionTuple
+```
+
+Generates ActionTuple corresponding to a random action (either discrete or continuous) for a number of agents.
+
+**Arguments**:
+
+- `n_agents`: The number of agents that will have actions generated
+
+
+#### create\_continuous
+
+```python
+ | @staticmethod
+ | create_continuous(continuous_size: int) -> "ActionSpec"
+```
+
+Creates an ActionSpec that is homogenously continuous
+
+
+#### create\_discrete
+
+```python
+ | @staticmethod
+ | create_discrete(discrete_branches: Tuple[int]) -> "ActionSpec"
+```
+
+Creates an ActionSpec that is homogenously discrete
+
+
+#### create\_hybrid
+
+```python
+ | @staticmethod
+ | create_hybrid(continuous_size: int, discrete_branches: Tuple[int]) -> "ActionSpec"
+```
+
+Creates a hybrid ActionSpace
+
+
+## DimensionProperty Objects
+
+```python
+class DimensionProperty(IntFlag)
+```
+
+The dimension property of a dimension of an observation.
+
+
+#### UNSPECIFIED
+
+No properties specified.
+
+
+#### NONE
+
+No Property of the observation in that dimension. Observation can be processed with Fully connected networks.
+
+
+#### TRANSLATIONAL\_EQUIVARIANCE
+
+Means it is suitable to do a convolution in this dimension.
+
+
+#### VARIABLE\_SIZE
+
+Means that there can be a variable number of observations in this dimension. The observations are unordered.
+
+
+## ObservationType Objects
+
+```python
+class ObservationType(Enum)
+```
+
+An Enum which defines the type of information carried in the observation of the agent.
+
+
+#### DEFAULT
+
+Observation information is generic.
+
+
+#### GOAL\_SIGNAL
+
+Observation contains goal information for current task.
+
+
+## ObservationSpec Objects
+
+```python
+class ObservationSpec(NamedTuple)
+```
+
+A NamedTuple containing information about the observation of Agents.
+- shape is a Tuple of int : It corresponds to the shape of an observation's dimensions.
+- dimension_property is a Tuple of DimensionProperties flag, one flag for each dimension.
+- observation_type is an enum of ObservationType.
+
+
+## BehaviorSpec Objects
+
+```python
+class BehaviorSpec(NamedTuple)
+```
+
+A NamedTuple containing information about the observation and action spaces for a group of Agents under the same behavior.
+- observation_specs is a List of ObservationSpec NamedTuple containing information about the information of the Agent's observations such as their shapes. The order of the ObservationSpec is the same as the order of the observations of an agent.
+- action_spec is an ActionSpec NamedTuple.
+
+
+## BaseEnv Objects
+
+```python
+class BaseEnv(ABC)
+```
+
+
+#### step
+
+```python
+ | @abstractmethod
+ | step() -> None
+```
+
+Signals the environment that it must move the simulation forward by one step.
+
+
+#### reset
+
+```python
+ | @abstractmethod
+ | reset() -> None
+```
+
+Signals the environment that it must reset the simulation.
+
+
+#### close
+
+```python
+ | @abstractmethod
+ | close() -> None
+```
+
+Signals the environment that it must close.
+
+
+#### behavior\_specs
+
+```python
+ | @property
+ | @abstractmethod
+ | behavior_specs() -> MappingType[str, BehaviorSpec]
+```
+
+Returns a Mapping from behavior names to behavior specs. Agents grouped under the same behavior name have the same action and observation specs, and are expected to behave similarly in the environment. Note that new keys can be added to this mapping as new policies are instantiated.
+
+
+#### set\_actions
+
+```python
+ | @abstractmethod
+ | set_actions(behavior_name: BehaviorName, action: ActionTuple) -> None
+```
+
+Sets the action for all of the agents in the simulation for the next step. The Actions must be in the same order as the order received in the DecisionSteps.
+
+**Arguments**:
+
+- `behavior_name`: The name of the behavior the agents are part of
+- `action`: ActionTuple tuple of continuous and/or discrete action. Actions are np.arrays with dimensions (n_agents, continuous_size) and (n_agents, discrete_size), respectively.
+
+
+#### set\_action\_for\_agent
+
+```python
+ | @abstractmethod
+ | set_action_for_agent(behavior_name: BehaviorName, agent_id: AgentId, action: ActionTuple) -> None
+```
+
+Sets the action for one of the agents in the simulation for the next step.
+
+**Arguments**:
+
+- `behavior_name`: The name of the behavior the agent is part of
+- `agent_id`: The id of the agent the action is set for
+- `action`: ActionTuple tuple of continuous and/or discrete action. Actions are np.arrays with dimensions (1, continuous_size) and (1, discrete_size), respectively. Note, this initial dimensions of 1 is because this action is meant for a single agent.
+
+
+#### get\_steps
+
+```python
+ | @abstractmethod
+ | get_steps(behavior_name: BehaviorName) -> Tuple[DecisionSteps, TerminalSteps]
+```
+
+Retrieves the steps of the agents that requested a step in the simulation.
+
+**Arguments**:
+
+- `behavior_name`: The name of the behavior the agents are part of
+
+**Returns**:
+
+A tuple containing :
+- A DecisionSteps NamedTuple containing the observations, the rewards, the agent ids and the action masks for the Agents of the specified behavior. These Agents need an action this step.
+- A TerminalSteps NamedTuple containing the observations, rewards, agent ids and interrupted flags of the agents that had their episode terminated last step.
+
+
+# mlagents\_envs.environment
+
+
+## UnityEnvironment Objects
+
+```python
+class UnityEnvironment(BaseEnv)
+```
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(file_name: Optional[str] = None, worker_id: int = 0, base_port: Optional[int] = None, seed: int = 0, no_graphics: bool = False, no_graphics_monitor: bool = False, timeout_wait: int = 60, additional_args: Optional[List[str]] = None, side_channels: Optional[List[SideChannel]] = None, log_folder: Optional[str] = None, num_areas: int = 1)
+```
+
+Starts a new unity environment and establishes a connection with the environment. Notice: Currently communication between Unity and Python takes place over an open socket without authentication. Ensure that the network where training takes place is secure.
+
+:string file_name: Name of Unity environment binary. :int base_port: Baseline port number to connect to Unity environment over. worker_id increments over this. If no environment is specified (i.e. file_name is None), the DEFAULT_EDITOR_PORT will be used. :int worker_id: Offset from base_port. Used for training multiple environments simultaneously. :bool no_graphics: Whether to run the Unity simulator in no-graphics mode :bool no_graphics_monitor: Whether to run the main worker in graphics mode, with the remaining in no-graphics mode :int timeout_wait: Time (in seconds) to wait for connection from environment. :list args: Addition Unity command line arguments :list side_channels: Additional side channel for no-rl communication with Unity :str log_folder: Optional folder to write the Unity Player log file into. Requires absolute path.
+
+
+#### close
+
+```python
+ | close()
+```
+
+Sends a shutdown signal to the unity environment, and closes the socket connection.
+
+
+# mlagents\_envs.registry
+
+
+# mlagents\_envs.registry.unity\_env\_registry
+
+
+## UnityEnvRegistry Objects
+
+```python
+class UnityEnvRegistry(Mapping)
+```
+
+### UnityEnvRegistry
+Provides a library of Unity environments that can be launched without the need of downloading the Unity Editor. The UnityEnvRegistry implements a Map, to access an entry of the Registry, use:
+```python
+registry = UnityEnvRegistry()
+entry = registry[]
+```
+An entry has the following properties :
+ * `identifier` : Uniquely identifies this environment
+ * `expected_reward` : Corresponds to the reward an agent must obtained for the task to be considered completed.
+ * `description` : A human readable description of the environment.
+
+To launch a Unity environment from a registry entry, use the `make` method:
+```python
+registry = UnityEnvRegistry()
+env = registry[].make()
+```
+
+
+#### register
+
+```python
+ | register(new_entry: BaseRegistryEntry) -> None
+```
+
+Registers a new BaseRegistryEntry to the registry. The BaseRegistryEntry.identifier value will be used as indexing key. If two are more environments are registered under the same key, the most recentry added will replace the others.
+
+
+#### register\_from\_yaml
+
+```python
+ | register_from_yaml(path_to_yaml: str) -> None
+```
+
+Registers the environments listed in a yaml file (either local or remote). Note that the entries are registered lazily: the registration will only happen when an environment is accessed. The yaml file must have the following format :
+```yaml
+environments:
+- :
+ expected_reward:
+ description: |
+
+ linux_url:
+ darwin_url:
+ win_url:
+
+- :
+ expected_reward:
+ description: |
+
+ linux_url:
+ darwin_url:
+ win_url:
+
+- ...
+```
+
+**Arguments**:
+
+- `path_to_yaml`: A local path or url to the yaml file
+
+
+#### clear
+
+```python
+ | clear() -> None
+```
+
+Deletes all entries in the registry.
+
+
+#### \_\_getitem\_\_
+
+```python
+ | __getitem__(identifier: str) -> BaseRegistryEntry
+```
+
+Returns the BaseRegistryEntry with the provided identifier. BaseRegistryEntry can then be used to make a Unity Environment.
+
+**Arguments**:
+
+- `identifier`: The identifier of the BaseRegistryEntry
+
+**Returns**:
+
+The associated BaseRegistryEntry
+
+
+# mlagents\_envs.side\_channel
+
+
+# mlagents\_envs.side\_channel.raw\_bytes\_channel
+
+
+## RawBytesChannel Objects
+
+```python
+class RawBytesChannel(SideChannel)
+```
+
+This is an example of what the SideChannel for raw bytes exchange would look like. Is meant to be used for general research purpose.
+
+
+#### on\_message\_received
+
+```python
+ | on_message_received(msg: IncomingMessage) -> None
+```
+
+Is called by the environment to the side channel. Can be called multiple times per step if multiple messages are meant for that SideChannel.
+
+
+#### get\_and\_clear\_received\_messages
+
+```python
+ | get_and_clear_received_messages() -> List[bytes]
+```
+
+returns a list of bytearray received from the environment.
+
+
+#### send\_raw\_data
+
+```python
+ | send_raw_data(data: bytearray) -> None
+```
+
+Queues a message to be sent by the environment at the next call to step.
+
+
+# mlagents\_envs.side\_channel.outgoing\_message
+
+
+## OutgoingMessage Objects
+
+```python
+class OutgoingMessage()
+```
+
+Utility class for forming the message that is written to a SideChannel. All data is written in little-endian format using the struct module.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__()
+```
+
+Create an OutgoingMessage with an empty buffer.
+
+
+#### write\_bool
+
+```python
+ | write_bool(b: bool) -> None
+```
+
+Append a boolean value.
+
+
+#### write\_int32
+
+```python
+ | write_int32(i: int) -> None
+```
+
+Append an integer value.
+
+
+#### write\_float32
+
+```python
+ | write_float32(f: float) -> None
+```
+
+Append a float value. It will be truncated to 32-bit precision.
+
+
+#### write\_float32\_list
+
+```python
+ | write_float32_list(float_list: List[float]) -> None
+```
+
+Append a list of float values. They will be truncated to 32-bit precision.
+
+
+#### write\_string
+
+```python
+ | write_string(s: str) -> None
+```
+
+Append a string value. Internally, it will be encoded to ascii, and the encoded length will also be written to the message.
+
+
+#### set\_raw\_bytes
+
+```python
+ | set_raw_bytes(buffer: bytearray) -> None
+```
+
+Set the internal buffer to a new bytearray. This will overwrite any existing data.
+
+**Arguments**:
+
+- `buffer`:
+
+**Returns**:
+
+
+
+
+# mlagents\_envs.side\_channel.engine\_configuration\_channel
+
+
+## EngineConfigurationChannel Objects
+
+```python
+class EngineConfigurationChannel(SideChannel)
+```
+
+This is the SideChannel for engine configuration exchange. The data in the engine configuration is as follows :
+ - int width;
+ - int height;
+ - int qualityLevel;
+ - float timeScale;
+ - int targetFrameRate;
+ - int captureFrameRate;
+
+
+#### on\_message\_received
+
+```python
+ | on_message_received(msg: IncomingMessage) -> None
+```
+
+Is called by the environment to the side channel. Can be called multiple times per step if multiple messages are meant for that SideChannel. Note that Python should never receive an engine configuration from Unity
+
+
+#### set\_configuration\_parameters
+
+```python
+ | set_configuration_parameters(width: Optional[int] = None, height: Optional[int] = None, quality_level: Optional[int] = None, time_scale: Optional[float] = None, target_frame_rate: Optional[int] = None, capture_frame_rate: Optional[int] = None) -> None
+```
+
+Sets the engine configuration. Takes as input the configurations of the engine.
+
+**Arguments**:
+
+- `width`: Defines the width of the display. (Must be set alongside height)
+- `height`: Defines the height of the display. (Must be set alongside width)
+- `quality_level`: Defines the quality level of the simulation.
+- `time_scale`: Defines the multiplier for the deltatime in the simulation. If set to a higher value, time will pass faster in the simulation but the physics might break.
+- `target_frame_rate`: Instructs simulation to try to render at a specified frame rate.
+- `capture_frame_rate`: Instructs the simulation to consider time between updates to always be constant, regardless of the actual frame rate.
+
+
+#### set\_configuration
+
+```python
+ | set_configuration(config: EngineConfig) -> None
+```
+
+Sets the engine configuration. Takes as input an EngineConfig.
+
+
+# mlagents\_envs.side\_channel.side\_channel\_manager
+
+
+## SideChannelManager Objects
+
+```python
+class SideChannelManager()
+```
+
+
+#### process\_side\_channel\_message
+
+```python
+ | process_side_channel_message(data: bytes) -> None
+```
+
+Separates the data received from Python into individual messages for each registered side channel and calls on_message_received on them.
+
+**Arguments**:
+
+- `data`: The packed message sent by Unity
+
+
+#### generate\_side\_channel\_messages
+
+```python
+ | generate_side_channel_messages() -> bytearray
+```
+
+Gathers the messages that the registered side channels will send to Unity and combines them into a single message ready to be sent.
+
+
+# mlagents\_envs.side\_channel.stats\_side\_channel
+
+
+## StatsSideChannel Objects
+
+```python
+class StatsSideChannel(SideChannel)
+```
+
+Side channel that receives (string, float) pairs from the environment, so that they can eventually be passed to a StatsReporter.
+
+
+#### on\_message\_received
+
+```python
+ | on_message_received(msg: IncomingMessage) -> None
+```
+
+Receive the message from the environment, and save it for later retrieval.
+
+**Arguments**:
+
+- `msg`:
+
+**Returns**:
+
+
+
+
+#### get\_and\_reset\_stats
+
+```python
+ | get_and_reset_stats() -> EnvironmentStats
+```
+
+Returns the current stats, and resets the internal storage of the stats.
+
+**Returns**:
+
+
+
+
+# mlagents\_envs.side\_channel.incoming\_message
+
+
+## IncomingMessage Objects
+
+```python
+class IncomingMessage()
+```
+
+Utility class for reading the message written to a SideChannel. Values must be read in the order they were written.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(buffer: bytes, offset: int = 0)
+```
+
+Create a new IncomingMessage from the bytes.
+
+
+#### read\_bool
+
+```python
+ | read_bool(default_value: bool = False) -> bool
+```
+
+Read a boolean value from the message buffer.
+
+**Arguments**:
+
+- `default_value`: Default value to use if the end of the message is reached.
+
+**Returns**:
+
+The value read from the message, or the default value if the end was reached.
+
+
+#### read\_int32
+
+```python
+ | read_int32(default_value: int = 0) -> int
+```
+
+Read an integer value from the message buffer.
+
+**Arguments**:
+
+- `default_value`: Default value to use if the end of the message is reached.
+
+**Returns**:
+
+The value read from the message, or the default value if the end was reached.
+
+
+#### read\_float32
+
+```python
+ | read_float32(default_value: float = 0.0) -> float
+```
+
+Read a float value from the message buffer.
+
+**Arguments**:
+
+- `default_value`: Default value to use if the end of the message is reached.
+
+**Returns**:
+
+The value read from the message, or the default value if the end was reached.
+
+
+#### read\_float32\_list
+
+```python
+ | read_float32_list(default_value: List[float] = None) -> List[float]
+```
+
+Read a list of float values from the message buffer.
+
+**Arguments**:
+
+- `default_value`: Default value to use if the end of the message is reached.
+
+**Returns**:
+
+The value read from the message, or the default value if the end was reached.
+
+
+#### read\_string
+
+```python
+ | read_string(default_value: str = "") -> str
+```
+
+Read a string value from the message buffer.
+
+**Arguments**:
+
+- `default_value`: Default value to use if the end of the message is reached.
+
+**Returns**:
+
+The value read from the message, or the default value if the end was reached.
+
+
+#### get\_raw\_bytes
+
+```python
+ | get_raw_bytes() -> bytes
+```
+
+Get a copy of the internal bytes used by the message.
+
+
+# mlagents\_envs.side\_channel.float\_properties\_channel
+
+
+## FloatPropertiesChannel Objects
+
+```python
+class FloatPropertiesChannel(SideChannel)
+```
+
+This is the SideChannel for float properties shared with Unity. You can modify the float properties of an environment with the commands set_property, get_property and list_properties.
+
+
+#### on\_message\_received
+
+```python
+ | on_message_received(msg: IncomingMessage) -> None
+```
+
+Is called by the environment to the side channel. Can be called multiple times per step if multiple messages are meant for that SideChannel.
+
+
+#### set\_property
+
+```python
+ | set_property(key: str, value: float) -> None
+```
+
+Sets a property in the Unity Environment.
+
+**Arguments**:
+
+- `key`: The string identifier of the property.
+- `value`: The float value of the property.
+
+
+#### get\_property
+
+```python
+ | get_property(key: str) -> Optional[float]
+```
+
+Gets a property in the Unity Environment. If the property was not found, will return None.
+
+**Arguments**:
+
+- `key`: The string identifier of the property.
+
+**Returns**:
+
+The float value of the property or None.
+
+
+#### list\_properties
+
+```python
+ | list_properties() -> List[str]
+```
+
+Returns a list of all the string identifiers of the properties currently present in the Unity Environment.
+
+
+#### get\_property\_dict\_copy
+
+```python
+ | get_property_dict_copy() -> Dict[str, float]
+```
+
+Returns a copy of the float properties.
+
+**Returns**:
+
+
+
+
+# mlagents\_envs.side\_channel.environment\_parameters\_channel
+
+
+## EnvironmentParametersChannel Objects
+
+```python
+class EnvironmentParametersChannel(SideChannel)
+```
+
+This is the SideChannel for sending environment parameters to Unity. You can send parameters to an environment with the command set_float_parameter.
+
+
+#### set\_float\_parameter
+
+```python
+ | set_float_parameter(key: str, value: float) -> None
+```
+
+Sets a float environment parameter in the Unity Environment.
+
+**Arguments**:
+
+- `key`: The string identifier of the parameter.
+- `value`: The float value of the parameter.
+
+
+#### set\_uniform\_sampler\_parameters
+
+```python
+ | set_uniform_sampler_parameters(key: str, min_value: float, max_value: float, seed: int) -> None
+```
+
+Sets a uniform environment parameter sampler.
+
+**Arguments**:
+
+- `key`: The string identifier of the parameter.
+- `min_value`: The minimum of the sampling distribution.
+- `max_value`: The maximum of the sampling distribution.
+- `seed`: The random seed to initialize the sampler.
+
+
+#### set\_gaussian\_sampler\_parameters
+
+```python
+ | set_gaussian_sampler_parameters(key: str, mean: float, st_dev: float, seed: int) -> None
+```
+
+Sets a gaussian environment parameter sampler.
+
+**Arguments**:
+
+- `key`: The string identifier of the parameter.
+- `mean`: The mean of the sampling distribution.
+- `st_dev`: The standard deviation of the sampling distribution.
+- `seed`: The random seed to initialize the sampler.
+
+
+#### set\_multirangeuniform\_sampler\_parameters
+
+```python
+ | set_multirangeuniform_sampler_parameters(key: str, intervals: List[Tuple[float, float]], seed: int) -> None
+```
+
+Sets a multirangeuniform environment parameter sampler.
+
+**Arguments**:
+
+- `key`: The string identifier of the parameter.
+- `intervals`: The lists of min and max that define each uniform distribution.
+- `seed`: The random seed to initialize the sampler.
+
+
+# mlagents\_envs.side\_channel.side\_channel
+
+
+## SideChannel Objects
+
+```python
+class SideChannel(ABC)
+```
+
+The side channel just get access to a bytes buffer that will be shared between C# and Python. For example, We will create a specific side channel for properties that will be a list of string (fixed size) to float number, that can be modified by both C# and Python. All side channels are passed to the Env object at construction.
+
+
+#### queue\_message\_to\_send
+
+```python
+ | queue_message_to_send(msg: OutgoingMessage) -> None
+```
+
+Queues a message to be sent by the environment at the next call to step.
+
+
+#### on\_message\_received
+
+```python
+ | @abstractmethod
+ | on_message_received(msg: IncomingMessage) -> None
+```
+
+Is called by the environment to the side channel. Can be called multiple times per step if multiple messages are meant for that SideChannel.
+
+
+#### channel\_id
+
+```python
+ | @property
+ | channel_id() -> uuid.UUID
+```
+
+**Returns**:
+
+The type of side channel used. Will influence how the data is processed in the environment.
diff --git a/com.unity.ml-agents/Documentation~/Python-LLAPI.md b/com.unity.ml-agents/Documentation~/Python-LLAPI.md
new file mode 100644
index 0000000000..45f560ccea
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-LLAPI.md
@@ -0,0 +1,204 @@
+# Unity ML-Agents Python Low Level API
+
+The `mlagents` Python package contains two components: a low level API which allows you to interact directly with a Unity Environment (`mlagents_envs`) and an entry point to train (`mlagents-learn`) which allows you to train agents in Unity Environments using our implementations of reinforcement learning or imitation learning. This document describes how to use the `mlagents_envs` API. For information on using `mlagents-learn`, see [here](Training-ML-Agents.md). For Python Low Level API documentation, see [here](Python-LLAPI-Documentation.md).
+
+The Python Low Level API can be used to interact directly with your Unity learning environment. As such, it can serve as the basis for developing and evaluating new learning algorithms.
+
+## mlagents_envs
+
+The ML-Agents Toolkit Low Level API is a Python API for controlling the simulation loop of an environment or game built with Unity. This API is used by the training algorithms inside the ML-Agent Toolkit, but you can also write your own Python programs using this API.
+
+The key objects in the Python API include:
+
+- **UnityEnvironment** — the main interface between the Unity application and your code. Use UnityEnvironment to start and control a simulation or training session.
+- **BehaviorName** - is a string that identifies a behavior in the simulation.
+- **AgentId** - is an `int` that serves as unique identifier for Agents in the simulation.
+- **DecisionSteps** — contains the data from Agents belonging to the same "Behavior" in the simulation, such as observations and rewards. Only Agents that requested a decision since the last call to `env.step()` are in the DecisionSteps object.
+- **TerminalSteps** — contains the data from Agents belonging to the same "Behavior" in the simulation, such as observations and rewards. Only Agents whose episode ended since the last call to `env.step()` are in the TerminalSteps object.
+- **BehaviorSpec** — describes the shape of the observation data inside DecisionSteps and TerminalSteps as well as the expected action shapes.
+
+These classes are all defined in the [base_env](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-envs/mlagents_envs/base_env.py) script.
+
+An Agent "Behavior" is a group of Agents identified by a `BehaviorName` that share the same observations and action types (described in their `BehaviorSpec`). You can think about Agent Behavior as a group of agents that will share the same policy. All Agents with the same behavior have the same goal and reward signals.
+
+To communicate with an Agent in a Unity environment from a Python program, the Agent in the simulation must have `Behavior Parameters` set to communicate. You must set the `Behavior Type` to `Default` and give it a `Behavior Name`.
+
+_Notice: Currently communication between Unity and Python takes place over an open socket without authentication. As such, please make sure that the network where training takes place is secure. This will be addressed in a future release._
+
+## Loading a Unity Environment
+
+Python-side communication happens through `UnityEnvironment` which is located in [`environment.py`](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-envs/mlagents_envs/environment.py). To load a Unity environment from a built binary file, put the file in the same directory as `envs`. For example, if the filename of your Unity environment is `3DBall`, in python, run:
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+# This is a non-blocking call that only loads the environment.
+env = UnityEnvironment(file_name="3DBall", seed=1, side_channels=[])
+# Start interacting with the environment.
+env.reset()
+behavior_names = env.behavior_specs.keys()
+...
+```
+**NOTE:** Please read [Interacting with a Unity Environment](#interacting-with-a-unity-environment) to read more about how you can interact with the Unity environment from Python.
+
+- `file_name` is the name of the environment binary (located in the root directory of the python project).
+- `worker_id` indicates which port to use for communication with the environment. For use in parallel training regimes such as A3C.
+- `seed` indicates the seed to use when generating random numbers during the training process. In environments which are stochastic, setting the seed enables reproducible experimentation by ensuring that the environment and trainers utilize the same random seed.
+- `side_channels` provides a way to exchange data with the Unity simulation that is not related to the reinforcement learning loop. For example: configurations or properties. More on them in the [Side Channels](Custom-SideChannels.md) doc.
+
+If you want to directly interact with the Editor, you need to use `file_name=None`, then press the **Play** button in the Editor when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen
+
+### Interacting with a Unity Environment
+
+#### The BaseEnv interface
+
+A `BaseEnv` has the following methods:
+
+- **Reset : `env.reset()`** Sends a signal to reset the environment. Returns None.
+- **Step : `env.step()`** Sends a signal to step the environment. Returns None. Note that a "step" for Python does not correspond to either Unity `Update` nor `FixedUpdate`. When `step()` or `reset()` is called, the Unity simulation will move forward until an Agent in the simulation needs a input from Python to act.
+- **Close : `env.close()`** Sends a shutdown signal to the environment and terminates the communication.
+- **Behavior Specs : `env.behavior_specs`** Returns a Mapping of `BehaviorName` to `BehaviorSpec` objects (read only). A `BehaviorSpec` contains the observation shapes and the `ActionSpec` (which defines the action shape). Note that the `BehaviorSpec` for a specific group is fixed throughout the simulation. The number of entries in the Mapping can change over time in the simulation if new Agent behaviors are created in the simulation.
+- **Get Steps : `env.get_steps(behavior_name: str)`** Returns a tuple `DecisionSteps, TerminalSteps` corresponding to the behavior_name given as input. The `DecisionSteps` contains information about the state of the agents
+ **that need an action this step** and have the behavior behavior_name. The `TerminalSteps` contains information about the state of the agents **whose episode ended** and have the behavior behavior_name. Both `DecisionSteps` an `TerminalSteps` contain information such as the observations, the rewards and the agent identifiers. `DecisionSteps` also contains action masks for the next action while `TerminalSteps` contains the reason for termination (did the Agent reach its maximum step and was interrupted). The data is in `np.array` of which the first dimension is always the number of agents note that the number of agents is not guaranteed to remain constant during the simulation and it is not unusual to have either `DecisionSteps` or `TerminalSteps` contain no Agents at all.
+- **Set Actions :`env.set_actions(behavior_name: str, action: ActionTuple)`** Sets the actions for a whole agent group. `action` is an `ActionTuple`, which is made up of a 2D `np.array` of `dtype=np.int32` for discrete actions, and `dtype=np.float32` for continuous actions. The first dimension of `np.array`in the tuple is the number of agents that requested a decision since the last call to `env.step()`. The second dimension is the number of discrete or continuous actions for the corresponding array.
+- **Set Action for Agent: `env.set_action_for_agent(agent_group: str, agent_id: int, action: ActionTuple)`** Sets the action for a specific Agent in an agent group. `agent_group` is the name of the group the Agent belongs to and `agent_id` is the integer identifier of the Agent. `action` is an `ActionTuple` as described above.
+**Note:** If no action is provided for an agent group between two calls to `env.step()` then the default action will be all zeros.
+
+#### DecisionSteps and DecisionStep
+
+`DecisionSteps` (with `s`) contains information about a whole batch of Agents while `DecisionStep` (no `s`) only contains information about a single Agent.
+
+A `DecisionSteps` has the following fields :
+
+- `obs` is a list of numpy arrays observations collected by the group of agent. The first dimension of the array corresponds to the batch size of the group (number of agents requesting a decision since the last call to `env.step()`).
+- `reward` is a float vector of length batch size. Corresponds to the rewards collected by each agent since the last simulation step.
+- `agent_id` is an int vector of length batch size containing unique identifier for the corresponding Agent. This is used to track Agents across simulation steps.
+- `action_mask` is an optional list of two-dimensional arrays of booleans which is only available when using multi-discrete actions. Each array corresponds to an action branch. The first dimension of each array is the batch size and the second contains a mask for each action of the branch. If true, the action is not available for the agent during this simulation step.
+
+It also has the two following methods:
+
+- `len(DecisionSteps)` Returns the number of agents requesting a decision since the last call to `env.step()`.
+- `DecisionSteps[agent_id]` Returns a `DecisionStep` for the Agent with the `agent_id` unique identifier.
+
+A `DecisionStep` has the following fields:
+
+- `obs` is a list of numpy arrays observations collected by the agent. (Each array has one less dimension than the arrays in `DecisionSteps`)
+- `reward` is a float. Corresponds to the rewards collected by the agent since the last simulation step.
+- `agent_id` is an int and an unique identifier for the corresponding Agent.
+- `action_mask` is an optional list of one dimensional arrays of booleans which is only available when using multi-discrete actions. Each array corresponds to an action branch. Each array contains a mask for each action of the branch. If true, the action is not available for the agent during this simulation step.
+
+#### TerminalSteps and TerminalStep
+
+Similarly to `DecisionSteps` and `DecisionStep`, `TerminalSteps` (with `s`) contains information about a whole batch of Agents while `TerminalStep` (no `s`) only contains information about a single Agent.
+
+A `TerminalSteps` has the following fields :
+
+- `obs` is a list of numpy arrays observations collected by the group of agent. The first dimension of the array corresponds to the batch size of the group (number of agents requesting a decision since the last call to `env.step()`).
+- `reward` is a float vector of length batch size. Corresponds to the rewards collected by each agent since the last simulation step.
+- `agent_id` is an int vector of length batch size containing unique identifier for the corresponding Agent. This is used to track Agents across simulation steps.
+- `interrupted` is an array of booleans of length batch size. Is true if the associated Agent was interrupted since the last decision step. For example, if the Agent reached the maximum number of steps for the episode.
+
+It also has the two following methods:
+
+- `len(TerminalSteps)` Returns the number of agents requesting a decision since the last call to `env.step()`.
+- `TerminalSteps[agent_id]` Returns a `TerminalStep` for the Agent with the `agent_id` unique identifier.
+
+A `TerminalStep` has the following fields:
+
+- `obs` is a list of numpy arrays observations collected by the agent. (Each array has one less dimension than the arrays in `TerminalSteps`)
+- `reward` is a float. Corresponds to the rewards collected by the agent since the last simulation step.
+- `agent_id` is an int and an unique identifier for the corresponding Agent.
+- `interrupted` is a bool. Is true if the Agent was interrupted since the last decision step. For example, if the Agent reached the maximum number of steps for the episode.
+
+#### BehaviorSpec
+
+A `BehaviorSpec` has the following fields :
+
+- `observation_specs` is a List of `ObservationSpec` objects : Each `ObservationSpec` corresponds to an observation's properties: `shape` is a tuple of ints that corresponds to the shape of the observation (without the number of agents dimension). `dimension_property` is a tuple of flags containing extra information about how the data should be processed in the corresponding dimension. `observation_type` is an enum corresponding to what type of observation is generating the data (i.e., default, goal, etc). Note that the `ObservationSpec` have the same ordering as the ordering of observations in the DecisionSteps, DecisionStep, TerminalSteps and TerminalStep.
+- `action_spec` is an `ActionSpec` namedtuple that defines the number and types of actions for the Agent.
+
+An `ActionSpec` has the following fields and properties:
+- `continuous_size` is the number of floats that constitute the continuous actions.
+- `discrete_size` is the number of branches (the number of independent actions) that constitute the multi-discrete actions.
+- `discrete_branches` is a Tuple of ints. Each int corresponds to the number of different options for each branch of the action. For example: In a game direction input (no movement, left, right) and jump input (no jump, jump) there will be two branches (direction and jump), the first one with 3 options and the second with 2 options. (`discrete_size = 2` and `discrete_action_branches = (3,2,)`)
+
+
+### Communicating additional information with the Environment
+
+In addition to the means of communicating between Unity and python described above, we also provide methods for sharing agent-agnostic information. These additional methods are referred to as side channels. ML-Agents includes two ready-made side channels, described below. It is also possible to create custom side channels to communicate any additional data between a Unity environment and Python. Instructions for creating custom side channels can be found [here](Custom-SideChannels.md).
+
+Side channels exist as separate classes which are instantiated, and then passed as list to the `side_channels` argument of the constructor of the `UnityEnvironment` class.
+
+```python
+channel = MyChannel()
+
+env = UnityEnvironment(side_channels = [channel])
+```
+
+**Note** : A side channel will only send/receive messages when `env.step` or
+`env.reset()` is called.
+
+#### EngineConfigurationChannel
+
+The `EngineConfiguration` side channel allows you to modify the time-scale, resolution, and graphics quality of the environment. This can be useful for adjusting the environment to perform better during training, or be more interpretable during inference.
+
+`EngineConfigurationChannel` has two methods :
+
+- `set_configuration_parameters` which takes the following arguments:
+ - `width`: Defines the width of the display. (Must be set alongside height)
+ - `height`: Defines the height of the display. (Must be set alongside width)
+ - `quality_level`: Defines the quality level of the simulation.
+ - `time_scale`: Defines the multiplier for the deltatime in the simulation. If set to a higher value, time will pass faster in the simulation but the physics may perform unpredictably.
+ - `target_frame_rate`: Instructs simulation to try to render at a specified frame rate.
+ - `capture_frame_rate` Instructs the simulation to consider time between updates to always be constant, regardless of the actual frame rate.
+- `set_configuration` with argument config which is an `EngineConfig` NamedTuple object.
+
+For example, the following code would adjust the time-scale of the simulation to be 2x realtime.
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel
+
+channel = EngineConfigurationChannel()
+
+env = UnityEnvironment(side_channels=[channel])
+
+channel.set_configuration_parameters(time_scale = 2.0)
+
+i = env.reset()
+...
+```
+
+#### EnvironmentParameters
+
+The `EnvironmentParameters` will allow you to get and set pre-defined numerical values in the environment. This can be useful for adjusting environment-specific settings, or for reading non-agent related information from the environment. You can call `get_property` and `set_property` on the side channel to read and write properties.
+
+`EnvironmentParametersChannel` has one methods:
+
+- `set_float_parameter` Sets a float parameter in the Unity Environment.
+ - key: The string identifier of the property.
+ - value: The float value of the property.
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.side_channel.environment_parameters_channel import EnvironmentParametersChannel
+
+channel = EnvironmentParametersChannel()
+
+env = UnityEnvironment(side_channels=[channel])
+
+channel.set_float_parameter("parameter_1", 2.0)
+
+i = env.reset()
+...
+```
+
+Once a property has been modified in Python, you can access it in C# after the next call to `step` as follows:
+
+```csharp
+var envParameters = Academy.Instance.EnvironmentParameters;
+float property1 = envParameters.GetWithDefault("parameter_1", 0.0f);
+```
+
+#### Custom side channels
+
+For information on how to make custom side channels for sending additional data types, see the documentation [here](Custom-SideChannels.md).
diff --git a/com.unity.ml-agents/Documentation~/Python-On-Off-Policy-Trainer-Documentation.md b/com.unity.ml-agents/Documentation~/Python-On-Off-Policy-Trainer-Documentation.md
new file mode 100644
index 0000000000..fc987405a4
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-On-Off-Policy-Trainer-Documentation.md
@@ -0,0 +1,682 @@
+# On/Off Policy Trainer Documentation
+
+
+# mlagents.trainers.trainer.on\_policy\_trainer
+
+
+## OnPolicyTrainer Objects
+
+```python
+class OnPolicyTrainer(RLTrainer)
+```
+
+The PPOTrainer is an implementation of the PPO algorithm.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(behavior_name: str, reward_buff_cap: int, trainer_settings: TrainerSettings, training: bool, load: bool, seed: int, artifact_path: str)
+```
+
+Responsible for collecting experiences and training an on-policy model.
+
+**Arguments**:
+
+- `behavior_name`: The name of the behavior associated with trainer config
+- `reward_buff_cap`: Max reward history to track in the reward buffer
+- `trainer_settings`: The parameters for the trainer.
+- `training`: Whether the trainer is set for training.
+- `load`: Whether the model should be loaded.
+- `seed`: The seed the model will be initialized with
+- `artifact_path`: The directory within which to store artifacts from this trainer.
+
+
+#### add\_policy
+
+```python
+ | add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
+```
+
+Adds policy to trainer.
+
+**Arguments**:
+
+- `parsed_behavior_id`: Behavior identifiers that the policy should belong to.
+- `policy`: Policy to associate with name_behavior_id.
+
+
+# mlagents.trainers.trainer.off\_policy\_trainer
+
+
+## OffPolicyTrainer Objects
+
+```python
+class OffPolicyTrainer(RLTrainer)
+```
+
+The SACTrainer is an implementation of the SAC algorithm, with support for discrete actions and recurrent networks.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(behavior_name: str, reward_buff_cap: int, trainer_settings: TrainerSettings, training: bool, load: bool, seed: int, artifact_path: str)
+```
+
+Responsible for collecting experiences and training an off-policy model.
+
+**Arguments**:
+
+- `behavior_name`: The name of the behavior associated with trainer config
+- `reward_buff_cap`: Max reward history to track in the reward buffer
+- `trainer_settings`: The parameters for the trainer.
+- `training`: Whether the trainer is set for training.
+- `load`: Whether the model should be loaded.
+- `seed`: The seed the model will be initialized with
+- `artifact_path`: The directory within which to store artifacts from this trainer.
+
+
+#### save\_model
+
+```python
+ | save_model() -> None
+```
+
+Saves the final training model to memory Overrides the default to save the replay buffer.
+
+
+#### save\_replay\_buffer
+
+```python
+ | save_replay_buffer() -> None
+```
+
+Save the training buffer's update buffer to a pickle file.
+
+
+#### load\_replay\_buffer
+
+```python
+ | load_replay_buffer() -> None
+```
+
+Loads the last saved replay buffer from a file.
+
+
+#### add\_policy
+
+```python
+ | add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
+```
+
+Adds policy to trainer.
+
+
+# mlagents.trainers.trainer.rl\_trainer
+
+
+## RLTrainer Objects
+
+```python
+class RLTrainer(Trainer)
+```
+
+This class is the base class for trainers that use Reward Signals.
+
+
+#### end\_episode
+
+```python
+ | end_episode() -> None
+```
+
+A signal that the Episode has ended. The buffer must be reset. Get only called when the academy resets.
+
+
+#### create\_optimizer
+
+```python
+ | @abc.abstractmethod
+ | create_optimizer() -> TorchOptimizer
+```
+
+Creates an Optimizer object
+
+
+#### save\_model
+
+```python
+ | save_model() -> None
+```
+
+Saves the policy associated with this trainer.
+
+
+#### advance
+
+```python
+ | advance() -> None
+```
+
+Steps the trainer, taking in trajectories and updates if ready. Will block and wait briefly if there are no trajectories.
+
+
+# mlagents.trainers.trainer.trainer
+
+
+## Trainer Objects
+
+```python
+class Trainer(abc.ABC)
+```
+
+This class is the base class for the mlagents_envs.trainers
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(brain_name: str, trainer_settings: TrainerSettings, training: bool, load: bool, artifact_path: str, reward_buff_cap: int = 1)
+```
+
+Responsible for collecting experiences and training a neural network model.
+
+**Arguments**:
+
+- `brain_name`: Brain name of brain to be trained.
+- `trainer_settings`: The parameters for the trainer (dictionary).
+- `training`: Whether the trainer is set for training.
+- `artifact_path`: The directory within which to store artifacts from this trainer
+- `reward_buff_cap`:
+
+
+#### stats\_reporter
+
+```python
+ | @property
+ | stats_reporter()
+```
+
+Returns the stats reporter associated with this Trainer.
+
+
+#### parameters
+
+```python
+ | @property
+ | parameters() -> TrainerSettings
+```
+
+Returns the trainer parameters of the trainer.
+
+
+#### get\_max\_steps
+
+```python
+ | @property
+ | get_max_steps() -> int
+```
+
+Returns the maximum number of steps. Is used to know when the trainer should be stopped.
+
+**Returns**:
+
+The maximum number of steps of the trainer
+
+
+#### get\_step
+
+```python
+ | @property
+ | get_step() -> int
+```
+
+Returns the number of steps the trainer has performed
+
+**Returns**:
+
+the step count of the trainer
+
+
+#### threaded
+
+```python
+ | @property
+ | threaded() -> bool
+```
+
+Whether or not to run the trainer in a thread. True allows the trainer to update the policy while the environment is taking steps. Set to False to enforce strict on-policy updates (i.e. don't update the policy when taking steps.)
+
+
+#### should\_still\_train
+
+```python
+ | @property
+ | should_still_train() -> bool
+```
+
+Returns whether or not the trainer should train. A Trainer could stop training if it wasn't training to begin with, or if max_steps is reached.
+
+
+#### reward\_buffer
+
+```python
+ | @property
+ | reward_buffer() -> Deque[float]
+```
+
+Returns the reward buffer. The reward buffer contains the cumulative rewards of the most recent episodes completed by agents using this trainer.
+
+**Returns**:
+
+the reward buffer.
+
+
+#### save\_model
+
+```python
+ | @abc.abstractmethod
+ | save_model() -> None
+```
+
+Saves model file(s) for the policy or policies associated with this trainer.
+
+
+#### end\_episode
+
+```python
+ | @abc.abstractmethod
+ | end_episode()
+```
+
+A signal that the Episode has ended. The buffer must be reset. Get only called when the academy resets.
+
+
+#### create\_policy
+
+```python
+ | @abc.abstractmethod
+ | create_policy(parsed_behavior_id: BehaviorIdentifiers, behavior_spec: BehaviorSpec) -> Policy
+```
+
+Creates a Policy object
+
+
+#### add\_policy
+
+```python
+ | @abc.abstractmethod
+ | add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
+```
+
+Adds policy to trainer.
+
+
+#### get\_policy
+
+```python
+ | get_policy(name_behavior_id: str) -> Policy
+```
+
+Gets policy associated with name_behavior_id
+
+**Arguments**:
+
+- `name_behavior_id`: Fully qualified behavior name
+
+**Returns**:
+
+Policy associated with name_behavior_id
+
+
+#### advance
+
+```python
+ | @abc.abstractmethod
+ | advance() -> None
+```
+
+Advances the trainer. Typically, this means grabbing trajectories from all subscribed trajectory queues (self.trajectory_queues), and updating a policy using the steps in them, and if needed pushing a new policy onto the right policy queues (self.policy_queues).
+
+
+#### publish\_policy\_queue
+
+```python
+ | publish_policy_queue(policy_queue: AgentManagerQueue[Policy]) -> None
+```
+
+Adds a policy queue to the list of queues to publish to when this Trainer makes a policy update
+
+**Arguments**:
+
+- `policy_queue`: Policy queue to publish to.
+
+
+#### subscribe\_trajectory\_queue
+
+```python
+ | subscribe_trajectory_queue(trajectory_queue: AgentManagerQueue[Trajectory]) -> None
+```
+
+Adds a trajectory queue to the list of queues for the trainer to ingest Trajectories from.
+
+**Arguments**:
+
+- `trajectory_queue`: Trajectory queue to read from.
+
+
+# mlagents.trainers.settings
+
+
+#### deep\_update\_dict
+
+```python
+deep_update_dict(d: Dict, update_d: Mapping) -> None
+```
+
+Similar to dict.update(), but works for nested dicts of dicts as well.
+
+
+## RewardSignalSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class RewardSignalSettings()
+```
+
+
+#### structure
+
+```python
+ | @staticmethod
+ | structure(d: Mapping, t: type) -> Any
+```
+
+Helper method to structure a Dict of RewardSignalSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure(). This is needed to handle the special Enum selection of RewardSignalSettings classes.
+
+
+## ParameterRandomizationSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class ParameterRandomizationSettings(abc.ABC)
+```
+
+
+#### \_\_str\_\_
+
+```python
+ | __str__() -> str
+```
+
+Helper method to output sampler stats to console.
+
+
+#### structure
+
+```python
+ | @staticmethod
+ | structure(d: Union[Mapping, float], t: type) -> "ParameterRandomizationSettings"
+```
+
+Helper method to a ParameterRandomizationSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure(). This is needed to handle the special Enum selection of ParameterRandomizationSettings classes.
+
+
+#### unstructure
+
+```python
+ | @staticmethod
+ | unstructure(d: "ParameterRandomizationSettings") -> Mapping
+```
+
+Helper method to a ParameterRandomizationSettings class. Meant to be registered with cattr.register_unstructure_hook() and called with cattr.unstructure().
+
+
+#### apply
+
+```python
+ | @abc.abstractmethod
+ | apply(key: str, env_channel: EnvironmentParametersChannel) -> None
+```
+
+Helper method to send sampler settings over EnvironmentParametersChannel Calls the appropriate sampler type set method.
+
+**Arguments**:
+
+- `key`: environment parameter to be sampled
+- `env_channel`: The EnvironmentParametersChannel to communicate sampler settings to environment
+
+
+## ConstantSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class ConstantSettings(ParameterRandomizationSettings)
+```
+
+
+#### \_\_str\_\_
+
+```python
+ | __str__() -> str
+```
+
+Helper method to output sampler stats to console.
+
+
+#### apply
+
+```python
+ | apply(key: str, env_channel: EnvironmentParametersChannel) -> None
+```
+
+Helper method to send sampler settings over EnvironmentParametersChannel Calls the constant sampler type set method.
+
+**Arguments**:
+
+- `key`: environment parameter to be sampled
+- `env_channel`: The EnvironmentParametersChannel to communicate sampler settings to environment
+
+
+## UniformSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class UniformSettings(ParameterRandomizationSettings)
+```
+
+
+#### \_\_str\_\_
+
+```python
+ | __str__() -> str
+```
+
+Helper method to output sampler stats to console.
+
+
+#### apply
+
+```python
+ | apply(key: str, env_channel: EnvironmentParametersChannel) -> None
+```
+
+Helper method to send sampler settings over EnvironmentParametersChannel Calls the uniform sampler type set method.
+
+**Arguments**:
+
+- `key`: environment parameter to be sampled
+- `env_channel`: The EnvironmentParametersChannel to communicate sampler settings to environment
+
+
+## GaussianSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class GaussianSettings(ParameterRandomizationSettings)
+```
+
+
+#### \_\_str\_\_
+
+```python
+ | __str__() -> str
+```
+
+Helper method to output sampler stats to console.
+
+
+#### apply
+
+```python
+ | apply(key: str, env_channel: EnvironmentParametersChannel) -> None
+```
+
+Helper method to send sampler settings over EnvironmentParametersChannel Calls the gaussian sampler type set method.
+
+**Arguments**:
+
+- `key`: environment parameter to be sampled
+- `env_channel`: The EnvironmentParametersChannel to communicate sampler settings to environment
+
+
+## MultiRangeUniformSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class MultiRangeUniformSettings(ParameterRandomizationSettings)
+```
+
+
+#### \_\_str\_\_
+
+```python
+ | __str__() -> str
+```
+
+Helper method to output sampler stats to console.
+
+
+#### apply
+
+```python
+ | apply(key: str, env_channel: EnvironmentParametersChannel) -> None
+```
+
+Helper method to send sampler settings over EnvironmentParametersChannel Calls the multirangeuniform sampler type set method.
+
+**Arguments**:
+
+- `key`: environment parameter to be sampled
+- `env_channel`: The EnvironmentParametersChannel to communicate sampler settings to environment
+
+
+## CompletionCriteriaSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class CompletionCriteriaSettings()
+```
+
+CompletionCriteriaSettings contains the information needed to figure out if the next lesson must start.
+
+
+#### need\_increment
+
+```python
+ | need_increment(progress: float, reward_buffer: List[float], smoothing: float) -> Tuple[bool, float]
+```
+
+Given measures, this method returns a boolean indicating if the lesson needs to change now, and a float corresponding to the new smoothed value.
+
+
+## Lesson Objects
+
+```python
+@attr.s(auto_attribs=True)
+class Lesson()
+```
+
+Gathers the data of one lesson for one environment parameter including its name, the condition that must be fulfilled for the lesson to be completed and a sampler for the environment parameter. If the completion_criteria is None, then this is the last lesson in the curriculum.
+
+
+## EnvironmentParameterSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class EnvironmentParameterSettings()
+```
+
+EnvironmentParameterSettings is an ordered list of lessons for one environment parameter.
+
+
+#### structure
+
+```python
+ | @staticmethod
+ | structure(d: Mapping, t: type) -> Dict[str, "EnvironmentParameterSettings"]
+```
+
+Helper method to structure a Dict of EnvironmentParameterSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure().
+
+
+## TrainerSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class TrainerSettings(ExportableSettings)
+```
+
+
+#### structure
+
+```python
+ | @staticmethod
+ | structure(d: Mapping, t: type) -> Any
+```
+
+Helper method to structure a TrainerSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure().
+
+
+## CheckpointSettings Objects
+
+```python
+@attr.s(auto_attribs=True)
+class CheckpointSettings()
+```
+
+
+#### prioritize\_resume\_init
+
+```python
+ | prioritize_resume_init() -> None
+```
+
+Prioritize explicit command line resume/init over conflicting yaml options. if both resume/init are set at one place use resume
+
+
+## RunOptions Objects
+
+```python
+@attr.s(auto_attribs=True)
+class RunOptions(ExportableSettings)
+```
+
+
+#### from\_argparse
+
+```python
+ | @staticmethod
+ | from_argparse(args: argparse.Namespace) -> "RunOptions"
+```
+
+Takes an argparse.Namespace as specified in `parse_command_line`, loads input configuration files from file paths, and converts to a RunOptions instance.
+
+**Arguments**:
+
+- `args`: collection of command-line parameters passed to mlagents-learn
+
+**Returns**:
+
+RunOptions representing the passed in arguments, with trainer config, curriculum and sampler configs loaded from files.
\ No newline at end of file
diff --git a/com.unity.ml-agents/Documentation~/Python-Optimizer-Documentation.md b/com.unity.ml-agents/Documentation~/Python-Optimizer-Documentation.md
new file mode 100644
index 0000000000..ee75dd9cff
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-Optimizer-Documentation.md
@@ -0,0 +1,75 @@
+# Python Optimizer
+
+
+# mlagents.trainers.optimizer.torch\_optimizer
+
+
+## TorchOptimizer Objects
+
+```python
+class TorchOptimizer(Optimizer)
+```
+
+
+#### create\_reward\_signals
+
+```python
+ | create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None
+```
+
+Create reward signals
+
+**Arguments**:
+
+- `reward_signal_configs`: Reward signal config.
+
+
+#### get\_trajectory\_value\_estimates
+
+```python
+ | get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]]
+```
+
+Get value estimates and memories for a trajectory, in batch form.
+
+**Arguments**:
+
+- `batch`: An AgentBuffer that consists of a trajectory.
+- `next_obs`: the next observation (after the trajectory). Used for bootstrapping if this is not a terminal trajectory.
+- `done`: Set true if this is a terminal trajectory.
+- `agent_id`: Agent ID of the agent that this trajectory belongs to.
+
+**Returns**:
+
+A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)], the final value estimate as a Dict of [name, float], and optionally (if using memories) an AgentBufferField of initial critic memories to be used during update.
+
+
+# mlagents.trainers.optimizer.optimizer
+
+
+## Optimizer Objects
+
+```python
+class Optimizer(abc.ABC)
+```
+
+Creates loss functions and auxillary networks (e.g. Q or Value) needed for training. Provides methods to update the Policy.
+
+
+#### update
+
+```python
+ | @abc.abstractmethod
+ | update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float]
+```
+
+Update the Policy based on the batch that was passed in.
+
+**Arguments**:
+
+- `batch`: AgentBuffer that contains the minibatch of data used for this update.
+- `num_sequences`: Number of recurrent sequences found in the minibatch.
+
+**Returns**:
+
+A Dict containing statistics (name, value) from the update (e.g. loss)
diff --git a/com.unity.ml-agents/Documentation~/Python-PettingZoo-API-Documentation.md b/com.unity.ml-agents/Documentation~/Python-PettingZoo-API-Documentation.md
new file mode 100644
index 0000000000..5040ac6879
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-PettingZoo-API-Documentation.md
@@ -0,0 +1,216 @@
+# Python PettingZoo API Documentation
+
+
+# mlagents\_envs.envs.pettingzoo\_env\_factory
+
+
+## PettingZooEnvFactory Objects
+
+```python
+class PettingZooEnvFactory()
+```
+
+
+#### env
+
+```python
+ | env(seed: Optional[int] = None, **kwargs: Union[List, int, bool, None]) -> UnityAECEnv
+```
+
+Creates the environment with env_id from unity's default_registry and wraps it in a UnityToPettingZooWrapper
+
+**Arguments**:
+
+- `seed`: The seed for the action spaces of the agents.
+- `kwargs`: Any argument accepted by `UnityEnvironment`class except file_name
+
+
+# mlagents\_envs.envs.unity\_aec\_env
+
+
+## UnityAECEnv Objects
+
+```python
+class UnityAECEnv(UnityPettingzooBaseEnv, AECEnv)
+```
+
+Unity AEC (PettingZoo) environment wrapper.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(env: BaseEnv, seed: Optional[int] = None)
+```
+
+Initializes a Unity AEC environment wrapper.
+
+**Arguments**:
+
+- `env`: The UnityEnvironment that is being wrapped.
+- `seed`: The seed for the action spaces of the agents.
+
+
+#### step
+
+```python
+ | step(action: Any) -> None
+```
+
+Sets the action of the active agent and get the observation, reward, done and info of the next agent.
+
+**Arguments**:
+
+- `action`: The action for the active agent
+
+
+#### observe
+
+```python
+ | observe(agent_id)
+```
+
+Returns the observation an agent currently can make. `last()` calls this function.
+
+
+#### last
+
+```python
+ | last(observe=True)
+```
+
+returns observation, cumulative reward, done, info for the current agent (specified by self.agent_selection)
+
+
+# mlagents\_envs.envs.unity\_parallel\_env
+
+
+## UnityParallelEnv Objects
+
+```python
+class UnityParallelEnv(UnityPettingzooBaseEnv, ParallelEnv)
+```
+
+Unity Parallel (PettingZoo) environment wrapper.
+
+
+#### \_\_init\_\_
+
+```python
+ | __init__(env: BaseEnv, seed: Optional[int] = None)
+```
+
+Initializes a Unity Parallel environment wrapper.
+
+**Arguments**:
+
+- `env`: The UnityEnvironment that is being wrapped.
+- `seed`: The seed for the action spaces of the agents.
+
+
+#### reset
+
+```python
+ | reset() -> Dict[str, Any]
+```
+
+Resets the environment.
+
+
+# mlagents\_envs.envs.unity\_pettingzoo\_base\_env
+
+
+## UnityPettingzooBaseEnv Objects
+
+```python
+class UnityPettingzooBaseEnv()
+```
+
+Unity Petting Zoo base environment.
+
+
+#### observation\_spaces
+
+```python
+ | @property
+ | observation_spaces() -> Dict[str, spaces.Space]
+```
+
+Return the observation spaces of all the agents.
+
+
+#### observation\_space
+
+```python
+ | observation_space(agent: str) -> Optional[spaces.Space]
+```
+
+The observation space of the current agent.
+
+
+#### action\_spaces
+
+```python
+ | @property
+ | action_spaces() -> Dict[str, spaces.Space]
+```
+
+Return the action spaces of all the agents.
+
+
+#### action\_space
+
+```python
+ | action_space(agent: str) -> Optional[spaces.Space]
+```
+
+The action space of the current agent.
+
+
+#### side\_channel
+
+```python
+ | @property
+ | side_channel() -> Dict[str, Any]
+```
+
+The side channels of the environment. You can access the side channels of an environment with `env.side_channel[]`.
+
+
+#### reset
+
+```python
+ | reset()
+```
+
+Resets the environment.
+
+
+#### seed
+
+```python
+ | seed(seed=None)
+```
+
+Reseeds the environment (making the resulting environment deterministic).
+`reset()` must be called after `seed()`, and before `step()`.
+
+
+#### render
+
+```python
+ | render(mode="human")
+```
+
+NOT SUPPORTED.
+
+Displays a rendered frame from the environment, if supported. Alternate render modes in the default environments are `'rgb_array'` which returns a numpy array and is supported by all environments outside of classic, and `'ansi'` which returns the strings printed (specific to classic environments).
+
+
+#### close
+
+```python
+ | close() -> None
+```
+
+Close the environment.
diff --git a/com.unity.ml-agents/Documentation~/Python-PettingZoo-API.md b/com.unity.ml-agents/Documentation~/Python-PettingZoo-API.md
new file mode 100644
index 0000000000..ec6c3da3ba
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Python-PettingZoo-API.md
@@ -0,0 +1,34 @@
+# Unity ML-Agents PettingZoo Wrapper
+
+With the increasing interest in multi-agent training with a gym-like API, we provide a PettingZoo Wrapper around the [Petting Zoo API](https://pettingzoo.farama.org/). Our wrapper provides interfaces on top of our `UnityEnvironment` class, which is the default way of interfacing with a Unity environment via Python.
+
+## Installation and Examples
+
+The PettingZoo wrapper is part of the `mlagents_envs` package. Please refer to the [mlagents_envs installation instructions](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-envs/README.md).
+
+[[Colab] PettingZoo Wrapper Example](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/develop-python-api-ga/ml-agents-envs/colabs/Colab_PettingZoo.ipynb)
+
+This colab notebook demonstrates the example usage of the wrapper, including installation, basic usages, and an example with our [Striker vs Goalie environment](Learning-Environment-Examples.md#strikers-vs-goalie) which is a multi-agents environment with multiple different behavior names.
+
+## API interface
+
+This wrapper is compatible with PettingZoo API. Please check out [PettingZoo API page](https://pettingzoo.farama.org/api/aec/) for more details. Here's an example of interacting with wrapped environment:
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.envs import UnityToPettingZooWrapper
+
+unity_env = UnityEnvironment("StrikersVsGoalie")
+env = UnityToPettingZooWrapper(unity_env)
+env.reset()
+for agent in env.agent_iter():
+ observation, reward, done, info = env.last()
+ action = policy(observation, agent)
+ env.step(action)
+```
+
+## Notes
+- There is support for both [AEC](https://pettingzoo.farama.org/api/aec/) and [Parallel](https://pettingzoo.farama.org/api/parallel/) PettingZoo APIs.
+- The AEC wrapper is compatible with PettingZoo (PZ) API interface but works in a slightly different way under the hood. For the AEC API, Instead of stepping the environment in every `env.step(action)`, the PZ wrapper will store the action, and will only perform environment stepping when all the agents requesting for actions in the current step have been assigned an action. This is for performance, considering that the communication between Unity and python is more efficient when data are sent in batches.
+- Since the actions for the AEC wrapper are stored without applying them to the environment until all the actions are queued, some components of the API might behave in unexpected way. For example, a call to `env.reward` should return the instantaneous reward for that particular step, but the true reward would only be available when an actual environment step is performed. It's recommended that you follow the API definition for training (access rewards from `env.last()` instead of `env.reward`) and the underlying mechanism shouldn't affect training results.
+- The environments will automatically reset when it's done, so `env.agent_iter(max_step)` will keep going on until the specified max step is reached (default: `2**63`). There is no need to call `env.reset()` except for the very beginning of instantiating an environment.
diff --git a/com.unity.ml-agents/Documentation~/Readme.md b/com.unity.ml-agents/Documentation~/Readme.md
new file mode 100644
index 0000000000..eac55b4b61
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Readme.md
@@ -0,0 +1,95 @@
+# Unity ML-Agents Toolkit
+
+[](https://docs.unity3d.com/Packages/com.unity.ml-agents@latest)
+
+[](https://github.com/Unity-Technologies/ml-agents/blob/release_22/LICENSE.md)
+
+([latest release](https://github.com/Unity-Technologies/ml-agents/releases/tag/latest_release)) ([all releases](https://github.com/Unity-Technologies/ml-agents/releases))
+
+**The Unity Machine Learning Agents Toolkit** (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents. We provide implementations (based on PyTorch) of state-of-the-art algorithms to enable game developers and hobbyists to easily train intelligent agents for 2D, 3D and VR/AR games. Researchers can also use the provided simple-to-use Python API to train Agents using reinforcement learning, imitation learning, neuroevolution, or any other methods. These trained agents can be used for multiple purposes, including controlling NPC behavior (in a variety of settings such as multi-agent and adversarial), automated testing of game builds and evaluating different game design decisions pre-release. The ML-Agents Toolkit is mutually beneficial for both game developers and AI researchers as it provides a central platform where advances in AI can be evaluated on Unity’s rich environments and then made accessible to the wider research and game developer communities.
+
+## Features
+- 17+ [example Unity environments](Learning-Environment-Examples.md)
+- Support for multiple environment configurations and training scenarios
+- Flexible Unity SDK that can be integrated into your game or custom Unity scene
+- Support for training single-agent, multi-agent cooperative, and multi-agent competitive scenarios via several Deep Reinforcement Learning algorithms (PPO, SAC, MA-POCA, self-play).
+- Support for learning from demonstrations through two Imitation Learning algorithms (BC and GAIL).
+- Quickly and easily add your own [custom training algorithm](Python-Custom-Trainer-Plugin.md) and/or components.
+- Easily definable Curriculum Learning scenarios for complex tasks
+- Train robust agents using environment randomization
+- Flexible agent control with On Demand Decision Making
+- Train using multiple concurrent Unity environment instances
+- Utilizes the [Inference Engine](Inference-Engine.md) to provide native cross-platform support
+- Unity environment [control from Python](Python-LLAPI.md)
+- Wrap Unity learning environments as a [gym](Python-Gym-API.md) environment
+- Wrap Unity learning environments as a [PettingZoo](Python-PettingZoo-API.md) environment
+
+## Releases & Documentation
+
+> **⚠️ Documentation Migration Notice**
+> We have moved to [Unity Package documentation](https://docs.unity3d.com/Packages/com.unity.ml-agents@latest) as the **primary developer documentation** and have **deprecated** the maintenance of [web docs](https://unity-technologies.github.io/ml-agents/). Please use the Unity Package documentation for the most up-to-date information.
+
+The table below shows our latest release, including our `develop` branch which is under active development and may be unstable. A few helpful guidelines:
+
+- The [Versioning page](Versioning.md) overviews how we manage our GitHub releases and the versioning process for each of the ML-Agents components.
+- The [Releases page](https://github.com/Unity-Technologies/ml-agents/releases) contains details of the changes between releases.
+- The [Migration page](Migrating.md) contains details on how to upgrade from earlier releases of the ML-Agents Toolkit.
+- The `com.unity.ml-agents` package is [verified](https://docs.unity3d.com/2020.1/Documentation/Manual/pack-safe.html) for Unity 2020.1 and later. Verified packages releases are numbered 1.0.x.
+
+| **Version** | **Release Date** | **Source** | **Documentation** | **Download** | **Python Package** | **Unity Package** |
+|:-----------:|:---------------:|:----------:|:-----------------:|:------------:|:------------------:|:-----------------:|
+| **Release 22** | **October 5, 2024** | **[source](https://github.com/Unity-Technologies/ml-agents/tree/release_22)** | **[docs](https://unity-technologies.github.io/ml-agents/)** | **[download](https://github.com/Unity-Technologies/ml-agents/archive/release_22.zip)** | **[1.1.0](https://pypi.org/project/mlagents/1.1.0/)** | **[3.0.0](https://docs.unity3d.com/Packages/com.unity.ml-agents@3.0/manual/index.html)** |
+| **develop (unstable)** | -- | [source](https://github.com/Unity-Technologies/ml-agents/tree/develop) | [docs](https://github.com/Unity-Technologies/ml-agents/tree/develop/com.unity.ml-agents/Documentation~/index.md) | [download](https://github.com/Unity-Technologies/ml-agents/archive/develop.zip) | -- | -- |
+
+
+
+If you are a researcher interested in a discussion of Unity as an AI platform, see a pre-print of our [reference paper on Unity and the ML-Agents Toolkit](https://arxiv.org/abs/1809.02627).
+
+If you use Unity or the ML-Agents Toolkit to conduct research, we ask that you cite the following paper as a reference:
+
+```
+@article{juliani2020,
+ title={Unity: A general platform for intelligent agents},
+ author={Juliani, Arthur and Berges, Vincent-Pierre and Teng, Ervin and Cohen, Andrew and Harper, Jonathan and Elion, Chris and Goy, Chris and Gao, Yuan and Henry, Hunter and Mattar, Marwan and Lange, Danny},
+ journal={arXiv preprint arXiv:1809.02627},
+ url={https://arxiv.org/pdf/1809.02627.pdf},
+ year={2020}
+}
+```
+
+Additionally, if you use the MA-POCA trainer in your research, we ask that you cite the following paper as a reference:
+
+```
+@article{cohen2022,
+ title={On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning},
+ author={Cohen, Andrew and Teng, Ervin and Berges, Vincent-Pierre and Dong, Ruo-Ping and Henry, Hunter and Mattar, Marwan and Zook, Alexander and Ganguly, Sujoy},
+ journal={RL in Games Workshop AAAI 2022},
+ url={http://aaai-rlg.mlanctot.info/papers/AAAI22-RLG_paper_32.pdf},
+ year={2022}
+}
+```
+
+
+## Additional Resources
+
+* [Unity Discussions](https://discussions.unity.com/tag/ml-agents)
+* [ML-Agents tutorials by CodeMonkeyUnity](https://www.youtube.com/playlist?list=PLzDRvYVwl53vehwiN_odYJkPBzcqFw110)
+* [Introduction to ML-Agents by Huggingface](https://huggingface.co/learn/deep-rl-course/en/unit5/introduction)
+* [Community created ML-Agents projects](https://discussions.unity.com/t/post-your-ml-agents-project/816756)
+* [ML-Agents models on Huggingface](https://huggingface.co/models?library=ml-agents)
+* [Blog posts](Blog-posts.md)
+
+## Community and Feedback
+
+The ML-Agents Toolkit is an open-source project and we encourage and welcome contributions. If you wish to contribute, be sure to review our [contribution guidelines](CONTRIBUTING.md) and [code of conduct](https://github.com/Unity-Technologies/ml-agents/blob/release_22/CODE_OF_CONDUCT.md).
+
+For problems with the installation and setup of the ML-Agents Toolkit, or discussions about how to best setup or train your agents, please create a new thread on the [Unity ML-Agents forum](https://forum.unity.com/forums/ml-agents.453/) and make sure to include as much detail as possible. If you run into any other problems using the ML-Agents Toolkit or have a specific feature request, please [submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).
+
+Please tell us which samples you would like to see shipped with the ML-Agents Unity package by replying to [this forum thread](https://forum.unity.com/threads/feedback-wanted-shipping-sample-s-with-the-ml-agents-package.1073468/).
+
+
+Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few minutes to [let us know about it](https://unitysoftware.co1.qualtrics.com/jfe/form/SV_55pQKCZ578t0kbc).
+
+## Privacy
+
+In order to improve the developer experience for Unity ML-Agents Toolkit, we have added in-editor analytics. Please refer to "Information that is passively collected by Unity" in the [Unity Privacy Policy](https://unity3d.com/legal/privacy-policy).
diff --git a/com.unity.ml-agents/Documentation~/Reference-Support.md b/com.unity.ml-agents/Documentation~/Reference-Support.md
new file mode 100644
index 0000000000..922e65c79d
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Reference-Support.md
@@ -0,0 +1,12 @@
+# Reference & Support
+
+The Reference & Support section contains essential documentation for ongoing ML-Agents development.
+
+
+| **Resource** | **Description** |
+|--------------------------------------|--------------------------------------------------------------|
+| [FAQ](FAQ.md) | Frequently asked questions and common issues with solutions. |
+| [Limitations](Limitations.md) | Known limitations and constraints of the ML-Agents Toolkit. |
+| [Migrating](Migrating.md) | Migration guides for updating between ML-Agents versions. |
+| [Versioning](Versioning.md) | Information about ML-Agents versioning and release notes. |
+| [ML-Agents Glossary](Glossary.md) | Glossary of terms and concepts used in ML-Agents. |
diff --git a/com.unity.ml-agents/Documentation~/Sample.md b/com.unity.ml-agents/Documentation~/Sample.md
new file mode 100644
index 0000000000..50185bf733
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Sample.md
@@ -0,0 +1,151 @@
+# Running an Example Environment
+
+This guide walks through the end-to-end process of opening one of our [example environments](Learning-Environment-Examples.md) in Unity, training an Agent in it, and embedding the trained model into the Unity environment. After reading this tutorial, you should be able to train any of the example environments. If you are not familiar with the [Unity Engine](https://unity3d.com/unity), view our [Background: Unity](Background-Unity.md) page for helpful pointers. Additionally, if you're not familiar with machine learning, view our [Background: Machine Learning](Background-Machine-Learning.md) page for a brief overview and helpful pointers.
+
+
+
+For this guide, we'll use the **3D Balance Ball** environment which contains a number of agent cubes and balls (which are all copies of each other). Each agent cube tries to keep its ball from falling by rotating either horizontally or vertically. In this environment, an agent cube is an **Agent** that receives a reward for every step that it balances the ball. An agent is also penalized with a negative reward for dropping the ball. The goal of the training process is to have the agents learn to balance the ball on their head.
+
+Let's get started!
+
+## Installation
+
+If you haven't already, follow the Local Installation for Development section in [installation instructions](Installation.md). Afterward, open the Unity Project that contains all the example environments. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Scenes` folder and open the `3DBall` scene file.
+
+## Understanding a Unity Environment
+
+An agent is an autonomous actor that observes and interacts with an _environment_. In the context of Unity, an environment is a scene containing one or more Agent objects, and, of course, the other entities that an agent interacts with.
+
+
+
+**Note:** In Unity, the base object of everything in a scene is the _GameObject_. The GameObject is essentially a container for everything else, including behaviors, graphics, physics, etc. To see the components that make up a GameObject, select the GameObject in the Scene window, and open the Inspector window. The Inspector shows every component on a GameObject.
+
+The first thing you may notice after opening the 3D Balance Ball scene is that it contains not one, but several agent cubes. Each agent cube in the scene is an independent agent, but they all share the same Behavior. 3D Balance Ball does this to speed up training since all twelve agents contribute to training in parallel.
+
+### Agent
+
+The Agent is the actor that observes and takes actions in the environment. In the 3D Balance Ball environment, the Agent components are placed on the twelve "Agent" GameObjects. The base Agent object has a few properties that affect its behavior:
+
+- **Behavior Parameters** — Every Agent must have a Behavior. The Behavior determines how an Agent makes decisions.
+- **Max Step** — Defines how many simulation steps can occur before the Agent's episode ends. In 3D Balance Ball, an Agent restarts after 5000 steps.
+
+#### Behavior Parameters : Vector Observation Space
+
+Before making a decision, an agent collects its observation about its state in the world. The vector observation is a vector of floating point numbers which contain relevant information for the agent to make decisions.
+
+The Behavior Parameters of the 3D Balance Ball example uses a `Space Size` of 8. This means that the feature vector containing the Agent's observations contains eight elements: the `x` and `z` components of the agent cube's rotation and the `x`, `y`, and `z` components of the ball's relative position and velocity.
+
+#### Behavior Parameters : Actions
+
+An Agent is given instructions in the form of actions. ML-Agents Toolkit classifies actions into two types: continuous and discrete. The 3D Balance Ball example is programmed to use continuous actions, which are a vector of floating-point numbers that can vary continuously. More specifically, it uses a `Space Size` of 2 to control the amount of `x` and `z` rotations to apply to itself to keep the ball balanced on its head.
+
+## Running a pre-trained model
+
+We include pre-trained models for our agents (`.onnx` files) and we use the [Inference Engine](Inference-Engine.md) to run these models inside Unity. In this section, we will use the pre-trained model for the 3D Ball example.
+
+1. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder. Expand `3DBall` and click on the `Agent` prefab. You should see the `Agent` prefab in the **Inspector** window.
+
+ **Note**: The platforms in the `3DBall` scene were created using the `3DBall` prefab. Instead of updating all 12 platforms individually, you can update the `3DBall` prefab instead.
+
+
+
+2. In the **Project** window, drag the **3DBall** Model located in `Assets/ML-Agents/Examples/3DBall/TFModels` into the `Model` property under `Behavior Parameters (Script)` component in the Agent GameObject **Inspector** window.
+
+
+
+3. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBall** as `Model` on the `Behavior Parameters`. **Note** : You can modify multiple game objects in a scene by selecting them all at once using the search bar in the Scene Hierarchy.
+4. Set the **Inference Device** to use for this model as `CPU`.
+5. Click the **Play** button in the Unity Editor and you will see the platforms balance the balls using the pre-trained model.
+
+## Training a new model with Reinforcement Learning
+
+While we provide pre-trained models for the agents in this environment, any environment you make yourself will require training agents from scratch to generate a new model file. In this section we will demonstrate how to use the reinforcement learning algorithms that are part of the ML-Agents Python package to accomplish this. We have provided a convenient command `mlagents-learn` which accepts arguments used to configure both training and inference phases.
+
+### Training the environment
+
+1. Open a command or terminal window.
+2. Navigate to the folder where you cloned the `ml-agents` repository. **Note**: If you followed the default [installation](Installation.md), then you should be able to run `mlagents-learn` from any directory.
+3. Run `mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun`.
+ - `config/ppo/3DBall.yaml` is the path to a default training configuration file that we provide. The `config/ppo` folder includes training configuration files for all our example environments, including 3DBall.
+ - `run-id` is a unique name for this training session.
+4. When the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen, you can press the **Play** button in Unity to start training in the Editor.
+
+If `mlagents-learn` runs correctly and starts training, you should see something like this:
+
+```console
+INFO:mlagents_envs:
+'Ball3DAcademy' started successfully!
+Unity Academy name: Ball3DAcademy
+
+INFO:mlagents_envs:Connected new brain:
+Unity brain name: 3DBallLearning
+ Number of Visual Observations (per agent): 0
+ Vector Observation space size (per agent): 8
+ Number of stacked Vector Observation: 1
+INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
+ batch_size: 64
+ beta: 0.001
+ buffer_size: 12000
+ epsilon: 0.2
+ gamma: 0.995
+ hidden_units: 128
+ lambd: 0.99
+ learning_rate: 0.0003
+ max_steps: 5.0e4
+ normalize: True
+ num_epoch: 3
+ num_layers: 2
+ time_horizon: 1000
+ sequence_length: 64
+ summary_freq: 1000
+ use_recurrent: False
+ memory_size: 256
+ use_curiosity: False
+ curiosity_strength: 0.01
+ curiosity_enc_size: 128
+ output_path: ./results/first3DBallRun/3DBallLearning
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
+```
+
+Note how the `Mean Reward` value printed to the screen increases as training progresses. This is a positive sign that training is succeeding.
+
+**Note**: You can train using an executable rather than the Editor. To do so, follow the instructions in [Using an Executable](Learning-Environment-Executable.md).
+
+### Observing Training Progress
+
+Once you start training using `mlagents-learn` in the way described in the previous section, the `ml-agents` directory will contain a `results` directory. In order to observe the training process in more detail, you can use TensorBoard. From the command line run:
+
+```sh
+tensorboard --logdir results
+```
+
+Then navigate to `localhost:6006` in your browser to view the TensorBoard summary statistics as shown below. For the purposes of this section, the most important statistic is `Environment/Cumulative Reward` which should increase throughout training, eventually converging close to `100` which is the maximum reward the agent can accumulate.
+
+
+
+## Embedding the model into the Unity Environment
+
+Once the training process completes, and the training process saves the model (denoted by the `Saved Model` message) you can add it to the Unity project and use it with compatible Agents (the Agents that generated the model). **Note:** Do not just close the Unity Window once the `Saved Model` message appears. Either wait for the training process to close the window or press `Ctrl+C` at the command-line prompt. If you close the window manually, the `.onnx` file containing the trained model is not exported into the ml-agents folder.
+
+If you've quit the training early using `Ctrl+C` and want to resume training, run the same command again, appending the `--resume` flag:
+
+```sh
+mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --resume
+```
+
+Your trained model will be at `results//.onnx` where `` is the name of the `Behavior Name` of the agents corresponding to the model. This file corresponds to your model's latest checkpoint. You can now embed this trained model into your Agents by following the steps below, which is similar to the steps described [above](#running-a-pre-trained-model).
+
+1. Move your model file into `Project/Assets/ML-Agents/Examples/3DBall/TFModels/`.
+2. Open the Unity Editor, and select the **3DBall** scene as described above.
+3. Select the **3DBall** prefab Agent object.
+4. Drag the `.onnx` file from the Project window of the Editor to the **Model** placeholder in the **Ball3DAgent** inspector window.
+5. Press the **Play** button at the top of the Editor.
diff --git a/com.unity.ml-agents/Documentation~/TableOfContents.md b/com.unity.ml-agents/Documentation~/TableOfContents.md
new file mode 100644
index 0000000000..7bfeac7d97
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/TableOfContents.md
@@ -0,0 +1,57 @@
+* [ML-Agents Package](index.md)
+* [ML-Agents Theory](ML-Agents-Overview.md)
+* [Get started](Get-Started.md)
+ * [Installation](Installation.md)
+ * [Sample: Running an Example Environment](Sample.md)
+ * [More Example Environments](Learning-Environment-Examples.md)
+* [Learning Environments and Agents](Learning-Environments-Agents.md)
+ * [Designing Learning Environments](Learning-Environment-Design.md)
+ * [Designing Agents](Learning-Environment-Design-Agents.md)
+ * [Sample: Making a New Learning Environment](Learning-Environment-Create-New.md)
+ * [Using an Executable Environment](Learning-Environment-Executable.md)
+* [Training](Training.md)
+ * [Training ML-Agents Basics](Training-ML-Agents.md)
+ * [Training Configuration File](Training-Configuration-File.md)
+ * [Using Tensorboard](Using-Tensorboard.md)
+ * [Customizing Training via Plugins](Training-Plugins.md)
+ * [Custom Trainer Plugin](Tutorial-Custom-Trainer-Plugin.md)
+ * [Profiling Trainers](Profiling-Python.md)
+* [Python APIs](Python-APIs.md)
+ * [Python Gym API](Python-Gym-API.md)
+ * [Python Gym API Documentation](Python-Gym-API-Documentation.md)
+ * [Python PettingZoo API](Python-PettingZoo-API.md)
+ * [Python PettingZoo API Documentation](Python-PettingZoo-API-Documentation.md)
+ * [Python Low-Level API](Python-LLAPI.md)
+ * [Python Low-Level API Documentation](Python-LLAPI-Documentation.md)
+ * [On/Off Policy Trainer Documentation](Python-On-Off-Policy-Trainer-Documentation.md)
+ * [Python Optimizer Documentation](Python-Optimizer-Documentation.md)
+* [Python Tutorial with Google Colab](Tutorial-Colab.md)
+ * [Using a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_22_docs/colab/Colab_UnityEnvironment_1_Run.ipynb)
+ * [Q-Learning with a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_22_docs/colab/Colab_UnityEnvironment_2_Train.ipynb)
+ * [Using Side Channels on a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_22_docs/colab/Colab_UnityEnvironment_3_SideChannel.ipynb)
+* [Advanced Features](Advanced-Features.md)
+ * [Custom Side Channels](Custom-SideChannels.md)
+ * [Custom Grid Sensors](Custom-GridSensors.md)
+ * [Input System Integration](InputSystem-Integration.md)
+ * [Inference Engine](Inference-Engine.md)
+ * [Hugging Face Integration](Hugging-Face-Integration.md)
+ * [Game Integrations](Integrations.md)
+ * [Match-3](Integrations-Match3.md)
+ * [ML-Agents Package Settings](Package-Settings.md)
+ * [Unity Environment Registry](Unity-Environment-Registry.md)
+* [Cloud & Deployment (deprecated)](Cloud-Deployment.md)
+ * [Using Docker](Using-Docker.md)
+ * [Amazon Web Services](Training-on-Amazon-Web-Service.md)
+ * [Microsoft Azure](Training-on-Microsoft-Azure.md)
+* [Reference & Support](Reference-Support.md)
+ * [FAQ](FAQ.md)
+ * [Limitations](Limitations.md)
+ * [Migrating](Migrating.md)
+ * [versioning](Versioning.md)
+ * [ML-Agents Glossary](Glossary.md)
+* [Background](Background.md)
+ * [Machine Learning](Background-Machine-Learning.md)
+ * [Unity](Background-Unity.md)
+ * [PyTorch](Background-PyTorch.md)
+ * [Using Virtual Environment](Using-Virtual-Environment.md)
+ * [ELO](ELO-Rating-System.md)
diff --git a/com.unity.ml-agents/Documentation~/Training-Configuration-File.md b/com.unity.ml-agents/Documentation~/Training-Configuration-File.md
new file mode 100644
index 0000000000..cc0de05f8d
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Training-Configuration-File.md
@@ -0,0 +1,171 @@
+# Training Configuration File
+
+## Common Trainer Configurations
+
+One of the first decisions you need to make regarding your training run is which trainer to use: PPO, SAC, or POCA. There are some training configurations that are common to both trainers (which we review now) and others that depend on the choice of the trainer (which we review on subsequent sections).
+
+| **Setting** | **Description** |
+| :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo`, `sac`, or `poca`. |
+| `summary_freq` | (default = `50000`) Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. |
+| `time_horizon` | (default = `64`) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions.
Typical range: `32` - `2048` |
+| `max_steps` | (default = `500000`) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same `max_steps` count.
Typical range: `5e5` - `1e7` |
+| `keep_checkpoints` | (default = `5`) The maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the checkpoint_interval option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. |
+| `even_checkpoints` | (default = `false`) If set to true, ignores `checkpoint_interval` and evenly distributes checkpoints throughout training based on `keep_checkpoints`and `max_steps`, i.e. `checkpoint_interval = max_steps / keep_checkpoints`. Useful for cataloging agent behavior throughout training. |
+| `checkpoint_interval` | (default = `500000`) The number of experiences collected between each checkpoint by the trainer. A maximum of `keep_checkpoints` checkpoints are saved before old ones are deleted. Each checkpoint saves the `.onnx` files in `results/` folder.|
+| `init_path` | (default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents.
You can provide either the file name or the full path to the checkpoint, e.g. `{checkpoint_name.pt}` or `./models/{run-id}/{behavior_name}/{checkpoint_name.pt}`. This option is provided in case you want to initialize different behaviors from different runs or initialize from an older checkpoint; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
+| `threaded` | (default = `false`) Allow environments to step while updating the model. This might result in a training speedup, especially when using SAC. For best performance, leave setting to `false` when using self-play. |
+| `hyperparameters -> learning_rate` | (default = `3e-4`) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase.
Typical range: `1e-5` - `1e-3` |
+| `hyperparameters -> batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using continuous actions, this value should be large (on the order of 1000s). If you are using only discrete actions, this value should be smaller (on the order of 10s).
Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
+| `hyperparameters -> buffer_size` | (default = `10240` for PPO and `50000` for SAC)
**PPO:** Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates.
**SAC:** The max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences.
Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` |
+| `hyperparameters -> learning_rate_schedule` | (default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally.
`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
+| `network_settings -> hidden_units` | (default = `128`) Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger.
Typical range: `32` - `512` |
+| `network_settings -> num_layers` | (default = `2`) The number of hidden layers in the neural network. Corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.
Typical range: `1` - `3` |
+| `network_settings -> normalize` | (default = `false`) Whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems. |
+| `network_settings -> vis_encode_type` | (default = `simple`) Encoder type for encoding visual observations.
`simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. `match3` is a smaller CNN ([Gudmundsoon et al.](https://www.researchgate.net/publication/328307928_Human-Like_Playtesting_with_Deep_Learning)) that can capture more granular spatial relationships and is optimized for board games. `fully_connected` uses a single fully connected dense layer as encoder without any convolutional layers.
Due to the size of convolution kernel, there is a minimum observation size limitation that each encoder type can handle - `simple`: 20x20, `nature_cnn`: 36x36, `resnet`: 15 x 15, `match3`: 5x5. `fully_connected` doesn't have convolutional layers and thus no size limits, but since it has less representation power it should be reserved for very small inputs. Note that using the `match3` CNN with very large visual input might result in a huge observation encoding and thus potentially slow down training or cause memory issues. |
+| `network_settings -> goal_conditioning_type` | (default = `hyper`) Conditioning type for the policy using goal observations.
`none` treats the goal observations as regular observations, `hyper` (default) uses a HyperNetwork with goal observations as input to generate some of the weights of the policy. Note that when using `hyper` the number of parameters of the network increases greatly. Therefore, it is recommended to reduce the number of `hidden_units` when using this `goal_conditioning_type`
+
+
+## Trainer-specific Configurations
+
+Depending on your choice of a trainer, there are additional trainer-specific configurations. We present them below in two separate tables, but keep in mind that you only need to include the configurations for the trainer selected (i.e. the `trainer` setting above).
+
+### PPO-specific Configurations
+
+| **Setting** | **Description** |
+| :---------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `hyperparameters -> beta` | (default = `5.0e-3`) Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease `beta`.
Typical range: `1e-4` - `1e-2` |
+| `hyperparameters -> epsilon` | (default = `0.2`) Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
Typical range: `0.1` - `0.3` |
+| `hyperparameters -> beta_schedule` | (default = `learning_rate_schedule`) Determines how beta changes over time.
`linear` decays beta linearly, reaching 0 at max_steps, while `constant` keeps beta constant for the entire training run. If not explicitly set, the default beta schedule will be set to `hyperparameters -> learning_rate_schedule`. |
+| `hyperparameters -> epsilon_schedule` | (default = `learning_rate_schedule `) Determines how epsilon changes over time (PPO only).
`linear` decays epsilon linearly, reaching 0 at max_steps, while `constant` keeps the epsilon constant for the entire training run. If not explicitly set, the default epsilon schedule will be set to `hyperparameters -> learning_rate_schedule`.
+| `hyperparameters -> lambd` | (default = `0.95`) Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process.
Typical range: `0.9` - `0.95` |
+| `hyperparameters -> num_epoch` | (default = `3`) Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
Typical range: `3` - `10` |
+| `hyperparameters -> shared_critic` | (default = `False`) Whether or not the policy and value function networks share a backbone. It may be useful to use a shared backbone when learning from image observations.
+
+### SAC-specific Configurations
+
+| **Setting** | **Description** |
+| :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `hyperparameters -> buffer_init_steps` | (default = `0`) Number of experiences to collect into the buffer before updating the policy model. As the untrained policy is fairly random, pre-filling the buffer with random actions is useful for exploration. Typically, at least several episodes of experiences should be pre-filled.
Typical range: `1000` - `10000` |
+| `hyperparameters -> init_entcoef` | (default = `1.0`) How much the agent should explore in the beginning of training. Corresponds to the initial entropy coefficient set at the beginning of training. In SAC, the agent is incentivized to make its actions entropic to facilitate better exploration. The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus. Increase init_entcoef to explore more in the beginning, decrease to converge to a solution faster.
Typical range: (Continuous): `0.5` - `1.0`; (Discrete): `0.05` - `0.5` |
+| `hyperparameters -> save_replay_buffer` | (default = `false`) Whether to save and load the experience replay buffer as well as the model when quitting and re-starting training. This may help resumes go more smoothly, as the experiences collected won't be wiped. Note that replay buffers can be very large, and will take up a considerable amount of disk space. For that reason, we disable this feature by default. |
+| `hyperparameters -> tau` | (default = `0.005`) How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability.
Typical range: `0.005` - `0.01` |
+| `hyperparameters -> steps_per_update` | (default = `1`) Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow.
Typical range: `1` - `20` |
+| `hyperparameters -> reward_signal_num_update` | (default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |
+
+### MA-POCA-specific Configurations
+MA-POCA uses the same configurations as PPO, and there are no additional POCA-specific parameters.
+
+**NOTE**: Reward signals other than Extrinsic Rewards have not been extensively tested with MA-POCA, though they can still be added and used for training on a your-mileage-may-vary basis.
+
+## Reward Signals
+
+The `reward_signals` section enables the specification of settings for both extrinsic (i.e. environment-based) and intrinsic reward signals (e.g. curiosity and GAIL). Each reward signal should define at least two parameters, `strength` and `gamma`, in addition to any class-specific hyperparameters. Note that to remove a reward signal, you should delete its entry entirely from `reward_signals`. At least one reward signal should be left defined at all times. Provide the following configurations to design the reward signal for your training run.
+
+### Extrinsic Rewards
+
+Enable these settings to ensure that your training run incorporates your environment-based reward signal:
+
+| **Setting** | **Description** |
+| :---------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `extrinsic -> strength` | (default = `1.0`) Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal.
Typical range: `1.00` |
+| `extrinsic -> gamma` | (default = `0.99`) Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1.
Typical range: `0.8` - `0.995` |
+
+### Curiosity Intrinsic Reward
+
+To enable curiosity, provide these settings:
+
+| **Setting** | **Description** |
+| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `curiosity -> strength` | (default = `1.0`) Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal.
Typical range: `0.001` - `0.1` |
+| `curiosity -> gamma` | (default = `0.99`) Discount factor for future rewards.
Typical range: `0.8` - `0.995` |
+| `curiosity -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs used by the intrinsic curiosity model. The value should of `hidden_units` should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations.
Typical range: `64` - `256` |
+| `curiosity -> learning_rate` | (default = `3e-4`) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable.
Typical range: `1e-5` - `1e-3` |
+
+### GAIL Intrinsic Reward
+
+To enable GAIL (assuming you have recorded demonstrations), provide these settings:
+
+| **Setting** | **Description** |
+| :---------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `gail -> strength` | (default = `1.0`) Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.
Typical range: `0.01` - `1.0` |
+| `gail -> gamma` | (default = `0.99`) Discount factor for future rewards.
Typical range: `0.8` - `0.9` |
+| `gail -> demo_path` | (Required, no default) The path to your .demo file or directory of .demo files. |
+| `gail -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs for the GAIL discriminator. The value of `hidden_units` should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times.
Typical range: `64` - `256` |
+| `gail -> learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable.
Typical range: `1e-5` - `1e-3` |
+| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
+| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. |
+
+### RND Intrinsic Reward
+
+Random Network Distillation (RND) is only available for the PyTorch trainers. To enable RND, provide these settings:
+
+| **Setting** | **Description** |
+| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `rnd -> strength` | (default = `1.0`) Magnitude of the curiosity reward generated by the intrinsic rnd module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal.
Typical range: `0.001` - `0.01` |
+| `rnd -> gamma` | (default = `0.99`) Discount factor for future rewards.
Typical range: `0.8` - `0.995` |
+| `rnd -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs for the RND model. |
+| `curiosity -> learning_rate` | (default = `3e-4`) Learning rate used to update the RND module. This should be large enough for the RND module to quickly learn the state representation, but small enough to allow for stable learning.
Typical range: `1e-5` - `1e-3`
+
+
+## Behavioral Cloning
+
+To enable Behavioral Cloning as a pre-training option (assuming you have recorded demonstrations), provide the following configurations under the `behavioral_cloning` section:
+
+| **Setting** | **Description** |
+| :------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `demo_path` | (Required, no default) The path to your .demo file or directory of .demo files. |
+| `strength` | (default = `1.0`) Learning rate of the imitation relative to the learning rate of PPO, and roughly corresponds to how strongly we allow BC to influence the policy.
Typical range: `0.1` - `0.5` |
+| `steps` | (default = `0`) During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. steps corresponds to the training steps over which BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run. |
+| `batch_size` | (default = `batch_size` of trainer) Number of demonstration experiences used for one iteration of a gradient descent update. If not specified, it will default to the `batch_size` of the trainer.
Typical range: (Continuous): `512` - `5120`; (Discrete): `32` - `512` |
+| `num_epoch` | (default = `num_epoch` of trainer) Number of passes through the experience buffer during gradient descent. If not specified, it will default to the number of epochs set for PPO.
Typical range: `3` - `10` |
+| `samples_per_update` | (default = `0`) Maximum number of samples to use during each imitation update. You may want to lower this if your demonstration dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 to train over all of the demonstrations at each update step.
Typical range: `buffer_size`
+
+## Memory-enhanced Agents using Recurrent Neural Networks
+
+You can enable your agents to use memory by adding a `memory` section under `network_settings`, and setting `memory_size` and `sequence_length`:
+
+| **Setting** | **Description** |
+| :---------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `network_settings -> memory -> memory_size` | (default = `128`) Size of the memory an agent must keep. In order to use a LSTM, training requires a sequence of experiences instead of single experiences. Corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network of the policy. This value must be a multiple of 2, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task.
Typical range: `32` - `256` |
+| `network_settings -> memory -> sequence_length` | (default = `64`) Defines how long the sequences of experiences must be while training. Note that if this number is too small, the agent will not be able to remember things over longer periods of time. If this number is too large, the neural network will take longer to train.
Typical range: `4` - `128` |
+
+A few considerations when deciding to use memory:
+
+- LSTM does not work well with continuous actions. Please use discrete actions for better results.
+- Adding a recurrent layer increases the complexity of the neural network, it is recommended to decrease `num_layers` when using recurrent.
+- It is required that `memory_size` be divisible by 2.
+
+## Self-Play
+
+Training with self-play adds additional confounding factors to the usual issues faced by reinforcement learning. In general, the tradeoff is between the skill level and generality of the final policy and the stability of learning. Training against a set of slowly or unchanging adversaries with low diversity results in a more stable learning process than training against a set of quickly changing adversaries with high diversity. With this context, this guide discusses the exposed self-play hyperparameters and intuitions for tuning them.
+
+If your environment contains multiple agents that are divided into teams, you can leverage our self-play training option by providing these configurations for each Behavior:
+
+| **Setting** | **Description** |
+| :-------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `save_steps` | (default = `20000`) Number of _trainer steps_ between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
Typical range: `10000` - `100000` |
+| `team_change` | (default = `5 * save_steps`) Number of _trainer_steps_ between switching the learning team. This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents per team switch.
A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies and so the agent may fail against the next batch of opponents.
The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we recommend setting this value as a function of the `save_steps` parameter discussed previously.
Typical range: 4x-10x where x=`save_steps` |
+| `swap_steps` | (default = `10000`) Number of _ghost steps_ (not trainer steps) between swapping the opponents policy with a different snapshot. A 'ghost step' refers to a step taken by an agent _that is following a fixed policy and not learning_. The reason for this distinction is that in asymmetric games, we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` agents during `team-change` total steps is: `(num_agents / num_opponent_agents) * (team_change / x)`
Typical range: `10000` - `100000` |
+| `play_against_latest_model_ratio` | (default = `0.5`) Probability an agent will play against the latest opponent policy. With probability 1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its opponent from a past iteration.
A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/research/emergent-tool-use) of more increasingly challenging situations which may lead to a stronger final policy.
Typical range: `0.0` - `1.0` |
+| `window` | (default = `10`) Size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training.
Typical range: `5` - `30` |
+
+### Note on Reward Signals
+
+We make the assumption that the final reward in a trajectory corresponds to the outcome of an episode. A final reward of +1 indicates winning, -1 indicates losing and 0 indicates a draw. The ELO calculation (discussed below) depends on this final reward being either +1, 0, -1.
+
+The reward signal should still be used as described in the documentation for the other trainers. However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.
+
+### Note on Swap Steps
+
+As an example, in a 2v1 scenario, if we want the swap to occur x=4 times during team-change=200000 steps, the swap_steps for the team of one agent is:
+
+swap_steps = (1 / 2) \* (200000 / 4) = 25000
+
+The swap_steps for the team of two agents is:
+
+swap_steps = (2 / 1) \* (200000 / 4) = 100000
+
+Note, with equal team sizes, the first term is equal to 1 and swap_steps can be calculated by just dividing the total steps by the desired number of swaps.
+
+A larger value of swap_steps means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected.
diff --git a/com.unity.ml-agents/Documentation~/Training-ML-Agents.md b/com.unity.ml-agents/Documentation~/Training-ML-Agents.md
new file mode 100644
index 0000000000..02170c3428
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Training-ML-Agents.md
@@ -0,0 +1,450 @@
+# Training ML-Agents
+
+Once your learning environment has been created and is ready for training, the next step is to initiate a training run. Training in the ML-Agents Toolkit is powered by a dedicated Python package, `mlagents`. This package exposes a command `mlagents-learn` that is the single entry point for all training workflows (e.g. reinforcement leaning, imitation learning, curriculum learning). Its implementation can be found [here](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents/mlagents/trainers/learn.py).
+
+## Training with mlagents-learn
+
+### Starting Training
+
+`mlagents-learn` is the main training utility provided by the ML-Agents Toolkit. It accepts a number of CLI options in addition to a YAML configuration file that contains all the configurations and hyperparameters to be used during training. The set of configurations and hyperparameters to include in this file depend on the agents in your environment and the specific training method you wish to utilize. Keep in mind that the hyperparameter values can have a big impact on the training performance (i.e. your agent's ability to learn a policy that solves the task). In this page, we will review all the hyperparameters for all training methods and provide guidelines and advice on their values.
+
+To view a description of all the CLI options accepted by `mlagents-learn`, use the `--help`:
+
+```sh
+mlagents-learn --help
+```
+
+The basic command for training is:
+
+```sh
+mlagents-learn --env= --run-id=
+```
+
+where
+
+- `` is the file path of the trainer configuration YAML. This contains all the hyperparameter values. We offer a detailed guide on the structure of this file and the meaning of the hyperparameters (and advice on how to set them) in the dedicated [Training Configurations](#training-configurations) section below.
+- ``**(Optional)** is the name (including path) of your [Unity executable](Learning-Environment-Executable.md) containing the agents to be trained. If `` is not passed, the training will happen in the Editor. Press the **Play** button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen.
+- `` is a unique name you can use to identify the results of your training runs.
+
+See the [Running an Example](Sample.md#training-a-new-model-with-reinforcement-learning) guide for a sample execution of the `mlagents-learn` command.
+
+#### Observing Training
+
+Regardless of which training methods, configurations or hyperparameters you provide, the training process will always generate three artifacts, all found in the `results/` folder:
+
+1. Summaries: these are training metrics that are updated throughout the training process. They are helpful to monitor your training performance and may help inform how to update your hyperparameter values. See [Using TensorBoard](Using-Tensorboard.md) for more details on how to visualize the training metrics.
+2. Models: these contain the model checkpoints that are updated throughout training and the final model file (`.onnx`). This final model file is generated once either when training completes or is interrupted.
+3. Timers file (under `results//run_logs`): this contains aggregated metrics on your training process, including time spent on specific code blocks. See [Profiling in Python](Profiling-Python.md) for more information on the timers generated.
+
+These artifacts are updated throughout the training process and finalized when training is completed or is interrupted.
+
+#### Stopping and Resuming Training
+
+To interrupt training and save the current progress, hit `Ctrl+C` once and wait for the model(s) to be saved out.
+
+To resume a previously interrupted or completed training run, use the `--resume` flag and make sure to specify the previously used run ID.
+
+If you would like to re-run a previously interrupted or completed training run and re-use the same run ID (in this case, overwriting the previously generated artifacts), then use the `--force` flag.
+
+#### Loading an Existing Model
+
+You can also use this mode to run inference of an already-trained model in Python by using both the `--resume` and `--inference` flags. Note that if you want to run inference in Unity, you should use the [Inference Engine](Sample.md#running-a-pre-trained-model).
+
+Additionally, if the network architecture changes, you may still load an existing model, but ML-Agents will only load the parts of the model it can load and ignore all others. For instance, if you add a new reward signal, the existing model will load but the new reward signal will be initialized from scratch. If you have a model with a visual encoder (CNN) but change the `hidden_units`, the CNN will be loaded but the body of the network will be initialized from scratch.
+
+Alternatively, you might want to start a new training run but _initialize_ it using an already-trained model. You may want to do this, for instance, if your environment changed and you want a new model, but the old behavior is still better than random. You can do this by specifying `--initialize-from=`, where `` is the old run ID.
+
+## Training Configurations
+
+The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods and options. As such, specific training runs may require different training configurations and may generate different artifacts and TensorBoard statistics. This section offers a detailed guide into how to manage the different training set-ups within the toolkit.
+
+More specifically, this section offers a detailed guide on the command-line flags for `mlagents-learn` that control the training configurations:
+
+- ``: defines the training hyperparameters for each Behavior in the scene, and the set-ups for the environment parameters (Curriculum Learning and Environment Parameter Randomization)
+
+It is important to highlight that successfully training a Behavior in the ML-Agents Toolkit involves tuning the training hyperparameters and configuration. This guide contains some best practices for tuning the training process when the default parameters don't seem to be giving the level of performance you would like. We provide sample configuration files for our example environments in the [config](https://github.com/Unity-Technologies/ml-agents/tree/release_22/config/) directory. The `config/ppo/3DBall.yaml` was used to train the 3D Balance Ball in the [Running an Example](Sample.md) guide. That configuration file uses the PPO trainer, but we also have configuration files for SAC and GAIL.
+
+Additionally, the set of configurations you provide depend on the training functionalities you use (see [ML-Agents Theory](ML-Agents-Overview.md) for a description of all the training functionalities). Each functionality you add typically has its own training configurations. For instance:
+
+- Use PPO or SAC?
+- Use Recurrent Neural Networks for adding memory to your agents?
+- Use the intrinsic curiosity module?
+- Ignore the environment reward signal?
+- Pre-train using behavioral cloning? (Assuming you have recorded demonstrations.)
+- Include the GAIL intrinsic reward signals? (Assuming you have recorded demonstrations.)
+- Use self-play? (Assuming your environment includes multiple agents.)
+
+The trainer config file, ``, determines the features you will use during training, and the answers to the above questions will dictate its contents. The rest of this guide breaks down the different sub-sections of the trainer config file and explains the possible settings for each. If you need a list of all the trainer configurations, please see [Training Configuration File](Training-Configuration-File.md).
+
+**NOTE:** The configuration file format has been changed between 0.17.0 and 0.18.0 and between 0.18.0 and onwards. To convert an old set of configuration files (trainer config, curriculum, and sampler files) to the new format, a script has been provided. Run `python -m mlagents.trainers.upgrade_config -h` in your console to see the script's usage.
+
+### Adding CLI Arguments to the Training Configuration file
+
+Additionally, within the training configuration YAML file, you can also add the CLI arguments (such as `--num-envs`).
+
+Reminder that a detailed description of all the CLI arguments can be found by using the help utility:
+
+```sh
+mlagents-learn --help
+```
+
+These additional CLI arguments are grouped into environment, engine, checkpoint and torch. The available settings and example values are shown below.
+
+#### Environment settings
+
+```yaml
+env_settings:
+ env_path: FoodCollector
+ env_args: null
+ base_port: 5005
+ num_envs: 1
+ timeout_wait: 10
+ seed: -1
+ max_lifetime_restarts: 10
+ restarts_rate_limit_n: 1
+ restarts_rate_limit_period_s: 60
+```
+
+#### Engine settings
+
+```yaml
+engine_settings:
+ width: 84
+ height: 84
+ quality_level: 5
+ time_scale: 20
+ target_frame_rate: -1
+ capture_frame_rate: 60
+ no_graphics: false
+```
+
+#### Checkpoint settings
+
+```yaml
+checkpoint_settings:
+ run_id: foodtorch
+ initialize_from: null
+ load_model: false
+ resume: false
+ force: true
+ train_model: false
+ inference: false
+```
+
+#### Torch settings:
+
+```yaml
+torch_settings:
+ device: cpu
+```
+
+### Behavior Configurations
+
+The primary section of the trainer config file is a set of configurations for each Behavior in your scene. These are defined under the sub-section `behaviors` in your trainer config file. Some of the configurations are required while others are optional. To help us get started, below is a sample file that includes all the possible settings if we're using a PPO trainer with all the possible training functionalities enabled (memory, behavioral cloning, curiosity, GAIL and self-play). You will notice that curriculum and environment parameter randomization settings are not part of the `behaviors` configuration, but in their own section called `environment_parameters`.
+
+```yaml
+behaviors:
+ BehaviorPPO:
+ trainer_type: ppo
+
+ hyperparameters:
+ # Hyperparameters common to PPO and SAC
+ batch_size: 1024
+ buffer_size: 10240
+ learning_rate: 3.0e-4
+ learning_rate_schedule: linear
+
+ # PPO-specific hyperparameters
+ beta: 5.0e-3
+ beta_schedule: constant
+ epsilon: 0.2
+ epsilon_schedule: linear
+ lambd: 0.95
+ num_epoch: 3
+ shared_critic: False
+
+ # Configuration of the neural network (common to PPO/SAC)
+ network_settings:
+ vis_encode_type: simple
+ normalize: false
+ hidden_units: 128
+ num_layers: 2
+ # memory
+ memory:
+ sequence_length: 64
+ memory_size: 256
+
+ # Trainer configurations common to all trainers
+ max_steps: 5.0e5
+ time_horizon: 64
+ summary_freq: 10000
+ keep_checkpoints: 5
+ checkpoint_interval: 50000
+ threaded: false
+ init_path: null
+
+ # behavior cloning
+ behavioral_cloning:
+ demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
+ strength: 0.5
+ steps: 150000
+ batch_size: 512
+ num_epoch: 3
+ samples_per_update: 0
+
+ reward_signals:
+ # environment reward (default)
+ extrinsic:
+ strength: 1.0
+ gamma: 0.99
+
+ # curiosity module
+ curiosity:
+ strength: 0.02
+ gamma: 0.99
+ encoding_size: 256
+ learning_rate: 3.0e-4
+
+ # GAIL
+ gail:
+ strength: 0.01
+ gamma: 0.99
+ encoding_size: 128
+ demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
+ learning_rate: 3.0e-4
+ use_actions: false
+ use_vail: false
+
+ # self-play
+ self_play:
+ window: 10
+ play_against_latest_model_ratio: 0.5
+ save_steps: 50000
+ swap_steps: 2000
+ team_change: 100000
+```
+
+Here is an equivalent file if we use an SAC trainer instead. Notice that the configurations for the additional functionalities (memory, behavioral cloning, curiosity and self-play) remain unchanged.
+
+```yaml
+behaviors:
+ BehaviorSAC:
+ trainer_type: sac
+
+ # Trainer configs common to PPO/SAC (excluding reward signals)
+ # same as PPO config
+
+ # SAC-specific configs (replaces the hyperparameters section above)
+ hyperparameters:
+ # Hyperparameters common to PPO and SAC
+ # Same as PPO config
+
+ # SAC-specific hyperparameters
+ # Replaces the "PPO-specific hyperparameters" section above
+ buffer_init_steps: 0
+ tau: 0.005
+ steps_per_update: 10.0
+ save_replay_buffer: false
+ init_entcoef: 0.5
+ reward_signal_steps_per_update: 10.0
+
+ # Configuration of the neural network (common to PPO/SAC)
+ network_settings:
+ # Same as PPO config
+
+ # Trainer configurations common to all trainers
+ #
+
+ # pre-training using behavior cloning
+ behavioral_cloning:
+ # same as PPO config
+
+ reward_signals:
+ # environment reward
+ extrinsic:
+ # same as PPO config
+
+ # curiosity module
+ curiosity:
+ # same as PPO config
+
+ # GAIL
+ gail:
+ # same as PPO config
+
+ # self-play
+ self_play:
+ # same as PPO config
+```
+
+We now break apart the components of the configuration file and describe what each of these parameters mean and provide guidelines on how to set them. See [Training Configuration File](Training-Configuration-File.md) for a detailed description of all the configurations listed above, along with their defaults. Unless otherwise specified, omitting a configuration will revert it to its default.
+
+### Default Behavior Settings
+
+In some cases, you may want to specify a set of default configurations for your Behaviors. This may be useful, for instance, if your Behavior names are generated procedurally by the environment and not known before runtime, or if you have many Behaviors with very similar settings. To specify a default configuration, insert a `default_settings` section in your YAML. This section should be formatted exactly like a configuration for a Behavior.
+
+```yaml
+default_settings:
+ # < Same as Behavior configuration >
+behaviors:
+ # < Same as above >
+```
+
+Behaviors found in the environment that aren't specified in the YAML will now use the `default_settings`, and unspecified settings in behavior configurations will default to the values in `default_settings` if specified there.
+
+### Environment Parameters
+
+In order to control the `EnvironmentParameters` in the Unity simulation during training, you need to add a section called `environment_parameters`. For example you can set the value of an `EnvironmentParameter` called `my_environment_parameter` to `3.0` with the following code :
+
+```yml
+behaviors:
+ BehaviorY:
+ # < Same as above >
+
+# Add this section
+environment_parameters:
+ my_environment_parameter: 3.0
+```
+
+Inside the Unity simulation, you can access your Environment Parameters by doing :
+
+```csharp
+Academy.Instance.EnvironmentParameters.GetWithDefault("my_environment_parameter", 0.0f);
+```
+
+#### Environment Parameter Randomization
+
+To enable environment parameter randomization, you need to edit the `environment_parameters` section of your training configuration yaml file. Instead of providing a single float value for your environment parameter, you can specify a sampler instead. Here is an example with three environment parameters called `mass`, `length` and `scale`:
+
+```yml
+behaviors:
+ BehaviorY:
+ # < Same as above >
+
+# Add this section
+environment_parameters:
+ mass:
+ sampler_type: uniform
+ sampler_parameters:
+ min_value: 0.5
+ max_value: 10
+
+ length:
+ sampler_type: multirangeuniform
+ sampler_parameters:
+ intervals: [[7, 10], [15, 20]]
+
+ scale:
+ sampler_type: gaussian
+ sampler_parameters:
+ mean: 2
+ st_dev: .3
+```
+
+
+| **Setting** | **Description** |
+| :--------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `sampler_type` | A string identifier for the type of sampler to use for this `Environment Parameter`. |
+| `sampler_parameters` | The parameters for a given `sampler_type`. Samplers of different types can have different `sampler_parameters` |
+
+**Supported Sampler Types**
+
+Below is a list of the `sampler_type` values supported by the toolkit.
+
+- `uniform` - Uniform sampler
+ - Uniformly samples a single float value from a range with a given minimum and maximum value (inclusive).
+ - **parameters** - `min_value`, `max_value`
+- `gaussian` - Gaussian sampler
+ - Samples a single float value from a normal distribution with a given mean and standard deviation.
+ - **parameters** - `mean`, `st_dev`
+- `multirange_uniform` - Multirange uniform sampler
+ - First, samples an interval from a set of intervals in proportion to relative length of the intervals. Then, uniformly samples a single float value from the sampled interval (inclusive). This sampler can take an arbitrary number of intervals in a list in the following format: [[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...]
+ - **parameters** - `intervals`
+
+The implementation of the samplers can be found [here](https://github.com/Unity-Technologies/ml-agents/blob/main/com.unity.ml-agents/Runtime/Sampler.cs).
+
+**Training with Environment Parameter Randomization**
+
+After the sampler configuration is defined, we proceed by launching `mlagents-learn` and specify trainer configuration with parameter randomization enabled. For example, if we wanted to train the 3D ball agent with parameter randomization, we would run
+
+```sh
+mlagents-learn config/ppo/3DBall_randomize.yaml --run-id=3D-Ball-randomize
+```
+
+We can observe progress and metrics via TensorBoard.
+
+#### Curriculum
+
+To enable curriculum learning, you need to add a `curriculum` sub-section to your environment parameter. Here is one example with the environment parameter `my_environment_parameter` :
+
+```yml
+behaviors:
+ BehaviorY:
+ # < Same as above >
+
+# Add this section
+environment_parameters:
+ my_environment_parameter:
+ curriculum:
+ - name: MyFirstLesson # The '-' is important as this is a list
+ completion_criteria:
+ measure: progress
+ behavior: my_behavior
+ signal_smoothing: true
+ min_lesson_length: 100
+ threshold: 0.2
+ value: 0.0
+ - name: MySecondLesson # This is the start of the second lesson
+ completion_criteria:
+ measure: progress
+ behavior: my_behavior
+ signal_smoothing: true
+ min_lesson_length: 100
+ threshold: 0.6
+ require_reset: true
+ value:
+ sampler_type: uniform
+ sampler_parameters:
+ min_value: 4.0
+ max_value: 7.0
+ - name: MyLastLesson
+ value: 8.0
+```
+
+Note that this curriculum __only__ applies to `my_environment_parameter`. The `curriculum` section contains a list of `Lessons`. In the example, the lessons are named `MyFirstLesson`, `MySecondLesson` and `MyLastLesson`. Each `Lesson` has 3 fields :
+
+ - `name` which is a user defined name for the lesson (The name of the lesson will be displayed in the console when the lesson changes)
+ - `completion_criteria` which determines what needs to happen in the simulation before the lesson can be considered complete. When that condition is met, the curriculum moves on to the next `Lesson`. Note that you do not need to specify a `completion_criteria` for the last `Lesson`
+ - `value` which is the value the environment parameter will take during the lesson. Note that this can be a float or a sampler.
+
+There are the different settings of the `completion_criteria` :
+
+
+| **Setting** | **Description** |
+| :------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `measure` | What to measure learning progress, and advancement in lessons by.
`reward` uses a measure of received reward, `progress` uses the ratio of steps/max_steps, while `Elo` is available only for self-play situations and uses Elo score as a curriculum completion measure. |
+| `behavior` | Specifies which behavior is being tracked. There can be multiple behaviors with different names, each at different points of training. This setting allows the curriculum to track only one of them. |
+| `threshold` | Determines at what point in value of `measure` the lesson should be increased. |
+| `min_lesson_length` | The minimum number of episodes that should be completed before the lesson can change. If `measure` is set to `reward`, the average cumulative reward of the last `min_lesson_length` episodes will be used to determine if the lesson should change. Must be nonnegative.
**Important**: the average reward that is compared to the thresholds is different than the mean reward that is logged to the console. For example, if `min_lesson_length` is `100`, the lesson will increment after the average cumulative reward of the last `100` episodes exceeds the current threshold. The mean reward logged to the console is dictated by the `summary_freq` parameter defined above. |
+| `signal_smoothing` | Whether to weight the current progress measure by previous values. |
+| `require_reset` | Whether changing lesson requires the environment to reset (default: false) |
+**Training with a Curriculum**
+
+Once we have specified our metacurriculum and curricula, we can launch `mlagents-learn` to point to the config file containing our curricula and PPO will train using Curriculum Learning. For example, to train agents in the Wall Jump environment with curriculum learning, we can run:
+
+```sh
+mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum
+```
+
+We can then keep track of the current lessons and progresses via TensorBoard. If you've terminated the run, you can resume it using `--resume` and lesson progress will start off where it ended.
+
+
+### Training Using Concurrent Unity Instances
+
+In order to run concurrent Unity instances during training, set the number of environment instances using the command line option `--num-envs=` when you invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which is the starting port used for the concurrent Unity instances.
+
+Some considerations:
+
+- **Buffer Size** - If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase `buffer_size` in the trainer config file. A common practice is to multiply `buffer_size` by `num-envs`.
+- **Resource Constraints** - Invoking concurrent Unity instances is constrained by the resources on the machine. Please use discretion when setting `--num-envs=`.
+- **Result Variation Using Concurrent Unity Instances** - If you keep all the hyperparameters the same, but change `--num-envs=`, the results and model would likely change.
diff --git a/com.unity.ml-agents/Documentation~/Training-Plugins.md b/com.unity.ml-agents/Documentation~/Training-Plugins.md
new file mode 100644
index 0000000000..934c2c1819
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Training-Plugins.md
@@ -0,0 +1,45 @@
+# Customizing Training via Plugins
+
+ML-Agents provides support for running your own python implementations of specific interfaces during the training process. These interfaces are currently fairly limited, but will be expanded in the future.
+
+**Note:** Plugin interfaces should currently be considered "in beta", and they may change in future releases.
+
+## How to Write Your Own Plugin
+[This video](https://www.youtube.com/watch?v=fY3Y_xPKWNA) explains the basics of how to create a plugin system using setuptools, and is the same approach that ML-Agents' plugin system is based on.
+
+The `ml-agents-plugin-examples` directory contains a reference implementation of each plugin interface, so it's a good starting point.
+
+### setup.py
+If you don't already have a `setup.py` file for your python code, you'll need to add one. `ml-agents-plugin-examples` has a [minimal example](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-plugin-examples/setup.py) of this.
+
+In the call to `setup()`, you'll need to add to the `entry_points` dictionary for each plugin interface that you implement. The form of this is `{entry point name}={plugin module}:{plugin function}`. For example, in
+ `ml-agents-plugin-examples`:
+```python
+entry_points={
+ ML_AGENTS_STATS_WRITER: [
+ "example=mlagents_plugin_examples.example_stats_writer:get_example_stats_writer"
+ ]
+}
+```
+* `ML_AGENTS_STATS_WRITER` (which is a string constant, `mlagents.stats_writer`) is the name of the plugin interface. This must be one of the provided interfaces ([see below](#plugin-interfaces)).
+* `example` is the plugin implementation name. This can be anything.
+* `mlagents_plugin_examples.example_stats_writer` is the plugin module. This points to the module where the plugin registration function is defined.
+* `get_example_stats_writer` is the plugin registration function. This is called when running `mlagents-learn`. The arguments and expected return type for this are different for each plugin interface.
+
+### Local Installation
+Once you've defined `entry_points` in your `setup.py`, you will need to run
+```
+pip install -e [path to your plugin code]
+```
+in the same python virtual environment that you have `mlagents` installed.
+
+## Plugin Interfaces
+
+### StatsWriter
+The StatsWriter class receives various information from the training process, such as the average Agent reward in each summary period. By default, we log this information to the console and write it to [TensorBoard](Using-Tensorboard.md).
+
+#### Interface
+The `StatsWriter.write_stats()` method must be implemented in any derived classes. It takes a "category" parameter, which typically is the behavior name of the Agents being trained, and a dictionary of `StatSummary` values with string keys. Additionally, `StatsWriter.on_add_stat()` may be extended to register a callback handler for each stat emission.
+
+#### Registration
+The `StatsWriter` registration function takes a `RunOptions` argument and returns a list of `StatsWriter`s. An example implementation is provided in [`mlagents_plugin_examples`](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents-plugin-examples/mlagents_plugin_examples/example_stats_writer.py)
diff --git a/com.unity.ml-agents/Documentation~/Training-on-Amazon-Web-Service.md b/com.unity.ml-agents/Documentation~/Training-on-Amazon-Web-Service.md
new file mode 100644
index 0000000000..fd57c59ade
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Training-on-Amazon-Web-Service.md
@@ -0,0 +1,296 @@
+# Training on Amazon Web Service
+
+:warning: **Note:** We no longer use this guide ourselves and so it may not work correctly. We've decided to keep it up just in case it is helpful to you.
+
+This page contains instructions for setting up an EC2 instance on Amazon Web Service for training ML-Agents environments.
+
+## Pre-configured AMI
+
+We've prepared a pre-configured AMI for you with the ID: `ami-016ff5559334f8619` in the `us-east-1` region. It was created as a modification of Deep Learning AMI (Ubuntu). The AMI has been tested with p2.xlarge instance. Furthermore, if you want to train without headless mode, you need to enable X Server.
+
+After launching your EC2 instance using the ami and ssh into it, run the following commands to enable it:
+
+```sh
+# Start the X Server, press Enter to come to the command line
+$ sudo /usr/bin/X :0 &
+
+# Check if Xorg process is running
+# You will have a list of processes running on the GPU, Xorg should be in the
+# list, as shown below
+$ nvidia-smi
+
+# Thu Jun 14 20:27:26 2018
+# +-----------------------------------------------------------------------------+
+# | NVIDIA-SMI 390.67 Driver Version: 390.67 |
+# |-------------------------------+----------------------+----------------------+
+# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
+# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
+# |===============================+======================+======================|
+# | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
+# | N/A 35C P8 31W / 149W | 9MiB / 11441MiB | 0% Default |
+# +-------------------------------+----------------------+----------------------+
+#
+# +-----------------------------------------------------------------------------+
+# | Processes: GPU Memory |
+# | GPU PID Type Process name Usage |
+# |=============================================================================|
+# | 0 2331 G /usr/lib/xorg/Xorg 8MiB |
+# +-----------------------------------------------------------------------------+
+
+# Make the ubuntu use X Server for display
+$ export DISPLAY=:0
+```
+
+## Configuring your own instance
+
+You could also choose to configure your own instance. To begin with, you will need an EC2 instance which contains the latest Nvidia drivers, CUDA9, and cuDNN. In this tutorial we used the Deep Learning AMI (Ubuntu) listed under AWS Marketplace with a p2.xlarge instance.
+
+### Installing the ML-Agents Toolkit on the instance
+
+After launching your EC2 instance using the ami and ssh into it:
+
+1. Activate the python3 environment
+
+ ```sh
+ source activate python3
+ ```
+
+2. Clone the ML-Agents repo and install the required Python packages
+
+ ```sh
+ git clone --branch release_22 https://github.com/Unity-Technologies/ml-agents.git
+ cd ml-agents/ml-agents/
+ pip3 install -e .
+ ```
+
+### Setting up X Server (optional)
+
+X Server setup is only necessary if you want to do training that requires visual observation input. _Instructions here are adapted from this [Medium post](https://medium.com/towards-data-science/how-to-run-unity-on-amazon-cloud-or-without-monitor-3c10ce022639) on running general Unity applications in the cloud._
+
+Current limitations of the Unity Engine require that a screen be available to render to when using visual observations. In order to make this possible when training on a remote server, a virtual screen is required. We can do this by installing Xorg and creating a virtual screen. Once installed and created, we can display the Unity environment in the virtual environment, and train as we would on a local machine. Ensure that `headless` mode is disabled when building linux executables which use visual observations.
+
+#### Install and setup Xorg:
+
+ ```sh
+ # Install Xorg
+ $ sudo apt-get update
+ $ sudo apt-get install -y xserver-xorg mesa-utils
+ $ sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
+
+ # Get the BusID information
+ $ nvidia-xconfig --query-gpu-info
+
+ # Add the BusID information to your /etc/X11/xorg.conf file
+ $ sudo sed -i 's/ BoardName "Tesla K80"/ BoardName "Tesla K80"\n BusID "0:30:0"/g' /etc/X11/xorg.conf
+
+ # Remove the Section "Files" from the /etc/X11/xorg.conf file
+ # And remove two lines that contain Section "Files" and EndSection
+ $ sudo vim /etc/X11/xorg.conf
+ ```
+
+#### Update and setup Nvidia driver:
+
+ ```sh
+ # Download and install the latest Nvidia driver for ubuntu
+ # Please refer to http://download.nvidia.com/XFree86/Linux-#x86_64/latest.txt
+ $ wget http://download.nvidia.com/XFree86/Linux-x86_64/390.87/NVIDIA-Linux-x86_64-390.87.run
+ $ sudo /bin/bash ./NVIDIA-Linux-x86_64-390.87.run --accept-license --no-questions --ui=none
+
+ # Disable Nouveau as it will clash with the Nvidia driver
+ $ sudo echo 'blacklist nouveau' | sudo tee -a /etc/modprobe.d/blacklist.conf
+ $ sudo echo 'options nouveau modeset=0' | sudo tee -a /etc/modprobe.d/blacklist.conf
+ $ sudo echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
+ $ sudo update-initramfs -u
+ ```
+
+#### Restart the EC2 instance:
+
+ ```sh
+ sudo reboot now
+ ```
+
+#### Make sure there are no Xorg processes running:
+
+```sh
+# Kill any possible running Xorg processes
+# Note that you might have to run this command multiple times depending on
+# how Xorg is configured.
+$ sudo killall Xorg
+
+# Check if there is any Xorg process left
+# You will have a list of processes running on the GPU, Xorg should not be in
+# the list, as shown below.
+$ nvidia-smi
+
+# Thu Jun 14 20:21:11 2018
+# +-----------------------------------------------------------------------------+
+# | NVIDIA-SMI 390.67 Driver Version: 390.67 |
+# |-------------------------------+----------------------+----------------------+
+# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
+# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
+# |===============================+======================+======================|
+# | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
+# | N/A 37C P8 31W / 149W | 0MiB / 11441MiB | 0% Default |
+# +-------------------------------+----------------------+----------------------+
+#
+# +-----------------------------------------------------------------------------+
+# | Processes: GPU Memory |
+# | GPU PID Type Process name Usage |
+# |=============================================================================|
+# | No running processes found |
+# +-----------------------------------------------------------------------------+
+
+```
+
+#### Start X Server and make the ubuntu use X Server for display:
+
+ ```console
+ # Start the X Server, press Enter to come back to the command line
+ $ sudo /usr/bin/X :0 &
+
+ # Check if Xorg process is running
+ # You will have a list of processes running on the GPU, Xorg should be in the list.
+ $ nvidia-smi
+
+ # Make the ubuntu use X Server for display
+ $ export DISPLAY=:0
+ ```
+
+#### Ensure the Xorg is correctly configured:
+
+ ```sh
+ # For more information on glxgears, see ftp://www.x.org/pub/X11R6.8.1/doc/glxgears.1.html.
+ $ glxgears
+ # If Xorg is configured correctly, you should see the following message
+
+ # Running synchronized to the vertical refresh. The framerate should be
+ # approximately the same as the monitor refresh rate.
+ # 137296 frames in 5.0 seconds = 27459.053 FPS
+ # 141674 frames in 5.0 seconds = 28334.779 FPS
+ # 141490 frames in 5.0 seconds = 28297.875 FPS
+
+ ```
+
+## Training on EC2 instance
+
+1. In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
+2. Open the Build Settings window (menu: File > Build Settings).
+3. Select Linux as the Target Platform, and x86_64 as the target architecture (the default x86 currently does not work).
+4. Check Headless Mode if you have not setup the X Server. (If you do not use Headless Mode, you have to set up the X Server to enable training.)
+5. Click Build to build the Unity environment executable.
+6. Upload the executable to your EC2 instance within `ml-agents` folder.
+7. Change the permissions of the executable.
+
+ ```sh
+ chmod +x .x86_64
+ ```
+
+8. (Without Headless Mode) Start X Server and use it for display:
+
+ ```sh
+ # Start the X Server, press Enter to come back to the command line
+ $ sudo /usr/bin/X :0 &
+
+ # Check if Xorg process is running
+ # You will have a list of processes running on the GPU, Xorg should be in the list.
+ $ nvidia-smi
+
+ # Make the ubuntu use X Server for display
+ $ export DISPLAY=:0
+ ```
+
+9. Test the instance setup from Python using:
+
+ ```python
+ from mlagents_envs.environment import UnityEnvironment
+
+ env = UnityEnvironment()
+ ```
+
+Where `` corresponds to the path to your environment executable.
+
+You should receive a message confirming that the environment was loaded successfully.
+
+10. Train your models
+
+ ```console
+ mlagents-learn --env= --train
+ ```
+
+## FAQ
+
+### The \_Data folder hasn't been copied cover
+
+If you've built your Linux executable, but forget to copy over the corresponding \_Data folder, you will see error message like the following:
+
+```sh
+Set current directory to /home/ubuntu/ml-agents/ml-agents
+Found path: /home/ubuntu/ml-agents/ml-agents/3dball_linux.x86_64
+no boot config - using default values
+
+(Filename: Line: 403)
+
+There is no data folder
+```
+
+### Unity Environment not responding
+
+If you didn't set up X Server or hasn't launched it properly, or your environment somehow crashes, or you haven't `chmod +x` your Unity Environment, all of these will cause connection between Unity and Python to fail. Then you will see something like this:
+
+```console
+Logging to /home/ubuntu/.config/unity3d//Player.log
+Traceback (most recent call last):
+ File "", line 1, in
+ File "/home/ubuntu/ml-agents/ml-agents/mlagents_envs/environment.py", line 63, in __init__
+ aca_params = self.send_academy_parameters(rl_init_parameters_in)
+ File "/home/ubuntu/ml-agents/ml-agents/mlagents_envs/environment.py", line 489, in send_academy_parameters
+ return self.communicator.initialize(inputs).rl_initialization_output
+ File "/home/ubuntu/ml-agents/ml-agents/mlagents_envs/rpc_communicator.py", line 60, in initialize
+mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
+ The environment does not need user interaction to launch
+ The environment and the Python interface have compatible versions.
+```
+
+It would be also really helpful to check your /home/ubuntu/.config/unity3d//Player.log to see what happens with your Unity environment.
+
+### Could not launch X Server
+
+When you execute:
+
+```sh
+sudo /usr/bin/X :0 &
+```
+
+You might see something like:
+
+```sh
+X.Org X Server 1.18.4
+...
+(==) Log file: "/var/log/Xorg.0.log", Time: Thu Oct 11 21:10:38 2018
+(==) Using config file: "/etc/X11/xorg.conf"
+(==) Using system config directory "/usr/share/X11/xorg.conf.d"
+(EE)
+Fatal server error:
+(EE) no screens found(EE)
+(EE)
+Please consult the X.Org Foundation support
+ at http://wiki.x.org
+ for help.
+(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
+(EE)
+(EE) Server terminated with error (1). Closing log file.
+```
+
+And when you execute:
+
+```sh
+nvidia-smi
+```
+
+You might see something like:
+
+```sh
+NVIDIA-SMI has failed because it could not communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
+```
+
+This means the NVIDIA's driver needs to be updated. Refer to [this section](Training-on-Amazon-Web-Service.md#update-and-setup-nvidia-driver) for more information.
diff --git a/com.unity.ml-agents/Documentation~/Training-on-Microsoft-Azure.md b/com.unity.ml-agents/Documentation~/Training-on-Microsoft-Azure.md
new file mode 100644
index 0000000000..98c559845d
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Training-on-Microsoft-Azure.md
@@ -0,0 +1,166 @@
+# Training on Microsoft Azure (works with ML-Agents Toolkit v0.3)
+
+:warning: **Note:** We no longer use this guide ourselves and so it may not work correctly. We've decided to keep it up just in case it is helpful to you.
+
+This page contains instructions for setting up training on Microsoft Azure through either [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances/) or Virtual Machines. Non "headless" training has not yet been tested to verify support.
+
+## Pre-Configured Azure Virtual Machine
+
+A pre-configured virtual machine image is available in the Azure Marketplace and is nearly completely ready for training. You can start by deploying the [Data Science Virtual Machine for Linux (Ubuntu)](https://learn.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro?view=azureml-api-2) into your Azure subscription.
+
+Note that, if you choose to deploy the image to an [N-Series GPU optimized VM](https://docs.microsoft.com/azure/virtual-machines/linux/sizes-gpu), training will, by default, run on the GPU. If you choose any other type of VM, training will run on the CPU.
+
+## Configuring your own Instance
+
+Setting up your own instance requires a number of package installations. Please view the documentation for doing so [here](#custom-instances).
+
+## Installing ML-Agents
+
+1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp) the `ml-agents` sub-folder of this ml-agents repo to the remote Azure instance, and set it as the working directory.
+2. Install the required packages: Torch: `pip3 install torch==1.7.0 -f https://download.pytorch.org/whl/torch_stable.html` and MLAgents: `python -m pip install mlagents==1.1.0`
+
+## Testing
+
+To verify that all steps worked correctly:
+
+1. In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
+2. Open the Build Settings window (menu: File > Build Settings).
+3. Select Linux as the Target Platform, and x86_64 as the target architecture.
+4. Check Headless Mode.
+5. Click Build to build the Unity environment executable.
+6. Upload the resulting files to your Azure instance.
+7. Test the instance setup from Python using:
+
+```python
+from mlagents_envs.environment import UnityEnvironment
+
+env = UnityEnvironment(file_name="", seed=1, side_channels=[])
+```
+
+Where `` corresponds to the path to your environment executable (i.e. `/home/UserName/Build/yourFile`).
+
+You should receive a message confirming that the environment was loaded successfully.
+
+**Note:** When running your environment in headless mode, you must append `--no-graphics` to your mlagents-learn command, as it won't train otherwise.
+You can test this simply by aborting a training and check if it says "Model Saved" or "Aborted", or see if it generated the .onnx in the result folder.
+
+## Running Training on your Virtual Machine
+
+To run your training on the VM:
+
+1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp) your built Unity application to your Virtual Machine.
+2. Set the directory where the ML-Agents Toolkit was installed to your working directory.
+3. Run the following command:
+
+```sh
+mlagents-learn --env= --run-id= --train
+```
+
+Where `` is the path to your app (i.e. `~/unity-volume/3DBallHeadless`) and `` is an identifier you would like to identify your training run with.
+
+If you've selected to run on a N-Series VM with GPU support, you can verify that the GPU is being used by running `nvidia-smi` from the command line.
+
+## Monitoring your Training Run with TensorBoard
+
+Once you have started training, you can [use TensorBoard to observe the training](Using-Tensorboard.md).
+
+1. Start by [opening the appropriate port for web traffic to connect to your VM](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal).
+
+ - Note that you don't need to generate a new `Network Security Group` but instead, go to the **Networking** tab under **Settings** for your VM.
+ - As an example, you could use the following settings to open the Port with the following Inbound Rule settings:
+ - Source: Any
+ - Source Port Ranges: \*
+ - Destination: Any
+ - Destination Port Ranges: 6006
+ - Protocol: Any
+ - Action: Allow
+ - Priority: (Leave as default)
+
+2. Unless you started the training as a background process, connect to your VM from another terminal instance.
+3. Run the following command from your terminal `tensorboard --logdir results --host 0.0.0.0`
+4. You should now be able to open a browser and navigate to `:6060` to view the TensorBoard report.
+
+## Running on Azure Container Instances
+
+[Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances/) allow you to spin up a container, on demand, that will run your training and then be shut down. This ensures you aren't leaving a billable VM running when it isn't needed. Using ACI enables you to offload training of your models without needing to install Python and TensorFlow on your own computer.
+
+## Custom Instances
+
+This page contains instructions for setting up a custom Virtual Machine on Microsoft Azure so you can running ML-Agents training in the cloud.
+
+1. Start by [deploying an Azure VM](https://docs.microsoft.com/azure/virtual-machines/linux/quick-create-portal) with Ubuntu Linux (tests were done with 16.04 LTS). To use GPU support, use a N-Series VM.
+2. SSH into your VM.
+3. Start with the following commands to install the Nvidia driver:
+
+ ```sh
+ wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
+
+ sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
+
+ sudo apt-get update
+
+ sudo apt-get install cuda-drivers
+
+ sudo reboot
+ ```
+
+4. After a minute you should be able to reconnect to your VM and install the CUDA toolkit:
+
+ ```sh
+ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
+
+ sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
+
+ sudo apt-get update
+
+ sudo apt-get install cuda-8-0
+ ```
+
+5. You'll next need to download cuDNN from the Nvidia developer site. This requires a registered account.
+
+6. Navigate to [http://developer.nvidia.com](http://developer.nvidia.com) and create an account and verify it.
+
+7. Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).
+
+8. Copy the deb package to your VM:
+
+ ```sh
+ scp libcudnn6_6.0.21-1+cuda8.0_amd64.deb @:libcudnn6_6.0.21-1+cuda8.0_amd64.deb
+ ```
+
+9. SSH back to your VM and execute the following:
+
+ ```console
+ sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb
+
+ export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
+ . ~/.profile
+
+ sudo reboot
+ ```
+
+10. After a minute, you should be able to SSH back into your VM. After doing so, run the following:
+
+ ```sh
+ sudo apt install python-pip
+ sudo apt install python3-pip
+ ```
+
+11. At this point, you need to install TensorFlow. The version you install should be tied to if you are using GPU to train:
+
+ ```sh
+ pip3 install tensorflow-gpu==1.4.0 keras==2.0.6
+ ```
+
+Or CPU to train:
+
+ ```sh
+ pip3 install tensorflow==1.4.0 keras==2.0.6
+ ```
+
+12. You'll then need to install additional dependencies:
+
+ ```sh
+ pip3 install pillow
+ pip3 install numpy
+ ```
diff --git a/com.unity.ml-agents/Documentation~/Training.md b/com.unity.ml-agents/Documentation~/Training.md
new file mode 100644
index 0000000000..f61463e49f
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Training.md
@@ -0,0 +1,13 @@
+# Training
+
+Use the following topics to learn about training in ML-Agents.
+
+
+| **Section** | **Description** |
+|---------------------------------------------------------------|--------------------------------------------|
+| [Training ML-Agents Basics](Training-ML-Agents.md) | Overview of training workflow. |
+| [Training Configuration File](Training-Configuration-File.md) | Training config file reference. |
+| [Using Tensorboard](Using-Tensorboard.md) | Visualize training progress. |
+| [Customizing Training via Plugins](Training-Plugins.md) | Extend training with custom plugins. |
+| [Custom Trainer Plugin](Tutorial-Custom-Trainer-Plugin.md) | Tutorial for creating custom trainers. |
+| [Profiling Trainers](Profiling-Python.md) | Analyze and optimize training performance. |
diff --git a/com.unity.ml-agents/Documentation~/Tutorial-Colab.md b/com.unity.ml-agents/Documentation~/Tutorial-Colab.md
new file mode 100644
index 0000000000..f91c847757
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Tutorial-Colab.md
@@ -0,0 +1,9 @@
+# Python Tutorial with Google Colab
+
+Interactive tutorials for using ML-Agents with Google Colab environments.
+
+| **Tutorial** | **Description** |
+|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| [Using a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_22_docs/colab/Colab_UnityEnvironment_1_Run.ipynb) | Learn how to set up and interact with Unity environments in Colab. |
+| [Q-Learning with a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_22_docs/colab/Colab_UnityEnvironment_2_Train.ipynb) | Implement Q-Learning algorithms with Unity ML-Agents in Colab. |
+| [Using Side Channels on a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_22_docs/colab/Colab_UnityEnvironment_3_SideChannel.ipynb) | Explore side channel communication between Unity and Python. |
diff --git a/com.unity.ml-agents/Documentation~/Tutorial-Custom-Trainer-Plugin.md b/com.unity.ml-agents/Documentation~/Tutorial-Custom-Trainer-Plugin.md
new file mode 100644
index 0000000000..bed6d1ef79
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Tutorial-Custom-Trainer-Plugin.md
@@ -0,0 +1,303 @@
+# Custom Trainer Plugin
+
+## How to write a custom trainer plugin
+
+### Step 1: Write your custom trainer class
+Before you start writing your code, make sure to use your favorite environment management tool(e.g. `venv` or `conda`) to create and activate a Python virtual environment. The following command uses `conda`, but other tools work similarly:
+```shell
+conda create -n trainer-env python=3.10.12
+conda activate trainer-env
+```
+
+Users of the plug-in system are responsible for implementing the trainer class subject to the API standard. Let us follow an example by implementing a custom trainer named "YourCustomTrainer". You can either extend `OnPolicyTrainer` or `OffPolicyTrainer` classes depending on the training strategies you choose.
+
+Please refer to the internal [PPO implementation](https://github.com/Unity-Technologies/ml-agents/blob/release_22/ml-agents/mlagents/trainers/ppo/trainer.py) for a complete code example. We will not provide a workable code in the document. The purpose of the tutorial is to introduce you to the core components and interfaces of our plugin framework. We use code snippets and patterns to demonstrate the control and data flow.
+
+Your custom trainers are responsible for collecting experiences and training the models. Your custom trainer class acts like a coordinator to the policy and optimizer. To start implementing methods in the class, create a policy class objects from method `create_policy`:
+
+
+```python
+def create_policy(
+ self, parsed_behavior_id: BehaviorIdentifiers, behavior_spec: BehaviorSpec
+) -> TorchPolicy:
+
+ actor_cls: Union[Type[SimpleActor], Type[SharedActorCritic]] = SimpleActor
+ actor_kwargs: Dict[str, Any] = {
+ "conditional_sigma": False,
+ "tanh_squash": False,
+ }
+ if self.shared_critic:
+ reward_signal_configs = self.trainer_settings.reward_signals
+ reward_signal_names = [
+ key.value for key, _ in reward_signal_configs.items()
+ ]
+ actor_cls = SharedActorCritic
+ actor_kwargs.update({"stream_names": reward_signal_names})
+
+ policy = TorchPolicy(
+ self.seed,
+ behavior_spec,
+ self.trainer_settings.network_settings,
+ actor_cls,
+ actor_kwargs,
+ )
+ return policy
+
+```
+
+Depending on whether you use shared or separate network architecture for your policy, we provide `SimpleActor` and `SharedActorCritic` from `mlagents.trainers.torch_entities.networks` that you can choose from. In our example above, we use a `SimpleActor`.
+
+Next, create an optimizer class object from `create_optimizer` method and connect it to the policy object you created above:
+
+
+```python
+def create_optimizer(self) -> TorchOptimizer:
+ return TorchPPOOptimizer( # type: ignore
+ cast(TorchPolicy, self.policy), self.trainer_settings # type: ignore
+ ) # type: ignore
+
+```
+
+There are a couple of abstract methods(`_process_trajectory` and `_update_policy`) inherited from `RLTrainer` that you need to implement in your custom trainer class. `_process_trajectory` takes a trajectory and processes it, putting it into the update buffer. Processing involves calculating value and advantage targets for the model updating step. Given input `trajectory: Trajectory`, users are responsible for processing the data in the trajectory and append `agent_buffer_trajectory` to the back of the update buffer by calling `self._append_to_update_buffer(agent_buffer_trajectory)`, whose output will be used in updating the model in `optimizer` class.
+
+A typical `_process_trajectory` function(incomplete) will convert a trajectory object to an agent buffer then get all value estimates from the trajectory by calling `self.optimizer.get_trajectory_value_estimates`. From the returned dictionary of value estimates we extract reward signals keyed by their names:
+
+```python
+def _process_trajectory(self, trajectory: Trajectory) -> None:
+ super()._process_trajectory(trajectory)
+ agent_id = trajectory.agent_id # All the agents should have the same ID
+
+ agent_buffer_trajectory = trajectory.to_agentbuffer()
+
+ # Get all value estimates
+ (
+ value_estimates,
+ value_next,
+ value_memories,
+ ) = self.optimizer.get_trajectory_value_estimates(
+ agent_buffer_trajectory,
+ trajectory.next_obs,
+ trajectory.done_reached and not trajectory.interrupted,
+ )
+
+ for name, v in value_estimates.items():
+ agent_buffer_trajectory[RewardSignalUtil.value_estimates_key(name)].extend(
+ v
+ )
+ self._stats_reporter.add_stat(
+ f"Policy/{self.optimizer.reward_signals[name].name.capitalize()} Value Estimate",
+ np.mean(v),
+ )
+
+ # Evaluate all reward functions
+ self.collected_rewards["environment"][agent_id] += np.sum(
+ agent_buffer_trajectory[BufferKey.ENVIRONMENT_REWARDS]
+ )
+ for name, reward_signal in self.optimizer.reward_signals.items():
+ evaluate_result = (
+ reward_signal.evaluate(agent_buffer_trajectory) * reward_signal.strength
+ )
+ agent_buffer_trajectory[RewardSignalUtil.rewards_key(name)].extend(
+ evaluate_result
+ )
+ # Report the reward signals
+ self.collected_rewards[name][agent_id] += np.sum(evaluate_result)
+
+ self._append_to_update_buffer(agent_buffer_trajectory)
+
+```
+
+A trajectory will be a list of dictionaries of strings mapped to `Anything`. When calling `forward` on a policy, the argument will include an “experience” dictionary from the last step. The `forward` method will generate an action and the next “experience” dictionary. Examples of fields in the “experience” dictionary include observation, action, reward, done status, group_reward, LSTM memory state, etc.
+
+
+
+### Step 2: implement your custom optimizer for the trainer.
+We will show you an example we implemented - `class TorchPPOOptimizer(TorchOptimizer)`, which takes a Policy and a Dict of trainer parameters and creates an Optimizer that connects to the policy. Your optimizer should include a value estimator and a loss function in the `update` method.
+
+Before writing your optimizer class, first define setting class `class PPOSettings(OnPolicyHyperparamSettings)` for your custom optimizer:
+
+
+
+```python
+class PPOSettings(OnPolicyHyperparamSettings):
+ beta: float = 5.0e-3
+ epsilon: float = 0.2
+ lambd: float = 0.95
+ num_epoch: int = 3
+ shared_critic: bool = False
+ learning_rate_schedule: ScheduleType = ScheduleType.LINEAR
+ beta_schedule: ScheduleType = ScheduleType.LINEAR
+ epsilon_schedule: ScheduleType = ScheduleType.LINEAR
+
+```
+
+You should implement `update` function following interface:
+
+
+```python
+def update(self, batch: AgentBuffer, num_sequences: int) -> Dict[str, float]:
+
+```
+
+In which losses and other metrics are calculated from an `AgentBuffer` that is generated from your trainer class, depending on which model you choose to implement the loss functions will be different. In our case we calculate value loss from critic and trust region policy loss. A typical pattern(incomplete) of the calculations will look like the following:
+
+
+```python
+run_out = self.policy.actor.get_stats(
+ current_obs,
+ actions,
+ masks=act_masks,
+ memories=memories,
+ sequence_length=self.policy.sequence_length,
+)
+
+log_probs = run_out["log_probs"]
+entropy = run_out["entropy"]
+
+values, _ = self.critic.critic_pass(
+ current_obs,
+ memories=value_memories,
+ sequence_length=self.policy.sequence_length,
+)
+policy_loss = ModelUtils.trust_region_policy_loss(
+ ModelUtils.list_to_tensor(batch[BufferKey.ADVANTAGES]),
+ log_probs,
+ old_log_probs,
+ loss_masks,
+ decay_eps,
+)
+loss = (
+ policy_loss
+ + 0.5 * value_loss
+ - decay_bet * ModelUtils.masked_mean(entropy, loss_masks)
+)
+
+```
+
+Finally update the model and return the a dictionary including calculated losses and updated decay learning rate:
+
+
+```python
+ModelUtils.update_learning_rate(self.optimizer, decay_lr)
+self.optimizer.zero_grad()
+loss.backward()
+
+self.optimizer.step()
+update_stats = {
+ "Losses/Policy Loss": torch.abs(policy_loss).item(),
+ "Losses/Value Loss": value_loss.item(),
+ "Policy/Learning Rate": decay_lr,
+ "Policy/Epsilon": decay_eps,
+ "Policy/Beta": decay_bet,
+}
+
+```
+
+### Step 3: Integrate your custom trainer into the plugin system
+
+By integrating a custom trainer into the plugin system, a user can use their published packages which have their implementations. To do that, you need to add a setup.py file. In the call to setup(), you'll need to add to the entry_points dictionary for each plugin interface that you implement. The form of this is {entry point name}={plugin module}:{plugin function}. For example:
+
+
+
+```python
+entry_points={
+ ML_AGENTS_TRAINER_TYPE: [
+ "your_trainer_type=your_package.your_custom_trainer:get_type_and_setting"
+ ]
+ },
+```
+
+Some key elements in the code:
+
+```
+ML_AGENTS_TRAINER_TYPE: a string constant for trainer type
+your_trainer_type: name your trainer type, used in configuration file
+your_package: your pip installable package containing custom trainer implementation
+```
+
+Also define `get_type_and_setting` method in `YourCustomTrainer` class:
+
+
+```python
+def get_type_and_setting():
+ return {YourCustomTrainer.get_trainer_name(): YourCustomTrainer}, {
+ YourCustomTrainer.get_trainer_name(): YourCustomSetting
+ }
+
+```
+
+Finally, specify trainer type in the config file:
+
+
+```python
+behaviors:
+ 3DBall:
+ trainer_type: your_trainer_type
+...
+```
+
+### Step 4: Install your custom trainer and run training:
+Before installing your custom trainer package, make sure you have `ml-agents-env` and `ml-agents` installed
+
+```shell
+pip3 install -e ./ml-agents-envs && pip3 install -e ./ml-agents
+```
+
+Install your custom trainer package(if your package is pip installable):
+```shell
+pip3 install your_custom_package
+```
+Or follow our internal implementations:
+```shell
+pip3 install -e ./ml-agents-trainer-plugin
+```
+
+Following the previous installations your package is added as an entrypoint and you can use a config file with new trainers:
+```shell
+mlagents-learn ml-agents-trainer-plugin/mlagents_trainer_plugin/a2c/a2c_3DBall.yaml --run-id
+--env
+```
+
+### Validate your implementations:
+Create a clean Python environment with Python 3.10.12 and activate it before you start, if you haven't done so already:
+```shell
+conda create -n trainer-env python=3.10.12
+conda activate trainer-env
+```
+
+Make sure you follow previous steps and install all required packages. We are testing internal implementations in this tutorial, but ML-Agents users can run similar validations once they have their own implementations installed:
+```shell
+pip3 install -e ./ml-agents-envs && pip3 install -e ./ml-agents
+pip3 install -e ./ml-agents-trainer-plugin
+```
+Once your package is added as an `entrypoint`, you can add to the config file the new trainer type. Check if trainer type is specified in the config file `a2c_3DBall.yaml`:
+```
+trainer_type: a2c
+```
+
+Test if custom trainer package is installed by running:
+```shell
+mlagents-learn ml-agents-trainer-plugin/mlagents_trainer_plugin/a2c/a2c_3DBall.yaml --run-id test-trainer
+```
+
+You can also list all trainers installed in the registry. Type `python` in your shell to open a REPL session. Run the python code below, you should be able to see all trainer types currently installed:
+```python
+>>> import pkg_resources
+>>> for entry in pkg_resources.iter_entry_points('mlagents.trainer_type'):
+... print(entry)
+...
+default = mlagents.plugins.trainer_type:get_default_trainer_types
+a2c = mlagents_trainer_plugin.a2c.a2c_trainer:get_type_and_setting
+dqn = mlagents_trainer_plugin.dqn.dqn_trainer:get_type_and_setting
+```
+
+If it is properly installed, you will see Unity logo and message indicating training will start:
+```
+[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
+```
+
+If you see the following error message, it could be due to trainer type is wrong or the trainer type specified is not installed:
+```shell
+mlagents.trainers.exception.TrainerConfigError: Invalid trainer type a2c was found
+```
+
diff --git a/com.unity.ml-agents/Documentation~/Unity-Environment-Registry.md b/com.unity.ml-agents/Documentation~/Unity-Environment-Registry.md
new file mode 100644
index 0000000000..9279f7b329
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Unity-Environment-Registry.md
@@ -0,0 +1,60 @@
+# Unity Environment Registry [Experimental]
+
+The Unity Environment Registry is a database of pre-built Unity environments that can be easily used without having to install the Unity Editor. It is a great way to get started with our [UnityEnvironment API](Python-LLAPI.md).
+
+## Loading an Environment from the Registry
+
+To get started, you can access the default registry we provide with our [Example Environments](Learning-Environment-Examples.md). The Unity Environment Registry implements a _Mapping_, therefore, you can access an entry with its identifier with the square brackets `[ ]`. Use the following code to list all of the environment identifiers present in the default registry:
+
+```python
+from mlagents_envs.registry import default_registry
+
+environment_names = list(default_registry.keys())
+for name in environment_names:
+ print(name)
+```
+
+The `make()` method on a registry value will return a `UnityEnvironment` ready to be used. All arguments passed to the make method will be passed to the constructor of the `UnityEnvironment` as well. Refer to the documentation on the [Python-API](Python-LLAPI.md) for more information about the arguments of the `UnityEnvironment` constructor. For example, the following code will create the environment under the identifier `"my-env"`, reset it, perform a few steps and finally close it:
+
+```python
+from mlagents_envs.registry import default_registry
+
+env = default_registry["my-env"].make()
+env.reset()
+for _ in range(10):
+ env.step()
+env.close()
+```
+
+## Create and share your own registry
+
+In order to share the `UnityEnvironment` you created, you must:
+
+ - [Create a Unity executable](Learning-Environment-Executable.md) of your environment for each platform (Linux, OSX and/or Windows)
+ - Place each executable in a `zip` compressed folder
+ - Upload each zip file online to your preferred hosting platform
+ - Create a `yaml` file that will contain the description and path to your environment
+ - Upload the `yaml` file online. The `yaml` file must have the following format :
+
+```yaml
+environments:
+ - :
+ expected_reward:
+ description:
+ linux_url:
+ darwin_url:
+ win_url:
+ additional_args:
+ -
+ - ...
+```
+
+Your users can now use your environment with the following code :
+```python
+from mlagents_envs.registry import UnityEnvRegistry
+
+registry = UnityEnvRegistry()
+registry.register_from_yaml("url-or-path-to-your-yaml-file")
+```
+ __Note__: The `"url-or-path-to-your-yaml-file"` can be either a url or a local path.
+
diff --git a/com.unity.ml-agents/Documentation~/Using-Docker.md b/com.unity.ml-agents/Documentation~/Using-Docker.md
new file mode 100644
index 0000000000..0fc6329a35
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Using-Docker.md
@@ -0,0 +1,118 @@
+# Using Docker For ML-Agents (Deprecated)
+
+:warning: **Note:** We no longer use this guide ourselves and so it may not work correctly. We've decided to keep it up just in case it is helpful to you.
+
+We currently offer a solution for Windows and Mac users who would like to do training or inference using Docker. This option may be appealing to those who would like to avoid installing Python and TensorFlow themselves. The current setup forces both TensorFlow and Unity to _only_ rely on the CPU for computations. Consequently, our Docker simulation does not use a GPU and uses [`Xvfb`](https://en.wikipedia.org/wiki/Xvfb) to do visual rendering. `Xvfb` is a utility that enables `ML-Agents` (or any other application) to do rendering virtually i.e. it does not assume that the machine running `ML-Agents` has a GPU or a display attached to it. This means that rich environments which involve agents using camera-based visual observations might be slower.
+
+## Requirements
+
+- [Docker](https://www.docker.com)
+- Unity _Linux Build Support_ Component. Make sure to select the _Linux Build Support_ component when installing Unity.
+
+
+
+## Setup
+
+- [Download](https://unity3d.com/get-unity/download) the Unity Installer and add the _Linux Build Support_ Component
+
+- [Download](https://www.docker.com/community-edition#/download) and install Docker if you don't have it setup on your machine.
+
+- Since Docker runs a container in an environment that is isolated from the host machine, a mounted directory in your host machine is used to share data, e.g. the trainer configuration file, Unity executable and TensorFlow graph. For convenience, we created an empty `unity-volume` directory at the root of the repository for this purpose, but feel free to use any other directory. The remainder of this guide assumes that the `unity-volume` directory is the one used.
+
+## Usage
+
+Using Docker for ML-Agents involves three steps: building the Unity environment with specific flags, building a Docker container and, finally, running the container. If you are not familiar with building a Unity environment for ML-Agents, please read through our [Getting Started with the 3D Balance Ball Example](Sample.md) guide first.
+
+### Build the Environment (Optional)
+
+_If you want to used the Editor to perform training, you can skip this step._
+
+Since Docker typically runs a container sharing a (linux) kernel with the host machine, the Unity environment **has** to be built for the **linux platform**. When building a Unity environment, please select the following options from the Build Settings window:
+
+- Set the _Target Platform_ to `Linux`
+- Set the _Architecture_ to `x86_64`
+
+Then click `Build`, pick an environment name (e.g. `3DBall`) and set the output directory to `unity-volume`. After building, ensure that the file `.x86_64` and subdirectory `_Data/` are created under `unity-volume`.
+
+
+
+### Build the Docker Container
+
+First, make sure the Docker engine is running on your machine. Then build the Docker container by calling the following command at the top-level of the repository:
+
+```sh
+docker build -t .
+```
+
+Replace `` with a name for the Docker image, e.g. `balance.ball.v0.1`.
+
+### Run the Docker Container
+
+Run the Docker container by calling the following command at the top-level of the repository:
+
+```sh
+docker run -it --name \
+ --mount type=bind,source="$(pwd)"/unity-volume,target=/unity-volume \
+ -p 5005:5005 \
+ -p 6006:6006 \
+ :latest \
+ \
+ --env= \
+ --train \
+ --run-id=
+```
+
+Notes on argument values:
+
+- `` is used to identify the container (in case you want to interrupt and terminate it). This is optional and Docker will generate a random name if this is not set. _Note that this must be unique for every run of a Docker image._
+- `` references the image name used when building the container.
+- `` **(Optional)**: If you are training with a linux executable, this is the name of the executable. If you are training in the Editor, do not pass a `` argument and press the **Play** button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen.
+- `source`: Reference to the path in your host OS where you will store the Unity executable.
+- `target`: Tells Docker to mount the `source` path as a disk with this name.
+- `trainer-config-file`, `train`, `run-id`: ML-Agents arguments passed to `mlagents-learn`. `trainer-config-file` is the filename of the trainer config file, `train` trains the algorithm, and `run-id` is used to tag each experiment with a unique identifier. We recommend placing the trainer-config file inside `unity-volume` so that the container has access to the file.
+
+To train with a `3DBall` environment executable, the command would be:
+
+```sh
+docker run -it --name 3DBallContainer.first.trial \
+ --mount type=bind,source="$(pwd)"/unity-volume,target=/unity-volume \
+ -p 5005:5005 \
+ -p 6006:6006 \
+ balance.ball.v0.1:latest 3DBall \
+ /unity-volume/trainer_config.yaml \
+ --env=/unity-volume/3DBall \
+ --train \
+ --run-id=3dball_first_trial
+```
+
+For more detail on Docker mounts, check out [these](https://docs.docker.com/storage/bind-mounts/) docs from Docker.
+
+**NOTE** If you are training using docker for environments that use visual observations, you may need to increase the default memory that Docker allocates for the container. For example, see [here](https://docs.docker.com/docker-for-mac/#advanced) for instructions for Docker for Mac.
+
+### Running Tensorboard
+
+You can run Tensorboard to monitor your training instance on http://localhost:6006:
+
+```sh
+docker exec -it tensorboard --logdir /unity-volume/results --host 0.0.0.0
+```
+
+With our previous 3DBall example, this command would look like this:
+
+```sh
+docker exec -it 3DBallContainer.first.trial tensorboard --logdir /unity-volume/results --host 0.0.0.0
+```
+
+For more details on Tensorboard, check out the documentation about [Using Tensorboard](Using-Tensorboard.md).
+
+### Stopping Container and Saving State
+
+If you are satisfied with the training progress, you can stop the Docker container while saving state by either using `Ctrl+C` or `⌘+C` (Mac) or by using the following command:
+
+```sh
+docker kill --signal=SIGINT
+```
+
+`` is the name of the container specified in the earlier
+`docker run` command. If you didn't specify one, you can find the randomly
+generated identifier by running `docker container ls`.
diff --git a/com.unity.ml-agents/Documentation~/Using-Tensorboard.md b/com.unity.ml-agents/Documentation~/Using-Tensorboard.md
new file mode 100644
index 0000000000..c50f789042
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Using-Tensorboard.md
@@ -0,0 +1,93 @@
+# Using TensorBoard to Observe Training
+
+The ML-Agents Toolkit saves statistics during learning session that you can view with a TensorFlow utility named, [TensorBoard](https://www.tensorflow.org/tensorboard).
+
+The `mlagents-learn` command saves training statistics to a folder named `results`, organized by the `run-id` value you assign to a training session.
+
+In order to observe the training process, either during training or afterward, start TensorBoard:
+
+1. Open a terminal or console window:
+2. Navigate to the directory where the ML-Agents Toolkit is installed.
+3. From the command line run: `tensorboard --logdir results --port 6006`
+4. Open a browser window and navigate to
+[localhost:6006](http://localhost:6006).
+
+**Note:** The default port TensorBoard uses is 6006. If there is an existing session running on port 6006 a new session can be launched on an open port using the --port option.
+
+**Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the default string, "ppo". You can delete the folders under the `results` directory to clear out old statistics.
+
+On the left side of the TensorBoard window, you can select which of the training runs you want to display. You can select multiple run-ids to compare statistics. The TensorBoard window also provides options for how to display and smooth graphs.
+
+## The ML-Agents Toolkit training statistics
+
+The ML-Agents training program saves the following statistics:
+
+
+
+### Environment Statistics
+
+- `Environment/Lesson` - Plots the progress from lesson to lesson. Only interesting when performing curriculum training.
+
+- `Environment/Cumulative Reward` - The mean cumulative episode reward over all agents. Should increase during a successful training session.
+
+- `Environment/Episode Length` - The mean length of each episode in the environment for all agents.
+
+### Is Training
+
+- `Is Training` - A boolean indicating if the agent is updating its model.
+
+### Policy Statistics
+
+- `Policy/Entropy` (PPO; SAC) - How random the decisions of the model are. It should slowly decrease during a successful training process. If it decreases too quickly, the `beta` hyperparameter should be increased.
+
+- `Policy/Learning Rate` (PPO; SAC) - How large a step the training algorithm takes as it searches for the optimal policy. Should decrease over time.
+
+- `Policy/Entropy Coefficient` (SAC) - Determines the relative importance of the entropy term. This value is adjusted automatically so that the agent retains some amount of randomness during training.
+
+- `Policy/Extrinsic Reward` (PPO; SAC) - This corresponds to the mean cumulative reward received from the environment per-episode.
+
+- `Policy/Value Estimate` (PPO; SAC) - The mean value estimate for all states visited by the agent. Should increase during a successful training session.
+
+- `Policy/Curiosity Reward` (PPO/SAC+Curiosity) - This corresponds to the mean cumulative intrinsic reward generated per-episode.
+
+- `Policy/Curiosity Value Estimate` (PPO/SAC+Curiosity) - The agent's value estimate for the curiosity reward.
+
+- `Policy/GAIL Reward` (PPO/SAC+GAIL) - This corresponds to the mean cumulative discriminator-based reward generated per-episode.
+
+- `Policy/GAIL Value Estimate` (PPO/SAC+GAIL) - The agent's value estimate for the GAIL reward.
+
+- `Policy/GAIL Policy Estimate` (PPO/SAC+GAIL) - The discriminator's estimate for states and actions generated by the policy.
+
+- `Policy/GAIL Expert Estimate` (PPO/SAC+GAIL) - The discriminator's estimate for states and actions drawn from expert demonstrations.
+
+### Learning Loss Functions
+
+- `Losses/Policy Loss` (PPO; SAC) - The mean magnitude of policy loss function. Correlates to how much the policy (process for deciding actions) is changing. The magnitude of this should decrease during a successful training session.
+
+- `Losses/Value Loss` (PPO; SAC) - The mean loss of the value function update. Correlates to how well the model is able to predict the value of each state. This should increase while the agent is learning, and then decrease once the reward stabilizes.
+
+- `Losses/Forward Loss` (PPO/SAC+Curiosity) - The mean magnitude of the forward model loss function. Corresponds to how well the model is able to predict the new observation encoding.
+
+- `Losses/Inverse Loss` (PPO/SAC+Curiosity) - The mean magnitude of the inverse model loss function. Corresponds to how well the model is able to predict the action taken between two observations.
+
+- `Losses/Pretraining Loss` (BC) - The mean magnitude of the behavioral cloning loss. Corresponds to how well the model imitates the demonstration data.
+
+- `Losses/GAIL Loss` (GAIL) - The mean magnitude of the GAIL discriminator loss. Corresponds to how well the model imitates the demonstration data.
+
+### Self-Play
+
+- `Self-Play/ELO` (Self-Play) - [ELO](https://en.wikipedia.org/wiki/Elo_rating_system) measures the relative skill level between two players. In a proper training run, the ELO of the agent should steadily increase.
+
+## Exporting Data from TensorBoard
+To export timeseries data in CSV or JSON format, check the "Show data download links" in the upper left. This will enable download links below each chart.
+
+
+
+## Custom Metrics from Unity
+
+To get custom metrics from a C# environment into TensorBoard, you can use the `StatsRecorder`:
+
+```csharp
+var statsRecorder = Academy.Instance.StatsRecorder;
+statsRecorder.Add("MyMetric", 1.0);
+```
diff --git a/com.unity.ml-agents/Documentation~/Using-Virtual-Environment.md b/com.unity.ml-agents/Documentation~/Using-Virtual-Environment.md
new file mode 100644
index 0000000000..cd7cfb28b9
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Using-Virtual-Environment.md
@@ -0,0 +1,54 @@
+# Using Virtual Environment
+
+## What is a Virtual Environment?
+
+A Virtual Environment is a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages. To learn more about Virtual Environments see [here](https://docs.python.org/3/library/venv.html).
+
+## Why should I use a Virtual Environment?
+
+A Virtual Environment keeps all dependencies for the Python project separate from dependencies of other projects. This has a few advantages:
+
+1. It makes dependency management for the project easy.
+2. It enables using and testing of different library versions by quickly spinning up a new environment and verifying the compatibility of the code with the different version.
+
+## Python Version Requirement (Required)
+This guide has been tested with Python 3.10.12. Newer versions might not have support for the dependent libraries, so are not recommended.
+
+## Use Conda (or Mamba)
+
+While there are many options for setting up virtual environments for python, by far the most common and simpler approach is by using Anaconda (aka Conda). You can read the documentation on how to get started with Conda [here](https://learning.anaconda.cloud/get-started-with-anaconda).
+
+## Installing Pip (Required)
+
+1. Download the `get-pip.py` file using the command `curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py`
+2. Run the following `python3 get-pip.py`
+3. Check pip version using `pip3 -V`
+
+Note (for Ubuntu users): If the `ModuleNotFoundError: No module named 'distutils.util'` error is encountered, then python3-distutils needs to be installed. Install python3-distutils using `sudo apt-get install python3-distutils`
+
+## Mac OS X Setup
+
+1. Create a folder where the virtual environments will reside `$ mkdir ~/python-envs`
+2. To create a new environment named `sample-env` execute `$ python3 -m venv ~/python-envs/sample-env`
+3. To activate the environment execute `$ source ~/python-envs/sample-env/bin/activate`
+4. Upgrade to the latest pip version using `$ pip3 install --upgrade pip`
+5. Upgrade to the latest setuptools version using `$ pip3 install --upgrade setuptools`
+6. To deactivate the environment execute `$ deactivate` (you can reactivate the environment using the same `activate` command listed above)
+
+## Ubuntu Setup
+
+1. Install the python3-venv package using `$ sudo apt-get install python3-venv`
+2. Follow the steps in the Mac OS X installation.
+
+## Windows Setup
+
+1. Create a folder where the virtual environments will reside `md python-envs`
+2. To create a new environment named `sample-env` execute `python -m venv python-envs\sample-env`
+3. To activate the environment execute `python-envs\sample-env\Scripts\activate`
+4. Upgrade to the latest pip version using `pip install --upgrade pip`
+5. To deactivate the environment execute `deactivate` (you can reactivate the environment using the same `activate` command listed above)
+
+Note:
+- Verify that you are using Python version 3.10.12. Launch a command prompt using `cmd` and execute `python --version` to verify the version.
+- Python3 installation may require admin privileges on Windows.
+- This guide is for Windows 10 using a 64-bit architecture only.
diff --git a/com.unity.ml-agents/Documentation~/Versioning.md b/com.unity.ml-agents/Documentation~/Versioning.md
new file mode 100644
index 0000000000..2f48a32319
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/Versioning.md
@@ -0,0 +1,49 @@
+# ML-Agents Versioning
+
+## Context
+As the ML-Agents project evolves into a more mature product, we want to communicate the process we use to version our packages and the data that flows into, through, and out of them clearly. Our project now has four packages (1 Unity, 3 Python) along with artifacts that are produced as well as consumed. This document covers the versioning for these packages and artifacts.
+
+## GitHub Releases
+Up until now, all packages were in lockstep in-terms of versioning. As a result, the GitHub releases were tagged with the version of all those packages (e.g. v0.15.0, v0.15.1) and labeled accordingly. With the decoupling of package versions, we now need to revisit our GitHub release tagging. The proposal is that we move towards an integer release numbering for our repo and each such release will call out specific version upgrades of each package. For instance, with [the April 30th release](https://github.com/Unity-Technologies/ml-agents/releases/tag/release_1), we will have:
+- GitHub Release 1 (branch name: *release_1_branch*)
+ - com.unity.ml-agents release 1.0.0
+ - ml-agents release 0.16.0
+ - ml-agents-envs release 0.16.0
+ - gym-unity release 0.16.0
+
+Our release cadence will not be affected by these versioning changes. We will keep having monthly releases to fix bugs and release new features.
+
+## Packages
+All of the software packages, and their generated artifacts will be versioned. Any automation tools will not be versioned.
+
+### Unity package
+Package name: com.unity.ml-agents
+- Versioned following [Semantic Versioning Guidelines](https://www.semver.org)
+- This package consumes an artifact of the training process: the `.nn` file. These files are integer versioned and currently at version 2. The com.unity.ml-agents package will need to support the version of `.nn` files which existed at its 1.0.0 release. For example, consider that com.unity.ml-agents is at version 1.0.0 and the NN files are at version 2. If the NN files change to version 3, the next release of com.unity.ml-agents at version 1.1.0 guarantees it will be able to read both of these formats. If the NN files were to change to version 4 and com.unity.ml-agents to version 2.0.0, support for NN versions 2 and 3 could be dropped for com.unity.ml-agents version 2.0.0.
+- This package produces one artifact, the `.demo` files. These files will have integer versioning. This means their version will increment by 1 at each change. The com.unity.ml-agents package must be backward compatible with version changes that occur between minor versions.
+- To summarize, the artifacts produced and consumed by com.unity.ml-agents are guaranteed to be supported for 1.x.x versions of com.unity.ml-agents. We intend to provide stability for our users by moving to a 1.0.0 release of com.unity.ml-agents.
+
+
+### Python Packages
+Package names: ml-agents / ml-agents-envs / gym-unity
+- The python packages remain in "Beta." This means that breaking changes to the public API of the python packages can change without having to have a major version bump. Historically, the python and C# packages were in version lockstep. This is no longer the case. The python packages will remain in lockstep with each other for now, while the C# package will follow its own versioning as is appropriate. However, the python package versions may diverge in the future.
+- While the python packages will remain in Beta for now, we acknowledge that the most heavily used portion of our python interface is the `mlagents-learn` CLI and strive to make this part of our API backward compatible. We are actively working on this and expect to have a stable CLI in the next few weeks.
+
+## Communicator
+
+Packages which communicate: com.unity.ml-agents / ml-agents-envs
+
+Another entity of the ML-Agents Toolkit that requires versioning is the communication layer between C# and Python, which will follow also semantic versioning. This guarantees a level of backward compatibility between different versions of C# and Python packages which communicate. Any Communicator version 1.x.x of the Unity package should be compatible with any 1.x.x Communicator Version in Python.
+
+An RLCapabilities struct keeps track of which features exist. This struct is passed from C# to Python, and another from Python to C#. With this feature level granularity, we can notify users more specifically about feature limitations based on what's available in both C# and Python. These notifications will be logged to the python terminal, or to the Unity Editor Console.
+
+
+## Side Channels
+
+The communicator is what manages data transfer between Unity and Python for the core training loop. Side Channels are another means of data transfer between Unity and Python. Side Channels are not versioned, but have been designed to support backward compatibility for what they are. As of today, we provide 4 side channels:
+- FloatProperties: shared float data between Unity - Python (bidirectional)
+- RawBytes: raw data that can be sent Unity - Python (bidirectional)
+- EngineConfig: a set of numeric fields in a pre-defined order sent from Python to Unity
+- Stats: (name, value, agg) messages sent from Unity to Python
+
+Aside from the specific implementations of side channels we provide (and use ourselves), the Side Channel interface is made available for users to create their own custom side channels. As such, we guarantee that the built in SideChannel interface between Unity and Python is backward compatible in packages that share the same major version.
diff --git a/com.unity.ml-agents/Documentation~/com.unity.ml-agents.md b/com.unity.ml-agents/Documentation~/com.unity.ml-agents.md
deleted file mode 100644
index 959f5edb75..0000000000
--- a/com.unity.ml-agents/Documentation~/com.unity.ml-agents.md
+++ /dev/null
@@ -1,197 +0,0 @@
-# ML-Agents Overview
-ML-agents enable games and simulations to serve as environments for training intelligent agents in Unity. Training can be done with reinforcement learning, imitation learning, neuroevolution, or any other methods. Trained agents can be used for many use cases, including controlling NPC behavior (in a variety of settings such as multi-agent and adversarial), automated testing of game builds and evaluating different game design decisions pre-release.
-
-The _ML-Agents_ package has a C# SDK for the [Unity ML-Agents Toolkit], which can be used outside of Unity. The scope of these docs is just to get started in the context of Unity, but further details and samples are located on the [github docs].
-
-## Capabilities
-The package allows you to convert any Unity scene into a learning environment and train character behaviors using a variety of machine-learning algorithms. Additionally, it allows you to embed these trained behaviors back into Unity scenes to control your characters. More specifically, the package provides the following core functionalities:
-
-* Define Agents: entities, or characters, whose behavior will be learned. Agents are entities that generate observations (through sensors), take actions, and receive rewards from the environment.
-* Define Behaviors: entities that specify how an agent should act. Multiple agents can share the same Behavior and a scene may have multiple Behaviors.
-* Record demonstrations: To show the behaviors of an agent within the Editor. You can use demonstrations to help train a behavior for that agent.
-* Embed a trained behavior (aka: run your ML model) in the scene via the [Unity Inference Engine]. Embedded behaviors allow you to switch an Agent between learning and inference.
-
-## Special Notes
-Note that the ML-Agents package does not contain the machine learning algorithms for training behaviors. The ML-Agents package only supports instrumenting a Unity scene, setting it up for training, and then embedding the trained model back into your Unity scene. The machine learning algorithms that orchestrate training are part of the companion [python package].
-
-## Package contents
-
-The following table describes the package folder structure:
-
-| **Location** | **Description** |
-| ---------------------- | ----------------------------------------------------------------------- |
-| _Documentation~_ | Contains the documentation for the Unity package. |
-| _Editor_ | Contains utilities for Editor windows and drawers. |
-| _Plugins_ | Contains third-party DLLs. |
-| _Runtime_ | Contains core C# APIs for integrating ML-Agents into your Unity scene. |
-| _Runtime/Integrations_ | Contains utilities for integrating ML-Agents into specific game genres. |
-| _Tests_ | Contains the unit tests for the package. |
-
-
-
-## Installation
-To add the ML-Agents package to a Unity project:
-
-* Create a new Unity project with Unity 6000.0 (or later) or open an existing one.
-* To open the Package Manager, navigate to Window > Package Manager.
-* Click + and select Add package by name...
-* Enter com.unity.ml-agents
-*Click Add to add the package to your project.
-
-To install the companion Python package to enable training behaviors, follow the [installation instructions] on our [GitHub repository].
-
-## Advanced Features
-
-### Custom Grid Sensors
-
-Grid Sensor provides a 2D observation that detects objects around an agent from a top-down view. Compared to RayCasts, it receives a full observation in a grid area without gaps, and the detection is not blocked by objects around the agents. This gives a more granular view while requiring a higher usage of compute resources.
-
-One extra feature with Grid Sensors is that you can derive from the Grid Sensor base class to collect custom data besides the object tags, to include custom attributes as observations. This allows more flexibility for the use of GridSensor.
-
-#### Creating Custom Grid Sensors
-To create a custom grid sensor, you'll need to derive from two classes: `GridSensorBase` and `GridSensorComponent`.
-
-##### Deriving from `GridSensorBase`
-This is the implementation of your sensor. This defines how your sensor process detected colliders,
-what the data looks like, and how the observations are constructed from the detected objects.
-Consider overriding the following methods depending on your use case:
-* `protected virtual int GetCellObservationSize()`: Return the observation size per cell. Default to `1`.
-* `protected virtual void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)`: Constructs observations from the detected object. The input provides the detected GameObject and the index of its tag (0-indexed). The observations should be written to the given `dataBuffer` and the buffer size is defined in `GetCellObservationSize()`. This data will be gathered from each cell and sent to the trainer as observation.
-* `protected virtual bool IsDataNormalized()`: Return whether the observation is normalized to 0~1. This affects whether you're able to use compressed observations as compressed data only supports normalized data. Return `true` if all the values written in `GetObjectData` are within the range of (0, 1), otherwise return `false`. Default to `false`.
-
- There might be cases when your data is not in the range of (0, 1) but you still wish to use compressed data to speed up training. If your data is naturally bounded within a range, normalize your data first to the possible range and fill the buffer with normalized data. For example, since the angle of rotation is bounded within `0 ~ 360`, record an angle `x` as `x/360` instead of `x`. If your data value is not bounded (position, velocity, etc.), consider setting a reasonable min/max value and use that to normalize your data.
-* `protected internal virtual ProcessCollidersMethod GetProcessCollidersMethod()`: Return the method to process colliders detected in a cell. This defines the sensor behavior when multiple objects with detectable tags are detected within a cell.
-Currently two methods are provided:
- * `ProcessCollidersMethod.ProcessClosestColliders` (Default): Process the closest collider to the agent. In this case each cell's data is represented by one object.
- * `ProcessCollidersMethod.ProcessAllColliders`: Process all detected colliders. This is useful when the data from each cell is additive, for instance, the count of detected objects in a cell. When using this option, the input `dataBuffer` in `GetObjectData()` will contain processed data from other colliders detected in the cell. You'll more likely want to add/subtract values from the buffer instead of overwrite it completely.
-
-##### Deriving from `GridSensorComponent`
-To create your sensor, you need to override the sensor component and add your sensor to the creation.
-Specifically, you need to override `GetGridSensors()` and return an array of grid sensors you want to use in the component.
-It can be used to create multiple different customized grid sensors, or you can also include the ones provided in our package (listed in the next section).
-
-Example:
-```csharp
-public class CustomGridSensorComponent : GridSensorComponent
-{
- protected override GridSensorBase[] GetGridSensors()
- {
- return new GridSensorBase[] { new CustomGridSensor(...)};
- }
-}
-```
-
-#### Grid Sensor Types
-Here we list out two types of grid sensor provided in the package: `OneHotGridSensor` and `CountingGridSensor`.
-Their implementations are also a good reference for making you own ones.
-
-##### OneHotGridSensor
-This is the default sensor used by `GridSensorComponent`. It detects objects with detectable tags and the observation is the one-hot representation of the detected tag index.
-
-The implementation of the sensor is defined as following:
-* `GetCellObservationSize()`: `detectableTags.Length`
-* `IsDataNormalized()`: `true`
-* `ProcessCollidersMethod()`: `ProcessCollidersMethod.ProcessClosestColliders`
-* `GetObjectData()`:
-
-```csharp
-protected override void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)
-{
- dataBuffer[tagIndex] = 1;
-}
-```
-
-##### CountingGridSensor
-This is an example of using all colliders detected in a cell. It counts the number of objects detected for each detectable tag. The sensor cannot be used with data compression.
-
-The implementation of the sensor is defined as following:
-* `GetCellObservationSize()`: `detectableTags.Length`
-* `IsDataNormalized()`: `false`
-* `ProcessCollidersMethod()`: `ProcessCollidersMethod.ProcessAllColliders`
-* `GetObjectData()`:
-
-```csharp
-protected override void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)
-{
- dataBuffer[tagIndex] += 1;
-}
-```
-
-### Input System Integration
-
-The ML-Agents package integrates with the [Input System Package](https://docs.unity3d.com/Packages/com.unity.inputsystem@1.1/manual/QuickStartGuide.html) through the `InputActuatorComponent`. This component sets up an action space for your `Agent` based on an `InputActionAsset` that is referenced by the `IInputActionAssetProvider` interface, or the `PlayerInput` component that may be living on your player controlled `Agent`. This means that if you have code outside of your agent that handles input, you will not need to implement the Heuristic function in agent as well. The `InputActuatorComponent` will handle this for you. You can now train and run inference on `Agents` with an action space defined by an `InputActionAsset`.
-
-Take a look at how we have implemented the C# code in the example Input Integration scene (located under Project/Assets/ML-Agents/Examples/PushBlockWithInput/). Once you have some familiarity, then the next step would be to add the InputActuatorComponent to your player Agent. The example we have implemented uses C# Events to send information from the Input System.
-
-#### Getting Started with Input System Integration
-1. Add the `com.unity.inputsystem` version 1.1.0-preview.3 or later to your project via the Package Manager window.
-2. If you have already setup an InputActionAsset skip to Step 3, otherwise follow these sub steps:
- 1. Create an InputActionAsset to allow your Agent to be controlled by the Input System.
- 2. Handle the events from the Input System where you normally would (i.e. a script external to your Agent class).
-3. Add the InputSystemActuatorComponent to the GameObject that has the `PlayerInput` and `Agent` components attached.
-
-Additionally, see below for additional technical specifications on the C# code for the InputActuatorComponent.
-#### Technical Specifications
-
-##### `IInputActionsAssetProvider` Interface
-The `InputActuatorComponent` searches for a `Component` that implements
-`IInputActionAssetProvider` on the `GameObject` they both are attached to. It is important to note
-that if multiple `Components` on your `GameObject` need to access an `InputActionAsset` to handle events,
-they will need to share the same instance of the `InputActionAsset` that is returned from the
-`IInputActionAssetProvider`.
-
-##### `InputActuatorComponent` Class
-The `InputActuatorComponent` is the bridge between ML-Agents and the Input System. It allows ML-Agents to:
-* create an `ActionSpec` for your Agent based on an `InputActionAsset` that comes from an
-`IInputActionAssetProvider`.
-* send simulated input from a training process or a neural network
-* let developers keep their input handling code in one place
-
-This is accomplished by adding the `InputActuatorComponent` to an Agent which already has the PlayerInput component attached.
-
-## Known Limitations
-
-### Training
-
-Training is limited to the Unity Editor and Standalone builds on Windows, MacOS,
-and Linux with the Mono scripting backend. Currently, training does not work
-with the IL2CPP scripting backend. Your environment will default to inference
-mode if training is not supported or is not currently running.
-
-### Inference
-
-Inference is executed via [Unity Inference Engine](https://docs.unity3d.com/Packages/com.unity.ai.inference@latest) on the end-user device. Therefore, it is subject to the performance limitations of the end-user CPU or GPU. Also, only models created with our trainers are supported for running ML-Agents with a neural network behavior.
-
-### Headless Mode
-
-If you enable Headless mode, you will not be able to collect visual observations from your agents.
-
-### Rendering Speed and Synchronization
-
-Currently the speed of the game physics can only be increased to 100x real-time. The Academy (the sentinel that controls the stepping of the game to make sure everything is synchronized, from collection of observations to applying actions generated from policy inference to the agent) also moves in time with `FixedUpdate()` rather than `Update()`, so game behavior implemented in Update() may be out of sync with the agent decision-making. See [Execution Order of Event Functions] for more information.
-
-You can control the frequency of Academy stepping by calling `Academy.Instance.DisableAutomaticStepping()`, and then calling `Academy.Instance.EnvironmentStep()`.
-
-### Input System Integration
-
- For the `InputActuatorComponent`
- - Limited implementation of `InputControls`
- - No way to customize the action space of the `InputActuatorComponent`
-
-## Additional Resources
-
-* [GitHub repository]
-* [Unity Discussions]
-* [Discord]
-* [Website]
-
-[github docs]: https://unity-technologies.github.io/ml-agents/
-[installation instructions]: https://github.com/Unity-Technologies/ml-agents/blob/release_22_docs/docs/Installation.md
-[Unity Inference Engine]: https://docs.unity3d.com/Packages/com.unity.ai.inference@2.2/manual/index.html
-[python package]: https://github.com/Unity-Technologies/ml-agents
-[GitHub repository]: https://github.com/Unity-Technologies/ml-agents
-[Execution Order of Event Functions]: https://docs.unity3d.com/Manual/ExecutionOrder.html
-[Unity Discussions]: https://discussions.unity.com/tag/ml-agents
-[Discord]: https://discord.com/channels/489222168727519232/1202574086115557446
-[Website]: https://unity-technologies.github.io/ml-agents/
-
diff --git a/com.unity.ml-agents/Documentation~/images/3dball_big.png b/com.unity.ml-agents/Documentation~/images/3dball_big.png
new file mode 100644
index 0000000000..c5a786825b
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/3dball_big.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/3dball_learning_brain.png b/com.unity.ml-agents/Documentation~/images/3dball_learning_brain.png
new file mode 100644
index 0000000000..68757fa6ce
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/3dball_learning_brain.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/3dball_small.png b/com.unity.ml-agents/Documentation~/images/3dball_small.png
new file mode 100644
index 0000000000..2591e0c3f6
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/3dball_small.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/TensorBoard-download.png b/com.unity.ml-agents/Documentation~/images/TensorBoard-download.png
new file mode 100644
index 0000000000..d5f38a17b2
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/TensorBoard-download.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/U_MachineLearningAgents_Logo_Black_RGB.png b/com.unity.ml-agents/Documentation~/images/U_MachineLearningAgents_Logo_Black_RGB.png
new file mode 100644
index 0000000000..88e2173ac4
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/U_MachineLearningAgents_Logo_Black_RGB.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/anaconda_default.PNG b/com.unity.ml-agents/Documentation~/images/anaconda_default.PNG
new file mode 100644
index 0000000000..9d65d81b69
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/anaconda_default.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/anaconda_install.PNG b/com.unity.ml-agents/Documentation~/images/anaconda_install.PNG
new file mode 100644
index 0000000000..237c59cae7
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/anaconda_install.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/balance.png b/com.unity.ml-agents/Documentation~/images/balance.png
new file mode 100644
index 0000000000..edef5c9342
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/balance.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/banner.png b/com.unity.ml-agents/Documentation~/images/banner.png
new file mode 100644
index 0000000000..9068615db9
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/banner.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/basic.png b/com.unity.ml-agents/Documentation~/images/basic.png
new file mode 100644
index 0000000000..1fec46c779
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/basic.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/conda_new.PNG b/com.unity.ml-agents/Documentation~/images/conda_new.PNG
new file mode 100644
index 0000000000..96d6cc8bf4
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/conda_new.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/cooperative_pushblock.png b/com.unity.ml-agents/Documentation~/images/cooperative_pushblock.png
new file mode 100644
index 0000000000..71e9efb51e
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/cooperative_pushblock.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/crawler.png b/com.unity.ml-agents/Documentation~/images/crawler.png
new file mode 100644
index 0000000000..819733113a
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/crawler.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/cuDNN_membership_required.png b/com.unity.ml-agents/Documentation~/images/cuDNN_membership_required.png
new file mode 100644
index 0000000000..6a7ffc6cd2
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/cuDNN_membership_required.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/cuda_toolkit_directory.PNG b/com.unity.ml-agents/Documentation~/images/cuda_toolkit_directory.PNG
new file mode 100644
index 0000000000..304ec7fc57
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/cuda_toolkit_directory.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/cudnn_zip_files.PNG b/com.unity.ml-agents/Documentation~/images/cudnn_zip_files.PNG
new file mode 100644
index 0000000000..9170f34f94
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/cudnn_zip_files.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/curriculum.png b/com.unity.ml-agents/Documentation~/images/curriculum.png
new file mode 100644
index 0000000000..f490530a96
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/curriculum.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/demo_component.png b/com.unity.ml-agents/Documentation~/images/demo_component.png
new file mode 100644
index 0000000000..1762257611
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/demo_component.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/demo_inspector.png b/com.unity.ml-agents/Documentation~/images/demo_inspector.png
new file mode 100644
index 0000000000..d1d62bba84
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/demo_inspector.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/docker_build_settings.png b/com.unity.ml-agents/Documentation~/images/docker_build_settings.png
new file mode 100644
index 0000000000..c942b6c2fe
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/docker_build_settings.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/dopamine_gridworld_plot.png b/com.unity.ml-agents/Documentation~/images/dopamine_gridworld_plot.png
new file mode 100644
index 0000000000..a7ca9b9cc3
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/dopamine_gridworld_plot.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/dopamine_visualbanana_plot.png b/com.unity.ml-agents/Documentation~/images/dopamine_visualbanana_plot.png
new file mode 100644
index 0000000000..90bd68c976
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/dopamine_visualbanana_plot.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/dungeon_escape.png b/com.unity.ml-agents/Documentation~/images/dungeon_escape.png
new file mode 100644
index 0000000000..e9f3d66201
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/dungeon_escape.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/edit_env_var.png b/com.unity.ml-agents/Documentation~/images/edit_env_var.png
new file mode 100644
index 0000000000..2fd622c431
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/edit_env_var.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/elo_example.png b/com.unity.ml-agents/Documentation~/images/elo_example.png
new file mode 100644
index 0000000000..fdf6bc40e9
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/elo_example.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/elo_expected_score_formula.png b/com.unity.ml-agents/Documentation~/images/elo_expected_score_formula.png
new file mode 100644
index 0000000000..19c2916a9d
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/elo_expected_score_formula.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/elo_score_update_formula.png b/com.unity.ml-agents/Documentation~/images/elo_score_update_formula.png
new file mode 100644
index 0000000000..a2768c5228
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/elo_score_update_formula.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/example-envs.png b/com.unity.ml-agents/Documentation~/images/example-envs.png
new file mode 100644
index 0000000000..1e4c66cca2
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/example-envs.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/foodCollector.png b/com.unity.ml-agents/Documentation~/images/foodCollector.png
new file mode 100644
index 0000000000..2e9ebc67f9
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/foodCollector.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/grid_sensor.png b/com.unity.ml-agents/Documentation~/images/grid_sensor.png
new file mode 100644
index 0000000000..2911874add
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/grid_sensor.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/gridworld.png b/com.unity.ml-agents/Documentation~/images/gridworld.png
new file mode 100644
index 0000000000..671830eb2f
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/gridworld.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/groupmanager_teamid.png b/com.unity.ml-agents/Documentation~/images/groupmanager_teamid.png
new file mode 100644
index 0000000000..ff064b95e3
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/groupmanager_teamid.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/hallway.png b/com.unity.ml-agents/Documentation~/images/hallway.png
new file mode 100644
index 0000000000..fab2cc152d
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/hallway.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/image-banner.png b/com.unity.ml-agents/Documentation~/images/image-banner.png
new file mode 100644
index 0000000000..860b1f737e
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/image-banner.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/learning_environment_basic.png b/com.unity.ml-agents/Documentation~/images/learning_environment_basic.png
new file mode 100644
index 0000000000..d58d4e6142
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/learning_environment_basic.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/learning_environment_example.png b/com.unity.ml-agents/Documentation~/images/learning_environment_example.png
new file mode 100644
index 0000000000..6567cc93ef
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/learning_environment_example.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/learning_environment_full.png b/com.unity.ml-agents/Documentation~/images/learning_environment_full.png
new file mode 100644
index 0000000000..39794932d0
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/learning_environment_full.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/math.png b/com.unity.ml-agents/Documentation~/images/math.png
new file mode 100644
index 0000000000..6cd1b696dd
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/math.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/ml-agents-LSTM.png b/com.unity.ml-agents/Documentation~/images/ml-agents-LSTM.png
new file mode 100644
index 0000000000..29867ec8bc
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/ml-agents-LSTM.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-3DBallHierarchy.png b/com.unity.ml-agents/Documentation~/images/mlagents-3DBallHierarchy.png
new file mode 100644
index 0000000000..9603e6f77d
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-3DBallHierarchy.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-BuildWindow.png b/com.unity.ml-agents/Documentation~/images/mlagents-BuildWindow.png
new file mode 100644
index 0000000000..4eae2512d6
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-BuildWindow.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-ImitationAndRL.png b/com.unity.ml-agents/Documentation~/images/mlagents-ImitationAndRL.png
new file mode 100644
index 0000000000..614f7473b9
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-ImitationAndRL.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-NewTutSplash.png b/com.unity.ml-agents/Documentation~/images/mlagents-NewTutSplash.png
new file mode 100644
index 0000000000..0e6efc2181
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-NewTutSplash.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-Open3DBall.png b/com.unity.ml-agents/Documentation~/images/mlagents-Open3DBall.png
new file mode 100644
index 0000000000..c9e6abaf91
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-Open3DBall.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-RollerAgentStats.png b/com.unity.ml-agents/Documentation~/images/mlagents-RollerAgentStats.png
new file mode 100644
index 0000000000..f1cde7cda2
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-RollerAgentStats.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/mlagents-TensorBoard.png b/com.unity.ml-agents/Documentation~/images/mlagents-TensorBoard.png
new file mode 100644
index 0000000000..a4e3fde36f
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/mlagents-TensorBoard.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/multiple-settings.png b/com.unity.ml-agents/Documentation~/images/multiple-settings.png
new file mode 100644
index 0000000000..8865503b70
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/multiple-settings.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/new_system_variable.PNG b/com.unity.ml-agents/Documentation~/images/new_system_variable.PNG
new file mode 100644
index 0000000000..b27365977a
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/new_system_variable.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/package-settings.png b/com.unity.ml-agents/Documentation~/images/package-settings.png
new file mode 100644
index 0000000000..82a2ff9afe
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/package-settings.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/path_variables.PNG b/com.unity.ml-agents/Documentation~/images/path_variables.PNG
new file mode 100644
index 0000000000..35745c56a5
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/path_variables.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/platform_prefab.png b/com.unity.ml-agents/Documentation~/images/platform_prefab.png
new file mode 100644
index 0000000000..ccf0ba7ca4
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/platform_prefab.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/push.png b/com.unity.ml-agents/Documentation~/images/push.png
new file mode 100644
index 0000000000..fd94b1cfac
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/push.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/pyramids.png b/com.unity.ml-agents/Documentation~/images/pyramids.png
new file mode 100644
index 0000000000..df67ac2e67
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/pyramids.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/ray_perception.png b/com.unity.ml-agents/Documentation~/images/ray_perception.png
new file mode 100644
index 0000000000..6eef39dcd6
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/ray_perception.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/rl_cycle.png b/com.unity.ml-agents/Documentation~/images/rl_cycle.png
new file mode 100644
index 0000000000..2283360dd7
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/rl_cycle.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/roller-ball-agent.png b/com.unity.ml-agents/Documentation~/images/roller-ball-agent.png
new file mode 100644
index 0000000000..a1d84f75ef
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/roller-ball-agent.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/roller-ball-floor.png b/com.unity.ml-agents/Documentation~/images/roller-ball-floor.png
new file mode 100644
index 0000000000..ce27b19bec
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/roller-ball-floor.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/roller-ball-hierarchy.png b/com.unity.ml-agents/Documentation~/images/roller-ball-hierarchy.png
new file mode 100644
index 0000000000..7c87cd1a27
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/roller-ball-hierarchy.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/roller-ball-projects.png b/com.unity.ml-agents/Documentation~/images/roller-ball-projects.png
new file mode 100644
index 0000000000..71e063c52e
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/roller-ball-projects.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/roller-ball-target.png b/com.unity.ml-agents/Documentation~/images/roller-ball-target.png
new file mode 100644
index 0000000000..29c990af5a
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/roller-ball-target.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/soccer.png b/com.unity.ml-agents/Documentation~/images/soccer.png
new file mode 100644
index 0000000000..998ae9236a
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/soccer.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/sorter.png b/com.unity.ml-agents/Documentation~/images/sorter.png
new file mode 100644
index 0000000000..b947fe1a73
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/sorter.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/strikersvsgoalie.png b/com.unity.ml-agents/Documentation~/images/strikersvsgoalie.png
new file mode 100644
index 0000000000..1945dddd12
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/strikersvsgoalie.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/system_variable_name_value.PNG b/com.unity.ml-agents/Documentation~/images/system_variable_name_value.PNG
new file mode 100644
index 0000000000..ae3a47d623
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/system_variable_name_value.PNG differ
diff --git a/com.unity.ml-agents/Documentation~/images/team_id.png b/com.unity.ml-agents/Documentation~/images/team_id.png
new file mode 100644
index 0000000000..2cff1b07a9
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/team_id.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity-logo-black.png b/com.unity.ml-agents/Documentation~/images/unity-logo-black.png
new file mode 100644
index 0000000000..66ab2e3d98
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity-logo-black.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity-logo.png b/com.unity.ml-agents/Documentation~/images/unity-logo.png
new file mode 100644
index 0000000000..15d4cb9ebe
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity-logo.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity-wide.png b/com.unity.ml-agents/Documentation~/images/unity-wide.png
new file mode 100644
index 0000000000..1668b46745
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity-wide.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity_linux_build_support.png b/com.unity.ml-agents/Documentation~/images/unity_linux_build_support.png
new file mode 100644
index 0000000000..c253efcae6
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity_linux_build_support.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity_package_json.png b/com.unity.ml-agents/Documentation~/images/unity_package_json.png
new file mode 100644
index 0000000000..9200a51bbd
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity_package_json.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity_package_manager_git_url.png b/com.unity.ml-agents/Documentation~/images/unity_package_manager_git_url.png
new file mode 100644
index 0000000000..ccd0a58dcd
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity_package_manager_git_url.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/unity_package_manager_window.png b/com.unity.ml-agents/Documentation~/images/unity_package_manager_window.png
new file mode 100644
index 0000000000..df0b4daf8f
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/unity_package_manager_window.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/variable-length-observation-illustrated.png b/com.unity.ml-agents/Documentation~/images/variable-length-observation-illustrated.png
new file mode 100644
index 0000000000..4a7a40f864
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/variable-length-observation-illustrated.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/visual-observation-rawimage.png b/com.unity.ml-agents/Documentation~/images/visual-observation-rawimage.png
new file mode 100644
index 0000000000..5e34c07367
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/visual-observation-rawimage.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/visual-observation-rendertexture.png b/com.unity.ml-agents/Documentation~/images/visual-observation-rendertexture.png
new file mode 100644
index 0000000000..9fb40290b3
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/visual-observation-rendertexture.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/visual-observation.png b/com.unity.ml-agents/Documentation~/images/visual-observation.png
new file mode 100644
index 0000000000..5533ee92c9
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/visual-observation.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/walker.png b/com.unity.ml-agents/Documentation~/images/walker.png
new file mode 100644
index 0000000000..99f7cb2e08
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/walker.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/wall.png b/com.unity.ml-agents/Documentation~/images/wall.png
new file mode 100644
index 0000000000..b6b9212ee5
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/wall.png differ
diff --git a/com.unity.ml-agents/Documentation~/images/worm.png b/com.unity.ml-agents/Documentation~/images/worm.png
new file mode 100644
index 0000000000..6a36baff6e
Binary files /dev/null and b/com.unity.ml-agents/Documentation~/images/worm.png differ
diff --git a/com.unity.ml-agents/Documentation~/index.md b/com.unity.ml-agents/Documentation~/index.md
new file mode 100644
index 0000000000..7fc00ba596
--- /dev/null
+++ b/com.unity.ml-agents/Documentation~/index.md
@@ -0,0 +1,54 @@
+# ML-Agents Overview
+
+
+
+ML-Agents enable games and simulations to serve as environments for training intelligent agents in Unity. Training can be done with reinforcement learning, imitation learning, neuroevolution, or any other methods. Trained agents can be used for many use cases, including controlling NPC behavior (in a variety of settings such as multi-agent and adversarial), automated testing of game builds and evaluating different game design decisions pre-release.
+
+This documentation contains comprehensive instructions about [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents) including the C# package, along with detailed training guides and Python API references. Note that the C# package does not contain the machine learning algorithms for training behaviors. The C# package only supports instrumenting a Unity scene, setting it up for training, and then embedding the trained model back into your Unity scene. The machine learning algorithms that orchestrate training are part of the companion Python package.
+
+## Documentation structure
+
+| **Section** | **Description** |
+|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [ML-Agents Theory](ML-Agents-Overview.md) | Learn about core concepts of ML-Agents. |
+| [Get started](Get-Started.md) | Learn how to install ML-Agents and explore samples. |
+| [Learning Environments and Agents](Learning-Environments-Agents.md) | Learn about Environments, Agents, creating environments, and using executable builds. |
+| [Training](Training.md) | Training workflow, config file, monitoring tools, custom plugins, and profiling. |
+| [Python APIs](Python-APIs.md) | Gym, PettingZoo, low-level interfaces, and trainer documentation. |
+| [Python Tutorial with Google Colab](Tutorial-Colab.md) | Interactive tutorials for using ML-Agents with Google Colab. |
+| [Advanced Features](Advanced-Features.md) | Custom sensors, side channels, package settings, environment registry, input system integrations, and game integrations (e.g., [Match-3](Integrations-Match3.md)). |
+| [Cloud & Deployment](Cloud-Deployment.md) | Legacy cloud deployment guides (deprecated). |
+| [Reference & Support](Reference-Support.md) | FAQ, troubleshooting, and migration guides. |
+| [Background](Background.md) | Machine Learning, Unity, PyTorch fundamentals, virtual environments, and ELO rating systems. |
+
+## Capabilities
+The package allows you to convert any Unity scene into a learning environment and train character behaviors using a variety of machine-learning algorithms. Additionally, it allows you to embed these trained behaviors back into Unity scenes to control your characters. More specifically, the package provides the following core functionalities:
+
+* Define Agents: entities, or characters, whose behavior will be learned. Agents are entities that generate observations (through sensors), take actions, and receive rewards from the environment.
+* Define Behaviors: entities that specify how an agent should act. Multiple agents can share the same Behavior and a scene may have multiple Behaviors.
+* Record demonstrations: To show the behaviors of an agent within the Editor. You can use demonstrations to help train a behavior for that agent.
+* Embed a trained behavior (aka: run your ML model) in the scene via the [Inference Engine](https://docs.unity3d.com/Packages/com.unity.ai.inference@latest). Embedded behaviors allow you to switch an Agent between learning and inference.
+
+## Community and Feedback
+
+The ML-Agents Toolkit is an open-source project, and we encourage and welcome contributions. If you wish to contribute, be sure to review our [contribution guidelines](CONTRIBUTING.md) and [code of conduct](https://github.com/Unity-Technologies/ml-agents/blob/release_22/CODE_OF_CONDUCT.md).
+
+For problems with the installation and setup of the ML-Agents Toolkit, or discussions about how to best setup or train your agents, please create a new thread on the [Unity ML-Agents forum](https://forum.unity.com/forums/ml-agents.453/) and make sure to include as much detail as possible. If you run into any other problems using the ML-Agents Toolkit or have a specific feature request, please [submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).
+
+Please tell us which samples you would like to see shipped with the ML-Agents Unity package by replying to [this forum thread](https://forum.unity.com/threads/feedback-wanted-shipping-sample-s-with-the-ml-agents-package.1073468/).
+
+
+Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few minutes to [let us know about it](https://unitysoftware.co1.qualtrics.com/jfe/form/SV_55pQKCZ578t0kbc).
+
+## Privacy
+In order to improve the developer experience for Unity ML-Agents Toolkit, we have added in-editor analytics. Please refer to "Information that is passively collected by Unity" in the [Unity Privacy Policy](https://unity3d.com/legal/privacy-policy).
+
+## Additional Resources
+
+* [GitHub repository](https://github.com/Unity-Technologies/ml-agents)
+* [Unity Discussions](https://discussions.unity.com/tag/ml-agents)
+* [ML-Agents tutorials by CodeMonkeyUnity](https://www.youtube.com/playlist?list=PLzDRvYVwl53vehwiN_odYJkPBzcqFw110)
+* [Introduction to ML-Agents by Huggingface](https://huggingface.co/learn/deep-rl-course/en/unit5/introduction)
+* [Community created ML-Agents projects](https://discussions.unity.com/t/post-your-ml-agents-project/816756)
+* [ML-Agents models on Huggingface](https://huggingface.co/models?library=ml-agents)
+* [Blog posts](Blog-posts.md)
diff --git a/com.unity.ml-agents/README.md b/com.unity.ml-agents/README.md
index cbcc9397e2..64418f5631 100644
--- a/com.unity.ml-agents/README.md
+++ b/com.unity.ml-agents/README.md
@@ -2,14 +2,5 @@
ML-Agents is a Unity package that allows users to use state-of-the-art machine learning to create intelligent character behaviors in any Unity environment (games, robotics, film, etc.).
-## Installation
+Please refer to the [Unity Package Documentation](https://docs.unity3d.com/Packages/com.unity.ml-agents@latest) for comprehensive installation, usage guides, tutorials, and API references.
-Please refer to the [ML-Agents github repo] for installation instructions.
-
-## Usage
-
-Please refer to the [ML-Agents documentation] page for usage guides.
-
-
-[ML-Agents github repo]: https://github.com/Unity-Technologies/ml-agents
-[ML-Agents documentation]: https://unity-technologies.github.io/ml-agents/
diff --git a/docs/deprecation-banner.js b/docs/deprecation-banner.js
new file mode 100644
index 0000000000..cf7334c09c
--- /dev/null
+++ b/docs/deprecation-banner.js
@@ -0,0 +1,12 @@
+// Add deprecation banner with link
+document.addEventListener('DOMContentLoaded', function() {
+ // Create the banner element
+ const banner = document.createElement('a');
+ banner.href = 'https://docs.unity3d.com/Packages/com.unity.ml-agents@latest';
+ banner.target = '_blank';
+ banner.className = 'deprecation-banner';
+ banner.innerHTML = '⚠️ DEPRECATED: This documentation has moved to Unity Package Documentation - Click here to view the latest documentation';
+
+ // Insert at the top of the body
+ document.body.insertBefore(banner, document.body.firstChild);
+});
diff --git a/docs/extra.css b/docs/extra.css
index 39f517eca6..c883dc429d 100644
--- a/docs/extra.css
+++ b/docs/extra.css
@@ -1,3 +1,27 @@
+/* Deprecation banner styling */
+.deprecation-banner {
+ display: block;
+ background: #ff6b6b;
+ color: white;
+ padding: 10px;
+ text-align: center;
+ font-weight: bold;
+ position: sticky;
+ top: 0;
+ z-index: 1000;
+ cursor: pointer;
+ text-decoration: none;
+ border: none;
+ margin: 0;
+}
+
+.deprecation-banner:hover {
+ background: #ff5252;
+ text-decoration: underline;
+ color: white;
+}
+
+/* Original ML-Agents styling */
.wy-nav-top, .wy-side-nav-search {
background: #439b47;
}
diff --git a/docs/index.md b/docs/index.md
index 4b822c4d61..3e8b8b758e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,2 +1,8 @@
+# ⚠️ Documentation Moved ⚠️
+
+**This documentation is deprecated and no longer maintained. Visit the [Unity Package Documentation](https://docs.unity3d.com/Packages/com.unity.ml-agents@latest) for the latest ML-Agents documentation. This site remains for legacy reference only.**
+
+---
+
{!README.md!}
diff --git a/mkdocs.yml b/mkdocs.yml
index 277adfe151..47927a21bf 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -2,11 +2,11 @@ site_name: Unity ML-Agents Toolkit
site_url: https://unity-technologies.github.io/ml-agents/
repo_url: https://github.com/Unity-Technologies/ml-agents
edit_uri: edit/main/docs/
-site_description: The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents.
+site_description: "DEPRECATED: This documentation has been moved to the Unity Package Documentation. Please visit https://docs.unity3d.com/Packages/com.unity.ml-agents@latest for current documentation. The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents."
site_author: Unity Technologies
copyright: com.unity.ml-agents copyright © 2017 - 2022 Unity Technologies
nav:
-- Home: index.md
+- "⚠️ DOCUMENTATION MOVED": index.md
- ML-Agents Overview: ML-Agents-Overview.md
- Installation: Installation.md
- Toolkit Documentation: ML-Agents-Toolkit-Documentation.md
@@ -53,6 +53,8 @@ theme:
extra_css:
- extra.css
+extra_javascript:
+ - deprecation-banner.js
markdown_extensions:
- markdown_include.include:
base_path: docs
diff --git a/ml-agents-envs/README.md b/ml-agents-envs/README.md
index 4db68723d2..2a2e0232f5 100644
--- a/ml-agents-envs/README.md
+++ b/ml-agents-envs/README.md
@@ -23,15 +23,15 @@ python -m pip install mlagents_envs==1.1.0
## Usage & More Information
See
-- [Gym API Guide](../docs/Python-Gym-API.md)
-- [PettingZoo API Guide](../docs/Python-PettingZoo-API.md)
-- [Python API Guide](../docs/Python-LLAPI.md)
+- [Gym API Guide](../com.unity.ml-agents/Documentation~/Python-Gym-API.md)
+- [PettingZoo API Guide](../com.unity.ml-agents/Documentation~/Python-PettingZoo-API.md)
+- [Python API Guide](../com.unity.ml-agents/Documentation~/Python-LLAPI.md)
for more information on how to use the API to interact with a Unity environment.
For more information on the ML-Agents Toolkit and how to instrument a Unity
scene with the ML-Agents SDK, check out the main
-[ML-Agents Toolkit documentation](../docs/Readme.md).
+[ML-Agents Toolkit documentation](../com.unity.ml-agents/Documentation~/Readme.md).
## Limitations
diff --git a/ml-agents-plugin-examples/README.md b/ml-agents-plugin-examples/README.md
index db66662d70..6ecae9c369 100644
--- a/ml-agents-plugin-examples/README.md
+++ b/ml-agents-plugin-examples/README.md
@@ -1,3 +1,3 @@
# ML-Agents Plugins
-See the [Plugins documentation](../docs/Training-Plugins.md) for more information.
+See the [Plugins documentation](../com.unity.ml-agents/Documentation~/Training-Plugins.md) for more information.
diff --git a/ml-agents/README.md b/ml-agents/README.md
index 16b3bada70..16ac4dd486 100644
--- a/ml-agents/README.md
+++ b/ml-agents/README.md
@@ -4,11 +4,11 @@ The `mlagents` Python package is part of the
[ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents). `mlagents`
provides a set of reinforcement and imitation learning algorithms designed to be
used with Unity environments. The algorithms interface with the Python API
-provided by the `mlagents_envs` package. See [here](../docs/Python-LLAPI.md) for
+provided by the `mlagents_envs` package. See [here](../com.unity.ml-agents/Documentation~/Python-LLAPI.md) for
more information on `mlagents_envs`.
The algorithms can be accessed using the: `mlagents-learn` access point. See
-[here](../docs/Training-ML-Agents.md) for more information on using this
+[here](../com.unity.ml-agents/Documentation~/Training-ML-Agents.md) for more information on using this
package.
## Installation
@@ -23,7 +23,7 @@ python -m pip install mlagents==1.1.0
For more information on the ML-Agents Toolkit and how to instrument a Unity
scene with the ML-Agents SDK, check out the main
-[ML-Agents Toolkit documentation](../docs/Readme.md).
+[ML-Agents Toolkit documentation](../com.unity.ml-agents/Documentation~/Readme.md).
## Limitations
diff --git a/utils/make_readme_table.py b/utils/make_readme_table.py
index bf467fd731..aefe0614be 100644
--- a/utils/make_readme_table.py
+++ b/utils/make_readme_table.py
@@ -96,14 +96,18 @@ def doc_link(self):
if self.is_verified:
return "https://github.com/Unity-Technologies/ml-agents/blob/release_2_verified_docs/docs/Readme.md"
- # TODO remove in favor of webdocs. commenting out for now.
- # # For release_X branches, docs are on a separate tag.
- # if self.release_tag.startswith("release"):
- # docs_name = self.release_tag + "_docs"
- # else:
- # docs_name = self.release_tag
- # return f"https://github.com/Unity-Technologies/ml-agents/tree/{docs_name}/docs/Readme.md"
- return "https://unity-technologies.github.io/ml-agents/"
+ if self.csharp_version == "develop":
+ return (
+ "https://github.com/Unity-Technologies/ml-agents/tree/"
+ "develop/com.unity.ml-agents/Documentation~/index.md"
+ )
+
+ # Prioritize Unity Package documentation over web docs
+ try:
+ StrictVersion(self.csharp_version).version
+ return "https://docs.unity3d.com/Packages/com.unity.ml-agents@latest"
+ except ValueError:
+ return "https://unity-technologies.github.io/ml-agents/ (DEPRECATED)"
@property
def package_link(self):